[clang] [clang-tools-extra] [llvm] [clang][DependencyScanning] Separate clangDependencyScanning and DependencyScanningTool (NFC) (PR #169962)

Wed Dec 3 14:35:47 PST 2025

https://github.com/naveen-seth updated https://github.com/llvm/llvm-project/pull/169962

>From 65063e8f4927bd408b3f8356da9d2bfd5740c867 Mon Sep 17 00:00:00 2001
From: Naveen Seth Hanig <naveen.hanig at outlook.com>
Date: Fri, 28 Nov 2025 21:21:48 +0100
Subject: [PATCH 1/3] [clang][deps] Separate clangDependencyScanning and
 DependencyScanningTool (NFC)

This patch is the first of two in refactoring Clang's dependency scanning
tooling to remove its dependency on clangDriver.

It separates Tooling/DependencyScanningTool.cpp from the rest of
clangDependencyScanning and moves clangDependencyScanning out of
clangTooling into its own library. No functional changes are introduced.

The follow-up patch will restrict clangDependencyScanning to handling only
-cc1 command line inputs and move functionality related to handling
driver commands into clangTooling (DependencyScanningTool.cpp).

This is part of a broader effort to support driver-managed
builds for compilations using C++ named modules and/or Clang modules.
It is required for linking the dependency scanning tooling against the
driver without introducing cyclic dependencies, which would otherwise
cause build failures when dynamic linking is enabled.

The RFC for this change can be found here:
https://discourse.llvm.org/t/rfc-new-clangoptions-library-remove-dependency-on-clangdriver-from-clangfrontend-and-flangfrontend/88773?u=naveen-seth
---
 .../clangd/ScanningProjectModules.cpp         |  10 +-
 .../DependencyScannerImpl.h                   |   7 +-
 .../DependencyScanningFilesystem.h            |  10 +-
 .../DependencyScanningService.h               |  14 +-
 .../DependencyScanningUtils.h                 | 166 ++++++++++++++++
 .../DependencyScanningWorker.h                |  14 +-
 .../DependencyScanning/InProcessModuleCache.h |  12 +-
 .../DependencyScanning/ModuleDepCollector.h   |  23 ++-
 .../DependencyScanningTool.h                  | 187 +++---------------
 clang/lib/CMakeLists.txt                      |   1 +
 .../DependencyScanning/CMakeLists.txt         |   2 +-
 .../DependencyScannerImpl.cpp                 |   5 +-
 .../DependencyScanningFilesystem.cpp          |   5 +-
 .../DependencyScanningService.cpp             |   5 +-
 .../DependencyScanningUtils.cpp               |  38 ++++
 .../DependencyScanningWorker.cpp              |   7 +-
 .../InProcessModuleCache.cpp                  |   5 +-
 .../DependencyScanning/ModuleDepCollector.cpp |   5 +-
 clang/lib/Tooling/CMakeLists.txt              |   3 +-
 .../DependencyScanningTool.cpp                |  31 +--
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |  10 +-
 clang/unittests/CMakeLists.txt                |   1 +
 .../DependencyScanning/CMakeLists.txt         |  11 ++
 .../DependencyScanningFilesystemTest.cpp      |   4 +-
 .../DependencyScanningWorkerTest.cpp          |  97 +++++++++
 clang/unittests/Tooling/CMakeLists.txt        |   3 +-
 .../DependencyScannerTest.cpp                 |  88 +--------
 27 files changed, 417 insertions(+), 347 deletions(-)
 rename clang/{lib/Tooling => include/clang}/DependencyScanning/DependencyScannerImpl.h (97%)
 rename clang/include/clang/{Tooling => }/DependencyScanning/DependencyScanningFilesystem.h (98%)
 rename clang/include/clang/{Tooling => }/DependencyScanning/DependencyScanningService.h (89%)
 create mode 100644 clang/include/clang/DependencyScanning/DependencyScanningUtils.h
 rename clang/include/clang/{Tooling => }/DependencyScanning/DependencyScanningWorker.h (94%)
 rename clang/include/clang/{Tooling => }/DependencyScanning/InProcessModuleCache.h (75%)
 rename clang/include/clang/{Tooling => }/DependencyScanning/ModuleDepCollector.h (95%)
 rename clang/include/clang/Tooling/{DependencyScanning => }/DependencyScanningTool.h (51%)
 rename clang/lib/{Tooling => }/DependencyScanning/CMakeLists.txt (93%)
 rename clang/lib/{Tooling => }/DependencyScanning/DependencyScannerImpl.cpp (99%)
 rename clang/lib/{Tooling => }/DependencyScanning/DependencyScanningFilesystem.cpp (99%)
 rename clang/lib/{Tooling => }/DependencyScanning/DependencyScanningService.cpp (82%)
 create mode 100644 clang/lib/DependencyScanning/DependencyScanningUtils.cpp
 rename clang/lib/{Tooling => }/DependencyScanning/DependencyScanningWorker.cpp (97%)
 rename clang/lib/{Tooling => }/DependencyScanning/InProcessModuleCache.cpp (95%)
 rename clang/lib/{Tooling => }/DependencyScanning/ModuleDepCollector.cpp (99%)
 rename clang/lib/Tooling/{DependencyScanning => }/DependencyScanningTool.cpp (88%)
 create mode 100644 clang/unittests/DependencyScanning/CMakeLists.txt
 rename clang/unittests/{Tooling => }/DependencyScanning/DependencyScanningFilesystemTest.cpp (98%)
 create mode 100644 clang/unittests/DependencyScanning/DependencyScanningWorkerTest.cpp
 rename clang/unittests/Tooling/{DependencyScanning => }/DependencyScannerTest.cpp (78%)

diff --git a/clang-tools-extra/clangd/ScanningProjectModules.cpp b/clang-tools-extra/clangd/ScanningProjectModules.cpp
index 672e99632019d..6a21ad2920764 100644
--- a/clang-tools-extra/clangd/ScanningProjectModules.cpp
+++ b/clang-tools-extra/clangd/ScanningProjectModules.cpp
@@ -8,8 +8,8 @@
 
 #include "ProjectModules.h"
 #include "support/Logger.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningService.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningTool.h"
+#include "clang/DependencyScanning/DependencyScanningService.h"
+#include "clang/Tooling/DependencyScanningTool.h"
 
 namespace clang::clangd {
 namespace {
@@ -36,8 +36,8 @@ class ModuleDependencyScanner {
       std::shared_ptr<const clang::tooling::CompilationDatabase> CDB,
       const ThreadsafeFS &TFS)
       : CDB(CDB), TFS(TFS),
-        Service(tooling::dependencies::ScanningMode::CanonicalPreprocessing,
-                tooling::dependencies::ScanningOutputFormat::P1689) {}
+        Service(dependencies::ScanningMode::CanonicalPreprocessing,
+                dependencies::ScanningOutputFormat::P1689) {}
 
   /// The scanned modules dependency information for a specific source file.
   struct ModuleDependencyInfo {
@@ -81,7 +81,7 @@ class ModuleDependencyScanner {
   // Whether the scanner has scanned the project globally.
   bool GlobalScanned = false;
 
-  clang::tooling::dependencies::DependencyScanningService Service;
+  clang::dependencies::DependencyScanningService Service;
 
   // TODO: Add a scanning cache.
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScannerImpl.h b/clang/include/clang/DependencyScanning/DependencyScannerImpl.h
similarity index 97%
rename from clang/lib/Tooling/DependencyScanning/DependencyScannerImpl.h
rename to clang/include/clang/DependencyScanning/DependencyScannerImpl.h
index b94d1b472f920..0a0808dd9b93e 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScannerImpl.h
+++ b/clang/include/clang/DependencyScanning/DependencyScannerImpl.h
@@ -9,18 +9,18 @@
 #ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNER_H
 #define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNER_H
 
+#include "clang/DependencyScanning/DependencyScanningFilesystem.h"
+#include "clang/DependencyScanning/ModuleDepCollector.h"
 #include "clang/Driver/Compilation.h"
+#include "clang/Driver/Driver.h"
 #include "clang/Frontend/CompilerInstance.h"
 #include "clang/Frontend/CompilerInvocation.h"
 #include "clang/Frontend/TextDiagnosticPrinter.h"
 #include "clang/Serialization/ObjectFilePCHContainerReader.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h"
-#include "clang/Tooling/DependencyScanning/ModuleDepCollector.h"
 
 namespace clang {
 class DiagnosticConsumer;
 
-namespace tooling {
 namespace dependencies {
 class DependencyScanningService;
 class DependencyScanningWorker;
@@ -206,7 +206,6 @@ class CompilerInstanceWithContext {
   llvm::Error handleReturnStatus(bool Success);
 };
 } // namespace dependencies
-} // namespace tooling
 } // namespace clang
 
 #endif
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h b/clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h
similarity index 98%
rename from clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
rename to clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h
index 2b21be7712693..a4516ff77509d 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningFilesystem.h - clang-scan-deps fs ===---*- C++ -*-===//
+//===- DependencyScanningFilesystem.h - Optimized Scanning FS ---*- C++ -*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,8 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGFILESYSTEM_H
-#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGFILESYSTEM_H
+#ifndef LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGFILESYSTEM_H
+#define LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGFILESYSTEM_H
 
 #include "clang/Basic/LLVM.h"
 #include "clang/Lex/DependencyDirectivesScanner.h"
@@ -21,7 +21,6 @@
 #include <variant>
 
 namespace clang {
-namespace tooling {
 namespace dependencies {
 
 using DependencyDirectivesTy =
@@ -521,7 +520,6 @@ class DependencyScanningWorkerFilesystem
 };
 
 } // end namespace dependencies
-} // end namespace tooling
 } // end namespace clang
 
-#endif // LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGFILESYSTEM_H
+#endif // LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGFILESYSTEM_H
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h b/clang/include/clang/DependencyScanning/DependencyScanningService.h
similarity index 89%
rename from clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
rename to clang/include/clang/DependencyScanning/DependencyScanningService.h
index 4e97c7bc9f36e..96dd33c28cf5a 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningService.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningService.h - clang-scan-deps service ===-*- C++ -*-===//
+//===- DependencyScanningService.h - Scanning Service -----------*- C++ -*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,16 +6,15 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGSERVICE_H
-#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGSERVICE_H
+#ifndef LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGSERVICE_H
+#define LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGSERVICE_H
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h"
-#include "clang/Tooling/DependencyScanning/InProcessModuleCache.h"
+#include "clang/DependencyScanning/DependencyScanningFilesystem.h"
+#include "clang/DependencyScanning/InProcessModuleCache.h"
 #include "llvm/ADT/BitmaskEnum.h"
 #include "llvm/Support/Chrono.h"
 
 namespace clang {
-namespace tooling {
 namespace dependencies {
 
 /// The mode in which the dependency scanner will operate to find the
@@ -125,7 +124,6 @@ class DependencyScanningService {
 };
 
 } // end namespace dependencies
-} // end namespace tooling
 } // end namespace clang
 
-#endif // LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGSERVICE_H
+#endif // LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGSERVICE_H
diff --git a/clang/include/clang/DependencyScanning/DependencyScanningUtils.h b/clang/include/clang/DependencyScanning/DependencyScanningUtils.h
new file mode 100644
index 0000000000000..e2fb5ad3e5cf3
--- /dev/null
+++ b/clang/include/clang/DependencyScanning/DependencyScanningUtils.h
@@ -0,0 +1,166 @@
+//===- DependencyScanningUtils.h - Common Scanning Utilities ----*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGUTILS_H
+#define LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGUTILS_H
+
+#include "clang/DependencyScanning/DependencyScannerImpl.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
+#include "clang/DependencyScanning/ModuleDepCollector.h"
+
+namespace clang {
+namespace dependencies {
+
+/// Graph of modular dependencies.
+using ModuleDepsGraph = std::vector<clang::dependencies::ModuleDeps>;
+
+/// The full dependencies and module graph for a specific input.
+struct TranslationUnitDeps {
+  /// The graph of direct and transitive modular dependencies.
+  ModuleDepsGraph ModuleGraph;
+
+  /// The identifier of the C++20 module this translation unit exports.
+  ///
+  /// If the translation unit is not a module then \c ID.ModuleName is empty.
+  clang::dependencies::ModuleID ID;
+
+  /// A collection of absolute paths to files that this translation unit
+  /// directly depends on, not including transitive dependencies.
+  std::vector<std::string> FileDeps;
+
+  /// A collection of prebuilt modules this translation unit directly depends
+  /// on, not including transitive dependencies.
+  std::vector<clang::dependencies::PrebuiltModuleDep> PrebuiltModuleDeps;
+
+  /// A list of modules this translation unit directly depends on, not including
+  /// transitive dependencies.
+  ///
+  /// This may include modules with a different context hash when it can be
+  /// determined that the differences are benign for this compilation.
+  std::vector<clang::dependencies::ModuleID> ClangModuleDeps;
+
+  /// A list of module names that are visible to this translation unit. This
+  /// includes both direct and transitive module dependencies.
+  std::vector<std::string> VisibleModules;
+
+  /// A list of the C++20 named modules this translation unit depends on.
+  std::vector<std::string> NamedModuleDeps;
+
+  /// The sequence of commands required to build the translation unit. Commands
+  /// should be executed in order.
+  ///
+  /// FIXME: If we add support for multi-arch builds in clang-scan-deps, we
+  /// should make the dependencies between commands explicit to enable parallel
+  /// builds of each architecture.
+  std::vector<clang::dependencies::Command> Commands;
+
+  /// Deprecated driver command-line. This will be removed in a future version.
+  std::vector<std::string> DriverCommandLine;
+};
+
+class FullDependencyConsumer : public clang::dependencies::DependencyConsumer {
+public:
+  FullDependencyConsumer(
+      const llvm::DenseSet<clang::dependencies::ModuleID> &AlreadySeen)
+      : AlreadySeen(AlreadySeen) {}
+
+  void handleBuildCommand(clang::dependencies::Command Cmd) override {
+    Commands.push_back(std::move(Cmd));
+  }
+
+  void handleDependencyOutputOpts(const DependencyOutputOptions &) override {}
+
+  void handleFileDependency(StringRef File) override {
+    Dependencies.push_back(std::string(File));
+  }
+
+  void handlePrebuiltModuleDependency(
+      clang::dependencies::PrebuiltModuleDep PMD) override {
+    PrebuiltModuleDeps.emplace_back(std::move(PMD));
+  }
+
+  void handleModuleDependency(clang::dependencies::ModuleDeps MD) override {
+    ClangModuleDeps[MD.ID] = std::move(MD);
+  }
+
+  void handleDirectModuleDependency(clang::dependencies::ModuleID ID) override {
+    DirectModuleDeps.push_back(ID);
+  }
+
+  void handleVisibleModule(std::string ModuleName) override {
+    VisibleModules.push_back(ModuleName);
+  }
+
+  void handleContextHash(std::string Hash) override {
+    ContextHash = std::move(Hash);
+  }
+
+  void handleProvidedAndRequiredStdCXXModules(
+      std::optional<clang::dependencies::P1689ModuleInfo> Provided,
+      std::vector<clang::dependencies::P1689ModuleInfo> Requires) override {
+    ModuleName = Provided ? Provided->ModuleName : "";
+    llvm::transform(Requires, std::back_inserter(NamedModuleDeps),
+                    [](const auto &Module) { return Module.ModuleName; });
+  }
+
+  TranslationUnitDeps takeTranslationUnitDeps();
+
+private:
+  std::vector<std::string> Dependencies;
+  std::vector<clang::dependencies::PrebuiltModuleDep> PrebuiltModuleDeps;
+  llvm::MapVector<clang::dependencies::ModuleID,
+                  clang::dependencies::ModuleDeps>
+      ClangModuleDeps;
+  std::string ModuleName;
+  std::vector<std::string> NamedModuleDeps;
+  std::vector<clang::dependencies::ModuleID> DirectModuleDeps;
+  std::vector<std::string> VisibleModules;
+  std::vector<clang::dependencies::Command> Commands;
+  std::string ContextHash;
+  const llvm::DenseSet<clang::dependencies::ModuleID> &AlreadySeen;
+};
+
+/// A callback to lookup module outputs for "-fmodule-file=", "-o" etc.
+using LookupModuleOutputCallback =
+    llvm::function_ref<std::string(const clang::dependencies::ModuleDeps &,
+                                   clang::dependencies::ModuleOutputKind)>;
+
+/// A simple dependency action controller that uses a callback. If no callback
+/// is provided, it is assumed that looking up module outputs is unreachable.
+class CallbackActionController
+    : public clang::dependencies::DependencyActionController {
+public:
+  virtual ~CallbackActionController();
+
+  static std::string
+  lookupUnreachableModuleOutput(const clang::dependencies::ModuleDeps &MD,
+                                clang::dependencies::ModuleOutputKind Kind) {
+    llvm::report_fatal_error("unexpected call to lookupModuleOutput");
+  };
+
+  CallbackActionController(LookupModuleOutputCallback LMO)
+      : LookupModuleOutput(std::move(LMO)) {
+    if (!LookupModuleOutput) {
+      LookupModuleOutput = lookupUnreachableModuleOutput;
+    }
+  }
+
+  std::string
+  lookupModuleOutput(const clang::dependencies::ModuleDeps &MD,
+                     clang::dependencies::ModuleOutputKind Kind) override {
+    return LookupModuleOutput(MD, Kind);
+  }
+
+private:
+  LookupModuleOutputCallback LookupModuleOutput;
+};
+
+} // end namespace dependencies
+} // end namespace clang
+
+#endif // LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGUTILS_H
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningWorker.h b/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
similarity index 94%
rename from clang/include/clang/Tooling/DependencyScanning/DependencyScanningWorker.h
rename to clang/include/clang/DependencyScanning/DependencyScanningWorker.h
index e2c353a254bf3..9d3966c25414a 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningWorker.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningWorker.h - clang-scan-deps worker ===---*- C++ -*-===//
+//===- DependencyScanningWorker.h - Thread-Safe Scanning Worker -*- C++ -*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,15 +6,15 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGWORKER_H
-#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGWORKER_H
+#ifndef LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGWORKER_H
+#define LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGWORKER_H
 
 #include "clang/Basic/DiagnosticOptions.h"
 #include "clang/Basic/FileManager.h"
 #include "clang/Basic/LLVM.h"
+#include "clang/DependencyScanning/DependencyScanningService.h"
+#include "clang/DependencyScanning/ModuleDepCollector.h"
 #include "clang/Frontend/PCHContainerOperations.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningService.h"
-#include "clang/Tooling/DependencyScanning/ModuleDepCollector.h"
 #include "llvm/Support/Error.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/MemoryBufferRef.h"
@@ -25,7 +25,6 @@ namespace clang {
 
 class DependencyOutputOptions;
 
-namespace tooling {
 namespace dependencies {
 
 class DependencyScanningWorkerFilesystem;
@@ -185,7 +184,6 @@ class DependencyScanningWorker {
 };
 
 } // end namespace dependencies
-} // end namespace tooling
 } // end namespace clang
 
-#endif // LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGWORKER_H
+#endif // LLVM_CLANG_DEPENDENCYSCANNING_DEPENDENCYSCANNINGWORKER_H
diff --git a/clang/include/clang/Tooling/DependencyScanning/InProcessModuleCache.h b/clang/include/clang/DependencyScanning/InProcessModuleCache.h
similarity index 75%
rename from clang/include/clang/Tooling/DependencyScanning/InProcessModuleCache.h
rename to clang/include/clang/DependencyScanning/InProcessModuleCache.h
index 213e60b39c199..c0e8f00b7fb59 100644
--- a/clang/include/clang/Tooling/DependencyScanning/InProcessModuleCache.h
+++ b/clang/include/clang/DependencyScanning/InProcessModuleCache.h
@@ -1,4 +1,4 @@
-//===----------------------------------------------------------------------===//
+//===- InProcessModuleCache.h - Implicit Module Cache -----------*- C++ -*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,8 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_INPROCESSMODULECACHE_H
-#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_INPROCESSMODULECACHE_H
+#ifndef LLVM_CLANG_DEPENDENCYSCANNING_INPROCESSMODULECACHE_H
+#define LLVM_CLANG_DEPENDENCYSCANNING_INPROCESSMODULECACHE_H
 
 #include "clang/Serialization/ModuleCache.h"
 #include "llvm/ADT/StringMap.h"
@@ -16,8 +16,8 @@
 #include <shared_mutex>
 
 namespace clang {
-namespace tooling {
 namespace dependencies {
+
 struct ModuleCacheEntry {
   std::shared_mutex CompilationMutex;
   std::atomic<std::time_t> Timestamp = 0;
@@ -30,8 +30,8 @@ struct ModuleCacheEntries {
 
 IntrusiveRefCntPtr<ModuleCache>
 makeInProcessModuleCache(ModuleCacheEntries &Entries);
+
 } // namespace dependencies
-} // namespace tooling
 } // namespace clang
 
-#endif
+#endif // LLVM_CLANG_DEPENDENCYSCANNING_INPROCESSMODULECACHE_H
diff --git a/clang/include/clang/Tooling/DependencyScanning/ModuleDepCollector.h b/clang/include/clang/DependencyScanning/ModuleDepCollector.h
similarity index 95%
rename from clang/include/clang/Tooling/DependencyScanning/ModuleDepCollector.h
rename to clang/include/clang/DependencyScanning/ModuleDepCollector.h
index b0a91b60ff6da..0243f7abcbe10 100644
--- a/clang/include/clang/Tooling/DependencyScanning/ModuleDepCollector.h
+++ b/clang/include/clang/DependencyScanning/ModuleDepCollector.h
@@ -6,18 +6,18 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_MODULEDEPCOLLECTOR_H
-#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_MODULEDEPCOLLECTOR_H
+#ifndef LLVM_CLANG_DEPENDENCYSCANNING_MODULEDEPCOLLECTOR_H
+#define LLVM_CLANG_DEPENDENCYSCANNING_MODULEDEPCOLLECTOR_H
 
 #include "clang/Basic/LLVM.h"
 #include "clang/Basic/Module.h"
 #include "clang/Basic/SourceManager.h"
+#include "clang/DependencyScanning/DependencyScanningService.h"
 #include "clang/Frontend/CompilerInvocation.h"
 #include "clang/Frontend/Utils.h"
 #include "clang/Lex/HeaderSearch.h"
 #include "clang/Lex/PPCallbacks.h"
 #include "clang/Serialization/ASTReader.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningService.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/StringSet.h"
@@ -28,7 +28,6 @@
 #include <variant>
 
 namespace clang {
-namespace tooling {
 namespace dependencies {
 
 class DependencyActionController;
@@ -109,7 +108,7 @@ struct ModuleID {
            std::tie(Other.ModuleName, Other.ContextHash);
   }
 
-  bool operator<(const ModuleID& Other) const {
+  bool operator<(const ModuleID &Other) const {
     return std::tie(ModuleName, ContextHash) <
            std::tie(Other.ModuleName, Other.ContextHash);
   }
@@ -264,10 +263,11 @@ class ModuleDepCollectorPP final : public PPCallbacks {
 
   /// Traverses the affecting modules and updates \c MD with references to the
   /// parent \c ModuleDepCollector info.
-  void addAllAffectingClangModules(const Module *M, ModuleDeps &MD,
+  void
+  addAllAffectingClangModules(const Module *M, ModuleDeps &MD,
                               llvm::DenseSet<const Module *> &AddedModules);
   void addAffectingClangModule(const Module *M, ModuleDeps &MD,
-                          llvm::DenseSet<const Module *> &AddedModules);
+                               llvm::DenseSet<const Module *> &AddedModules);
 
   /// Add discovered module dependency for the given module.
   void addOneModuleDep(const Module *M, const ModuleID ID, ModuleDeps &MD);
@@ -406,16 +406,15 @@ bool areOptionsInStableDir(const ArrayRef<StringRef> Directories,
                            const HeaderSearchOptions &HSOpts);
 
 } // end namespace dependencies
-} // end namespace tooling
 } // end namespace clang
 
 namespace llvm {
-inline hash_code hash_value(const clang::tooling::dependencies::ModuleID &ID) {
+inline hash_code hash_value(const clang::dependencies::ModuleID &ID) {
   return hash_combine(ID.ModuleName, ID.ContextHash);
 }
 
-template <> struct DenseMapInfo<clang::tooling::dependencies::ModuleID> {
-  using ModuleID = clang::tooling::dependencies::ModuleID;
+template <> struct DenseMapInfo<clang::dependencies::ModuleID> {
+  using ModuleID = clang::dependencies::ModuleID;
   static inline ModuleID getEmptyKey() { return ModuleID{"", ""}; }
   static inline ModuleID getTombstoneKey() {
     return ModuleID{"~", "~"}; // ~ is not a valid module name or context hash
@@ -427,4 +426,4 @@ template <> struct DenseMapInfo<clang::tooling::dependencies::ModuleID> {
 };
 } // namespace llvm
 
-#endif // LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_MODULEDEPCOLLECTOR_H
+#endif // LLVM_CLANG_DEPENDENCYSCANNING_MODULEDEPCOLLECTOR_H
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningTool.h b/clang/include/clang/Tooling/DependencyScanningTool.h
similarity index 51%
rename from clang/include/clang/Tooling/DependencyScanning/DependencyScanningTool.h
rename to clang/include/clang/Tooling/DependencyScanningTool.h
index ed562f46cfdaa..8e03f6e949689 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningTool.h
+++ b/clang/include/clang/Tooling/DependencyScanningTool.h
@@ -6,12 +6,13 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGTOOL_H
-#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGTOOL_H
+#ifndef LLVM_CLANG_TOOLING_DEPENDENCYSCANNINGTOOL_H
+#define LLVM_CLANG_TOOLING_DEPENDENCYSCANNINGTOOL_H
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningService.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningWorker.h"
-#include "clang/Tooling/DependencyScanning/ModuleDepCollector.h"
+#include "clang/DependencyScanning/DependencyScanningService.h"
+#include "clang/DependencyScanning/DependencyScanningUtils.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
+#include "clang/DependencyScanning/ModuleDepCollector.h"
 #include "clang/Tooling/JSONCompilationDatabase.h"
 #include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/MapVector.h"
@@ -25,61 +26,10 @@ namespace clang {
 namespace tooling {
 namespace dependencies {
 
-/// A callback to lookup module outputs for "-fmodule-file=", "-o" etc.
-using LookupModuleOutputCallback =
-    llvm::function_ref<std::string(const ModuleDeps &, ModuleOutputKind)>;
-
-/// Graph of modular dependencies.
-using ModuleDepsGraph = std::vector<ModuleDeps>;
-
-/// The full dependencies and module graph for a specific input.
-struct TranslationUnitDeps {
-  /// The graph of direct and transitive modular dependencies.
-  ModuleDepsGraph ModuleGraph;
-
-  /// The identifier of the C++20 module this translation unit exports.
-  ///
-  /// If the translation unit is not a module then \c ID.ModuleName is empty.
-  ModuleID ID;
-
-  /// A collection of absolute paths to files that this translation unit
-  /// directly depends on, not including transitive dependencies.
-  std::vector<std::string> FileDeps;
-
-  /// A collection of prebuilt modules this translation unit directly depends
-  /// on, not including transitive dependencies.
-  std::vector<PrebuiltModuleDep> PrebuiltModuleDeps;
-
-  /// A list of modules this translation unit directly depends on, not including
-  /// transitive dependencies.
-  ///
-  /// This may include modules with a different context hash when it can be
-  /// determined that the differences are benign for this compilation.
-  std::vector<ModuleID> ClangModuleDeps;
-
-  /// A list of module names that are visible to this translation unit. This
-  /// includes both direct and transitive module dependencies.
-  std::vector<std::string> VisibleModules;
-
-  /// A list of the C++20 named modules this translation unit depends on.
-  std::vector<std::string> NamedModuleDeps;
-
-  /// The sequence of commands required to build the translation unit. Commands
-  /// should be executed in order.
-  ///
-  /// FIXME: If we add support for multi-arch builds in clang-scan-deps, we
-  /// should make the dependencies between commands explicit to enable parallel
-  /// builds of each architecture.
-  std::vector<Command> Commands;
-
-  /// Deprecated driver command-line. This will be removed in a future version.
-  std::vector<std::string> DriverCommandLine;
-};
-
 struct P1689Rule {
   std::string PrimaryOutput;
-  std::optional<P1689ModuleInfo> Provides;
-  std::vector<P1689ModuleInfo> Requires;
+  std::optional<clang::dependencies::P1689ModuleInfo> Provides;
+  std::vector<clang::dependencies::P1689ModuleInfo> Requires;
 };
 
 /// The high-level implementation of the dependency discovery tool that runs on
@@ -90,9 +40,10 @@ class DependencyScanningTool {
   ///
   /// @param Service  The parent service. Must outlive the tool.
   /// @param FS The filesystem for the tool to use. Defaults to the physical FS.
-  DependencyScanningTool(DependencyScanningService &Service,
-                         llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS =
-                             llvm::vfs::createPhysicalFileSystem());
+  DependencyScanningTool(
+      clang::dependencies::DependencyScanningService &Service,
+      llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS =
+          llvm::vfs::createPhysicalFileSystem());
 
   /// Print out the dependency information into a string using the dependency
   /// file format that is specified in the options (-MD is the default) and
@@ -145,10 +96,11 @@ class DependencyScanningTool {
   ///
   /// \returns a \c StringError with the diagnostic output if clang errors
   /// occurred, \c TranslationUnitDeps otherwise.
-  llvm::Expected<TranslationUnitDeps> getTranslationUnitDependencies(
+  llvm::Expected<clang::dependencies::TranslationUnitDeps>
+  getTranslationUnitDependencies(
       const std::vector<std::string> &CommandLine, StringRef CWD,
-      const llvm::DenseSet<ModuleID> &AlreadySeen,
-      LookupModuleOutputCallback LookupModuleOutput,
+      const llvm::DenseSet<clang::dependencies::ModuleID> &AlreadySeen,
+      clang::dependencies::LookupModuleOutputCallback LookupModuleOutput,
       std::optional<llvm::MemoryBufferRef> TUBuffer = std::nullopt);
 
   /// Given a compilation context specified via the Clang driver command-line,
@@ -157,10 +109,12 @@ class DependencyScanningTool {
   /// TODO: this method should be removed as soon as Swift and our C-APIs adopt
   /// CompilerInstanceWithContext. We are keeping it here so that it is easier
   /// to coordinate with Swift and C-API changes.
-  llvm::Expected<TranslationUnitDeps> getModuleDependencies(
+  llvm::Expected<clang::dependencies::TranslationUnitDeps>
+  getModuleDependencies(
       StringRef ModuleName, const std::vector<std::string> &CommandLine,
-      StringRef CWD, const llvm::DenseSet<ModuleID> &AlreadySeen,
-      LookupModuleOutputCallback LookupModuleOutput);
+      StringRef CWD,
+      const llvm::DenseSet<clang::dependencies::ModuleID> &AlreadySeen,
+      clang::dependencies::LookupModuleOutputCallback LookupModuleOutput);
 
   /// The following three methods provide a new interface to perform
   /// by name dependency scan. The new interface's intention is to improve
@@ -190,9 +144,11 @@ class DependencyScanningTool {
   ///                           arguments for dependencies.
   /// @return An instance of \c TranslationUnitDeps if the scan is successful.
   ///         Otherwise it returns an error.
-  llvm::Expected<TranslationUnitDeps> computeDependenciesByNameWithContext(
-      StringRef ModuleName, const llvm::DenseSet<ModuleID> &AlreadySeen,
-      LookupModuleOutputCallback LookupModuleOutput);
+  llvm::Expected<clang::dependencies::TranslationUnitDeps>
+  computeDependenciesByNameWithContext(
+      StringRef ModuleName,
+      const llvm::DenseSet<clang::dependencies::ModuleID> &AlreadySeen,
+      clang::dependencies::LookupModuleOutputCallback LookupModuleOutput);
 
   /// @brief This method finializes the compiler instance. It finalizes the
   ///        diagnostics and deletes the compiler instance. Call this method
@@ -203,96 +159,13 @@ class DependencyScanningTool {
   llvm::vfs::FileSystem &getWorkerVFS() const { return Worker.getVFS(); }
 
 private:
-  DependencyScanningWorker Worker;
-};
-
-class FullDependencyConsumer : public DependencyConsumer {
-public:
-  FullDependencyConsumer(const llvm::DenseSet<ModuleID> &AlreadySeen)
-      : AlreadySeen(AlreadySeen) {}
-
-  void handleBuildCommand(Command Cmd) override {
-    Commands.push_back(std::move(Cmd));
-  }
-
-  void handleDependencyOutputOpts(const DependencyOutputOptions &) override {}
-
-  void handleFileDependency(StringRef File) override {
-    Dependencies.push_back(std::string(File));
-  }
-
-  void handlePrebuiltModuleDependency(PrebuiltModuleDep PMD) override {
-    PrebuiltModuleDeps.emplace_back(std::move(PMD));
-  }
-
-  void handleModuleDependency(ModuleDeps MD) override {
-    ClangModuleDeps[MD.ID] = std::move(MD);
-  }
-
-  void handleDirectModuleDependency(ModuleID ID) override {
-    DirectModuleDeps.push_back(ID);
-  }
-
-  void handleVisibleModule(std::string ModuleName) override {
-    VisibleModules.push_back(ModuleName);
-  }
-
-  void handleContextHash(std::string Hash) override {
-    ContextHash = std::move(Hash);
-  }
-
-  void handleProvidedAndRequiredStdCXXModules(
-      std::optional<P1689ModuleInfo> Provided,
-      std::vector<P1689ModuleInfo> Requires) override {
-    ModuleName = Provided ? Provided->ModuleName : "";
-    llvm::transform(Requires, std::back_inserter(NamedModuleDeps),
-                    [](const auto &Module) { return Module.ModuleName; });
-  }
-
-  TranslationUnitDeps takeTranslationUnitDeps();
-
-private:
-  std::vector<std::string> Dependencies;
-  std::vector<PrebuiltModuleDep> PrebuiltModuleDeps;
-  llvm::MapVector<ModuleID, ModuleDeps> ClangModuleDeps;
-  std::string ModuleName;
-  std::vector<std::string> NamedModuleDeps;
-  std::vector<ModuleID> DirectModuleDeps;
-  std::vector<std::string> VisibleModules;
-  std::vector<Command> Commands;
-  std::string ContextHash;
-  const llvm::DenseSet<ModuleID> &AlreadySeen;
-};
-
-/// A simple dependency action controller that uses a callback. If no callback
-/// is provided, it is assumed that looking up module outputs is unreachable.
-class CallbackActionController : public DependencyActionController {
-public:
-  virtual ~CallbackActionController();
-
-  static std::string lookupUnreachableModuleOutput(const ModuleDeps &MD,
-                                                   ModuleOutputKind Kind) {
-    llvm::report_fatal_error("unexpected call to lookupModuleOutput");
-  };
-
-  CallbackActionController(LookupModuleOutputCallback LMO)
-      : LookupModuleOutput(std::move(LMO)) {
-    if (!LookupModuleOutput) {
-      LookupModuleOutput = lookupUnreachableModuleOutput;
-    }
-  }
-
-  std::string lookupModuleOutput(const ModuleDeps &MD,
-                                 ModuleOutputKind Kind) override {
-    return LookupModuleOutput(MD, Kind);
-  }
-
-private:
-  LookupModuleOutputCallback LookupModuleOutput;
+  clang::dependencies::DependencyScanningWorker Worker;
+  std::unique_ptr<clang::dependencies::TextDiagnosticsPrinterWithOutput>
+      DiagPrinterWithOS;
 };
 
 } // end namespace dependencies
 } // end namespace tooling
 } // end namespace clang
 
-#endif // LLVM_CLANG_TOOLING_DEPENDENCYSCANNING_DEPENDENCYSCANNINGTOOL_H
+#endif // LLVM_CLANG_TOOLING_DEPENDENCYSCANNINGTOOL_H
diff --git a/clang/lib/CMakeLists.txt b/clang/lib/CMakeLists.txt
index e90b009da606a..2fc69e4e4fa6f 100644
--- a/clang/lib/CMakeLists.txt
+++ b/clang/lib/CMakeLists.txt
@@ -18,6 +18,7 @@ add_subdirectory(Serialization)
 add_subdirectory(Frontend)
 add_subdirectory(FrontendTool)
 add_subdirectory(Tooling)
+add_subdirectory(DependencyScanning)
 add_subdirectory(DirectoryWatcher)
 add_subdirectory(Index)
 add_subdirectory(IndexSerialization)
diff --git a/clang/lib/Tooling/DependencyScanning/CMakeLists.txt b/clang/lib/DependencyScanning/CMakeLists.txt
similarity index 93%
rename from clang/lib/Tooling/DependencyScanning/CMakeLists.txt
rename to clang/lib/DependencyScanning/CMakeLists.txt
index 76bdc50097fff..2976f7c236f2e 100644
--- a/clang/lib/Tooling/DependencyScanning/CMakeLists.txt
+++ b/clang/lib/DependencyScanning/CMakeLists.txt
@@ -9,7 +9,7 @@ add_clang_library(clangDependencyScanning
   DependencyScanningFilesystem.cpp
   DependencyScanningService.cpp
   DependencyScanningWorker.cpp
-  DependencyScanningTool.cpp
+  DependencyScanningUtils.cpp
   DependencyScannerImpl.cpp
   InProcessModuleCache.cpp
   ModuleDepCollector.cpp
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScannerImpl.cpp b/clang/lib/DependencyScanning/DependencyScannerImpl.cpp
similarity index 99%
rename from clang/lib/Tooling/DependencyScanning/DependencyScannerImpl.cpp
rename to clang/lib/DependencyScanning/DependencyScannerImpl.cpp
index 657547d299abd..b17d6aec7263e 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScannerImpl.cpp
+++ b/clang/lib/DependencyScanning/DependencyScannerImpl.cpp
@@ -6,17 +6,16 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "DependencyScannerImpl.h"
+#include "clang/DependencyScanning/DependencyScannerImpl.h"
 #include "clang/Basic/DiagnosticFrontend.h"
 #include "clang/Basic/DiagnosticSerialization.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
 #include "clang/Driver/Driver.h"
 #include "clang/Frontend/FrontendActions.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningWorker.h"
 #include "llvm/ADT/ScopeExit.h"
 #include "llvm/TargetParser/Host.h"
 
 using namespace clang;
-using namespace tooling;
 using namespace dependencies;
 
 namespace {
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp b/clang/lib/DependencyScanning/DependencyScanningFilesystem.cpp
similarity index 99%
rename from clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
rename to clang/lib/DependencyScanning/DependencyScanningFilesystem.cpp
index 266944ee730cb..24a794e4a6a22 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
+++ b/clang/lib/DependencyScanning/DependencyScanningFilesystem.cpp
@@ -1,4 +1,4 @@
-//===- DependencyScanningFilesystem.cpp - clang-scan-deps fs --------------===//
+//===- DependencyScanningFilesystem.cpp - Optimized Scanning FS -----------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,13 +6,12 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h"
+#include "clang/DependencyScanning/DependencyScanningFilesystem.h"
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/Threading.h"
 #include <optional>
 
 using namespace clang;
-using namespace tooling;
 using namespace dependencies;
 
 llvm::ErrorOr<DependencyScanningWorkerFilesystem::TentativeEntry>
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningService.cpp b/clang/lib/DependencyScanning/DependencyScanningService.cpp
similarity index 82%
rename from clang/lib/Tooling/DependencyScanning/DependencyScanningService.cpp
rename to clang/lib/DependencyScanning/DependencyScanningService.cpp
index 7f40c99f07287..72f359e56d116 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningService.cpp
+++ b/clang/lib/DependencyScanning/DependencyScanningService.cpp
@@ -1,4 +1,4 @@
-//===- DependencyScanningService.cpp - clang-scan-deps service ------------===//
+//===- DependencyScanningService.cpp - Scanning Service -------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,10 +6,9 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningService.h"
+#include "clang/DependencyScanning/DependencyScanningService.h"
 
 using namespace clang;
-using namespace tooling;
 using namespace dependencies;
 
 DependencyScanningService::DependencyScanningService(
diff --git a/clang/lib/DependencyScanning/DependencyScanningUtils.cpp b/clang/lib/DependencyScanning/DependencyScanningUtils.cpp
new file mode 100644
index 0000000000000..e27c597a14fcc
--- /dev/null
+++ b/clang/lib/DependencyScanning/DependencyScanningUtils.cpp
@@ -0,0 +1,38 @@
+//===- DependencyScanningUtils.cpp - Common Scanning Utilities ------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "clang/DependencyScanning/DependencyScanningUtils.h"
+
+using namespace clang;
+using namespace dependencies;
+
+TranslationUnitDeps FullDependencyConsumer::takeTranslationUnitDeps() {
+  TranslationUnitDeps TU;
+
+  TU.ID.ContextHash = std::move(ContextHash);
+  TU.ID.ModuleName = std::move(ModuleName);
+  TU.NamedModuleDeps = std::move(NamedModuleDeps);
+  TU.FileDeps = std::move(Dependencies);
+  TU.PrebuiltModuleDeps = std::move(PrebuiltModuleDeps);
+  TU.VisibleModules = std::move(VisibleModules);
+  TU.Commands = std::move(Commands);
+
+  for (auto &&M : ClangModuleDeps) {
+    auto &MD = M.second;
+    // TODO: Avoid handleModuleDependency even being called for modules
+    //   we've already seen.
+    if (AlreadySeen.count(M.first))
+      continue;
+    TU.ModuleGraph.push_back(std::move(MD));
+  }
+  TU.ClangModuleDeps = std::move(DirectModuleDeps);
+
+  return TU;
+}
+
+CallbackActionController::~CallbackActionController() {}
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp b/clang/lib/DependencyScanning/DependencyScanningWorker.cpp
similarity index 97%
rename from clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
rename to clang/lib/DependencyScanning/DependencyScanningWorker.cpp
index 0bc17f9c80605..b22b0f456fd5c 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/DependencyScanning/DependencyScanningWorker.cpp
@@ -1,4 +1,4 @@
-//===- DependencyScanningWorker.cpp - clang-scan-deps worker --------------===//
+//===- DependencyScanningWorker.cpp - Thread-Safe Scanning Worker ---------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,14 +6,13 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningWorker.h"
-#include "DependencyScannerImpl.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
 #include "clang/Basic/DiagnosticFrontend.h"
+#include "clang/DependencyScanning/DependencyScannerImpl.h"
 #include "clang/Driver/Driver.h"
 #include "clang/Driver/Tool.h"
 
 using namespace clang;
-using namespace tooling;
 using namespace dependencies;
 
 DependencyScanningWorker::DependencyScanningWorker(
diff --git a/clang/lib/Tooling/DependencyScanning/InProcessModuleCache.cpp b/clang/lib/DependencyScanning/InProcessModuleCache.cpp
similarity index 95%
rename from clang/lib/Tooling/DependencyScanning/InProcessModuleCache.cpp
rename to clang/lib/DependencyScanning/InProcessModuleCache.cpp
index d1e543b438225..1dd2d34032a96 100644
--- a/clang/lib/Tooling/DependencyScanning/InProcessModuleCache.cpp
+++ b/clang/lib/DependencyScanning/InProcessModuleCache.cpp
@@ -1,4 +1,4 @@
-//===----------------------------------------------------------------------===//
+//===- InProcessModuleCache.cpp - Implicit Module Cache ---------*- C++ -*-===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,7 +6,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/InProcessModuleCache.h"
+#include "clang/DependencyScanning/InProcessModuleCache.h"
 
 #include "clang/Serialization/InMemoryModuleCache.h"
 #include "llvm/Support/AdvisoryLock.h"
@@ -15,7 +15,6 @@
 #include <mutex>
 
 using namespace clang;
-using namespace tooling;
 using namespace dependencies;
 
 namespace {
diff --git a/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp b/clang/lib/DependencyScanning/ModuleDepCollector.cpp
similarity index 99%
rename from clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
rename to clang/lib/DependencyScanning/ModuleDepCollector.cpp
index 3a99f8c882b8f..39bd2e2ab0032 100644
--- a/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
+++ b/clang/lib/DependencyScanning/ModuleDepCollector.cpp
@@ -6,18 +6,17 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/ModuleDepCollector.h"
+#include "clang/DependencyScanning/ModuleDepCollector.h"
 
 #include "clang/Basic/MakeSupport.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
 #include "clang/Frontend/CompilerInstance.h"
 #include "clang/Lex/Preprocessor.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningWorker.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/Support/BLAKE3.h"
 #include <optional>
 
 using namespace clang;
-using namespace tooling;
 using namespace dependencies;
 
 void ModuleDeps::forEachFileDep(llvm::function_ref<void(StringRef)> Cb) const {
diff --git a/clang/lib/Tooling/CMakeLists.txt b/clang/lib/Tooling/CMakeLists.txt
index faaa53276d0e6..0972ecb08437f 100644
--- a/clang/lib/Tooling/CMakeLists.txt
+++ b/clang/lib/Tooling/CMakeLists.txt
@@ -10,7 +10,6 @@ add_subdirectory(Inclusions)
 add_subdirectory(Refactoring)
 add_subdirectory(ASTDiff)
 add_subdirectory(Syntax)
-add_subdirectory(DependencyScanning)
 add_subdirectory(Transformer)
 
 add_clang_library(clangTooling
@@ -18,6 +17,7 @@ add_clang_library(clangTooling
   ArgumentsAdjusters.cpp
   CommonOptionsParser.cpp
   CompilationDatabase.cpp
+  DependencyScanningTool.cpp
   Execution.cpp
   ExpandResponseFilesCompilationDatabase.cpp
   FileMatchTrie.cpp
@@ -39,6 +39,7 @@ add_clang_library(clangTooling
   clangAST
   clangASTMatchers
   clangBasic
+  clangDependencyScanning
   clangDriver
   clangOptions
   clangFormat
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningTool.cpp b/clang/lib/Tooling/DependencyScanningTool.cpp
similarity index 88%
rename from clang/lib/Tooling/DependencyScanning/DependencyScanningTool.cpp
rename to clang/lib/Tooling/DependencyScanningTool.cpp
index a1f2db7a471be..e037420f4fcf2 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningTool.cpp
+++ b/clang/lib/Tooling/DependencyScanningTool.cpp
@@ -6,13 +6,14 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningTool.h"
+#include "clang/Tooling/DependencyScanningTool.h"
 #include "clang/Frontend/Utils.h"
 #include <optional>
 
 using namespace clang;
 using namespace tooling;
-using namespace dependencies;
+using namespace clang::dependencies;
+using namespace clang::tooling::dependencies;
 
 DependencyScanningTool::DependencyScanningTool(
     DependencyScanningService &Service,
@@ -200,29 +201,3 @@ DependencyScanningTool::computeDependenciesByNameWithContext(
 llvm::Error DependencyScanningTool::finalizeCompilerInstanceWithContext() {
   return Worker.finalizeCompilerInstanceWithContextOrError();
 }
-
-TranslationUnitDeps FullDependencyConsumer::takeTranslationUnitDeps() {
-  TranslationUnitDeps TU;
-
-  TU.ID.ContextHash = std::move(ContextHash);
-  TU.ID.ModuleName = std::move(ModuleName);
-  TU.NamedModuleDeps = std::move(NamedModuleDeps);
-  TU.FileDeps = std::move(Dependencies);
-  TU.PrebuiltModuleDeps = std::move(PrebuiltModuleDeps);
-  TU.VisibleModules = std::move(VisibleModules);
-  TU.Commands = std::move(Commands);
-
-  for (auto &&M : ClangModuleDeps) {
-    auto &MD = M.second;
-    // TODO: Avoid handleModuleDependency even being called for modules
-    //   we've already seen.
-    if (AlreadySeen.count(M.first))
-      continue;
-    TU.ModuleGraph.push_back(std::move(MD));
-  }
-  TU.ClangModuleDeps = std::move(DirectModuleDeps);
-
-  return TU;
-}
-
-CallbackActionController::~CallbackActionController() {}
diff --git a/clang/tools/clang-scan-deps/ClangScanDeps.cpp b/clang/tools/clang-scan-deps/ClangScanDeps.cpp
index 5f5bf42df5e6b..984a51c915f45 100644
--- a/clang/tools/clang-scan-deps/ClangScanDeps.cpp
+++ b/clang/tools/clang-scan-deps/ClangScanDeps.cpp
@@ -6,14 +6,14 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "clang/DependencyScanning/DependencyScanningService.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
 #include "clang/Driver/Compilation.h"
 #include "clang/Driver/Driver.h"
 #include "clang/Frontend/CompilerInstance.h"
 #include "clang/Frontend/TextDiagnosticPrinter.h"
 #include "clang/Tooling/CommonOptionsParser.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningService.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningTool.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningWorker.h"
+#include "clang/Tooling/DependencyScanningTool.h"
 #include "clang/Tooling/JSONCompilationDatabase.h"
 #include "clang/Tooling/Tooling.h"
 #include "llvm/ADT/STLExtras.h"
@@ -40,7 +40,9 @@
 #include "Opts.inc"
 
 using namespace clang;
-using namespace tooling::dependencies;
+using namespace tooling;
+using namespace clang::dependencies;
+using namespace clang::tooling::dependencies;
 
 namespace {
 
diff --git a/clang/unittests/CMakeLists.txt b/clang/unittests/CMakeLists.txt
index 54c781a35c20c..438a5c4c2e711 100644
--- a/clang/unittests/CMakeLists.txt
+++ b/clang/unittests/CMakeLists.txt
@@ -79,6 +79,7 @@ add_subdirectory(Basic)
 add_subdirectory(Lex)
 add_subdirectory(Parse)
 add_subdirectory(Driver)
+add_subdirectory(DependencyScanning)
 if(CLANG_ENABLE_STATIC_ANALYZER)
   add_subdirectory(Analysis)
   add_subdirectory(StaticAnalyzer)
diff --git a/clang/unittests/DependencyScanning/CMakeLists.txt b/clang/unittests/DependencyScanning/CMakeLists.txt
new file mode 100644
index 0000000000000..40425820d4d08
--- /dev/null
+++ b/clang/unittests/DependencyScanning/CMakeLists.txt
@@ -0,0 +1,11 @@
+add_clang_unittest(ClangDependencyScanningTests
+  DependencyScanningFilesystemTest.cpp
+  DependencyScanningWorkerTest.cpp
+  CLANG_LIBS
+  clangDependencyScanning
+  clangFrontend # For TextDiagnosticPrinter.
+  LLVM_COMPONENTS
+  ${LLVM_TARGETS_TO_BUILD}
+  Option
+  Support
+  )
diff --git a/clang/unittests/Tooling/DependencyScanning/DependencyScanningFilesystemTest.cpp b/clang/unittests/DependencyScanning/DependencyScanningFilesystemTest.cpp
similarity index 98%
rename from clang/unittests/Tooling/DependencyScanning/DependencyScanningFilesystemTest.cpp
rename to clang/unittests/DependencyScanning/DependencyScanningFilesystemTest.cpp
index cdb0ce2100d60..0e195411915aa 100644
--- a/clang/unittests/Tooling/DependencyScanning/DependencyScanningFilesystemTest.cpp
+++ b/clang/unittests/DependencyScanning/DependencyScanningFilesystemTest.cpp
@@ -6,12 +6,12 @@
 //
 //===----------------------------------------------------------------------===//
 
-#include "clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h"
+#include "clang/DependencyScanning/DependencyScanningFilesystem.h"
 #include "llvm/ADT/SmallString.h"
 #include "llvm/Support/VirtualFileSystem.h"
 #include "gtest/gtest.h"
 
-using namespace clang::tooling::dependencies;
+using namespace clang::dependencies;
 
 TEST(DependencyScanningFilesystem, OpenFileAndGetBufferRepeatedly) {
   auto InMemoryFS = llvm::makeIntrusiveRefCnt<llvm::vfs::InMemoryFileSystem>();
diff --git a/clang/unittests/DependencyScanning/DependencyScanningWorkerTest.cpp b/clang/unittests/DependencyScanning/DependencyScanningWorkerTest.cpp
new file mode 100644
index 0000000000000..e6a5684b10cc9
--- /dev/null
+++ b/clang/unittests/DependencyScanning/DependencyScanningWorkerTest.cpp
@@ -0,0 +1,97 @@
+//===- DependencyScanningWorkerTest.cpp -----------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
+#include "clang/DependencyScanning/DependencyScanningUtils.h"
+#include "llvm/Support/FormatVariadic.h"
+#include "gtest/gtest.h"
+#include <string>
+
+using namespace clang;
+using namespace dependencies;
+
+TEST(DependencyScanner, ScanDepsWithDiagConsumer) {
+  StringRef CWD = "/root";
+
+  auto VFS = llvm::makeIntrusiveRefCnt<llvm::vfs::InMemoryFileSystem>();
+  VFS->setCurrentWorkingDirectory(CWD);
+  auto Sept = llvm::sys::path::get_separator();
+  std::string HeaderPath =
+      std::string(llvm::formatv("{0}root{0}header.h", Sept));
+  std::string TestPath = std::string(llvm::formatv("{0}root{0}test.cpp", Sept));
+  std::string AsmPath = std::string(llvm::formatv("{0}root{0}test.s", Sept));
+
+  VFS->addFile(HeaderPath, 0, llvm::MemoryBuffer::getMemBuffer("\n"));
+  VFS->addFile(TestPath, 0,
+               llvm::MemoryBuffer::getMemBuffer("#include \"header.h\"\n"));
+  VFS->addFile(AsmPath, 0, llvm::MemoryBuffer::getMemBuffer(""));
+
+  DependencyScanningService Service(ScanningMode::DependencyDirectivesScan,
+                                    ScanningOutputFormat::Make);
+  DependencyScanningWorker Worker(Service, VFS);
+
+  llvm::DenseSet<ModuleID> AlreadySeen;
+  FullDependencyConsumer DC(AlreadySeen);
+  CallbackActionController AC(nullptr);
+
+  struct EnsureFinishedConsumer : public DiagnosticConsumer {
+    bool Finished = false;
+    void finish() override { Finished = true; }
+  };
+
+  {
+    // Check that a successful scan calls DiagConsumer.finish().
+    std::vector<std::string> Args = {"clang",
+                                     "-target",
+                                     "x86_64-apple-macosx10.7",
+                                     "-c",
+                                     "test.cpp",
+                                     "-o"
+                                     "test.cpp.o"};
+
+    EnsureFinishedConsumer DiagConsumer;
+    bool Success = Worker.computeDependencies(CWD, Args, DC, AC, DiagConsumer);
+
+    EXPECT_TRUE(Success);
+    EXPECT_EQ(DiagConsumer.getNumErrors(), 0u);
+    EXPECT_TRUE(DiagConsumer.Finished);
+  }
+
+  {
+    // Check that an invalid command-line, which never enters the scanning
+    // action calls DiagConsumer.finish().
+    std::vector<std::string> Args = {"clang", "-invalid-arg"};
+    EnsureFinishedConsumer DiagConsumer;
+    bool Success = Worker.computeDependencies(CWD, Args, DC, AC, DiagConsumer);
+
+    EXPECT_FALSE(Success);
+    EXPECT_GE(DiagConsumer.getNumErrors(), 1u);
+    EXPECT_TRUE(DiagConsumer.Finished);
+  }
+
+  {
+    // Check that a valid command line that produces no scanning jobs calls
+    // DiagConsumer.finish().
+    std::vector<std::string> Args = {"clang",
+                                     "-target",
+                                     "x86_64-apple-macosx10.7",
+                                     "-c",
+                                     "-x",
+                                     "assembler",
+                                     "test.s",
+                                     "-o"
+                                     "test.cpp.o"};
+
+    EnsureFinishedConsumer DiagConsumer;
+    bool Success = Worker.computeDependencies(CWD, Args, DC, AC, DiagConsumer);
+
+    EXPECT_FALSE(Success);
+    EXPECT_EQ(DiagConsumer.getNumErrors(), 1u);
+    EXPECT_TRUE(DiagConsumer.Finished);
+  }
+}
diff --git a/clang/unittests/Tooling/CMakeLists.txt b/clang/unittests/Tooling/CMakeLists.txt
index 106c6b9dc38bd..8c8b22250cd83 100644
--- a/clang/unittests/Tooling/CMakeLists.txt
+++ b/clang/unittests/Tooling/CMakeLists.txt
@@ -13,8 +13,7 @@ add_clang_unittest(ToolingTests
   LookupTest.cpp
   QualTypeNamesTest.cpp
   RangeSelectorTest.cpp
-  DependencyScanning/DependencyScannerTest.cpp
-  DependencyScanning/DependencyScanningFilesystemTest.cpp
+  DependencyScannerTest.cpp
   RecursiveASTVisitorTests/Attr.cpp
   RecursiveASTVisitorTests/BitfieldInitializer.cpp
   RecursiveASTVisitorTests/CallbacksLeaf.cpp
diff --git a/clang/unittests/Tooling/DependencyScanning/DependencyScannerTest.cpp b/clang/unittests/Tooling/DependencyScannerTest.cpp
similarity index 78%
rename from clang/unittests/Tooling/DependencyScanning/DependencyScannerTest.cpp
rename to clang/unittests/Tooling/DependencyScannerTest.cpp
index 4523af33e3c28..9fcd0545b17fa 100644
--- a/clang/unittests/Tooling/DependencyScanning/DependencyScannerTest.cpp
+++ b/clang/unittests/Tooling/DependencyScannerTest.cpp
@@ -9,13 +9,13 @@
 #include "clang/AST/ASTConsumer.h"
 #include "clang/AST/DeclCXX.h"
 #include "clang/AST/DeclGroup.h"
+#include "clang/DependencyScanning/DependencyScanningWorker.h"
 #include "clang/Frontend/ASTUnit.h"
 #include "clang/Frontend/CompilerInstance.h"
 #include "clang/Frontend/FrontendAction.h"
 #include "clang/Frontend/FrontendActions.h"
 #include "clang/Tooling/CompilationDatabase.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningTool.h"
-#include "clang/Tooling/DependencyScanning/DependencyScanningWorker.h"
+#include "clang/Tooling/DependencyScanningTool.h"
 #include "clang/Tooling/Tooling.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/MC/TargetRegistry.h"
@@ -29,7 +29,8 @@
 
 using namespace clang;
 using namespace tooling;
-using namespace dependencies;
+using namespace clang::dependencies;
+using namespace tooling::dependencies;
 
 namespace {
 
@@ -304,84 +305,3 @@ TEST(DependencyScanner, ScanDepsWithModuleLookup) {
   EXPECT_TRUE(!llvm::is_contained(InterceptFS->StatPaths, OtherPath));
   EXPECT_EQ(InterceptFS->ReadFiles, std::vector<std::string>{"test.m"});
 }
-
-TEST(DependencyScanner, ScanDepsWithDiagConsumer) {
-  StringRef CWD = "/root";
-
-  auto VFS = llvm::makeIntrusiveRefCnt<llvm::vfs::InMemoryFileSystem>();
-  VFS->setCurrentWorkingDirectory(CWD);
-  auto Sept = llvm::sys::path::get_separator();
-  std::string HeaderPath =
-      std::string(llvm::formatv("{0}root{0}header.h", Sept));
-  std::string TestPath = std::string(llvm::formatv("{0}root{0}test.cpp", Sept));
-  std::string AsmPath = std::string(llvm::formatv("{0}root{0}test.s", Sept));
-
-  VFS->addFile(HeaderPath, 0, llvm::MemoryBuffer::getMemBuffer("\n"));
-  VFS->addFile(TestPath, 0,
-               llvm::MemoryBuffer::getMemBuffer("#include \"header.h\"\n"));
-  VFS->addFile(AsmPath, 0, llvm::MemoryBuffer::getMemBuffer(""));
-
-  DependencyScanningService Service(ScanningMode::DependencyDirectivesScan,
-                                    ScanningOutputFormat::Make);
-  DependencyScanningWorker Worker(Service, VFS);
-
-  llvm::DenseSet<ModuleID> AlreadySeen;
-  FullDependencyConsumer DC(AlreadySeen);
-  CallbackActionController AC(nullptr);
-
-  struct EnsureFinishedConsumer : public DiagnosticConsumer {
-    bool Finished = false;
-    void finish() override { Finished = true; }
-  };
-
-  {
-    // Check that a successful scan calls DiagConsumer.finish().
-    std::vector<std::string> Args = {"clang",
-                                     "-target",
-                                     "x86_64-apple-macosx10.7",
-                                     "-c",
-                                     "test.cpp",
-                                     "-o"
-                                     "test.cpp.o"};
-
-    EnsureFinishedConsumer DiagConsumer;
-    bool Success = Worker.computeDependencies(CWD, Args, DC, AC, DiagConsumer);
-
-    EXPECT_TRUE(Success);
-    EXPECT_EQ(DiagConsumer.getNumErrors(), 0u);
-    EXPECT_TRUE(DiagConsumer.Finished);
-  }
-
-  {
-    // Check that an invalid command-line, which never enters the scanning
-    // action calls DiagConsumer.finish().
-    std::vector<std::string> Args = {"clang", "-invalid-arg"};
-    EnsureFinishedConsumer DiagConsumer;
-    bool Success = Worker.computeDependencies(CWD, Args, DC, AC, DiagConsumer);
-
-    EXPECT_FALSE(Success);
-    EXPECT_GE(DiagConsumer.getNumErrors(), 1u);
-    EXPECT_TRUE(DiagConsumer.Finished);
-  }
-
-  {
-    // Check that a valid command line that produces no scanning jobs calls
-    // DiagConsumer.finish().
-    std::vector<std::string> Args = {"clang",
-                                     "-target",
-                                     "x86_64-apple-macosx10.7",
-                                     "-c",
-                                     "-x",
-                                     "assembler",
-                                     "test.s",
-                                     "-o"
-                                     "test.cpp.o"};
-
-    EnsureFinishedConsumer DiagConsumer;
-    bool Success = Worker.computeDependencies(CWD, Args, DC, AC, DiagConsumer);
-
-    EXPECT_FALSE(Success);
-    EXPECT_EQ(DiagConsumer.getNumErrors(), 1u);
-    EXPECT_TRUE(DiagConsumer.Finished);
-  }
-}

>From 1cd11341fbcfc682852f000d4faf0bb46cfcf26a Mon Sep 17 00:00:00 2001
From: Naveen Seth Hanig <naveen at linux.fritz.box>
Date: Wed, 3 Dec 2025 22:48:27 +0100
Subject: [PATCH 2/3] Merge from upstream
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

commit 1054a6e9dee0198da0a3d234fd3254aa9e143319
Author: Florian Hahn <flo at fhahn.com>
Date:   Wed Dec 3 21:13:11 2025 +0000

    [SCEV] Handle non-constant start values in AddRec UDiv canonicalization. (#170474)

    Follow-up to https://github.com/llvm/llvm-project/pull/169576 to enable
    UDiv canonicalization if the start of the AddRec is not constant.

    The fold is not restricted to constant start values, as long as we are
    able to compute a constant remainder. The fold is only applied if the
    subtraction of the remainder can be folded into to start expression, but
    that is just to avoid creating more complex AddRecs.

    For reference, the proof from #169576 is
    https://alive2.llvm.org/ce/z/iu2tav

    PR: https://github.com/llvm/llvm-project/pull/170474

commit 095f8e07933636bba726e3a903f215ce9fc7e2dd
Author: Florian Hahn <flo at fhahn.com>
Date:   Wed Dec 3 21:06:36 2025 +0000

    [LV] Add more tests for finding the first-iv of argmin.

    Adds more test coverage for
    https://github.com/llvm/llvm-project/pull/170223.

commit 2fb2d7eb412f25fbe48f47a31b017a87d2398f8a
Author: Valentin Clement (バレンタイン クレメン) <clementval at gmail.com>
Date:   Wed Dec 3 13:05:28 2025 -0800

    [flang][cuda] Change how to handle static shared memory variables (#170388)

    Generate one global per static shared variable so the alignment can be
    set separately. Dynamic shared memory is unchanged.

commit d2accd386f3e9727309c97ecea1e22f11b617237
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 13:02:02 2025 -0800

    [Github] Make issue-write workflow support reading from multiple files

    This is so that we can read from multiple files emitted by the premerge
    workflow.

    Reviewers: tstellar, cmtice

    Reviewed By: cmtice

    Pull Request: https://github.com/llvm/llvm-project/pull/170411

commit b13b41a891dbd4bb7b49f3bfec4ebe4f42983f58
Author: Vladislav Dzhidzhoev <vdzhidzhoev at accesssoftek.com>
Date:   Wed Dec 3 22:01:34 2025 +0100

    Revert "[LLDB] Add SBFrameExtensions Tests (#169236)" (#170555)

    This reverts commit 5e5937c3d2e493a48837b2bdf179a53e8b80a66a, since the
    added test fails on the `lldb-x86_64-win` buildbot.

    https://lab.llvm.org/buildbot/#/builders/211/builds/4246

commit 04c81a99735c04b2018eeb687e74f9860e1d0e1b
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Wed Dec 3 16:00:12 2025 -0500

    CodeGen: Add LibcallLoweringInfo analysis pass (#168622)

    The libcall lowering decisions should be program dependent,
    depending on the current module's RuntimeLibcallInfo. We need
    another related analysis derived from that plus the current
    function's subtarget to provide concrete lowering decisions.

    This takes on a somewhat unusual form. It's a Module analysis,
    with a lookup keyed on the subtarget. This is a separate module
    analysis from RuntimeLibraryAnalysis to avoid that depending on
    codegen. It's not a function pass to avoid depending on any
    particular function, to avoid repeated subtarget map lookups in
    most of the use passes, and to avoid any recomputation in the
    common case of one subtarget (and keeps it reusable across
    repeated compilations).

    This also switches ExpandFp and PreISelIntrinsicLowering as
    a sample function and module pass. Note this is not yet wired
    up to SelectionDAG, which is still using the LibcallLoweringInfo
    constructed inside of TargetLowering.

commit 5cbd294ca2390069181d984644dac6ca34b5e95c
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 20:57:45 2025 +0000

    [Github] Fix issue-write workflow after #170216

    This changed the name of one of the outputs that issue-write used to
    control whether or not it ran. This patch should fix that.

commit 3d598c33350a6691807441666f9c5014c18aff39
Author: Koakuma <koachan at protonmail.com>
Date:   Thu Dec 4 03:38:48 2025 +0700

    [SPARC] Remove CCIfConsecutiveRegs for f128 returns (#170133)

    It appears that using it will result in callers mistakenly thinking that
    complex f128 returns is done the sret-way, when it should be returned in
    registers.

commit 58dd3a4fef51b11d4ea5f6c4f7c349589fb12255
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 20:26:09 2025 +0000

    [Github] Also run test issue write when touching issue-write.yml

    We should actually run the test workflow when touching the workflow we
    are attempting to test.

commit 7b3ec5191a701dcebbf3c05a53b938ddd5f3c2d1
Author: Ramkumar Ramachandra <ramkumar.ramachandra at codasip.com>
Date:   Wed Dec 3 20:25:52 2025 +0000

    [VPlan] Consolidate logic for narrowToSingleScalars (NFCI) (#167360)

    The logic for narrowing to single scalar recipes is in two different
    places: narrowToSingleScalarRecipes and legalizeAndOptimizeInductions.
    Consolidate them.

commit 562d911857d9e050b002b9904d64d0f08bf4a762
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 12:23:21 2025 -0800

    [Github] Make unprivileged-download-artifact download multiple artifacts

    This is designed to allow a workflow (e.g., premerge) upload comments
    across multiple jobs. Subsequent PRs will wire this up within the
    issue-write workflow to support reading comments from multiple files.

    Reviewers: tstellar, cmtice

    Reviewed By: cmtice

    Pull Request: https://github.com/llvm/llvm-project/pull/170216

commit 8f6e95ef45d20709f338b0753a362c172a51eff7
Author: Ahmed Nour <ahmednour.mohamed2012 at gmail.com>
Date:   Wed Dec 3 22:19:54 2025 +0200

    [Clang][X86] Add constexpr support for permute4x64_pd and permute4x64_epi64 (#170442)

    This PR adds constexpr support for the AVX2 cross-lane permute
    intrinsics _mm256_permute4x64_pd and _mm256_permute4x64_epi64

    Resolves https://github.com/llvm/llvm-project/issues/169304

commit 6164b0785efcf6d9565cdcf42eada2187897e434
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 12:17:54 2025 -0800

    [Github] Add workflow to test the issue write workflow

    This does not test most of the functionality (i.e., that writing to an
    existing comment still works), but does ensure that the plumbing works
    and things are not completely broken.

    Reviewers: tstellar, cmtice

    Reviewed By: cmtice

    Pull Request: https://github.com/llvm/llvm-project/pull/170209

commit c5e6f4e99d6a1d74614cdfd866cf0f81ecc43984
Author: Florian Hahn <flo at fhahn.com>
Date:   Wed Dec 3 20:14:58 2025 +0000

    [AArch64] Add unrolling test with -mcpu=apple-a17.

    Currently Apple unrolling preferences are not applied to apple-a17.

commit 43b69166e7df5f82c15b7536e61f251428df07af
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 20:14:23 2025 +0000

    Revert "[clangd] Enable lit internal shell by default (#170186)"

    This reverts commit 671a8ce6bed475830ee9eb67cd3afb950e5a17e1.

    This stil broke the clangd-ubuntu-tsan bot. It seems like somehow the
    PATH variable is not getting propagated in the
    system-include-extractor.test test.

commit c656bf30e6fd84bbc2aa8d7b8bacf32ee7d13d09
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 20:09:32 2025 +0000

    [Github] Add user of issue-write for #170209

    So that we can actually test the workflow before comitting into tree.

commit fc1e91112b8388ec684b8f59c5b03337331db8c2
Author: Charles Zablit <c_zablit at apple.com>
Date:   Wed Dec 3 21:14:05 2025 +0100

    [lldb] ensure comment conforms to LLVM guidelines (#170533)

    This patch is a follow up to
    https://github.com/llvm/llvm-project/pull/170471.

commit 671a8ce6bed475830ee9eb67cd3afb950e5a17e1
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Wed Dec 3 12:05:39 2025 -0800

    Reapply "[clangd] Enable lit internal shell by default" (#170186)

    This reverts commit 4cfbc44ebe26692c209655c37aeb0b6cbf1d479b.

    This was failing due to a missing chmod binary on one of the bots
    (clangd-ubuntu-tsan). This patch fixes that by explicitly checking for
    the presence of a chmod binary. This should not be necessary (I have
    added a TODO for myself to update once I have looked at the bot setup
    which I am currently waiting on access to) as check-llvm works with
    requiring chmod unconditionally.

commit 4715e525648dde9abc50dfc93fa2cd3a67708cc7
Author: Fateme Hosseini <quic_fhossein at quicinc.com>
Date:   Wed Dec 3 14:05:07 2025 -0600

    [Hexagon] Add an option to use fast FP to int convert for some HVX cases (#169562)

    Lowering several flavors of fptosi for HVX can be done faster, but
    violates c/c++ convention on some arch tags. Nevertheless customers are
    using direct intrinsics with "incorrect" rounding mode and want compiler
    to do the same.

    Default behavior is not changed.

    Patch By: Fateme Hosseini

    Co-authored-by: Sergei Larin <slarin at codeaurora.org>
    Co-authored-by: Sergei Larin <slarin at qti.qualcomm.com>

commit 50916a4adc106e140fc389097aa21eb93ea2f798
Author: Florian Hahn <flo at fhahn.com>
Date:   Wed Dec 3 19:48:23 2025 +0000

    [VPlan] Use predicate in VPInstruction::computeCost for selects. (#170278)

    In some cases, the lowering a select depends on the predicate. If the
    condition of a select is a compare instruction, thread the predicate
    through to the TTI hook.

    PR: https://github.com/llvm/llvm-project/pull/170278

commit c5fa1f8c4bcc097ec8336bda8ef0b0a223abc2e6
Author: Valeriy Savchenko <vsavchenko at apple.com>
Date:   Wed Dec 3 19:34:21 2025 +0000

    [DAGCombiner] Handle type-promoted constants in UDIV lowering (#169491)

commit d041d5d4e07ba0eddd5120efd66520b3984a2b9b
Author: Daniel Thornburgh <dthorn at google.com>
Date:   Wed Dec 3 11:24:56 2025 -0800

    [clang] "modular_format" attribute for functions using format strings (#147431)

    This provides a C language `modular_format` attribute. This combines
    with information from the existing `format` to set the new IR
    `modular-format` attribute.

    The purpose of these attributes is to enable "modular printf". A
    statically linked libc can provide a modular variant of printf that only
    weakly references implementation routines. The link-time symbol `printf`
    would strongly reference aspect symbols (e.g. for float, fixed point,
    etc.) that are provided by those routines, restoring the status quo.
    However, the compiler could transform calls with constant format strings
    to calls to the modular printf instead, and at the same time, it would
    emit strong references to the aspect symbols that are needed to
    implement the format string. Then, the printf implementation would
    contain only the union of the aspects requested.

    See issue #146159 for context.

commit bdf90227abd55b24821b126a50ab89e49a39a2b5
Author: Jason Rice <ricejasonf at gmail.com>
Date:   Wed Dec 3 11:15:00 2025 -0800

    [MLIR] Test generated build functions with move-only parameter types (#170391)

    This adds a test of the MLIR TableGen `OpBuilder` syntax with move-only
    parameters types. Additionally, an overload is added to test defining a
    builder outside of the TableGen interface.

commit d7cc82b9c53fa03dd25f7ae9b8f07871a89e7b56
Author: Philip Reames <preames at rivosinc.com>
Date:   Wed Dec 3 11:06:40 2025 -0800

    [IndVars] Split NumElimCmp statistic into three pieces (#170514)

    Only one of the three update paths actual eliminates the comparison.

    While here, use early return to clarify the code structure.

commit 33a80a7d8e34b4448f7a3af64ba1ec3a56c1e553
Author: Charles Zablit <c_zablit at apple.com>
Date:   Wed Dec 3 19:59:45 2025 +0100

    [lldb][windows] fix a use before allocation crash (#170530)

commit 4ca61f56619c6ed2e4a1113682503bdb3da79b35
Author: Yonah Goldberg <ygoldberg at nvidia.com>
Date:   Wed Dec 3 10:58:30 2025 -0800

    [NFC][SROA] Clean up rewritePartition type selection process (#169106)

    This change reverts
    https://github.com/llvm/llvm-project/commit/257251247a267c3fa30fdeef17ffa4987d8a52e5,
    which landed on Aug 8, 2022. This change addressed the problem that if
    you have IR that looks something like:

    ```
    %alloca = alloca <4 x float>
    store <4 x float> %data, ptr %alloca
    %load = load half, ptr %alloca
    ```

    `getCommonType` would return `<4 x float>` because the `load half` isn't
    to the entire partition, so we skip the first `getTypePartition` check.
    https://github.com/llvm/llvm-project/commit/257251247a267c3fa30fdeef17ffa4987d8a52e5
    added a later check that sees that `<4 x float>` is not vector
    promotable because of the `load half`, and then calls
    `getTypePartition`, which changes the `sliceTy` to `< 8 x half>`, which
    is vector promotable because the store can be changed to `store <8 x
    half>`. So we set the `sliceTy` to `<8 x half>`, we can promote the
    alloca, and everyone is happy.

    This code became unnecessary after
    https://github.com/llvm/llvm-project/commit/529eafd9beff233ba8debfc73e0b5c04cac36835,
    which landed ~3 months later, which fixes the issue in a different way.
    `isVectorPromotionViable` was already smart enough to try `<8 x half>`
    as a type candidate because it sees the `load half`. However, this
    candidate didn't work because it conflicts with `store <4 x float>`.
    This commit added logic to try integer-ifying candidates if there is no
    common type. So the `<8 x half>` candidate gets converted to `<8 x
    i16>`, which works because we can convert the alloca to `alloca <8 x
    i16>` and the load to `load i16`, allowing promotion.

    After
    https://github.com/llvm/llvm-project/commit/529eafd9beff233ba8debfc73e0b5c04cac36835,
    the original commit is pointless. It tries to refine the `SliceTy`, but
    if `isVectorPromotionViable` succeeds, it returns a new type to use and
    we will ignore the `SliceTy`.

    This change is my first patch to try to simplify the type selection
    process in rewritePartition. I had some other ideas that I tried in
    https://github.com/llvm/llvm-project/pull/167771 and
    https://github.com/llvm/llvm-project/pull/168796, but they need
    refinement.

commit c2472be3fb359e640587f84ea668c98a2d86888b
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Wed Dec 3 18:40:48 2025 +0000

    [VectorCombine][X86] foldShuffleOfIntrinsics - provide the arguments to a getShuffleCost call (#170465)

    Ensure the arguments are passed to the getShuffleCost calls to improve
    cost analysis, in particular if these are constant the costs will be
    recognised as free

    Noticed while reviewing #170052

commit 907c94b3c2cc271a06afe9fe149d954578188c31
Author: Medha Tiwari <75640645+medhatiwari at users.noreply.github.com>
Date:   Thu Dec 4 00:08:40 2025 +0530

    [X86][Clang] Add constexpr support for AVX512 kshift intrinsics (#170480)

    Add AVX512 kshiftli/kshiftri mask intrinsics to be used in constexpr.

    Enables constexpr evaluation for:
    - `_kshiftli_mask8/16/32/64`
    - `_kshiftri_mask8/16/32/64`

    Fixes #162056

commit 7c33b8247d7ed0f8ff0e5ac8cc899ca3d6f8d183
Author: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
Date:   Wed Dec 3 12:30:53 2025 -0600

    [flang][OpenMP] Move two functions to check-omp-loop.cpp, NFC (#170526)

    These are checks for clauses that apply to loop constructs.

commit 106edbdabef8bcd914ec1720f7fa6adb07aa4e6b
Author: Jacob Lalonde <jalalonde at fb.com>
Date:   Wed Dec 3 10:29:18 2025 -0800

    [LLDB] Fix deadlock in module callback when running in parallel (#168425)

    When the target is being created, the target list acquires the mutex for
    the duration of the target creation process. However if a module
    callback is enabled and is being called in parallel there exists an
    opportunity to deadlock if the callback calls into targetlist. I've
    created a minimum repro
    [here](https://gist.github.com/Jlalond/2557e06fa09825f338eca08b1d21884f).

    ```
    command script import dead-lock-example (from above gist)
    ...
    target create a.out
    [hangs]
    ```

    This looks like a straight forward fix, where `CreateTargetInternal`
    doesn't access any state directly, and instead calls methods which they
    themselves are thread-safe. So I've moved the lock to when we update the
    list with the created target. I'm not sure if this is a comprehensive
    fix, but it does fix my above example and in my (albeit limited)
    testing, doesn't cause any strange change in behavior.

commit a8ccd42ab23af6848929a638cd6b099953c7e491
Author: Tom Stellard <tstellar at redhat.com>
Date:   Wed Dec 3 10:27:28 2025 -0800

    workflows: Factor out artifact attestation and upload into a composite action (#169621)

    Also, switch the release-sources workflow over to use this new action.
    As a result of this change, the attestation file for the sources will be
    renamed from attestation.jsonl to $TAG-sources-attestation.jsonl.

commit 2221f4a06ec2409f7396ce4408442f115aca1ae0
Author: Jay Foad <jay.foad at amd.com>
Date:   Wed Dec 3 18:26:40 2025 +0000

    [AMDGPU] Add a RUN line to check VGPR MSBs for VOPD pairs (#170494)

    Some tests were added in #157168. This patch makes failures more obvious
    because they will hit an "Invalid VOPD pair was created" assertion
    during VGPR lowering.

commit 63ea3537d55f75be0d6fb92aef16465b291fa9ed
Author: Nathan Corbyn <n_corbyn at apple.com>
Date:   Wed Dec 3 18:17:54 2025 +0000

    [libunwind](TestOnly) Mark failing tests as unsupported on Apple targets (#170488)

    #167642 introduced a number of test failures on one of our stage 2
    builds:
    https://ci.swift.org/job/llvm.org/job/clang-stage2-Rthinlto/1403/. This
    PR marks these tests as unsupported on `.*-apple.*` targets.

commit 0006cd694f8640cb3820d16c9d49d1155b06cda6
Author: Jasmine Tang <jjasmine at igalia.com>
Date:   Wed Dec 3 18:08:40 2025 +0000

    [CIR] Upstream builtin scatter from ClangIR incubator (#170353)

    Part of [#167752](https://github.com/llvm/llvm-project/issues/167752)

commit 94232f9f560f84d2ae7f50b2d1df5bc26b2ce69e
Author: Jan André Reuter <j.reuter at fz-juelich.de>
Date:   Wed Dec 3 19:03:34 2025 +0100

    [OpenMP][OMPT] Use global thread id for `codeptr_ra` in `end_critical` (#169826)

    When a critical construct has finished, it will trigger a
    critical-released event. If a tool is attached, and the `mutex_released`
    callback was registered, the tool with receive an event containing the
    `codeptr_ra`, the return address of the callback invocation.

    All the way back in 82e94a593433f36734e2d34898d353a2ecb65b8b, this
    `codeptr_ra` was implemented by calling `__ompt_load_return_address`
    with a fixed global thread id of `0`. However, this approach results in
    a race-condition, and can yield incorrect results to the tool.

    `__ompt_load_return_address(0)` points to the current return address of
    the thread 0 in `__kmp_threads`. This thread may already execute some
    other construct. A tool might therefore receive the return address of
    e.g. some `libomp` internals, or other parts of the user code.
    Additionally, a call to `__ompt_load_return_address` resets the
    `th.ompt_thread_info.return_address` to `NULL`, therefore also affecting
    the return address of thread 0. Another dispatched event, e.g.
    parallel-begin might therefore not transfer any `codeptr_ra`.

    To fix this, replace the fixed thread id by the `global_tid`, which is
    stored just before dispatching the `mutex_released` callback.

    Signed-off-by: Jan André Reuter <j.reuter at fz-juelich.de>

commit 540fd18568deb299a35b009d34ce32f96e3944bd
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Wed Dec 3 13:01:21 2025 -0500

    DAG: Avoid using getLibcallName when looking for a divrem call (#170413)

    Also introduce an error if it's not available, which is not yet
    testable.

commit cdb501064f35dbe5a1d49721daf59eca261057e9
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Wed Dec 3 13:01:04 2025 -0500

    DAG: Avoid more uses of getLibcallName (#170402)

commit 8d6c5cddf245d652bb2badc065848d280ef8aa9f
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Wed Dec 3 13:00:45 2025 -0500

    DAG: Use LibcallImpl in various getLibFunc helpers (#170400)

    Avoid using getLibcallName in favor of querying the
    libcall impl, and getting the ABI details from that.

commit 14ed98271bb55cfb72ba1045fb1dec6c285a7456
Author: Charles Zablit <c_zablit at apple.com>
Date:   Wed Dec 3 18:42:46 2025 +0100

    [NFC][lldb][windows] refactor the creation of inherited handles (#170301)

    Co-authored-by: Saleem Abdulrasool <compnerd at compnerd.org>

commit 817ab49ece9b0ccafd9a01ad7bd910c102f161c2
Author: Andy Kaylor <akaylor at nvidia.com>
Date:   Wed Dec 3 09:30:16 2025 -0800

    [CIR][NFC] Add infrastructure for AArch64 builtins (#170386)

    This change adds the basic code structure for handling AArch64 builtins.
    The structure of this code is brought over from classic codegen to make
    implementing missing builtins easier. In some cases, the handling
    involved too much logic for a simple NFC change, so those parts were
    replaced with a MissingFeature assert.

    The actual handling for all builtins is left for later changes.

commit bd4c21b3c8a897e5ca467134d26ec6d831c8087a
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Thu Aug 21 08:04:12 2025 -0700

    [MLIR] Apply clang-tidy fixes for performance-move-const-arg in NVGPUTransformOps.cpp (NFC)

commit c379f7cc0151fdf39cca8bfaf65e701308c77de0
Author: Sang Ik Lee <sang.ik.lee at intel.com>
Date:   Wed Dec 3 09:22:18 2025 -0800

    [MLIR][XeGPU] Add integration with XeGPU load / store ops to / from memref subview. (#170385)

    Add XeGPU integration test for missing usage case: base memory from
    memref subview.

commit 70dd63b7804255daba4154c7cc5061c1072923f7
Author: Craig Topper <craig.topper at sifive.com>
Date:   Wed Dec 3 09:22:01 2025 -0800

    [RISCV] Move tuning features below non-tuning features. Put CPU family in their own section. NFC (#170352)

    We had 4 features after all the tuning features, but there didn't seem
    to be particular reason for it.

    Put the CPU family tuning features in their own section after the tuning
    features instead of in the middle.

commit 93832466cc40c142eb39d96876f98b49927c255b
Author: Sebastian Pop <spop at nvidia.com>
Date:   Wed Dec 3 11:19:56 2025 -0600

    [DA] Fix zero coeff bug in Strong SIV test with runtime assumptions (#149991) (#155037)

    Fix GitHub issue #149991 where Strong SIV test incorrectly concludes
    'none!' for symbolic coefficients that could be zero, leading to 0/0
    undef behavior.

    The Strong SIV test was incorrectly concluding "no dependence" when the
    coefficient is symbolic and the delta (difference between source and
    destination) is zero.

    When delta=0, the Strong SIV test divides delta/coeff to get the
    distance.
    The bug occurs when coeff is an unknown symbolic value: if coeff=0 at
    runtime,
    then 0/0 is undefined and all iterations access the same memory
    location,
    creating a true dependence that was being missed.

commit d18d53fda8755a6f29be00b9bf0a6672a85dd444
Author: Sebastian Pop <spop at nvidia.com>
Date:   Wed Dec 3 11:16:05 2025 -0600

    [DA] add testcases for bug #148435 (#154980)

    Add regression tests from issue #148435 .

commit 0ffabf4d084ffb40345c4660c2182b7067475df5
Author: Jan Svoboda <jan_svoboda at apple.com>
Date:   Wed Dec 3 09:11:43 2025 -0800

    [clang][deps] Use the caching VFS even in the 'preprocess' mode (#168970)

    The dependency scanner worker's VFS originally unconditionally did two
    things: file system access caching and dependency directives extraction.
    That's why `clang-scan-deps -mode preprocess` avoided using the VFS
    entirely. Since then, the dependency directives extraction was made
    lazy/on-demand/optional, meaning it should be possible to use only the
    caching parts of the VFS. This PR does exactly that, speeding up
    `clang-scan-deps -mode preprocess` on my config of Clang/LLVM from ~80s
    to ~38s. (For comparison, `clang-scan-deps -mode
    preprocess-dependency-directives` runs in ~13s.)

    (The real motivation was to simplify the VFS handling in the scanner,
    this is just a nice side-effect.)

commit 838ad0efbf57dfcd6c42c2c5497b30f26492e925
Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
Date:   Wed Dec 3 09:11:05 2025 -0800

    AMDGPU: Generalize and normalize some tests to avoid future churn (#170508)

commit 836935197b8ff38459bb86c5aa592ef018311250
Author: Tarun Prabhu <tarun at lanl.gov>
Date:   Wed Dec 3 10:08:33 2025 -0700

    [flang][docs] Fix title and text in the release notes page

    The title of the release notes page always showed "|version|
    (In-Progress)". This has been fixed so the release version is shown as
    expected. '(In-Progress)' is now only shown on non-release branches.
    Unlike clang, flang does not use ${LLVM_VERSION_SUFFIX}, so even on
    non-release branches the 'git' suffix will not be shown.

commit 9f9e15f71553a2cfad040b87cb8e9a3ab5bee808
Author: Amr Hesham <amr96 at programmer.net>
Date:   Wed Dec 3 17:17:56 2025 +0100

    [CIR] Upstream SizeOf for VariableArrayType (#169993)

    Upstream SizeOf support for VariableArrayType

commit c752bb9203954ebb2c6032d462e020fbbad4757e
Author: Philip Reames <preames at rivosinc.com>
Date:   Wed Dec 3 08:16:22 2025 -0800

    [IndVars] Strengthen inference of samesign flags (#170363)

    When reviewing another change, I noticed that we were failing to infer
    samsign for two cases: 1) an unsigned comparison, and 2) when both
    arguments were known negative.

    Using CVP and InstCombine as a reference, we need to be careful to not
    allow eq/ne comparisons. I'm a bit unclear on the why of that, and for
    now am going with the low risk change. I may return to investigate that
    in a follow up.

    Compile time results look like noise to me, see:
    https://llvm-compile-time-tracker.com/compare.php?from=49a978712893fcf9e5f40ac488315d029cf15d3d&to=2ddb263604fd7d538e09dc1f805ebc30eb3ffab0&stat=instructions:u

commit ec6a15f84db135186f5075e15146c7f2ec532d3a
Author: Folkert de Vries <folkert at folkertdev.nl>
Date:   Wed Dec 3 17:04:16 2025 +0100

    [X86] optimize masked truncated saturating stores (#169827)

    Combine the saturating operation into the masked truncating store.

    https://godbolt.org/z/n1YfavKP6

    ```asm
    _mm256_mask_cvtusepi16_storeu_epi8_manual: # @_mm256_mask_cvtusepi16_storeu_epi8_manual
            kmovd   k1, esi
            vmovdqa ymm0, ymmword ptr [rdx]
            vpminuw ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
            vpmovwb xmmword ptr [rdi] {k1}, ymm0
            vzeroupper
            ret
    _mm256_mask_cvtusepi16_storeu_epi8_intrinsic: # @_mm256_mask_cvtusepi16_storeu_epi8_intrinsic
            kmovd   k1, esi
            vmovdqa ymm0, ymmword ptr [rdx]
            vpmovuswb       xmmword ptr [rdi] {k1}, ymm0
            vzeroupper
            ret
    ```

commit bd21095d8ba0bff04f5718096601638ecf9270db
Author: Hongyu Chen <xxs_chy at outlook.com>
Date:   Wed Dec 3 23:55:59 2025 +0800

    [MachineBasicBlock] Don't split loop header successor if the terminator is unanalyzable (#170146)

    Fixes https://github.com/llvm/llvm-project/issues/170051
    The previous implementation allows splitting the successor if it's the
    loop header, regardless of whether the terminator of `this` is
    analyzable.

commit 58d74febfa3958f7d870c9dca35eb20264c211e8
Author: Bertik23 <39457484+Bertik23 at users.noreply.github.com>
Date:   Wed Dec 3 16:50:55 2025 +0100

    [SupportLSP] Add ShowMessageParams (#164626)

    Adds ShowMessageParams to LSP support according to the [LSP
    specification](https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#showMessageRequestParams).

commit eb7db0b9ecceed9719f841cc789ecaa6d5c9aaef
Author: Charitha Saumya <136391709+charithaintc at users.noreply.github.com>
Date:   Wed Dec 3 07:48:00 2025 -0800

    [mlir][xegpu] Change `index` arithmetic ops to `arith` ops. (#170390)

    Index ops cause some issues during SIMT distribution because they don't
    have the `Elementwise` mappable trait. This PR replaces all index
    arithmetic ops with matching `arith` dialect ops.

commit 267865a7b54dd84dc22f147623ec281d34bf7a3f
Author: Philip Reames <preames at rivosinc.com>
Date:   Wed Dec 3 07:31:29 2025 -0800

    [SCEV] Factor out utility for proving same sign of two SCEVs [nfc] (#170376)

    This is a slightly different API than ConstantRange's
    areInsensitiveToSignednessOfICmpPredicate. The only actual difference
    (beyond naming) is the handling of empty ranges (i.e. unreachable code).
    I wanted to keep the existing SCEV behavior for the unreachable code as
    we should be folding that to poison, not reasoning about samesign. I
    tried the other variant locally, and saw no test changes.

commit ccd4e7b1ed3858c64b4667787929b939513bc929
Author: John Brawn <john.brawn at arm.com>
Date:   Wed Dec 3 15:28:46 2025 +0000

    [LSR] Make OptimizeLoopTermCond able to handle some non-cmp conditions (#165590)

    Currently OptimizeLoopTermCond can only convert a cmp instruction to
    using a postincrement induction variable, which means it can't handle
    predicated loops where the termination condition comes from
    get_active_lane_mask. Relax this restriction so that we can handle any
    kind of instruction, though only if it's the instruction immediately
    before the branch (except for possibly an extractelement).

commit c128fd9bebf7d281ac7cf12d8258573e8928672b
Author: Oleksandr T. <oleksandr.tarasiuk at outlook.com>
Date:   Wed Dec 3 17:24:33 2025 +0200

    [Clang] prevent crash on invalid nested name specifiers with a single colon (#169246)

    Fixes #167905

    ---

    This patch addresses an issue where invalid nested name specifier
    sequences containing a single colon (`a:c::`) could be treated during
    recovery as valid scope specifiers, which in turn led to a crash

    https://github.com/llvm/llvm-project/blob/c543615744d61e0967b956c402e310946d741570/clang/lib/Parse/ParseExprCXX.cpp#L404-L418

    For malformed inputs like `a:c::`, the single colon recovery incorrectly
    triggers and produces an `annot_cxxscope`. When tentative parsing later
    runs

    https://github.com/llvm/llvm-project/blob/996213c6ea0dc2e47624c6b06c0833a882c1c1f7/clang/lib/Parse/ParseTentative.cpp#L1739-L1740

    the classifier returns `Ambiguous`, which doesn't stop parsing. The
    parser then enters the

    https://github.com/llvm/llvm-project/blob/996213c6ea0dc2e47624c6b06c0833a882c1c1f7/clang/lib/Parse/ParseTentative.cpp#L1750-L1752

    and consumes the invalid scope annotation, eventually reaching `EOF` and
    crashing.

commit d0f5a49fb6f3604dbb7d6692ad0f81ed1cdf3a86
Author: Nikita Popov <npopov at redhat.com>
Date:   Wed Dec 3 16:21:47 2025 +0100

    [Support] Support debug counters in non-assertion builds (#170468)

    This enables the use of debug counters in (non-assertion) release
    builds. This is useful to enable debugging without having to switch to
    an assertion-enabled build, which may not always be easy.

    After some recent improvements, always supporting debug counters no
    longer has measurable overhead.

commit 5ab8c3a590681b557b117827f8cfcded6dd72015
Author: Sohaib Iftikhar <sohaib1692 at gmail.com>
Date:   Wed Dec 3 16:05:25 2025 +0100

    [LLDB|BUILD] Fix for c50802cb (#170484)

    Fix after #170236

commit 4c09e45f1d54730bd1e50efdca8df5c768558376
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Thu Aug 21 11:03:23 2025 -0700

    [MLIR] Apply clang-tidy fixes for llvm-qualified-auto in OpenMPToLLVMIRTranslation.cpp (NFC)

commit 45b697e610fd24b4114d78f9d7819fa5e9461371
Author: Björn Pettersson <bjorn.a.pettersson at ericsson.com>
Date:   Wed Dec 3 15:45:26 2025 +0100

    [MemoryBuiltins] Consider index type size when aggregating gep offsets (#132365)

    [MemoryBuiltins] Consider index type size when aggregating gep offsets
    Main goal here is to fix some bugs seen with LowerConstantIntrinsics
    pass and the lowering of llvm.objectsize.

    In ObjectSizeOffsetVisitor::computeImpl we are using an external
    analysis together with stripAndAccumulateConstantOffsets. The idea
    is to compute the Min/Max value of individual offsets within a GEP.
    The bug solved here is that when doing the Min/Max comparisons the
    external analysis wasn't considering the index type size (given by
    the data layout), it was simply using the type from the IR. Since a
    GEP is defined as sext/truncating indices we need to consider the
    index type size in the external analysis.

    This solves a regression (false ubsan warnings) seen after commit

    https://github.com/llvm/llvm-project/commit/02b8ee281947f6cb39c7eb3c4bbba59322e9015b
    (https://github.com/llvm/llvm-project/pull/117849).

commit 045331e4a035fa5dd4e91db03c5c7d6335443c03
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Thu Aug 21 13:29:40 2025 -0700

    [MLIR] Apply clang-tidy fixes for performance-unnecessary-value-param in SymbolTableTest.cpp (NFC)

commit b1d06058a39579cfc6ea48c496a1f63f023c5cb5
Author: Oleksandr T. <oleksandr.tarasiuk at outlook.com>
Date:   Wed Dec 3 16:20:12 2025 +0200

    [Clang] adjust caret placement for the suggested attribute location for enum class (#168092)

    Fixes #163224

    ---

    This patch addresses the issue by correcting the caret insertion
    location for attributes incorrectly positioned before an enum. The
    location is now derived from the associated `EnumDecl`: for named enums,
    the attribute is placed before the identifier, while for anonymous enum
    definitions, it is placed before the opening brace, with a fallback to
    the semicolon when no brace is present.

    For example:

    ```cpp
      [[nodiscard]] enum class E1 {};
    ```

    is now suggested as:

    ```cpp
      enum class [[nodiscard]] E1 {};
    ```

commit be3204a59d53f1e44080b99813fb69db0672b5d1
Author: sstwcw <su3e8a96kzlver at posteo.net>
Date:   Wed Dec 3 14:14:01 2025 +0000

    [clang-format] Ignore C++ keywords when formatting Verilog (#167984)

    In the sample below, the `private` identifier is the name of the type,
    and the `try` identifier is the name of the variable.

    new

    ```SystemVerilog
    begin
      private try;
    end
    ```

    old

    ```SystemVerilog
    begin
    private
      try
        ;
    end
    ```

commit 75c85bafb830e5a7bd7fda13d2648180538ff513
Author: sstwcw <su3e8a96kzlver at posteo.net>
Date:   Wed Dec 3 14:13:42 2025 +0000

    [clang-format] Continue aligned lines without parentheses (#167979)

    before, with the options `AlignConsecutiveDeclarations` and
    `AlignConsecutiveAssignments` enabled

    ```C++
    veryverylongvariablename = somethingelse;
    shortervariablename      = anotherverylonglonglongvariablename + //
                          somevariablethatwastoolongtofitonthesamerow;

    double i234 = 0;
    auto   v    = false ? type{}
                        : type{
                         1,
                     };
    ```

    after

    ```C++
    veryverylongvariablename = somethingelse;
    shortervariablename      = anotherverylonglonglongvariablename + //
                               somevariablethatwastoolongtofitonthesamerow;

    double i234 = 0;
    auto   v    = false ? type{}
                        : type{
                              1,
                          };
    ```

    Fixes #126873.

    Fixes #57612.

    Previously, the part for determining whether aligning a line should move
    the next line relied on having a pair of tokens such as parentheses
    surrounding both lines. There are often no such tokens. For example in
    the first block above. This patch removes the requirement for those
    tokens.

    Now the program keeps track of how the position is calculated. The
    alignment step moves the next line if its position is based on a column
    to the right of the token that gets aligned.

    The column that the position of the line is based on is more detailed
    than the `IsAligned` property that the program used before this patch.
    It enables the program to handle cases where parts that should not
    usually move with the previous line and parts that should are nested
    like in the second block above. That is why the patch uses it instead of
    fake parentheses.

commit f83f6f565f408c8d24ff024146a002f6a1ea77c7
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Wed Dec 3 14:59:34 2025 +0100

    Fix lit testing to support standalone testing (#170365)

    To be able to test lit without having a configuration of LLVM, we need
    to support invocations that are not going through the lit.site.cfg and
    thus don't have a llvm_config set-up.

commit cb5362a43329c0e9747e1d63202b00d461db4831
Author: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
Date:   Wed Dec 3 07:35:33 2025 -0600

    [flang][OpenMP] Rename OmpLoopRangeClause to OmpLooprangeClause, NFC (#170370)

    The convention is to change spelling from snake_case to UpperCamel, and
    use the result as a stem in derived names, e.g.
    - spelling is "some_clause" -> stem is SomeClause
    - spelling is "someclause" -> stem is Someclause

    Member of the OmpClause variant is <stem> itself, e.g. Looprange as in
    parser::OmpClause::Looprange.

    Specific clause class name is Omp<stem>Clause, e.g. OmpLooprangeClause.

commit 21d006c4828a2f547e861c23549796834a377d2b
Author: Eugene Epshteyn <eepshteyn at nvidia.com>
Date:   Wed Dec 3 08:29:43 2025 -0500

    [flang] Support kind/index lookup inside of EQUIVALENCE (#170056)

    Turn off "in EQUIVALENCE" check for processing of array subscripts,
    since subscripts themselves are not part of the EQUIVALENCE.

    Fixes #169590

commit 00c8e615e30a6f38698b7bb7e426f83abb8b5798
Author: Lukacma <Marian.Lukac at arm.com>
Date:   Wed Dec 3 12:55:19 2025 +0000

    [AArch64] Add bitcasts for lowering saturating add/sub and shift intrinsics.  (#161840)

    This is followup patch to #157680 . In this patch, we are adding
    explicit bitcasts to floating-point type when lowering saturating
    add/sub and shift NEON scalar intrinsics using SelectionDAG, so they can
    be picked up by patterns added in first part of this series. To do that,
    we have to create new nodes for these intrinsics, which operate on
    floating-point types and wrap them in bitcast nodes.

commit 8b94997a475192d0e519d03cf009f5c51d6a389e
Author: Charles Zablit <c_zablit at apple.com>
Date:   Wed Dec 3 13:53:17 2025 +0100

    [lldb][windows] fix invalid corefile error message (#170471)

commit 2fc12754009b835f00dd8b604096b68bad96e3c1
Author: Pengcheng Wang <wangpengcheng.pp at bytedance.com>
Date:   Wed Dec 3 20:42:12 2025 +0800

    [RISCV] Fix corner cases after #170070 (#170438)

    There are two fixes:

    1. Clear kill flags for `FalseReg` in foldVMergeToMask or we can't
    pass the MachineVerifier because of using a killed virtual register.
    2. Restrict `lookThruCopies` to only look through COPYs with
    one non-debug use.

    This was found when backporting #170070 to 21.x branch.

commit 6af1c3f3a927497081d114f202501667cbbf80c2
Author: Yingwei Zheng <dtcxzyw2333 at gmail.com>
Date:   Wed Dec 3 20:37:30 2025 +0800

    [ValueTracking] Support scalable vector splats in computeKnownBits (#170345)

    Similar to https://github.com/llvm/llvm-project/pull/170325, this patch
    adds support for scalable vector splats in computeKnownBits.

commit 2e87463603171a61713c9b9c3c07fc90b31a555e
Author: Nathan Corbyn <n_corbyn at apple.com>
Date:   Wed Dec 3 12:15:39 2025 +0000

    [Clang] Fix `PPChainedCallbacks::EmbedFileNotFound()` (#170293)

    We've had internal test failures since #166188 landed. The root cause is
    that `PPChainedCallbacks::EmbedFileNotFound()` incorrectly calls
    `PPCallbacks::FileNotFound()` not `PPCallbacks::EmbedFileNotFound()`.

commit 09efb48991dd86ed6a2db89a3eb126aff7337090
Author: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date:   Wed Dec 3 12:12:15 2025 +0000

    [gn build] Port e9bda498e6a0

commit e9bda498e6a061354b3a3e97c29b93e775d721d3
Author: Ebuka Ezike <yerimyah1 at gmail.com>
Date:   Wed Dec 3 12:09:23 2025 +0000

    [lldb] add libstdcpp span formatter (#168705)

commit e947139f082f16c654e6536a90221e15bc0fc96c
Author: Benjamin Maxwell <benjamin.maxwell at arm.com>
Date:   Wed Dec 3 12:06:03 2025 +0000

    [SDAG] Disable illegal extract_subvector splitting for scalable vectors (#170315)

    The "half spanning" legalization of extract_subvector is only valid for
    fixed-length vectors. This patch disables it for scalable vectors and
    makes more careful use of ElementCount in the lowering.

    Fixes regression from https://github.com/llvm/llvm-project/pull/154101,
    which was encountered here:
    https://github.com/llvm/llvm-project/pull/166748#issuecomment-3600498185

    Note: We could optimize this case given the known vscale, but this patch
    only attempts to fix the miscompile.

commit 22d354a2f25e3817ab2e9816eff43fc7ad4de472
Author: Hamza Hassanain <53662962+HamzaHassanain at users.noreply.github.com>
Date:   Wed Dec 3 13:59:37 2025 +0200

    [X86][Clang] Support constexpr evaluation of cvtpd2ps intrinsics (#169980)

    This patch implements constant evaluation support for the following X86
    intrinsics:
    - _mm_cvtpd_ps, _mm256_cvtpd_ps (Packed Double to Float)
    - _mm_cvtsd_ss (Scalar Double to Float merge)
    - Masked variants of the above

    It implements the strict "Exact and Finite" rule: conversions that are
    inexact, infinite, or NaN are rejected in constexpr contexts.

    Fixes #169370

commit d68f5432532bb2bb641258b9f9236f0eba53c4fd
Author: Baranov Victor <bar.victor.2002 at gmail.com>
Date:   Wed Dec 3 14:35:09 2025 +0300

    [clang-tidy] Remove 'clang-analyzer-*' checks from default checks. (#157306)

    Closes https://github.com/llvm/llvm-project/issues/146482.

commit 4497c53298a6121dae51da490b3c228beb053e89
Author: Timm Baeder <tbaeder at redhat.com>
Date:   Wed Dec 3 12:21:36 2025 +0100

    [clang][bytecode] Accept current PC argument in Function::dump() (#170449)

    This is useful since we can highlight the opcode that OpPC points to.

commit dd9a516e0eb3b3a55890adbdc2221e70a3bf7719
Author: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date:   Wed Dec 3 11:14:16 2025 +0000

    [gn build] Port aeb36a925234

commit 0dcbc870ed9baa54dc7c46e483d40a26dff28f96
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Wed Dec 3 03:10:49 2025 -0800

    [lldb/docs] Add ScriptingFrameProvider documentation to the website

    This patch adds the documentation for ScriptedFrameProviders to the
    lldb website.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 4286a474b476e300079efa127d084593e833b1d6
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Wed Dec 3 03:10:22 2025 -0800

    Revert "[lldb/docs] Add ScriptingFrameProvider documentation to the website"

    This reverts commit bfde296d081098605bdad0e4487c4bad9ca19c95.

commit 1ca763b76423a17a2101a4579b5d74bade4f0ce4
Author: Martin Storsjö <martin at martin.st>
Date:   Wed Dec 3 13:09:14 2025 +0200

    [llvm-readobj] [ARMWinEH] Fix printing of packed unwind with H=1, RegI=RegF=0, CR!=1 (#170294)

    In these cases, there are no other GPRs or float registers that would
    have been backed up before the register homing area, that would have
    allocated space on the stack for the saved registers.

    Normally, the register homing part of the prologue consists of 4 nop
    unwind codes. However, if we haven't allocated stack space for those
    arguments yet, there's no space to store them in. The previous printout,
    printing "stp x0, x1, [sp, #-N]!" wouldn't work when interpreted as a
    nop unwind code.

    Based on "dumpbin -unwindinfo", and from empirical inspection with
    RtlVirtualUnwind, it turns out that the homing of argument registers is
    done outside of the prologue. In these cases, "dumpbin -unwindinfo"
    prints an annotation "(argument registers homed post-prolog)".

    Adjust the printout accordingly. In these cases, the later stack
    allocation (either "stp x29, x30, [sp, #-LocSZ]! or "sub sp, sp,
    #LocSZ") is adjusted to include the space the homed registers (i.e. be
    the full size from FrameSize).

commit 6822e3c91b5df96ea980c94655a5d547c5f510b8
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Wed Dec 3 11:07:26 2025 +0000

    [VectorCombine][X86] Add tests showing failure to push a shuffle through a fma with multiple constants (#170458)

    Despite 2 of the 3 arguments of the fma intrinsics calls being constant
    (free shuffle), foldShuffleOfIntrinsics fails to fold the shuffle
    through

commit aeb36a92523427b63466555d92b35bd3aa26ee40
Author: Lang Hames <lhames at gmail.com>
Date:   Wed Dec 3 22:03:23 2025 +1100

    [ORC] Port CallableTraitsHelper from the new ORC runtime. (#170441)

    The code for this commit was taken with minimal modification to fit LLVM
    style from llvm-project/orc-rt/include/CallableTraitsHelper.h and
    llvm-project/orc-rt/unittests/CallableTraitsHelperTest.cpp (originally
    commited in 40fce325011)

    CallableTraitsHelper identifies the return type and argument types of a
    callable type and passes those to an implementation class template to
    operate on. E.g. the CallableArgInfoImpl class exposes these types as
    typedefs.

    Porting CallableTraitsHelper from the new ORC runtime will allow us to
    simplify existing and upcoming "callable-traits" classes in ORC.

commit c0f0936f5a47270d47486f6d5860b5f8e30e0e32
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Wed Dec 3 10:51:27 2025 +0000

    [lldb] Fix ThreadPlanStepOut::DoPlanExplainsStop inspection of BreakpointSite (#169799)

    Suppose two threads are performing the exact same step out plan. They
    will both have an internal breakpoint set at their parent frame. Now
    supposed both of those breakpoints are in the same address (i.e. the
    same BreakpointSite).

    At the end of `ThreadPlanStepOut::DoPlanExplainsStop`, we see this:

    ```
    // If there was only one owner, then we're done.  But if we also hit
    // some user breakpoint on our way out, we should mark ourselves as
    // done, but also not claim to explain the stop, since it is more
    // important to report the user breakpoint than the step out
    // completion.

    if (site_sp->GetNumberOfConstituents() == 1)
      return true;
    ```

    In other words, the plan looks at the name number of constituents of the
    site to decide whether it explains the stop, the logic being that a
    _user_ might have put a breakpoint there. However, the implementation is
    not correct; in particular, it will fail in the situation described
    above. We should only care about non-internal breakpoints that would
    stop for the current thread.

    It is tricky to test this, as it depends on the timing of threads, but I
    was able to consistently reproduce the issue with a swift program using
    concurrency.

    rdar://165481473

commit bfde296d081098605bdad0e4487c4bad9ca19c95
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Wed Dec 3 02:46:01 2025 -0800

    [lldb/docs] Add ScriptingFrameProvider documentation to the website

    This patch adds the documentation for ScriptedFrameProviders to the
    lldb website.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 4e4763a8a4659dc252429a003c613f762d5a1083
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Wed Dec 3 10:43:48 2025 +0000

    [lldb] Handle backwards branches in UnwindAssemblyInstEmulation (#169633)

    This allows the unwinder to handle code with mid-function epilogues
    where the subsequent code is reachable through a backwards branch.

    Two changes are required to accomplish this:

    1. Do not enqueue the subsequent instruction if the current instruction
       is a barrier(*).
    2. When processing an instruction, stop ignoring branches with negative
       offsets.

    (*) As per the definition in LLVM's MC layer, a barrier is any
    instruction that "stops control flow from executing the instruction
    immediately following it". See `MCInstrDesc::isBarrier` in MCInstrDesc.h

    Part of a sequence of PRs:
    [lldb][NFCI] Rewrite UnwindAssemblyInstEmulation in terms of a CFG visit
    #169630
    [lldb][NFC] Rename forward_branch_offset to branch_offset in
    UnwindAssemblyInstEmulation #169631
    [lldb] Add DisassemblerLLVMC::IsBarrier API #169632
    [lldb] Handle backwards branches in UnwindAssemblyInstEmulation #169633

    commit-id:fd266c13

commit 49774448d69b55f5c46aef2147b45537fd61276a
Author: mitchell <mitchell.xu2 at gmail.com>
Date:   Wed Dec 3 18:41:45 2025 +0800

    [clang-tidy] Fix false positive in `readability-redundant-typename` (#170034)

    Closes #169166

    ---------

    Co-authored-by: Victor Chernyakin <chernyakin.victor.j at outlook.com>

commit f17abc280c708c16f622be2de2ab7d0710cc8bc1
Author: Gil Rapaport <gil.rapaport at mobileye.com>
Date:   Wed Dec 3 12:41:29 2025 +0200

    [mlir][emitc] Add address-of and dereference ops (#72569)

    EmitC currently models C's `&` and `*` operators via its `apply` op,
    which has several drawbacks:

    - Its pre-lvalue semantics combines dereferencing with memory access.

    - Representing multiple opcodes (selected by an attribute) in a single
    op complicates the code by adding a second, attribute-based selection
    layer on top of MLIR's standard `isa<>` mechanism.

    This patch adds two distinct, lvalue-based ops to model these C operators.
    EmitC passes were converted to use the new ops instead of `apply`, which
    is now deprecated.

commit 2697c8cb459c1705f6c3a60c908462ca099e657f
Author: Fabian Ritter <fabian.ritter at amd.com>
Date:   Wed Dec 3 11:13:52 2025 +0100

    [LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering (#169039)

    So far, memcpy with known size, memcpy with unknown size, memmove with known
    size, and memmove with unknown size have individual optimized loop lowering
    implementations, while memset and memset.pattern use an unoptimized loop
    lowering. This patch extracts the parts of the memcpy lowerings (for known and
    unknown sizes) that generate the control flow for the loop expansion into an
    `insertLoopExpansion` function. The `createMemCpyLoop(Unk|K)nownSize` functions
    then only collect the necessary arguments for `insertLoopExpansion`, call it,
    and fill the generated loop basic blocks.

    The immediate benefit of this is that logic from the two memcpy lowerings is
    deduplicated. Moreover, it enables follow-up patches that will use
    `insertLoopExpansion` to optimize the memset and memset.pattern implementations
    similarly to memcpy, since they can use the exact same control flow patterns.

    The test changes are due to more consistent and useful basic block names in the
    loop expansion and an improvement in basic block ordering: previously, the
    basic block that determines if the residual loop is executed would be put at
    the end of the function, now it is put before the residual loop body.
    Otherwise, the generated code should be equivalent.

    This patch doesn't affect memmove; deduplicating its logic would also be nice,
    but to extract all CF generation from the memmove lowering,
    `insertLoopExpansion` would need to be able to also create code that iterates
    backwards over the argument buffers. That would make `insertLoopExpansion` a
    lot more complex for a code path that's only used for memmove, so it's probably
    not worth refactoring.

    For SWDEV-543208.

commit 8feb6762ba9fb83f8e13ef9486c3b743e1b5cfa7
Author: Pierre van Houtryve <pierre.vanhoutryve at amd.com>
Date:   Wed Dec 3 10:37:58 2025 +0100

    [AMDGPU] Take BUF instructions into account in mayAccessScratchThroughFlat (#170274)

    BUF instructions can access the scratch address space, so
    SIInsertWaitCnt needs to be able
    to track the SCRATCH_WRITE_ACCESS event for such BUF instructions.

    The release-vgprs.mir test had to be updated because BUF instructions
    w/o a MMO are now
    tracked as a SCRATCH_WRITE_ACCESS. I added a MMO that touches global to
    keep the test result unchanged. I also added a couple of testcases with no MMO to test the corrected behavior.

commit 5ccf8c90d1e4020d5f9bc255fe521aa0763f2b2b
Author: Tom Eccles <tom.eccles at arm.com>
Date:   Wed Dec 3 09:36:22 2025 +0000

    [flang] implement VECTOR VECTORLENGTH directive (#170114)

    This should match exactly the llvm attributes generated by classic
    flang.

commit 114ca6522e4ea425115adb778c39fd89745a6853
Author: Shih-Po Hung <shihpo.hung at sifive.com>
Date:   Wed Dec 3 17:24:40 2025 +0800

    [TTI] Use MemIntrinsicCostAttributes for getStridedOpCost (#170436)

    - Following #168029. This is a step toward a unified interface for
    masked/gather-scatter/strided/expand-compress cost modeling.
    - Replace the ad-hoc parameter list with a single attributes object.

    API change:
    ```
    - InstructionCost getStridedMemoryOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
                                                                           bool VariableMask, Align Alignment,
                                                                           TTI::TargetCostKind CostKind,
                                                                           const Instruction *I = nullptr);
    + InstructionCost getStridedMemoryOpCost(MemIntrinsicCostAttributes,
    +                                                                      CostKind);
    ```

    Notes:
    - NFCI intended: callers populate MemIntrinsicCostAttributes with same
    information as before.

commit 9296223b28029095c1e734ba9373b9bcfc853d7b
Author: mitchell <mitchell.xu2 at gmail.com>
Date:   Wed Dec 3 17:22:34 2025 +0800

    [clang-tidy] Fix `cppcoreguidelines-pro-type-member-init` check (#169832)

    Closes [#169677](https://github.com/llvm/llvm-project/issues/169677)

    ---------

    Co-authored-by: EugeneZelenko <eugene.zelenko at gmail.com>

commit 2b725ab8bf08b0bde29910ec4fa1c610eaaffa63
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Wed Dec 3 09:08:05 2025 +0000

    [lldb] Add DisassemblerLLVMC::IsBarrier API (#169632)

    This will allow the instruction emulation unwinder to reason about
    instructions that prevent the subsequent instruction from executing.

    Part of a sequence of PRs:
    [lldb][NFCI] Rewrite UnwindAssemblyInstEmulation in terms of a CFG visit
    #169630
    [lldb][NFC] Rename forward_branch_offset to branch_offset in
    UnwindAssemblyInstEmulation #169631
    [lldb] Add DisassemblerLLVMC::IsBarrier API #169632
    [lldb] Handle backwards branches in UnwindAssemblyInstEmulation #169633

    commit-id:bb5df4aa

commit 7cdb27a4b3757879446596d6f042f87b5119c638
Author: Aaditya <115080342+easyonaadit at users.noreply.github.com>
Date:   Wed Dec 3 14:36:25 2025 +0530

    [NFC][AMDGPU] Refactor wave reduce test files (#170440)

    Separate out float wave-reduce intrinsic tests from the overloaded call.
    Moved float add/sub/min/max ops from:
    `llvm.amdgcn.reduce.add/sub/min/max` to
    `llvm.amdgcn.reduce.fadd/fsub/fmin/fmax`.

commit 8b7a07a5f7e7b2a96417665f807cbf79a3161a76
Author: Ebuka Ezike <yerimyah1 at gmail.com>
Date:   Wed Dec 3 08:55:11 2025 +0000

    [lldb]  Fix abi_tag parsing for operator<< and operator-named tags (#170224)

    The parser now correctly handles:
    - abi_tags attached to operator<<: `operator<<[abi:SOMETAG]`
    - abi_tags with "operator" as the tag name: `func[abi:operator]`

commit 4b0a9759395f3e9cbefa9c194ca331f4d88003bf
Author: Hongyu Chen <xxs_chy at outlook.com>
Date:   Wed Dec 3 16:53:25 2025 +0800

    [OpenCL][NVPTX] Don't set calling convention for OpenCL kernel (#170170)

    Fixes #154772
    We previously set `ptx_kernel` for all kernels. But it's incorrect to
    add `ptx_kernel` to the stub version of kernel introduced in #115821.
    This patch copies the workaround of AMDGPU.

commit 6638d59c972512d45da474c214abc67ec3cfe333
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Wed Dec 3 08:31:34 2025 +0000

    [lldb][NFC] Rename forward_branch_offset to branch_offset in UnwindAssemblyInstEmulation (#169631)

    This will reduce the diff in subsequent patches

    Part of a sequence of PRs:
    [lldb][NFCI] Rewrite UnwindAssemblyInstEmulation in terms of a CFG visit
    #169630
    [lldb][NFC] Rename forward_branch_offset to branch_offset in
    UnwindAssemblyInstEmulation #169631
    [lldb] Add DisassemblerLLVMC::IsBarrier API #169632
    [lldb] Handle backwards branches in UnwindAssemblyInstEmulation #169633
    commit-id:5e758a22

commit c5ecdec9fb84e6865fe44f69e380afa1291c2adf
Author: Ebuka Ezike <yerimyah1 at gmail.com>
Date:   Wed Dec 3 08:30:35 2025 +0000

    [lldb-dap] start all sent protocol message from number one. (#170378)

    This aligns with the DAP
    [specification](https://microsoft.github.io/debug-adapter-protocol//specification.html#Base_Protocol_ProtocolMessage)

    Force it to be an error in test cases.

commit cd86b2ab32bb2c444fb48e41a40f43c80a7eaeae
Author: Vikash Gupta <Vikash.Gupta at amd.com>
Date:   Wed Dec 3 13:56:15 2025 +0530

    [CodeGen] Add MO_LaneMask type and a new COPY_LANEMASK instruction (#151944)

    Introduce MO_LaneMask as new machine operand type. This can be used to
    hold liveness infomation at sub-register granularity for register-type
    operands. We also introduce a new COPY_LANEMASK instruction that uses
    MO_lanemask operand to perform partial copy from source register
    opernad.

    One such use case of MO_LaneMask can be seen in #151123, where it can be
    used to store live regUnits information corresponding to the source
    register of the COPY instructions, later can be used during CopyPhysReg
    expansion.

commit ae4289f0e6e1bf61f45f88870aec220c9164800b
Author: Shih-Po Hung <shihpo.hung at sifive.com>
Date:   Wed Dec 3 16:24:56 2025 +0800

    [Hexagon][NFC] Drop no-op getMaskedMemoryOpCost/getGatherScatterOpCost stubs (#170426)

    These stubs (from 4bdf1aa416b02) don’t actually override anything.
    Removing them eliminates the need for a local getMemIntrinsicCost()
    forwarder in #169885.

commit befa4e85e4fab6a109203903a2fbeb979164cd2e
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date:   Wed Dec 3 00:14:32 2025 -0800

    [AMDGPU] Avoid undefs in hazard-gfx1250-flat-scr-hi.mir. NFC (#170396)

commit 5ee6cff90ba5d8e08066eeeef0c27aa0b6f24d2c
Author: Jonas Hahnfeld <jonas.hahnfeld at cern.ch>
Date:   Wed Dec 3 09:10:56 2025 +0100

    [clang] Propagate definition data to all redecls (#170090)

    Fix the propagation added in commit 0d490ae55f to include all redecls,
    not only previous ones. This fixes another instance of the assertion
    "Cannot get layout of forward declarations" in getASTRecordLayout().

    Kudos to Alexander Kornienko for providing an initial version of the
    reproducer that I further simplified.

    Fixes #170084

commit 98182f4d209ded292cb6030f45bcae132096acae
Author: Sven van Haastregt <sven.vanhaastregt at arm.com>
Date:   Wed Dec 3 08:58:31 2025 +0100

    Move CodeGenFunction::EmitScalarOrConstFoldImmArg; NFC (#170286)

    This function is called from various .cpp files under `TargetBuiltins/`,
    and was moved unintentionally into `AMDGPU.cpp` in PR #132252. Move it
    to a common place.

commit e6110cb3395b22a941cba4726c9e36308e5b5613
Author: Matthias Springer <me at m-sp.org>
Date:   Wed Dec 3 08:35:05 2025 +0100

    [mlir][Transforms] Fix crash in `-remove-dead-values` on private functions (#169269)

    This commit fixes two crashes in the `-remove-dead-values` pass related
    to private functions.

    Private functions are considered entirely "dead" by the liveness
    analysis, which drives the `-remove-dead-values` pass.

    The `-remove-dead-values` pass removes dead block arguments from private
    functions. Private functions are entirely dead, so all of their block
    arguments are removed. However, the pass did not correctly update all
    users of these dropped block arguments.

    1. A side-effecting operation must be removed if one of its operands is
    dead. Otherwise, the operation would end up with a NULL operand. Note:
    The liveness analysis would not have marked an SSA value as "dead" if it
    had a reachable side-effecting users. (Therefore, it is safe to erase
    such side-effecting operations.)
    2. A branch operation must be removed if one of its non-forwarded
    operands is dead. (E.g., the condition value of a `cf.cond_br`.)
    Whenever a terminator is removed, a `ub.unrechable` operation is
    inserted. This fixes #158760.

commit 30f479fa2b08d6e480939a57384996f7a276eb91
Author: Henrich Lauko <xlauko at mail.muni.cz>
Date:   Wed Dec 3 08:20:50 2025 +0100

    [CIR] Use default attribute printer/parser (NFC) (#170366)

commit 042a38f0bfe5c9f49df5d4cb5e23092e512c9fbe
Author: Nikita Popov <npopov at redhat.com>
Date:   Wed Dec 3 07:55:06 2025 +0100

    [Support] Optimize DebugCounter (#170305)

    Currently, DebugCounters work by creating a unique counter ID during
    registration, and then using that ID to look up the counter information
    in the global registry.

    However, this means that anything working with counters has to always go
    through the global instance. This includes the fast path that checks
    whether any counters are enabled.

    Instead, we can drop the counter IDs, and make the counter variables use
    CounterInfo themselves. We can then directly check whether the specific
    counter is active without going through the global registry. This is
    both faster for the fast-path where all counters are disabled, and also
    faster for the case where only one counter is active (as the fast-path
    can now still be used for all the disabled counters).

    After this change, disabled counters become essentially free at runtime,
    and we should be able to enable them in non-assert builds as well.

commit 9f634c6777701794a6ed5577857ffb8f202513b8
Author: Jianjian Guan <jacquesguan at me.com>
Date:   Wed Dec 3 14:19:20 2025 +0800

    [RISCV][GISel] Fix legalize G_EXTRACT_SUBVECTOR (#169877)

    Fix wrong mask type that used by G_VSLIDEDOWN_VL.

commit 689b3cc7c700b1687cf4aaaf4ef2c81a4e988917
Author: Jinjie Huang <huangjinjie at bytedance.com>
Date:   Wed Dec 3 14:08:20 2025 +0800

    [clang] Support header shadowing diagnostics in Clang header search (#162491)

    When including a header file, multiple files with the same name may
    exist across different search paths, like:
       |-- main.cpp
      |-- **header.h**
      |-- include
      |  └── **header.h**
    The compiler usually picks the first match it finds (typically following
    MSVC rules for current/include-chain paths first, then regular -I
    paths), which may not be the user’s intended header.
    This silent behavior can lead to subtle runtime API mismatches or
    increase the cost of resolving errors such as “error: use of undeclared
    identifier”, especially in large projects.

    Therefore, this patch tries to provide a diagnostic message without
    changing the current header selection. It does this by performing an
    additional search for duplicate filenames across all search paths (both
    MSVC rules and standard paths). This informs the user about a potential
    "header shadowing" issue and clarifies which header path was actually
    used.

    Since header searching is much cheaper than file loading, the added
    overhead should be within an acceptable range -- assuming the diagnostic
    message is valuable.

commit 73036cf9113b4748d4fbb28037e8714ff2486238
Author: Baranov Victor <bar.victor.2002 at gmail.com>
Date:   Wed Dec 3 09:02:50 2025 +0300

    [clang-tidy][NFC] Fix miscellaneous clang-tidy warnings (#170424)

commit d05370e6863e28fcf988b8491dc583fcf5e4e1be
Author: Baranov Victor <bar.victor.2002 at gmail.com>
Date:   Wed Dec 3 08:56:24 2025 +0300

    [clang-tidy][NFC] Enable readability-any-all-of check (#167134)

    Closes https://github.com/llvm/llvm-project/issues/156161.
    Assisted-by: Claude Sonnet 4.5 via Claude Code

commit 822fc449985553c609e44915374f935672c0db50
Author: Rajat Bajpai <rbajpai at nvidia.com>
Date:   Wed Dec 3 10:49:17 2025 +0530

    [LLVM][Intrinsics] Adds an API to automatically resolve overload types (#169007)

    Currently, the getOrInsertDeclaration API requires callers to explicitly
    provide overload types for overloaded intrinsics, placing a significant
    burden on callers who must determine whether overload types are needed.
    This typically results in conditional logic at each call site to check
    if the intrinsic is overloaded and manually match the intrinsic
    signature.

    This patch introduces a new getOrInsertDeclaration overload that
    automatically deduces overload types from the provided return type and
    argument types, then uses this API to simplify
    IRBuilder::CreateIntrinsic. The new API uses
    Intrinsic::matchIntrinsicSignature internally to resolve overloaded
    types, eliminating the need for callers to do manual overload detection.

commit 1f35b52a00ebd7d595deaffd5e72f72088f450b1
Author: Michael Buch <michaelbuch12 at gmail.com>
Date:   Wed Dec 3 12:07:16 2025 +0900

    [lldb][DWARFASTParserClang] Treat DW_TAG_template_alias like we do DW_TAG_typedef (#170135)

    Depends on:
    * https://github.com/llvm/llvm-project/pull/170132

    Clang gained the `-gtemplate-alias` not too long ago, which emits C++
    alias templates as `DW_TAG_template_alias` (instead of
    `DW_TAG_typedef`). The main difference is that `DW_TAG_template_alias`
    has `DW_TAG_template_XXX` children. The flag was not enabled by default
    because consumers (mainly LLDB) didn't know how to handle it. This patch
    adds rudimentary support for debugging with `DW_TAG_template_alias`.

    This patch simply creates the same kind of `TypedefDecl` as we do for
    `DW_TAG_typedef`. The more complete solution would be to create a
    `TypeAliasTemplateDecl` and associated `TypeAliasDecl`. But that would
    require DWARF to carry generic template information, but currently each
    `DW_TAG_template_alias` represents a concrete instantiation. We could
    probably hack up some working AST representation that includes the
    template parameters, but I currently don't see a compelling reason to.
    All we need is the `DW_AT_name` and the `DW_AT_type` that the typedef
    refers to.

    rdar://137499401

commit 1c86f4a8f1a254a6286342a5bffb13c99168267b
Author: Shih-Po Hung <shihpo.hung at sifive.com>
Date:   Wed Dec 3 11:01:35 2025 +0800

    [TTI] Use MemIntrinsicCostAttributes for getGatherScatterOpCost (#168650)

    - Following #168029. This is a step toward a unified interface for
    masked/gather-scatter/strided/expand-compress cost modeling.
    - Replace the ad-hoc parameter list with a single attributes object.

    API change:
    ```
    - InstructionCost getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
    -                                        Alignment, CostKind, Inst);

    + InstructionCost getGatherScatterOpCost(MemIntrinsicCostAttributes,
    +                                       CostKind);
    ```

    Notes:
    - NFCI intended: callers populate MemIntrinsicCostAttributes with same
    information as before.

commit 2978b20af43f9dbba8c775c9b2b5a20f60ec9fe7
Author: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date:   Wed Dec 3 12:01:18 2025 +0900

    [Delinearization] Add validation for large size arrays (#169902)

    This patch adds a check in validation for delinearization to ensure that
    the offset calculation does not overflow. If it overflows, different
    array accesses (e.g., `A[0][0]` and `A[1][0]`) could map to the same
    linear index, leading to incorrect behavior.
    For fixed-size arrays, the check is relatively straightforward. However,
    for dynamic-size arrays (i.e., arrays where the size is not known at
    compile time), it's difficult to prove this statically, and it going to
    fail for almost all cases. Maybe we need to add some runtime checks or
    reasoning based on `inbounds` like LAA does.

    Fixes the test cases added in #169048.

commit 542a8f25c0d93a01e90f270fc73107d9ce2280c6
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 18:59:14 2025 -0800

    [lldb/test] Add missing import for decorator (NFC)

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 76cb98442b280bf9b5862f0bec3a56c2cc37d70f
Author: Pengcheng Wang <wangpengcheng.pp at bytedance.com>
Date:   Wed Dec 3 10:55:38 2025 +0800

    [RISCV] Sources of vmerge shouldn't overlap V0 (#170070)

    According to the spec:

    > A vector register cannot be used to provide source operands with more
    > than one EEW for a single instruction. A mask register source is
    > considered to have EEW=1 for this constraint.

    There must be a mask `V0` in `vmerge` variants so the sources should
    use register classes without `V0`.

    This fixes #169905.

    Co-authored-by: Luke Lau <luke at igalia.com>

commit 242077ad1c0df4ecfd12769a38cf6fcb1b0b1d72
Author: Vlad Serebrennikov <serebrennikov.vladislav at gmail.com>
Date:   Wed Dec 3 06:46:46 2025 +0400

    [clang][NFC] Promote CWG3005 test to "ready"

    Not updating cxx_dr_status.html yet, because CWG2917 test might need major adjustments before make_cxx_dr_status can be ran.

commit 6f5a69b54cf186d984971ad0f098b4bab51ba742
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 18:52:40 2025 -0800

    [lldb/test] Skip ScriptedFrameProviders tests on arm32 (NFC)

    It looks like the providers don't get loaded on arm32 bots:

    https://github.com/llvm/llvm-project/issues/170412

    Skipping for now since I don't have access to a machine to investigate
    it.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 93ebe63f2e7a252038bde01a4399c14e0123cdac
Author: Chuanqi Xu <yedeng.yd at linux.alibaba.com>
Date:   Wed Dec 3 10:26:05 2025 +0800

    [C++20] [Modules] Fix ADL for friend in modules

    Close https://github.com/llvm/llvm-project/issues/170235

    The cause of the issue is it didn't check friendness for decls
    in ordinary namespace if it isn't visible.

    It is fine for code before modules, since everything is visible.
    But it is not true after modules came in. This patch adjusts this.

    Note that this doesn't change the control flow for non-modules codes,
    as the decls in ordinary namespace is always visible then it won't never
    fall in following friendness check.

commit 82c6ad655ddbfd86d22d8d1aa3de1fb5d6ec2f6b
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 18:28:42 2025 -0800

    [lldb/test] Add missing import for decorator (NFC)

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 2cf276880d58effab669f89dcda4d27bb9c15d73
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 17:24:11 2025 -0800

    [lldb/test] XFAIL TestFrameProviderCircularDependency.py on Windows

    This patch disables TestFrameProviderCircularDependency.py on Windows
    since the scripted frame provider uses SBTarget.FindFunctions which
    doesn't seem to be working (according to TestTargetAPI.test_find_functions).

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 83ab875b8337aad5970fb8f519fec91a43dce906
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date:   Tue Dec 2 17:22:07 2025 -0800

    [AMDGPU] Handle phys regs in flat_scratch_base_hi operand check (#170395)

commit 271e99daf0ff860d0ab50c688ba5e0480de78847
Author: Omar Hossam <moar.ahmed at gmail.com>
Date:   Wed Dec 3 02:19:41 2025 +0100

    [CIR] Support x86 builtin rotate (#169566)

    This PR implements CodeGen for rotate builtins in CIR upstream.
    Issue https://github.com/llvm/llvm-project/issues/167765

commit 7685e1f82383e8a7c21de338ba376e7b317e0fa3
Author: Michael Buch <michaelbuch12 at gmail.com>
Date:   Wed Dec 3 10:19:30 2025 +0900

    [lldb][test] DWARFASTParserClangTests: extract test setup into helper structure (#170132)

    Depends on:
    * https://github.com/llvm/llvm-project/pull/170249

    We keep repeating the boilerplate of creating a
    `DWARFASTParserClangStub` and `TypeSystemClangHolder` in all the
    unit-test cases. Lets extract this into a helper to make the tests
    easier to grok.

    We actually only need the `DWARFASTParserClangStub` and a
    `TypeSystemClangHolder` in one of the test cases. For the rest, we can
    just re-use the typesystem/parser that the `YAMLModuleTester` created.
    Re-using them makes it more straightforward to write test-cases because
    we don't need to worry about which TypeSystem which DWARFParser created
    types into.

commit ac19d38e6f3f97ae920f71dc2618800f54668332
Author: Michael Buch <michaelbuch12 at gmail.com>
Date:   Wed Dec 3 09:49:41 2025 +0900

    [lldb][DWARFASTParserClang] Complete and make use of LLVM's RTTI support (#170249)

    We almost had RTTI support for `DWARFASTParserClang`, but because
    `classof` was protected, using `llvm::cast`/etc. on it would fail to
    compile with:
    ```
    llvm/include/llvm/Support/Casting.h:64:57: error: 'classof' is a protected member of 'DWARFASTParserClang'
       64 |   static inline bool doit(const From &Val) { return To::classof(&Val); }
          |                                                         ^
    llvm/include/llvm/Support/Casting.h:110:32: note: in instantiation of member function 'llvm::isa_impl<DWARFASTParserClang, lldb_private::plugin::dwarf::DWARFASTParser>::doit' requested here
      110 |     return isa_impl<To, From>::doit(*Val);
    ```

    This patch makes `classof` public and turns `static_cast`s of
    `DWARFASTParserClang` into `llvm::cast`s.

commit dea86c6fb0b5eabacc1e9237489bac3ba53119b8
Author: Changpeng Fang <changpeng.fang at amd.com>
Date:   Tue Dec 2 16:44:33 2025 -0800

    [AMDGPU][NFC] Add occupancy checks for gfx950 and gfx1250 (#170392)

commit b30a48c389cee20479419a672d841cb32eaf107a
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 16:37:58 2025 -0800

    [lldb/test] Fix scripted frame provider tests on ARM32

    On ARM32, FixCodeAddress unconditionally clears bit 0 (the Thumb bit)
    from all code addresses, including synthetic frame PCs. This causes
    test failures where synthetic PCs like 0xFFFF and 0xDEADBEEF become
    0xFFFE and 0xDEADBEEE respectively.

    This adjusts the tests to expect the modified PC values on ARM32.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit e05fffbbc54d201a60e55e8c051bad81eaebd69a
Author: Nico Weber <thakis at chromium.org>
Date:   Tue Dec 2 19:19:54 2025 -0500

    Revert "[Clang] Add __builtin_common_reference (#121199)"

    This reverts commit 3b9e203364dcd8234b12eb447ddbcf97a877558c.
    Causes not-yet-understood semantic differences, see commits
    on #121199.

commit c5e9289ba5e643967faa5caad72f15195f764d08
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Tue Dec 2 16:36:29 2025 -0800

    [llvm-exegesis] Make rvv/filter.test deterministic

    This should prevent the flaky failures that have been plaguing the
    buildbots since the test was introduced and allow for offline
    investigation without disrupting CI.

    Reviewers: topperc, mshockwave

    Reviewed By: mshockwave

    Pull Request: https://github.com/llvm/llvm-project/pull/170014

commit 325a08267de9362a9b17a8fc80fdc59568fd30f8
Author: Zachary Fogg <zach.fogg at gmail.com>
Date:   Tue Dec 2 19:36:01 2025 -0500

    [lldb] Fix Doxygen warning in SBTrace.h (#170394)

    Remove errant `\a` command before `<directory>` in `SaveToDisk`
    documentation. The `\a` Doxygen command expects a word argument, but
    `<directory>` starts with `<` which Doxygen interprets as HTML. This
    fixes:

    ```
    llvm-project/lldb/include/lldb/API/SBTrace.h:60:
    Warning 564: Error parsing Doxygen command a: No word followed the command. Command ignored.
    ```

commit 94c8940f449ebc3a42c8343ebbdf5b888a436854
Author: Max Desiatov <m_desiatov at apple.com>
Date:   Wed Dec 3 00:27:14 2025 +0000

    lldbgdbremote.md: Update `qWasmLocal` result description (#170393)

    The current description mistakenly specified that an address of a local
    value in some address space is returned. When testing this with Wasm
    runtimes that already implement this command, it can be observed that
    the value itself is returned. The value itself may be an address for
    languages that use shadow stack in Wasm linear memory, but the value of
    an arbitrary local does not always contain that address.

commit 9fd288e8866788d9defccccfcc75272eb27f54fe
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Tue Dec 2 19:11:30 2025 -0500

    clang/AMDGPU: Enable opencl 2.0 features for unknown target (#170308)

    Assume amdhsa triples support flat addressing, which matches
    the backend logic for the default target. This fixes the
    rocm device-libs build.

commit 9dd33465896032d402f851ac5a3ef047723ed3d8
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date:   Tue Dec 2 16:08:00 2025 -0800

    [AMDGPU] Prevent folding of flat_scr_base_hi into a 64-bit SALU (#170373)

    Fixes: SWDEV-563886

commit dd1b4abfb74809481100ed20c5a099f062ef0625
Author: Farzon Lotfi <farzonlotfi at microsoft.com>
Date:   Tue Dec 2 19:02:25 2025 -0500

    [HLSL][Matrix] Add support for Matrix element and trunc Casts (#168915)

    fixes #168737
    fixes #168755

    This change fixes adds support for Matrix truncations via the
    ICK_HLSL_Matrix_Truncation enum. That ends up being most of the files
    changed.

    It also allows Matrix as an HLSL Elementwise cast as long as the cast
    does not perform a shape transformation ie 3x2 to 2x3.

    Tests for the new elementwise and truncation behavior were added. As
    well as sema tests to make sure we error n the shape transformation
    cast.

    I am punting right now on the ConstExpr Matrix support. That will need
    to be addressed later. Will file a seperate issue for that if reviewers
    agree it can wait.

commit 45918f50aa956e7c9ecb0d931a85e533c488d741
Author: David Stone <davidfromonline at gmail.com>
Date:   Tue Dec 2 17:02:14 2025 -0700

    [llvm][NFC] In `SetVector`, `contains` and `count` now automatically accept `const T *` arguments when the key is `T *` (#170377)

    Also use `is_contained` to implement `contains`, since this tries the
    `contains` member function of the set type first.

commit 6c32535b204006488ed9d800dee549118f0fd719
Author: David Stone <davidfromonline at gmail.com>
Date:   Tue Dec 2 17:02:00 2025 -0700

    [clang][NFC] Remove unused CFGStmtMap.h includes (#170383)

commit e9c127428cd2bc38c64ea788007e336d21e5f199
Author: Mircea Trofin <mtrofin at google.com>
Date:   Tue Dec 2 15:47:35 2025 -0800

    [LTT] mark the CFI jumptable naked on Windows (#170371)

    We were not marking the `.cfi.jumptable` functions as `naked` on windows. The referenced bug (https://llvm.org/bugs/show_bug.cgi?id=28641#c3) appears to be fixed:

    ```bash
    build/bin/opt -S -passes=lowertypetests -mtriple=i686-pc-win32 llvm/test/Transforms/LowerTypeTests/function.ll | build/bin/llc -O0
    ```

    ```
    L_.cfi.jumptable:                       # @.cfi.jumptable
    # %bb.0:                                # %entry
            #APP
            jmp     _f.cfi at PLT
            int3
            int3
            int3

            #NO_APP
            #APP
            jmp     _g.cfi at PLT
            int3
            int3
            int3

            #NO_APP
                                            # -- End function
            .section        .rdata,"dr"
            .p2align        4, 0x0                          # @0

    ```

    Not seeing the spilled registers described in the bug anymore.

commit 6bdb838a05bb7c6f293e53800f46ba182a22f571
Author: Thibault Monnier <97551402+Thibault-Monnier at users.noreply.github.com>
Date:   Wed Dec 3 00:29:12 2025 +0100

    [CIR] Upstream vec shuffle builtins in CIR codegen (#169178)

    This PR is part of #167752. It upstreams the codegen and tests for the
    shuffle builtins implemented in the incubator, including:
    - `vinsert` + `insert`
    - `pblend` + `blend`
    - `vpermilp`
    - `pshuf` + `shufp`
    - `palignr`

    It does NOT upstream the `perm`, `vperm2`, `vpshuf`, `shuf_i` / `shuf_f`
    and `align` builtins, which are not yet implemented in the incubator.

    This _is_ a large commit, but most of it is tests.

    The `pshufd` / `vpermilp` builtins seem to have no test coverage in the
    incubator, what should I do?

commit 9c78bc5de4fc2450d8fd5e5d52e8168ef653958e
Author: Drew Kersnar <dkersnar at nvidia.com>
Date:   Tue Dec 2 17:27:58 2025 -0600

    Revert "[LSV] Merge contiguous chains across scalar types" (#170381)

    Reverts llvm/llvm-project#154069. I pointed out a number of issues
    post-merge, most importantly examples of miscompiles:
    https://github.com/llvm/llvm-project/pull/154069#issuecomment-3603854626.

    While the motivation of the change is clear, I think the implementation
    approach is flawed. It seems like the goal is to allow elements like
    `load <2xi16>` and `load i32` to be vectorized together despite the
    current algorithm not grouping them into the same equivalence classes. I
    personally think that if we want to attempt this it should be a more
    wholistic approach, maybe even redefining the concept of an equivalence
    class. This current solution seems like it would be really hard to do
    bug-free, and even if the bugs were not present, it is only able to
    merge chains that happen to be adjacent to each other after
    `splitChainByContiguity`, which seems like it is leaving things up to
    chance whether this optimization kicks in. But we can discuss more in
    the re-land. Maybe the broader approach I'm proposing is too difficult,
    and a narrow optimization is worthwhile. Regardless, this should be
    reverted, it needs more iteration before it is correct.

commit e5f1d025aa9981b5ccad29e367c8a79d23c736f2
Author: Hendrik Hübner <117831077+HendrikHuebner at users.noreply.github.com>
Date:   Wed Dec 3 00:22:46 2025 +0100

    [CIR] Lower calls to trivial copy constructor to cir::CopyOp (#168281)

    This PR is a follow up to #167975 and replaces calls to trivial copy
    constructors with `cir::CopyOp`.

    ---------

    Co-authored-by: Andy Kaylor <akaylor at nvidia.com>
    Co-authored-by: Henrich Lauko <henrich.lau at gmail.com>

commit dbb702fbcb5f43a642db876fac29d1845e320b7a
Author: Shilei Tian <i at tianshilei.me>
Date:   Tue Dec 2 18:17:09 2025 -0500

    [NFC][AMDGPU] Remove trailing white spaces in `AMDGPU.td`

commit 0f235c346c1592345c118565b3e3aaf5e9c72520
Author: Björn Pettersson <bjorn.a.pettersson at ericsson.com>
Date:   Wed Dec 3 00:12:42 2025 +0100

    [LowerConstantIntrinsics] Improve tests related to llvm.objectsize. NFC (#132364)

    Adding some new test cases (including FIXME:s) to highlight some bugs
    related to lowering of llvm.objectsize.

    One special case is when there are getelementptr instruction with index
    types that are larger than the index type size for the pointer being
    analysed. This will add a couple of tests to show what happens both when
    using a smaller and larger index type, and when having out-of-bounds
    indices (both too large and negative).

commit aeea056f604200e3acd78cf279d1ea41eb3f2bfd
Author: Petar Avramovic <Petar.Avramovic at amd.com>
Date:   Tue Dec 2 23:49:21 2025 +0100

    AMDGPU/GlobalISel: Report RegBankLegalize errors using reportGISelFailure (#169918)

    Use standard GlobalISel error reporting with reportGISelFailure
    and pass returning false instead of llvm_unreachable.
    Also enables -global-isel-abort=0 or 2 for -global-isel -new-reg-bank-select.
    Note: new-reg-bank-select with abort 0 or 2 runs LCSSA,
    while "intended use" without abort or with abort 1 does not run LCSSA.

commit ec6091f4de8a530af198f259db1622e99b2bd954
Author: Alex Duran <alejandro.duran at intel.com>
Date:   Tue Dec 2 23:45:23 2025 +0100

    [OFFLOAD][LIBOMPTARGET] Start to update debug messages in libomptarget (#170265)

    * Add compatibility support for DP and REPORT macros
    * Define a set of predefined Debug Type for libomptarget
    * Start to update libomptarget files (OffloadRTL.cpp, device.cpp)

commit 9885aed474acccccda929f9d784c48ae0041939a
Author: Valentin Clement (バレンタイン クレメン) <clementval at gmail.com>
Date:   Tue Dec 2 14:31:55 2025 -0800

    [flang][cuda] Add address cast for src and dst in TMA operations (#170375)

    src and dst pointer needs to have an address cast

commit 434127b0c1dbd95a9c776fdf266d51e21da3f770
Author: Helena Kotas <hekotas at microsoft.com>
Date:   Tue Dec 2 14:25:17 2025 -0800

    [HLSL] Static resources (#166880)

    This change fixes couple of issues with static resources:
    - Enables assignment to static resource or resource array variables (fixes #166458)
    - Initializes static resources and resource arrays with default constructor that sets the handle to poison

commit fff45ddcc05eeed711d19392fcc6786674fa56ca
Author: John Harrison <harjohn at google.com>
Date:   Tue Dec 2 14:19:05 2025 -0800

    [lldb-dap] Follow the spec more closely on 'initialize' arguments. (#170350)

    Updates `InitializeRequestArguments` to correctly follow the spec, see
    https://microsoft.github.io/debug-adapter-protocol/specification#Requests_Initialize.

    This should correct which fields are tracked as optional and simplifies
    some of the types to make sure they're meaningful (e.g. an
    `optional<bool>` isn't anymore helpful than a `bool` since undefined and
    false are basically equivalent and it requires us to handle interpreting undefined as the default value in all the places we use the `optional<bool>`).

commit 41519b390fa1ae90221af33342d24fd4caa4734f
Author: Florian Hahn <flo at fhahn.com>
Date:   Tue Dec 2 22:15:43 2025 +0000

    [SCEV] Add UDiv canonicalization tests with nested AddRecs.

    Add more tests for follow-up to
    https://github.com/llvm/llvm-project/pull/169576.

commit d3256d935dbd0d9c7c1a525b347783d760e2cb98
Author: Valentin Clement (バレンタイン クレメン) <clementval at gmail.com>
Date:   Tue Dec 2 14:13:19 2025 -0800

    [flang][cuda] Add alignment to shared memory operation (#170372)

    Shared memory for TMA operation needs to be align to 16. Add ability to
    set an alignment on the cuf.shared_memory operation.

commit bd5fa633355638f4e9b176ca82007ff755bb51e9
Author: Florian Hahn <flo at fhahn.com>
Date:   Tue Dec 2 21:59:38 2025 +0000

    [VPlan] Remove duplicated computeCost call (NFC).

    Remove a redundant duplicated computeCost call. NFC, just skipping an
    unneeded call.

commit 4006df9b3276a8c8f03194e09386465d3b611b88
Author: Erich Keane <ekeane at nvidia.com>
Date:   Tue Dec 2 13:56:42 2025 -0800

    [OpenACC][CIR] Implement 'nohost' lowering. (#170369)

    This clause is pretty small/trivial and is a simple 'set a bool' value
    on the IR node, so its implementation is quite simple. We create the
    Operation with this as 'false', so the 'nohost' marks it as true always.

commit f0e1254bce44b85bdeb14fb5318163dab72ccff6
Author: Florian Hahn <flo at fhahn.com>
Date:   Tue Dec 2 21:39:54 2025 +0000

    [LV] Use forced cost once for whole interleave group in legacy costmodel (#168270)

    The VPlan-based cost model assigns the forced cost once for a whole
    VPInterleaveRecipe. Update the legacy cost model to match this behavior.
    This fixes a cost-model divergence, and assigns the cost in a way that
    matches the generated code more accurately.

    PR: https://github.com/llvm/llvm-project/pull/168270

commit 139ebfa63def4935e4cc935254bbc3be5a2bde9e
Author: Jason Macnak <jmacnak at gmail.com>
Date:   Tue Dec 2 13:32:52 2025 -0800

    [Bazel] Fix `--warn-backrefs` errors in `Analysis` target (#170357)

    Commit b262785 introduced a separate `AnalysisFpExc` target to try to
    workaround the lack of a bazel equivalent of single source file
    properties. However, this introduces backref errors when
    `--warn-backrefs` is enabled.

    This change alternatively just adds the `-ftrapping-math` copt to the
    entire `Analysis` target.

    Fix suggested by @rocallahan.

commit d97746c56b820d6603c409a0f7d53d8e64f3ee93
Author: asmok-g <102585811+asmok-g at users.noreply.github.com>
Date:   Tue Dec 2 22:18:50 2025 +0100

    [libc++] Fix the rest of __gnu_cxx::hash_XXX copy construction (#160525)

    Co-authored-by: Alexander Kornienko <alexfh at google.com>
    Co-authored-by: Louis Dionne <ldionne.2 at gmail.com>

commit 12ae72744c16610f9f63c8311578f4573d56667b
Author: Andy Kaylor <akaylor at nvidia.com>
Date:   Tue Dec 2 13:15:17 2025 -0800

    [CIR] Upstream support for builtin_constant_p (#170354)

    This upstreams the handler for the BI__builtin_constant_p function.

commit c77fe5845ee75071385755b6b9fc5c905dffad93
Author: Kyungtak Woo <kevinwkt1997 at gmail.com>
Date:   Tue Dec 2 15:04:07 2025 -0600

    [bazel] update bazel build for PluginScriptedProcess (#170364)

    Adding the following dependencies to PluginScriptedProcess:
    -         "//lldb:CoreHeaders",
    -         "//lldb:SymbolHeaders",
    -         "//llvm:Support",

    For c50802cbee3f6f25059422ba0edcc455e395a207

commit c910d821dc3fb33339504e44a1b9c30e25f7b0df
Author: Erich Keane <ekeane at nvidia.com>
Date:   Tue Dec 2 12:58:11 2025 -0800

    [OpenACC][CIR] Add worker/vector clause lowering for Routine (#170358)

    These two are both incredibly similar and simple, basically identical to
    'seq'. This patch adds them both together.

commit 0bb987f4091083d1d8637d1880ecd918ab76793e
Author: Yaxun (Sam) Liu <yaxun.liu at amd.com>
Date:   Tue Dec 2 14:19:55 2025 -0500

    Revert "[CUDA][HIP] Fix CTAD for host/device constructors (#168711)"

    This reverts commit e719e93d4157edfad17e9bf40670decc158470c4.

    revert this since it caused regression in our internal CI.

    Deduction guide with host/device attrs have already been
    used in

    https://github.com/ROCm/rocm-libraries/blob/develop/projects/rocrand/library/src/rng/utils/cpp_utils.hpp#L249

    ```
    template<class V>
    __host__ __device__ vec_wrapper(V) -> vec_wrapper<V>;
    ```

commit ca3de05eca474aaa7f53a62832a3c4cc80c5f43d
Author: Andy Kaylor <akaylor at nvidia.com>
Date:   Tue Dec 2 12:29:04 2025 -0800

    [CIR][NFC] Fix a release build warning (#170359)

    This moves a call inside an assert to avoid a warning about the result
    variable being unused in release builds.

commit 49a978712893fcf9e5f40ac488315d029cf15d3d
Author: Philip Reames <preames at rivosinc.com>
Date:   Tue Dec 2 12:13:11 2025 -0800

    [SCEV] Regenerate a subset of auto updated tests

    Reducing spurious diff in an upcoming change.

commit b50a590984a342a400cf23e6c5e210f9c062eb52
Author: Razvan Lupusoru <razvan.lupusoru at gmail.com>
Date:   Tue Dec 2 12:09:32 2025 -0800

    [acc][flang] Add genLoad and genStore to PointerLikeType (#170348)

    This patch extends the OpenACC PointerLikeType interface with two new
    methods for generating load and store operations, enabling
    dialect-agnostic memory access patterns.

    New Interface Methods:
    - genLoad(builder, loc, srcPtr, valueType): Generates a load operation
    from a pointer-like value. Returns the loaded value.

    - genStore(builder, loc, valueToStore, destPtr): Generates a store
    operation to a pointer-like value.

    Implementations provided for FIR pointer-like types, memref type (rank-0
    only), and LLVM pointer types.

    Extended TestPointerLikeTypeInterface.cpp with 'load' and 'store' test
    modes.

commit 6dd639ec9e7aeb957ec0b2bc0830ecdf6ce5efaa
Author: Erich Keane <ekeane at nvidia.com>
Date:   Tue Dec 2 11:55:14 2025 -0800

    [CIR][OpenACC] Implement 'routine' lowering + seq clause (#170207)

    The 'routine' construct just adds a acc.routine element to the global
    module, which contains all of the information about the directive. it
    contains a reference to the function, which also contains a reference to
    the acc.routine, which this generates.

    This handles both the implicit-func version (where the routine is
        spelled without parens, and just applies to the next function) and
    the explicit-func version (where the routine is spelled with the func
        name in parens).

    The AST stores the directive in an OpenACCRoutineDeclAttr in the
    implicit case, so we can emit that when we hit the function declaration.
    The explicit case is held in an OpenACCRoutineAnnotAttr on the function,
    however, when we emit the function we haven't necessarily seen the
    construct yet, so we can't depend on that attribute. Instead, we save up
    the list in Sema so that we can emit them all at the end.

    This results in the tests getting really hard to read (because ordering
    is a little awkward based on spelling, with no way to fix it), so we
    instead split the tests up based on topic.

    One last thing: Flang spends some time determining if the clause lists
    of two routines on the same function are identical, and omits the
    duplicates. However, it seems to do a poor job on this when the ordering
    isn't the same, or references are slightly different. This patch doesn't
    bother trying that, and instead emits all, trusting the ACC dialect to
    remove duplicates/handle duplicates gracefully.

    Note; This doesn't cause emission of functions that would otherwise not
    be emitted, but DOES emit routine references based on which function
    they are attached to.

commit fae64adaa6a69eafb1c5dca0db82cbc48694e3f2
Author: David Peixotto <peix at meta.com>
Date:   Tue Dec 2 11:13:48 2025 -0800

    [lldb] Handle deref of register and implicit locations (#169419)

    This commit modifies the dwarf expression evaluator in how we handle the
    deref operation for register and implicit locations on the stack. For a
    typical memory location a deref operation will read the value from
    memory. For register and implicit locations the deref operation will
    read the value from the register or its implicit location. In lldb we
    eagerly read register and implicit values and push them on the stack so
    the deref operation for these becomes a "no-op" that leaves the value on
    the stack and updates the tracked location kind.

    The motivation for this change is to handle `DW_OP_deref*` operations on
    location descriptions as described by the heterogenious debugging
    [extensions](https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/llvm/html/AMDGPUDwarfExtensionsForHeterogeneousDebugging.html#a-2-5-4-4-4-register-location-description-operations).

    Specifically, for register locations it states

    > These operations obtain a register location. To fetch the contents of
    > a register, it is necessary to use DW_OP_regval_type, use one of the
    > DW_OP_breg* register-based addressing operations, or use DW_OP_deref*
    on
    > a register location description.

    My understanding is that this is the intended behavior from dwarf5 as
    well and is not a change in behavior.

commit 3f2e3e67c11d3a86123aeb9ef5adfd9c9eb6f3ba
Author: Krzysztof Drewniak <Krzysztof.Drewniak at amd.com>
Date:   Tue Dec 2 11:02:45 2025 -0800

    [mlir][AMDGPU][NFC] Fix overlapping masked load refinements (#159805)

    The two paterns for handlig vector.maskedload on AMD GPUs had an overlap
    - both the "scalar mask becomes an if statement" pattern and the "masked
    loads become a normal load + a select on buffers" patterns could handle
    a load with a broadcast mask on a fat buffer resource.

    This commet add checks to resolve the overlap.

commit c50802cbee3f6f25059422ba0edcc455e395a207
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 10:59:40 2025 -0800

    Reland "[lldb] Introduce ScriptedFrameProvider for real threads (#161870)" (#170236)

    This patch re-lands #161870 with fixes to the previous test failures.

    rdar://161834688

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 879dddf2b4ede2e6474964f9e5b63545d271c733
Author: David Green <david.green at arm.com>
Date:   Tue Dec 2 18:58:32 2025 +0000

    [AArch64] Add tests for umulh. NFC

commit 6e262aa8ba3f06a23e1df6857aa65042ea4f5ef5
Author: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date:   Tue Dec 2 18:51:35 2025 +0000

    [gn build] Port 41a53c0a23ee

commit 73979c1df9695f281d78ad8e18a7023bcbbceab9
Author: Erick Ochoa Lopez <erick.ochoalopez at amd.com>
Date:   Tue Dec 2 13:48:31 2025 -0500

    [mlir][amdgpu] Lower amdgpu.make_dma_base (#169817)

    * Adds lowering for `amdgpu.make_dma_base`

commit 697b1be09cefd0a2c166fdbdfd5b744224808d02
Author: Changpeng Fang <changpeng.fang at amd.com>
Date:   Tue Dec 2 10:47:00 2025 -0800

    [AMDGPU][NFC] Put gfx125x common features into 12_50_Common (#170338)

commit 5c3c0020af102f4d1887f277ecb726c3ccf00daf
Author: Robert Imschweiler <robert.imschweiler at amd.com>
Date:   Tue Dec 2 19:42:31 2025 +0100

    [NFC] Refactor TargetLowering::getTgtMemIntrinsic to take CallBase parameter (#170334)

    cf.
    https://github.com/llvm/llvm-project/pull/133907#discussion_r2578576548

commit 2183846a15a04791cf7d85ca5d61d4c89505d3ab
Author: hjagasiaAMD <harsha.jagasia at amd.com>
Date:   Tue Dec 2 12:41:16 2025 -0600

    [AMDGPU] Fix AGPR_32 reg assign for mfma scale ops (#168964)

    In MFMA rewrite pass, prevent AGPR_32 reg class assignment for scale
    operands, not permitted by instruction format.

    ---------

    Co-authored-by: Matt Arsenault <arsenm2 at gmail.com>

commit 41a53c0a23ee3268c930fa67cc0a39f18c49efc4
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Tue Dec 2 10:41:03 2025 -0800

    [lldb/Target] Add BorrowedStackFrame and make StackFrame methods virtual (#170191)

    This change makes StackFrame methods virtual to enable subclass
    overrides and introduces BorrowedStackFrame, a wrapper that presents an
    existing StackFrame with a different frame index.

    This enables creating synthetic frame views or renumbering frames
    without copying the underlying frame data, which is useful for frame
    manipulation scenarios.

    This also adds a new borrowed-info format entity to show what was the
    original frame index of the borrowed frame.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 1a3709cc7e88cbd354b6b102e87d02f379bce3b9
Author: Nick Sarnie <nick.sarnie at intel.com>
Date:   Wed Dec 3 03:24:59 2025 +0900

    [SPIRV] Error for zero-length arrays if not a shader (#169732)

    I had a case where the frontend was generating a zero elem array in
    non-shader code so it was just crashing in a release build.
    Add a real error and make it not crash.

    ---------

    Signed-off-by: Nick Sarnie <nick.sarnie at intel.com>

commit e0db7f347c0afe2f1cdf3511f2e99cf5fc8541ed
Author: Jasmine Tang <jjasmine at igalia.com>
Date:   Tue Dec 2 18:23:17 2025 +0000

    [WebAssembly] Optimize away mask of 63 for sra and srl( zext (and i32 63))) (#170128)

    Follow up to #71844 after shl implementation

commit 23a22d0497eae08fa1ba7a0ecb2570eb07f5cfc8
Author: Shubham Sandeep Rastogi <Shubham.Rastogi at sony.com>
Date:   Tue Dec 2 10:12:20 2025 -0800

    [SROA] Unify the names of new instructions created in SROA. (#167917)

    In Debug builds, the names of adjusted pointers have a pointer-specific
    name prefix which doesn't exist in non-debug builds.

    This causes differences in output when looking at the output of SROA
    with a Debug or Release compiler.

    For most of our ongoing testing, we use essentially Release+Asserts
    build (basically release but without NDEBUG defined), however we ship a
    Release compiler. Therefore we want to say with reasonable confidence
    that building a large project with Release vs a Release+Asserts build
    gives us the same output when the same compiler version is used.

    This difference however, makes it difficult to prove that the output is
    the same if the only difference is the name when using LTO builds and
    looking at bitcode.

    Hence this change is being proposed.

commit 4587fe6be8e7c6269766fa7a26b120bd88e9bf40
Author: serge-sans-paille <sguelton at mozilla.com>
Date:   Tue Dec 2 18:11:27 2025 +0000

    [lld] Fix typo in lld manpage, nfc (#170299)

commit 2c38632639e588818add82ba9c8bac5ae774840e
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Tue Dec 2 13:10:48 2025 -0500

    LTO: Remove unused TargetLibraryInfo include (#170340)

commit ef49c9227155a9b9483356c7206a92eda693e90b
Author: Rahul Joshi <rjoshi at nvidia.com>
Date:   Tue Dec 2 10:03:31 2025 -0800

    [NFC][LLVM] Namespace cleanup in ScalarEvolution (#166620)

commit 5e5937c3d2e493a48837b2bdf179a53e8b80a66a
Author: Ahmed Nour <ahmednour.mohamed2012 at gmail.com>
Date:   Tue Dec 2 20:02:11 2025 +0200

    [LLDB] Add SBFrameExtensions Tests (#169236)

    Fixes part of https://github.com/llvm/llvm-project/issues/168920

commit ac66ae45cd22a7958ace645a035831000bfcbf51
Author: Kyungtak Woo <kevinwkt1997 at gmail.com>
Date:   Tue Dec 2 12:00:07 2025 -0600

    [bazel] feat: update bazel lldb for llvm:support dep (#170344)

    Adding llvm:Support dep since plugin started using llvm/ADT/...

commit e0f330293edb929152f44f1566d986b74ad5c1fc
Author: Yingwei Zheng <dtcxzyw2333 at gmail.com>
Date:   Wed Dec 3 01:56:38 2025 +0800

    [ValueTracking] Support scalable vector splats in computeKnownFPClass (#170325)

    Address comment
    https://github.com/llvm/llvm-project/pull/169904#discussion_r2576299467

commit ea00593dd10336ea452f34cb38269e911136286c
Author: Artem Kroviakov <71938912+akroviakov at users.noreply.github.com>
Date:   Tue Dec 2 18:49:06 2025 +0100

    [MLIR][XeGPU][Quickfix] Disable block count in propagation (#170304)

    One of the previous PRs
    https://github.com/llvm/llvm-project/pull/169267/ has reintroduced block
    count to layout propagation that was removed in
    https://github.com/llvm/llvm-project/pull/168504/. This PR patches the
    issue.

commit a8ef3c8eb9d4afff8c87b291f04fd826977b7414
Author: Andrzej Warzyński <andrzej.warzynski at arm.com>
Date:   Tue Dec 2 17:44:20 2025 +0000

    [mlir][vector][test] Fix comment in test (nfc) (#170336)

    Fix a comment post #162167

commit 5fa103a7fc804ab39c6253b384fdd38b4de388ce
Author: J. Ryan Stinnett <jryans at gmail.com>
Date:   Tue Dec 2 17:43:35 2025 +0000

    [clang][Docs] Move debug info flags into groups (#169942)

    This moves a few existing debug info flags that were floating in the
    general pool of unorganised flags over to the existing groups for debug
    info flags (so that they are presented together in documentation).

    As a tiny further tweak, this also fixes the spelling of "DWARF" in the
    flag docs for consistency with other flags.

commit 1e6476ddb70daab17533617aa8712cfd6c9f0c76
Author: Florian Hahn <flo at fhahn.com>
Date:   Tue Dec 2 17:39:02 2025 +0000

    [LV] Add predicated store sinking tests requiring further noalias checks

    Add additional tests where extra no-alias checks are needed, as future
    extensions of https://github.com/llvm/llvm-project/pull/168771.

commit c0371289ed6289549da73f79d29e827867d9ef2f
Author: David Green <david.green at arm.com>
Date:   Tue Dec 2 17:23:15 2025 +0000

    [ARM] Introduce intrinsics for MVE minnm/maxnm under strict-fp. (#169795)

    Similar to #169156 again, this is mostly for denormal handling as there
    is no rounding step in a minnum/maxnum.

commit 2ad71745cd2b6a266b4bd08e6a82a14e393ee915
Author: John Brawn <john.brawn at arm.com>
Date:   Tue Dec 2 17:15:00 2025 +0000

    [LSR] Insert the transformed IV increment in the user block (#169515)

    Currently we try to hoist the transformed IV increment instruction to
    the header block to help with generation of postincrement instructions,
    but this only works if the user instruction is also in the header. We
    should instead be trying to insert it in the same block as the user.

commit 90634160d0687a58a5dec8d199013eb31203de5e
Author: Jason-VanBeusekom <jason.van-beusekom at hpe.com>
Date:   Tue Dec 2 11:12:56 2025 -0600

    [OpenMP][clang] Remove metadata checks in amdgcn_weak_alias.c (#170326)

    4394aa685c4b01ad3782a137fcfebeadc4941df1 introduced the test
    amdgcn_weak_alias, which is failing on the reverse iteration build, due
    to the the order of the aliasees being different. This failure is a test
    issue, not a bug, so the metadata checks are removed.

commit 5c552c5cff656f8f3b292fcfb527a8f1c0e52798
Author: Ebuka Ezike <yerimyah1 at gmail.com>
Date:   Tue Dec 2 17:10:08 2025 +0000

    [lldb] Fix GetExpressionPath for vector registers (#169210)

    Vector registers have synthetic values for display purposes. This causes
    SBValue::GetExpressionPath to dispatch
    to ValueObjectSynthetic instead of ValueObjectRegister, producing
    incorrect results.

    Fixes #147144

commit 2209d335206c6901d28efc8624a242e66b982022
Author: Ahmed Nour <ahmednour.mohamed2012 at gmail.com>
Date:   Tue Dec 2 19:04:14 2025 +0200

    [CIR][X86] Add support for kunpck builtins (#168757)

    Part of https://github.com/llvm/llvm-project/issues/167765

commit 5681c71a803e8bb4f574f8199406085272e4a7c3
Author: Michael Kruse <llvm-project at meinersbur.de>
Date:   Tue Dec 2 16:57:31 2025 +0100

    Revert "[flang] implement show_descriptor intrinsic, a non-standard extension (#169137)"

    This reverts commit e7748e92cd5d71af2e1699328b7c575e9b9bf479.

    It broke the Windows build

    https://github.com/llvm/llvm-project/actions/runs/19842117405/job/56852610863
    https://lab.llvm.org/buildbot/#/builders/166/builds/4535

    After #170142 fixed another issue, this was also the remaining reason
    for this buildbot to fail:

    https://lab.llvm.org/buildbot/#/builders/207/builds/10423

commit 669683a036bf256e9cfba21bd2b70bafbf03be45
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Tue Dec 2 11:46:16 2025 -0500

    clang/AMDGPU: Add missing __opencl_c_read_write_images feature macro (#170307)

    This is a partial fix for the rocm device-libs build. This
    was most likely broken by 423bdb2bf257e19271d62e60b6339d84b8ce05aa

commit 4ff3d1cd9d6e3a2bbe2869c5027c2531ff12e3ce
Author: Mircea Trofin <mtrofin at google.com>
Date:   Tue Dec 2 08:39:42 2025 -0800

    [profcheck] update exclude list (#170316)

commit c6910201cc70014d1360f6038b5eb61fdc3c8788
Author: Zahira Ammarguellat <zahira.ammarguellat at intel.com>
Date:   Tue Dec 2 11:38:18 2025 -0500

    [NFC] [clang-tidy] Fix potential SA issues. (#170289)

    This patch addresses issues identified by the static analyzers, which
    appear to be legitimate problems.

    `FloatLoopCounterCheck.cpp`: "Dereferencing a pointer that might be
    `nullptr` FS when calling `getInc`".
    `ProBoundsAvoidUncheckedContainerAccessCheck.cpp`: "Dereferencing a
    pointer that might be `nullptr Callee` when calling `getBeginLoc`".
    `ExpandModularHeadersPPCallbacks.cpp`: Non-static class member
    `CurrentToken.Flags` is not initialized in this constructor nor in any
    functions that it calls. (line #101).

commit 6984f942bc5bd7a64095597d41d0b23d4734f070
Author: Marco Elver <elver at google.com>
Date:   Tue Dec 2 17:38:02 2025 +0100

    [MemProf] Require x86 for memprof-pgho.cpp test (#170321)

    This requires an x86 build, otherwise the test will fail with:

    ```
    Error running ThinLTO backend: No available targets are compatible with triple "x86_64-unknown-linux-gnu"
    ```

commit c21fd448a3daaa81fb59b076f9e7eae490fc28d5
Author: David Stone <davidfromonline at gmail.com>
Date:   Tue Dec 2 09:24:05 2025 -0700

    [clang][deps][NFC] Replace a vector with an array (#169555)

    `ResourceDirectoryCache::findResourceDir` uses a `std::vector` when a
    `std::array` would do.

commit e07e60e5dc911f689ba02c0bcbad472b436eef87
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Tue Dec 2 11:23:22 2025 -0500

    libclc: Fix build in atomic_def.inc (#170306)

commit 0d853aefecf6232121ac2d33664e90aa6759632b
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Tue Dec 2 11:19:46 2025 -0500

    AMDGPU: Fix treating unknown mem operands as uniform (#170309)

    The test changes are mostly GlobalISel specific regressions.
    GlobalISel is still relying on isUniformMMO, but it doesn't really
    have an excuse for doing so. These should be avoidable with new
    regbankselect.

    There is an additional regression for addrspacecast for cov4. We
    probably ought to be using a separate PseudoSourceValue for the
    access of the queue pointer.

commit cdc41478a0142529e57d2669a3025601f5d136c0
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Tue Dec 2 16:13:54 2025 +0000

    [lldb][shell tests] Properly fix fallout from c8031c3dd743

commit 25b6a15dfd228a4bf10c77240cecb26864e0e527
Author: Petar Avramovic <Petar.Avramovic at amd.com>
Date:   Tue Dec 2 17:12:10 2025 +0100

    GlobalISel: Stop using TPC to check if GlobalISelAbort is enabled (#169917)

    New pass manager does not use TargetPassConfig.
    GlobalISel requires TargetPassConfig to reportGISelFailure,
    and it only actual use is to check if GlobalISelAbort is enabled.
    TargetPassConfig uses TargetMachine to check if GlobalISelAbort is
    enabled, but TargetMachine is also available from MachineFunction.

commit 47d66bf34bc96fd7d667e8c3efd44bdf8d7a056a
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Tue Dec 2 16:08:37 2025 +0000

    [X86] Add tests showing failure to concat fcmp instructions together (#170313)

    Some of the AVX512 cases are already handled by #170295

commit 23e6dbf864f4ff730dc2949dcc74d75633641624
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Tue Dec 2 10:48:02 2025 -0500

    AMDGPU: Use ConstantPool as source value for DAG lowered kernarg loads (#168917)

    This isn't quite a constant pool, but probably close enough for this
    purpose. We just need some known invariant value address. The aliasing
    queries against the real kernarg base pointer will falsely report
    no aliasing, but for invariant memory it probably doesn't matter.

commit 734a912d0f025559fcf76bde9aaaeb0383c1625a
Author: Nathan Gauër <brioche at google.com>
Date:   Tue Dec 2 16:35:25 2025 +0100

    [Clang][HLSL] Fix invalid flag passed by the driver (#170300)

    The test were using the DXC driver in Clang, which adds the
    `--spirv-ext=` option. Turns out some buildbots are built without this
    flag support, meaning any test using this driver would fail with an
    'unknown command line argument' error.

commit b4149a013d8ea93c3a34fe88a1eb0a80a8c8b6b9
Author: Yue Huang <30948580+AdUhTkJm at users.noreply.github.com>
Date:   Tue Dec 2 15:35:18 2025 +0000

    [MLIR][Presburger] Fix Gaussian elimination (#164437)

    In the Presburger library, there are two minor bugs of Gaussian
    elimination.

    In Barvinok.cpp, the `if (equations(i, i) != 0) continue;` is intended
    to skip only the row-swapping, but it in fact skipped the whole loop
    body altogether, including the elimination parts.

    In IntegerRelation.cpp, the Gaussian elimination forgets to advance
    `firstVar` (the number of finished columns) when it finishes a column.
    Moreover, when it checks the pivot row of each column, it didn't ignore
    the rows considered.

    As an example, suppose the constraints are
    ```
    1 0 0 1 2 = 0
    0 1 0 0 3 = 0
    0 0 0 1 4 = 0
    ...
    ```
    For the 4th column, it will think the pivot is the first row `1 0 0 1 2
    = 0`, rather than the correct 3rd row `0 0 0 1 4 = 0`.

    (This bug is left undiscovered, because if we don't advance `firstVar`
    then this Gaussian elimination process will simply do nothing. Moreover,
    it is called only in `simplify()`, and the existing test cases doesn't
    care whether a set has been simplified.)

commit bfc45712f836b3a48eb5c4e1779b6368ae7ac80d
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Tue Dec 2 15:34:59 2025 +0000

    Revert "Revert "[LLDB] Update Shell lit config to handle c8031c3dd743"" (#170312)

    Reverts llvm/llvm-project#170288

    Turns out this was not the cause of the failure

commit e719e93d4157edfad17e9bf40670decc158470c4
Author: Yaxun (Sam) Liu <yaxun.liu at amd.com>
Date:   Tue Dec 2 10:34:48 2025 -0500

    [CUDA][HIP] Fix CTAD for host/device constructors (#168711)

    Clang currently does not allow using CTAD in CUDA/HIP device functions
    since deduction guides are treated as host-only. This patch fixes that
    by treating deduction guides as host+device. The rationale is that
    deduction guides do not actually generate code in IR, and there is an
    existing check for device/host correctness for constructors.

    The patch also suppresses duplicate implicit deduction guides from
    host/device constructors with identical signatures and constraints
    to prevent ambiguity.

    For CUDA/HIP, deduction guides are now always implicitly enabled for
    both host and device, which matches nvcc's effective behavior. Unlike
    nvcc, which silently ignores explicit CUDA/HIP target attributes on
    deduction guides, Clang diagnoses such attributes as errors to keep
    the syntax clean and avoid confusion.

    This ensures CTAD works correctly in CUDA/HIP for constructors with
    different target attributes and provides clearer diagnostics when users
    attempt to annotate deduction guides with CUDA/HIP target attributes.

    Example:

    ```
      #include <tuple>

      __host__ __device__ void func()
      {
        std::tuple<int, int> t = std::tuple(1, 1);
      }
    ```

    This compiles with nvcc but fails with clang for CUDA/HIP without this
    fix.

    Reference: https://godbolt.org/z/WhT1GrhWE

    Fixes: https://github.com/ROCm/ROCm/issues/5646

    Fixes: https://github.com/llvm/llvm-project/issues/146646

commit 00f3410719d090fe8aa77cc5ecc1a280c01fbf0d
Author: Arjun P <arjunpitchanathan at gmail.com>
Date:   Tue Dec 2 15:17:05 2025 +0000

    [MLIR][Presburger] add atConstraint to index into combined constraint matrix

commit 87f4e809425da31b19a5a86833c3f1af4981cc99
Author: Nick Sarnie <nick.sarnie at intel.com>
Date:   Wed Dec 3 00:21:55 2025 +0900

    [SPIRV] Add support for CodeSectionINTEL storage class in legalizer (#167961)

    The
    [SPV_INTEL_function_pointers](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_function_pointers.asciidoc)
    extension defines a new storage class `CodeSectionINTEL` that is
    represented in LLVM IR as `addrspace(9)`.

    Per the spec, it is basically not allowed to be casted to or interact
    with pointers with other storage classes.

    Add `addrspace(9)` as a known pointer type to the legalizer, and then
    add some error cases for IR that is impossible to legalize.

    Right now, if you try to run the backend on input with SPIR-V, basically
    everything errors saying that it is unable to legalize because `ptr
    addrspace(9)` is not considered a pointer type.

    Ideally the FE should not generate the illegal IR or error out earlier,
    but we should catch it before generating invalid SPIR-V.

    ---------

    Signed-off-by: Nick Sarnie <nick.sarnie at intel.com>

commit 3c6864ab8879f7274e6c24c3b7c8e8139cd135dd
Author: Marco Elver <elver at google.com>
Date:   Tue Dec 2 16:13:45 2025 +0100

    [Clang][CodeGen] Remove explicit insertion of AllocToken pass (#169360)

    Remove explicit insertion of the AllocTokenPass, which is now handled by
    the PassBuilder. Emit AllocToken configuration as LLVM module flags to
    persist into the backend.

    Specifically, this also means it will now be handled by LTO backend
    phases; this avoids interference with other optimizations (e.g. PGHO)
    and enable late heap-allocation optimizations with LTO enabled.

commit ca7edf2d141379827a9e107656a11bfe3735d11e
Author: Chinmay Deshpande <chdeshpa at amd.com>
Date:   Tue Dec 2 06:55:25 2025 -0800

    [AMDGPU][GISel] Add RegBankLegalize support for G_STRICT_{FADD|FMUL} (#169406)

commit 854df547a023715bb6229d410d0699be2d3c3d04
Author: Nikita Popov <npopov at redhat.com>
Date:   Tue Dec 2 15:53:26 2025 +0100

    [Support] Optimize DebugCounter hot path (NFC) (#170260)

    When enabling ShouldPrintCounter, also set Enabled, so that we only have
    to check one of them. This cuts down the cost of (disabled) debug
    counters by half.

commit 5a32fd3ea501ab78f5a5fc820f61fe81a98edc40
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Tue Dec 2 14:53:02 2025 +0000

    [lldb][NFCI] Rewrite UnwindAssemblyInstEmulation in terms of a CFG visit (#169630)

    Currently, UnwindAssemblyInstEmulation visits instructions in the order
    in which they appear in a function. This commit makes an NFCI change to
    UnwindAssemblyInstEmulation so that it follows the function's CFG:

    1. The first instruction is enqueued.
    2. While the queue is not empty:
    2.1 Visit the instruction in the *back* queue to compute the new unwind
        state.
    2.2 Push(+) the next instruction to the *back* of the queue.
    2.3 If the instruction is a forward branch with a known branch target,
        push(+) the destination instruction to the *front* of the queue.

    (+) Only push if this instruction hasn't been enqueued before.
    (+) When pushing an instruction, the current unwind state is attached to
    it.

    Note that:
    * the "next instruction" is pushed to the *back* of the queue,
    * a branch target is pushed to the *front* of the queue, and
    * we always dequeue from the *back* of the queue.

    This means that consecutive instructions are visited one after the
    other; this is important to support "conditional blocks" [1] of
    instructions (see the line with "if last_condition != new_condition").
    This is arguably a very Thumb specific thing, so maybe it shouldn't be
    in the generic algorithm; that said, it is already in the code, so we
    have to support it.

    The main reason this patch is NFCI and not NFC is that, now, the
    destination of a forward branch is visited in a slightly different
    moment than before. This should not cause any changes in output, as if a
    branch destination is reachable through two different paths, any well
    behaved compiler will generate the same unwind state in both paths.

    The motivation for this patch is to change step 2.2 so that it _only_
    pushes the next instruction if the current instruction is not an
    unconditional branch / return, and to change step 2.3 so that backwards
    branches are also allowed, fixing the bug described by [2].

    [1]:
    https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/it
    [2]: https://github.com/llvm/llvm-project/pull/168398

    Part of a sequence of PRs:
    [lldb][NFCI] Rewrite UnwindAssemblyInstEmulation in terms of a CFG visit
    #169630
    [lldb][NFC] Rename forward_branch_offset to branch_offset in
    UnwindAssemblyInstEmulation #169631
    [lldb] Add DisassemblerLLVMC::IsBarrier API #169632
    [lldb] Handle backwards branches in UnwindAssemblyInstEmulation #169633

    commit-id:dce6b515

commit 84e46aa62d66fab59c0b3beee7b4b154d62eeb0f
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Tue Dec 2 14:50:51 2025 +0000

    [X86] combineConcatVectorOps - add handling to concat setcc instructions together (#170295)

    So far this only handles AVX512 predicate masks, which is by far the
    easiest to support - AVX1/AVX2 support can mostly be dealt with via CMPP
    + CMPEQ/GT nodes (but these still fail for some icmp expansions where
    nodes have multiple uses).

commit d6f92050c0c2f60e78f3c8bcf557c5e69b025d7a
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Tue Dec 2 14:38:01 2025 +0000

    [XRay] Mark test unsupported on armhf

    Tbis addresses a buildbot failure now that these tests actually run more
    broadly.

    error: ALWAYSINSTR: expected string not found in input
    // ALWAYSINSTR: {{.*function-name:.*main.*}}
                    ^
    <stdin>:1:1: note: scanning from here

commit 7e29448b4e517631b228b11e855b8ecd1d357dff
Author: Hendrik Hübner <117831077+HendrikHuebner at users.noreply.github.com>
Date:   Tue Dec 2 15:28:12 2025 +0100

    [CIR] Upstream var arg copy builtin (#169415)

    This PR upstreams `__builtin_va_copy`, and extends the existing tests.

commit 5d876093b72182ede3d8beb551397b7fe90faa84
Author: Florian Hahn <flo at fhahn.com>
Date:   Tue Dec 2 14:09:53 2025 +0000

    [SCEV] Allow udiv canonicalization of potentially-wrapping AddRecs (#169576)

    Extend the {X,+,N}/C => {(X - X%N),+,N}/C canonicalization to handle
    AddRecs that may wrap, when X < N <= C and both N,C are powers of 2. The
    alignment and power-of-2 properties ensure division results remain
    equivalent for all offsets [(X - X%N), X).

    Alive2 Proof: https://alive2.llvm.org/ce/z/iu2tav

    Fixes https://github.com/llvm/llvm-project/issues/168709

    PR: https://github.com/llvm/llvm-project/pull/169576

commit 8a40d08b836ff5f363783f99efd901a44d8575de
Author: Nathan Gauër <brioche at google.com>
Date:   Tue Dec 2 15:08:31 2025 +0100

    [HLSL][SPIR-V] Implement vk::location for inputs (#169479)

    This commit adds the support for vk::location attribute which can be
    applied to input and output variables.

    As in/inout parameters are not supported yet, vk::location on such
    parameters is not tested.

    As implemented in DXC, vk::location has the following rules:
    - input and outputs are handled independently.
    - input/output lowered to a SPIR-V builtins are not using the assigned
    vk::location and thus ignored.
    - input/output lowered to a Location decoration must either all have
    explicit locations, or none. Mixing is not allowed (except with
    builtins).

commit c2a0350d8663b37ff96d3e7ca088ffb7d995ef29
Author: Nikolas Klauser <nikolasklauser at berlin.de>
Date:   Tue Dec 2 15:02:04 2025 +0100

    [libc++] Simplify a few places where we use __index_sequence (#167213)

    This is done in two ways:
    1) `index_sequence_for` is back-ported as `__index_sequence_for`
    2) Extra functions just to expand the parameter pack are replaced with
    lambdas

commit cd5ed7ca87fbf287c4453c728cb92f77a4ecf78c
Author: Louis Dionne <ldionne.2 at gmail.com>
Date:   Tue Dec 2 08:58:00 2025 -0500

    [libc++] Make CC and CXX environment variables mandatory in run-buildbot (#166875)

    Previously, the bootstrapping-build job defined in run-buildbot required
    the CC and CXX environment variables to be defined even though
    run-buildbot documents these environment variables as being optional. It
    also relied on ccache being available.

    Refactor run-buildbot to make CC and CXX mandatory, and refactor various
    places in the CI where we called run-buildbot without setting CC and
    CXX. After this patch, all places that use run-buildbot are setting CC
    and CXX before calling the script, which makes it easier to track what
    compiler is used where. This also allows simplifying run-buildbot
    itself.

    Finally, this patch makes ccache optional for running the bootstrapping
    build.

commit 63f48fd829ff8e1d400d9896ba6ab9730fd2773b
Author: Bertik23 <39457484+Bertik23 at users.noreply.github.com>
Date:   Tue Dec 2 14:48:18 2025 +0100

    [CFGPrinter] Add node id formater (#164623)

    This PR is part of the LLVM IR LSP server project
    ([RFC](https://discourse.llvm.org/t/rfc-ir-visualization-with-vs-code-extension-using-an-lsp-server/87773))

    Sometimes it is nice to be able to specify IDs of nodes in the printed
    CFG. For better manipulation of the outputed CFG.
    In our case we will use it for navigation between IR and CFG views.

    This adds an argument to DOTFuncInfo - a function that takes a
    BasicBlock and returns a node ID, to be printed in the result dot.

commit e88a83acde69b2fc395474c905b9a17c22f61c05
Author: Alexandros Lamprineas <alexandros.lamprineas at arm.com>
Date:   Tue Dec 2 13:46:39 2025 +0000

    [GlobalOpt][FMV] Perform expensive checks when NumVersions < Threshold (#168054)

    Extends the static resolution algorith to handle cases where we can
    infer additional information on why a prior caller version of higher
    priority was skipped, based on the features of the current caller
    version.

    For example let's say the current caller is aes+sve2 and a previous
    caller was mops+sve2. Knowing that sve2 is available we could deduce
    that mops is unavailable. This would allow us to skip callee versions
    which depend on mops.

    This comes at the expense of performing more checks. However we can
    control the threshold (number of versions) which decides whether the
    expensive checks will be performed or not.

commit b341885126ec0c63e45dfae96df78bb4902c6f35
Author: Igor Wodiany <igor.wodiany at imgtec.com>
Date:   Tue Dec 2 13:42:18 2025 +0000

    [mlir][spirv] (De)serialize Coherent decoration (#170280)

commit e74b425ddcac22ccc4d0bd5d65f95ffc2682b62f
Author: Nathan Gauër <brioche at google.com>
Date:   Tue Dec 2 14:19:00 2025 +0100

    [HLSL][SPIR-V] Add support for SV_Target semantic (#168743)

    This PR adds the support for the SV_Target semantic and improved the
    diagnostics when the stage is correct, but the direction is disallowed.

    This PR will require #168735 to be merged first.

commit c26fa8bfb9f1293a9ab3b7400a193676738aa486
Author: Aaron Ballman <aaron at aaronballman.com>
Date:   Tue Dec 2 08:16:42 2025 -0500

    Fix docs build

    This amends 7e2411cd2b39625443bcf59be20e6636ba31ae8d

commit ac23264c0327a055bd439fd12264461f5b7d16b9
Author: Felipe de Azevedo Piovezan <fpiovezan at apple.com>
Date:   Tue Dec 2 13:11:02 2025 +0000

    Revert "[LLDB] Update Shell lit config to handle c8031c3dd743" (#170288)

    Reverts llvm/llvm-project#170225

    See failures in
    https://ci.swift.org/view/all/job/llvm.org/job/as-lldb-cmake/36912/

    ```
    [2025-12-02T01:20:37.083Z] # .---command stderr------------
    [2025-12-02T01:20:37.083Z] # | clang: warning: no such sysroot directory: 'b/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk' [-Wmissing-sysroot]
    [2025-12-02T01:20:37.083Z] # | clang: warning: argument unused during compilation: '-fmodules-cache-path=/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/lldb-build/lldb-test-build.noindex/module-cache-clang/lldb-shell' [-Wunused-command-line-argument]
    [2025-12-02T01:20:37.083Z] # | /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/lldb-build/tools/lldb/test/Shell/Settings/Output/TestFrameFormatFunctionPrefix.test.tmp/main.m:2:13: warning: non-void function does not return a value [-Wreturn-type]
    [2025-12-02T01:20:37.083Z] # |     2 | int func() {}
    [2025-12-02T01:20:37.083Z] # |       |             ^
    [2025-12-02T01:20:37.083Z] # | /Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake/lldb-build/tools/lldb/test/Shell/Settings/Output/TestFrameFormatFunctionPrefix.test.tmp/main.m:3:21: warning: non-void function does not return a value [-Wreturn-type]
    [2025-12-02T01:20:37.083Z] # |     3 | int bar() { func(); }
    [2025-12-02T01:20:37.083Z] # |       |                     ^
    [2025-12-02T01:20:37.083Z] # | 2 warnings generated.
    [2025-12-02T01:20:37.083Z] # | ld: library 'System' not found
    [2025-12-02T01:20:37.083Z] # | clang: error: linker command failed with exit code 1 (use -v to see invocation)
    [2025-12-02T01:20:37.083Z] # `-----------------------------
    [2025-12-02T01:20:37.083Z] # error: command failed with exit status: 1
    ```

commit 7e2411cd2b39625443bcf59be20e6636ba31ae8d
Author: Miko <110693261+mikomikotaishi at users.noreply.github.com>
Date:   Tue Dec 2 13:04:42 2025 +0000

    [clang][docs] Add link to C++ modules Wikipedia page to docs (#169200)

    This PR adds a link to the "[Modules
    (C++)](https://en.wikipedia.org/wiki/Modules_(C++))" page on Wikipedia
    and similar on cpp reference, as per recommendation by another
    contributor.

commit 42898499310b7b105db90701b579dc1cb8c23e8b
Author: Akshiitaa06 <akshitasaxena206 at gmail.com>
Date:   Tue Dec 2 18:27:53 2025 +0530

    Improve formatting in BAT.md (#170254)

    Make "Header" a subheading to improve readability in the Functions table
    section.

commit ea3fdc5972db7f2d459e543307af05c357f2be26
Author: Lewis Crawford <lcrawford at nvidia.com>
Date:   Tue Dec 2 12:43:03 2025 +0000

    Avoid maxnum(sNaN, x) optimizations / folds (#170181)

    The behaviour of constant-folding `maxnum(sNaN, x)` and `minnum(sNaN,
    x)` has become controversial, and there are ongoing discussions about
    which behaviour we want to specify in the LLVM IR LangRef.

    See:
      - https://github.com/llvm/llvm-project/issues/170082
      - https://github.com/llvm/llvm-project/pull/168838
      - https://github.com/llvm/llvm-project/pull/138451
      - https://github.com/llvm/llvm-project/pull/170067
    -
    https://discourse.llvm.org/t/rfc-a-consistent-set-of-semantics-for-the-floating-point-minimum-and-maximum-operations/89006

    This patch removes optimizations and constant-folding support for
    `maxnum(sNaN, x)` but keeps it folded/optimized for `qNaN`. This should
    allow for some more flexibility so the implementation can conform to
    either the old or new version of the semantics specified without any
    changes.

    As far as I am aware, optimizations involving constant `sNaN` should
    generally be edge-cases that rarely occur, so here should hopefully be
    very little real-world performance impact from disabling these
    optimizations.

commit f5dd2dc7129ad070832db4c2f1b1d5ec6ad87f04
Author: Anton Sidorenko <anton.sidorenko at mail.com>
Date:   Tue Dec 2 15:41:31 2025 +0300

    [cmake] Fix semicolon expansion when passing LLVM_TABLEGEN_FLAGS (#169518)

    This patch uses common workaround for cmake semicolon expansion to
    spaces

commit 0e6d6127d43d6589408d5eed9b73c40238b6e741
Author: Graham Hunter <graham.hunter at arm.com>
Date:   Tue Dec 2 12:29:49 2025 +0000

    [AArch64] Improve select dagcombine (#169925)

    An AnyOf reduction (aka vector.reduce.or) with a fixed-width vector is
    canonicalized to a bitcast of the mask vector to an integer of the same
    overall size, which is then compared against zero.

    If the scalar result of the bitcast is smaller than the element size of
    vectors being selected, we often end up with suboptimal codegen. This
    fixes the main cases, removing scalarized code.

commit 153c7e47d6d160b1e158018b7d016aa3b227b9ed
Author: Valery Mironov <valera.mironow at gmail.com>
Date:   Tue Dec 2 13:29:27 2025 +0100

    [libc++] Use private CMake flags to enable the pragma system_header macro when building (#138826)

    That property doesn't need to be propagated beyond the translation units
    of the libc++ built library itself.

commit 7bced745766930e795e4e588366d84fe456311b3
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Tue Dec 2 12:05:59 2025 +0000

    [X86] combine-icmp.ll - fix copy+paste typo in concat_icmp_v64i8_v16i8 test (#170281)

    I changed the condcode for variety but failed to update the constant to prevent constant folding

commit 9ba5fa2e7199a558154dd4f8955dbee52c63da17
Author: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date:   Tue Dec 2 20:49:21 2025 +0900

    [Delinearization] Add test for inferred array size exceeds integer range (NFC) (#169048)

    Add test cases where the delinearized arrays may not satisfy the
    following "common" property:

    `&A[I_1][I_2]...[I_n] == &A[J_1][J_2]...[J_n]` iff
    `(I_1, I_2, ..., I_n) == (J_1, J_2, ..., J_n)`

    The root cause of this issue is that the inferred array size is too
    large and the offset calculation overflows.
    Such results should be discarded during validation. This will be fixed
    by #169902 .

commit 3098bfe7d98bca214c78303ac7869083c4da517d
Author: daniilavdeev <daniilavdeev237 at gmail.com>
Date:   Tue Dec 2 14:47:17 2025 +0300

    [llvm][Docs] Add release notes about dwarf fission with relaxations (#169871)

commit f7418517316f1b3d66733c5a607c785d15099fab
Author: David Green <david.green at arm.com>
Date:   Tue Dec 2 11:46:50 2025 +0000

    Revert "[AArch64][ARM] Move ARM-specific InstCombine transforms into `Transforms/Utils` (#169589)"

    This reverts commit 1c32b6f51ccaaf9c65be11d7dca9e5a476cddb5a due to failures on
    BUILD_SHARED_LIBS builds.

commit e8bf01108589be73f4057fe285cd7e04b4143f4a
Author: Tibor Győri <tibor.gyori at chem.u-szeged.hu>
Date:   Tue Dec 2 12:46:41 2025 +0100

    [LV] Emit better debug and opt-report messages when vectorization is disallowed in the LoopVectorizer (#158513)

    While looking into fixing #158499, I found some other cases where the
    messages emitted could be improved. This PR improves both the messages
    printed to the debug output and the missed-optimization messages in
    cases where:

    - loop vectorization is explicitly disabled
    - loop vectorization is implicitly disabled by disabling all loop
    transformations
    - loop vectorization is set to happen only where explicitly enabled

    A branch that should currently be unreachable is also added. If the
    related logic ever breaks (eg. due to changes to getForce() or the
    ForceKind enum) this should alert devs and users. New test cases are
    also added to verify that the correct messages (and only them) are
    outputted.

    ---------

    Co-authored-by: GYT <tiborgyri at gmail.com>
    Co-authored-by: Florian Hahn <flo at fhahn.com>

commit 4b6ad1187633c55087e00ab90567260ae6aafd0d
Author: Florian Hahn <flo at fhahn.com>
Date:   Tue Dec 2 11:43:37 2025 +0000

    [VPlan] Sink predicated stores with complementary masks. (#168771)

    Extend the logic to hoist predicated loads
    (https://github.com/llvm/llvm-project/pull/168373) to sink predicated
    stores with complementary masks in a similar fashion.

    The patch refactors some of the existing logic for legality checks to be
    shared between hosting and sinking, and adds a new sinking transform on
    top.

    With respect to the legality checks, for sinking stores the code also
    checks if there are any aliasing stores that may alias, not only loads.

    PR: https://github.com/llvm/llvm-project/pull/168771

commit 753f47d6a5043b32f6eebf467cca26f5e1a0611a
Author: ArnavM3434 <84486711+ArnavM3434 at users.noreply.github.com>
Date:   Tue Dec 2 06:40:01 2025 -0500

    [X86] Make VBMI2 funnel shifts use VSHLD/VSHRD for const splats (#169401)

    Make ISD::FSHL/FSHR legal on VBMI2 vector targets and convert to VSHLD/VSHRD in a combine

    closes #166949

commit 458035027c80e984bf5862140aed20d5e50dd22a
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 14:13:08 2025 +0200

    [AArch64] [test] Make unwind info tests actually use the right instructions

    This makes them match the expected decoding of the unwind info
    opcodes, avoiding mismatch indications from "dumpbin -unwindinfo".

commit 4a619a7d082473ca2373824ac9c7f2ea61011a73
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 14:04:44 2025 +0200

    [AArch64] [test] Spell out the matching instructions for SVE unwind opcodes

    The MS dumpbin.exe tool can dump the unwind opcodes with the
    "-unwindinfo" option; this mode also checks that the instructions
    actually match the expected ones here. (This mode doesn't seem
    to fully work for all instructions here, but spell out all the
    intended instructions here.)

commit 9e27fefc180da46bed8731cb19f542f1bd6f2ff4
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 14:03:40 2025 +0200

    [AArch64] [test] Fix stack allocation instructions in the seh.s test

    The actual unwind opcodes only stores stack increments in units
    of 16 (which is what it is listed as in the unwind opcode
    dumping by llvm-readobj); actually write what we intend to encode.

commit e50ac8a9b1bca90b5d3fea7a20b9985870a08df4
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 14:00:36 2025 +0200

    [AArch64] [test] Move tests for custom unwind opcodes to a separate function

    These custom opcodes disable the checker for having the prologue
    length actually match the opcodes (see checkARM64Instructions in
    MCWin64EH.cpp) - which led to the prologue mismatching the opcodes
    by one instruction, since 312d6b488ef9d7c0e4d649827820db7285e36406.

    Move the special opcodes to a separate test function.

    Remove the mismatched nop instruction at the end of the main
    function, as this prologue now is assembled with the strict length
    checking enabled.

commit 3e5b86cec112f5f5639c71bd54e7ca7862cf58bb
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 14:06:35 2025 +0200

    [AArch64] [test] Write the seh.s test output object to a file

    This is what is done in other tests; this makes it easier to
    inspect the output of this test manually.

commit aaa37afbc283aef885afc779dcb1539a3b3775e6
Author: Paul Walker <paul.walker at arm.com>
Date:   Tue Dec 2 11:31:52 2025 +0000

    [LLVM][CodeGen][SVE] Add lowering for ISD::[ANY,SIGN,ZERO]_EXTEND_VECTOR_INREG. (#169847)

commit c12dd598e2f76864b2ea4bcc8616334872d1c112
Author: Arseniy Zaostrovnykh <necto.ne at gmail.com>
Date:   Tue Dec 2 12:30:31 2025 +0100

    [NFC][analyzer] Constify AnalysisConsumer::getModeForDecl (#170275)

    In my previous commit I forgot that `this` argument of
    AnalysisConsumer::getModeForDecl() is also never modified.
    Here is the missing trailing const.

commit 4a0b5bc2b5ddacc12260761a0ea18fdf2442e412
Author: Martin Storsjö <martin at martin.st>
Date:   Tue Dec 2 13:29:51 2025 +0200

    [MC] [Win64EH] Produce packed unwind for the special case of X19+LR (#169697)

commit 1c32b6f51ccaaf9c65be11d7dca9e5a476cddb5a
Author: valadaptive <79560998+valadaptive at users.noreply.github.com>
Date:   Tue Dec 2 06:17:12 2025 -0500

    [AArch64][ARM] Move ARM-specific InstCombine transforms into `Transforms/Utils` (#169589)

    Back when `TargetTransformInfo::instCombineIntrinsic` was added in
    https://reviews.llvm.org/D81728, several transforms common to both ARM
    and AArch64 were kept in the non-target-specific `InstCombineCalls.cpp`
    so they could be shared between the two targets.

    I want to extend the transform of the `tbl` intrinsics into static
    `shufflevector`s in a similar manner to
    https://github.com/llvm/llvm-project/pull/169110 (right now it only
    works with a 64-bit `tbl1`, but `shufflevector` should allow it to work
    with up to 2 operands, and it can definitely work with 128-bit vectors).
    I think separating out the transform into a TTI hook is a prerequisite.

    ~~I'm not happy about creating an entirely new module for this and
    having to wire it up through CMake and everything, but I'm not sure
    about the alternatives. If any maintainers can think of a cleaner way of
    doing this, I'm very open to it.~~

    I've moved the transforms into
    `Transforms/Utils/ARMCommonInstCombineIntrinsic.cpp`, which is a lot
    simpler.

commit 2f86bc207af7a5bd766d4d224c30851d5ced5999
Author: David Spickett <david.spickett at linaro.org>
Date:   Tue Dec 2 11:05:55 2025 +0000

    [clang] Only build c-index-test and apinotes-test when clang tests are included (#151157)

    Those programs are only used for testing, and it's used in tests that
    are already guarded by CLANG_INCLUDE_TESTS in clang/CMakeLists.txt.

    This change enables us to do builds with
    LLVM_INSTALL_TOOLCHAIN_ONLY=OFF, and CLANG_INCLUDE_TESTS=OFF, which
    contain the required files to build other bits of llvm-project
    standalone, but do not include those unnecessary testing programs.

commit d20d84fec5945fcc16aa6f63879e1458d4af9ea6
Author: David Spickett <david.spickett at linaro.org>
Date:   Tue Dec 2 10:58:22 2025 +0000

    [lldb] Make sure SBError is valid when SBDebugger::InitializeWithErrorHandling succeeds (#170156)

    Fixes #169788

    When this function fails to initialise the debugger, it sets the SBError
    using the returned error from the initialise function. This results in
    Success being false and isVaid being true. This is expected behaviour.

    When it does not fail to initialise, it was returning the default
    constructed SBError which has Success() == true but IsValid == false.
    IsValid should be true, to show that the success can be trusted.

    To fix this, construct the SBError using a default constructed Status,
    which results in Success and IsValid being true.

commit f01e8ac0041da783608029825cb8931f0b9f5b9f
Author: Corentin Jabot <corentinjabot at gmail.com>
Date:   Tue Dec 2 11:55:43 2025 +0100

    [Clang] Fix handling of zero-length arrays in sfinae context. (#170144)

    We were producing a diagnostic for zero-length arrays in Sfinae context,
    without invalidating the overload.

    This causes the diagnostic to be emitted
    if and when that undiagnosed overload is selected.

    Fixes #170040

commit 437fa02c074221ddc635caf6261e056ce44f5178
Author: David Green <david.green at arm.com>
Date:   Tue Dec 2 10:46:39 2025 +0000

    [ARM] Add tests for over-sized mulh. NFC

    The double-sized v8i32 do OK, but the larger v16i32 do not current get
    converted to umulh.

commit bbbc681463316425e3e511a030a2f932e5999bef
Author: Dan Blackwell <dan_blackwell at apple.com>
Date:   Tue Dec 2 10:34:04 2025 +0000

    [AArch64] Force dwarf unwind for MTE-tagged stack frames (#168530)

    Currently, on Darwin running with -fsanitize=memtag-stack generates
    compact-unwind exception unwinding that does not untag MTE-tagged memory
    on the way back up.

    This patch forces dwarf unwinding on MTE-tagged frames.

    rdar://162195539

commit 535f604dabfb6563dab2a2478fb665699523fd0a
Author: Martin Storsjö <martin at martin.st>
Date:   Tue Dec 2 12:11:13 2025 +0200

    [MC] [Win64EH] Clarify the comment about a skipped case of packed unwind info (#169784)

    Clarify the comment from 924defada9bc0e3c89b0c0e288d7cb4dd654e7d4. There
    is no longer any ambiguity about this case; newer versions of Windows
    correctly match the documentation, making it clear that the older
    versions were incorrect. Mention specific versions that have and don't
    have the inconsistency.

    Even if we wouldn't care about the older versions of Windows, we can't
    enable this case of unwind info packing, unless the implementation also
    is changed to match for asymmetrical prologs/epilogs.

commit 885509b1a2c08071c6b14eba84a2d80741cc9520
Author: Martin Storsjö <martin at martin.st>
Date:   Tue Dec 2 12:07:49 2025 +0200

    [llvm-readobj] [ARMWinEH] Fix the interpretation of packed unwind CR=01 RegI=1 (#169676)

    Even though the table for how to expand packed unwind info at [1]
    doesn't explicitly say this, this case is mentioned at [2] under the
    case "Only x19 saved":

        sub    sp,sp,#16                // reg save area allocation*
        stp    x19,lr,[sp]              // save x19, lr
        sub    sp,sp,#(framesz-16)      // allocate the remaining local area

    This was discussed and clarified at [3].

    [1]
    https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#packed-unwind-data
    [2]
    https://learn.microsoft.com/en-us/cpp/build/arm64-exception-handling?view=msvc-170#arm64-stack-frame-layout
    [3]
    https://github.com/llvm/llvm-project/issues/169588#issuecomment-3581688753

commit 96c69b7393be845dce997ead88a4cfd3ea0f8944
Author: Marco Elver <elver at google.com>
Date:   Tue Dec 2 10:50:39 2025 +0100

    [LTO][AllocToken] Support AllocToken instrumentation in backend (#169358)

    Unconditionally add AllocTokenPass to the optimization pipelines, and
    ensure that it runs last in LTO backend pipelines. The latter ensures
    that AllocToken instrumentation can be moved later in the LTO pipeline
    to avoid interference with other optimizations (e.g. PGHO) and enable
    late heap-allocation optimizations.

    In preparation of removing AllocTokenPass being added by Clang, add
    support for AllocTokenPass to read configuration options from LLVM
    module flags.

    To optimize given the pass is now runs unconditionally, only retrieve
    TargetLibraryInfo and OptimizationRemarkEmitter when necessary.

commit 0dec52b2c3f8d9b94c7ca8aadc0da7a1ed5055d7
Author: Sven van Haastregt <sven.vanhaastregt at arm.com>
Date:   Tue Dec 2 10:24:11 2025 +0100

    Fix NDEBUG Wundef warning; NFC (#170153)

    The `NDEBUG` macro is tested for defined-ness everywhere else. The
    instance here triggers a warning when compiling with `-Wundef`.

commit b17e644eedf31e498350b61855a3ac19b9c11d2c
Author: Jean-Didier PAILLEUX <jean-didier.pailleux at sipearl.com>
Date:   Tue Dec 2 10:19:11 2025 +0100

    [flang/flang-rt] Adding support of RAND, IRAND and SRAND intrinsics (#166780)

    This PR adds support of
    [RAND](https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gfortran/RAND.html),
    [IRAND](https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gfortran/IRAND.html)
    and
    [SRAND](https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gfortran/SRAND.html)
    intrinsics in Flang, which are part of the GNU extension.
    These intrinsics are used in the following benchmark:
    [floatingspeed](https://github.com/ahbarnett/floatingspeed/)

commit fa2ddf24e1d538836438c51fcbaa1eabff31bfa2
Author: Diana Picus <Diana-Magda.Picus at amd.com>
Date:   Tue Dec 2 10:16:53 2025 +0100

    [AMDGPU] Fixup 30219f0f4300 (#170266)

commit 9107d097227e8b10ad3aebdd109539ea13ddb170
Author: Benjamin Maxwell <benjamin.maxwell at arm.com>
Date:   Tue Dec 2 09:16:30 2025 +0000

    [AArch64][SME] Avoid clobbering X0 in the MachineSMEABIPass (#170131)

    This tweaks `findStateChangeInsertionPoint` to also avoid clobbering X0,
    which should be possible in most cases (since X0's live ranges are
    likely to be very short before register allocation).

    This improves codegen in a few cases, as not all redundant copies
    to/from X0 are eliminated.

commit a09571ed5be3054b546b714c62c078b595d2f1cd
Author: jeanPerier <jperier at nvidia.com>
Date:   Tue Dec 2 10:13:23 2025 +0100

    [flang] represent ABSTRACT in fir.type_info (#170109)

    This patch keeps information about ABSTRACT derived types and DEFERRED
    type bound procedures inside fir.type_info dispatch tables.

    This is part of the effort to delay generation of runtime type info
    global by keeping the type information in a more condense fashion inside
    fir.type_info (which is also easier to use for any potential
    optimizations).

commit 96056669493dfd3db653790579b0dbeba0cdaa5f
Author: Robert Imschweiler <robert.imschweiler at amd.com>
Date:   Tue Dec 2 09:54:36 2025 +0100

    Fix Windows OpenMP build (#170142)

    fixes Windows build issue in
    https://github.com/llvm/llvm-project/pull/168554

commit 04dd71cb0b5f1a59e68d6aa843e0a50e0d725af7
Author: Henrich Lauko <xlauko at mail.muni.cz>
Date:   Tue Dec 2 09:53:50 2025 +0100

    [CIR] Align inline-kind FuncOp attribute with incubator (#170050)

    Switches to more efficient explicit enum property instead of a wrapped
    storage, simplifying the string representation. The attribute is now
    placed before the symbol name for consistency with other FuncOp
    attributes. FileCheck patterns are also simplified to match only the
    attributes under test.

commit b76815218ad9e3754f00d656f59aea9badc037e7
Author: Cullen Rhodes <cullen.rhodes at arm.com>
Date:   Tue Dec 2 08:45:33 2025 +0000

    Revert "[Attributor] Support nested conditional branches" (#170257)

    Reverts llvm/llvm-project#168532

    Causing a crash in the flang-rt that needs to be investigated, see
    #170211.

commit 30219f0f4300b767ece5ea07609a707bf62c7300
Author: Diana Picus <Diana-Magda.Picus at amd.com>
Date:   Tue Dec 2 09:44:35 2025 +0100

    [AMDGPU] Allow any SGPRs for chain callees (#168345)

    Chain calls never return and don't need to preserve any SGPRs.
    Therefore, we don't need to limit the registers used for callees to the
    CCR_SGPR_64 register class - it's fine to use any available SGPRs.

    Also introduce a new pseudo, SI_TCRETURN_CHAIN, which also has a plain
    SGPR_64 operand. This is necessary because we won't be able to lower
    SI_CS_CHAIN_TC to SI_TCRETURN anymore, since its operand accepts a wider
    range of registers than the latter.

commit 34c699246d9d2ad0e09306d4faed6e8d7ec87aa5
Author: Vladi Krapp <vladi.krapp at arm.com>
Date:   Tue Dec 2 08:39:26 2025 +0000

    [Arm] Control forced unrolling of small loops (#170127)

    * Add flag to control cost threshold for forced unrolling of loops.
      Existing value preserved as default.

commit 2024d6732b9ab0ad3077a5ddc80b647cd5aa138e
Author: Sameer Sahasrabuddhe <sameer.sahasrabuddhe at amd.com>
Date:   Tue Dec 2 12:11:42 2025 +0530

    [NFC][AMDGPU] modify lit test to use update_llc_test_checks

commit 87d37956b33d4b582e6ff7a0ed4707e70bef361d
Author: Fangrui Song <i at maskray.me>
Date:   Tue Dec 2 00:18:09 2025 -0800

    [lld-macho] Remove cuIndices indirection in UnwindInfoSection. NFC (#170252)

    cuEntries was sorted indirectly through a separate `cuIndices`.
    Eliminate cuIndices for simplicity.

    Linking chromium_framework from `#48001` with `-no_uuid` gives identical
    executable using this patch.

commit b5f7058e9114bffccadba520eca9d83891782cde
Author: Nathan Corbyn <n_corbyn at apple.com>
Date:   Tue Dec 2 07:57:47 2025 +0000

    [AArch64][GlobalISel] Don't crash when legalising vector G_SHL (#168848)

commit c4def4631c2c826046c2f496b51143a43109124f
Author: Vasily Leonenko <vleonen at users.noreply.github.com>
Date:   Tue Dec 2 10:41:12 2025 +0300

    [BOLT] Allow missing DT_FINI{,_ARRAY} if instrumentation-sleep-time is used (#170086)

    This PR allows instrument binaries without the .fini and .fini_array
    entries if the user passes the `instrumentation-sleep-time` option.
    The `.fini` or `.fini_array` entries are used to hook into the process
    finalization process and write a profile during finalization. However,
    with the `instrumentation-sleep-time` option, the profile should be
    written periodically, without the need for it to be written at
    finalization.

    Co-authored-by: Vasily Leonenko <vasily.leonenko at huawei.com>

commit 8c7c940585c1eb5e9cc1a00c42691051f183863d
Author: Durgadoss R <durgadossr at nvidia.com>
Date:   Tue Dec 2 12:32:36 2025 +0530

    [MLIR][NVVM] Update mbarrier.test.wait Op (#169698)

    This patch extends mbarrier.test_wait to support scope,
    semantics, and phase-parity, completing the updates for
    this Op up to Blackwell. Corresponding lit tests are added
    to verify the lowering.

    Signed-off-by: Durgadoss R <durgadossr at nvidia.com>

commit 4522448cd16489b84c80ec39ae06960b01fd3b59
Author: Nikita Popov <npopov at redhat.com>
Date:   Tue Dec 2 07:04:03 2025 +0100

    [PowerPC][MC] Diagnose out of range branch fixups (#165859)

    Currently, out-of-range branch fixups will be silently miscompiled. GNU
    as will instead print an "operand out of range" error instead.

    Check that the branch target fixups fit into the proper range and have
    low zero bits. The implementation is inspired by SystemZ:
    https://github.com/llvm/llvm-project/blob/0ed8e66f88b689c152245d6b968a06fa8e67b51f/llvm/lib/Target/SystemZ/MCTargetDesc/SystemZMCAsmBackend.cpp#L31

    (My actual interest here is not actually assembly usage, but LLVM
    failing to use the correct branch kind for reasons I've not tracked down
    yet. Currently this just silently truncates the branch target instead of
    producing an error.)

commit cb6e1967c4f5239170b7c088ba6f86910ae66a63
Author: Jason Rice <ricejasonf at gmail.com>
Date:   Mon Dec 1 21:27:23 2025 -0800

    [MLIR] Forward generated OpTy::create arguments (#170012)

    The recent changes in the MLIR TableGen interface for generated
    OpTy::build functions involves a new OpTy::create function that is
    generated passing arguments without forwarding. This is problematic with
    arguments that are move only such as `std::unique_ptr`. My particular
    use case involves `std::unique_ptr<mlir::Region>` which is desirable as
    the `mlir::OperationState` object accepts calls to
    `addRegion(std::unique_ptr<mlir::Region>`.

    In Discord, the use of `extraClassDeclarations` was suggested which I
    may go with regardless since I still have to define the builder function
    anyways, but perhaps you would consider this trivial change as it
    supports a broader class of argument types for this approach.

    Consider the declaration in TableGen:
    ```
      let builders = [
        OpBuilder<(ins "::mlir::Value":$cdr,
                       "::mlir::ValueRange":$packs,
                       "std::unique_ptr<::mlir::Region>":$body)>
      ];
      ```

      Which currently generates:
      ```cpp
     ExpandPacksOp ExpandPacksOp::create(::mlir::OpBuilder &builder, ::mlir::Location location, ::mlir::Value cdr, ::mlir::ValueRange packs, std::unique_ptr<::mlir::Region> body) {
      ::mlir::OperationState __state__(location, getOperationName());
      build(builder, __state__, std::forward<decltype(cdr)>(cdr), std::forward<decltype(packs)>(packs), std::forward<decltype(body)>(body));
      auto __res__ = ::llvm::dyn_cast<ExpandPacksOp>(builder.create(__state__));
      assert(__res__ && "builder didn't return the right type");
      return __res__;
    }

    ```
    With this change it will generate:
    ```cpp
    ExpandPacksOp ExpandPacksOp::create(::mlir::OpBuilder &builder, ::mlir::Location location, ::mlir::Value cdr, ::mlir::ValueRange packs, std::unique_ptr<::mlir::Region>&&body) {
      ::mlir::OperationState __state__(location, getOperationName());
      build(builder, __state__, static_cast<decltype(cdr)>(cdr), std::forward<decltype(packs)>(packs), std::forward<decltype(body)>(body));
      auto __res__ = ::llvm::dyn_cast<ExpandPacksOp>(builder.create(__state__));
      assert(__res__ && "builder didn't return the right type");
      return __res__;
    }
    ```

    Another option could be to make this function a template but then it
    would not be hidden in the generated translation unit. I don't know if
    that was the original intent. Thank you for your consideration.

commit f3501d70d8e4ae8640ad01663fc27d64af31d4aa
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Tue Dec 2 04:55:39 2025 +0000

    [XRay] Mark default-options.cpp unsupported on ppc

    This test fails now that it actually runs:

    ld.lld: error: undefined symbol: std::__throw_system_error(int)

commit ff3d550d7ec4ec36750b730afb993cdf061b01f7
Author: Timm Baeder <tbaeder at redhat.com>
Date:   Tue Dec 2 05:40:15 2025 +0100

    [clang][bytecode][NFC] Add popToUInt64() to builtin evaluation (#170164)

    We often don't need the APSInt at all, so add a version that pops the
    integral from the stack and just static_casts to uint64_t.

commit 867d353cff54d4257afcd254196a75f9d057743e
Author: Kareem Ergawy <kareem.ergawy at amd.com>
Date:   Tue Dec 2 05:20:10 2025 +0100

    [OpenMP][flang] Support GPU team-reductions on allocatables (#169651)

    Extends the work started in #165714 by supporting team reductions.
    Similar to what was done in #165714, this PR introduces proper
    allocations, loads, and stores for by-ref reductions in teams-related
    callbacks:
    * `_omp_reduction_list_to_global_copy_func`,
    * `_omp_reduction_list_to_global_reduce_func`,
    * `_omp_reduction_global_to_list_copy_func`, and
    * `_omp_reduction_global_to_list_reduce_func`.

commit 728cada359199d2952bdae43d0726fc5a208df6e
Author: cmtice <cmtice at google.com>
Date:   Mon Dec 1 20:08:19 2025 -0800

    [LLDB] Add type casting to DIL, part 1 of 3. (#165199)

    This is an alternative to
    https://github.com/llvm/llvm-project/pull/159500, breaking that PR down
    into three separate PRs, to make it easier to review.

    This first PR of the three adds the basic framework for doing type
    casing to the DIL code, but it does not actually do any casting: In this
    PR the DIL parser only recognizes builtin type names, and the DIL
    interpreter does not do anything except return the original operand (no
    casting). The second and third PRs will add most of the type parsing,
    and do the actual type casting, respectively.

commit fbdf8ab59005bc35f23b3167e0783013c7ee5fa4
Author: Anshil Gandhi <95053726+gandhi56 at users.noreply.github.com>
Date:   Mon Dec 1 23:05:17 2025 -0500

    [LSV] Merge contiguous chains across scalar types (#154069)

    This change enables the LoadStoreVectorizer to merge and vectorize
    contiguous chains even when their scalar element types differ, as long
    as the total bitwidth matches. To do so, we rebase offsets between
    chains, normalize value types to a common integer type, and insert the
    necessary casts around loads and stores. This uncovers more
    vectorization opportunities and explains the expected codegen updates
    across AMDGPU tests.

    Key changes:
    - Chain merging
      - Build contiguous subchains and then merge adjacent ones when:
    - They refer to the same underlying pointer object and address space.
        - They are either all loads or all stores.
        - A constant leader-to-leader delta exists.
    - Rebasing one chain into the other's coordinate space does not overlap.
        - All elements have equal total bit width.
    - Rebase the second chain by the computed delta and append it to the
    first.

    - Type normalization and casting
    - Normalize merged chains to a common integer type sized to the total
    bits.
    - For loads: create a new load of the normalized type, copy metadata,
    and cast back to the original type for uses if needed.
      - For stores: bitcast the value to the normalized type and store that.
    - Insert zext/trunc for integer size changes; use bit-or-pointer casts
    when sizes match.

    - Cleanups
      - Erase replaced instructions and DCE pointer operands when safe.
    - New helpers: computeLeaderDelta, chainsOverlapAfterRebase,
    rebaseChain, normalizeChainToType, and allElemsMatchTotalBits.

    Impact:
    - Increases vectorization opportunities across mixed-typed but
    size-compatible access chains.
    - Large set of expected AMDGPU codegen diffs due to more/changed
    vectorization.

    This PR resolves #97715.

commit 039f883f7c350d2c8bd5cf07a05d757890ddcfdf
Author: Longsheng Mou <longshengmou at gmail.com>
Date:   Tue Dec 2 11:49:01 2025 +0800

    [mlir][tensor] Fix bug in `ConcatOpInterface` (#168676)

    This PR fixes an issue in `ConcatOpInterface` where `tensor.concat`
    fails when the concat dimension is dynamic while the result type is
    static. The fix unifies the computation by using `OpFoldResult`,
    avoiding the need to separately handle dynamic and static dimension
    values. Fixes #162776.

commit 0a03b7e6569ae89d55c9703faedf8e2503bcc728
Author: Jianjian Guan <jacquesguan at me.com>
Date:   Tue Dec 2 10:31:50 2025 +0800

    [CIR] Upstream CIR codegen for vec_set x86 builtin (#169265)

    Support CIR codegen for x86 builtin vec_set.

commit 91531f320830e6cb5e9d48d011b5f9ac7e578fc7
Author: Mingjie Xu <xumingjie.enna1 at bytedance.com>
Date:   Tue Dec 2 10:08:50 2025 +0800

    [ThinLTO] Fix parsing null aliasee in alias summary (#169490)

    In
    https://github.com/llvm/llvm-project/commit/f8182f1aef5b6ec74cbe2c1618e759f0113921ba,
    we add support for printing "null" aliasee in AsmWriter, but missing
    support in LLParser.

commit 8dc6abbab383fe86508e8a1b4d429ed8150da06d
Author: lonely eagle <2020382038 at qq.com>
Date:   Tue Dec 2 10:01:01 2025 +0800

    [mlir][presburger] Implement moveColumns using std::rotate (#168243)

commit 28e200495e5b39b7599846c511e61723cde2f478
Author: Michael Liao <michael.hliao at gmail.com>
Date:   Mon Dec 1 20:54:41 2025 -0500

    [clang][CUDA] Clean up tests from device-side kernel call support. NFC

    - Remove unused 'CHECK' from the CUDASema test

commit da1a8876086fde210bace059bb5e9ea9886362f1
Author: Kelvin Li <kli at ca.ibm.com>
Date:   Mon Dec 1 20:53:44 2025 -0500

    [flang] Enable debug test on AIX (NFC) (#169945)

commit 744480a2a6b83f819b782ca0df11a25b23d9b245
Author: David Zbarsky <dzbarsky at gmail.com>
Date:   Mon Dec 1 20:45:21 2025 -0500

    [bazel] Rewrite overlay handling to starlark (#170000)

    Starlark is perfectly capable of doing what we need and this avoids the
    dependency on a host Python

commit 1f794e62b4467ac89ef093a7d8f061b0c4fc07ba
Author: Brandon Wu <brandon.wu at sifive.com>
Date:   Tue Dec 2 09:35:11 2025 +0800

    [NFC][RISCV] Correct fminimumnum test case (#170169)

    The test case mismatch was introduced in #135727

commit 755733e219a11a265e47cc1e4f63ad2dbb15f41e
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Mon Dec 1 17:23:43 2025 -0800

    [lldb/Target] Track containing StackFrameList to avoid circular dependencies (#170226)

    This change adds tracking of the StackFrameList that produced each frame
    by storing a weak pointer (m_frame_list_wp) in both `StackFrame` and
    `ExecutionContextRef`.

    When resolving frames through `ExecutionContextRef::GetFrameSP`, the
    code now first attempts to use the remembered frame list instead of
    immediately calling `Thread::GetStackFrameList`. This breaks circular
    dependencies that can occur during frame provider initialization, where
    creating a frame provider might trigger `ExecutionContext` resolution,
    which would then call back into `Thread::GetStackFrameList()`, creating
    an infinite loop.

    The `StackFrameList` now sets m_frame_list_wp on every frame it creates,
    and a new virtual method `GetOriginatingStackFrameList` allows frames to
    expose their originating list.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 326ee7af410a5bba12ea20c80c0ad16bb915e47f
Author: mitchell <mitchell.xu2 at gmail.com>
Date:   Tue Dec 2 09:16:17 2025 +0800

    [clang-tidy] Fix false positive in `misc-const-correctness` (#170065)

    Closes https://github.com/llvm/llvm-project/issues/131599 and
    https://github.com/llvm/llvm-project/issues/170033

commit 91e8780424c0e7c2f11f1adfc47915f975691d87
Author: Letu Ren <fantasquex at gmail.com>
Date:   Tue Dec 2 08:51:08 2025 +0800

    [CIR] Upstream Builtin FloorOp (#169954)

commit c6f501d479e82185a5906096b758480e43f2edcc
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 16:23:43 2025 -0800

    [XRay] Run tests inside bootstrapping build (#168378)

    COMPILER_RT_STANDALONE_BUILD is set when doing a bootstrapping build
    through LLVM_ENABLE_RUNTIMES with the CMake source directory being in
    llvm/. This patch changes the XRay tests to also detect that we have
    LLVM sources and the llvm-xray tool if we are in a bootstrapping build
    through the use of the LLVM_TREE_AVAILABLE variable which is set in
    runtimes/CMakeLists.txt.

commit e27dec5eed902e6c2e34afa1b593aa46699ce0a2
Author: Dan Liew <dan at su-root.co.uk>
Date:   Mon Dec 1 16:13:19 2025 -0800

    [BoundsSafety][LLDB] Implement instrumentation plugin for -fbounds-safety soft traps (#169117)

    This patch tries to upstream code landed downstream in
    https://github.com/swiftlang/llvm-project/pull/11835.

    This patch implements an instrumentation plugin for the
    `-fbounds-safety` soft trap mode first implemented in
    https://github.com/swiftlang/llvm-project/pull/11645 (rdar://158088757).
    That functionality isn't supported in upstream Clang yet, however the
    instrumented plugin can be compiled without issue so this patch tries to
    upstream it. The included tests are all disabled when the clang used for
    testing doesn't support `-fbounds-safety`. This means the tests will be
    skipped. However, it's fairly easy to point LLDB at a clang that does
    support `-fbounds-safety. I've done this and confirmed the tests pass.
    To use a custom clang the following can be done:

    * For API tests set the `LLDB_TEST_COMPILER` CMake cache variable to
      point to appropriate compiler.
    * For shell tests applying a patch like this can be used to set the
      appropriate compiler:

    ```
    --- a/lldb/test/Shell/helper/toolchain.py
    +++ b/lldb/test/Shell/helper/toolchain.py
    @@ -271,6 +271,7 @@ def use_support_substitutions(config):
         if config.lldb_lit_tools_dir:
             additional_tool_dirs.append(config.lldb_lit_tools_dir)

    +    config.environment['CLANG'] = '/path/to/clang'
         llvm_config.use_clang(
    ```

    The current implementation of -fbounds-safety traps works by emitting
    calls to runtime functions intended to log the occurrence of a soft
    trap.
    While the user could just set a breakpoint of these functions the
    instrumentation plugin sets it automatically and provides several
    additional features:

    When debug info is available:

    * It adjusts the stop reason to be the reason for trapping. This is
      extracted from the artificial frame in the debug info (similar to
      -fbounds-safety hard traps).
    * It adjusts the selected frame to be the frame where the soft trap
      occurred.

    When debug info is not available:

    * For the `call-with-str` soft trap mode the soft trap reason is
      read from the first argument register.
    * For the `call-minimal` soft trap mode the stop reason is adjusted
      to note its a bounds check failure but does not give further
      information because none is available.
    * In this situation the selected frame is not adjusted because in
      this mode the user will be looking at assembly and adjusting the
      frame makes things confusing.

    This patch includes shell and api tests. The shell tests seemed like the
    best way to test behavior when debug info is missing because those tests
    make it easy to disable building with debug info completely.

    rdar://163230807

commit 677e1d078eacdeff10c7a69e4b83f88cceffead4
Author: Jasmine Tang <jjasmine at igalia.com>
Date:   Tue Dec 2 00:10:33 2025 +0000

    [CIR] Upstream gather instrinsics (#169157)

commit be5db3386dbab435a5b44118b653c7785ad34168
Author: Victor Mustya <victor.mustya at intel.com>
Date:   Mon Dec 1 16:06:37 2025 -0800

    [libclc] Fix bitfield_insert implementation (#170208)

    The `bitfield_insert` function in the OpenCL C library had an incorrect
    `__CLC_BODY` definition, that included the `.inc` file for the
    `__clc_bitfield_insert` declaration instead of the correct
    implementation. So, the function was not defined at all, leading to
    linker errors when trying to use it.

commit 64a762804893ebf2c0942ad7970118188f27b16a
Author: Andres-Salamanca <andrealebarbaritos at gmail.com>
Date:   Mon Dec 1 18:48:26 2025 -0500

    [CIR] Upstream Emit the resume branch for cir.await op (#169864)

    This PR upstreams the emission of the `cir.await` resume branch.
    Handling the case where the return value of `co_await` is not ignored is
    deferred to a future PR, which will be added once `co_return` is
    upstreamed. Additionally, the `forLValue` variable is always `false` in
    the current implementation. When support for emitting `coro_yield` is
    added, this variable will be set to `true`, so that work is also
    deferred to a future PR.

commit ace65a0a8d7b9ad48c1f321cc70c711ecdab29bf
Author: Alex Langford <alangford at apple.com>
Date:   Mon Dec 1 15:44:01 2025 -0800

    [LLDB] Update Shell lit config to handle c8031c3dd743 (#170225)

commit 6883d4a23605dbd67d24a44801e9c1888ffdf586
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Mon Dec 1 18:40:26 2025 -0500

    AMDGPU: Try to use zext to implement constant-32-bit addrspacecast (#168977)

    If the high bits are assumed 0 for the cast, use zext. Previously
    we would emit a build_vector and a bitcast with the high element
    as 0. The zext is more easily optimized. I'm less convinced this is
    good for globalisel, since you still need to have the inttoptr back
    to the original pointer type.

    The default value is 0, though I'm not sure if this is meaningful
    in the real world. The real uses might always override the high
    bit value with the attribute.

commit 9324dae70f009ceb5c0e93b99e73c51fcaf67911
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 22:46:43 2025 +0000

    [X86] Add tests showing failure to concat icmp instructions together. (#170210)

commit b7c358c44af3cda2b731f6bb94f6d765350017a4
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Mon Dec 1 14:45:54 2025 -0800

    [lldb/ScriptInterpreter] Add a way to retrieve script module file path (#170202)

    This adds a new virtual method `GetScriptedModulePath()` to
    `ScriptedInterface` that allows retrieving the file path of the Python
    module containing the scripted object implementation.

    The Python implementation acquires the GIL and walks through the
    object's `__class__.__module__` to find the module's `__file__`
    attribute. This will be used by ScriptedFrame to populate the module and
    compile unit for frames pointing to Python source files.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 229dca66df6d0f9253273565d82972fd1787bd4a
Author: Peter Klausler <pklausler at nvidia.com>
Date:   Mon Dec 1 14:41:57 2025 -0800

    [flang] Handle assumed-type dummy arguments in ExtractDataRef (#169080)

    Assumed-type dummy argument symbols s are never packaged in DataRefs
    since the only way they can be used in Fortran is by forwarded as actual
    arguments to other calls. When an ActualArgument comprising a forwarded
    assumed-type dummy argument is presented to ExtractDataRef, it fails,
    because ExtractDataRef for ActualArgument only handles actual argument
    expressions (including variable references). Add support for actual
    arguments that are assumed-type dummy arguments.

    Fixes https://github.com/llvm/llvm-project/issues/168978.

commit 76c5b6abc96c8fd2e977e9d5ed50e038a0b4477a
Author: Peter Klausler <pklausler at nvidia.com>
Date:   Mon Dec 1 14:41:32 2025 -0800

    [flang] Warn on invalid argument to SQRT() (#168607)

    When folding SQRT(), notice invalid argument exceptions and optionally
    warn about them.

commit b76300acc5207b77ffce5677f31491ee58f06c06
Author: Daniel Thornburgh <mysterymath at gmail.com>
Date:   Mon Dec 1 14:41:20 2025 -0800

    [libc][malloc] Ensure a minimum block alignment of 4 (#169447)

    Most platforms inherently have a size_t alignment of 4, but this isn't
    true on every platform LLVM has some degree of backend support for.
    Accordingly, it's simple enough to just set the min alignment of Block
    to 4 and lose the static_assert.

commit 2864afbe4d5922511d0f65b3f5231ef6f7ae7c10
Author: Florian Hahn <flo at fhahn.com>
Date:   Mon Dec 1 22:40:04 2025 +0000

    [LV] Add more tests for argmin finding the first index.

    Add more test coverage for supporting argmin/argmax with strict
    predicates, in preparation for follow up to 99addbf73db596403a17.

commit ae68377c69db15d1d567368b2321455d31f41b69
Author: Jason Molenda <jmolenda at apple.com>
Date:   Mon Dec 1 14:37:55 2025 -0800

    [lldb][NFC] Change ObjectFile's DataExtractor to a shared ptr (#170066)

    ObjectFile has an m_data DataExtractor ivar which may be default
    constructed initially, or initialized with a DataBuffer passed in to its
    ctor. If the DataExtractor does not get a DataBuffer source passed in,
    the subclass will initialize it with access to the object file's data.
    When a DataBuffer is passed in to the base class ctor, the DataExtractor
    only has its buffer initialized; ObjectFile doesn't yet know the address
    size and endianness to fully initialize the DataExtractor.

    This patch changes ObjectFile to instead have a DataExtractorSP ivar
    which is always initialized with at least a default-constructed
    DataExtractor object in the base class ctor. The next patch I will be
    writing is to change the ObjectFile ctor to take an optional
    DataExtractorSP, so the caller can pass a DataExtractor subclass -- the
    VirtualizeDataExtractor being added via
    https://github.com/llvm/llvm-project/pull/168802
    instead of a DataBuffer which is trivially saved into the DataExtractor.

    The change is otherwise mechanical; all `m_data.` changed to
    `m_data_sp->` and all the places where `m_data` was passed in for a
    by-ref call were changed to `*m_data_sp.get()`. The shared pointer is
    always initialized to contain an object.

    I built & ran the testsuite on macOS and on aarch64-Ubuntu (thanks for
    getting the Linux testsuite to run on SME-only systems David). All of
    the ObjectFile subclasses I modifed compile cleanly, but I haven't
    tested them beyond any unit tests they may have (prob breakpad).

    rdar://148939795

commit 28d2208f7d067c58eb81495fbb9606e508678f6f
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 14:34:31 2025 -0800

    [CI] Add checkmark emojis for passing builds (#170183)

    This better matches the code formatter and I personally find the visual
    indication valuable when I am scrolling/glancing at a comment.

commit 60513b8d6ebacde46e8fbe4faf1319ac87e990e3
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 22:27:21 2025 +0000

    [Github] Fix typo in unprivileged-download-artifact test workflow

    s/Chekcout/Checkout

commit 4ed97c153b6a289256a320238f98b74eaf4844d6
Author: Daniel Sanders <daniel_l_sanders at apple.com>
Date:   Mon Dec 1 14:23:04 2025 -0800

    [lldb] Add type hints for gdbclientutils.py when base class returns Literal[T] (#170182)

    Pyright automatically deduces these functions to return `Literal["foo"]`
    since the implementation returns "foo". This causes any overload that
    returns a different literal or a string to report that they're
    overloaded in an incompatible way. By correctly annotating them as
    returning str, the overloads can return different strings without this
    error

    I was encountering these a lot while writing tests for my downstream
    target

commit e0e897e5c8976cbbc4b99987eb6fbc7faa6d03cf
Author: google-yfyang <yfyang at google.com>
Date:   Mon Dec 1 17:19:50 2025 -0500

    [CUDA][NFC] Fix an unused variable error when compiled with optimization (#170205)

    #165519 causes some builds to fail.
    [clang/lib/CodeGen/CGCUDARuntime.cpp:65]:15: error: unused variable
    'Ctx' [-Werror,-Wunused-variable]
       65 |   ASTContext &Ctx = CGM.getContext();

commit d08b0f7240aaba42d344fef942f812e6a38e5331
Author: Erik Enikeev <evonatarius at gmail.com>
Date:   Tue Dec 2 01:03:55 2025 +0300

    [ARM] Disable strict node mutation and use correct lowering for several strict ops (#170136)

    Changes in this PR were discussed and reviewed in
    https://github.com/llvm/llvm-project/pull/137101.

commit e7748e92cd5d71af2e1699328b7c575e9b9bf479
Author: Valery Dmitriev <valeryd at nvidia.com>
Date:   Mon Dec 1 13:53:13 2025 -0800

    [flang] implement show_descriptor intrinsic, a non-standard extension (#169137)

    show_descriptor intrinsic prints details of a descriptor (extended
    Fortran pointer).
    It accepts a descriptor for any type and rank, including scalars.
    Requires use of flang_debug module.

    Example:
    program test
      use flang_debug
      implicit none
      integer :: a(4) = (/ 1,3,5,7 /)
      call show_descriptor(a(1:3))
    end program test

    and its output:
    Descriptor @ 0x7ffe01ec6a98:
      base_addr 0x563b7035103c
      elem_len  4
      version   20240719
      rank      1
      type      9 "INTEGER(kind=4)"
      attribute 0
      extra     0
        addendum  0
        alloc_idx 0
      dim[0] lower_bound 1
             extent      3
             sm          4

commit aa727db23ed9c6c2399e5728d5d689c110bd7f80
Author: lntue <lntue at google.com>
Date:   Mon Dec 1 16:45:21 2025 -0500

    [libc][docs] Add links to 2025 talks. (#170206)

commit 93e18db3e48dc28818d4880e813b9027bfbf3c16
Author: Tom Honermann <tom.honermann at intel.com>
Date:   Mon Dec 1 16:34:37 2025 -0500

    [libsycl] fix license link in README.md.

commit b545e54f7a87291198d3e615e523a2b37a967482
Author: Alexey Samsonov <vonosmas at gmail.com>
Date:   Mon Dec 1 13:33:51 2025 -0800

    [libc] Remove btowc / wctob from wctype_utils internal header. (#170200)

    They are not used anywhere except for the btowc/wctob entrypoints, so
    just move the implementation there. Internal code should probably be
    using a safer mbrtowc variants anyway, if applicable.

    This allows us to remove the use of wint_t, which is problematic for
    some uses through `libc/shared/` when a host system doesn't have
    wide-character support (see PR #165884 comments). There's no such
    problems with `wchar_t`, since it's a fundamental type in C++.

commit eb1533d3f9c8755369a13b0a941fc58ef959d00b
Author: Jordan Rupprecht <rupprecht at google.com>
Date:   Mon Dec 1 15:30:26 2025 -0600

    [bazel] Move clang-fuzzer to separate package (#170167)

    This avoids needing to pull in protobuf deps for clang-fuzzer when that
    is not needed.

    Recently requested: #168928

    Previously requested: #123126 / #123833

commit 41aade49d2600a464a338f1086328a58b30b7f95
Author: Jordan Rupprecht <rupprecht at google.com>
Date:   Mon Dec 1 15:29:51 2025 -0600

    [bazel][mlir][acc] Port #170189: acc deps (#170203)

commit 3ca85e74a1af2771ea46519107e6d366316a03ee
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Mon Dec 1 13:12:42 2025 -0800

    [lldb] Handle staticmethod/classmethod descriptors in ScriptedPythonInterface (#170188)

    Extract `__func__` attribute from staticmethod/classmethod descriptors
    before treating them as callables. Python's `@staticmethod` and
    `@classmethod` decorators wrap methods in descriptor objects that are
    not directly usable as PythonCallable, when calling PyCallable_Check.

    The actual callable function is stored in the `__func__` attribute of
    these descriptors, so we need to unwrap them to properly validate and
    invoke the decorated methods in scripted interfaces.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 637a230ee3a2c07679b4e7207467a5a740ba3a3e
Author: Ramkumar Ramachandra <ramkumar.ramachandra at codasip.com>
Date:   Mon Dec 1 21:06:28 2025 +0000

    [MapVector] Introduce {keys,values} iterators (#169675)

    Similar to DenseMap::{keys,values}, introduce MapVector::{keys,values}.

commit da9e8d5c57c845c852cc676c104c499eff06ae09
Author: David Peixotto <peix at meta.com>
Date:   Mon Dec 1 13:05:27 2025 -0800

    [lldb] Unify DW_OP_deref and DW_OP_deref_size implementations (#169587)

    This commit unifies the code in the dwarf expression evaluator that
    handles these two deref operations. Previously we had similar, but not
    identical code for handling them.

    The implementation was taken from the DW_OP_deref_size code path since
    that handles the general case. We put that code into an
    Evaluate_DW_OP_deref_size function and call it with the address size
    when evaluating DW_OP_deref.

    There were initially two test failures when I made the change. The
    `DW_op_deref_no_ptr_fixing` unittest failed because we were not
    correctly setting the address size when we created the DataExtractor.
    The `DW_OP_deref test` failed because previously the expression
    `DW_OP_lit4, DW_OP_deref` would evaluate to a LoadAddress, but the code
    for deref_size was evaluating it to a scalar.

    The difference was in how it handled a deref of a scalar value type. In
    the deref code path we convert a scalar to a load address, but this was
    not done in the deref_size code path.

    ```
      case Value::ValueType::Scalar:
          stack.back().SetValueType(Value::ValueType::LoadAddress);
    ```

    I decided to modify the deref_size code to also convert the value to a
    load address to keep the test passing.

    There is no functional change intended here. The motivation is to reduce
    the number of code paths that implement the deref operation.

commit 5d4c4411f13755d5f12a83a0d6705e8501f33d5f
Author: davidtrevelyan <davidtrevelyan at users.noreply.github.com>
Date:   Mon Dec 1 20:56:43 2025 +0000

    [rtsan] Handle attributed IR function declarations (#169577)

    Addresses https://github.com/llvm/llvm-project/issues/169377.

    Previously, the RealtimeSanitizer pass only handled attributed function
    _definitions_ in IR, and we have recently found that attributed function
    _declarations_ caused it to crash. To fix the issue, we must check
    whether the IR function is empty before attempting to do any
    manipulation of its instructions.

    This PR:

    - Adds checks for whether IR `Function`s are `empty()` ~~in each
    relevant~~ at the top-level RTSan pass routine
    - ~~Removes the utility function `rtsanPreservedCFGAnalyses` from the
    pass, whose result was unused and which would otherwise have complicated
    the fix~~

commit af2e2468217d1fe6e87b3d0f37f9524cc95c9298
Author: Max191 <44243577+Max191 at users.noreply.github.com>
Date:   Mon Dec 1 15:46:42 2025 -0500

    [mlir] Add missing pad reshape propagation patterns (#168888)

    The existing `FoldPadWithProducerReshapeOpByExpansion` and
    `FoldPadWithProducerReshapeOpByCollapsing` patterns did not cover all
    reshape propagation cases, because they only consider cases where the
    pad op is the consumer operation. This PR adds 2 new patterns to cover
    the cases where the pad op is the producer operation, which completes
    the propagation pattern set for pad op with expand_shape and
    collapse_shape.

    Note for integration: This PR also removes the single user restriction
    for the `FoldPadWithProducerReshapeOpByExpansion` and
    `FoldPadWithProducerReshapeOpByCollapsing` patterns, which leaves more
    control to the users of the pattern. If this constraint is needed, then
    it should be added to the control function for these patterns.

    ---------

    Signed-off-by: Max Dawkins <max.dawkins at gmail.com>

commit 7b2ee464d278c05a0539482ecf3562649e9ea27d
Author: Amr Hesham <amr96 at programmer.net>
Date:   Mon Dec 1 21:45:51 2025 +0100

    [CIR] Add boolean to the Complex type constraints msg (#170192)

    Update the type constraints error message to also mention the boolean
    type

commit d7b5469b39d0c8b5d5db2f87bbd2446365f2dc35
Author: macurtis-amd <macurtis at amd.com>
Date:   Mon Dec 1 14:41:46 2025 -0600

    [clang] Handle null dtor decl during consumed analysis (#170180)

    See similar PR #169593.

    This is another case where null was not handled when returned from
    `getDestructorDecl`.

commit 258cb467e9af70cca5bcd13aef0a9443860960d9
Author: Razvan Lupusoru <razvan.lupusoru at gmail.com>
Date:   Mon Dec 1 12:26:43 2025 -0800

    [mlir][acc] Add acc serial to acc parallel conversion (#170189)

    This patch introduces a new transformation pass that converts
    `acc.serial` constructs into `acc.parallel` constructs with
    num_gangs(1), num_workers(1), and vector_length(1).

    The transformation is semantically equivalent since an OpenACC serial
    region executes sequentially, which is identical to a parallel region
    with a single gang, worker, and vector. This unification simplifies
    processing of acc regions by enabling code reuse in later compilation
    stages.

    Co-authored-by: Vijay Kandiah <vkandiah at nvidia.com>

commit da76a48943f090bc6d5aa8b462d07d361f401d37
Author: Yu Hao <yuhaoyu at google.com>
Date:   Mon Dec 1 12:24:07 2025 -0800

    [clang][transformer] Add `merge` range-selector for selecting the merge of two ranges. (#169560)

    This new range-selector `merge` takes in two ranges and selects from
    min(begin locs of input ranges) to max(end locs of input ranges). This
    is useful for when the user needs to select a range that is a merge of
    two arbitrary ranges (potentially overlapped and out of order).

    The existing `enclose` range-selector does something similar but it
    requires the first range's begin loc appears before the second range's
    end loc. The `merge` range-selector complements `enclose`.

    ---------

    Co-authored-by: Yitzhak Mandelbaum <ymand at users.noreply.github.com>

commit b9b9a239df4785b42b050b128eff18694871bc14
Author: Max <628527+mxms0 at users.noreply.github.com>
Date:   Mon Dec 1 15:22:08 2025 -0500

    [ProfData] Improve efficiency of reader (#169730)

    Pre-reserve space in the map before inserting. In release builds, 9.4%
    of all CPU time is spent in llvm::sampleprof::ProfileSymbolList::add. Of
    that 9.4%, roughly half is in llvm::DenseMapBase::grow.

    ---------

    Co-authored-by: mxms <mxms at google.com>
    Co-authored-by: Vitaly Buka <vitalybuka at gmail.com>

commit 1d3384e5d4c6bd1b297110b2de8a79d8a4b274e2
Author: Louis Dionne <ldionne.2 at gmail.com>
Date:   Mon Dec 1 15:17:44 2025 -0500

    [libc++] Update the Docker image hash in run-buildbot-container (#170165)

    The current Docker image used by our CI is d6b22a347f813cf4a983, but we
    forgot to synchronize the value in run-buildbot-container.

commit 860146c4b6856e5c2a57218fd9e70f131280b00f
Author: Louis Dionne <ldionne.2 at gmail.com>
Date:   Mon Dec 1 15:17:16 2025 -0500

    [libc++] Make sure the LLVM libc shared utilities use libc++'s assertion mechanism (#170149)

    Otherwise, they would use their own mechanism based on C assert. It's
    better to use the same assertion mechanism consistently everywhere since
    this code is considered an implementation detail of libc++.

commit c8031c3dd7434635dd64ad8a4abe9a96f86a384b
Author: Tomohiro Kashiwada <kikairoya at gmail.com>
Date:   Tue Dec 2 05:06:17 2025 +0900

    [LIT] remove `to_unicode`, `to_string`, and `to_bytes` helpers (#165950)

    These helpers, which handle the difference between Python 2.x and Python
    3.x, are no longer required.

    Co-authored-by: Alexander Richardson <mail at alexrichardson.me>

commit 33bcde0678707ffffb7f01188d530da05bed47b8
Author: Gleb Popov <6yearold at gmail.com>
Date:   Mon Dec 1 23:05:37 2025 +0300

    libunwind: Remove OS requirements from tests to make them run on more OSes (#167642)

    There might be a cleaner way to enable these tests running on FreeBSD,
    I'm open to suggestions.

    Co-authored-by: Alexander Richardson <mail at alexrichardson.me>

commit df3e1b59d85b153a369d344f9ef335f5315d84a5
Author: Erick Ochoa Lopez <erick.ochoalopez at amd.com>
Date:   Mon Dec 1 15:05:02 2025 -0500

    [mlir][amdgpu] Add amdgpu.make_dma_descriptor (#169407)

    Co-authored-by: Jakub Kuderski <kubakuderski at gmail.com>

commit 28ac6b36c14376f5a80b974d6b1c49a89201b594
Author: Valentin Clement (バレンタイン クレメン) <clementval at gmail.com>
Date:   Mon Dec 1 11:51:35 2025 -0800

    [flang][cuda] Use the option to populate conversion patterns (#170190)

    #169740 split the conversion patterns but the option was not use when
    populating them.

commit fffe9bcbc7d5d93872ad00a7f212483d749ae71d
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date:   Mon Dec 1 11:46:30 2025 -0800

    [AMDGPU] Allow hazard checks for WMMA co-exec (#168805)

    Now we are just inserting V_NOP instrtuctions, try to schedule
    something into the shadow.

    It is still somewhat imprecise, for example AdvanceCycle() will
    use TII.getNumWaitStates() anyway, but in a scheduling mode
    we are not required to be precise. We must be finally precise
    in the hazard recognizer mode. Then EmittedInstrs buffer is also
    limited to MaxLookAhead even though VALU only hazards may actually
    never expire and require an endless buffer. But that's OK, we can
    at least mitigate what the buffer can hold. The buffer is also
    currently much bigger than any of VALU hazards may need.

    That said the rest of the 'fix*' functions here can be changed
    the same way, these which are using V_NOPs. This one is just the
    worst because it may require up to 9 nops.

commit 00276b67d36a665119a6a7b39dbba69f45c44e58
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 11:34:36 2025 -0800

    [WPD] Remove undef from tests (#170179)

commit 281f4ea58e4684e0817e15a7284c42fb29e37704
Author: Nico Weber <thakis at chromium.org>
Date:   Mon Dec 1 14:29:16 2025 -0500

    [gn] port f65c19982d2a

commit fd8bf3c69a10cfe60f89799710c60c4f5dd4e22d
Author: Med Ismail Bennani <ismail at bennani.ma>
Date:   Mon Dec 1 11:24:50 2025 -0800

    [lldb/ScriptInterpreter] Fix typo in AbstractMethodCheckerPayload (NFC) (#170187)

    This fixes a typo in ScriptedPythonInterface and changes
    `AbstrackMethodCheckerPayload` to `AbstractMethodCheckerPayload`.

    Signed-off-by: Med Ismail Bennani <ismail at bennani.ma>

commit 6397e2f59ee06814693016bea181fce9480622d2
Author: Lucas Ste <38472950+LucasSte at users.noreply.github.com>
Date:   Mon Dec 1 16:19:39 2025 -0300

    Revert "[BPF] Allow libcalls behind a feature gate (#168442)" (#169733)

    **Problem**

    As mentioned in
    https://github.com/llvm/llvm-project/pull/168442#pullrequestreview-3501502448
    #168442, is not the right solution for the problem.

    I'll follow @arsenm suggestions starting with
    https://github.com/llvm/llvm-project/pull/169537 to properly allow safe
    math operations and i128 support in BPF.

    **Solution**

    Revert #168442.

commit e6ae2462bd6dcf583ccd13c6627fe3ffe8a17f2c
Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin at amd.com>
Date:   Mon Dec 1 10:59:52 2025 -0800

    [AMDGPU] Refactor hazard recognizer for VALU-pipeline hazards. NFCI. (#168801)

    This is in preparation of handling these in scheduler. I do not expect
    any changes to the produced code here, it is just an infrastructure.
    Our current problem with the VALU pipeline hazards is that we only
    insert V_NOP instructions in the hazard recognizer mode, but ignore
    it during scheduling. This patch is meant to create a mechanism to
    actually account for that during scheduling.

commit 0ff0f52460531c0bfa213d0dcfa0cfb4ba61e934
Author: Greg Clayton <gclayton at fb.com>
Date:   Mon Dec 1 10:57:28 2025 -0800

    Fix __apple_XXX iterator that iterates over all entries. (#157538)

    The previous iterator for __apple_XXX sections was assuming that all
    entries in the table would be contiguous and it wasn't using the offsets
    table to access each chain of entries for a given name. This patch fixes
    it so the iterator does the right thing.

    This issue became apparent after a modification to strip template names
    from DW_AT_name entries to allow adding both the template class base
    name as an entry and also include the name with template names. The
    commit hash is 2e7ee4dc21430b0fe4c9ee306dc1d8c7986a6646. The problem is
    if the name starts with a "<" it will try and split the name. So if the
    name is `"<get-size>"` it will return an empty string as the function
    name, and this empty string gets added to the __apple_names table and
    causes large delays when using the iterators.

commit 9edbf83667821e3154446d5e2429e41bf261e26f
Author: Charles Zablit <c_zablit at apple.com>
Date:   Mon Dec 1 19:46:16 2025 +0100

    [lldb][windows] fix environment handling in CreateProcessW setup (#168733)

    This patch refactors and documents the setup of the `CreateProcessW`
    invocation in `ProcessLauncherWindows`. It's a dependency of
    https://github.com/llvm/llvm-project/pull/168729.

    `CreateEnvironmentBufferW` now sorts the environment variable keys
    before concatenating them into a string. From [the `CreateProcess`
    documentation](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw):
    > An application must manually pass the current directory information to
    the new process. To do so, the application must explicitly create these
    environment variable strings, sort them alphabetically (because the
    system uses a sorted environment), and put them into the environment
    block. Typically, they will go at the front of the environment block,
    due to the environment block sort order.

    `GetFlattenedWindowsCommandStringW` now returns an error which will be
    surfaced, instead of failing silently.

    Types were converted to their wide equivalent (i.e appending `W` to
    them, see `STARTUPINFOEX`) since we are calling the `W` variant of
    `CreateProcess`.

commit b73385dda5caa21570ddc6d7277c22aca8f2de1e
Author: Nico Weber <thakis at chromium.org>
Date:   Mon Dec 1 13:42:28 2025 -0500

    [TySan] Attempt to unbreak build after #169036

    If tysan was not in COMPILER_RT_SANITIZERS_TO_BUILD, we used to
    get an error after #169036, see comments there for details.

commit f65c19982d2af7f791115e0b51c095a52ad5da4a
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 10:42:36 2025 -0800

    Reapply "[clangd] Make lit tests work with the internal shell" (#169972)

    This reverts commit bd04ef6df50e8e6e5212762fc798ea9fbdcfc897.

    This reapply fixes the broken case where we would fail at CMake
    configuration time if LLVM_INCLUDE_BENCHMARKS was explicitly turned off.

commit 4a48740831d0f0779780e0bea64ec4a16d9f6d97
Author: Helena Kotas <hekotas at microsoft.com>
Date:   Mon Dec 1 10:34:29 2025 -0800

    [HLSL] Update indexed vector elements individually (#169144)

    When an individual element of a vector is updated via indexing into the vector, it needs to be handled as a store operation on that one vector element.

    Clang treats vectors as one unit, so a vector element needs to be updated, the whole vector is loaded, the element is modified, and then the whole vector is stored. In HLSL vector elements are handled individually. We need to avoid this load/modify/store sequence to prevent overwriting other vector elements that might be getting updated in parallel.

    Fixes #167729

    Contributes to #160208.

commit 56d061ccc5afb07e8d9a4d2c501bbcb56031ccc9
Author: William Tran-Viet <wtranviet at proton.me>
Date:   Mon Dec 1 13:33:48 2025 -0500

    [libc++][NFC] Add optional<T&> synopsis (#170043)

commit c103d61758e61a9fe4c1963b29d602ffe2c22427
Author: Jonas Devlieghere <jonas at devlieghere.com>
Date:   Mon Dec 1 10:30:31 2025 -0800

    [lldb] Fix a bug when disabling the statusline. (#169127)

    Currently, disabling the statusline with `settings set show-statusline
    false` leaves LLDB in a broken state. The same is true when trying to
    toggle the setting again.

    The issue was that setting the scroll window to 0 is apparently not
    identical to setting it to the correct number of rows, even though some
    documentation online incorrectly claims so.

    Fixes #166608

commit c9d9dddc1c5e9f203f5db890f383b956c5b2d295
Author: nerix <nerixdev at outlook.de>
Date:   Mon Dec 1 19:27:54 2025 +0100

    [LLDB][NativePDB] Look for PDBs in `target.debug-file-search-paths` (#169719)

    Similar to DWARF's DWO, we should look for PDBs in
    `target.debug-file-search-paths` if the PDB isn't at the original
    location or next to the executable.

    With this PR, the search order is as follows:

    1. PDB path specified in the PE/COFF file
    2. Next to the executable
    3. In `target.debug-file-search-paths`

    This roughly matches [the order Visual Studio
    uses](https://learn.microsoft.com/en-us/visualstudio/debugger/specify-symbol-dot-pdb-and-source-files-in-the-visual-studio-debugger?view=vs-2022#where-the-debugger-looks-for-symbols),
    except that we don't have a project folder and don't support symbol
    servers.

    Closes #125355 (though I think this is already fixed in the native
    plugin).

commit 21e64d1f5a3dbf539eaf9c7ac160469e60222ba2
Author: Valentin Clement (バレンタイン クレメン) <clementval at gmail.com>
Date:   Mon Dec 1 10:19:52 2025 -0800

    [flang][cuda][NFC] Split allocation related operation conversion from other cuf operations (#169740)

    Split AllocOp, FreeOp, AllocateOp and DeallocateOp from other
    conversion. Patterns are currently added to the base CUFOpConversion
    when the option is enabled.
    This split is a pre-requisite to be more flexible where we do the
    allocation related operations conversion in the pipeline.

commit d1899acd08d3eb876de0e5394f6c3a2441e04756
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 18:17:15 2025 +0000

    [X86] combineConcatVectorOps - add handling to concat ISD::FROUND/FFLOOR intrinsics together (#170176)

    These were missed in #170160

commit dec77e4f878cd4a530aa0be6106859fc69726928
Author: Ramkumar Ramachandra <ramkumar.ramachandra at codasip.com>
Date:   Mon Dec 1 17:47:32 2025 +0000

    [VPlan] Improve code in VPInstruction::generate (NFC) (#169470)

    Make miscellaneous improvements including inlining some expressions and
    re-using the existing State.Builder reference.

commit 61881c307c059a43ec04b2f9a9923c57d9a38f23
Author: darkbuck <michael.hliao at gmail.com>
Date:   Mon Dec 1 12:45:10 2025 -0500

    [CUDA] Add device-side kernel launch support (#165519)

    - CUDA's dynamic parallelism extension allows device-side kernel
    launches, which share the identical syntax to host-side launches, e.g.,

        kernel<<<Dg, Db, Ns, S>>>(arguments);

    but differ from the code generation. That device-side kernel launches is
    eventually translated into the following sequence

        config = cudaGetParameterBuffer(alignment, size);
        // setup arguments by copying them into `config`.
        cudaLaunchDevice(func, config, Dg, Db, Ns, S);

    - To support the device-side kernel launch, 'CUDAKernelCallExpr' is
    reused but its config expr is set to a call to 'cudaLaunchDevice'.
    During the code generation, 'CUDAKernelCallExpr' is expanded into the
    sequence aforementioned.

    - As the device-side kernel launch requires the source to be compiled as
    relocatable device code and linked with '-lcudadevrt'. Linkers are
    changed to pass relevant link options to 'nvlink'.

commit a7c1f467339abd1942c89f2ef8b79083e89e7dad
Author: Igor Wodiany <igor.wodiany at imgtec.com>
Date:   Mon Dec 1 17:41:52 2025 +0000

    [mlir][spirv] Enable block splitting for `spirv.Switch` (#170147)

    This is not strictly necessary as now selection regions can yield
    values, however splitting the block simplifies the code as it avoids
    unnecessary values being sunk just to be later yielded.

commit 25ab47bd407d6d92e587e2d545081ab25c909d86
Author: Florian Hahn <flo at fhahn.com>
Date:   Mon Dec 1 17:33:36 2025 +0000

    [VPlan] Use wide IV if scalar lanes > 0 are used with scalable vectors. (#169796)

    For scalable vectors, VPScsalarIVStepsRecipe cannot create all scalar
    step values. At the moment, it creates a vector, in addition to to the
    first lane. The only supported case for this is when only the last lane
    is used. A recipe should not set both scalar and vector values.

    Instead, we can simply use a vector induction. It would also be possible
    to preserve the current vector code-gen, by creating VPInstructions
    based on the first lane of VPScalarIVStepsRecipe, but using a vector
    induction seems simpler.

    PR: https://github.com/llvm/llvm-project/pull/169796

commit 3d862cfcea9bba5fe04d22beaa6c46f850a76a73
Author: Steven Perron <stevenperron at google.com>
Date:   Mon Dec 1 12:27:36 2025 -0500

    [SPIRV] Add legalization for long vectors (#169665)

    This patch introduces the necessary infrastructure to legalize vector
    operations on vectors that are longer than what the SPIR-V target
    supports. For instance, shaders only support vectors up to 4 elements.

    The legalization is done by splitting the long vectors into smaller
    vectors of a legal size.

    Specifically, this patch does the following:
    - Introduces `vectorElementCountIsGreaterThan` and
      `vectorElementCountIsLessThanOrEqualTo` legality predicates.
    - Adds legalization rules for `G_SHUFFLE_VECTOR`,
    `G_EXTRACT_VECTOR_ELT`,
      `G_BUILD_VECTOR`, `G_CONCAT_VECTORS`, `G_SPLAT_VECTOR`, and
      `G_UNMERGE_VALUES`.
    - Handles `G_BITCAST` of long vectors by converting them to
      `@llvm.spv.bitcast` intrinsics which are then legalized.
    - Updates `selectUnmergeValues` to handle extraction of both scalars
      and vectors from a larger vector, using `OpCompositeExtract` and
      `OpVectorShuffle` respectively.

    Fixes https://github.com/llvm/llvm-project/pull/165444

commit 8a3891ceadad3a156b8fbcdccd82f0aa7dece982
Author: Prasoon Mishra <Prasoon.Mishra at amd.com>
Date:   Mon Dec 1 22:56:37 2025 +0530

    [AMDGPU][NPM] Preserve analyses in AMDGPURewriteAGPRCopyMFMA for NPM (#170130)

    The pass preserved LiveStacksAnalysis but failed to preserve
    LiveIntervalsAnalysis, LiveRegMatrixAnalysis, VirtRegMapAnalysis, and
    SlotIndexesAnalysis under NPM. This caused these analyses to be
    invalidated and recomputed, leading to incorrect behavior in subsequent
    passes like VirtRegRewriter.

    Fix by explicitly preserving all required analyses in the NPM version,
    matching the legacy pass manager behavior.

    ---------

    Co-authored-by: vikhegde <vikram.hegde at amd.com>

commit dae9139d8fecf09d975f59b012646bc04f694c35
Author: Muhammad Abdul <alilo.ghazali at gmail.com>
Date:   Tue Dec 2 00:24:16 2025 +0700

    [X86][Clang] VectorExprEvaluator::VisitCallExpr / InterpretBuiltin - allow AVX512 kmov intrinsics to be used in constexp (#169895)

    Resolves #166975

commit 4e316d7e81a3e481dc55804a662e6204ec6a62a6
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 17:23:53 2025 +0000

    [X86] Add test coverage for the concatenation of ISD::FROUND intrinsics (#170166)

    These were missed in #170160

commit 8ccdb3540b6d9085bf2112aa7cbed4a292837c01
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 17:19:37 2025 +0000

    [X86] Add test coverage for the concatenation of ISD::FFLOOR intrinsics (#170168)

    These were missed in #170160

commit a15a6c870b9cf34340c3332b586beff6bdf15424
Author: Sam Elliott <aelliott at qti.qualcomm.com>
Date:   Mon Dec 1 17:18:02 2025 +0000

    [RISCV] Rename SFB Base Feature (#169607)

    New SFB subsets are being added with the scheduler class name as a
    suffix, so now is the time to go back to the base extension and add IALU
    to its name.

    This also:
    - Drops a hyphen from the other SFB features for mul and minmax, to more
    closely match their scheduling classes.
    - Updates the predicates on specific SFB pseudos so we get verifier
    errors if we introduce the pseudos when we don't have the right
    subtarget feature.
    - Updates the SFB Documentation comment to make it no longer
    SiFive-specific.

commit 65666b2586383c34a4cdc3f324836192258dddc3
Author: Igor Wodiany <igor.wodiany at imgtec.com>
Date:   Mon Dec 1 17:14:14 2025 +0000

    [mlir][spirv] Rename `*.spv` tests to `*.spvasm`. (#170161)

    This patch renames two of the SPIR-V tests to `*.spvasm` since both
    files are assembly files, rather than SPIR-V binaries. The `lit.cfg.py`
    is adjusted and we no longer need to run `*.spv` tests since none are
    present.

commit 5c2601563789a232a9d0575c95edacdc2c25a97d
Author: David Green <david.green at arm.com>
Date:   Mon Dec 1 16:53:40 2025 +0000

    [AArch64][GlobalISel] Add GISel coverage for i32 lround and lrint. NFC

commit 40aa91f12a498b42be4eabbdacfb4c5e25a77be1
Author: anoopkg6 <anoop.kumar6 at ibm.com>
Date:   Mon Dec 1 10:52:24 2025 -0600

    [TySan] TySan support for SystemZ - Re-submission of original pr#162396  (#169850)

    This is a re-submission of original reverted patch [(#162396)
    ](https://github.com/llvm/llvm-project/pull/162396url)for adding TySan
    support for systemzZ along with build failure patch
    [#169746](https://github.com/llvm/llvm-project/pull/169746).

    See conversations in #169746.

    Co-authored-by: anoopkg6 <anoopkg6 at github.com>

commit fddf7b0510e5df7a08c512a177ea9c1ec4307718
Author: Durgadoss R <durgadossr at nvidia.com>
Date:   Mon Dec 1 22:19:34 2025 +0530

    [MLIR][NVVM] Update mbarrier.arrive.expect_tx Op (#169922)

    This patch updates the mbarrier.arrive.expect_tx Op.
    It also adds an Op for its arrive_drop version.

    * No change in the existing inline-asm lowering.
       This functionality continues to work as is.
    * An optional return value is added for shared_cta space.
    * The scope and semantics are added as attributes.
    * Inline-PTX lowering is available when `predicate` is provided.
      Otherwise, the Op lowers to intrinsics.
    * lit tests are added to verify the lowering to intrinsics.
    * Specific negative tests are added to check the invalid cases for
    inline-ptx lowering.

    Signed-off-by: Durgadoss R <durgadossr at nvidia.com>

commit f3cce97ba79ee507adfe4069ba907dcc842def31
Author: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
Date:   Mon Dec 1 10:43:45 2025 -0600

    [flang][OpenMP] Remove directive-specific code from GetOmpDirectiveNa… (#170157)

    …me, NFC

    It is unnecessary, existing overloads handle these cases already.

commit bb06f909433dc053166c0f02d4f5164b83b5b39f
Author: LLVM GN Syncbot <llvmgnsyncbot at gmail.com>
Date:   Mon Dec 1 16:32:09 2025 +0000

    [gn build] Port 9438b741d449

commit 9438b741d4491b400cb04b4ec47aae0936e2e954
Author: Jonas Devlieghere <jonas at devlieghere.com>
Date:   Mon Dec 1 08:27:42 2025 -0800

    [lldb] Add VirtualDataExtractor for virtual address translation (#168802)

    Introduce VirtualDataExtractor, a DataExtractor subclass that enables
    reading data at virtual addresses by translating them to physical buffer
    offsets using a lookup table. The lookup table maps virtual address
    ranges to physical offsets and enforces boundaries to prevent reads from
    crossing entry limits.

    The new class inherits from DataExtractor, overriding GetData and
    PeekData to provide transparent virtual address translation for most of
    the DataExtractor methods. The exception are the unchecked methods, that
    bypass those methods and are overloaded as well.

commit 318d932ca028830625290227004180c6d9c776f9
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 16:24:40 2025 +0000

    [X86] combineConcatVectorOps - add handling to concat fp rounding intrinsics together (#170160)

commit 9f54c2a6743ed4770c2453bb3a8b4d7ee8e2b152
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 16:20:43 2025 +0000

    [X86] Add test coverage for the concatenation of vXf64 sqrt intrinsics (#170158)

commit 46c34bec134be0cb606fba1affbc70920b4fc266
Author: Jordan Rupprecht <rupprecht at google.com>
Date:   Mon Dec 1 10:00:53 2025 -0600

    [benchmark][NFC] Update cc_binary load (#169710)

    cc_binary now needs to be loaded from the rules_cc repo

    I don't think this file is actually used, but updating it to be more
    syntactically correct anyway.

commit 979a987d3ad51d421f091730a5a1cf9326b47bbc
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 07:34:54 2025 -0800

    [WPD] Change Devirt Cutoff to use DebugCounter (#170009)

    This removes the presence of global state from within the pass which is
    blocking some efforts around test daemonization and is not good design
    practice in general for LLVM. See

    https://discourse.llvm.org/t/rfc-reducing-process-creation-overhead-in-llvm-regression-tests/88612/11
    for more discussion.

    This patch replaces the usage of global state with a DebugCounter, which
    helps fix the global state problem and also increases the flexibility of
    the option as now an explicit range can be passed.

    Co-authored-by: Mingming Liu <mingmingl at google.com>

commit b76cada909cff3c63a454a97fd247388a3650b4c
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 15:31:01 2025 +0000

    [X86] combineConcatVectorOps - add handling to concat RCPPS/RSQRTPS intrinsics together (#170148)

    Limited to 128->256 cases as we can't safely convert to the RCP14/RSQRT14 variants

commit 37858b087a00e4cd7dd6e9983d4f45b015e9e3a1
Author: Yu Hao <yuhaoyu at google.com>
Date:   Mon Dec 1 07:21:17 2025 -0800

    [clang][ASTMatchers] Add `arrayTypeLoc` ast matcher for ArrayTypeLoc (#168990)

    There's `arrayType` matcher for matching `ArrayType`, but no matcher for
    `ArrayTypeLoc`. This change complements it.

    Note that there's already `hasElementTypeLoc` matcher, which was
    declared together with the `hasElementType` matcher.

commit 73889c35713ecc659935445ef066fa74ae62f3fa
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 07:12:40 2025 -0800

    [llvm-exegesis] Add CLI Option to set Fixed RNG seed

    The primary motivation for this is to set a fixed RNG seed for flaky
    tests. This also has the bonus of adding debug logging for what seed
    gets used which can make it much easier to reproduce issues that only
    happen occasionally and are seed-dependent.

    Reviewers: sjoerdmeijer, davemgreen, mshockwave

    Reviewed By: davemgreen

    Pull Request: https://github.com/llvm/llvm-project/pull/170013

commit 10ceca8a9661fb700dc1288ba0cc21188663b2b9
Author: Aaron <aaron at tinyblob.com>
Date:   Mon Dec 1 15:06:48 2025 +0000

    [lldb-dap] Fix segfault in JSONUtils.cpp when GetUUIDString() returns nullptr (#169844)

    When creating a stack frame in JSONUtils.cpp CreateStackFrame() the code
    constructs a std::string from module.GetUUIDString(), which can return
    nullptr in some cases (as documented in the implementation of
    SBModule::GetUUIDString()). This causes a segmentation fault when passed
    to the std::string constructor.

    This fix adds a null check before constructing the UUID string, falling
    back to an empty string if nullptr is returned. The existing empty check
    ensures the moduleId field is omitted from the JSON when no UUID exists.

    rdar://163811812

    ---------

    Co-authored-by: Ebuka Ezike <yerimyah1 at gmail.com>

commit 7b6bf8b060f74669d7027d33f488a35cfb448b29
Author: Arseniy Zaostrovnykh <necto.ne at gmail.com>
Date:   Mon Dec 1 15:48:59 2025 +0100

    [NFC][analyzer] const ptr param in AnalysisConsumer::getModeForDecl (#170145)

    This is a tiny change that would make the function contract more clear
    and our work downstream easier.

commit c7c6c0a45c1d840d05b414d73f7bab5136dcb8c2
Author: Yingwei Zheng <dtcxzyw2333 at gmail.com>
Date:   Mon Dec 1 22:46:16 2025 +0800

    [AggressiveInstCombine] Fix memory location for alias analysis (#169953)

    When LOps.RootInsert comes after LI2, since we use LI2 as the new insert
    point, we should make sure the memory region accessed by LOps isn't
    modified. However, the original implementation passes the bit width
    `LOps.LoadSize` as the number of bytes to be accessed, causing BasicAA
    to return NoAlias:

    https://github.com/llvm/llvm-project/blob/a941e150749650e6a75e948f10d46b0bedcc128b/llvm/lib/Analysis/BasicAliasAnalysis.cpp#L1658-L1667
    With `-aa-trace`, we get:
    ```
    End ptr getelementptr inbounds nuw (i8, ptr @g, i64 4) @ LocationSize::precise(1),   %gep1 = getelementptr i8, ptr %p, i64 4 @ LocationSize::precise(32) = NoAlias
    ```
    This patch uses `getTypeStoreSize` to compute the correct access size
    for LOps. Instead of modifying the MemoryLocation for End (i.e.,
    `LOps.RootInsert`), it also uses the computed base and AATag for
    correctness.

    Closes https://github.com/llvm/llvm-project/issues/169921.

commit 97e0573f9e16fb6b7970130ff24e5c9eba98e164
Author: Erich Keane <ekeane at nvidia.com>
Date:   Mon Dec 1 06:44:42 2025 -0800

    [CIR] Start printing/parsing func 'attributes' (#169674)

    This patch adds a print and parse ability for the func to have
    MLIR-standard 'attributes' printed along side the standard function.

    This patch also seeds the initial "disallowed" list so that we don't
    print things that we have custom printing for, AND will disallow them
    from being parsed. I believe this list to be complete, and it passes all
    tests.

    This printing of attributes is necessary for testing some OpenACC things
    that putting into the normal func-printing seems unnecessary.

commit 3b9e203364dcd8234b12eb447ddbcf97a877558c
Author: Nikolas Klauser <nikolasklauser at berlin.de>
Date:   Mon Dec 1 15:42:33 2025 +0100

    [Clang] Add __builtin_common_reference (#121199)

commit fa6d611f0a352967eefb8a8175f1556241cacc17
Author: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date:   Mon Dec 1 23:34:10 2025 +0900

    [DA] Remove special handling for SCEVAddExpr in GCD MIV (#169927)

    In `gcdMIVtest`, there is logic that assumes the addition(s) of
    `SCEVAddExpr` don't overflow without any checks. Adding overflow checks
    would be fine, but this part appeart to be less useful. So this patch
    removes it.

    Fix one of the tests added in #169926.

commit 235d44d8b6f40b8804537d950d5655fcfe80d9c7
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Mon Dec 1 05:50:31 2025 -0800

    Fix LLVM test to use %python instead of python

    This uses lit substitution, which fixes running this test on
    some environment where 'python' isn't in the path.

commit aa04b654b4113d3e2c1a36baf769d601ab378096
Author: Mend Renovate <bot at renovateapp.com>
Date:   Mon Dec 1 14:26:22 2025 +0000

    [Github] Update GHA Dependencies (#170057)

    This PR contains the following updates:

    | Package | Type | Update | Change | Pending |
    |---|---|---|---|---|
    |
    [actions/setup-python](https://redirect.github.com/actions/setup-python)
    | action | minor | `v6.0.0` -> `v6.1.0` | |
    |
    [github/codeql-action](https://redirect.github.com/github/codeql-action)
    | action | patch | `v4.31.4` -> `v4.31.5` | `v4.31.6` |
    |
    [hendrikmuhs/ccache-action](https://redirect.github.com/hendrikmuhs/ccache-action)
    | action | patch | `v1.2.19` -> `v1.2.20` | |

commit 1ced99aa4a989b54bda8a68f0f39ecd9004afd81
Author: Mend Renovate <bot at renovateapp.com>
Date:   Mon Dec 1 14:23:12 2025 +0000

    [Github] Update actions/upload-artifact action to v5 (#170058)

    This PR contains the following updates:

    | Package | Type | Update | Change |
    |---|---|---|---|
    |
    [actions/upload-artifact](https://redirect.github.com/actions/upload-artifact)
    | action | major | `v4.6.2` -> `v5.0.0` |

commit ad656d3a1954dd6157ba689b3003b6fbb97a0833
Author: Jakub Kuderski <jakub at nod-labs.com>
Date:   Mon Dec 1 09:19:07 2025 -0500

    [mlir][linalg][arm] Fix use of fill in arm integration tests (#170143)

    Follow up to
    https://github.com/llvm/llvm-project/pull/169567#issuecomment-3596220014

commit 2538f6382a10af359c05a07738a0021f9eae221a
Author: Igor Wodiany <igor.wodiany at imgtec.com>
Date:   Mon Dec 1 14:15:02 2025 +0000

    [mlir][spirv] Support (de)serialization of block operands in `spirv.Switch` (#168899)

commit 4978cd3cdf64fb1cd87f1ddf73fc44bb8ca223c2
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Mon Dec 1 06:08:04 2025 -0800

    Revert "Fix LLVM test to use %python instead of python"

    This reverts commit b4c30b0e1ece2bc97ef91e4bbed422c2e620be05.

    This substitution is not available from within these tests.

commit aaa59e34894d3d0648631776afe2b297e2ad0895
Author: Nico Weber <thakis at chromium.org>
Date:   Mon Dec 1 09:05:28 2025 -0500

    [gn] port 29fef3a51e6d (bolt PassTests)

commit 461433fea23d18d6a9da73bf09698bd4b3c68ef6
Author: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date:   Mon Dec 1 23:03:52 2025 +0900

    [DA] Add overflow check when calculating Delta in GCD MIV (#169928)

    Add overflow check when computing `Delta` in `gcdMIVtest`.

    Fix one of the tests added by #169926.

commit b4c30b0e1ece2bc97ef91e4bbed422c2e620be05
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Mon Dec 1 05:50:31 2025 -0800

    Fix LLVM test to use %python instead of python

    This uses lit substitution, which fixes running this test on
    some environment where 'python' isn't in the path.

commit b27301ff5d9ab39ab4dfc5d0041273cdd80546a4
Author: Ryan Holt <ryanholt at mathworks.com>
Date:   Mon Dec 1 08:52:20 2025 -0500

    [mlir][linalg] Re-enable linalg runtime verification test (#170129)

    Test seems to pass after re-enabling without any additional changes.

commit c25ad27174c47f01c7bd542fac55e8a7cdec5b73
Author: David Green <david.green at arm.com>
Date:   Mon Dec 1 13:43:16 2025 +0000

    [AArch64] Remove unused references to MVT::f80. (#169545)

    These f80 fp types are only supported on X86 and can be removed from
    AArch64. It looks like they were copied from another backend by mistake.

commit d431f38860ff6759bb9648e5620d587c6581b951
Author: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date:   Mon Dec 1 22:36:01 2025 +0900

    [DA] Add tests for GCD MIV misses dependency due to overflow (NFC) (#169926)

    Add two test cases where dependencies are missed due to overflows. These
    will be fixed by #169927 and #169928, respectively.

commit 8808beeb1a35c8f2ffe228b9e91af5067388f909
Author: Robert Imschweiler <robert.imschweiler at amd.com>
Date:   Mon Dec 1 14:18:31 2025 +0100

    Reland: [OpenMP] Implement omp_get_uid_from_device() / omp_get_device_from_uid() (#168554)

    Reland https://github.com/llvm/llvm-project/pull/164392 with Fortran support moved to follow-up PR

commit 4a6451af7b945bb8283ee71bf9628b385bd69ec0
Author: Jasmine Tang <jjasmine at igalia.com>
Date:   Mon Dec 1 12:53:47 2025 +0000

    Fix typo in attr.td: Avaiable -> Available (#170116)

    Follow up to #163618

commit 05ad84095a04adba2a0d8699629fc3db705b23f6
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 12:45:45 2025 +0000

    [X86] combineConcatVectorOps - add handling to concat sqrt intrinsics together (#170113)

    Similar to fdiv, we should be trying to concat these high latency instructions together

commit d3edc94d113d2d30a7a26fa4d72496ac0b9256b8
Author: Giacomo Castiglioni <giacastiglioni at gmail.com>
Date:   Mon Dec 1 13:39:59 2025 +0100

    [MLIR][GPU] subgroup_mma fp64 extension - take 2 (#169061)

    This PR re-lands #165873.

    This PR extends the gpu.subgroup_mma_* ops to support fp64 type.
    The extension requires special handling during the lowering to nvvm due
    to the return type for load ops for fragment a and b (they return a
    scalar instead of a struct).

    The original PR did not guard the new test based on the required
    architecture (sm80) which lead to a failure on the cuda runners with T4
    GPUs.

commit 8478de3d00a7a16b532b3902d5d9794405ae2379
Author: Paul Walker <paul.walker at arm.com>
Date:   Mon Dec 1 12:32:58 2025 +0000

    [LLVM][CodeGen] Remove failure cases when widening EXTRACT/INSERT_SUBVECTOR. (#162308)

    This PR implements catch all handling for widening the scalable
    subvector operand (INSERT_SUBVECTOR) or result (EXTRACT_SUBVECTOR). It
    does this via the stack using masked memory operations. With general
    handling available we can add optimiations for specific cases.

commit 989ac4c9db3aaa660dcfd0d1d5683b4c07dffaec
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 12:07:01 2025 +0000

    [X86] Add tests showing failure to concat fp rounding intrinsics together. (#170108)

commit 6157d4625941870392a0f5377b8ab08c4c204ce4
Author: Sohaib Iftikhar <sohaibiftikhar at google.com>
Date:   Mon Dec 1 13:00:58 2025 +0100

    [MLIR|BUILD]: Fix for 8ceeba838 (#170110)

commit 58770200a7045dd46dfb8c85299eee504d95026c
Author: Ryotaro Kasuga <kasuga.ryotaro at fujitsu.com>
Date:   Mon Dec 1 20:57:09 2025 +0900

    [DA] Clean up unnecessary member function declarations (#170106)

    Follow-up for #169047. The previous PR moved some functions from DA to
    Delinearization, but the member function declarations were not updated
    accordingly. This patch removes them.

commit d0df51bc93fb5a254dd8a05752b782a13dc1f64d
Author: Luke Lau <luke at igalia.com>
Date:   Mon Dec 1 19:51:56 2025 +0800

    [ConstantRange] Allow casting to the same bitwidth. NFC (#170102)

    From the review in
    https://github.com/llvm/llvm-project/pull/169527#discussion_r2567122387,
    there are some users where we want to extend or truncate a ConstantRange
    only if it's not already the destination bitwidth. Previously this
    asserted, so this PR relaxes it to just be a no-op, similar to
    IRBuilder::createZExt and friends.

commit 48931e5e5942304afd1c0a493be91b662ffd221b
Author: Timm Baeder <tbaeder at redhat.com>
Date:   Mon Dec 1 12:43:35 2025 +0100

    [clang][bytecode] Check memcmp builtin for one-past-the-end pointers (#170097)

    We can't read from those and will run into an assertion sooner or later.

    Fixes https://github.com/llvm/llvm-project/issues/170031

commit 577cd6fb02959270dcdc48864ea0fba1d540cef4
Author: Mehdi Amini <joker.eph at gmail.com>
Date:   Mon Dec 1 12:39:25 2025 +0100

    [LIT] Workaround the 60 processed limit on Windows (#157759)

    Python multiprocessing is limited to 60 workers at most:

    https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672

    The limit being per thread pool, we can work around it by using multiple
    pools on windows when we want to actually use more workers.

commit 130746addfed03e9a53b62dfc0da47e2c18ee959
Author: Jan Patrick Lehr <JanPatrick.Lehr at amd.com>
Date:   Mon Dec 1 12:37:09 2025 +0100

    [MLIR] Fix build after #169982 (#170107)

commit edd1856686a44db896d64a3083619dfcc473a65f
Author: Jasmine Tang <jjasmine at igalia.com>
Date:   Mon Dec 1 11:32:46 2025 +0000

    [WebAssembly] Optimize away mask of 63 for shl ( zext (and i32 63))) (#152397)

    Fixes https://github.com/llvm/llvm-project/issues/71844

commit 0e721b75aaa39181c71e798d5a95102eb349bf1c
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 11:28:34 2025 +0000

    [X86] Add tests showing failure to concat RCPPS + RSQRTPS intrinsics together. (#170098)

    Can only do this for 128->256 cases as we can't safely convert to the RCP14/RSQRT14 variants

commit 6c0a02f2adb4dd92c965bd5a70f19d59d4c597a5
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Mon Dec 1 11:23:43 2025 +0000

    [X86] Add tests showing failure to concat sqrt intrinsics together. (#170096)

    Similar to fdiv, we should be trying to concat these high latency instructions together

commit bf22687c4842fe4f78cee34ec4e5e2d3e6e1fb59
Author: Tom Eccles <tom.eccles at arm.com>
Date:   Mon Dec 1 11:23:14 2025 +0000

    [OMPIRBuilder] CANCEL IF(FALSE) is still a cancellation point (#170095)

    From OpenMP 4.0:

    > When an if clause is present on a cancel construct and the if
    expression
    > evaluates to false, the cancel construct does not activate
    cancellation.
    > The cancellation point associated with the cancel construct is always
    > encountered regardless of the value of the if expression.

    This wording is retained unmodified in OpenMP 6.0.

    This re-opens the already approved PR #164587, which was closed by
    accident. The only changes are a rebase.

commit b60a84a46fa558dd14497f53fc8ad6f7ff505aaa
Author: Tom Eccles <tom.eccles at arm.com>
Date:   Mon Dec 1 11:19:12 2025 +0000

    Revert "[flang][TBAA] refine TARGET/POINTER encoding" (#170105)

    Reverts llvm/llvm-project#169544

    [Regressed](https://lab.llvm.org/buildbot/#/builders/143/builds/12956)
    gfortran test suite

commit 2c217909839b345760de964cf87bf1045c9ff784
Author: Ming Yan <ming.yan at terapines.com>
Date:   Mon Dec 1 19:02:02 2025 +0800

    Revert "[MLIR][SCF] Sink scf.if from scf.while before region into after region in scf-uplift-while-to-for" (#169888)

    Reverts llvm/llvm-project#165216
    It is implemented in #169892 .

commit 29fef3a51e6dcc5e6b5683c281ce7c19b19f0bbf
Author: Gergely Bálint <gergely.balint at arm.com>
Date:   Mon Dec 1 12:00:31 2025 +0100

    [BOLT] Improve DWARF CFI generation for pac-ret binaries (#163381)

    During InsertNegateRAState pass we check the annotations on
    instructions,
    to decide where to generate the OpNegateRAState CFIs in the output
    binary.

    As only instructions in the input binary were annotated, we have to make
    a judgement on instructions generated by other BOLT passes.
    Incorrect placement may cause issues when an (async) unwind request
    is received during the new "unknown" instructions.

    This patch adds more logic to make a more informed decision on by taking
    into account:
    - unknown instructions in a BasicBlock with other instruction have the
    same RAState. Previously, if the BasicBlock started with an unknown
    instruction,
    the RAState was copied from the preceding block. Now, the RAState is
    copied from
      the succeeding instructions in the same block.
    - Some BasicBlocks may only contain instructions with unknown RAState,
    As explained in issue #160989, these blocks already have incorrect
    unwind info. Because of this, the last known RAState based on the layout order
    is copied.

    Updated bolt/docs/PacRetDesign.md to reflect changes.

commit 8ceeba83812d551423a9e50f600cc77ea4718ca2
Author: Ming Yan <ming.yan at terapines.com>
Date:   Mon Dec 1 18:54:21 2025 +0800

    [MLIR][SCF] Canonicalize redundant scf.if from scf.while before region into after region (#169892)

    When a `scf.if` directly precedes a `scf.condition` in the before region
    of a `scf.while` and both share the same condition, move the if into the
    after region of the loop. This helps simplify the control flow to enable
    uplifting `scf.while` to `scf.for`.

commit b7721c55fc09616d186bbe1f9e3e4b9df8fb4009
Author: Jim Lin <jim at andestech.com>
Date:   Mon Dec 1 13:32:45 2025 +0800

    [RISCV] Remove the duplicate for RV32/RV64 in zicond-fp-select-zfinx.ll. NFC.

commit d1500d12be60f21f9a80fdbfb3cfa24b8f20a0c9
Author: Luke Lau <luke at igalia.com>
Date:   Mon Dec 1 18:33:50 2025 +0800

    [SelectionDAG] Add SelectionDAG::getTypeSize. NFC (#169764)

    Similar to how getElementCount avoids the need to reason about fixed and
    scalable ElementCounts separately, this patch adds getTypeSize to do the
    same for TypeSize.

    It also goes through and replaces some of the manual uses of getVScale
    with getTypeSize/getElementCount where possible.

commit b1620996f49611767d1950927835fa20284355d5
Author: Timm Baeder <tbaeder at redhat.com>
Date:   Mon Dec 1 11:33:33 2025 +0100

    [clang][bytecode] Fix discarding ImplitiValueInitExprs (#170089)

    They don't have side-effects, so this should be fine.

    Fixes https://github.com/llvm/llvm-project/issues/170064

commit 2c9e9ffa77e37fa0ff5d15325dab5471636b8a44
Author: Luke Lau <luke at igalia.com>
Date:   Mon Dec 1 18:29:21 2025 +0800

    [SCCP] Handle llvm.experimental.get.vector.length calls (#169527)

    As noted in the reproducer provided in
    https://github.com/llvm/llvm-project/issues/164762#issuecomment-3554719231,
    on RISC-V after LTO we sometimes have trip counts exposed to vectorized
    loops. The loop vectorizer will have generated calls to
    @llvm.experimental.get.vector.length, but there are [some
    properties](https://llvm.org/docs/LangRef.html#id2399) about the
    intrinsic we can use to simplify it:

    - The result is always less than both Count and MaxLanes
    - If Count <= MaxLanes, then the result is Count

    This teaches SCCP to handle these cases with the intrinsic, which allows
    some single-iteration-after-LTO loops to be unfolded.

    #169293 is related and also simplifies the intrinsic in InstCombine via
    computeKnownBits, but it can't fully remove the loop since
    computeKnownBits only does limited reasoning on recurrences.

commit 8ec2112ec8b43a0fdf8f5e000f0c6376b6105987
Author: Tom Eccles <tom.eccles at arm.com>
Date:   Mon Dec 1 10:07:19 2025 +0000

    [OMPIRBuilder] re-land cancel barriers patch #164586 (#169931)

    A barrier will pause execution until all threads reach it. If some go to
    a different barrier then we deadlock. This manifests in that the
    finalization callback must only be run once. Fix by ensuring we always
    go through the same finalization block whether the thread in cancelled
    or not and no matter which cancellation point causes the cancellation.

    The old callback only affected PARALLEL, so it has been moved into the
    code generating PARALLEL. For this reason, we don't need similar changes
    for other cancellable constructs. We need to create the barrier on the
    shared exit from the outlined function instead of only on the cancelled
    branch to make sure that threads exiting normally (without cancellation)
    meet the same barriers as those which were cancelled. For example,
    previously we might have generated code like

    ```
    ...
      %ret = call i32 @__kmpc_cancel(...)
      %cond = icmp eq i32 %ret, 0
      br i1 %cond, label %continue, label %cancel

    continue:
      // do the rest of the callback, eventually branching to %fini
      br label %fini

    cancel:
      // Populated by the callback:
      // unsafe: if any thread makes it to the end without being cancelled
      // it won't reach this barrier and then the program will deadlock
      %unused = call i32 @__kmpc_cancel_barrier(...)
      br label %fini

    fini:
      // run destructors etc
      ret
    ```

    In the new version the barrier is moved into fini. I generate it *after*
    the destructors because the standard describes the barrier as occurring
    after the end of the parallel region.

    ```
    ...
      %ret = call i32 @__kmpc_cancel(...)
      %cond = icmp eq i32 %ret, 0
      br i1 %cond, label %continue, label %cancel

    continue:
      // do the rest of the callback, eventually branching to %fini
      br label %fini

    cancel:
      br label %fini

    fini:
      // run destructors etc
      // safe so long as every exit from the function happens via this block:
      %unused = call i32 @__kmpc_cancel_barrier(...)
      ret
    ```

    To achieve this, the barrier is now generated alongside the finalization
    code instead of in the callback. This is the reason for the changes to
    the unit test.

    I'm unsure if I should keep the incorrect barrier generation callback
    only on the cancellation branch in clang with the OMPIRBuilder backend
    because that would match clang's ordinary codegen. Right now I have
    opted to remove it entirely because it is a deadlock waiting to happen.

    ---

    This re-lands #164586 with a small fix for a failing buildbot running
    address sanitizer on clang lit tests.

    In the previous version of the patch I added an insertion point guard
    "just to be safe" and never removed it. There isn't insertion point
    guarding on the other route out of this function and we do not
    preserve the insertion point around getFiniBB either so it is not
    needed here.

    The problem flagged by the sanitizers was because the saved insertion
    point pointed to an instruction which was then removed inside the FiniCB
    for some clang codegen functions. The instruction was freed when it was
    removed. Then accessing it to restore the insertion point was a use
    after free bug.

commit 34c44f21ae9bf5532e467fa2e942fe61715d1394
Author: Tom Eccles <tom.eccles at arm.com>
Date:   Mon Dec 1 10:05:56 2025 +0000

    [flang][TBAA] refine TARGET/POINTER encoding (#169544)

    Previously we were less specific for POINTER/TARGET: encoding that they
    could alias with (almost) anything.

    In the new system, the "target data" tree is now a sibling of the other
    trees (e.g. "global data"). POITNTER variables go at the root of the
    "target data" tree, whereas TARGET variables get their own nodes under
    that tree. For example,

    ```
    integer, pointer :: ip
    real, pointer :: rp
    integer, target :: it
    integer, target :: it2(:)
    real, target :: rt
    integer :: i
    real :: r
    ```
    - `ip` and `rp` may alias with any variable except `i` and `r`.
    - `it`, `it2`, and `rt` may alias only with `ip` or `rp`.
    - `i` and `r` cannot alias with any other variable.

    Fortran 2023 15.5.2.14 gives restrictions on entities associated with
    dummy arguments. These do not allow non-target globals to be modified
    through dummy arguments and therefore I don't think we need to make all
    globals alias with dummy arguments.

    I haven't implemented it in this patch, but I wonder whether it is ever
    possible for `ip` to alias with `rt` or even `it2`.

    While I was updating the tests I fixed up some tests that still assumed
    that local alloc tbaa wasn't the default.

    I found no functional regressions in the gfortran test suite, fujitsu
    test suite, spec2017, or a selection of HPC apps we test internally.

commit 1317083530b95fcf052f3017394a7719a67546fa
Author: Benjamin Maxwell <benjamin.maxwell at arm.com>
Date:   Mon Dec 1 09:55:49 2025 +0000

    [AArch64][SME] Support saving/restoring ZT0 in the MachineSMEABIPass (#166362)

    This patch extends the MachineSMEABIPass to support ZT0. This is done
    with the addition of two new states:

    - `ACTIVE_ZT0_SAVED`
      * This is used when calling a function that shares ZA, but does not
        share ZT0 (i.e., no ZT0 attributes)
      * This state indicates ZT0 must be saved to the save slot, but ZA must
        remain on, with no lazy save setup
    - `LOCAL_COMMITTED`
      * This is used for saving ZT0 in functions without ZA state
      * This state indicates ZA is off and ZT0 has been saved
      * This state is general enough to support ZA, but the required
        transitions have not been implemented†

    To aid with readability, the state transitions have been reworked to a
    switch of `transitionFrom(<FromState>).to(<ToState>)`, rather than
    nested ifs, which helps manage more transitions.

    † This could be implemented to handle some cases of undefined behavior
    better.

commit dda15ad0aadf0bf485498e3d5f22e5caf94925e5
Author: Igor Wodiany <igor.wodiany at imgtec.com>
Date:   Mon Dec 1 09:43:25 2025 +0000

    [mlir][spirv] Use MapVector for BlockMergeInfoMap (#169636)

    This should ensure that the structurizer while loop is deterministic
    across runs. Use of `MapVector` addresses the source of the
    nondeterminism which is use of a `Block*` as a map key.

    fixes #128547

commit 8e6fb0ee84dcfba7e712f3ee4cc9d9819bc2a757
Author: Gergely Bálint <gergely.balint at arm.com>
Date:   Mon Dec 1 10:20:23 2025 +0100

    Reapply "[BOLT][BTI] Skip inlining BasicBlocks containing indirect tailcalls" (#169881) (#169929)

    This reapplies commit 5d6d74359d69d3aada6a46c7cf51d84eb0848b70.

    Fix: added assertions to the requirements of the test

    --------

    Original commit message:

    In the Inliner pass, tailcalls are converted to calls in the inlined
    BasicBlock. If the tailcall is indirect, the `BR` is converted to `BLR`.

    These instructions require different BTI landing pads at their targets.

    As the targets of indirect tailcalls are unknown, inlining such blocks
    is unsound for BTI: they should be skipped instead.

commit 8079d033c97f3ad8d289fa014b0f1c85cf3bbbad
Author: Steven Wu <stevenwu at apple.com>
Date:   Mon Dec 1 17:10:39 2025 +0800

    [CAS] Temporarily skip tests on old windows version (#170063)

commit eb711d8e142683e06ae14b652218b881896f5046
Author: Carlos Galvez <carlosgalvezp at gmail.com>
Date:   Mon Dec 1 09:50:19 2025 +0100

    [clang-tidy][doc] Fix incorrect link syntax in cppcoreguidelines-pro-… (#170088)

    …bounds-avoid-unchecked-container-access

    Missing a trailing underscore to render it as a link.

    Co-authored-by: Carlos Gálvez <carlos.galvez at zenseact.com>

commit 147c466bcd0efcd3efe7b403db441ec8d4912d6a
Author: Matthias Springer <me at m-sp.org>
Date:   Mon Dec 1 16:50:02 2025 +0800

    [mlir][arith] Add support for min/max to `ArithToAPFloat` (#169760)

    Add support for `arith.minnumf`, `arith.maxnumf`, `arith.minimumf`,
    `arith.maximumf`.

commit 9afb651613a9383923b0f52885fb2221a5ec134f
Author: ShashwathiNavada <shashwathinavada at gmail.com>
Date:   Mon Dec 1 14:03:32 2025 +0530

    Adding support for iterator in motion clauses. (#159112)

    As described in section 2.14.6 of openmp spec, the patch implements
    support for iterator in motion clauses.

    ---------

    Co-authored-by: Shashwathi N <nshashwa at pe31.hpc.amslabs.hpecorp.net>

commit 05b19895510af314a78ed42c6a969c4478a8f496
Author: Matthias Springer <me at m-sp.org>
Date:   Mon Dec 1 16:28:23 2025 +0800

    [mlir][arith] Add support for `negf` to `ArithToAPFloat` (#169759)

    Add support for `arith.negf`.

commit f67b01847031aadd4d9d9b90e82c99d0490c4287
Author: Matthias Springer <me at m-sp.org>
Date:   Mon Dec 1 16:15:15 2025 +0800

    [mlir][SPIRV] Improve ub.unreachable lowering test case (#170083)

    Addresses a comment on the PR that introduces the ub.reachable ->
    spriv.Unreachable lowering
    (https://github.com/llvm/llvm-project/pull/169872#discussion_r2573670611).

commit 7ce71414ec3c7eebe77c1c248c119a7df5067369
Author: Abhishek Varma <avarma094 at gmail.com>
Date:   Mon Dec 1 13:44:15 2025 +0530

    [NFC][Linalg] Follow-up on ConvMatchBuilder (#170080)

    -- This commit addresses [follow-up review comments on
    169704](https://github.com/llvm/llvm-project/pull/169704#pullrequestreview-3521785548).
    -- Contains NFC nit/minor changes.

    Signed-off-by: Abhishek Varma <abhvarma at amd.com>

commit 17677ad7eb2b2391d61c976887bbd2616e7d6c3e
Author: David Sherwood <david.sherwood at arm.com>
Date:   Mon Dec 1 08:12:41 2025 +0000

    [LV] Don't create WidePtrAdd recipes for scalar VFs (#169344)

    While attempting to remove the use of undef from more loop vectoriser
    tests I discovered a bug where this assert was firing:

    ```
    llvm::Constant* llvm::Constant::getSplatValue(bool) const: Assertion `this->getType()->isVectorTy() && "Only valid for vectors!"' failed.
    ...
     #8 0x0000aaaab9e2fba4 llvm::Constant::getSplatValue
     #9 0x0000aaaab9dfb844 llvm::ConstantFoldBinaryInstruction
    ```

    This seems to be happening because we are incorrectly generating
    WidePtrAdd recipes for scalar VFs. The PR fixes this by checking whether
    a plan has a scalar VF only in legalizeAndOptimizeInductions.

    This PR also removes the use of undef from the test `both` in
    Transforms/LoopVectorize/iv_outside_user.ll, which is what started
    triggering the assert.

    Fixes #169334

commit 4d7abe535512e1076ff7e5fea14afde29615a8ed
Author: Matthias Springer <me at m-sp.org>
Date:   Mon Dec 1 16:12:11 2025 +0800

    [mlir][arith] Add support for `cmpf` to `ArithToAPFloat` (#169753)

    Add support for `arith.cmpf`.

commit a751ed97acf1ea760d6724bc6ea72b1b9b59a448
Author: Vasily Leonenko <vleonen at users.noreply.github.com>
Date:   Mon Dec 1 10:55:00 2025 +0300

    [BOLT] Support runtime library hook via DT_INIT_ARRAY (#167467)

    Major part of this PR is commit implementing support for DT_INIT_ARRAY
    for BOLT runtime libraries initialization. Also, it adds related
    hook-init test & fixes couple of X86 instrumentation tests.

    This commit follows implementation of instrumentation hook via
    DT_FINI_ARRAY (https://github.com/llvm/llvm-project/pull/67348) and
    extends it for BOLT runtime libraries (including instrumentation
    library) initialization hooking.

    Initialization has has differences compared to finalization:
    - Executables always use ELF entry point address. Update code checks it
    and updates init_array entry if ELF is shared library (have no interp
    entry) and have no DT_INIT entry. Also this commit introduces
    "runtime-lib-init-hook" option to select primary initialization hook
    (entry_point, init, init_array) with fall back to next available hook in
    input binary. e.g. in case of libc we can explicitly set it to
    init_array.
    - Shared library init_array entries relocations usually has
    R_AARCH64_ABS64 type on AArch64 binaries. We check relocation type and
    adjust methods for reading init_array relocations in discovery and
    update methods.

    ---------

    Co-authored-by: Vasily Leonenko <vasily.leonenko at huawei.com>

commit bbb0dbadfaf292766922f5914f1c8946e2ef8519
Author: Timm Baeder <tbaeder at redhat.com>
Date:   Mon Dec 1 08:33:54 2025 +0100

    [clang][AST] Add `RecordDecl::getNumFields()` (#170022)

    Not sure why that didn't exist yet, but we have quite a few places using
    the same `std::distance` pattern.

commit dc5ce79cc143e2e33e9cabbaa41349199b919cda
Author: Luke Lau <luke at igalia.com>
Date:   Mon Dec 1 15:22:45 2025 +0800

    [LV] Regenerate some check lines. NFC

    The scalar loop doesn't exist anymore after 8907b6d39371d439461cdd3475d5590f87821377

commit 9416b19e4f3b471216dcc3fcabac98f2a430faea
Author: Yingwei Zheng <dtcxzyw2333 at gmail.com>
Date:   Mon Dec 1 15:20:45 2025 +0800

    [InstCombine] Add missing constant check (#170068)

    `cast<Constant>` is not guarded by a type check during canonicalization
    of predicates. This patch adds a type check in the outer if to avoid the
    crash. `dyn_cast` may introduce another nested if, so I just use
    `isa<Constant>` instead.

    Address the crash reported in
    https://github.com/llvm/llvm-project/pull/153053#issuecomment-3593914124.

commit 036279addf48cc5a5d7596f4abd06d33242f4f19
Author: Jason Molenda <jmolenda at apple.com>
Date:   Sun Nov 30 21:40:13 2025 -0800

    [lldb][debugserver] Return shared cache filepath in jGetSharedCacheInfo (#168474)

    Add a "shared_cache_path" key-value to the jGetSharedCacheInfo response,
    if we can fetch the shared cache path.

    If debugserver and the inferior process are running with the same shared
    cache UUID, there is a simple SPI to get debugserver's own shared cache
    filepath and we will return that.

    On newer OSes, there are SPI we can use to get the inferior process'
    shared cache filepath, use that if necessary and the SPI are available.

    The response for the jGetSharedCacheInfo packet will now look like

    {"shared_cache_base_address":6609256448,"shared_cache_uuid":"B69FF43C-DBFD-3FB1-B4FE-A8FE32EA1062","no_shared_cache":false,"shared_cache_private_cache":false,"shared_cache_path":"/System/Volumes/Preboot/Cryptexes/OS/System/Library/dyld/dyld_shared_cache_arm64e"}

    when we have the full information about the shared cache in the
    inferior. There are three possible types of responses:

    1. inferior has not yet mapped in a shared cache (read: when stopped at
    dyld_start and dyld hasn't started executing yet). In this case, no
    "shared_cache_path" is listed. ("shared_cache_base_address" will be 0,
    "shared_cache_uuid" will be all-zeroes uuid)

    2. inferior has a shared cache, but it is different than debugserver's
    and we do not have the new SPI to query the shared cache filepath. No
    "shared_cache_path" is listed.

    3. We were able to find the shared cache filepath, and it is included in
    the response, as above.

    I'm not using this information in lldb yet, but changes that build on
    this will be forthcoming.

    rdar://148939795

commit 81c5d468cf00d6e41112fba6c89d6c40013bcbda
Author: Men-cotton <mencotton0410 at gmail.com>
Date:   Mon Dec 1 13:20:13 2025 +0900

    [MLIR][NVVM] Propagate verification failure for unsupported SM targets (#170001)

    Fixes: https://github.com/llvm/llvm-project/issues/169113

    Correctly propagate verification failure when
    `NVVM::RequiresSMInterface` check fails during `gpu.module`
    verification.
    Previously, the walk was interrupted but the function returned
    `success()`, causing a mismatch between the emitted diagnostic and the
    return status. This led to assertion failures in Python bindings which
    expect `failure()` when diagnostics are emitted.

    CC: @grypp

commit e2181400d70857bc5a212a4053d5d7940c84acaf
Author: Brandon Wu <brandon.wu at sifive.com>
Date:   Mon Dec 1 11:03:50 2025 +0800

    [RISCV][llvm] Correct shamt in P extension EXTRACT_VECTOR_ELT lowering (#169823)

    During operation legalization, element type should have been turn into
    XLenVT which makes the SHL a no-op. We need to use exact vector element
    type instead.

commit 6369279a0c4ca1a008241f171657c1db83cfe026
Author: Matt Arsenault <Matthew.Arsenault at amd.com>
Date:   Sun Nov 30 21:56:47 2025 -0500

    Revert "Revert "LangRef: Clarify llvm.minnum and llvm.maxnum about sNaN and signed zero (#112852)"" (#170067)

    Reverts llvm/llvm-project#168838

    Justification is confused and this did not receive adequate discussion,
    particularly during a holiday week

commit 7494f3df14e5d401b73f2f8ccbd811f3556c5be5
Author: Aadesh Premkumar <aadesh.premkumar at multicorewareinc.com>
Date:   Mon Dec 1 08:14:51 2025 +0530

    [SPIRV] Added support for extension SPV_ALTERA_arbitrary_precision_fixed_point and name change of SPV_INTEL_arbitrary_precision_integers to SPV_ALTERA_arbitrary_precision_integers  (#136085)

    --Added support for extension SPV_ALTERA_arbitrary_precision_fixed_point
    --Added test files for extension
    SPV_ALTERA_arbitrary_precision_fixed_point

commit 2e21bb815d527ebbe4d53f0396d1e40aae9e2146
Author: fennecJ <hwahwa649 at gmail.com>
Date:   Mon Dec 1 10:19:56 2025 +0800

    [RISCV][ISelLowering] Use Zicond for FP selects on Zfinx/Zdinx (#169299)

    ### Summary

    This patch let RISCVTargetLowering::lowerSELECT to lower some
    floating-point select operations through an integer zicond select when:
    * Zicond is available, and
    * FP values live in GPRs (Zfinx/Zdinx), and
    * Select condition is an integer type.

    In that scenario there is no extra cost for GPR <-> "FP GPR" moves, so
    we can implement FP selects with a CZERO-based sequence instead of a
    branch.

    For example, for
    ```c
    float foo(int cond, float x) {
        return (cond != 0) ? x : 0.0f;
    }
    ```
    the current lowering produces:
    ```asm
    foo:
      mv    a2, a0
      li    a0, 0
      beqz  a2, .LBB0_2
    .LBB0_1:
      mv    a0, a1
    .LBB0_2:
      ret
    ```

    With this patch, when targeting rv64ima_zicond_zfinx we instead get:

    ```asm
    foo:
      czero.nez  a2, zero, a0
      czero.eqz  a0, a1, a0
      or         a0, a2, a0
      ret
    ```

    The existing branch-based lowering is preserved for:
    * targets without Zicond
    * targets where FP registers are separate (+f, +d without zfinx/zdinx)

    ### Testing

    Adds llvm/test/CodeGen/RISCV/zicond-fp-select-zfinx.ll to cover:
    * RV64 Zfinx/Zicond vs Zfinx without Zicond
    * RV64 Zdinx/Zicond vs Zdinx without Zicond
    * RV32 Zfinx/Zicond vs Zfinx without Zicond

    Also adds baseline RV32F/RV64F/RV64D cases to ensure we still use
    branches when FP registers are separate.

    The tests check that:
    * With Zicond + Zfinx/Zdinx, FP select lowers to a CZERO+OR sequence
    with no conditional branches.
    * Without Zicond (or without Zfinx/Zdinx), we still get branch-based
    code and no czero.* instructions.

commit e110abc3c65bb33f738738a9fa6e0f5b602ed97f
Author: lonely eagle <2020382038 at qq.com>
Date:   Mon Dec 1 10:00:54 2025 +0800

    [mlir][affine] Use iter argument replace init when delete loop in the coalesceLoops function (#169514)

    Fix https://github.com/llvm/llvm-project/issues/169483 by using iter
    argument replace init when delete loop in the coalesceLoops function.

commit 75aa01b89553bf4213a3b0e83829b6d0689941b9
Author: Phoebe Wang <phoebe.wang at intel.com>
Date:   Mon Dec 1 09:35:00 2025 +0800

    Revert "LangRef: Clarify llvm.minnum and llvm.maxnum about sNaN and signed zero (#112852)" (#168838)

    This reverts commit 363b05944f9212511ee6811d0eb1af841c177226.

    This is a follow up of #166912. Sorry for not noticing the change at the
    beginning, but I disagree with both sNaN and signed zero semantics
    change.

    I have 3 justifications:

    - llvm.minnum and llvm.maxnum are common intrinsics, we cannot change
    the definition just because "some architectures" support the changed
    semantic. For example, X86 min/max instructions neither distinguish sNaN
    nor signed zero. We have to add couples of extra instructions to match
    with the new definition, which makes the intrinsics less efficient. But
    efficient is not the reason for the objection. I object because such
    cost is unnecessary;
    - As the example ``minnum(fadd(sNaN, -0.0), 1.0)`` shows, minnum/maxnum
    themself cannot guarantee consistent result if multiple FP arithmetic
    operations involved. It makes the sacrifice of performance totally
    unnecessary. `Behavior of Floating-Point NaN values` notes all NaNs can
    be treated as quiet NaNs unless using Constrained Floating-Point
    Intrinsics. So the cost is only worth for constrained minnum/maxnum ones
    if we want to define them;
    - Signed zero handling is unnecessary either, because even the C
    functions don't require it. If any other front ends require, they can
    use the existing fminnum_ieee/fmaxnum_ieee or define new intrinsics;

    Fixes: https://github.com/llvm/llvm-project/issues/138303 and
    https://github.com/llvm/llvm-project/issues/169122

commit ef3785887c7c306d1ea933430befb78fb17e1650
Author: Fangrui Song <i at maskray.me>
Date:   Sun Nov 30 14:37:34 2025 -0800

    ELF: Move .eh_frame_hdr code closer to .eh_frame . NFC

    ... as they are closely related. Also improve the comments.

commit c465a56e9d1f244a32ea00a426d449bc7f38a9b1
Author: Florian Hahn <flo at fhahn.com>
Date:   Sun Nov 30 21:50:37 2025 +0000

    [VPlan] Handle canonical IVs in ::isSingleScalar. (NFCI)

    The canonical IV is always a single scalar. They are already treated as
    uniform-across-UF-and-VF.

    This should currently be NFC.

commit 113e0c95a89ab3ce9f1ac4e2ba6351d957a64da9
Author: Florian Hahn <flo at fhahn.com>
Date:   Sun Nov 30 21:07:28 2025 +0000

    [LV] Add additional tests for argmin with find-first wrapping IV ranges.

    Add test cases for upcoming argmin vectorization changes that have
    wrapping IV ranges.

commit f42e58f61680e325555f382cab5115c54df6f6df
Author: Nico Weber <thakis at chromium.org>
Date:   Sun Nov 30 13:59:11 2025 -0500

    [gn] port a6643f27ecda (libc++ picolib/newlib)

commit 38678a91f3eb984a76db40b71d573e336194029a
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Sun Nov 30 18:26:24 2025 +0000

    [DAG] getCarry - always succeed if we encounter a i1 type during trunc/ext peeling (#169777)

    If we are force reconstructing a carry from a raw MVT::i1 type, make
    sure we don't miss any cases while peeling through trunc/ext chains -
    check for i1 types at the start of the while loop

    Fixes #169691

commit 76d5dd5f9e9154a34dd6cfee232f322fc8112d63
Author: Shih-Po Hung <shihpo.hung at sifive.com>
Date:   Mon Dec 1 01:46:46 2025 +0800

    [TTI][RISCV] Add cost modelling for intrinsic vp.load.ff (#169890)

    This patch is a rework of #160470 (which was reverted).
    With getMemIntrinsicCost() now available, we can re‑land the change and
    reduce vp_load_ff boilerplate.

commit 0bd2f12753604cd072ae0935820ba9a23bb17ccc
Author: Jakub Kuderski <jakub at nod-labs.com>
Date:   Sun Nov 30 09:51:37 2025 -0500

    [mlir][linalg] Restrict fill initial value type to output element type (#169567)

    Disallow implicit casting, which is surprising, and, IME, usually
    indicative of copy-paste errors.

    Because the initial value must be a scalar, I don't expect this to
    affect any data movement.

commit b22825631293c19f70ca9969bd9de6094c688430
Author: David Green <david.green at arm.com>
Date:   Sun Nov 30 11:12:53 2025 +0000

    [ARM] Introduce intrinsics for MVE fma under strict-fp. (#169771)

    Similar to #169156, this adds an @arm.mve.fma intrinsic for strict-fp. A
    Builder class is added to act as the common subclass of IRBuilder and
    IRInt.

commit ce2c0813f0615084f387b31715ca0e1d8377134e
Author: Jonas Hahnfeld <jonas.hahnfeld at cern.ch>
Date:   Sun Nov 30 11:56:58 2025 +0100

    [clang] Move and update comment in getASTRecordLayout, NFC.

    isDefinition was already renamed to isCompleteDefinition by commit
    f937c023bf in 2011, later the day the comment was originally written.

commit 22257e8d6ed5600d9c689fecbd17ea68e9d08a6f
Author: Aiden Grossman <aidengrossman at google.com>
Date:   Sun Nov 30 00:45:18 2025 -0800

    [bazel] Port #169873 (#170027)

    A new dependency was added.

commit dda1fcf7b14cdcaeb39fc7aed377d8d4483ebcac
Author: Tomer Shafir <tomer.shafir8 at gmail.com>
Date:   Sun Nov 30 10:42:29 2025 +0200

    [llc][NFC] Remove unreachable return statement (#169915)

    `reportError()` is a `[[noreturn]]` that calls `exit(1)`.

commit 70970d0a5bc07f5614cfdb3c224b1ee8bbd58546
Author: sathvikreddy853 <157317970+sathvikreddy853 at users.noreply.github.com>
Date:   Sun Nov 30 08:37:14 2025 +0530

    [flang] Implement lowering for the PAUSE statement (Fixes #166821) (#167115)

    Implements lowering for the Fortran `PAUSE` statement.

    - Handles PAUSE with no operand.
    - Handles PAUSE with integer argument.
    - Handles PAUSE with character literal argument.
    - Adds a new lowering test: flang/test/Lower/pause-statement.f90.

    Unlike STOP, PAUSE does not unconditionally terminate control flow.
    The lowering preserves labels and GOTOs, consistent with legacy Fortran
    behavior.

    Fixes: #166821

    ---------

    Co-authored-by: aditya nath <adityanath5002 at gmail.com>
    Co-authored-by: Eugene Epshteyn <eepshteyn at nvidia.com>

commit 3de11e9251bba9f974b99947662eea69329075b2
Author: Matthias Springer <me at m-sp.org>
Date:   Sun Nov 30 10:36:57 2025 +0800

    [mlir][CF] Add `ub.unreachable` canonicalization (#169873)

    Basic blocks with only a `ub.unreachable` terminator are unreachable.
    This commit adds a canonicalization pattern that folds to `cf.cond_br`
    to `cf.br` if one of the destination branches is unreachable.

commit a8cffb82991f76c3a004820d94dd4e0853bce1db
Author: Fangrui Song <i at maskray.me>
Date:   Sat Nov 29 17:43:26 2025 -0800

    Remove unused MCObjectFileInfo::SupportsWeakOmittedEHFrame

    The code is related to pre-AsmPrinter legacy code (see
    9cb0e94dc79657144d639c722619e1e4fc19040e in 2008). The only caller has
    been removed by bb237c72a69e6294258874a40aaaf14ad2747710 in 2011.

commit 24b87b8d4891d90afd8c4033a4997dedecbdd107
Author: Florian Hahn <flo at fhahn.com>
Date:   Sat Nov 29 22:00:30 2025 +0000

    [VPlan] Skip cost verification for loops with EVL gather/scatter.

    The VPlan-based cost model use vp_gather/vp_scatter for gather/scatter
    costs, which is different to the legacy cost model and cannot be matched
    there. Don't verify the costs match for plans containing gather/scatters
    with EVL.

    Fixes https://github.com/llvm/llvm-project/issues/169948.

commit 9ffd2e40c1c469e3ccb0798fa15fc38d6df42652
Author: Lucie Choi <clucie at google.com>
Date:   Sat Nov 29 13:57:06 2025 -0800

    [SimplifyCFG] Fix `SimplifyCFG` pass to skip folding when both blocks contain convergence loop/entry intrinsics. (#166452)

    Fixes a bug https://github.com/llvm/llvm-project/issues/165642. [Similar
    fix](https://github.com/llvm/llvm-project/pull/165643) is being made in
    `IndVarSimplify` pass to account for convergence tokens.

    [LLVM
    Spec](https://llvm.org/docs/ConvergentOperations.html#llvm-experimental-convergence-loop)
    states that only a single loop / entry convergence token can be included
    in a basic block.

    This PR fixes the issue in `SimplifyCFG` pass so that when a basic block
    and its predecessor both contain such convergence intrinsics, it skips
    merging the two blocks.

commit cd3192a2c9c422f41d517428afef0a2232b9db8f
Author: Florian Hahn <flo at fhahn.com>
Date:   Sat Nov 29 20:49:22 2025 +0000

    [VPlan] Turn IVOp assertion into early exit.

    Turn assertion added in 99addbf73 [0] into an early exit.
    There are cases where the operand may not be a
    VPWidenIntOrFpInductionRecipe, e.g. if the IV increment is selected,
    as in the test cases.

    [0] https://github.com/llvm/llvm-project/pull/141431

commit f57129312421b05eb2a46cf715f2c1db32f56c83
Author: Fangrui Song <i at maskray.me>
Date:   Sat Nov 29 11:23:59 2025 -0800

    ARC: Override pseudos with pointers

    This ports #159881 fix for other targets and fixes
    ```
    error: missing target override for pseudoinstruction using PointerLikeRegClass
    ```

commit 435bafd0d534c8888783f0610afb86ed20d34fa7
Author: AIT <45133884+GeneraluseAI at users.noreply.github.com>
Date:   Sun Nov 30 03:22:25 2025 +0800

    [CIR][X86] Implement lowering for AVX512 mask builtins  (#169774)

    This patch adds CIR codegen support for AVX512 mask operations on X86,
    including kadd, kand, kandn, kor, kxor, knot, and kmov in all supported
    mask widths. Each builtin now lowers to the expected vector<i1> form and
    bitcast representations in CIR, matching the semantics of the
    corresponding LLVM intrinsics.

commit 246528cb3ad67ededee5f076fd1ef501af97f294
Author: Islam Imad <143586474+Islam-Imad at users.noreply.github.com>
Date:   Sat Nov 29 19:38:17 2025 +0200

    [clang][bytecode] Unify elementwise integer builtins using callback pattern (#169957)

    This patch refactors the handling of elementwise integer unary
    operations to use a unified callback-based approach, eliminating code
    duplication.

    Changes:
    - Extended interp__builtin_elementwise_int_unaryop to handle vector types
    - Replaced BI__builtin_elementwise_popcount with callback invocation
    - Replaced BI__builtin_elementwise_bitreverse with callback invocation
    - Removed  interp__builtin_elementwise_popcount function

    The new approach uses a lambda function to specify the operation
    (popcount or reverseBits), which is applied uniformly to both scalar and
    vector operands. This reduces code duplication and makes it easier to
    add similar builtins in the future.

    Fixes #169657

commit 8462cff40daf40e58d705f5d86d4e91ef6e6294c
Author: David Stone <davidfromonline at gmail.com>
Date:   Sat Nov 29 09:48:44 2025 -0700

    [clang][NFC] Declare `CXXBasePaths::isAmbiguous` as `const` (#169944)

    To make this change, we have to use `lookup` instead of `operator[]` on
    a map. They both return the same thing: a default constructed value. The
    difference is that `lookup` default constructs a value and then returns
    it, whereas `operator[]` default constructs a value, inserts it into the
    map, and then returns a reference to that. Given that we are using a
    by-value return, the only way this is different is if a later use of the
    map depends on a value being at that key.

    The map is a private variable of the class, so the only possible users
    are are other member functions. The only other use of the map that cares
    about the contents of the map is in `lookupInBases`, and it accesses the
    map with `operator[]`. This means that attempting to access the same
    element in this function will default construct the value before doing
    anything with it, which means it would do the exact thing it needs to do
    in the case where we are looking up a non-existent key, therefore no
    behavior has changed.

    In terms of performance, this would either be a win or neutral. The
    benefit is that in some cases, we can avoid a memory allocation just
    read the contents of a 32-bit `0`. If a call to `isAmbiguous` is always
    followed up with a call to `lookupInBases`, then we allocate the memory
    just a little bit later for no difference in performance.

commit a09c5792ed3b6a0644c990060f890c53f042b267
Author: Abhishek Varma <avarma094 at gmail.com>
Date:   Sat Nov 29 22:01:19 2025 +0530

    [NFC][Linalg] Introduce ConvMatchBuilder + refactor Conv matchers (#169704)

    -- This commit is a follow-up and third in the series of adding
    matchers for conv/pool ops. Refer:
    https://github.com/llvm/llvm-project/pull/163724
    -- It introduces ConvMatchBuilder class in order to reduce the
       repetitive code across Conv1D/2D/3D/Depthwise/Pooling variants.
    -- Refer to [Conv2D
    thread](https://github.com/llvm/llvm-project/pull/168362#issuecomment-3575972133)
    for further context.

    Signed-off-by: Abhishek Varma <abhvarma at amd.com>

commit 7925a9ea1e63b5e7c1f57e467a05e819f6ef7c27
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date:   Sat Nov 29 14:32:05 2025 +0000

    [X86] combineConcatVectorOps - add handling for vXi1 concat(logicop(),logicop()) patterns. (#169998)

commit 3e16aef2a650a8c2da4ebd5c58c6a9e261361828
Author: Koakuma <koachan at protonmail.com>
Date:   Sat Nov 29 21:30:39 2025 +0700

    [SPARC] Properly handle CC for long double on sparc32 (#162226)

    Pass and return `long double`s indirectly, as specified in the psABI.
    This continues the patch at https://reviews.llvm.org/D89130.

    This should fix the issue at https://github.com/llvm/llvm-project/issues/41838.

commit 3a1079fa2514d16c51bfe53b3da8a8b8d78128c1
Author: theRonShark <ron.lieberman at amd.com>
Date:   Sat Nov 29 08:01:23 2025 -0500

    Revert "[RegAlloc] Relax the split constrain on MBB prolog" (#169990)

    Reverts llvm/llvm-project#168259

    breaks hip buildot

commit d3762edd5fc11e6ad670950d89d51edabf30f8b5
Author: Tirthankar Mazumder <63574588+wermos at users.noreply.github.com>
Date:   Sat Nov 29 18:30:21 2025 +0530

    [docs] Fix typos and remove redundant whitespace (#169981)

    As the title says, I fixed some spelling mistakes I found in the docs.

commit 66d33cec991c5526b4ec3bbfec741a2a9e78b21f
Author: Florian Hahn <flo at fhahn.com>
Date:   Sat Nov 29 12:26:51 2025 +0000

    [LV] Extend test coverage for inductions depending on complex SCEVs.

    Re-generate check lines, add test with complex SCEV as induction start
    value and add stores to existing loops to make them not trivial.

commit f5742c4d540a20651a67de51e16242a52e5d4064
Author: Qihan Cai <caiqihan021 at hotmail.com>
Date:   Sat Nov 29 19:07:19 2025 +1100

    [RISCV] Intrinsic Support for XCVelw (#129168)

commit 5dd2b06d60d3eb9b07c7513358ad8b04386f79bc
Author: mitchell <mitchell.xu2 at gmail.com>
Date:   Sat Nov 29 10:36:01 2025 +0800

    [clang-tidy] Fix OOB access in `FormatStringConverter` with signed chars (#169215)

    `FormatStringConverter::appendFormatText` incorrectly treated non-ASCII
    characters (e.g. UTF-8) as negative values when using signed chars. This
    caused them to pass the `< 32` check for control characters.

    The negative values were passed to `llvm::hexdigit`, resulting in an OOB
    access and a crash.

    This closes
    [#169198](https://github.com/llvm/llvm-project/issues/169198)

commit 9bae84b01718e53495abf50958abc86ea45f16bb
Author: Luo Yuanke <lyk_03 at hotmail.com>
Date:   Sat Nov 29 07:27:19 2025 +0800

    [RegAlloc] Relax the split constrain on MBB prolog (#168259)

    https://reviews.llvm.org/D52052 is to prevent register split on the MBB
    which have prolog instructions defining the exec register (or mask register
    that activate the threads of a warp in GPU). The constrain seems too
    strict, because 1) If the split is allowed, it may fit the free live range
    of a physical register, and no spill will happen; 2) The register class of
    register that is under splitting may not be the same to the register that
    is defined in prolog, so there is no interference with the register being
    defined in prolog.
    The current code has another small issue. The MBB->getFirstNonDebugInstr()
    just skip debug instructions, but SA->getFirstSplitPoint(Number) would skip
    label and phi instructions. This cause some MBB with label instruction
    being taken as prolog.
    This patch is to relax the split constrain on MMB with prolog by checking
    if the register defined in prolog has the common register class with the
    register being split. It allow the split if the register defined in prolog
    is physical register or there is no common register class.

    ---------

    Co-authored-by: Yuanke Luo <ykluo at birentech.com>

commit 99addbf73db596403a1702ac5c3f92e58f9e9f55
Author: Florian Hahn <flo at fhahn.com>
Date:   Fri Nov 28 22:26:19 2025 +0000

    [LV] Vectorize selecting last IV of min/max element. (#141431)

    Add support for vectorizing loops that select the index of the minimum
    or maximum element. The patch implements vectorizing those patterns by
    combining Min/Max and FindFirstIV reductions.

    It extends matching Min/Max reductions to allow in-loop users that are
    FindLastIV reductions. It records a flag indicating that the Min/Max
    reduction is used by another reduction. The extra user is then check as
    part of the new `handleMultiUseReductions` VPlan transformation.

    It processes any reduction that has other reduction users. The reduction
    using the min/max reduction currently must be a FindLastIV reduction,
    which needs adjusting to compute the correct result:
     1. We need to find the last IV for which the condition based on the
         min/max reduction is true,
     2. Compare the partial min/max reduction result to its final value and,
     3. Select the lanes of the partial FindLastIV reductions which
         correspond to the lanes matching the min/max reduction result.

    Depends on https://github.com/llvm/llvm-project/pull/140451

    PR: https://github.com/llvm/llvm-project/pull/141431

commit e99d8adf8d34da521d9243ba225995ac543745df
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 20:28:52 2025 +0200

    [MC] [Win64EH] Fix the operator ordering for UOP_SaveFPLRX. NFC.

    The encoded offset should be (OffsetInBytes/8)-1 due to an
    implicit offset of 1. Previously the operator ordering was
    inverted. As the offset is a multiple of 8, the incorrect
    operator ordering did produce the right result in all cases
    anyway.

commit 8a2965dfa929b49ecc3ba7e508d2f6970ac418af
Author: Martin Storsjö <martin at martin.st>
Date:   Fri Nov 28 20:45:18 2025 +0200

    [llvm-readobj] Remove a leftover comment from 6ad4fdacaeea4777e98a3ab41512c49d3d1b6151. NFC.

    This case did get documented upstream, in
    https://github.com/MicrosoftDocs/cpp-docs/pull/4202, and the
    way that llvm-readobj prints it, implemented in that commit, is
    correct.
---
 .ci/generate_test_report_lib.py               |    8 +-
 .ci/generate_test_report_lib_test.py          |    6 +-
 .github/workflows/check-ci.yml                |    2 +-
 .github/workflows/ci-post-commit-analyzer.yml |    2 +-
 .github/workflows/docs.yml                    |    2 +-
 .github/workflows/gha-codeql.yml              |    4 +-
 .github/workflows/issue-write-test.yaml       |   33 +
 .github/workflows/issue-write.yml             |   12 +-
 .github/workflows/libc-fullbuild-tests.yml    |    2 +-
 .github/workflows/libc-overlay-tests.yml      |    2 +-
 .github/workflows/libclang-python-tests.yml   |    4 +-
 .github/workflows/libcxx-build-and-test.yaml  |   29 +-
 .../libcxx-check-generated-files.yml          |    3 +
 .github/workflows/libcxx-run-benchmarks.yml   |    2 +-
 .github/workflows/mlir-spirv-tests.yml        |    2 +-
 .github/workflows/premerge.yaml               |    4 +-
 .github/workflows/release-binaries.yml        |    2 +-
 .github/workflows/release-documentation.yml   |    2 +-
 .github/workflows/release-doxygen.yml         |    2 +-
 .github/workflows/release-sources.yml         |   57 +-
 .github/workflows/scorecard.yml               |    2 +-
 .github/workflows/spirv-tests.yml             |    2 +-
 .../test-unprivileged-download-artifact.yml   |   28 +-
 .../unprivileged-download-artifact/action.yml |   59 +-
 .../upload-release-artifact/action.yml        |  105 ++
 bolt/docs/BAT.md                              |    1 +
 bolt/docs/CommandLineArgumentReference.md     |    9 +
 bolt/docs/PacRetDesign.md                     |   21 +-
 bolt/include/bolt/Core/BinaryContext.h        |    9 +
 .../bolt/Passes/InsertNegateRAStatePass.h     |   25 +-
 bolt/include/bolt/Rewrite/RewriteInstance.h   |   11 +-
 bolt/lib/Passes/Inliner.cpp                   |   26 +
 bolt/lib/Passes/InsertNegateRAStatePass.cpp   |  147 +-
 bolt/lib/Rewrite/RewriteInstance.cpp          |  241 ++-
 bolt/test/AArch64/hook-fini.s                 |   14 +-
 bolt/test/AArch64/hook-init.s                 |  221 +++
 bolt/test/AArch64/inline-bti-dbg.s            |   40 +
 bolt/test/AArch64/inline-bti.s                |   38 +
 bolt/test/AArch64/instrument-no-fini.s        |   34 +
 bolt/test/X86/hook-init.s                     |  221 +++
 bolt/test/X86/instrument-no-fini.s            |   34 +
 bolt/test/X86/internal-call-instrument-so.s   |    9 +-
 .../runtime/X86/instrument-wrong-target.s     |    7 +
 bolt/unittests/CMakeLists.txt                 |    1 +
 bolt/unittests/Passes/CMakeLists.txt          |   30 +
 bolt/unittests/Passes/InsertNegateRAState.cpp |  333 ++++
 clang-tools-extra/clang-tidy/.clang-tidy      |    3 +-
 .../ClangTidyDiagnosticConsumer.cpp           |    9 +-
 .../ExpandModularHeadersPPCallbacks.h         |    2 +-
 .../clang-tidy/bugprone/BranchCloneCheck.cpp  |    9 +-
 .../CapturingThisInMemberVariableCheck.cpp    |   23 +-
 .../EasilySwappableParametersCheck.cpp        |    8 +-
 .../bugprone/ExceptionEscapeCheck.cpp         |    3 +-
 .../bugprone/FloatLoopCounterCheck.cpp        |    1 +
 .../clang-tidy/bugprone/InfiniteLoopCheck.cpp |   18 +-
 .../bugprone/SuspiciousReallocUsageCheck.cpp  |    7 +-
 .../ProBoundsArrayToPointerDecayCheck.cpp     |    9 +-
 ...undsAvoidUncheckedContainerAccessCheck.cpp |    2 +-
 .../ProTypeMemberInitCheck.cpp                |   47 +-
 .../fuchsia/VirtualInheritanceCheck.cpp       |    7 +-
 .../misc/NewDeleteOverloadsCheck.cpp          |   12 +-
 .../clang-tidy/misc/UnusedParametersCheck.cpp |   11 +-
 .../clang-tidy/modernize/LoopConvertCheck.cpp |   19 +-
 .../clang-tidy/modernize/LoopConvertUtils.cpp |   19 +-
 .../clang-tidy/modernize/PassByValueCheck.cpp |    6 +-
 .../clang-tidy/modernize/UseEmplaceCheck.cpp  |   17 +-
 .../clang-tidy/modernize/UseStdPrintCheck.h   |    2 +-
 .../modernize/UseTrailingReturnTypeCheck.cpp  |   11 +-
 .../clang-tidy/objc/MissingHashCheck.cpp      |    8 +-
 .../TriviallyDestructibleCheck.cpp            |    9 +-
 .../AmbiguousSmartptrResetCallCheck.cpp       |    9 +-
 .../readability/IdentifierNamingCheck.cpp     |    4 +-
 .../OperatorsRepresentationCheck.cpp          |    8 +-
 .../readability/RedundantTypenameCheck.cpp    |    6 +-
 .../SuspiciousCallArgumentCheck.cpp           |   10 +-
 .../clang-tidy/tool/ClangTidyMain.cpp         |    3 +-
 .../clang-tidy/utils/Aliasing.cpp             |   12 +-
 .../clang-tidy/utils/DeclRefExprUtils.cpp     |    5 +-
 .../clang-tidy/utils/ExprSequence.cpp         |    9 +-
 .../utils/FormatStringConverter.cpp           |    7 +-
 .../clang-tidy/utils/TypeTraits.cpp           |   20 +-
 clang-tools-extra/clangd/test/CMakeLists.txt  |    4 +
 .../test/include-cleaner-batch-fix.test       |    4 +-
 .../clangd/test/index-tools.test              |    9 +-
 clang-tools-extra/clangd/test/lit.cfg.py      |    4 +
 .../clangd/test/lit.site.cfg.py.in            |    1 +
 .../clangd/test/system-include-extractor.test |    3 +-
 clang-tools-extra/docs/ReleaseNotes.rst       |   23 +-
 ...ounds-avoid-unchecked-container-access.rst |    4 +-
 .../test/clang-tidy/check_clang_tidy.py       |    2 +
 .../pro-type-member-init.ignorearrays.cpp     |   36 +
 .../const-correctness-pointer-as-pointers.cpp |   15 +
 .../checkers/modernize/use-std-print.cpp      |   12 +
 .../readability/redundant-typename.cpp        |   24 +
 clang/docs/LibASTMatchersReference.html       |   12 +
 clang/docs/OpenMPSupport.rst                  |    2 +-
 clang/docs/ReleaseNotes.rst                   |   18 +
 clang/docs/StandardCPlusPlusModules.rst       |    5 +-
 clang/include/clang/AST/ASTConsumer.h         |    6 +
 clang/include/clang/AST/ASTContext.h          |   16 +
 clang/include/clang/AST/CXXInheritance.h      |    2 +-
 clang/include/clang/AST/Decl.h                |   14 +
 clang/include/clang/AST/OpenMPClause.h        |   44 +-
 clang/include/clang/AST/OperationKinds.def    |    3 +
 clang/include/clang/ASTMatchers/ASTMatchers.h |   13 +
 clang/include/clang/Basic/Attr.td             |   20 +-
 clang/include/clang/Basic/AttrDocs.td         |   52 +
 clang/include/clang/Basic/BuiltinsX86.td      |   52 +-
 clang/include/clang/Basic/DiagnosticGroups.td |    5 +
 .../include/clang/Basic/DiagnosticLexKinds.td |    4 +
 .../clang/Basic/DiagnosticSemaKinds.td        |   29 +-
 clang/include/clang/Basic/OpenMPKinds.def     |    1 +
 clang/include/clang/Basic/TokenKinds.h        |    2 +-
 clang/include/clang/Basic/arm_mve.td          |   14 +-
 clang/include/clang/Basic/arm_mve_defs.td     |   13 +-
 clang/include/clang/CIR/CIRGenerator.h        |    3 +
 .../include/clang/CIR/Dialect/IR/CIRAttrs.td  |   42 +-
 .../clang/CIR/Dialect/IR/CIRDialect.td        |    9 +-
 clang/include/clang/CIR/Dialect/IR/CIROps.td  |  124 +-
 .../CIR/Dialect/IR/CIRTypeConstraints.td      |    2 +-
 clang/include/clang/CIR/MissingFeatures.h     |    7 +
 .../DependencyScanningWorker.h                |   31 +-
 clang/include/clang/Lex/HeaderSearch.h        |    6 +
 clang/include/clang/Lex/PPCallbacks.h         |    4 +-
 clang/include/clang/Lex/Token.h               |    8 +-
 clang/include/clang/Options/Options.td        |   36 +-
 clang/include/clang/Sema/Overload.h           |    3 +
 clang/include/clang/Sema/Sema.h               |    5 +
 clang/include/clang/Sema/SemaCUDA.h           |    5 +
 clang/include/clang/Sema/SemaHLSL.h           |   49 +-
 clang/include/clang/Sema/SemaOpenACC.h        |    9 +
 clang/include/clang/Sema/SemaOpenMP.h         |    4 +-
 clang/include/clang/Serialization/ASTReader.h |    2 +-
 .../clang/Tooling/Transformer/RangeSelector.h |    4 +
 clang/lib/AST/ByteCode/Compiler.cpp           |    3 +
 clang/lib/AST/ByteCode/Disasm.cpp             |   21 +-
 clang/lib/AST/ByteCode/Function.h             |    4 +-
 clang/lib/AST/ByteCode/InterpBuiltin.cpp      |  339 +++-
 clang/lib/AST/ByteCode/Source.h               |    1 +
 clang/lib/AST/CXXInheritance.cpp              |    4 +-
 clang/lib/AST/ComparisonCategories.cpp        |    2 +-
 clang/lib/AST/Expr.cpp                        |    1 +
 clang/lib/AST/ExprConstant.cpp                |  222 ++-
 clang/lib/AST/OpenMPClause.cpp                |   38 +-
 clang/lib/AST/RecordLayoutBuilder.cpp         |    9 +-
 clang/lib/ASTMatchers/ASTMatchersInternal.cpp |    1 +
 clang/lib/ASTMatchers/Dynamic/Registry.cpp    |    1 +
 clang/lib/Analysis/Consumed.cpp               |    9 +-
 clang/lib/Analysis/ExprMutationAnalyzer.cpp   |    5 +
 clang/lib/Basic/Targets/AMDGPU.h              |   16 +-
 clang/lib/Basic/Targets/Sparc.cpp             |    1 +
 clang/lib/Basic/Targets/Sparc.h               |    7 +
 clang/lib/CIR/CodeGen/CIRGenBuiltin.cpp       |   60 +-
 .../lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp  | 1583 +++++++++++++++++
 clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp    |  462 ++++-
 clang/lib/CIR/CodeGen/CIRGenClass.cpp         |    3 +-
 clang/lib/CIR/CodeGen/CIRGenCoroutine.cpp     |   39 +
 clang/lib/CIR/CodeGen/CIRGenDeclOpenACC.cpp   |   87 +-
 clang/lib/CIR/CodeGen/CIRGenExpr.cpp          |    5 +-
 clang/lib/CIR/CodeGen/CIRGenExprComplex.cpp   |    1 +
 clang/lib/CIR/CodeGen/CIRGenExprConstant.cpp  |    1 +
 clang/lib/CIR/CodeGen/CIRGenExprScalar.cpp    |   31 +-
 clang/lib/CIR/CodeGen/CIRGenFunction.h        |   26 +-
 clang/lib/CIR/CodeGen/CIRGenModule.cpp        |   28 +-
 clang/lib/CIR/CodeGen/CIRGenModule.h          |    6 +
 clang/lib/CIR/CodeGen/CIRGenerator.cpp        |   12 +
 clang/lib/CIR/CodeGen/CMakeLists.txt          |    1 +
 clang/lib/CIR/Dialect/IR/CIRAttrs.cpp         |   18 -
 clang/lib/CIR/Dialect/IR/CIRDialect.cpp       |   89 +-
 .../Dialect/Transforms/LoweringPrepare.cpp    |   24 +-
 clang/lib/CIR/FrontendAction/CIRGenAction.cpp |    5 +
 .../CIR/Lowering/DirectToLLVM/LowerToLLVM.cpp |   36 +-
 clang/lib/CodeGen/BackendUtil.cpp             |   52 +-
 clang/lib/CodeGen/CGCUDARuntime.cpp           |  106 ++
 clang/lib/CodeGen/CGCUDARuntime.h             |    4 +
 clang/lib/CodeGen/CGCall.cpp                  |   13 +
 clang/lib/CodeGen/CGExpr.cpp                  |   36 +-
 clang/lib/CodeGen/CGExprAgg.cpp               |    3 +-
 clang/lib/CodeGen/CGExprCXX.cpp               |    4 +
 clang/lib/CodeGen/CGExprComplex.cpp           |    1 +
 clang/lib/CodeGen/CGExprConstant.cpp          |    1 +
 clang/lib/CodeGen/CGExprScalar.cpp            |   62 +-
 clang/lib/CodeGen/CGHLSLRuntime.cpp           |  109 +-
 clang/lib/CodeGen/CGHLSLRuntime.h             |    6 +-
 clang/lib/CodeGen/CGOpenMPRuntime.cpp         |   18 +
 clang/lib/CodeGen/CodeGenFunction.cpp         |   17 +
 clang/lib/CodeGen/CodeGenModule.cpp           |   19 +-
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   |   17 -
 clang/lib/CodeGen/Targets/AMDGPU.cpp          |    5 +-
 clang/lib/CodeGen/Targets/NVPTX.cpp           |    3 -
 clang/lib/CodeGen/Targets/SPIR.cpp            |   21 -
 clang/lib/CodeGen/Targets/Sparc.cpp           |   34 +-
 .../DependencyScannerImpl.cpp                 |    6 +-
 .../DependencyScanningWorker.cpp              |   39 +-
 clang/lib/Driver/ToolChains/Linux.cpp         |    2 +-
 clang/lib/Edit/RewriteObjCFoundationAPI.cpp   |    1 +
 clang/lib/Format/ContinuationIndenter.cpp     |  121 +-
 clang/lib/Format/ContinuationIndenter.h       |   51 +-
 clang/lib/Format/FormatTokenLexer.cpp         |    2 +
 clang/lib/Format/UnwrappedLineFormatter.cpp   |    6 +-
 clang/lib/Format/WhitespaceManager.cpp        |   43 +-
 clang/lib/Format/WhitespaceManager.h          |   25 +-
 clang/lib/Headers/avx512bwintrin.h            |   10 +-
 clang/lib/Headers/avx512dqintrin.h            |    6 +-
 clang/lib/Headers/avx512fintrin.h             |   52 +-
 clang/lib/Headers/avx512vlintrin.h            |   16 +-
 clang/lib/Headers/avxintrin.h                 |    9 +-
 clang/lib/Headers/emmintrin.h                 |    7 +-
 clang/lib/Lex/HeaderSearch.cpp                |   67 +
 clang/lib/Lex/PPDirectives.cpp                |   11 +-
 clang/lib/Parse/ParseOpenMP.cpp               |   31 +-
 clang/lib/Parse/ParseTentative.cpp            |    2 +-
 clang/lib/Parse/Parser.cpp                    |   41 +-
 clang/lib/Sema/AnalysisBasedWarnings.cpp      |    1 -
 clang/lib/Sema/CodeCompleteConsumer.cpp       |    3 +-
 clang/lib/Sema/Sema.cpp                       |    3 +
 clang/lib/Sema/SemaCUDA.cpp                   |   99 +-
 clang/lib/Sema/SemaChecking.cpp               |   23 +-
 clang/lib/Sema/SemaDecl.cpp                   |   53 +-
 clang/lib/Sema/SemaDeclAttr.cpp               |   72 +
 clang/lib/Sema/SemaExprCXX.cpp                |   22 +-
 clang/lib/Sema/SemaHLSL.cpp                   |  167 +-
 clang/lib/Sema/SemaLookup.cpp                 |    7 +-
 clang/lib/Sema/SemaOpenACC.cpp                |   23 +-
 clang/lib/Sema/SemaOpenMP.cpp                 |   42 +-
 clang/lib/Sema/SemaOverload.cpp               |   73 +-
 clang/lib/Sema/SemaType.cpp                   |    6 +-
 clang/lib/Sema/TreeTransform.h                |   38 +-
 clang/lib/Serialization/ASTReader.cpp         |   12 +-
 clang/lib/Serialization/ASTReaderDecl.cpp     |    4 +-
 clang/lib/Serialization/ASTWriter.cpp         |   41 +-
 clang/lib/StaticAnalyzer/Core/BugReporter.cpp |    1 -
 clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp |    1 +
 .../Frontend/AnalysisConsumer.cpp             |    6 +-
 clang/lib/Tooling/Transformer/Parsing.cpp     |    2 +-
 .../lib/Tooling/Transformer/RangeSelector.cpp |   57 +
 clang/test/AST/ByteCode/builtin-functions.cpp |    7 +
 clang/test/AST/ByteCode/c.c                   |   13 +
 .../CIR/CodeGen/aapcs-volatile-bitfields.c    |   14 +-
 .../CIR/CodeGen/address-space-conversion.cpp  |    8 +-
 clang/test/CIR/CodeGen/address-space.c        |    6 +-
 clang/test/CIR/CodeGen/array-ctor.cpp         |    8 +-
 .../CIR/CodeGen/asm-label-inline-builtins.c   |    2 +-
 clang/test/CIR/CodeGen/assign-operator.cpp    |    4 +-
 clang/test/CIR/CodeGen/bitfield-union.c       |    2 +-
 clang/test/CIR/CodeGen/bitfields.c            |    6 +-
 clang/test/CIR/CodeGen/bitfields.cpp          |    6 +-
 clang/test/CIR/CodeGen/bitfields_be.c         |    4 +-
 clang/test/CIR/CodeGen/constant-inits.cpp     |    2 +-
 clang/test/CIR/CodeGen/copy-constructor.cpp   |    2 +-
 clang/test/CIR/CodeGen/coro-task.cpp          |   22 +-
 .../CIR/CodeGen/cxx-conversion-operators.cpp  |   10 +-
 .../CIR/CodeGen/cxx-special-member-attr.cpp   |   35 +-
 clang/test/CIR/CodeGen/delete.cpp             |    8 +-
 clang/test/CIR/CodeGen/destructors.cpp        |    6 +-
 clang/test/CIR/CodeGen/dtors.cpp              |    2 +-
 clang/test/CIR/CodeGen/dynamic-cast.cpp       |   12 +-
 clang/test/CIR/CodeGen/global-ctor-dtor.cpp   |    8 +-
 clang/test/CIR/CodeGen/goto.cpp               |   16 +-
 clang/test/CIR/CodeGen/inline-attributes.cpp  |   12 +-
 clang/test/CIR/CodeGen/label-values.c         |    8 +-
 clang/test/CIR/CodeGen/label.c                |   14 +-
 .../CIR/CodeGen/lambda-static-invoker.cpp     |    6 +-
 clang/test/CIR/CodeGen/lambda.cpp             |   32 +-
 clang/test/CIR/CodeGen/linkage-spec.cpp       |   28 +-
 clang/test/CIR/CodeGen/no-prototype.c         |   16 +-
 clang/test/CIR/CodeGen/placement-new.cpp      |    2 +-
 clang/test/CIR/CodeGen/ptrdiff.cpp            |    2 +-
 clang/test/CIR/CodeGen/size-of-vla.cpp        |  156 ++
 clang/test/CIR/CodeGen/statement-exprs.c      |   16 +-
 clang/test/CIR/CodeGen/stmt-expr.cpp          |    4 +-
 clang/test/CIR/CodeGen/struct.cpp             |   18 +-
 clang/test/CIR/CodeGen/var_arg.c              |   26 +-
 .../CIR/CodeGen/variable-decomposition.cpp    |    2 +-
 clang/test/CIR/CodeGen/vbase.cpp              |    2 +-
 clang/test/CIR/CodeGen/volatile.cpp           |   20 +-
 clang/test/CIR/CodeGen/vtable-emission.cpp    |    4 +-
 .../CIR/CodeGenBuiltins/X86/avx-builtins.c    |   74 +-
 .../CIR/CodeGenBuiltins/X86/avx2-builtins.c   |   53 +
 .../CodeGenBuiltins/X86/avx512bw-builtins.c   |  460 ++++-
 .../CodeGenBuiltins/X86/avx512dq-builtins.c   |  210 +++
 .../CodeGenBuiltins/X86/avx512f-builtins.c    |  618 +++++++
 .../CodeGenBuiltins/X86/avx512vl-builtins.c   |  201 +++
 .../CIR/CodeGenBuiltins/X86/sse-builtins.c    |   12 +
 .../CIR/CodeGenBuiltins/X86/sse2-builtins.c   |   55 +-
 .../CodeGenBuiltins/X86/vec-set-builtins.c    |  141 ++
 .../CIR/CodeGenBuiltins/X86/xop-builtins.c    |   92 +
 .../CIR/CodeGenBuiltins/builtin-constant-p.c  |  281 +++
 .../CIR/CodeGenBuiltins/builtin-fcmp-sse.c    |   16 +-
 .../test/CIR/CodeGenBuiltins/builtin_inline.c |    2 +-
 .../CIR/CodeGenBuiltins/builtin_prefetch.c    |    2 +-
 .../CodeGenBuiltins/builtins-floating-point.c |   21 +
 .../CIR/CodeGenBuiltins/builtins-overflow.cpp |   60 +-
 .../combined-firstprivate-clause.cpp          |   12 +-
 .../compute-firstprivate-clause-templates.cpp |    4 +-
 .../compute-firstprivate-clause.cpp           |   15 +-
 .../firstprivate-clause-recipes.cpp           |    6 +-
 .../openacc-not-implemented-global.cpp        |    6 -
 .../CIR/CodeGenOpenACC/routine-anon-ns.cpp    |   27 +
 .../CIR/CodeGenOpenACC/routine-clauses.cpp    |   38 +
 .../CIR/CodeGenOpenACC/routine-globals.cpp    |   35 +
 .../CIR/CodeGenOpenACC/routine-globals2.cpp   |   44 +
 .../CIR/CodeGenOpenACC/routine-locals.cpp     |   24 +
 .../CIR/CodeGenOpenACC/routine-members.cpp    |   55 +
 clang/test/CIR/CodeGenOpenACC/routine-ns.cpp  |   28 +
 .../test/CIR/CodeGenOpenACC/routine-templ.cpp |   16 +
 clang/test/CIR/IR/func.cir                    |    8 +
 clang/test/CIR/IR/inline-attrs.cir            |   37 +-
 clang/test/CIR/IR/invalid-func-attr.cir       |   11 +
 clang/test/CIR/func-linkage.cpp               |    8 +-
 clang/test/CXX/drs/cwg30xx.cpp                |    2 +-
 clang/test/CodeGen/Sparc/sparcv8-abi.c        |   36 +-
 clang/test/CodeGen/X86/avx-builtins.c         |    2 +
 clang/test/CodeGen/X86/avx2-builtins.c        |   22 +
 clang/test/CodeGen/X86/avx512bw-builtins.c    |   26 +
 clang/test/CodeGen/X86/avx512dq-builtins.c    |   13 +
 clang/test/CodeGen/X86/avx512f-builtins.c     |   25 +
 clang/test/CodeGen/X86/avx512vl-builtins.c    |    8 +
 clang/test/CodeGen/X86/sse2-builtins.c        |    4 +
 clang/test/CodeGen/alloc-token-lower.c        |   13 +-
 clang/test/CodeGen/alloc-token-module-flags.c |   22 +
 .../test/CodeGen/arm-mve-intrinsics/ternary.c | 1012 +++++++----
 .../CodeGen/arm-mve-intrinsics/vmaxnmaq.c     |   84 +-
 .../test/CodeGen/arm-mve-intrinsics/vmaxnmq.c |  110 +-
 .../CodeGen/arm-mve-intrinsics/vminnmaq.c     |   84 +-
 .../test/CodeGen/arm-mve-intrinsics/vminnmq.c |  110 +-
 clang/test/CodeGen/attr-modular-format.c      |   49 +
 .../distributed-thin-lto/memprof-pgho.cpp     |   69 +
 clang/test/CodeGen/lto-newpm-pipeline.c       |    8 +-
 clang/test/CodeGenCUDA/Inputs/cuda.h          |    8 +-
 clang/test/CodeGenCUDA/device-kernel-call.cu  |   35 +
 .../BasicFeatures/MatrixElementTypeCast.hlsl  |  219 +++
 .../MatrixExplicitTruncation.hlsl             |  156 ++
 .../MatrixImplicitTruncation.hlsl             |  138 ++
 clang/test/CodeGenHLSL/BoolVector.hlsl        |   15 +-
 .../builtins/VectorElementStore.hlsl          |   41 +
 clang/test/CodeGenHLSL/builtins/lit.hlsl      |    6 +-
 .../CodeGenHLSL/semantics/SV_Target.ps.hlsl   |   19 +
 ...antic.explicit-location-output-struct.hlsl |   37 +
 .../semantics/semantic.explicit-location.hlsl |   19 +
 .../semantic.explicit-mix-builtin.hlsl        |   39 +
 .../semantic.explicit-mix-builtin.vs.hlsl     |   31 +
 .../semantics/semantic.explicit-mix.lib.hlsl  |   40 +
 clang/test/CodeGenOpenCL/address-spaces.cl    |    5 +-
 clang/test/CodeGenOpenCL/builtins-alloca.cl   |    4 +-
 clang/test/CodeGenOpenCL/ptx-calls.cl         |   26 +-
 clang/test/CodeGenOpenCL/reflect.cl           |    2 +-
 clang/test/Driver/autocomplete.c              |    2 +
 clang/test/Driver/nvlink-wrapper.c            |   11 +-
 clang/test/FixIt/fixit-cxx0x-attributes.cpp   |   48 +
 clang/test/Misc/amdgcn.languageOptsOpenCL.cl  |    7 +
 ...a-attribute-supported-attributes-list.test |    2 +
 clang/test/Modules/GH170084.cpp               |   75 +
 clang/test/Modules/pr170235.cppm              |   32 +
 clang/test/OpenMP/amdgcn_weak_alias.c         |    7 -
 clang/test/OpenMP/cancel_codegen.cpp          |  119 +-
 clang/test/OpenMP/critical_codegen.cpp        |    2 +
 clang/test/OpenMP/critical_codegen_attr.cpp   |    2 +
 .../OpenMP/irbuilder_nested_parallel_for.c    |  108 +-
 clang/test/OpenMP/masked_codegen.cpp          |    2 +
 clang/test/OpenMP/master_codegen.cpp          |    2 +
 clang/test/OpenMP/nested_loop_codegen.cpp     |    4 +
 clang/test/OpenMP/ordered_codegen.cpp         |   40 +-
 clang/test/OpenMP/parallel_codegen.cpp        |    4 +
 clang/test/OpenMP/target_update_codegen.cpp   |   32 +
 .../target_update_iterator_ast_print.cpp      |   16 +
 .../target_update_iterator_serialization.cpp  |   35 +
 clang/test/Parser/cxx-nested-name-spec.cpp    |   10 +
 clang/test/Preprocessor/header-shadowing.c    |   57 +
 clang/test/Preprocessor/init.c                |   22 +-
 .../Preprocessor/predefined-arch-macros.c     |    5 +
 clang/test/Sema/attr-modular-format.c         |   26 +
 clang/test/SemaCUDA/Inputs/cuda.h             |    7 +
 .../test/SemaCUDA/call-kernel-from-kernel.cu  |    5 +-
 clang/test/SemaCUDA/device-kernel-call.cu     |   15 +
 clang/test/SemaCUDA/function-overload.cu      |   26 +-
 clang/test/SemaCUDA/function-target.cu        |    4 +-
 clang/test/SemaCUDA/reference-to-kernel-fn.cu |    4 +-
 .../SemaCXX/constexpr-x86-avx-builtins.cpp    |   18 +
 .../constexpr-x86-avx512f-builtins.cpp        |  230 +++
 .../constexpr-x86-avx512vl-builtins.cpp       |  120 ++
 .../SemaCXX/constexpr-x86-sse2-builtins.cpp   |   79 +
 .../SemaCXX/no-warn-consumed-analysis.cpp     |    9 +
 clang/test/SemaCXX/zero-length-arrays.cpp     |   17 +-
 .../MatrixElementOverloadResolution.hlsl      |  293 +++
 .../test/SemaHLSL/Semantics/position.ps.hlsl  |    2 +-
 .../semantic.explicit-mix-builtin-vs.hlsl     |   16 +
 .../semantic.explicit-mix-location-2.hlsl     |   15 +
 .../semantic.explicit-mix-location.hlsl       |   15 +
 .../SemaHLSL/Semantics/target.ps.input.hlsl   |    7 +
 .../SemaHLSL/Semantics/target.vs.input.hlsl   |    8 +
 .../SemaHLSL/Semantics/target.vs.output.hlsl  |    7 +
 .../Types/BuiltinMatrix/MatrixCastErrors.hlsl |   21 +
 .../MatrixImplicitTruncCastWarnings.hlsl      |   50 +
 clang/test/SemaHLSL/static_resources.hlsl     |  138 ++
 clang/tools/CMakeLists.txt                    |    6 +-
 .../ClangNVLinkWrapper.cpp                    |   21 +-
 .../tools/clang-nvlink-wrapper/NVLinkOpts.td  |    5 +
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |    8 +-
 .../ASTMatchers/ASTMatchersNodeTest.cpp       |   20 +
 .../Analysis/ExprMutationAnalyzerTest.cpp     |   17 +
 clang/unittests/Format/FormatTest.cpp         |   59 +
 clang/unittests/Format/FormatTestObjC.cpp     |   26 +
 clang/unittests/Format/FormatTestVerilog.cpp  |    5 +
 clang/unittests/Format/TokenAnnotatorTest.cpp |    7 +
 clang/unittests/Lex/PPCallbacksTest.cpp       |   56 +-
 clang/unittests/Tooling/RangeSelectorTest.cpp |   39 +
 clang/utils/TableGen/MveEmitter.cpp           |   15 +-
 .../cmake/Modules/AllSupportedArchDefs.cmake  |    2 +-
 compiler-rt/cmake/config-ix.cmake             |    1 +
 compiler-rt/lib/builtins/CMakeLists.txt       |    5 +-
 compiler-rt/lib/tysan/tysan_platform.h        |    7 +
 compiler-rt/test/CMakeLists.txt               |    7 +
 compiler-rt/test/builtins/CMakeLists.txt      |    3 +-
 .../sanitizer_common/TestCases/printf-ldbl.c  |    3 -
 .../sanitizer_common/TestCases/scanf-ldbl.c   |    3 -
 .../ubsan/TestCases/Float/cast-overflow.cpp   |    3 -
 .../ubsan/TestCases/Misc/log-path_test.cpp    |    3 -
 .../Posix/always-never-instrument.cpp         |    2 +
 .../xray/TestCases/Posix/default-options.cpp  |    2 +
 compiler-rt/test/xray/lit.site.cfg.py.in      |    2 +-
 flang-rt/lib/runtime/extensions.cpp           |   58 +
 flang/docs/Directives.md                      |    9 +
 flang/docs/Intrinsics.md                      |   42 +
 flang/docs/ReleaseNotes.md                    |    7 +-
 flang/docs/ReleaseNotesTemplate.txt           |    6 +-
 flang/docs/conf.py                            |   28 +
 flang/include/flang/Lower/OpenMP/Clauses.h    |    2 +-
 .../flang/Optimizer/Builder/CUFCommon.h       |    2 +-
 .../flang/Optimizer/Builder/IntrinsicCall.h   |    4 +
 .../Optimizer/Builder/Runtime/Intrinsics.h    |    6 +
 .../flang/Optimizer/Dialect/CUF/CUFOps.td     |    8 +-
 .../include/flang/Optimizer/Dialect/FIROps.td |   35 +-
 .../Support/FIROpenACCTypeInterfaces.h        |    9 +
 .../Transforms/CUDA/CUFAllocationConversion.h |   33 +
 .../flang/Optimizer/Transforms/Passes.td      |    8 +
 flang/include/flang/Parser/dump-parse-tree.h  |    4 +-
 flang/include/flang/Parser/openmp-utils.h     |   12 +-
 flang/include/flang/Parser/parse-tree.h       |   18 +-
 flang/include/flang/Runtime/extensions.h      |    9 +
 flang/lib/Evaluate/fold-real.cpp              |   10 +-
 flang/lib/Evaluate/intrinsics.cpp             |    8 +
 flang/lib/Evaluate/tools.cpp                  |    6 +-
 flang/lib/Lower/Bridge.cpp                    |   56 +-
 flang/lib/Lower/OpenMP/Clauses.cpp            |    2 +-
 flang/lib/Lower/Runtime.cpp                   |   51 +-
 .../Optimizer/Builder/CUDAIntrinsicCall.cpp   |   16 +-
 flang/lib/Optimizer/Builder/IntrinsicCall.cpp |   35 +
 .../Optimizer/Builder/Runtime/Intrinsics.cpp  |   24 +
 flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp    |    3 +-
 flang/lib/Optimizer/Dialect/FIROps.cpp        |    8 +
 .../Support/FIROpenACCTypeInterfaces.cpp      |  110 ++
 flang/lib/Optimizer/Transforms/CMakeLists.txt |    1 +
 .../CUDA/CUFAllocationConversion.cpp          |  468 +++++
 .../CUFComputeSharedMemoryOffsetsAndSize.cpp  |   96 +-
 .../Transforms/CUFGPUToLLVMConversion.cpp     |    7 +-
 .../Optimizer/Transforms/CUFOpConversion.cpp  |  390 +---
 flang/lib/Parser/Fortran-parsers.cpp          |   11 +
 flang/lib/Parser/openmp-parsers.cpp           |    4 +-
 flang/lib/Parser/unparse.cpp                  |   21 +-
 .../lib/Semantics/canonicalize-directives.cpp |    4 +
 flang/lib/Semantics/check-omp-loop.cpp        |   14 +
 flang/lib/Semantics/check-omp-structure.cpp   |   15 -
 flang/lib/Semantics/resolve-names.cpp         |   11 +
 flang/test/Evaluate/bug168978.f90             |    6 +
 flang/test/Evaluate/folding03.f90             |    4 +
 flang/test/Fir/CUDA/cuda-code-gen.mlir        |    4 +-
 flang/test/Fir/CUDA/cuda-shared-offset.mlir   |   23 +-
 flang/test/Fir/CUDA/cuda-shared-to-llvm.mlir  |    6 +-
 .../OpenACC/pointer-like-interface-load.mlir  |   95 +
 .../OpenACC/pointer-like-interface-store.mlir |   85 +
 flang/test/Fir/fir-ops.fir                    |    7 +
 .../parallel-private-reduction-worstcase.f90  |    5 +-
 flang/test/Lower/CUDA/cuda-device-proc.cuf    |   21 +-
 flang/test/Lower/Intrinsics/rand.f90          |   41 +
 flang/test/Lower/dispatch-table-abstract.f90  |   21 +
 .../Lower/module-debug-file-loc-linux.f90     |    2 +-
 flang/test/Lower/pause-statement.f90          |   28 +-
 flang/test/Lower/vectorlength.f90             |   67 +
 flang/test/Parser/OpenMP/fuse-looprange.f90   |    2 +-
 flang/test/Parser/compiler-directives.f90     |   22 +
 flang/test/Semantics/equiv-kind.f90           |   19 +
 flang/test/Transforms/debug-dwarf-version.fir |    2 +-
 .../Transforms/debug-line-table-existing.fir  |    2 +-
 .../Transforms/debug-line-table-inc-file.fir  |    2 +-
 .../debug-line-table-inc-same-file.fir        |    2 +-
 libc/docs/talks.rst                           |   68 +-
 libc/fuzzing/__support/freelist_heap_fuzz.cpp |    8 +-
 libc/src/__support/CMakeLists.txt             |    1 -
 libc/src/__support/block.h                    |   70 +-
 libc/src/__support/freelist_heap.h            |    6 +-
 libc/src/__support/freestore.h                |    9 +-
 libc/src/__support/wctype_utils.h             |   26 -
 libc/src/wchar/CMakeLists.txt                 |    2 -
 libc/src/wchar/btowc.cpp                      |    8 +-
 libc/src/wchar/wctob.cpp                      |   12 +-
 libc/test/src/__support/block_test.cpp        |   38 +-
 .../test/src/__support/freelist_heap_test.cpp |    4 +-
 libc/test/src/__support/freestore_test.cpp    |    4 +-
 .../opencl/lib/generic/atomic/atomic_def.inc  |    3 +-
 .../lib/generic/integer/bitfield_insert.cl    |    2 +-
 libcxx/docs/AddingNewCIJobs.rst               |    3 +
 libcxx/docs/Contributing.rst                  |    3 +
 libcxx/include/__functional/bind.h            |   20 +-
 libcxx/include/__mutex/once_flag.h            |   10 +-
 libcxx/include/__utility/integer_sequence.h   |    3 +
 libcxx/include/__utility/pair.h               |    6 +-
 libcxx/include/ext/hash_map                   |    5 +-
 libcxx/include/ext/hash_set                   |    5 +-
 libcxx/include/future                         |   10 +-
 libcxx/include/optional                       |   65 +
 libcxx/include/scoped_allocator               |    4 +-
 libcxx/include/tuple                          |   25 +-
 .../src/include/from_chars_floating_point.h   |   14 +-
 .../extensions/gnu/hash_map/copy.pass.cpp     |   27 +
 .../extensions/gnu/hash_set/copy.pass.cpp     |   27 +
 libcxx/test/selftest/dsl/lit.local.cfg        |    2 +-
 libcxx/utils/ci/buildkite-pipeline.yml        |   30 +
 libcxx/utils/ci/run-buildbot                  |  100 +-
 libcxx/utils/ci/run-buildbot-container        |    4 +-
 libsycl/README.md                             |    2 +-
 libunwind/test/aarch64_vg_unwind.pass.cpp     |    3 +-
 libunwind/test/aarch64_za_unwind.pass.cpp     |    3 +-
 libunwind/test/bad_unwind_info.pass.cpp       |    4 +-
 libunwind/test/eh_frame_fde_pc_range.pass.cpp |    6 +-
 libunwind/test/floatregister.pass.cpp         |    3 +-
 libunwind/test/forceunwind.pass.cpp           |    4 +-
 libunwind/test/remember_state_leak.pass.sh.s  |    4 +-
 libunwind/test/signal_unwind.pass.cpp         |    4 +-
 libunwind/test/unwind_leaffunction.pass.cpp   |    4 +-
 .../test/unwind_scalable_vectors.pass.cpp     |    2 +-
 lld/ELF/Options.td                            |    2 +-
 lld/ELF/SyntheticSections.cpp                 |  157 +-
 lld/ELF/SyntheticSections.h                   |   30 +-
 lld/MachO/UnwindInfoSection.cpp               |   75 +-
 lldb/bindings/python/python-wrapper.swig      |   12 +
 lldb/docs/CMakeLists.txt                      |    1 +
 lldb/docs/dil-expr-lang.ebnf                  |   28 +-
 lldb/docs/python_extensions.rst               |    8 +
 lldb/docs/resources/lldbgdbremote.md          |    6 +-
 .../templates/scripted_frame_provider.py      |   47 +
 .../python/templates/scripted_process.py      |   47 +-
 lldb/include/lldb/API/SBTarget.h              |   30 +
 lldb/include/lldb/API/SBThread.h              |    1 +
 lldb/include/lldb/API/SBThreadCollection.h    |    1 +
 lldb/include/lldb/API/SBTrace.h               |    2 +-
 lldb/include/lldb/Breakpoint/BreakpointSite.h |    4 +
 lldb/include/lldb/Core/Disassembler.h         |    4 +
 lldb/include/lldb/Core/FormatEntity.h         |    1 +
 .../Host/windows/ProcessLauncherWindows.h     |   31 +
 .../ScriptedFrameProviderInterface.h          |   18 +
 .../Interfaces/ScriptedInterface.h            |    4 +
 .../lldb/Interpreter/ScriptInterpreter.h      |    3 +
 lldb/include/lldb/Symbol/ObjectFile.h         |   11 +-
 lldb/include/lldb/Target/BorrowedStackFrame.h |  146 ++
 lldb/include/lldb/Target/ExecutionContext.h   |   14 +-
 lldb/include/lldb/Target/StackFrame.h         |   98 +-
 lldb/include/lldb/Target/StackFrameList.h     |   36 +-
 .../lldb/Target/SyntheticFrameProvider.h      |   30 +-
 lldb/include/lldb/Target/Target.h             |   38 +
 lldb/include/lldb/Target/TargetList.h         |    5 +
 lldb/include/lldb/Target/Thread.h             |   12 +
 lldb/include/lldb/Target/ThreadSpec.h         |    2 +
 lldb/include/lldb/Utility/DataExtractor.h     |   14 +-
 lldb/include/lldb/Utility/RangeMap.h          |    4 +
 lldb/include/lldb/Utility/ScriptedMetadata.h  |   27 +
 .../lldb/Utility/VirtualDataExtractor.h       |   75 +
 lldb/include/lldb/ValueObject/DILAST.h        |   33 +
 lldb/include/lldb/ValueObject/DILEval.h       |    1 +
 lldb/include/lldb/ValueObject/DILParser.h     |    6 +
 .../lldb/ValueObject/ValueObjectSynthetic.h   |    5 +
 lldb/include/lldb/lldb-enumerations.h         |    1 +
 lldb/include/lldb/lldb-forward.h              |    2 +
 lldb/include/lldb/lldb-private-interfaces.h   |    4 +-
 .../Python/lldbsuite/test/gdbclientutils.py   |   74 +-
 .../test/tools/lldb-dap/dap_server.py         |    2 +-
 lldb/source/API/SBDebugger.cpp                |    2 +-
 lldb/source/API/SBTarget.cpp                  |   82 +
 lldb/source/Breakpoint/BreakpointSite.cpp     |   16 +
 lldb/source/Commands/CommandObjectTarget.cpp  |  200 +++
 lldb/source/Core/Disassembler.cpp             |    5 +
 lldb/source/Core/FormatEntity.cpp             |   19 +
 lldb/source/Core/Statusline.cpp               |    9 +-
 lldb/source/Expression/DWARFExpression.cpp    |  301 ++--
 lldb/source/Expression/ObjectFileJIT.cpp      |   10 +-
 .../Host/windows/ProcessLauncherWindows.cpp   |  191 +-
 lldb/source/Interpreter/ScriptInterpreter.cpp |    7 +
 lldb/source/Plugins/CMakeLists.txt            |    1 +
 .../Disassembler/LLVMC/DisassemblerLLVMC.cpp  |   13 +
 .../BoundsSafety/CMakeLists.txt               |   13 +
 .../InstrumentationRuntimeBoundsSafety.cpp    |  481 +++++
 .../InstrumentationRuntimeBoundsSafety.h      |   61 +
 .../InstrumentationRuntime/CMakeLists.txt     |    1 +
 .../Plugins/Language/CPlusPlus/CMakeLists.txt |    1 +
 .../Language/CPlusPlus/CPlusPlusLanguage.cpp  |    9 +
 .../CPlusPlus/CPlusPlusNameParser.cpp         |   11 +-
 .../Plugins/Language/CPlusPlus/LibStdcpp.h    |    4 +
 .../Language/CPlusPlus/LibStdcppSpan.cpp      |  112 ++
 .../Breakpad/ObjectFileBreakpad.cpp           |    8 +-
 .../ObjectFile/COFF/ObjectFileCOFF.cpp        |    4 +-
 .../Plugins/ObjectFile/ELF/ObjectFileELF.cpp  |   26 +-
 .../ObjectFile/Mach-O/ObjectFileMachO.cpp     |  247 +--
 .../ObjectFile/PECOFF/ObjectFilePECOFF.cpp    |   88 +-
 .../ObjectFile/XCOFF/ObjectFileXCOFF.cpp      |    2 +-
 .../ObjectFile/wasm/ObjectFileWasm.cpp        |   12 +-
 .../Process/Windows/Common/ProcessWindows.cpp |    4 +-
 .../Process/gdb-remote/ProcessGDBRemote.cpp   |    8 +-
 .../Process/scripted/ScriptedFrame.cpp        |   87 +-
 .../Plugins/Process/scripted/ScriptedFrame.h  |   40 +-
 .../Process/scripted/ScriptedThread.cpp       |    6 +-
 .../ScriptInterpreterPythonInterfaces.cpp     |    2 +
 .../ScriptedFrameProviderPythonInterface.cpp  |   58 +-
 .../ScriptedFrameProviderPythonInterface.h    |   23 +-
 .../Interfaces/ScriptedPythonInterface.cpp    |   13 +
 .../Interfaces/ScriptedPythonInterface.h      |  192 +-
 .../Python/PythonDataObjects.cpp              |   11 +
 .../Python/SWIGPythonBridge.h                 |    1 +
 .../SymbolFile/DWARF/DWARFASTParserClang.cpp  |   14 +-
 .../SymbolFile/DWARF/DWARFASTParserClang.h    |    9 +-
 .../SymbolFile/DWARF/SymbolFileDWARF.cpp      |   11 +-
 .../NativePDB/SymbolFileNativePDB.cpp         |   52 +-
 .../SyntheticFrameProvider/CMakeLists.txt     |    1 +
 .../ScriptedFrameProvider/CMakeLists.txt      |   12 +
 .../ScriptedFrameProvider.cpp                 |  221 +++
 .../ScriptedFrameProvider.h                   |   53 +
 .../UnwindAssemblyInstEmulation.cpp           |   81 +-
 .../UnwindAssemblyInstEmulation.h             |    4 +-
 lldb/source/Symbol/ObjectFile.cpp             |   22 +-
 lldb/source/Target/BorrowedStackFrame.cpp     |  187 ++
 lldb/source/Target/CMakeLists.txt             |    1 +
 lldb/source/Target/ExecutionContext.cpp       |   17 +-
 lldb/source/Target/StackFrame.cpp             |    3 +
 lldb/source/Target/StackFrameList.cpp         |   44 +
 lldb/source/Target/SyntheticFrameProvider.cpp |   25 +-
 lldb/source/Target/Target.cpp                 |   55 +
 lldb/source/Target/TargetList.cpp             |    5 +-
 lldb/source/Target/Thread.cpp                 |   72 +-
 lldb/source/Target/ThreadPlanStepOut.cpp      |   11 +-
 lldb/source/Target/ThreadSpec.cpp             |    4 +
 lldb/source/Utility/CMakeLists.txt            |    1 +
 lldb/source/Utility/VirtualDataExtractor.cpp  |  139 ++
 lldb/source/ValueObject/DILAST.cpp            |    4 +
 lldb/source/ValueObject/DILEval.cpp           |   12 +
 lldb/source/ValueObject/DILParser.cpp         |  167 +-
 .../ValueObject/ValueObjectSynthetic.cpp      |   15 +
 .../generic/span/TestDataFormatterStdSpan.py  |   26 +-
 .../scripted_frame_provider/Makefile          |    3 +
 .../TestScriptedFrameProvider.py              |  428 +++++
 .../circular_dependency/Makefile              |    3 +
 .../TestFrameProviderCircularDependency.py    |  119 ++
 .../circular_dependency/frame_provider.py     |  102 ++
 .../circular_dependency/main.c                |   21 +
 .../scripted_frame_provider/main.cpp          |   53 +
 .../test_frame_providers.py                   |  222 +++
 .../statusline/TestStatusline.py              |    6 +-
 .../API/lang/BoundsSafety/soft_trap/Makefile  |   10 +
 .../TestBoundsSafetyInstrumentationPlugin.py  |  148 ++
 .../API/lang/BoundsSafety/soft_trap/main.c    |   10 +
 .../soft_trap/mockSoftTrapRuntime.c           |   17 +
 .../API/python_api/exprpath_register/Makefile |    3 +
 .../TestExprPathRegisters.py                  |   64 +
 .../API/python_api/exprpath_register/main.c   |   10 +
 .../boundsSafetyMockCallSoftTrapRuntime.c     |    8 +
 .../Inputs/boundsSafetyMockSoftTrapRuntime.c  |   15 +
 .../Inputs/boundsSafetySoftTraps.c            |   12 +
 .../boundsSafetySoftTrapsMissingReason.c      |   20 +
 .../boundssafety_soft_trap_call_minimal.test  |   31 +
 ...soft_trap_call_minimal_missing_reason.test |   34 +
 ...ty_soft_trap_call_minimal_no_dbg_info.test |   33 +
 ...fety_soft_trap_call_minimal_no_plugin.test |   30 +
 .../boundssafety_soft_trap_call_str.test      |   31 +
 ...oft_trap_call_with_str_missing_reason.test |   34 +
 ...y_soft_trap_call_with_str_no_dbg_info.test |   30 +
 ...ap_call_with_str_no_dbg_info_null_str.test |   36 +
 ...ety_soft_trap_call_with_str_no_plugin.test |   30 +
 .../NativePDB/find-pdb-next-to-exe.test       |   76 +
 lldb/test/Shell/helper/toolchain.py           |    2 +-
 lldb/tools/debugserver/source/DNB.cpp         |    2 +-
 .../debugserver/source/MacOSX/MachProcess.h   |   14 +-
 .../debugserver/source/MacOSX/MachProcess.mm  |  125 +-
 lldb/tools/lldb-dap/DAP.cpp                   |    7 +-
 .../Handler/CompileUnitsRequestHandler.cpp    |    6 +-
 .../Handler/InitializeRequestHandler.cpp      |    2 +-
 lldb/tools/lldb-dap/JSONUtils.cpp             |    5 +-
 lldb/tools/lldb-dap/Protocol/ProtocolBase.h   |    4 +-
 .../lldb-dap/Protocol/ProtocolRequests.cpp    |   11 +-
 .../lldb-dap/Protocol/ProtocolRequests.h      |   12 +-
 lldb/unittests/DAP/ProtocolRequestsTest.cpp   |   66 +-
 lldb/unittests/DAP/TestBase.cpp               |    1 +
 .../Expression/DWARFExpressionTest.cpp        |  109 +-
 .../CPlusPlus/CPlusPlusLanguageTest.cpp       |    6 +
 .../Python/PythonTestSuite.cpp                |    5 +
 .../unittests/Symbol/TestClangASTImporter.cpp |    2 +-
 .../DWARF/DWARFASTParserClangTests.cpp        |  501 ++++--
 .../ARM64/TestArm64InstEmulation.cpp          |   23 +-
 lldb/unittests/Utility/CMakeLists.txt         |    1 +
 .../Utility/VirtualDataExtractorTest.cpp      |  583 ++++++
 llvm/cmake/modules/CrossCompile.cmake         |    4 +-
 llvm/docs/InstCombineContributorGuide.md      |    2 +-
 llvm/docs/KeyInstructionsDebugInfo.md         |    2 +-
 llvm/docs/MIRLangRef.rst                      |   16 +
 llvm/docs/ReleaseNotes.md                     |    2 +
 llvm/docs/SPIRVUsage.rst                      |   13 +-
 llvm/docs/Telemetry.rst                       |   48 +-
 llvm/include/llvm/ADT/MapVector.h             |    5 +
 llvm/include/llvm/ADT/SetVector.h             |   12 +-
 llvm/include/llvm/Analysis/CFGPrinter.h       |   48 +-
 .../llvm/Analysis/DependenceAnalysis.h        |   15 +-
 llvm/include/llvm/Analysis/IVDescriptors.h    |   18 +-
 .../llvm/Analysis/RuntimeLibcallInfo.h        |   22 +-
 llvm/include/llvm/Analysis/ScalarEvolution.h  |   12 +
 .../llvm/Analysis/TargetTransformInfo.h       |    3 +-
 .../llvm/Analysis/TargetTransformInfoImpl.h   |   12 +-
 llvm/include/llvm/CodeGen/BasicTTIImpl.h      |   68 +-
 .../CodeGen/GlobalISel/InstructionSelector.h  |    3 -
 .../llvm/CodeGen/GlobalISel/LegalizerInfo.h   |   10 +
 .../llvm/CodeGen/GlobalISel/RegBankSelect.h   |    3 -
 llvm/include/llvm/CodeGen/GlobalISel/Utils.h  |    3 -
 .../llvm/CodeGen/LibcallLoweringInfo.h        |   68 +
 llvm/include/llvm/CodeGen/MIR2Vec.h           |    2 +-
 llvm/include/llvm/CodeGen/MachineInstr.h      |    5 +
 .../llvm/CodeGen/MachineInstrBuilder.h        |    5 +
 llvm/include/llvm/CodeGen/MachineOperand.h    |   17 +-
 llvm/include/llvm/CodeGen/SelectionDAG.h      |   14 +-
 llvm/include/llvm/CodeGen/TargetLowering.h    |    2 +-
 .../DebugInfo/DWARF/DWARFAcceleratorTable.h   |   29 +-
 .../Orc/CallableTraitsHelper.h                |   74 +
 llvm/include/llvm/Frontend/OpenMP/ClauseT.h   |    4 +-
 llvm/include/llvm/Frontend/OpenMP/OMP.td      |    2 +-
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h       |   59 +-
 llvm/include/llvm/IR/Intrinsics.h             |   15 +
 llvm/include/llvm/IR/IntrinsicsARM.td         |   10 +
 llvm/include/llvm/IR/IntrinsicsRISCVXCV.td    |    4 +
 llvm/include/llvm/InitializePasses.h          |    1 +
 llvm/include/llvm/MC/MCObjectFileInfo.h       |    7 -
 llvm/include/llvm/ProfileData/SampleProf.h    |    1 +
 llvm/include/llvm/Support/DebugCounter.h      |  125 +-
 llvm/include/llvm/Support/LSP/Protocol.h      |   27 +
 llvm/include/llvm/Support/TargetOpcodes.def   |    6 +
 llvm/include/llvm/Target/Target.td            |    7 +
 .../Transforms/Instrumentation/AllocToken.h   |    2 +-
 .../Vectorize/LoopVectorizationLegality.h     |    9 +
 llvm/lib/Analysis/CFGPrinter.cpp              |    6 +-
 llvm/lib/Analysis/ConstantFolding.cpp         |    4 +
 llvm/lib/Analysis/Delinearization.cpp         |   63 +
 llvm/lib/Analysis/DependenceAnalysis.cpp      |   68 +-
 llvm/lib/Analysis/IVDescriptors.cpp           |   51 +
 llvm/lib/Analysis/InstructionSimplify.cpp     |   25 +-
 llvm/lib/Analysis/MemoryBuiltins.cpp          |   38 +-
 llvm/lib/Analysis/RuntimeLibcallInfo.cpp      |   16 +
 llvm/lib/Analysis/ScalarEvolution.cpp         |   56 +-
 llvm/lib/Analysis/ScalarEvolutionDivision.cpp |    4 -
 llvm/lib/Analysis/ValueTracking.cpp           |   11 +
 llvm/lib/AsmParser/LLParser.cpp               |   29 +-
 llvm/lib/Bitcode/Reader/BitcodeReader.cpp     |   19 +-
 llvm/lib/Bitcode/Writer/BitcodeWriter.cpp     |   11 +-
 llvm/lib/CodeGen/CodeGen.cpp                  |    1 +
 llvm/lib/CodeGen/ExpandFp.cpp                 |   40 +-
 llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp  |   21 +-
 .../CodeGen/GlobalISel/InstructionSelect.cpp  |   15 +-
 .../CodeGen/GlobalISel/LegalityPredicates.cpp |   20 +
 llvm/lib/CodeGen/GlobalISel/Legalizer.cpp     |    4 +-
 .../CodeGen/GlobalISel/LegalizerHelper.cpp    |   21 +-
 llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp |    9 +-
 llvm/lib/CodeGen/GlobalISel/Utils.cpp         |   25 +-
 llvm/lib/CodeGen/LibcallLoweringInfo.cpp      |   42 +
 llvm/lib/CodeGen/MIRParser/MILexer.cpp        |    1 +
 llvm/lib/CodeGen/MIRParser/MILexer.h          |    1 +
 llvm/lib/CodeGen/MIRParser/MIParser.cpp       |   28 +
 llvm/lib/CodeGen/MIRPrinter.cpp               |    3 +-
 llvm/lib/CodeGen/MIRVRegNamerUtils.cpp        |    3 +-
 llvm/lib/CodeGen/MachineBasicBlock.cpp        |   16 +-
 llvm/lib/CodeGen/MachineOperand.cpp           |   19 +-
 llvm/lib/CodeGen/MachineStableHash.cpp        |    4 +
 llvm/lib/CodeGen/MachineVerifier.cpp          |   40 +
 llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp |   53 +-
 llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp |   46 +-
 llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp |   15 +-
 .../SelectionDAG/LegalizeVectorTypes.cpp      |  146 +-
 .../lib/CodeGen/SelectionDAG/SelectionDAG.cpp |  180 +-
 .../SelectionDAG/SelectionDAGBuilder.cpp      |   14 +-
 .../CodeGen/SelectionDAG/TargetLowering.cpp   |   43 +-
 .../DebugInfo/DWARF/DWARFAcceleratorTable.cpp |   24 +-
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp     |  355 ++--
 llvm/lib/IR/ConstantRange.cpp                 |    6 +
 llvm/lib/IR/IRBuilder.cpp                     |   16 +-
 llvm/lib/IR/Intrinsics.cpp                    |   56 +-
 llvm/lib/LTO/LTO.cpp                          |    1 -
 llvm/lib/LTO/LTOBackend.cpp                   |    6 +
 llvm/lib/MC/MCObjectFileInfo.cpp              |    4 -
 llvm/lib/MC/MCWin64EH.cpp                     |   60 +-
 llvm/lib/Passes/PassBuilderPipelines.cpp      |   28 +
 llvm/lib/Passes/PassRegistry.def              |    1 +
 llvm/lib/ProfileData/SampleProf.cpp           |    6 +
 llvm/lib/Support/DebugCounter.cpp             |   75 +-
 llvm/lib/Support/LSP/Protocol.cpp             |   18 +
 .../AArch64/AArch64ExpandPseudoInsts.cpp      |    1 +
 .../Target/AArch64/AArch64ISelLowering.cpp    |  160 +-
 llvm/lib/Target/AArch64/AArch64ISelLowering.h |    3 +-
 .../lib/Target/AArch64/AArch64InstrFormats.td |   56 +-
 llvm/lib/Target/AArch64/AArch64InstrInfo.td   |   45 +-
 .../lib/Target/AArch64/AArch64SMEInstrInfo.td |    6 +
 .../AArch64/AArch64TargetTransformInfo.cpp    |   19 +-
 .../AArch64/AArch64TargetTransformInfo.h      |    6 +-
 .../MCTargetDesc/AArch64AsmBackend.cpp        |    5 +
 llvm/lib/Target/AArch64/MachineSMEABIPass.cpp |  205 ++-
 llvm/lib/Target/AMDGPU/AMDGPU.td              |   50 +-
 llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp |    4 +-
 llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp |   18 +-
 llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.cpp    |   15 +-
 .../lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp |   36 +-
 llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h  |    1 +
 llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp  |    3 +-
 .../Target/AMDGPU/AMDGPURegBankLegalize.cpp   |    6 +-
 .../AMDGPU/AMDGPURegBankLegalizeHelper.cpp    |  199 ++-
 .../AMDGPU/AMDGPURegBankLegalizeHelper.h      |   39 +-
 .../AMDGPU/AMDGPURegBankLegalizeRules.cpp     |   33 +-
 .../AMDGPU/AMDGPURegBankLegalizeRules.h       |    4 +-
 .../AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp      |   24 +-
 .../lib/Target/AMDGPU/GCNHazardRecognizer.cpp |   91 +-
 llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h  |    9 +-
 llvm/lib/Target/AMDGPU/GCNSubtarget.h         |    6 +
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp     |   38 +-
 llvm/lib/Target/AMDGPU/SIISelLowering.h       |    4 +-
 llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp   |    4 +-
 llvm/lib/Target/AMDGPU/SIInstrInfo.cpp        |   17 +-
 llvm/lib/Target/AMDGPU/SIInstrInfo.h          |    6 +-
 llvm/lib/Target/AMDGPU/SIInstructions.td      |   28 +-
 .../Target/AMDGPU/SILateBranchLowering.cpp    |    2 +-
 llvm/lib/Target/ARC/ARC.td                    |    2 +
 llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp  |    1 +
 llvm/lib/Target/ARM/ARMISelLowering.cpp       |   66 +-
 llvm/lib/Target/ARM/ARMISelLowering.h         |    3 +-
 llvm/lib/Target/ARM/ARMInstrMVE.td            |   34 +-
 llvm/lib/Target/ARM/ARMInstrVFP.td            |    8 +-
 .../lib/Target/ARM/ARMTargetTransformInfo.cpp |   23 +-
 llvm/lib/Target/ARM/ARMTargetTransformInfo.h  |    6 +-
 llvm/lib/Target/BPF/BPF.td                    |    4 -
 llvm/lib/Target/BPF/BPFISelLowering.cpp       |   23 +-
 llvm/lib/Target/BPF/BPFISelLowering.h         |   10 -
 llvm/lib/Target/BPF/BPFSubtarget.cpp          |    1 -
 llvm/lib/Target/BPF/BPFSubtarget.h            |    3 -
 .../Target/Hexagon/HexagonISelLowering.cpp    |    2 +-
 llvm/lib/Target/Hexagon/HexagonISelLowering.h |    2 +-
 .../Target/Hexagon/HexagonISelLoweringHVX.cpp |   30 +
 .../Hexagon/HexagonTargetTransformInfo.cpp    |   13 -
 .../Hexagon/HexagonTargetTransformInfo.h      |    8 -
 .../LoongArch/LoongArchISelLowering.cpp       |    2 +-
 .../Target/LoongArch/LoongArchISelLowering.h  |    2 +-
 llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp   |    7 +-
 llvm/lib/Target/NVPTX/NVPTXISelLowering.h     |    2 +-
 .../PowerPC/MCTargetDesc/PPCAsmBackend.cpp    |   24 +-
 llvm/lib/Target/PowerPC/PPCISelLowering.cpp   |    2 +-
 llvm/lib/Target/PowerPC/PPCISelLowering.h     |    3 +-
 .../Target/RISCV/AsmParser/RISCVAsmParser.cpp |    3 +
 .../RISCV/GISel/RISCVInstructionSelector.cpp  |    2 +-
 .../Target/RISCV/GISel/RISCVLegalizerInfo.cpp |    2 +-
 llvm/lib/Target/RISCV/RISCVFeatures.td        |  146 +-
 llvm/lib/Target/RISCV/RISCVISelLowering.cpp   |   80 +-
 llvm/lib/Target/RISCV/RISCVISelLowering.h     |    6 +-
 llvm/lib/Target/RISCV/RISCVInstrInfo.cpp      |    4 +-
 llvm/lib/Target/RISCV/RISCVInstrInfoSFB.td    |   25 +-
 .../Target/RISCV/RISCVInstrInfoVPseudos.td    |   30 +-
 llvm/lib/Target/RISCV/RISCVInstrInfoXAndes.td |    2 +-
 llvm/lib/Target/RISCV/RISCVInstrInfoXCV.td    |    9 +-
 llvm/lib/Target/RISCV/RISCVInstrInfoXqci.td   |    4 +-
 llvm/lib/Target/RISCV/RISCVProcessors.td      |    4 +-
 llvm/lib/Target/RISCV/RISCVSubtarget.h        |    2 +-
 .../Target/RISCV/RISCVTargetTransformInfo.cpp |   60 +-
 .../Target/RISCV/RISCVTargetTransformInfo.h   |   20 +-
 llvm/lib/Target/RISCV/RISCVVectorPeephole.cpp |   59 +-
 llvm/lib/Target/SPIRV/SPIRVBuiltins.cpp       |   73 +
 llvm/lib/Target/SPIRV/SPIRVBuiltins.td        |   14 +
 llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp    |   10 +-
 llvm/lib/Target/SPIRV/SPIRVGlobalRegistry.cpp |   16 +-
 llvm/lib/Target/SPIRV/SPIRVISelLowering.cpp   |    2 +-
 llvm/lib/Target/SPIRV/SPIRVISelLowering.h     |    2 +-
 llvm/lib/Target/SPIRV/SPIRVInstrInfo.td       |   24 +
 .../Target/SPIRV/SPIRVInstructionSelector.cpp |   50 +-
 llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.cpp  |  234 ++-
 llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.h    |    4 +
 llvm/lib/Target/SPIRV/SPIRVModuleAnalysis.cpp |   21 +
 llvm/lib/Target/SPIRV/SPIRVPreLegalizer.cpp   |    2 +-
 .../lib/Target/SPIRV/SPIRVSymbolicOperands.td |    6 +-
 llvm/lib/Target/Sparc/SparcCallingConv.td     |    9 +-
 llvm/lib/Target/Sparc/SparcISelLowering.cpp   |  177 +-
 .../WebAssembly/WebAssemblyISelLowering.cpp   |    2 +-
 .../WebAssembly/WebAssemblyISelLowering.h     |    2 +-
 .../WebAssembly/WebAssemblyInstrInfo.td       |    2 +-
 .../WebAssembly/WebAssemblyInstrInteger.td    |    7 +
 llvm/lib/Target/X86/X86ISelLowering.cpp       |  190 +-
 llvm/lib/Target/X86/X86ISelLowering.h         |    2 +-
 .../lib/Target/X86/X86TargetTransformInfo.cpp |   16 +-
 llvm/lib/Target/X86/X86TargetTransformInfo.h  |    8 +-
 .../AggressiveInstCombine.cpp                 |   12 +-
 .../Transforms/IPO/AttributorAttributes.cpp   |   32 +-
 llvm/lib/Transforms/IPO/GlobalOpt.cpp         |   62 +-
 llvm/lib/Transforms/IPO/LowerTypeTests.cpp    |    7 +-
 .../lib/Transforms/IPO/WholeProgramDevirt.cpp |   20 +-
 .../InstCombine/InstCombineSelect.cpp         |    1 +
 .../Transforms/Instrumentation/AllocToken.cpp |   69 +-
 .../Instrumentation/RealtimeSanitizer.cpp     |    3 +
 .../Transforms/Scalar/LoopStrengthReduce.cpp  |   84 +-
 llvm/lib/Transforms/Scalar/SROA.cpp           |   86 +-
 llvm/lib/Transforms/Utils/BasicBlockUtils.cpp |   16 +
 llvm/lib/Transforms/Utils/LoopUnroll.cpp      |    2 +
 .../Transforms/Utils/LowerMemIntrinsics.cpp   |  518 +++---
 llvm/lib/Transforms/Utils/SCCPSolver.cpp      |   32 +
 llvm/lib/Transforms/Utils/SimplifyIndVar.cpp  |   40 +-
 .../Vectorize/LoopVectorizationLegality.cpp   |   33 +-
 .../Transforms/Vectorize/LoopVectorize.cpp    |   69 +-
 .../Transforms/Vectorize/SLPVectorizer.cpp    |    1 +
 llvm/lib/Transforms/Vectorize/VPlan.h         |   24 +-
 .../Vectorize/VPlanConstruction.cpp           |  177 +-
 .../Transforms/Vectorize/VPlanPatternMatch.h  |    6 +
 .../lib/Transforms/Vectorize/VPlanRecipes.cpp |   67 +-
 .../Transforms/Vectorize/VPlanTransforms.cpp  |  358 ++--
 .../Transforms/Vectorize/VPlanTransforms.h    |   12 +
 llvm/lib/Transforms/Vectorize/VPlanUtils.cpp  |    3 +-
 .../Transforms/Vectorize/VectorCombine.cpp    |    5 +-
 .../Analysis/CostModel/RISCV/cmp-select.ll    |    2 +-
 .../Analysis/CostModel/RISCV/vp-intrinsics.ll |  130 +-
 .../constant_functions_multi_dim.ll           |    2 +-
 .../Delinearization/multidim_only_ivs_2d.ll   |    4 +-
 .../Delinearization/multidim_only_ivs_3d.ll   |    2 +-
 ..._two_accesses_different_delinearization.ll |    4 +-
 .../Delinearization/validation_large_size.ll  |  180 ++
 .../Analysis/DependenceAnalysis/DADelin.ll    |   48 +-
 .../DependenceAnalysis/DifferentOffsets.ll    |   18 +-
 .../DependenceAnalysis/MIVCheckConst.ll       |    1 +
 .../Analysis/DependenceAnalysis/PR148435.ll   |   86 +
 .../Analysis/DependenceAnalysis/StrongSIV.ll  |   20 +-
 .../DependenceAnalysis/SymbolicSIV.ll         |   16 +-
 .../DependenceAnalysis/WeakCrossingSIV.ll     |    8 +-
 .../DependenceAnalysis/WeakZeroDstSIV.ll      |    4 +-
 .../DependenceAnalysis/WeakZeroSrcSIV.ll      |    4 +-
 .../becount-couldnotcompute.ll                |    4 +-
 .../DependenceAnalysis/bounds-check.ll        |   27 +
 .../compute-absolute-value.ll                 |    8 +-
 .../DependenceAnalysis/gcd-miv-overflow.ll    |  122 +-
 .../DependenceAnalysis/monotonicity-cast.ll   |    4 +-
 .../DependenceAnalysis/wrapping-addrec-1.ll   |   35 +
 .../DependenceAnalysis/wrapping-addrec.ll     |   33 +
 .../DependenceAnalysis/zero-coefficient.ll    |   34 +
 .../addrec-may-wrap-udiv-canonicalize.ll      |  245 ++-
 llvm/test/Analysis/ScalarEvolution/pr44605.ll |    4 +-
 .../ValueTracking/assume-queries-counter.ll   |    1 -
 llvm/test/Assembler/thinlto-summary.ll        |   56 +-
 .../AArch64/GlobalISel/counter-fallback.ll    |    2 -
 llvm/test/CodeGen/AArch64/O0-pipeline.ll      |    2 +
 llvm/test/CodeGen/AArch64/O3-pipeline.ll      |    2 +
 llvm/test/CodeGen/AArch64/arm64-int-neon.ll   |  238 +++
 llvm/test/CodeGen/AArch64/arm64-vmul.ll       |  184 +-
 llvm/test/CodeGen/AArch64/arm64-vshift.ll     |   34 +-
 ...rleaving-reductions-predicated-scalable.ll |   40 +-
 llvm/test/CodeGen/AArch64/expand-select.ll    |   50 +-
 .../CodeGen/AArch64/lrint-conv-fp16-win.ll    |   45 +-
 llvm/test/CodeGen/AArch64/lrint-conv-win.ll   |   55 +-
 .../CodeGen/AArch64/lround-conv-fp16-win.ll   |   55 +-
 llvm/test/CodeGen/AArch64/lround-conv-win.ll  |   64 +-
 .../machine-sme-abi-find-insert-pt.mir        |    4 +-
 .../CodeGen/AArch64/memtag-compact-unwind.ll  |   27 +
 llvm/test/CodeGen/AArch64/neon-anyof-splat.ll |   67 +
 .../CodeGen/AArch64/print-pipeline-passes.ll  |    2 +-
 llvm/test/CodeGen/AArch64/rem-by-const.ll     |   65 +-
 llvm/test/CodeGen/AArch64/shift.ll            |   31 +
 llvm/test/CodeGen/AArch64/sme-agnostic-za.ll  |   16 +-
 llvm/test/CodeGen/AArch64/sme-dynamic-tls.ll  |    6 +-
 .../CodeGen/AArch64/sme-lazy-save-call.ll     |    8 +-
 .../test/CodeGen/AArch64/sme-peephole-opts.ll |    4 -
 .../test/CodeGen/AArch64/sme-za-exceptions.ll |  273 ++-
 llvm/test/CodeGen/AArch64/sme-zt0-state.ll    |  104 +-
 .../AArch64/sve-extract-scalable-vector.ll    |  253 ++-
 .../sve-fixed-vector-extract-256-bits.ll      |   23 +
 .../test/CodeGen/AArch64/sve-insert-vector.ll |  279 ++-
 .../test/CodeGen/AArch64/sve-int-mulh-pred.ll |  278 ++-
 llvm/test/CodeGen/AArch64/sve-sext-zext.ll    |  128 ++
 .../CodeGen/AArch64/sve-stack-frame-layout.ll |   14 +-
 llvm/test/CodeGen/AArch64/sve2-int-mulh.ll    |  277 ++-
 .../AArch64/udiv-by-const-promoted-ops.ll     |   78 +
 .../AMDGPU/GlobalISel/extractelement.i128.ll  |  274 ++-
 ...licit-kernarg-backend-usage-global-isel.ll |   70 +-
 .../irtranslator-amdgcn-cs-chain.ll           |    8 +-
 .../GlobalISel/irtranslator-amdgpu_kernel.ll  |  432 ++---
 .../GlobalISel/legalize-addrspacecast.mir     |   12 +-
 .../legalize-load-constant-32bit.mir          |   32 +-
 .../legalize-sextload-constant-32bit.mir      |   36 +-
 .../legalize-zextload-constant-32bit.mir      |   48 +-
 .../CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll  |    4 +-
 .../AMDGPU/GlobalISel/load-constant32bit.ll   |    4 +-
 .../AMDGPU/GlobalISel/regbankselect-load.mir  |  146 +-
 ...gbankselect-split-scalar-load-metadata.mir |   16 +-
 .../regbankselect-widen-scalar-loads.mir      |  152 +-
 .../amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll |   42 +-
 .../AMDGPU/buffer-fat-pointers-memcpy.ll      |   32 +-
 .../codegen-prepare-addrspacecast-non-null.ll |   27 +-
 .../AMDGPU/fcanonicalize-elimination.ll       |    6 +-
 .../AMDGPU/hazard-gfx1250-flat-scr-hi.mir     |  183 ++
 .../isel-amdgcn-cs-chain-intrinsic-w32.ll     |   64 +-
 .../isel-amdgcn-cs-chain-intrinsic-w64.ll     |   64 +-
 ...-amdgpu-cs-chain-intrinsic-dyn-vgpr-w32.ll |   32 +-
 llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll     |  327 +++-
 llvm/test/CodeGen/AMDGPU/llc-pipeline.ll      |   10 +
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll  | 1001 -----------
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.fadd.ll | 1021 +++++++++++
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.fmax.ll |  928 ++++++++++
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.fmin.ll |  928 ++++++++++
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.fsub.ll | 1021 +++++++++++
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll  |  911 ----------
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll  |  911 ----------
 .../CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll  | 1001 -----------
 .../lower-buffer-fat-pointers-mem-transfer.ll |  144 +-
 .../CodeGen/AMDGPU/lower-mem-intrinsics.ll    |  420 ++---
 .../CodeGen/AMDGPU/memcpy-crash-issue63986.ll |   54 +-
 .../CodeGen/AMDGPU/memintrinsic-unroll.ll     |   60 +-
 llvm/test/CodeGen/AMDGPU/memmove-var-size.ll  |  120 +-
 .../misched-into-wmma-hazard-shadow.mir       |   56 +
 llvm/test/CodeGen/AMDGPU/occupancy-levels.ll  |   82 +
 llvm/test/CodeGen/AMDGPU/release-vgprs.mir    |   70 +-
 .../CodeGen/AMDGPU/remove-register-flags.mir  |    2 +-
 .../rewrite-vgpr-mfma-scale-to-agpr.mir       |   10 +-
 llvm/test/CodeGen/AMDGPU/strict_fadd.f16.ll   |  220 +--
 llvm/test/CodeGen/AMDGPU/strict_fadd.f32.ll   |    8 +-
 llvm/test/CodeGen/AMDGPU/strict_fadd.f64.ll   |   47 +-
 llvm/test/CodeGen/AMDGPU/strict_fmul.f16.ll   |   48 +-
 llvm/test/CodeGen/AMDGPU/strict_fmul.f32.ll   |   15 +-
 llvm/test/CodeGen/AMDGPU/strict_fmul.f64.ll   |   72 +-
 .../CodeGen/AMDGPU/vector-alloca-atomic.ll    |   16 +-
 .../CodeGen/AMDGPU/vector-alloca-bitcast.ll   |    6 +-
 .../CodeGen/AMDGPU/vopd-combine-gfx1250.mir   |  967 ++++++++++
 llvm/test/CodeGen/AMDGPU/waitcnt-debug.mir    |    1 -
 llvm/test/CodeGen/ARM/fminmax-folds.ll        |   39 +-
 llvm/test/CodeGen/ARM/fp-intrinsics-vector.ll | 1499 ++++++++++++++++
 llvm/test/CodeGen/ARM/fp16-fullfp16.ll        |    8 +-
 llvm/test/CodeGen/BPF/atomic-oversize.ll      |    2 +
 llvm/test/CodeGen/BPF/builtin_calls.ll        |   39 -
 llvm/test/CodeGen/BPF/struct_ret1.ll          |    2 +-
 llvm/test/CodeGen/BPF/struct_ret2.ll          |    2 +-
 .../CodeGen/Hexagon/autohvx/fp-to-int_2.ll    |   75 +
 llvm/test/CodeGen/LoongArch/O0-pipeline.ll    |    2 +
 llvm/test/CodeGen/LoongArch/opt-pipeline.ll   |    2 +
 .../parse-lanemask-operand-empty-lanemask.mir |   13 +
 ...arse-lanemask-operand-invalid-lanemask.mir |   13 +
 .../parse-lanemask-operand-missing-lparen.mir |   13 +
 .../parse-lanemask-operand-missing-rparen.mir |   13 +
 .../MIR/AMDGPU/parse-lanemask-operand.mir     |   17 +
 .../Inputs/reference_x86_vocab_print.txt      |    2 +
 .../reference_x86_vocab_wo=0.5_print.txt      |    2 +
 llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll  |   28 +-
 llvm/test/CodeGen/NVPTX/param-add.ll          |    5 -
 .../test/CodeGen/NVPTX/switch-loop-header.mir |  182 ++
 llvm/test/CodeGen/NVPTX/switch.ll             |   73 +
 llvm/test/CodeGen/PowerPC/O0-pipeline.ll      |    2 +
 llvm/test/CodeGen/PowerPC/O3-pipeline.ll      |    2 +
 .../CodeGen/PowerPC/peephole-counter-XToI.mir |    1 -
 .../instruction-select/rvv/select.mir         |   44 +-
 .../rvv/legalize-extract-subvector.mir        |   56 +-
 llvm/test/CodeGen/RISCV/O0-pipeline.ll        |    2 +
 llvm/test/CodeGen/RISCV/O3-pipeline.ll        |    2 +
 llvm/test/CodeGen/RISCV/cmov-branch-opt.ll    |    8 +-
 llvm/test/CodeGen/RISCV/features-info.ll      |    6 +-
 llvm/test/CodeGen/RISCV/min-max.ll            |    6 +-
 llvm/test/CodeGen/RISCV/rvp-ext-rv32.ll       |   32 +
 llvm/test/CodeGen/RISCV/rvp-ext-rv64.ll       |   12 +
 .../RISCV/rvv/combine-reduce-add-to-vcpop.ll  |    4 +-
 llvm/test/CodeGen/RISCV/rvv/copyprop.mir      |    2 +-
 .../rvv/fixed-vectors-interleaved-access.ll   |  822 ++++-----
 .../RISCV/rvv/fixed-vectors-reduction-fp.ll   |  434 +++--
 .../test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll |  261 +--
 .../test/CodeGen/RISCV/rvv/fminimum-sdnode.ll |  261 +--
 .../CodeGen/RISCV/rvv/fminimumnum-sdnode.ll   |  444 ++---
 .../test/CodeGen/RISCV/rvv/mask-reg-alloc.mir |    4 +-
 llvm/test/CodeGen/RISCV/rvv/pr88576.ll        |    2 +-
 .../RISCV/rvv/rvv-peephole-vmerge-to-vmv.mir  |   36 +-
 .../RISCV/rvv/rvv-peephole-vmerge-vops.ll     |    3 +-
 llvm/test/CodeGen/RISCV/rvv/vector-splice.ll  |   24 +-
 .../test/CodeGen/RISCV/rvv/vl-opt-op-info.mir |   36 +-
 .../RISCV/rvv/vl-optimizer-subreg-assert.mir  |    8 +-
 .../CodeGen/RISCV/rvv/vmerge-peephole.mir     |   79 +-
 .../CodeGen/RISCV/rvv/vmv.v.v-peephole.mir    |    2 +-
 llvm/test/CodeGen/RISCV/select-bare.ll        |    2 +-
 llvm/test/CodeGen/RISCV/select-cc.ll          |    2 +-
 llvm/test/CodeGen/RISCV/select-cond.ll        |    2 +-
 llvm/test/CodeGen/RISCV/select-const.ll       |    2 +-
 llvm/test/CodeGen/RISCV/select.ll             |    2 +-
 .../RISCV/short-forward-branch-load-imm.ll    |    4 +-
 .../RISCV/short-forward-branch-opt-min-max.ll |    8 +-
 .../RISCV/short-forward-branch-opt-mul.ll     |    8 +-
 llvm/test/CodeGen/RISCV/xcvelw.ll             |   27 +
 llvm/test/CodeGen/RISCV/xqcicli.ll            |    2 +-
 llvm/test/CodeGen/RISCV/xqcicm.ll             |    2 +-
 llvm/test/CodeGen/RISCV/xqcics.ll             |    2 +-
 .../CodeGen/RISCV/zicond-fp-select-zfinx.ll   |  703 ++++++++
 llvm/test/CodeGen/SPARC/fp128-abi.ll          |  164 ++
 llvm/test/CodeGen/SPARC/fp16-promote.ll       |   21 +-
 llvm/test/CodeGen/SPARC/llvm.sincos.ll        |  195 +-
 .../SPIRV/GlobalISel/fn-ptr-addrspacecast.ll  |    8 +
 .../CodeGen/SPIRV/GlobalISel/fn-ptr-load.ll   |    8 +
 .../CodeGen/SPIRV/GlobalISel/fn-ptr-memcpy.ll |    8 +
 .../CodeGen/SPIRV/GlobalISel/fn-ptr-memset.ll |    8 +
 .../CodeGen/SPIRV/GlobalISel/fn-ptr-store.ll  |    8 +
 ...arbitrary-precision-fixed-point-numbers.ll |  254 +++
 .../SPV_INTEL_arbitrary_precision_integers.ll |    6 +-
 .../SPV_INTEL_function_pointers/fp_cmp.ll     |   27 +
 .../extensions/SPV_INTEL_int4/negative.ll     |    6 +-
 ...both-allowed-disallowed-extension-error.ll |    6 +-
 .../enable-all-extensions-but-one.ll          |    4 +-
 .../SPIRV/extensions/enable-all-extensions.ll |    2 +-
 ...-SPV_INTEL_arbitrary_precision_integers.ll |    6 +-
 .../vector-legalization-kernel.ll             |   69 +
 .../vector-legalization-shader.ll             |  133 ++
 llvm/test/CodeGen/SPIRV/llc-pipeline.ll       |    4 +
 .../llvm-intrinsics/bitreverse_small_type.ll  |    8 +-
 .../test/CodeGen/SPIRV/semantics/target.ps.ll |   33 +
 .../CodeGen/SPIRV/trunc-nonstd-bitwidth.ll    |    4 +-
 llvm/test/CodeGen/SPIRV/zero-length-array.ll  |   10 +-
 .../test/CodeGen/Thumb2/mve-blockplacement.ll |    9 +-
 .../mve-intrinsics/strict-intrinsics.ll       |  213 ++-
 llvm/test/CodeGen/Thumb2/mve-vmulh.ll         |  493 ++++-
 .../test/CodeGen/WebAssembly/masked-shifts.ll |   45 +
 llvm/test/CodeGen/X86/O0-pipeline.ll          |    2 +
 llvm/test/CodeGen/X86/addcarry.ll             |   15 +-
 .../CodeGen/X86/avx512-skx-insert-subvec.ll   |   22 +-
 llvm/test/CodeGen/X86/combine-fceil.ll        |  193 ++
 llvm/test/CodeGen/X86/combine-fcmp.ll         |  330 ++++
 llvm/test/CodeGen/X86/combine-ffloor.ll       |  193 ++
 llvm/test/CodeGen/X86/combine-fnearbyint.ll   |  193 ++
 llvm/test/CodeGen/X86/combine-frint.ll        |  193 ++
 llvm/test/CodeGen/X86/combine-fround.ll       |  419 +++++
 llvm/test/CodeGen/X86/combine-froundeven.ll   |  193 ++
 llvm/test/CodeGen/X86/combine-fsqrt.ll        |  174 ++
 llvm/test/CodeGen/X86/combine-ftrunc.ll       |  193 ++
 llvm/test/CodeGen/X86/combine-icmp.ll         |  905 ++++++++++
 llvm/test/CodeGen/X86/combine-rcp.ll          |   65 +
 llvm/test/CodeGen/X86/combine-rndscale.ll     |  162 ++
 llvm/test/CodeGen/X86/combine-rsqrt.ll        |   65 +
 llvm/test/CodeGen/X86/dag-combine-counter.ll  |    2 -
 llvm/test/CodeGen/X86/fmaxnum.ll              |   45 +-
 llvm/test/CodeGen/X86/fminnum.ll              |   45 +-
 llvm/test/CodeGen/X86/kmov.ll                 |   44 +-
 .../CodeGen/X86/masked_store_trunc_ssat.ll    |  130 +-
 .../CodeGen/X86/masked_store_trunc_usat.ll    |  109 +-
 llvm/test/CodeGen/X86/opt-pipeline.ll         |    2 +
 llvm/test/CodeGen/X86/pr114360.ll             |    1 -
 .../CodeGen/X86/prefer-avx256-mask-extend.ll  |   76 +-
 .../CodeGen/X86/prefer-avx256-mask-shuffle.ll |   58 +-
 .../AllocToken/hot-cold-new.ll                |   20 +
 .../AllocToken/module-flags.ll                |   35 +
 .../RealtimeSanitizer/rtsan_attrib_declare.ll |   11 +
 llvm/test/LTO/X86/alloc-token-hot-cold-new.ll |   25 +
 llvm/test/LTO/X86/alloc-token.ll              |   27 +
 .../MC/AArch64/seh-large-func-multi-epilog.s  |   38 +-
 llvm/test/MC/AArch64/seh-packed-epilog.s      |    2 +-
 llvm/test/MC/AArch64/seh-packed-unwind.s      |  113 +-
 llvm/test/MC/AArch64/seh.s                    |  106 +-
 llvm/test/MC/PowerPC/fixup-out-of-range.s     |   91 +
 llvm/test/MC/RISCV/corev/XCVelw-pseudo.s      |   11 +
 ...verifier-copyLanemask-invalid-lanemask.mir |   37 +
 ...verifier-copyLanemask-missing-lanemask.mir |   19 +
 .../Other/X86/debugcounter-divrempairs.ll     |    1 -
 .../debugcounter-partiallyinlinelibcalls.ll   |    1 -
 llvm/test/Other/debugcounter-dce.ll           |    1 -
 llvm/test/Other/debugcounter-earlycse.ll      |    1 -
 llvm/test/Other/debugcounter-newgvn.ll        |    1 -
 llvm/test/Other/debugcounter-predicateinfo.ll |    1 -
 llvm/test/Other/debugcounter-slsr.ll          |    1 -
 llvm/test/Other/new-pm-O0-defaults.ll         |   10 +-
 llvm/test/Other/new-pm-defaults.ll            |    1 +
 llvm/test/Other/new-pm-lto-defaults.ll        |    1 +
 .../Other/new-pm-thinlto-postlink-defaults.ll |    1 +
 .../new-pm-thinlto-postlink-pgo-defaults.ll   |    1 +
 ...-pm-thinlto-postlink-samplepgo-defaults.ll |    1 +
 llvm/test/Other/print-debug-counter.ll        |    2 -
 .../match-table-cxx.td                        |    2 +-
 .../AggressiveInstCombine/X86/or-load.ll      |   53 +
 .../Attributor/dereferenceable-1.ll           |    9 +-
 llvm/test/Transforms/Attributor/nofpclass.ll  |   28 +-
 llvm/test/Transforms/Attributor/nonnull.ll    |  411 ++---
 .../Attributor/value-simplify-pointer-info.ll |    6 +-
 llvm/test/Transforms/Attributor/willreturn.ll |    2 +-
 .../DeadStoreElimination/debug-counter.ll     |    2 -
 .../Transforms/ExpandFp/AMDGPU/frem-inf.ll    |    4 +-
 llvm/test/Transforms/ExpandFp/AMDGPU/frem.ll  |    2 +-
 .../ExpandFp/AMDGPU/missing-analysis.ll       |    6 +
 .../ExpandFp/AMDGPU/pass-parameters.ll        |   16 +-
 .../X86/expand-large-fp-convert-fptosi129.ll  |    2 +-
 .../X86/expand-large-fp-convert-fptoui129.ll  |    2 +-
 .../X86/expand-large-fp-convert-si129tofp.ll  |    2 +-
 .../X86/expand-large-fp-convert-ui129tofp.ll  |    2 +-
 .../X86/expand-large-fp-optnone.ll            |    2 +-
 llvm/test/Transforms/FunctionAttrs/nonnull.ll |    3 +-
 .../Transforms/GlobalOpt/resolve-fmv-ifunc.ll |   92 +-
 .../IndVarSimplify/AArch64/widen-loop-comp.ll |    5 +-
 .../IndVarSimplify/ARM/code-size.ll           |   18 +-
 .../ARM/indvar-unroll-imm-cost.ll             |    2 +-
 .../IndVarSimplify/X86/eliminate-trunc.ll     |    6 +-
 .../Transforms/IndVarSimplify/X86/iv-widen.ll |    4 +-
 .../Transforms/IndVarSimplify/X86/pr59615.ll  |    6 +-
 .../IndVarSimplify/backedge-on-min-max.ll     |    4 +-
 .../IndVarSimplify/canonicalize-cmp.ll        |    4 +-
 .../IndVarSimplify/constant_result.ll         |    2 +-
 .../Transforms/IndVarSimplify/cycled_phis.ll  |    8 +-
 .../IndVarSimplify/debugloc-rem-subst.ll      |    2 +-
 .../IndVarSimplify/dont-recompute.ll          |    2 +-
 .../IndVarSimplify/eliminate-exit.ll          |  218 ++-
 .../IndVarSimplify/eliminate-sat.ll           |    8 +-
 .../IndVarSimplify/exit_value_tests.ll        |    2 +-
 .../IndVarSimplify/floating-point-small-iv.ll |   16 +-
 .../invalidate-modified-lcssa-phi.ll          |    2 +-
 .../IndVarSimplify/loop-predication.ll        |    4 +-
 .../IndVarSimplify/monotonic_checks.ll        |   12 +-
 .../IndVarSimplify/negative_ranges.ll         |    4 +-
 .../IndVarSimplify/post-inc-range.ll          |   16 +-
 .../test/Transforms/IndVarSimplify/pr38674.ll |    2 +-
 .../test/Transforms/IndVarSimplify/pr39673.ll |   14 +-
 .../test/Transforms/IndVarSimplify/pr56242.ll |    2 +-
 .../test/Transforms/IndVarSimplify/pr57247.ll |    8 +-
 .../test/Transforms/IndVarSimplify/pr62992.ll |    2 +-
 .../IndVarSimplify/sharpen-range.ll           |    2 +-
 .../IndVarSimplify/shift-range-checks.ll      |    4 +-
 .../simplify-pointer-arithmetic.ll            |   10 +-
 .../skip-predication-convergence.ll           |    2 +-
 .../skip-predication-nested-convergence.ll    |    4 +-
 .../IndVarSimplify/turn-to-invariant.ll       |    2 +-
 .../widen-nonnegative-countdown.ll            |   24 +-
 .../test/Transforms/InstCombine/known-bits.ll |   17 +
 .../InstCombine/saturating-add-sub.ll         |   16 +
 .../InstCombine/simplify-demanded-fpclass.ll  |    6 +-
 .../InstSimplify/ConstProp/min-max.ll         |   12 +-
 .../Transforms/InstSimplify/fminmax-folds.ll  |   35 +-
 llvm/test/Transforms/LICM/lnicm.ll            |    3 +
 .../LoopDistribute/laa-invalidation.ll        |    6 +-
 .../AArch64/non-cmp-cond.ll                   |  205 +++
 .../LoopStrengthReduce/AArch64/prefer-all.ll  |    4 +-
 .../LoopUnroll/AArch64/apple-unrolling.ll     |  506 ++++--
 .../peel-multiple-unreachable-exits.ll        |    6 +-
 ...turn-invariant-accesses-dereferenceable.ll |    2 +-
 .../runtime-loop-multiexit-dom-verify.ll      |   46 +-
 .../LoopUnroll/runtime-loop-multiple-exits.ll |    6 +-
 .../unroll-header-exiting-with-phis.ll        |    2 +-
 .../LoopUnrollAndJam/unroll-and-jam.ll        |   26 +-
 .../AArch64/force-target-instruction-cost.ll  |  160 +-
 .../AArch64/pr60831-sve-inv-store-crash.ll    |  157 +-
 .../LoopVectorize/AArch64/select-costs.ll     |   23 +
 .../LoopVectorize/AArch64/select-index.ll     |  297 +++-
 .../AArch64/sve-interleaved-accesses.ll       |    6 +-
 .../RISCV/gather-scatter-cost.ll              |  116 ++
 .../LoopVectorize/X86/uniformshift.ll         |   61 +-
 .../diag-disabled-vectorization-msgs.ll       |   94 +
 ...predicated-loads-with-predicated-stores.ll |  608 ++++---
 .../LoopVectorize/iv_outside_user.ll          |  139 +-
 .../LoopVectorize/pr58811-scev-expansion.ll   |  567 ++++--
 .../select-index-interleaving.ll              |  228 ++-
 .../LoopVectorize/select-smax-last-index.ll   |  177 +-
 .../LoopVectorize/select-smin-first-index.ll  |  262 +++
 .../LoopVectorize/select-smin-last-index.ll   |  178 +-
 .../LoopVectorize/select-umax-last-index.ll   |  177 +-
 .../LoopVectorize/select-umin-first-index.ll  |  226 ++-
 .../LoopVectorize/select-umin-last-index.ll   |  223 ++-
 .../Transforms/LoopVectorize/struct-return.ll |   28 +-
 .../builtin-object-size-idxsize.ll            |  262 +++
 .../builtin-object-size-phi.ll                |   57 +-
 .../Transforms/LowerTypeTests/function.ll     |   11 +-
 .../OpenMP/parallel_region_merging.ll         |   24 +-
 .../SCCP/get_vector_length-intrinsic.ll       |  147 ++
 .../SLPVectorizer/X86/debug-counter.ll        |    1 -
 ...rging-duplicate-convergence-instrinsics.ll |   68 +
 .../Transforms/Util/assume-builder-counter.ll |    1 -
 .../VectorCombine/X86/shuffle-of-fma-const.ll |   48 +
 .../WholeProgramDevirt/calls-to-devirt.ll     |   79 +
 .../WholeProgramDevirt/import-indir.ll        |    4 +-
 .../Transforms/WholeProgramDevirt/import.ll   |   12 +-
 .../uniform-retval-invoke.ll                  |    3 +-
 .../tools/llvm-exegesis/RISCV/rvv/filter.test |    8 +-
 .../X86/snippet-generator-seed.test           |   16 +
 .../llvm-readobj/COFF/arm64-packed-unwind.s   |   46 +-
 llvm/tools/llc/NewPMDriver.cpp                |   12 +
 llvm/tools/llc/llc.cpp                        |   11 +-
 .../llvm-exegesis/lib/SnippetGenerator.cpp    |   14 +-
 llvm/tools/llvm-readobj/ARMWinEHPrinter.cpp   |   32 +-
 llvm/tools/opt/NewPMDriver.cpp                |   23 +-
 llvm/tools/opt/NewPMDriver.h                  |   10 +-
 llvm/tools/opt/optdriver.cpp                  |   19 +-
 llvm/unittests/ADT/MapVectorTest.cpp          |   18 +
 llvm/unittests/ADT/SetVectorTest.cpp          |    2 +-
 llvm/unittests/CAS/CASTestConfig.h            |    9 +
 .../GlobalISel/InstructionSelectTest.cpp      |    2 -
 llvm/unittests/CodeGen/MachineOperandTest.cpp |   17 +
 .../ExecutionEngine/Orc/CMakeLists.txt        |    1 +
 .../Orc/CallableTraitsHelperTest.cpp          |   70 +
 .../Frontend/OpenMPIRBuilderTest.cpp          |  107 +-
 llvm/unittests/IR/ConstantRangeTest.cpp       |    9 +
 llvm/unittests/IR/IntrinsicsTest.cpp          |  214 ++-
 .../Transforms/Utils/MemTransferLowering.cpp  |    9 +-
 .../gn/secondary/bolt/unittests/BUILD.gn      |    1 +
 .../secondary/bolt/unittests/Passes/BUILD.gn  |   48 +
 .../clang-tools-extra/clangd/test/BUILD.gn    |    1 +
 .../gn/secondary/libcxx/include/BUILD.gn      |    2 +
 .../Plugins/Language/CPlusPlus/BUILD.gn       |    1 +
 .../gn/secondary/lldb/source/Target/BUILD.gn  |    1 +
 .../gn/secondary/lldb/source/Utility/BUILD.gn |    1 +
 .../llvm/lib/Target/AArch64/BUILD.gn          |    1 +
 .../unittests/ExecutionEngine/Orc/BUILD.gn    |    1 +
 llvm/utils/lit/lit/TestRunner.py              |   97 +-
 llvm/utils/lit/lit/builtin_commands/diff.py   |    6 +-
 llvm/utils/lit/lit/formats/googletest.py      |    2 +-
 llvm/utils/lit/lit/llvm/config.py             |    6 +-
 llvm/utils/lit/lit/reports.py                 |    4 +-
 llvm/utils/lit/lit/run.py                     |   68 +-
 llvm/utils/lit/lit/util.py                    |   81 +-
 llvm/utils/lit/tests/lit.cfg                  |    4 +
 llvm/utils/lit/tests/windows-pools.py         |   27 +
 llvm/utils/profcheck-xfail.txt                |   37 -
 mlir/docs/Dialects/Linalg/OpDSL.md            |   17 +-
 .../Analysis/Presburger/IntegerRelation.h     |   14 +
 .../mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h |    2 +-
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td |  127 +-
 mlir/include/mlir/Dialect/EmitC/IR/EmitC.td   |   58 +-
 mlir/include/mlir/Dialect/GPU/IR/GPUBase.td   |    2 +-
 mlir/include/mlir/Dialect/GPU/IR/GPUOps.td    |    8 +-
 mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td   |  142 +-
 .../Linalg/IR/LinalgNamedStructuredOps.yaml   |   18 +-
 .../mlir/Dialect/OpenACC/OpenACCOps.td        |   19 +
 .../Dialect/OpenACC/OpenACCTypeInterfaces.td  |   44 +
 .../mlir/Dialect/OpenACC/Transforms/Passes.td |   16 +
 .../mlir/Dialect/SPIRV/IR/SPIRVBase.td        |    2 +-
 mlir/include/mlir/Transforms/Passes.td        |    1 +
 mlir/lib/Analysis/Presburger/Barvinok.cpp     |   14 +-
 .../Analysis/Presburger/IntegerRelation.cpp   |   22 +-
 mlir/lib/Analysis/Presburger/Matrix.cpp       |   41 +-
 .../AMDGPUToROCDL/AMDGPUToROCDL.cpp           |   78 +-
 .../ArithToAPFloat/ArithToAPFloat.cpp         |  205 ++-
 .../Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp    |   52 +-
 .../MemRefToEmitC/MemRefToEmitC.cpp           |   14 +-
 .../Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp    |    6 +-
 mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp  |   56 +
 .../AMDGPU/Transforms/MaskedloadToLoad.cpp    |   34 +-
 mlir/lib/Dialect/Affine/Utils/LoopUtils.cpp   |    6 +
 .../lib/Dialect/ControlFlow/IR/CMakeLists.txt |    1 +
 .../Dialect/ControlFlow/IR/ControlFlowOps.cpp |   34 +-
 mlir/lib/Dialect/EmitC/IR/EmitC.cpp           |   29 +
 mlir/lib/Dialect/GPU/IR/GPUDialect.cpp        |    4 +-
 mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp    |  175 +-
 .../Dialect/Linalg/IR/LinalgInterfaces.cpp    |   23 +-
 .../Linalg/Transforms/ElementwiseOpFusion.cpp |  301 +++-
 mlir/lib/Dialect/Linalg/Utils/Utils.cpp       |  756 ++++----
 .../NVGPU/TransformOps/NVGPUTransformOps.cpp  |    4 +-
 mlir/lib/Dialect/OpenACC/IR/OpenACC.cpp       |   74 +
 .../OpenACC/Transforms/ACCLegalizeSerial.cpp  |  117 ++
 .../Dialect/OpenACC/Transforms/CMakeLists.txt |    1 +
 mlir/lib/Dialect/SCF/IR/CMakeLists.txt        |    2 +-
 mlir/lib/Dialect/SCF/IR/SCF.cpp               |  131 +-
 .../SCF/Transforms/UpliftWhileToFor.cpp       |   79 +-
 .../BufferizableOpInterfaceImpl.cpp           |   56 +-
 mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp    |    9 +-
 .../XeGPU/Transforms/XeGPUPropagateLayout.cpp |    6 +-
 mlir/lib/Dialect/XeGPU/Utils/XeGPUUtils.cpp   |    3 +-
 mlir/lib/ExecutionEngine/APFloatWrappers.cpp  |   40 +
 mlir/lib/Target/Cpp/TranslateToCpp.cpp        |   29 +-
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp      |   20 +-
 .../SPIRV/Deserialization/Deserializer.cpp    |   35 +-
 .../SPIRV/Deserialization/Deserializer.h      |   14 +-
 .../Target/SPIRV/Serialization/Serializer.cpp |   16 +-
 mlir/lib/Transforms/CMakeLists.txt            |    1 +
 mlir/lib/Transforms/RemoveDeadValues.cpp      |   47 +-
 .../linalg/opdsl/ops/core_named_ops.py        |    8 +-
 ...cvt_scale_pk-gfx1250.mlir => gfx1250.mlir} |   48 +
 .../ArithToApfloat/arith-to-apfloat.mlir      |   65 +
 .../GPUToNVVM/wmma-ops-to-nvvm.mlir           |   22 +
 .../memref-to-emitc-alloc-copy.mlir           |   12 +-
 .../MemRefToEmitC/memref-to-emitc-copy.mlir   |    6 +-
 .../MemRefToEmitC/memref-to-emitc.mlir        |    2 +-
 .../Conversion/NVVMToLLVM/nvvm-to-llvm.mlir   |    8 +-
 .../Conversion/UBToSPIRV/ub-to-spirv.mlir     |   16 +-
 mlir/test/Dialect/AMDGPU/invalid.mlir         |   61 +
 mlir/test/Dialect/AMDGPU/ops.mlir             |   57 +-
 mlir/test/Dialect/Affine/loop-coalescing.mlir |   28 +
 .../Affine/value-bounds-reification.mlir      |    4 +-
 .../Dialect/ControlFlow/canonicalize.mlir     |   22 +
 mlir/test/Dialect/EmitC/invalid_ops.mlir      |   16 +
 mlir/test/Dialect/EmitC/ops.mlir              |   10 +
 mlir/test/Dialect/GPU/invalid.mlir            |    4 +-
 .../Dialect/LLVMIR/nvvm-target-invalid.mlir   |   11 +
 mlir/test/Dialect/LLVMIR/nvvm.mlir            |   13 -
 .../fuse-with-reshape-by-collapsing.mlir      |   75 +
 .../Linalg/fusion-elementwise-ops.mlir        |   26 +-
 .../generalize-named-polymorphic-ops.mlir     |    8 +-
 mlir/test/Dialect/Linalg/invalid.mlir         |   18 +
 mlir/test/Dialect/Linalg/reshape_fusion.mlir  |   75 +
 .../test/Dialect/OpenACC/legalize-serial.mlir |  164 ++
 .../OpenACC/pointer-like-interface-load.mlir  |   29 +
 .../OpenACC/pointer-like-interface-store.mlir |   39 +
 mlir/test/Dialect/SCF/canonicalize.mlir       |   50 +
 mlir/test/Dialect/SCF/uplift-while.mlir       |   31 -
 mlir/test/Dialect/Tensor/bufferize.mlir       |   40 +-
 mlir/test/Dialect/Vector/vector-sink.mlir     |    2 +-
 .../XeGPU/propagate-layout-inst-data.mlir     |    3 +
 .../Dialect/XeGPU/subgroup-distribute.mlir    |   24 +-
 .../Dialect/XeGPU/xegpu-attr-interface.mlir   |   18 +-
 .../test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir |   14 +-
 .../XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir    |   42 +-
 .../XeGPU/xegpu-wg-to-sg-unify-ops.mlir       |  119 +-
 mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir   |   28 +-
 .../Arith/CPU/test-apfloat-emulation.mlir     |   12 +
 .../Linalg/CPU/ArmSME/matmul-transpose-a.mlir |    4 +-
 .../Dialect/Linalg/CPU/ArmSME/matmul.mlir     |    4 +-
 .../Linalg/CPU/ArmSME/multi-tile-matmul.mlir  |    4 +-
 .../Linalg/CPU/runtime-verification.mlir      |    8 +-
 .../Linalg/CPU/test-matmul-masked-vec.mlir    |    4 +-
 .../Dialect/Transform/match_matmul.mlir       |    4 +-
 .../XeGPU/LANE/load_store_subview.mlir        |   63 +
 .../CUDA/TensorCore/sm80/wmma-matmul-f64.mlir |   72 +
 mlir/test/Target/Cpp/common-cpp.mlir          |   19 +
 mlir/test/Target/Cpp/expressions.mlir         |   30 +-
 .../LLVMIR/allocatable_gpu_reduction.mlir     |    2 +
 .../allocatable_gpu_reduction_teams.mlir      |  121 ++
 .../LLVMIR/nvvm/mbar_arr_drop_expect_tx.mlir  |   68 +
 .../LLVMIR/nvvm/mbar_arr_expect_tx.mlir       |   68 +
 mlir/test/Target/LLVMIR/nvvm/mbar_init.mlir   |   20 -
 .../test/Target/LLVMIR/nvvm/mbar_invalid.mlir |   73 +
 .../Target/LLVMIR/nvvm/mbar_test_wait.mlir    |   73 +
 .../Target/LLVMIR/openmp-barrier-cancel.mlir  |   14 +-
 mlir/test/Target/LLVMIR/openmp-cancel.mlir    |   81 +-
 .../LLVMIR/openmp-cancellation-point.mlir     |   14 +-
 .../LLVMIR/openmp-outline-infinite-loop.mlir  |    6 +-
 .../openmp-parallel-reduction-multiblock.mlir |    4 +-
 .../openmp-reduction-array-sections.mlir      |   14 +-
 .../LLVMIR/openmp-reduction-init-arg.mlir     |    4 +-
 .../LLVMIR/openmp-reduction-sections.mlir     |   16 +-
 ...ction.spv => consecutive-selection.spvasm} |    0
 mlir/test/Target/SPIRV/decorations.mlir       |    7 +
 mlir/test/Target/SPIRV/selection.mlir         |   58 +
 .../SPIRV/{selection.spv => selection.spvasm} |    0
 .../test/Target/SPIRV/selection_switch.spvasm |   69 +
 mlir/test/Transforms/remove-dead-values.mlir  |   27 +
 .../OpenACC/TestPointerLikeTypeInterface.cpp  |  121 +-
 mlir/test/lib/Dialect/Test/TestOpDefs.cpp     |   11 +
 mlir/test/lib/Dialect/Test/TestOps.td         |   18 +
 mlir/test/lit.cfg.py                          |    2 +-
 mlir/test/mlir-tblgen/op-decl-and-defs.td     |    4 +-
 .../integration/dialects/linalg/opsrun.py     |   26 +-
 mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp   |    9 +-
 .../Analysis/Presburger/BarvinokTest.cpp      |    9 +
 .../Presburger/IntegerRelationTest.cpp        |   15 +
 mlir/unittests/IR/SymbolTableTest.cpp         |   12 +-
 offload/include/OpenMP/omp.h                  |    8 +
 offload/include/Shared/Debug.h                |  323 ++--
 offload/include/omptarget.h                   |    2 +
 offload/libomptarget/OffloadRTL.cpp           |    9 +-
 offload/libomptarget/OpenMP/API.cpp           |   58 +
 offload/libomptarget/device.cpp               |   15 +-
 offload/libomptarget/exports                  |    2 +
 offload/test/api/omp_device_uid.c             |   76 +
 openmp/device/include/DeviceTypes.h           |    3 +
 openmp/device/include/Interface.h             |    4 +
 openmp/device/src/State.cpp                   |    6 +
 openmp/runtime/src/dllexports                 |    2 +
 openmp/runtime/src/exports_so.txt             |    2 +
 openmp/runtime/src/exports_test_so.txt        |    2 +
 openmp/runtime/src/include/omp.h.var          |    5 +
 openmp/runtime/src/kmp_csupport.cpp           |    2 +-
 openmp/runtime/src/kmp_ftn_entry.h            |   35 +-
 openmp/runtime/src/kmp_ftn_os.h               |    8 +
 openmp/runtime/test/api/omp_device_uid.c      |   77 +
 runtimes/cmake/Modules/WarningFlags.cmake     |    2 +-
 .../benchmark/bindings/python/build_defs.bzl  |    4 +-
 utils/bazel/configure.bzl                     |   86 +-
 .../llvm-project-overlay/clang/BUILD.bazel    |   63 -
 .../clang/tools/clang-fuzzer/BUILD.bazel      |   70 +
 .../llvm-project-overlay/libc/BUILD.bazel     |    4 -
 .../lldb/source/Plugins/BUILD.bazel           |    5 +
 .../llvm-project-overlay/llvm/BUILD.bazel     |   33 +-
 .../llvm-project-overlay/mlir/BUILD.bazel     |    3 +
 utils/bazel/overlay_directories.py            |   99 --
 1474 files changed, 58151 insertions(+), 17069 deletions(-)
 create mode 100644 .github/workflows/issue-write-test.yaml
 create mode 100644 .github/workflows/upload-release-artifact/action.yml
 create mode 100644 bolt/test/AArch64/hook-init.s
 create mode 100644 bolt/test/AArch64/inline-bti-dbg.s
 create mode 100644 bolt/test/AArch64/inline-bti.s
 create mode 100644 bolt/test/AArch64/instrument-no-fini.s
 create mode 100644 bolt/test/X86/hook-init.s
 create mode 100644 bolt/test/X86/instrument-no-fini.s
 create mode 100644 bolt/unittests/Passes/CMakeLists.txt
 create mode 100644 bolt/unittests/Passes/InsertNegateRAState.cpp
 create mode 100644 clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
 create mode 100644 clang/test/CIR/CodeGen/size-of-vla.cpp
 create mode 100644 clang/test/CIR/CodeGenBuiltins/X86/avx2-builtins.c
 create mode 100644 clang/test/CIR/CodeGenBuiltins/X86/avx512dq-builtins.c
 create mode 100644 clang/test/CIR/CodeGenBuiltins/X86/avx512vl-builtins.c
 create mode 100644 clang/test/CIR/CodeGenBuiltins/X86/vec-set-builtins.c
 create mode 100644 clang/test/CIR/CodeGenBuiltins/X86/xop-builtins.c
 create mode 100644 clang/test/CIR/CodeGenBuiltins/builtin-constant-p.c
 delete mode 100644 clang/test/CIR/CodeGenOpenACC/openacc-not-implemented-global.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-anon-ns.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-clauses.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-globals.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-globals2.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-locals.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-members.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-ns.cpp
 create mode 100644 clang/test/CIR/CodeGenOpenACC/routine-templ.cpp
 create mode 100644 clang/test/CIR/IR/invalid-func-attr.cir
 create mode 100644 clang/test/CodeGen/alloc-token-module-flags.c
 create mode 100644 clang/test/CodeGen/attr-modular-format.c
 create mode 100644 clang/test/CodeGen/distributed-thin-lto/memprof-pgho.cpp
 create mode 100644 clang/test/CodeGenCUDA/device-kernel-call.cu
 create mode 100644 clang/test/CodeGenHLSL/BasicFeatures/MatrixElementTypeCast.hlsl
 create mode 100644 clang/test/CodeGenHLSL/BasicFeatures/MatrixExplicitTruncation.hlsl
 create mode 100644 clang/test/CodeGenHLSL/BasicFeatures/MatrixImplicitTruncation.hlsl
 create mode 100644 clang/test/CodeGenHLSL/builtins/VectorElementStore.hlsl
 create mode 100644 clang/test/CodeGenHLSL/semantics/SV_Target.ps.hlsl
 create mode 100644 clang/test/CodeGenHLSL/semantics/semantic.explicit-location-output-struct.hlsl
 create mode 100644 clang/test/CodeGenHLSL/semantics/semantic.explicit-location.hlsl
 create mode 100644 clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.hlsl
 create mode 100644 clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.vs.hlsl
 create mode 100644 clang/test/CodeGenHLSL/semantics/semantic.explicit-mix.lib.hlsl
 create mode 100644 clang/test/FixIt/fixit-cxx0x-attributes.cpp
 create mode 100644 clang/test/Modules/GH170084.cpp
 create mode 100644 clang/test/Modules/pr170235.cppm
 create mode 100644 clang/test/OpenMP/target_update_iterator_ast_print.cpp
 create mode 100644 clang/test/OpenMP/target_update_iterator_serialization.cpp
 create mode 100644 clang/test/Parser/cxx-nested-name-spec.cpp
 create mode 100644 clang/test/Preprocessor/header-shadowing.c
 create mode 100644 clang/test/Sema/attr-modular-format.c
 create mode 100644 clang/test/SemaCUDA/device-kernel-call.cu
 create mode 100644 clang/test/SemaCXX/constexpr-x86-avx-builtins.cpp
 create mode 100644 clang/test/SemaCXX/constexpr-x86-avx512f-builtins.cpp
 create mode 100644 clang/test/SemaCXX/constexpr-x86-avx512vl-builtins.cpp
 create mode 100644 clang/test/SemaCXX/constexpr-x86-sse2-builtins.cpp
 create mode 100644 clang/test/SemaCXX/no-warn-consumed-analysis.cpp
 create mode 100644 clang/test/SemaHLSL/MatrixElementOverloadResolution.hlsl
 create mode 100644 clang/test/SemaHLSL/Semantics/semantic.explicit-mix-builtin-vs.hlsl
 create mode 100644 clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location-2.hlsl
 create mode 100644 clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location.hlsl
 create mode 100644 clang/test/SemaHLSL/Semantics/target.ps.input.hlsl
 create mode 100644 clang/test/SemaHLSL/Semantics/target.vs.input.hlsl
 create mode 100644 clang/test/SemaHLSL/Semantics/target.vs.output.hlsl
 create mode 100644 clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixCastErrors.hlsl
 create mode 100644 clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixImplicitTruncCastWarnings.hlsl
 create mode 100644 clang/test/SemaHLSL/static_resources.hlsl
 create mode 100644 flang/include/flang/Optimizer/Transforms/CUDA/CUFAllocationConversion.h
 create mode 100644 flang/lib/Optimizer/Transforms/CUDA/CUFAllocationConversion.cpp
 create mode 100644 flang/test/Evaluate/bug168978.f90
 create mode 100644 flang/test/Fir/OpenACC/pointer-like-interface-load.mlir
 create mode 100644 flang/test/Fir/OpenACC/pointer-like-interface-store.mlir
 create mode 100644 flang/test/Lower/Intrinsics/rand.f90
 create mode 100644 flang/test/Lower/dispatch-table-abstract.f90
 create mode 100644 flang/test/Lower/vectorlength.f90
 create mode 100644 flang/test/Semantics/equiv-kind.f90
 create mode 100644 libcxx/test/extensions/gnu/hash_map/copy.pass.cpp
 create mode 100644 libcxx/test/extensions/gnu/hash_set/copy.pass.cpp
 create mode 100644 lldb/include/lldb/Target/BorrowedStackFrame.h
 create mode 100644 lldb/include/lldb/Utility/VirtualDataExtractor.h
 create mode 100644 lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/CMakeLists.txt
 create mode 100644 lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.cpp
 create mode 100644 lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.h
 create mode 100644 lldb/source/Plugins/Language/CPlusPlus/LibStdcppSpan.cpp
 create mode 100644 lldb/source/Plugins/SyntheticFrameProvider/CMakeLists.txt
 create mode 100644 lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/CMakeLists.txt
 create mode 100644 lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.cpp
 create mode 100644 lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.h
 create mode 100644 lldb/source/Target/BorrowedStackFrame.cpp
 create mode 100644 lldb/source/Utility/VirtualDataExtractor.cpp
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/Makefile
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/TestScriptedFrameProvider.py
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/Makefile
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/TestFrameProviderCircularDependency.py
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/frame_provider.py
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/main.c
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/main.cpp
 create mode 100644 lldb/test/API/functionalities/scripted_frame_provider/test_frame_providers.py
 create mode 100644 lldb/test/API/lang/BoundsSafety/soft_trap/Makefile
 create mode 100644 lldb/test/API/lang/BoundsSafety/soft_trap/TestBoundsSafetyInstrumentationPlugin.py
 create mode 100644 lldb/test/API/lang/BoundsSafety/soft_trap/main.c
 create mode 100644 lldb/test/API/lang/BoundsSafety/soft_trap/mockSoftTrapRuntime.c
 create mode 100644 lldb/test/API/python_api/exprpath_register/Makefile
 create mode 100644 lldb/test/API/python_api/exprpath_register/TestExprPathRegisters.py
 create mode 100644 lldb/test/API/python_api/exprpath_register/main.c
 create mode 100644 lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockCallSoftTrapRuntime.c
 create mode 100644 lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockSoftTrapRuntime.c
 create mode 100644 lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTraps.c
 create mode 100644 lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTrapsMissingReason.c
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_missing_reason.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_dbg_info.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_plugin.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_str.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_missing_reason.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info_null_str.test
 create mode 100644 lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_plugin.test
 create mode 100644 lldb/test/Shell/SymbolFile/NativePDB/find-pdb-next-to-exe.test
 create mode 100644 lldb/unittests/Utility/VirtualDataExtractorTest.cpp
 create mode 100644 llvm/include/llvm/ExecutionEngine/Orc/CallableTraitsHelper.h
 create mode 100644 llvm/test/Analysis/Delinearization/validation_large_size.ll
 create mode 100644 llvm/test/Analysis/DependenceAnalysis/PR148435.ll
 create mode 100644 llvm/test/Analysis/DependenceAnalysis/bounds-check.ll
 create mode 100644 llvm/test/Analysis/DependenceAnalysis/wrapping-addrec-1.ll
 create mode 100644 llvm/test/Analysis/DependenceAnalysis/wrapping-addrec.ll
 create mode 100644 llvm/test/Analysis/DependenceAnalysis/zero-coefficient.ll
 create mode 100644 llvm/test/CodeGen/AArch64/arm64-int-neon.ll
 create mode 100644 llvm/test/CodeGen/AArch64/memtag-compact-unwind.ll
 create mode 100644 llvm/test/CodeGen/AArch64/neon-anyof-splat.ll
 create mode 100644 llvm/test/CodeGen/AArch64/sve-fixed-vector-extract-256-bits.ll
 create mode 100644 llvm/test/CodeGen/AArch64/udiv-by-const-promoted-ops.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/hazard-gfx1250-flat-scr-hi.mir
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fadd.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmax.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmin.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fsub.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/misched-into-wmma-hazard-shadow.mir
 create mode 100644 llvm/test/CodeGen/ARM/fp-intrinsics-vector.ll
 delete mode 100644 llvm/test/CodeGen/BPF/builtin_calls.ll
 create mode 100644 llvm/test/CodeGen/Hexagon/autohvx/fp-to-int_2.ll
 create mode 100644 llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-empty-lanemask.mir
 create mode 100644 llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-invalid-lanemask.mir
 create mode 100644 llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-lparen.mir
 create mode 100644 llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-rparen.mir
 create mode 100644 llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand.mir
 create mode 100644 llvm/test/CodeGen/NVPTX/switch-loop-header.mir
 create mode 100644 llvm/test/CodeGen/NVPTX/switch.ll
 create mode 100644 llvm/test/CodeGen/RISCV/xcvelw.ll
 create mode 100644 llvm/test/CodeGen/RISCV/zicond-fp-select-zfinx.ll
 create mode 100644 llvm/test/CodeGen/SPARC/fp128-abi.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-addrspacecast.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-load.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memcpy.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memset.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-store.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/extensions/SPV_ALTERA_arbitrary_precision_fixed_point/capability-arbitrary-precision-fixed-point-numbers.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_function_pointers/fp_cmp.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/legalization/vector-legalization-kernel.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/legalization/vector-legalization-shader.ll
 create mode 100644 llvm/test/CodeGen/SPIRV/semantics/target.ps.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-fceil.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-fcmp.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-ffloor.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-fnearbyint.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-frint.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-fround.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-froundeven.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-fsqrt.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-ftrunc.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-icmp.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-rcp.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-rndscale.ll
 create mode 100644 llvm/test/CodeGen/X86/combine-rsqrt.ll
 create mode 100644 llvm/test/Instrumentation/AllocToken/hot-cold-new.ll
 create mode 100644 llvm/test/Instrumentation/AllocToken/module-flags.ll
 create mode 100644 llvm/test/Instrumentation/RealtimeSanitizer/rtsan_attrib_declare.ll
 create mode 100644 llvm/test/LTO/X86/alloc-token-hot-cold-new.ll
 create mode 100644 llvm/test/LTO/X86/alloc-token.ll
 create mode 100644 llvm/test/MC/PowerPC/fixup-out-of-range.s
 create mode 100644 llvm/test/MC/RISCV/corev/XCVelw-pseudo.s
 create mode 100644 llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-invalid-lanemask.mir
 create mode 100644 llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-missing-lanemask.mir
 create mode 100644 llvm/test/Transforms/ExpandFp/AMDGPU/missing-analysis.ll
 create mode 100644 llvm/test/Transforms/LoopStrengthReduce/AArch64/non-cmp-cond.ll
 create mode 100644 llvm/test/Transforms/LoopVectorize/diag-disabled-vectorization-msgs.ll
 create mode 100644 llvm/test/Transforms/LoopVectorize/select-smin-first-index.ll
 create mode 100644 llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-idxsize.ll
 create mode 100644 llvm/test/Transforms/SCCP/get_vector_length-intrinsic.ll
 create mode 100644 llvm/test/Transforms/SimplifyCFG/skip-merging-duplicate-convergence-instrinsics.ll
 create mode 100644 llvm/test/Transforms/VectorCombine/X86/shuffle-of-fma-const.ll
 create mode 100644 llvm/test/Transforms/WholeProgramDevirt/calls-to-devirt.ll
 create mode 100644 llvm/test/tools/llvm-exegesis/X86/snippet-generator-seed.test
 create mode 100644 llvm/unittests/ExecutionEngine/Orc/CallableTraitsHelperTest.cpp
 create mode 100644 llvm/utils/gn/secondary/bolt/unittests/Passes/BUILD.gn
 create mode 100644 llvm/utils/lit/tests/windows-pools.py
 create mode 100644 mlir/lib/Dialect/OpenACC/Transforms/ACCLegalizeSerial.cpp
 rename mlir/test/Conversion/AMDGPUToROCDL/{cvt_scale_pk-gfx1250.mlir => gfx1250.mlir} (81%)
 create mode 100644 mlir/test/Dialect/LLVMIR/nvvm-target-invalid.mlir
 create mode 100644 mlir/test/Dialect/OpenACC/legalize-serial.mlir
 create mode 100644 mlir/test/Dialect/OpenACC/pointer-like-interface-load.mlir
 create mode 100644 mlir/test/Dialect/OpenACC/pointer-like-interface-store.mlir
 create mode 100644 mlir/test/Integration/Dialect/XeGPU/LANE/load_store_subview.mlir
 create mode 100644 mlir/test/Integration/GPU/CUDA/TensorCore/sm80/wmma-matmul-f64.mlir
 create mode 100644 mlir/test/Target/LLVMIR/allocatable_gpu_reduction_teams.mlir
 create mode 100644 mlir/test/Target/LLVMIR/nvvm/mbar_arr_drop_expect_tx.mlir
 create mode 100644 mlir/test/Target/LLVMIR/nvvm/mbar_arr_expect_tx.mlir
 create mode 100644 mlir/test/Target/LLVMIR/nvvm/mbar_test_wait.mlir
 rename mlir/test/Target/SPIRV/{consecutive-selection.spv => consecutive-selection.spvasm} (100%)
 rename mlir/test/Target/SPIRV/{selection.spv => selection.spvasm} (100%)
 create mode 100644 mlir/test/Target/SPIRV/selection_switch.spvasm
 create mode 100644 offload/test/api/omp_device_uid.c
 create mode 100644 openmp/runtime/test/api/omp_device_uid.c
 create mode 100644 utils/bazel/llvm-project-overlay/clang/tools/clang-fuzzer/BUILD.bazel
 delete mode 100755 utils/bazel/overlay_directories.py

diff --git a/.ci/generate_test_report_lib.py b/.ci/generate_test_report_lib.py
index ce8262f0dc73f..c62c901fe46f5 100644
--- a/.ci/generate_test_report_lib.py
+++ b/.ci/generate_test_report_lib.py
@@ -184,8 +184,8 @@ def generate_report(
         if return_code == 0:
             report.extend(
                 [
-                    "The build succeeded and no tests ran. This is expected in some "
-                    "build configurations."
+                    ":white_check_mark: The build succeeded and no tests ran. "
+                    "This is expected in some build configurations."
                 ]
             )
         else:
@@ -272,6 +272,10 @@ def plural(num_tests):
                 ]
             )
             report.extend(_format_failures(ninja_failures, failure_explanations))
+    else:
+        report.extend(
+            ["", ":white_check_mark: The build succeeded and all tests passed."]
+        )
 
     if failures or return_code != 0:
         report.extend(["", UNRELATED_FAILURES_STR])
diff --git a/.ci/generate_test_report_lib_test.py b/.ci/generate_test_report_lib_test.py
index 341cf3037b921..182af1d52641a 100644
--- a/.ci/generate_test_report_lib_test.py
+++ b/.ci/generate_test_report_lib_test.py
@@ -194,7 +194,7 @@ def test_title_only(self):
                 """\
                 # Foo
 
-                The build succeeded and no tests ran. This is expected in some build configurations."""
+                :white_check_mark: The build succeeded and no tests ran. This is expected in some build configurations."""
             ),
         )
 
@@ -308,7 +308,9 @@ def test_no_failures(self):
                     """\
               # Foo
 
-              * 1 test passed"""
+              * 1 test passed
+              
+              :white_check_mark: The build succeeded and all tests passed."""
                 )
             ),
         )
diff --git a/.github/workflows/check-ci.yml b/.github/workflows/check-ci.yml
index 914ead803181e..9f63b4ce22a28 100644
--- a/.github/workflows/check-ci.yml
+++ b/.github/workflows/check-ci.yml
@@ -26,7 +26,7 @@ jobs:
         with:
           sparse-checkout: .ci
       - name: Setup Python
-        uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+        uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
         with:
           python-version: 3.14
           cache: 'pip'
diff --git a/.github/workflows/ci-post-commit-analyzer.yml b/.github/workflows/ci-post-commit-analyzer.yml
index abd64809f9dc9..a823dadb5979f 100644
--- a/.github/workflows/ci-post-commit-analyzer.yml
+++ b/.github/workflows/ci-post-commit-analyzer.yml
@@ -44,7 +44,7 @@ jobs:
         uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
 
       - name: Setup ccache
-        uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+        uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
         with:
           # A full build of llvm, clang, lld, and lldb takes about 250MB
           # of ccache space. There's not much reason to have more than this,
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
index ef1b727dc00a8..f621bb1d64086 100644
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -95,7 +95,7 @@ jobs:
             workflow:
               - '.github/workflows/docs.yml'
       - name: Setup Python env
-        uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+        uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
         with:
           python-version: '3.14'
           cache: 'pip'
diff --git a/.github/workflows/gha-codeql.yml b/.github/workflows/gha-codeql.yml
index a94c064757d29..7baaa6ce0cf08 100644
--- a/.github/workflows/gha-codeql.yml
+++ b/.github/workflows/gha-codeql.yml
@@ -29,9 +29,9 @@ jobs:
           sparse-checkout: |
             .github/
       - name: Initialize CodeQL
-        uses: github/codeql-action/init at e12f0178983d466f2f6028f5cc7a6d786fd97f4b # v4.31.4
+        uses: github/codeql-action/init at fdbfb4d2750291e159f0156def62b853c2798ca2 # v4.31.5
         with:
           languages: actions
           queries: security-extended
       - name: Perform CodeQL Analysis
-        uses: github/codeql-action/analyze at e12f0178983d466f2f6028f5cc7a6d786fd97f4b # v4.31.4
+        uses: github/codeql-action/analyze at fdbfb4d2750291e159f0156def62b853c2798ca2 # v4.31.5
diff --git a/.github/workflows/issue-write-test.yaml b/.github/workflows/issue-write-test.yaml
new file mode 100644
index 0000000000000..a54e716d1dee9
--- /dev/null
+++ b/.github/workflows/issue-write-test.yaml
@@ -0,0 +1,33 @@
+name: Test Issue Write
+
+permissions:
+  contents: read
+
+on:
+  pull_request:
+    paths:
+      - '.github/workflows/issue-write-test.yaml'
+      - '.github/workflows/issue-write.yml'
+
+jobs:
+  test-issue-write:
+    name: "Test Issue Write"
+    runs-on: ubuntu-24.04
+    if: github.repository == 'llvm/llvm-project'
+    steps:
+      - name: Write Comment
+        run: |
+          echo '[{"body": "This is a comment for testing the issue write workflow"}]' > comments-foo
+          echo '[{"body": "This is another comment for testing the issue write workflow that was placed in a separate file"}]' > comments-bar
+      - name: Upload Comment
+        uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
+        with:
+          name: workflow-args-foo
+          path: |
+            comments-foo
+      - name: Upload Comment
+        uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
+        with:
+          name: workflow-args-bar
+          path: |
+            comments-bar
diff --git a/.github/workflows/issue-write.yml b/.github/workflows/issue-write.yml
index ece6081ce9ba6..eebaf89e027be 100644
--- a/.github/workflows/issue-write.yml
+++ b/.github/workflows/issue-write.yml
@@ -8,6 +8,7 @@ on:
       - "PR Request Release Note"
       - "Code lint"
       - "CI Checks"
+      - "Test Issue Write"
     types:
       - completed
 
@@ -40,13 +41,18 @@ jobs:
           artifact-name: workflow-args
 
       - name: 'Comment on PR'
-        if: steps.download-artifact.outputs.artifact-id != ''
+        if: steps.download-artifact.outputs.artifact-ids != ''
         uses: actions/github-script at ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
         with:
           github-token: ${{ secrets.GITHUB_TOKEN }}
           script: |
             var fs = require('fs');
-            const comments = JSON.parse(fs.readFileSync('./comments'));
+            var comments = []
+            for (local_file of fs.readdirSync('.')) {
+              if (local_file.startsWith("comments")) {
+                comments.push(...JSON.parse(fs.readFileSync(local_file)))
+              }
+            }
             if (!comments || comments.length == 0) {
               return;
             }
@@ -155,5 +161,5 @@ jobs:
       - name: Dump comments file
         if: >-
           always() &&
-          steps.download-artifact.outputs.artifact-id != ''
+          steps.download-artifact.outputs.artifact-ids != ''
         run: cat comments
diff --git a/.github/workflows/libc-fullbuild-tests.yml b/.github/workflows/libc-fullbuild-tests.yml
index 13c0c2b82ab42..3c7dc5d4fcd75 100644
--- a/.github/workflows/libc-fullbuild-tests.yml
+++ b/.github/workflows/libc-fullbuild-tests.yml
@@ -97,7 +97,7 @@ jobs:
     # Do not use direct GHAC access even though it is supported by sccache. GHAC rejects
     # frequent small object writes.
     - name: Setup ccache
-      uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+      uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
       with:
         max-size: 1G
         key: libc_fullbuild_${{ matrix.c_compiler }}
diff --git a/.github/workflows/libc-overlay-tests.yml b/.github/workflows/libc-overlay-tests.yml
index 29bcd0f600490..ca6e9ad92ab1b 100644
--- a/.github/workflows/libc-overlay-tests.yml
+++ b/.github/workflows/libc-overlay-tests.yml
@@ -51,7 +51,7 @@ jobs:
     # Do not use direct GHAC access even though it is supported by sccache. GHAC rejects
     # frequent small object writes.
     - name: Setup ccache
-      uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+      uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
       with:
         max-size: 1G
         key: libc_overlay_build_${{ matrix.os }}_${{ matrix.compiler.c_compiler }}
diff --git a/.github/workflows/libclang-python-tests.yml b/.github/workflows/libclang-python-tests.yml
index 69d281b65b5fb..27ad0aa512c03 100644
--- a/.github/workflows/libclang-python-tests.yml
+++ b/.github/workflows/libclang-python-tests.yml
@@ -34,11 +34,11 @@ jobs:
     steps:
       - uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
       - name: Setup Python
-        uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+        uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
         with:
           python-version: ${{ matrix.python-version }}
       - name: Setup ccache
-        uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+        uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
         with:
           max-size: 2G
           key: spirv-ubuntu-24.04
diff --git a/.github/workflows/libcxx-build-and-test.yaml b/.github/workflows/libcxx-build-and-test.yaml
index 8e6dc48f4c495..bded56d54ed37 100644
--- a/.github/workflows/libcxx-build-and-test.yaml
+++ b/.github/workflows/libcxx-build-and-test.yaml
@@ -223,6 +223,9 @@ jobs:
           source .venv/bin/activate
           python -m pip install psutil
           xcrun bash libcxx/utils/ci/run-buildbot ${{ matrix.config }}
+        env:
+          CC: clang
+          CXX: clang++
       - uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
         if: always()  # Upload artifacts even if the build or test suite fails
         with:
@@ -241,16 +244,16 @@ jobs:
       fail-fast: false
       matrix:
         include:
-        - { config: clang-cl-dll, mingw: false }
-        - { config: clang-cl-static, mingw: false }
-        - { config: clang-cl-no-vcruntime, mingw: false }
-        - { config: clang-cl-debug, mingw: false }
-        - { config: clang-cl-static-crt, mingw: false }
-        - { config: mingw-dll, mingw: true }
-        - { config: mingw-static, mingw: true }
-        - { config: mingw-dll-i686, mingw: true }
-        - { config: mingw-incomplete-sysroot, mingw: true }
-        - { config: mingw-static, mingw: true, runner: windows-11-arm }
+        - { config: clang-cl-dll,             mingw: false, cc: clang-cl, cxx: clang-cl }
+        - { config: clang-cl-static,          mingw: false, cc: clang-cl, cxx: clang-cl }
+        - { config: clang-cl-no-vcruntime,    mingw: false, cc: clang-cl, cxx: clang-cl }
+        - { config: clang-cl-debug,           mingw: false, cc: clang-cl, cxx: clang-cl }
+        - { config: clang-cl-static-crt,      mingw: false, cc: clang-cl, cxx: clang-cl }
+        - { config: mingw-dll,                mingw: true,  cc: cc,       cxx: c++ }
+        - { config: mingw-dll,                mingw: true,  cc: i686-w64-mingw32-clang, cxx: i686-w64-mingw32-clang++ }
+        - { config: mingw-static,             mingw: true,  cc: cc,       cxx: c++ }
+        - { config: mingw-incomplete-sysroot, mingw: true,  cc: cc,       cxx: c++ }
+        - { config: mingw-static,             mingw: true,  cc: cc,       cxx: c++, runner: windows-11-arm }
     runs-on: ${{ matrix.runner != '' && matrix.runner || 'windows-2022' }}
     steps:
       - uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
@@ -286,5 +289,7 @@ jobs:
         run: |
           echo "c:\Program Files\LLVM\bin" | Out-File -FilePath $Env:GITHUB_PATH -Encoding utf8 -Append
       - name: Build and test
-        run: |
-          bash libcxx/utils/ci/run-buildbot ${{ matrix.config }}
+        run: bash libcxx/utils/ci/run-buildbot ${{ matrix.config }}
+        env:
+          CC: ${{ matrix.cc }}
+          CXX: ${{ matrix.cxx }}
diff --git a/.github/workflows/libcxx-check-generated-files.yml b/.github/workflows/libcxx-check-generated-files.yml
index a25dc8b70001d..ae1f680d95235 100644
--- a/.github/workflows/libcxx-check-generated-files.yml
+++ b/.github/workflows/libcxx-check-generated-files.yml
@@ -22,3 +22,6 @@ jobs:
 
       - name: Check generated files
         run: libcxx/utils/ci/run-buildbot check-generated-output
+        env:
+          CC: cc
+          CXX: c++
diff --git a/.github/workflows/libcxx-run-benchmarks.yml b/.github/workflows/libcxx-run-benchmarks.yml
index 64a902482f9a3..9f4cd257704e4 100644
--- a/.github/workflows/libcxx-run-benchmarks.yml
+++ b/.github/workflows/libcxx-run-benchmarks.yml
@@ -33,7 +33,7 @@ jobs:
 
     runs-on: llvm-premerge-libcxx-next-runners # TODO: This should run on a dedicated set of machines
     steps:
-      - uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+      - uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
         with:
           python-version: '3.14'
 
diff --git a/.github/workflows/mlir-spirv-tests.yml b/.github/workflows/mlir-spirv-tests.yml
index e9b0cddb391be..ee565aaf5c5a2 100644
--- a/.github/workflows/mlir-spirv-tests.yml
+++ b/.github/workflows/mlir-spirv-tests.yml
@@ -30,7 +30,7 @@ jobs:
     steps:
       - uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
       - name: Setup ccache
-        uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+        uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
         with:
           max-size: 2G
           key: spirv-mlir-ubuntu-24.04
diff --git a/.github/workflows/premerge.yaml b/.github/workflows/premerge.yaml
index 252d7fbe8e67f..10f7f6a827b30 100644
--- a/.github/workflows/premerge.yaml
+++ b/.github/workflows/premerge.yaml
@@ -120,7 +120,7 @@ jobs:
           retention-days: 5
           include-hidden-files: 'true'
       - name: Upload Comment
-        uses: actions/upload-artifact at ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
+        uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
         if: ${{ always() && !startsWith(matrix.runs-on, 'depot-ubuntu-24.04-arm') }}
         continue-on-error: true
         with:
@@ -200,7 +200,7 @@ jobs:
         with:
           fetch-depth: 2
       - name: Setup ccache
-        uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+        uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
         with:
           max-size: "2000M"
       - name: Install Ninja
diff --git a/.github/workflows/release-binaries.yml b/.github/workflows/release-binaries.yml
index 104d37db8a28d..a8bae830fc609 100644
--- a/.github/workflows/release-binaries.yml
+++ b/.github/workflows/release-binaries.yml
@@ -66,7 +66,7 @@ jobs:
     steps:
     # It's good practice to use setup-python, but this is also required on macos-14
     # due to https://github.com/actions/runner-images/issues/10385
-    - uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+    - uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
       with:
         python-version: '3.14'
 
diff --git a/.github/workflows/release-documentation.yml b/.github/workflows/release-documentation.yml
index 23bc0aed4a546..043a3f1ed9e08 100644
--- a/.github/workflows/release-documentation.yml
+++ b/.github/workflows/release-documentation.yml
@@ -41,7 +41,7 @@ jobs:
         uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
 
       - name: Setup Python env
-        uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+        uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
         with:
           cache: 'pip'
           cache-dependency-path: './llvm/docs/requirements.txt'
diff --git a/.github/workflows/release-doxygen.yml b/.github/workflows/release-doxygen.yml
index 6e6ea883ef1d0..4cf1b9b14ccd6 100644
--- a/.github/workflows/release-doxygen.yml
+++ b/.github/workflows/release-doxygen.yml
@@ -43,7 +43,7 @@ jobs:
         uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
 
       - name: Setup Python env
-        uses: actions/setup-python at e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
+        uses: actions/setup-python at 83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
         with:
           cache: 'pip'
           cache-dependency-path: './llvm/docs/requirements.txt'
diff --git a/.github/workflows/release-sources.yml b/.github/workflows/release-sources.yml
index 9b21d2adfd27a..20e7c62976098 100644
--- a/.github/workflows/release-sources.yml
+++ b/.github/workflows/release-sources.yml
@@ -64,11 +64,11 @@ jobs:
     name: Package Release Sources
     if: github.repository_owner == 'llvm'
     runs-on: ubuntu-24.04
+    outputs:
+      digest: ${{ steps.digest.outputs.digest }}
+      artifact-id: ${{ steps.artifact-upload.outputs.artifact-id }}
     needs:
       - inputs
-    permissions:
-      id-token: write
-      attestations: write
     steps:
       - name: Checkout LLVM
         uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
@@ -79,30 +79,47 @@ jobs:
         run: |
           pip install --require-hashes -r ./llvm/utils/git/requirements.txt
 
-      - name: Check Permissions
-        if: github.event_name != 'pull_request'
-        env:
-          GITHUB_TOKEN: ${{ github.token }}
-          USER_TOKEN: ${{ secrets.RELEASE_TASKS_USER_TOKEN }}
-        run: |
-          ./llvm/utils/release/./github-upload-release.py --token "$GITHUB_TOKEN" --user ${{ github.actor }} --user-token "$USER_TOKEN" check-permissions
       - name: Create Tarballs
         run: |
           ./llvm/utils/release/export.sh ${{ needs.inputs.outputs.export-args }}
-      - name: Attest Build Provenance
-        if: github.event_name != 'pull_request'
-        id: provenance
-        uses: actions/attest-build-provenance at 977bb373ede98d70efdf65b84cb5f73e068dcc2a # v3.0.0
-        with:
-          subject-path: "*.xz"
-      - if: github.event_name != 'pull_request'
+
+      - name: Generate sha256 digest for sources
+        id: digest
         run: |
-          mv ${{ steps.provenance.outputs.bundle-path }} .
-      - name: Create Tarball Artifacts
+          echo "digest=$(cat *.xz | sha256sum | cut -d ' ' -f 1)" >> $GITHUB_OUTPUT
+    
+      - name: Release Sources Artifact
+        id: artifact-upload
         uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
         with:
+          name: ${{ needs.inputs.outputs.ref }}-sources
           path: |
             *.xz
-            attestation.jsonl
 
+  attest-release-sources:
+    name: Attest Release Sources
+    runs-on: ubuntu-24.04
+    if: github.event_name != 'pull_request'
+    needs:
+      - inputs
+      - release-sources
+    permissions:
+      id-token: write
+      attestations: write
+    steps:
+      - name: Checkout Release Scripts
+        uses: actions/checkout at 08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
+        with:
+          sparse-checkout: |
+            .github/workflows/upload-release-artifact
+            llvm/utils/release/github-upload-release.py
+            llvm/utils/git/requirements.txt
+          sparse-checkout-cone-mode: false
 
+      - name: Upload Artifacts
+        uses: ./.github/workflows/upload-release-artifact
+        with:
+          artifact-id: ${{ needs.release-sources.outputs.artifact-id }}
+          attestation-name: ${{ needs.inputs.outputs.ref }}-sources-attestation
+          digest: ${{ needs.release-sources.outputs.digest }}
+          upload: false
diff --git a/.github/workflows/scorecard.yml b/.github/workflows/scorecard.yml
index 1ec56d2447df4..95aa8b59413cc 100644
--- a/.github/workflows/scorecard.yml
+++ b/.github/workflows/scorecard.yml
@@ -57,6 +57,6 @@ jobs:
 
       # Upload the results to GitHub's code scanning dashboard.
       - name: "Upload to code-scanning"
-        uses: github/codeql-action/upload-sarif at e12f0178983d466f2f6028f5cc7a6d786fd97f4b # v4.31.4
+        uses: github/codeql-action/upload-sarif at fdbfb4d2750291e159f0156def62b853c2798ca2 # v4.31.5
         with:
           sarif_file: results.sarif
diff --git a/.github/workflows/spirv-tests.yml b/.github/workflows/spirv-tests.yml
index 5ede9df3b006d..c33c6b3e5650c 100644
--- a/.github/workflows/spirv-tests.yml
+++ b/.github/workflows/spirv-tests.yml
@@ -26,7 +26,7 @@ jobs:
     steps:
       - uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
       - name: Setup ccache
-        uses: hendrikmuhs/ccache-action at bfa03e1de4d7f7c3e80ad9109feedd05c4f5a716 # v1.2.19
+        uses: hendrikmuhs/ccache-action at 5ebbd400eff9e74630f759d94ddd7b6c26299639 # v1.2.20
         with:
           max-size: 2G
           key: spirv-ubuntu-24.04
diff --git a/.github/workflows/test-unprivileged-download-artifact.yml b/.github/workflows/test-unprivileged-download-artifact.yml
index 39ac3d57a3879..ad41cdfdb7525 100644
--- a/.github/workflows/test-unprivileged-download-artifact.yml
+++ b/.github/workflows/test-unprivileged-download-artifact.yml
@@ -21,15 +21,23 @@ jobs:
     if: github.repository_owner == 'llvm'
     runs-on: ubuntu-24.04
     steps:
-      - name: Create Test File
+      - name: Create Test Files
         run: |
-          echo "test" > comment
-      - name: Upload Test File
+          echo "foo" > comment1
+          echo "bar" > comment2
+      - name: Upload Test File 1
         uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
         with:
-          name: workflow-args
+          name: artifact-name-1
           path: |
-            comment
+            comment1
+      - name: Upload Test File 2
+        uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
+        with:
+          name: artifact-name-2
+          path: |
+            comment2
+        
 
   test-download:
     name: Test Unprivileged Download Artifact
@@ -37,7 +45,7 @@ jobs:
     runs-on: ubuntu-24.04
     needs: [ upload-test-artifact ]
     steps:
-      - name: Chekcout LLVM
+      - name: Checkout LLVM
         uses: actions/checkout at 1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
         with:
           sparse-checkout: |
@@ -47,8 +55,10 @@ jobs:
         id: download-artifact
         with:
           run-id: ${{ github.run_id }}
-          artifact-name: workflow-args
+          artifact-name: artifact-name-
       - name: Assert That Contents are the Same
         run: |
-          cat comment
-          [[ "$(cat comment)" == "test" ]]
+          cat comment1
+          [[ "$(cat comment1)" == "foo" ]]
+          cat comment2
+          [[ "$(cat comment2)" == "bar" ]]
diff --git a/.github/workflows/unprivileged-download-artifact/action.yml b/.github/workflows/unprivileged-download-artifact/action.yml
index 72815b26bcf41..173b8ca93252f 100644
--- a/.github/workflows/unprivileged-download-artifact/action.yml
+++ b/.github/workflows/unprivileged-download-artifact/action.yml
@@ -19,9 +19,9 @@ outputs:
       The filename of the downloaded artifact or the empty string if the
       artifact was not found.
     value: ${{ steps.download-artifact.outputs.filename }}
-  artifact-id:
+  artifact-ids:
     description: "The id of the artifact being downloaded."
-    value: ${{ steps.artifact-url.outputs.id }}
+    value: ${{ steps.artifact-url.outputs.ids }}
 
 
 runs:
@@ -36,46 +36,67 @@ runs:
             response = await github.rest.actions.listArtifactsForRepo({
               owner: context.repo.owner,
               repo: context.repo.repo,
-              name: "${{ inputs.artifact-name }}"
             })
           } else {
             response = await github.rest.actions.listWorkflowRunArtifacts({
               owner: context.repo.owner,
               repo: context.repo.repo,
               run_id: "${{ inputs.run-id }}",
-              name: "${{ inputs.artifact-name }}"
             })
           }
 
           console.log(response)
 
+          artifacts_to_download = []
           for (artifact of response.data.artifacts) {
+            if (artifact.name.startsWith("${{ inputs.artifact-name }}")) {
+              artifacts_to_download.push(artifact)
+            }
+          }
+
+          for (artifact of artifacts_to_download) {
             console.log(artifact);
           }
 
-          if (response.data.artifacts.length == 0) {
-            console.log("Could not find artifact ${{ inputs.artifact-name }} for workflow run ${{ inputs.run-id }}")
+          if (artifacts_to_download.length == 0) {
+            console.log("Could not find artifacts starting with name ${{ inputs.artifact-name }} for workflow run ${{ inputs.run-id }}")
             return;
           }
 
-          const url_response = await github.rest.actions.downloadArtifact({
-            owner: context.repo.owner,
-            repo: context.repo.repo,
-            artifact_id: response.data.artifacts[0].id,
-            archive_format: "zip"
-          })
+          artifact_ids = []
+          artifact_urls = []
+          artifact_names = []
+          for (artifact_to_download of artifacts_to_download) {
+            const url_response = await github.rest.actions.downloadArtifact({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              artifact_id: artifact_to_download.id,
+              archive_format: "zip"
+            })
+
+            artifact_ids.push(artifact_to_download.id)
+            artifact_urls.push('"' + url_response.url + '"')
+            artifact_names.push('"' + artifact_to_download.name + '"')
+          }
 
-          core.setOutput("url", url_response.url);
-          core.setOutput("id", response.data.artifacts[0].id);
+          core.setOutput("urls", artifact_urls.join(" "));
+          core.setOutput("ids", artifact_ids.join(" "));
+          core.setOutput("names", artifact_names.join(" "));
 
     - shell: bash
-      if: steps.artifact-url.outputs.url != ''
+      if: steps.artifact-url.outputs.urls != ''
       id: download-artifact
       run: |
-        curl -L -o ${{ inputs.artifact-name }}.zip "${{ steps.artifact-url.outputs.url }}"
-        echo "filename=${{ inputs.artifact-name }}.zip" >> $GITHUB_OUTPUT
+        artifact_urls=(${{ steps.artifact-url.outputs.urls }})
+        artifact_names=(${{ steps.artifact-url.outputs.names }})
+        for i in "${!artifact_urls[@]}"; do
+          curl -L -o "${artifact_names[$i]}.zip" "${artifact_urls[$i]}"
+        done
 
     - shell: bash
-      if: steps.download-artifact.outputs.filename != ''
+      if: steps.artifact-url.outputs.names != ''
       run: |
-        unzip ${{ steps.download-artifact.outputs.filename }}
+        artifact_names=(${{ steps.artifact-url.outputs.names }})
+        for name in "${artifact_names[@]}"; do
+          unzip "${name}.zip"
+        done
diff --git a/.github/workflows/upload-release-artifact/action.yml b/.github/workflows/upload-release-artifact/action.yml
new file mode 100644
index 0000000000000..b2adb31f269c1
--- /dev/null
+++ b/.github/workflows/upload-release-artifact/action.yml
@@ -0,0 +1,105 @@
+name: Upload Release Artifact
+description: >-
+  Upload release artifact along with an attestation.  The action assumes that
+  the llvm-project repository has already been checked out.
+inputs:
+  release-version:
+    description: >-
+      The release where the artifact will be attached.
+    required: true
+  upload:
+    description: >-
+      Whether or not to upload the file and attestation to the release.  If this
+      is set to false, then the file will be uploaded to the job as an artifact,
+      but no atteastion will be generated and the artifact won't be uploaded
+      to the release.
+    default: true
+  user-token:
+    description: >-
+      Token with premissions to read llvm teams that is used to ensure that
+      the person who triggred the action has permission to upload artifacts.
+      This is required if upload is true.
+    requred: false
+  attestation-name:
+    description: >-
+      This will be used for the artifact name that is attached to the workflow and
+      will be used as the basename for the attestation file which will be called
+      $attestation-name.jsonl.  If this is not set, it will default
+      to the falue of `files`.
+    required: false
+  artifact-id:
+    description: >-
+      Artifact id of the artifact with the files to upload.
+    required: true
+  digest:
+    description: >-
+      sha256 digest to verify the authenticity of the files being uploaded.
+    required: true
+
+runs:
+  using: "composite"
+  steps:
+    - name: Download Artifact
+      uses: actions/download-artifact at 018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
+      id: download-artifact
+      with:
+        artifact-ids: ${{ inputs.artifact-id }}
+        path: downloads
+
+    # In theory github artifacts are immutable so we could just rely on using
+    # the artifact-id to download it, but just to be extra safe we want to
+    # generated a digest for the files we are uploading so we can verify it
+    # when downloading.
+    # See also: https://irsl.medium.com/github-artifact-immutability-is-a-lie-9b6244095694
+    - name: Verify Files
+      shell: bash
+      env:
+        INPUTS_DIGEST: ${{ inputs.digest }}
+      run: |
+        digest_file="sha256"
+        echo "$INPUTS_DIGEST -" > $digest_file
+        cat ${{ steps.download-artifact.outputs.download-path }}/* | sha256sum -c $digest_file
+
+    - name: Attest Build Provenance
+      id: provenance
+      uses: actions/attest-build-provenance at 977bb373ede98d70efdf65b84cb5f73e068dcc2a # v3.0.0
+      with:
+        subject-path: ${{ steps.download-artifact.outputs.download-path }}/*
+
+    - name: Rename attestation file
+      shell: bash
+      env:
+        INPUTS_ATTESTATION_NAME: ${{ inputs.attestation-name }}
+      run: |
+        mv ${{ steps.provenance.outputs.bundle-path }} "$INPUTS_ATTESTATION_NAME".jsonl
+
+    - name: Upload Build Provenance
+      uses: actions/upload-artifact at 330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
+      with:
+        name: ${{ inputs.attestation-name }}
+        path: |
+          ${{ inputs.attestation-name }}.jsonl
+
+    - name: Install Python Requirements
+      if: inputs.upload == 'true'
+      shell: bash
+      run: |
+        pip install --require-hashes -r ./llvm/utils/git/requirements.txt
+
+    - name: Check Permissions
+      if: inputs.upload == 'true'
+      env:
+        GITHUB_TOKEN: ${{ github.token }}
+        USER_TOKEN: ${{ inputs.user-token }}
+      shell: bash
+      run: |
+        ./llvm/utils/release/./github-upload-release.py --token "$GITHUB_TOKEN" --user "$GITHUB_ACTOR" --user-token "$USER_TOKEN" check-permissions
+    - name: Upload Release
+      shell: bash
+      if: inputs.upload == 'true'
+      run: |
+        ./llvm/utils/release/github-upload-release.py \
+        --token ${{ github.token }} \
+        --release ${{ inputs.release-version }} \
+        upload \
+        --files ${{ steps.download-artifact.outputs.download-path }}/* ${{ steps.vars.outputs.attestation-name}}.jsonl
diff --git a/bolt/docs/BAT.md b/bolt/docs/BAT.md
index 817ad288aa34b..fa43e81553d5c 100644
--- a/bolt/docs/BAT.md
+++ b/bolt/docs/BAT.md
@@ -61,6 +61,7 @@ Functions table:
 
 ### Functions table
 Hot and cold functions tables share the encoding except differences marked below.
+
 Header:
 | Entry  | Encoding | Description |
 | ------ | ----- | ----------- |
diff --git a/bolt/docs/CommandLineArgumentReference.md b/bolt/docs/CommandLineArgumentReference.md
index 7c6e01d669b74..0dbf6f59d5e88 100644
--- a/bolt/docs/CommandLineArgumentReference.md
+++ b/bolt/docs/CommandLineArgumentReference.md
@@ -811,6 +811,15 @@
 
   Specify file name of the runtime instrumentation library
 
+- `--runtime-lib-init-hook=<value>`
+
+  Primary target for hooking runtime library initialization, used in
+  fallback order of availability in input binary (entry_point -> init
+   -> init_array) (default: entry_point)
+  - `entry_point`: use ELF Header Entry Point
+  - `init`: use ELF DT_INIT entry
+  - `init_array`: use ELF .init_array entry
+
 - `--sctc-mode=<value>`
 
   Mode for simplify conditional tail calls
diff --git a/bolt/docs/PacRetDesign.md b/bolt/docs/PacRetDesign.md
index f3fe5fbd522cb..2e3cb7b91e0ce 100644
--- a/bolt/docs/PacRetDesign.md
+++ b/bolt/docs/PacRetDesign.md
@@ -200,15 +200,22 @@ This pass runs after optimizations. It performns the _inverse_ of MarkRAState pa
 Some BOLT passes can add new Instructions. In InsertNegateRAStatePass, we have
 to know what RA state these have.
 
-The current solution has the `inferUnknownStates` function to cover these, using
-a fairly simple strategy: unknown states inherit the last known state.
+> [!important]
+> As issue #160989 explains, unwind info is missing from stubs.
+> For this same reason, we cannot generate correct pac-specific unwind info: the
+> signedness of the _incorrect_ return address is meaningless.
 
-This will be updated to a more robust solution.
+Assignment of RAStates to newly generated instructions is done in `inferUnknownStates`.
+We have two different cases to cover:
 
-> [!important]
-> As issue #160989 describes, unwind info is incorrect in stubs with multiple callers.
-> For this same reason, we cannot generate correct pac-specific unwind info: the signess
-> of the _incorrect_ return address is meaningless.
+1. If a BasicBlock has some instructions with known RA state, and some without, we
+   can copy the RAState of known instructions to the unknown ones. As the control
+   flow only changes between BasicBlocks, instructions in the same BasicBlock have
+   the same return address. (The exception is noreturn calls, but these would only
+   cause problems, if the newly inserted instruction is right after the call.)
+
+2. If a BasicBlock has no instructions with known RAState, we have to copy the
+   RAState of the previous BasicBlock in layout order.
 
 ### Optimizations requiring special attention
 
diff --git a/bolt/include/bolt/Core/BinaryContext.h b/bolt/include/bolt/Core/BinaryContext.h
index 2af1d330b7545..8a90febcea3cc 100644
--- a/bolt/include/bolt/Core/BinaryContext.h
+++ b/bolt/include/bolt/Core/BinaryContext.h
@@ -807,6 +807,15 @@ class BinaryContext {
   /// the execution of the binary is completed.
   std::optional<uint64_t> FiniFunctionAddress;
 
+  /// DT_INIT.
+  std::optional<uint64_t> InitAddress;
+
+  /// DT_INIT_ARRAY. Only used when DT_INIT is not set.
+  std::optional<uint64_t> InitArrayAddress;
+
+  /// DT_INIT_ARRAYSZ. Only used when DT_INIT is not set.
+  std::optional<uint64_t> InitArraySize;
+
   /// DT_FINI.
   std::optional<uint64_t> FiniAddress;
 
diff --git a/bolt/include/bolt/Passes/InsertNegateRAStatePass.h b/bolt/include/bolt/Passes/InsertNegateRAStatePass.h
index 836948bf5e9c0..3f003af96162d 100644
--- a/bolt/include/bolt/Passes/InsertNegateRAStatePass.h
+++ b/bolt/include/bolt/Passes/InsertNegateRAStatePass.h
@@ -1,4 +1,4 @@
-//===- bolt/Passes/InsertNegateRAStatePass.cpp ----------------------------===//
+//===- bolt/Passes/InsertNegateRAStatePass.h ------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -30,9 +30,30 @@ class InsertNegateRAState : public BinaryFunctionPass {
 private:
   /// Because states are tracked as MCAnnotations on individual instructions,
   /// newly inserted instructions do not have a state associated with them.
-  /// New states are "inherited" from the last known state.
+  /// Uses fillUnknownStateInBB and fillUnknownStubs.
   void inferUnknownStates(BinaryFunction &BF);
 
+  /// Simple case: copy RAStates to unknown insts from previous inst.
+  /// If the first inst has unknown state, copy set it to the first known state.
+  /// Accounts for signing and authenticating insts.
+  void fillUnknownStateInBB(BinaryContext &BC, BinaryBasicBlock &BB);
+
+  /// Fill in RAState in BasicBlocks consisting entirely of new instructions.
+  /// As of #160989, we have to copy the RAState from the previous BB in the
+  /// layout, because CFIs are already incorrect here.
+  void fillUnknownStubs(BinaryFunction &BF);
+
+  /// Returns the first known RAState from \p BB, or std::nullopt if all are
+  /// unknown.
+  std::optional<bool> getFirstKnownRAState(BinaryContext &BC,
+                                           BinaryBasicBlock &BB);
+
+  /// \p Return true if all instructions have unknown RAState.
+  bool isUnknownBlock(BinaryContext &BC, BinaryBasicBlock &BB);
+
+  /// Set all instructions in \p BB to \p State.
+  void markUnknownBlock(BinaryContext &BC, BinaryBasicBlock &BB, bool State);
+
   /// Support for function splitting:
   /// if two consecutive BBs with Signed state are going to end up in different
   /// functions (so are held by different FunctionFragments), we have to add a
diff --git a/bolt/include/bolt/Rewrite/RewriteInstance.h b/bolt/include/bolt/Rewrite/RewriteInstance.h
index 35abf6b4d4ddd..5950b3c1630e1 100644
--- a/bolt/include/bolt/Rewrite/RewriteInstance.h
+++ b/bolt/include/bolt/Rewrite/RewriteInstance.h
@@ -93,14 +93,23 @@ class RewriteInstance {
   /// section allocations if found.
   void discoverBOLTReserved();
 
+  /// Check whether we should use DT_INIT or DT_INIT_ARRAY for instrumentation.
+  /// DT_INIT is preferred; DT_INIT_ARRAY is only used when no DT_INIT entry was
+  /// found.
+  Error discoverRtInitAddress();
+
   /// Check whether we should use DT_FINI or DT_FINI_ARRAY for instrumentation.
   /// DT_FINI is preferred; DT_FINI_ARRAY is only used when no DT_FINI entry was
   /// found.
   Error discoverRtFiniAddress();
 
+  /// If DT_INIT_ARRAY is used for instrumentation, update the relocation of its
+  /// first entry to point to the instrumentation library's init address.
+  Error updateRtInitReloc();
+
   /// If DT_FINI_ARRAY is used for instrumentation, update the relocation of its
   /// first entry to point to the instrumentation library's fini address.
-  void updateRtFiniReloc();
+  Error updateRtFiniReloc();
 
   /// Create and initialize metadata rewriters for this instance.
   void initializeMetadataManager();
diff --git a/bolt/lib/Passes/Inliner.cpp b/bolt/lib/Passes/Inliner.cpp
index 5a7d02a34b4d8..0740fcef9102b 100644
--- a/bolt/lib/Passes/Inliner.cpp
+++ b/bolt/lib/Passes/Inliner.cpp
@@ -491,6 +491,32 @@ bool Inliner::inlineCallsInFunction(BinaryFunction &Function) {
         }
       }
 
+      // AArch64 BTI:
+      // If the callee has an indirect tailcall (BR), we would transform it to
+      // an indirect call (BLR) in InlineCall. Because of this, we would have to
+      // update the BTI at the target of the tailcall. However, these targets
+      // are not known. Instead, we skip inlining blocks with indirect
+      // tailcalls.
+      auto HasIndirectTailCall = [&](const BinaryFunction &BF) -> bool {
+        for (const auto &BB : BF) {
+          for (const auto &II : BB) {
+            if (BC.MIB->isIndirectBranch(II) && BC.MIB->isTailCall(II)) {
+              return true;
+            }
+          }
+        }
+        return false;
+      };
+
+      if (BC.isAArch64() && BC.usesBTI() &&
+          HasIndirectTailCall(*TargetFunction)) {
+        ++InstIt;
+        LLVM_DEBUG(dbgs() << "BOLT-DEBUG: Skipping inlining block with tailcall"
+                          << " in " << Function << " : " << BB->getName()
+                          << " to keep BTIs consistent.\n");
+        continue;
+      }
+
       LLVM_DEBUG(dbgs() << "BOLT-DEBUG: inlining call to " << *TargetFunction
                         << " in " << Function << " : " << BB->getName()
                         << ". Count: " << BB->getKnownExecutionCount()
diff --git a/bolt/lib/Passes/InsertNegateRAStatePass.cpp b/bolt/lib/Passes/InsertNegateRAStatePass.cpp
index 775b7795e77c5..ed4de8a56f89f 100644
--- a/bolt/lib/Passes/InsertNegateRAStatePass.cpp
+++ b/bolt/lib/Passes/InsertNegateRAStatePass.cpp
@@ -52,8 +52,8 @@ void InsertNegateRAState::runOnFunction(BinaryFunction &BF) {
         MCInst &Inst = *It;
         if (BC.MIB->isCFI(Inst))
           continue;
-        auto RAState = BC.MIB->getRAState(Inst);
-        if (!RAState) {
+        std::optional<bool> RAState = BC.MIB->getRAState(Inst);
+        if (!RAState.has_value()) {
           BC.errs() << "BOLT-ERROR: unknown RAState after inferUnknownStates "
                     << " in function " << BF.getPrintName() << "\n";
           PassFailed = true;
@@ -74,6 +74,20 @@ void InsertNegateRAState::runOnFunction(BinaryFunction &BF) {
   }
 }
 
+void InsertNegateRAState::inferUnknownStates(BinaryFunction &BF) {
+  BinaryContext &BC = BF.getBinaryContext();
+
+  // Fill in missing RAStates in simple cases (inside BBs).
+  for (BinaryBasicBlock &BB : BF) {
+    fillUnknownStateInBB(BC, BB);
+  }
+  // BasicBlocks which are made entirely of "new instructions" (instructions
+  // without RAState annotation) are stubs, and do not have correct unwind info.
+  // We should iterate in layout order and fill them based on previous known
+  // RAState.
+  fillUnknownStubs(BF);
+}
+
 void InsertNegateRAState::coverFunctionFragmentStart(BinaryFunction &BF,
                                                      FunctionFragment &FF) {
   BinaryContext &BC = BF.getBinaryContext();
@@ -92,8 +106,8 @@ void InsertNegateRAState::coverFunctionFragmentStart(BinaryFunction &BF,
   // If a function is already split in the input, the first FF can also start
   // with Signed state. This covers that scenario as well.
   auto II = (*FirstNonEmpty)->getFirstNonPseudo();
-  auto RAState = BC.MIB->getRAState(*II);
-  if (!RAState) {
+  std::optional<bool> RAState = BC.MIB->getRAState(*II);
+  if (!RAState.has_value()) {
     BC.errs() << "BOLT-ERROR: unknown RAState after inferUnknownStates "
               << " in function " << BF.getPrintName() << "\n";
     PassFailed = true;
@@ -104,32 +118,119 @@ void InsertNegateRAState::coverFunctionFragmentStart(BinaryFunction &BF,
                          MCCFIInstruction::createNegateRAState(nullptr));
 }
 
-void InsertNegateRAState::inferUnknownStates(BinaryFunction &BF) {
+std::optional<bool>
+InsertNegateRAState::getFirstKnownRAState(BinaryContext &BC,
+                                          BinaryBasicBlock &BB) {
+  for (const MCInst &Inst : BB) {
+    if (BC.MIB->isCFI(Inst))
+      continue;
+    std::optional<bool> RAState = BC.MIB->getRAState(Inst);
+    if (RAState.has_value())
+      return RAState;
+  }
+  return std::nullopt;
+}
+
+bool InsertNegateRAState::isUnknownBlock(BinaryContext &BC,
+                                         BinaryBasicBlock &BB) {
+  std::optional<bool> FirstRAState = getFirstKnownRAState(BC, BB);
+  return !FirstRAState.has_value();
+}
+
+void InsertNegateRAState::fillUnknownStateInBB(BinaryContext &BC,
+                                               BinaryBasicBlock &BB) {
+
+  auto First = BB.getFirstNonPseudo();
+  if (First == BB.end())
+    return;
+  // If the first instruction has unknown RAState, we should copy the first
+  // known RAState.
+  std::optional<bool> RAState = BC.MIB->getRAState(*First);
+  if (!RAState.has_value()) {
+    std::optional<bool> FirstRAState = getFirstKnownRAState(BC, BB);
+    if (!FirstRAState.has_value())
+      // We fill unknown BBs later.
+      return;
+
+    BC.MIB->setRAState(*First, *FirstRAState);
+  }
+
+  // At this point we know the RAState of the first instruction,
+  // so we can propagate the RAStates to all subsequent unknown instructions.
+  MCInst Prev = *First;
+  for (auto It = First + 1; It != BB.end(); ++It) {
+    MCInst &Inst = *It;
+    if (BC.MIB->isCFI(Inst))
+      continue;
+
+    // No need to check for nullopt: we only entered this loop after the first
+    // instruction had its RAState set, and RAState is always set for the
+    // previous instruction in the previous iteration of the loop.
+    std::optional<bool> PrevRAState = BC.MIB->getRAState(Prev);
+
+    std::optional<bool> RAState = BC.MIB->getRAState(Inst);
+    if (!RAState.has_value()) {
+      if (BC.MIB->isPSignOnLR(Prev))
+        PrevRAState = true;
+      else if (BC.MIB->isPAuthOnLR(Prev))
+        PrevRAState = false;
+      BC.MIB->setRAState(Inst, *PrevRAState);
+    }
+    Prev = Inst;
+  }
+}
+
+void InsertNegateRAState::markUnknownBlock(BinaryContext &BC,
+                                           BinaryBasicBlock &BB, bool State) {
+  // If we call this when an Instruction has either kRASigned or kRAUnsigned
+  // annotation, setRASigned or setRAUnsigned would fail.
+  assert(isUnknownBlock(BC, BB) &&
+         "markUnknownBlock should only be called on unknown blocks");
+  for (MCInst &Inst : BB) {
+    if (BC.MIB->isCFI(Inst))
+      continue;
+    BC.MIB->setRAState(Inst, State);
+  }
+}
+
+void InsertNegateRAState::fillUnknownStubs(BinaryFunction &BF) {
   BinaryContext &BC = BF.getBinaryContext();
   bool FirstIter = true;
   MCInst PrevInst;
-  for (BinaryBasicBlock &BB : BF) {
-    for (MCInst &Inst : BB) {
-      if (BC.MIB->isCFI(Inst))
-        continue;
+  for (FunctionFragment &FF : BF.getLayout().fragments()) {
+    for (BinaryBasicBlock *BB : FF) {
+      if (FirstIter) {
+        FirstIter = false;
+        if (isUnknownBlock(BC, *BB))
+          // If the first BasicBlock is unknown, the function's entry RAState
+          // should be used.
+          markUnknownBlock(BC, *BB, BF.getInitialRAState());
+      } else if (isUnknownBlock(BC, *BB)) {
+        // As explained in issue #160989, the unwind info is incorrect for
+        // stubs. Indicating the correct RAState without the rest of the unwind
+        // info being correct is not useful. Instead, we copy the RAState from
+        // the previous instruction.
+        std::optional<bool> PrevRAState = BC.MIB->getRAState(PrevInst);
+        if (!PrevRAState.has_value()) {
+          // No non-cfi instruction encountered in the function yet.
+          // This means the RAState is the same as at the function entry.
+          markUnknownBlock(BC, *BB, BF.getInitialRAState());
+          continue;
+        }
 
-      auto RAState = BC.MIB->getRAState(Inst);
-      if (!FirstIter && !RAState) {
         if (BC.MIB->isPSignOnLR(PrevInst))
-          RAState = true;
+          PrevRAState = true;
         else if (BC.MIB->isPAuthOnLR(PrevInst))
-          RAState = false;
-        else {
-          auto PrevRAState = BC.MIB->getRAState(PrevInst);
-          RAState = PrevRAState ? *PrevRAState : false;
-        }
-        BC.MIB->setRAState(Inst, *RAState);
-      } else {
-        FirstIter = false;
-        if (!RAState)
-          BC.MIB->setRAState(Inst, BF.getInitialRAState());
+          PrevRAState = false;
+        markUnknownBlock(BC, *BB, *PrevRAState);
       }
-      PrevInst = Inst;
+      // This function iterates on BasicBlocks, so the PrevInst has to be
+      // updated to the last instruction of the current BasicBlock. If the
+      // BasicBlock is empty, or only has PseudoInstructions, PrevInst will not
+      // be updated.
+      auto Last = BB->getLastNonPseudo();
+      if (Last != BB->rend())
+        PrevInst = *Last;
     }
   }
 }
diff --git a/bolt/lib/Rewrite/RewriteInstance.cpp b/bolt/lib/Rewrite/RewriteInstance.cpp
index 8a5bbe28e5f66..067a3e6636f0b 100644
--- a/bolt/lib/Rewrite/RewriteInstance.cpp
+++ b/bolt/lib/Rewrite/RewriteInstance.cpp
@@ -80,6 +80,7 @@ namespace opts {
 extern cl::list<std::string> HotTextMoveSections;
 extern cl::opt<bool> Hugify;
 extern cl::opt<bool> Instrument;
+extern cl::opt<uint32_t> InstrumentationSleepTime;
 extern cl::opt<bool> KeepNops;
 extern cl::opt<bool> Lite;
 extern cl::list<std::string> PrintOnly;
@@ -294,6 +295,28 @@ cl::bits<GadgetScannerKind> GadgetScannersToRun(
         clEnumValN(GS_ALL, "all", "All implemented scanners")),
     cl::ZeroOrMore, cl::CommaSeparated, cl::cat(BinaryAnalysisCategory));
 
+// Primary targets for hooking runtime library initialization hooking
+// with fallback to next item in case if current item is not available
+// in the input binary.
+enum RuntimeLibInitHookTarget : char {
+  RLIH_ENTRY_POINT = 0, /// Use ELF Header Entry Point
+  RLIH_INIT = 1,        /// Use ELF DT_INIT entry
+  RLIH_INIT_ARRAY = 2,  /// Use ELF .init_array entry
+};
+
+cl::opt<RuntimeLibInitHookTarget> RuntimeLibInitHook(
+    "runtime-lib-init-hook",
+    cl::desc("Primary target for hooking runtime library initialization, used "
+             "in fallback order of availabiliy in input binary (entry_point -> "
+             "init -> init_array) (default: entry_point)"),
+    cl::Hidden, cl::init(RLIH_ENTRY_POINT),
+    cl::values(clEnumValN(RLIH_ENTRY_POINT, "entry_point",
+                          "use ELF Header Entry Point"),
+               clEnumValN(RLIH_INIT, "init", "use ELF DT_INIT entry"),
+               clEnumValN(RLIH_INIT_ARRAY, "init_array",
+                          "use ELF .init_array entry")),
+    cl::ZeroOrMore, cl::cat(BoltOptCategory));
+
 } // namespace opts
 
 // FIXME: implement a better way to mark sections for replacement.
@@ -741,9 +764,12 @@ Error RewriteInstance::run() {
   adjustCommandLineOptions();
   discoverFileObjects();
 
-  if (opts::Instrument && !BC->IsStaticExecutable)
+  if (opts::Instrument && !BC->IsStaticExecutable) {
+    if (Error E = discoverRtInitAddress())
+      return E;
     if (Error E = discoverRtFiniAddress())
       return E;
+  }
 
   preprocessProfileData();
 
@@ -785,8 +811,12 @@ Error RewriteInstance::run() {
 
   updateMetadata();
 
-  if (opts::Instrument && !BC->IsStaticExecutable)
-    updateRtFiniReloc();
+  if (opts::Instrument && !BC->IsStaticExecutable) {
+    if (Error E = updateRtInitReloc())
+      return E;
+    if (Error E = updateRtFiniReloc())
+      return E;
+  }
 
   if (opts::OutputFilename == "/dev/null") {
     BC->outs() << "BOLT-INFO: skipping writing final binary to disk\n";
@@ -1411,6 +1441,65 @@ void RewriteInstance::discoverBOLTReserved() {
   NextAvailableAddress = BC->BOLTReserved.start();
 }
 
+Error RewriteInstance::discoverRtInitAddress() {
+  if (BC->HasInterpHeader && opts::RuntimeLibInitHook == opts::RLIH_ENTRY_POINT)
+    return Error::success();
+
+  // Use DT_INIT if it's available.
+  if (BC->InitAddress && opts::RuntimeLibInitHook <= opts::RLIH_INIT) {
+    BC->StartFunctionAddress = BC->InitAddress;
+    return Error::success();
+  }
+
+  if (!BC->InitArrayAddress || !BC->InitArraySize) {
+    return createStringError(std::errc::not_supported,
+                             "Instrumentation of shared library needs either "
+                             "DT_INIT or DT_INIT_ARRAY");
+  }
+
+  if (*BC->InitArraySize < BC->AsmInfo->getCodePointerSize()) {
+    return createStringError(std::errc::not_supported,
+                             "Need at least 1 DT_INIT_ARRAY slot");
+  }
+
+  ErrorOr<BinarySection &> InitArraySection =
+      BC->getSectionForAddress(*BC->InitArrayAddress);
+  if (auto EC = InitArraySection.getError())
+    return errorCodeToError(EC);
+
+  if (InitArraySection->getAddress() != *BC->InitArrayAddress) {
+    return createStringError(std::errc::not_supported,
+                             "Inconsistent address of .init_array section");
+  }
+
+  if (const Relocation *Reloc = InitArraySection->getDynamicRelocationAt(0)) {
+    if (Reloc->isRelative()) {
+      BC->StartFunctionAddress = Reloc->Addend;
+    } else {
+      MCSymbol *Sym = Reloc->Symbol;
+      if (!Sym)
+        return createStringError(
+            std::errc::not_supported,
+            "Failed to locate symbol for 0 entry of .init_array");
+      const BinaryFunction *BF = BC->getFunctionForSymbol(Sym);
+      if (!BF)
+        return createStringError(
+            std::errc::not_supported,
+            "Failed to locate binary function for 0 entry of .init_array");
+      BC->StartFunctionAddress = BF->getAddress() + Reloc->Addend;
+    }
+    return Error::success();
+  }
+
+  if (const Relocation *Reloc = InitArraySection->getRelocationAt(0)) {
+    BC->StartFunctionAddress = Reloc->Value;
+    return Error::success();
+  }
+
+  return createStringError(std::errc::not_supported,
+                           "No relocation for first DT_INIT_ARRAY slot");
+}
+
 Error RewriteInstance::discoverRtFiniAddress() {
   // Use DT_FINI if it's available.
   if (BC->FiniAddress) {
@@ -1419,6 +1508,9 @@ Error RewriteInstance::discoverRtFiniAddress() {
   }
 
   if (!BC->FiniArrayAddress || !BC->FiniArraySize) {
+    // Missing fini hooks are allowed when instrumentation-sleep-time is in use.
+    if (opts::InstrumentationSleepTime > 0)
+      return Error::success();
     return createStringError(
         std::errc::not_supported,
         "Instrumentation needs either DT_FINI or DT_FINI_ARRAY");
@@ -1434,6 +1526,11 @@ Error RewriteInstance::discoverRtFiniAddress() {
   if (auto EC = FiniArraySection.getError())
     return errorCodeToError(EC);
 
+  if (FiniArraySection->getAddress() != *BC->FiniArrayAddress) {
+    return createStringError(std::errc::not_supported,
+                             "Inconsistent address of .fini_array section");
+  }
+
   if (const Relocation *Reloc = FiniArraySection->getDynamicRelocationAt(0)) {
     BC->FiniFunctionAddress = Reloc->Addend;
     return Error::success();
@@ -1448,26 +1545,99 @@ Error RewriteInstance::discoverRtFiniAddress() {
                            "No relocation for first DT_FINI_ARRAY slot");
 }
 
-void RewriteInstance::updateRtFiniReloc() {
+Error RewriteInstance::updateRtInitReloc() {
+  if (BC->HasInterpHeader && opts::RuntimeLibInitHook == opts::RLIH_ENTRY_POINT)
+    return Error::success();
+
+  // Updating DT_INIT is handled by patchELFDynamic.
+  if (BC->InitAddress && opts::RuntimeLibInitHook <= opts::RLIH_INIT)
+    return Error::success();
+
+  const RuntimeLibrary *RT = BC->getRuntimeLibrary();
+  if (!RT || !RT->getRuntimeStartAddress())
+    return Error::success();
+
+  if (!BC->InitArrayAddress)
+    return Error::success();
+
+  if (!BC->InitArrayAddress || !BC->InitArraySize)
+    return createStringError(std::errc::not_supported,
+                             "inconsistent .init_array state");
+
+  ErrorOr<BinarySection &> InitArraySection =
+      BC->getSectionForAddress(*BC->InitArrayAddress);
+  if (!InitArraySection)
+    return createStringError(std::errc::not_supported, ".init_array removed");
+
+  if (std::optional<Relocation> Reloc =
+          InitArraySection->takeDynamicRelocationAt(0)) {
+    if (Reloc->isRelative()) {
+      if (Reloc->Addend != BC->StartFunctionAddress)
+        return createStringError(std::errc::not_supported,
+                                 "inconsistent .init_array dynamic relocation");
+      Reloc->Addend = RT->getRuntimeStartAddress();
+      InitArraySection->addDynamicRelocation(*Reloc);
+    } else {
+      MCSymbol *Sym = Reloc->Symbol;
+      if (!Sym)
+        return createStringError(
+            std::errc::not_supported,
+            "Failed to locate symbol for 0 entry of .init_array");
+      const BinaryFunction *BF = BC->getFunctionForSymbol(Sym);
+      if (!BF)
+        return createStringError(
+            std::errc::not_supported,
+            "Failed to locate binary function for 0 entry of .init_array");
+      if (BF->getAddress() + Reloc->Addend != BC->StartFunctionAddress)
+        return createStringError(std::errc::not_supported,
+                                 "inconsistent .init_array dynamic relocation");
+      InitArraySection->addDynamicRelocation(Relocation{
+          /*Offset*/ 0, /*Symbol*/ nullptr, /*Type*/ Relocation::getAbs64(),
+          /*Addend*/ RT->getRuntimeStartAddress(), /*Value*/ 0});
+    }
+  }
+  // Update the static relocation by adding a pending relocation which will get
+  // patched when flushPendingRelocations is called in rewriteFile. Note that
+  // flushPendingRelocations will calculate the value to patch as
+  // "Symbol + Addend". Since we don't have a symbol, just set the addend to the
+  // desired value.
+  InitArraySection->addPendingRelocation(Relocation{
+      /*Offset*/ 0, /*Symbol*/ nullptr, /*Type*/ Relocation::getAbs64(),
+      /*Addend*/ RT->getRuntimeStartAddress(), /*Value*/ 0});
+  BC->outs()
+      << "BOLT-INFO: runtime library initialization was hooked via .init_array "
+         "entry, set to 0x"
+      << Twine::utohexstr(RT->getRuntimeStartAddress()) << "\n";
+  return Error::success();
+}
+
+Error RewriteInstance::updateRtFiniReloc() {
   // Updating DT_FINI is handled by patchELFDynamic.
   if (BC->FiniAddress)
-    return;
+    return Error::success();
 
   const RuntimeLibrary *RT = BC->getRuntimeLibrary();
   if (!RT || !RT->getRuntimeFiniAddress())
-    return;
+    return Error::success();
 
-  assert(BC->FiniArrayAddress && BC->FiniArraySize &&
-         "inconsistent .fini_array state");
+  if (!BC->FiniArrayAddress || !BC->FiniArraySize) {
+    // Missing fini hooks are allowed when instrumentation-sleep-time is in use.
+    if (opts::InstrumentationSleepTime > 0)
+      return Error::success();
+    return createStringError(std::errc::not_supported,
+                             "inconsistent .fini_array state");
+  }
 
   ErrorOr<BinarySection &> FiniArraySection =
       BC->getSectionForAddress(*BC->FiniArrayAddress);
-  assert(FiniArraySection && ".fini_array removed");
+  if (!FiniArraySection)
+    return createStringError(std::errc::not_supported, ".fini_array removed");
 
   if (std::optional<Relocation> Reloc =
           FiniArraySection->takeDynamicRelocationAt(0)) {
-    assert(Reloc->Addend == BC->FiniFunctionAddress &&
-           "inconsistent .fini_array dynamic relocation");
+    if (Reloc->Addend != BC->FiniFunctionAddress)
+      return createStringError(std::errc::not_supported,
+                               "inconsistent .fini_array dynamic relocation");
     Reloc->Addend = RT->getRuntimeFiniAddress();
     FiniArraySection->addDynamicRelocation(*Reloc);
   }
@@ -1480,6 +1650,10 @@ void RewriteInstance::updateRtFiniReloc() {
   FiniArraySection->addPendingRelocation(Relocation{
       /*Offset*/ 0, /*Symbol*/ nullptr, /*Type*/ Relocation::getAbs64(),
       /*Addend*/ RT->getRuntimeFiniAddress(), /*Value*/ 0});
+  BC->outs() << "BOLT-INFO: runtime library finalization was hooked via "
+                ".fini_array entry, set to 0x"
+             << Twine::utohexstr(RT->getRuntimeFiniAddress()) << "\n";
+  return Error::success();
 }
 
 void RewriteInstance::registerFragments() {
@@ -2178,6 +2352,14 @@ void RewriteInstance::adjustCommandLineOptions() {
     exit(1);
   }
 
+  if (opts::Instrument && opts::RuntimeLibInitHook == opts::RLIH_ENTRY_POINT &&
+      !BC->HasInterpHeader) {
+    BC->errs()
+        << "BOLT-WARNING: adjusted runtime-lib-init-hook to 'init' due to "
+           "absence of INTERP header\n";
+    opts::RuntimeLibInitHook = opts::RLIH_INIT;
+  }
+
   if (opts::HotText && opts::HotTextMoveSections.getNumOccurrences() == 0) {
     opts::HotTextMoveSections.addValue(".stub");
     opts::HotTextMoveSections.addValue(".mover");
@@ -4849,9 +5031,14 @@ void RewriteInstance::patchELFSectionHeaderTable(ELFObjectFile<ELFT> *File) {
   ELFEhdrTy NewEhdr = Obj.getHeader();
 
   if (BC->HasRelocations) {
-    if (RuntimeLibrary *RtLibrary = BC->getRuntimeLibrary())
+    RuntimeLibrary *RtLibrary = BC->getRuntimeLibrary();
+    if (RtLibrary && opts::RuntimeLibInitHook == opts::RLIH_ENTRY_POINT) {
       NewEhdr.e_entry = RtLibrary->getRuntimeStartAddress();
-    else
+      BC->outs()
+          << "BOLT-INFO: runtime library initialization was hooked via ELF "
+             "Header Entry Point, set to 0x"
+          << Twine::utohexstr(NewEhdr.e_entry) << "\n";
+    } else
       NewEhdr.e_entry = getNewFunctionAddress(NewEhdr.e_entry);
     assert((NewEhdr.e_entry || !Obj.getHeader().e_entry) &&
            "cannot find new address for entry point");
@@ -5692,14 +5879,23 @@ void RewriteInstance::patchELFDynamic(ELFObjectFile<ELFT> *File) {
       }
       RuntimeLibrary *RtLibrary = BC->getRuntimeLibrary();
       if (RtLibrary && Dyn.getTag() == ELF::DT_FINI) {
-        if (uint64_t Addr = RtLibrary->getRuntimeFiniAddress())
+        if (uint64_t Addr = RtLibrary->getRuntimeFiniAddress()) {
           NewDE.d_un.d_ptr = Addr;
+          BC->outs()
+              << "BOLT-INFO: runtime library finalization was hooked via "
+                 "DT_FINI, set to 0x"
+              << Twine::utohexstr(Addr) << "\n";
+        }
       }
-      if (RtLibrary && Dyn.getTag() == ELF::DT_INIT && !BC->HasInterpHeader) {
+      if (RtLibrary && Dyn.getTag() == ELF::DT_INIT &&
+          (!BC->HasInterpHeader ||
+           opts::RuntimeLibInitHook == opts::RLIH_INIT)) {
         if (auto Addr = RtLibrary->getRuntimeStartAddress()) {
-          LLVM_DEBUG(dbgs() << "BOLT-DEBUG: Set DT_INIT to 0x"
-                            << Twine::utohexstr(Addr) << '\n');
           NewDE.d_un.d_ptr = Addr;
+          BC->outs()
+              << "BOLT-INFO: runtime library initialization was hooked via "
+                 "DT_INIT, set to 0x"
+              << Twine::utohexstr(Addr) << "\n";
         }
       }
       break;
@@ -5767,10 +5963,13 @@ Error RewriteInstance::readELFDynamic(ELFObjectFile<ELFT> *File) {
   for (const Elf_Dyn &Dyn : DynamicEntries) {
     switch (Dyn.d_tag) {
     case ELF::DT_INIT:
-      if (!BC->HasInterpHeader) {
-        LLVM_DEBUG(dbgs() << "BOLT-DEBUG: Set start function address\n");
-        BC->StartFunctionAddress = Dyn.getPtr();
-      }
+      BC->InitAddress = Dyn.getPtr();
+      break;
+    case ELF::DT_INIT_ARRAY:
+      BC->InitArrayAddress = Dyn.getPtr();
+      break;
+    case ELF::DT_INIT_ARRAYSZ:
+      BC->InitArraySize = Dyn.getPtr();
       break;
     case ELF::DT_FINI:
       BC->FiniAddress = Dyn.getPtr();
diff --git a/bolt/test/AArch64/hook-fini.s b/bolt/test/AArch64/hook-fini.s
index 4f321d463ef32..3bb95f9317b1b 100644
--- a/bolt/test/AArch64/hook-fini.s
+++ b/bolt/test/AArch64/hook-fini.s
@@ -15,13 +15,13 @@
 # RUN: %clang %cflags -pie %s -Wl,-q -o %t.exe
 # RUN: llvm-readelf -d %t.exe | FileCheck --check-prefix=DYN-FINI %s
 # RUN: llvm-readelf -r %t.exe | FileCheck --check-prefix=RELOC-PIE %s
-# RUN: llvm-bolt %t.exe -o %t --instrument
+# RUN: llvm-bolt %t.exe -o %t --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-FINI %s
 # RUN: llvm-readelf -drs %t | FileCheck --check-prefix=CHECK-FINI %s
 
 # RUN: %clang %cflags -pie %s -Wl,-q,-fini=0 -o %t-no-fini.exe
 # RUN: llvm-readelf -d %t-no-fini.exe | FileCheck --check-prefix=DYN-NO-FINI %s
 # RUN: llvm-readelf -r %t-no-fini.exe | FileCheck --check-prefix=RELOC-PIE %s
-# RUN: llvm-bolt %t-no-fini.exe -o %t-no-fini --instrument
+# RUN: llvm-bolt %t-no-fini.exe -o %t-no-fini --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-FINI-ARRAY %s
 # RUN: llvm-readelf -drs %t-no-fini | FileCheck --check-prefix=CHECK-NO-FINI %s
 # RUN: llvm-readelf -ds -x .fini_array %t-no-fini | FileCheck --check-prefix=CHECK-NO-FINI-RELOC %s
 
@@ -29,7 +29,7 @@
 # RUN: %clang %cflags %p/../Inputs/stub.c -fPIC -shared -o %t-stub.so
 # RUN: %clang %cflags %s -no-pie -Wl,-q,-fini=0 %t-stub.so -o %t-no-pie-no-fini.exe
 # RUN: llvm-readelf -r %t-no-pie-no-fini.exe | FileCheck --check-prefix=RELOC-NO-PIE %s
-# RUN: llvm-bolt %t-no-pie-no-fini.exe -o %t-no-pie-no-fini --instrument
+# RUN: llvm-bolt %t-no-pie-no-fini.exe -o %t-no-pie-no-fini --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-FINI-ARRAY %s
 # RUN: llvm-readelf -ds -x .fini_array %t-no-pie-no-fini | FileCheck --check-prefix=CHECK-NO-PIE-NO-FINI %s
 
 ## With fini: dynamic section should contain DT_FINI
@@ -46,6 +46,14 @@
 ## Without PIE: binary should not have relative relocations
 # RELOC-NO-PIE-NOT: R_AARCH64_RELATIVE
 
+## Check BOLT output output finalization hook (DT_FINI)
+# CHECK-BOLT-RT-FINI: runtime library finalization was hooked via DT_FINI
+# CHECK-BOLT-RT-FINI-NOT: runtime library finalization was hooked via .fini_array entry
+
+## Check BOLT output output finalization hook (.fini_array entry)
+# CHECK-BOLT-RT-FINI-ARRAY-NOT: runtime library finalization was hooked via DT_FINI
+# CHECK-BOLT-RT-FINI-ARRAY: runtime library finalization was hooked via .fini_array entry
+
 ## Check that DT_FINI is set to __bolt_runtime_fini
 # CHECK-FINI:     Dynamic section at offset {{.*}} contains {{.*}} entries:
 # CHECK-FINI-DAG: (FINI) 0x[[FINI:[[:xdigit:]]+]]
diff --git a/bolt/test/AArch64/hook-init.s b/bolt/test/AArch64/hook-init.s
new file mode 100644
index 0000000000000..a48328b630fa0
--- /dev/null
+++ b/bolt/test/AArch64/hook-init.s
@@ -0,0 +1,221 @@
+## Test the different ways of hooking the init function for instrumentation (via
+## entry point, DT_INIT and via DT_INIT_ARRAY). We test the latter for both PIE
+## and non-PIE binaries because of the different ways of handling relocations
+## (static or dynamic), executable and shared library.
+## All tests perform the following steps:
+## - Compile and link for the case to be tested
+## - Some sanity-checks on the dynamic section and relocations in the binary to
+##   verify it has the shape we want for testing:
+##   - INTERP in Program Headers
+##   - DT_INIT or DT_INIT_ARRAY in dynamic section
+##   - No relative relocations for non-PIE
+## - Instrument (with extra --runtime-lib-init-hook=init/init_array options
+##   in some cases)
+## - Verify generated binary
+# REQUIRES: system-linux,bolt-runtime,target=aarch64{{.*}}
+
+# RUN: %clang %cflags -pie %s -Wl,-q -o %t.exe
+# RUN: llvm-readelf -d %t.exe | FileCheck --check-prefix=DYN-INIT %s
+# RUN: llvm-readelf -l %t.exe | FileCheck --check-prefix=PH-INTERP %s
+# RUN: llvm-readelf -r %t.exe | FileCheck --check-prefix=RELOC-PIE %s
+# RUN: llvm-bolt %t.exe -o %t --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-EP %s
+# RUN: llvm-readelf -hdrs %t | FileCheck --check-prefix=CHECK-INIT-EP %s
+# RUN: llvm-bolt %t.exe -o %t-no-ep --instrument --runtime-lib-init-hook=init | FileCheck --check-prefix=CHECK-BOLT-RT-INIT %s
+# RUN: llvm-readelf -hdrs %t-no-ep | FileCheck --check-prefix=CHECK-INIT-NO-EP %s
+# RUN: llvm-bolt %t.exe -o %t-no-ep --instrument --runtime-lib-init-hook=init_array | FileCheck --check-prefix=CHECK-BOLT-RT-INIT-ARRAY %s
+# RUN: llvm-readelf -hdrs %t-no-ep | FileCheck --check-prefix=CHECK-INIT-ARRAY-NO-EP %s
+
+# RUN: %clang -shared %cflags -pie %s -Wl,-q -o %t-shared.exe
+# RUN: llvm-readelf -d %t-shared.exe | FileCheck --check-prefix=DYN-INIT %s
+# RUN: llvm-readelf -l %t-shared.exe | FileCheck --check-prefix=PH-INTERP-SHARED %s
+# RUN: llvm-readelf -r %t-shared.exe | FileCheck --check-prefix=RELOC-SHARED-PIE %s
+# RUN: llvm-bolt %t-shared.exe -o %t-shared --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-INIT %s
+# RUN: llvm-readelf -hdrs %t-shared | FileCheck --check-prefix=CHECK-SHARED-INIT %s
+
+# RUN: %clang %cflags -pie %s -Wl,-q,-init=0 -o %t-no-init.exe
+# RUN: llvm-readelf -d %t-no-init.exe | FileCheck --check-prefix=DYN-NO-INIT %s
+# RUN: llvm-readelf -l %t-no-init.exe | FileCheck --check-prefix=PH-INTERP %s
+# RUN: llvm-readelf -r %t-no-init.exe | FileCheck --check-prefix=RELOC-PIE %s
+# RUN: llvm-bolt %t-no-init.exe -o %t-no-init --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-EP %s
+# RUN: llvm-readelf -hdrs %t-no-init | FileCheck --check-prefix=CHECK-NO-INIT-EP %s
+# RUN: llvm-bolt %t-no-init.exe -o %t-no-init-no-ep --instrument --runtime-lib-init-hook=init | FileCheck --check-prefix=CHECK-BOLT-RT-INIT-ARRAY %s
+# RUN: llvm-readelf -hdrs %t-no-init-no-ep | FileCheck --check-prefix=CHECK-NO-INIT-NO-EP %s
+
+# RUN: %clang -shared %cflags -pie %s -Wl,-q,-init=0 -o %t-shared-no-init.exe
+# RUN: llvm-readelf -d %t-shared-no-init.exe | FileCheck --check-prefix=DYN-NO-INIT %s
+# RUN: llvm-readelf -l %t-shared-no-init.exe | FileCheck --check-prefix=PH-INTERP-SHARED %s
+# RUN: llvm-readelf -r %t-shared-no-init.exe | FileCheck --check-prefix=RELOC-SHARED-PIE %s
+# RUN: llvm-bolt %t-shared-no-init.exe -o %t-shared-no-init --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-INIT-ARRAY %s
+# RUN: llvm-readelf -drs %t-shared-no-init | FileCheck --check-prefix=CHECK-SHARED-NO-INIT %s
+
+## Create a dummy shared library to link against to force creation of the dynamic section.
+# RUN: %clang %cflags %p/../Inputs/stub.c -fPIC -shared -o %t-stub.so
+# RUN: %clang %cflags %s -no-pie -Wl,-q,-init=0 %t-stub.so -o %t-no-pie-no-init.exe
+# RUN: llvm-readelf -r %t-no-pie-no-init.exe | FileCheck --check-prefix=RELOC-NO-PIE %s
+# RUN: llvm-bolt %t-no-pie-no-init.exe -o %t-no-pie-no-init --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-EP %s
+# RUN: llvm-readelf -hds %t-no-pie-no-init | FileCheck --check-prefix=CHECK-NO-PIE-NO-INIT-EP %s
+
+## With init: dynamic section should contain DT_INIT
+# DYN-INIT: (INIT)
+
+## Without init: dynamic section should only contain DT_INIT_ARRAY
+# DYN-NO-INIT-NOT: (INIT)
+# DYN-NO-INIT:     (INIT_ARRAY)
+# DYN-NO-INIT:     (INIT_ARRAYSZ)
+
+## With interp program header (executable)
+# PH-INTERP: Program Headers:
+# PH-INTERP: INTERP
+
+## Without interp program header (shared library)
+# PH-INTERP-SHARED:     Program Headers:
+# PH-INTERP-SHARED-NOT: INTERP
+
+## With PIE: binary should have relative relocations
+# RELOC-PIE: R_AARCH64_RELATIVE
+
+## With PIE: binary should have relative relocations
+# RELOC-SHARED-PIE: R_AARCH64_ABS64
+
+## Without PIE: binary should not have relative relocations
+# RELOC-NO-PIE-NOT: R_AARCH64_RELATIVE
+
+## Check BOLT output output initialization hook (ELF Header Entry Point)
+# CHECK-BOLT-RT-EP: runtime library initialization was hooked via ELF Header Entry Point
+# CHECK-BOLT-RT-EP-NOT: runtime library initialization was hooked via DT_INIT
+# CHECK-BOLT-RT-EP-NOT: runtime library initialization was hooked via .init_array entry
+
+## Check BOLT output output initialization hook (DT_INIT)
+# CHECK-BOLT-RT-INIT-NOT: runtime library initialization was hooked via ELF Header Entry Point
+# CHECK-BOLT-RT-INIT: runtime library initialization was hooked via DT_INIT
+# CHECK-BOLT-RT-INIT-NOT: runtime library initialization was hooked via .init_array entry
+
+## Check BOLT output output initialization hook (.init_array entry)
+# CHECK-BOLT-RT-INIT-ARRAY-NOT: runtime library initialization was hooked via ELF Header Entry Point
+# CHECK-BOLT-RT-INIT-ARRAY-NOT: runtime library initialization was hooked via DT_INIT
+# CHECK-BOLT-RT-INIT-ARRAY: runtime library initialization was hooked via .init_array entry
+
+## Check that entry point address is set to __bolt_runtime_start for PIE executable with DT_INIT
+# CHECK-INIT-EP:               ELF Header:
+# CHECK-INIT-EP:               Entry point address: 0x[[#%x,EP_ADDR:]]
+## Check that the dynamic relocation at .init and .init_array were not patched
+# CHECK-INIT-EP:               Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-INIT-EP-NOT:           (INIT) 0x[[#%x, EP_ADDR]]
+# CHECK-INIT-EP-NOT:           (INIT_ARRAY) 0x[[#%x, EP_ADDR]]
+## Check that the new entry point address points to __bolt_runtime_start
+# CHECK-INIT-EP:               Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-INIT-EP:               {{0+}}[[#%x, EP_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that DT_INIT address is set to __bolt_runtime_start for PIE executable with DT_INIT
+# CHECK-INIT-NO-EP:            ELF Header:
+# CHECK-INIT-NO-EP:            Entry point address: 0x[[#%x,EP_ADDR:]]
+## Read Dynamic section DT_INIT and DT_INIT_ARRAY entries
+# CHECK-INIT-NO-EP:            Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-INIT-NO-EP-DAG:        (INIT) 0x[[#%x,INIT:]]
+# CHECK-INIT-NO-EP-DAG:        (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Check if ELF entry point address points to _start symbol and new DT_INIT entry points to __bolt_runtime_start
+# CHECK-INIT-NO-EP:            Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-INIT-NO-EP-DAG:        {{0+}}[[#%x, EP_ADDR]] {{.*}} _start
+# CHECK-INIT-NO-EP-DAG:        {{0+}}[[#%x, INIT]] {{.*}} __bolt_runtime_start
+
+## Check that 1st entry of DT_INIT_ARRAY is set to __bolt_runtime_start and DT_INIT was not changed
+# CHECK-INIT-ARRAY-NO-EP:      ELF Header:
+# CHECK-INIT-ARRAY-NO-EP:      Entry point address: 0x[[#%x,EP_ADDR:]]
+## Read Dynamic section DT_INIT and DT_INIT_ARRAY entries
+# CHECK-INIT-ARRAY-NO-EP:      Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-INIT-ARRAY-NO-EP-DAG:  (INIT) 0x[[#%x,INIT:]]
+# CHECK-INIT-ARRAY-NO-EP-DAG:  (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Read the dynamic relocation from 1st entry of .init_array
+# CHECK-INIT-ARRAY-NO-EP:      Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-INIT-ARRAY-NO-EP:      {{0+}}[[#%x,INIT_ARRAY]] {{.*}} R_AARCH64_RELATIVE [[#%x,INIT_ADDR:]]
+# CHECK-INIT-ARRAY-NO-EP-NOT:  {{0+}}[[#%x,INIT_ARRAY]] {{.*}} R_AARCH64_RELATIVE [[#%x,INIT]]
+## Check that 1st entry of .init_array points to __bolt_runtime_start
+# CHECK-INIT-ARRAY-NO-EP:      Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-INIT-ARRAY-NO-EP-DAG:  {{0+}}[[#%x, EP_ADDR]] {{.*}} _start
+# CHECK-INIT-ARRAY-NO-EP-DAG:  {{[0-9]]*}}: {{0+}}[[#%x, INIT_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for PIE executable without DT_INIT
+# CHECK-NO-INIT-EP:            ELF Header:
+# CHECK-NO-INIT-EP:            Entry point address: 0x[[#%x,EP_ADDR:]]
+## Check that the dynamic relocation at .init and .init_array were not patched
+# CHECK-NO-INIT-EP:            Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-NO-INIT-EP-NOT:        (INIT) 0x[[#%x, EP_ADDR]]
+# CHECK-NO-INIT-EP-NOT:        (INIT_ARRAY) 0x[[#%x, EP_ADDR]]
+## Check that the new entry point address points to __bolt_runtime_start
+# CHECK-NO-INIT-EP:            Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-NO-INIT-EP:            {{0+}}[[#%x, EP_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that DT_INIT is set to __bolt_runtime_start for shared library with DT_INIT
+# CHECK-SHARED-INIT:           Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-SHARED-INIT-DAG:       (INIT) 0x[[#%x, INIT:]]
+# CHECK-SHARED-INIT-DAG:       (INIT_ARRAY) 0x[[#%x, INIT_ARRAY:]]
+## Check that the dynamic relocation at .init_array was not patched
+# CHECK-SHARED-INIT:           Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-SHARED-INIT-NOT:       {{0+}}[[#%x, INIT_ARRAY]] {{.*}} R_AARCH64_ABS64 {{0+}}[[#%x, INIT]]
+## Check that dynamic section DT_INIT points to __bolt_runtime_start
+# CHECK-SHARED-INIT:           Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-SHARED-INIT:           {{0+}}[[#%x, INIT]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for PIE executable without DT_INIT
+# CHECK-NO-INIT-NO-EP:         ELF Header:
+# CHECK-NO-INIT-NO-EP:         Entry point address: 0x[[#%x,EP_ADDR:]]
+# CHECK-NO-INIT-NO-EP:         Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-NO-INIT-NO-EP-NOT:     (INIT)
+# CHECK-NO-INIT-NO-EP:         (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Read the dynamic relocation from 1st entry of .init_array
+# CHECK-NO-INIT-NO-EP:         Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-NO-INIT-NO-EP:         {{0+}}[[#%x,INIT_ARRAY]] {{.*}} R_AARCH64_RELATIVE [[#%x,INIT_ADDR:]]
+## Check that 1st entry of .init_array points to __bolt_runtime_start
+# CHECK-NO-INIT-NO-EP:         Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-NO-INIT-NO-EP-DAG:     {{0+}}[[#%x, EP_ADDR]] {{.*}} _start
+# CHECK-NO-INIT-NO-EP-DAG:     {{[0-9]]*}}: {{0+}}[[#%x, INIT_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for shared library without DT_INIT
+# CHECK-SHARED-NO-INIT:        Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-SHARED-NO-INIT-NOT:    (INIT)
+# CHECK-SHARED-NO-INIT:        (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Read the dynamic relocation from 1st entry of .init_array
+# CHECK-SHARED-NO-INIT:        Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-SHARED-NO-INIT:        {{0+}}[[#%x, INIT_ARRAY]] {{.*}} R_AARCH64_ABS64 [[#%x,INIT_ADDR:]]
+## Check that 1st entry of .init_array points to __bolt_runtime_start
+# CHECK-SHARED-NO-INIT:        Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-SHARED-NO-INIT:        {{[0-9]]*}}: {{0+}}[[#%x, INIT_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for non-PIE executable with DT_INIT
+# CHECK-NO-PIE-NO-INIT-EP:     ELF Header:
+# CHECK-NO-PIE-NO-INIT-EP:     Entry point address: 0x[[#%x,EP_ADDR:]]
+## Check that the dynamic relocation at .init and .init_array were not patched
+# CHECK-NO-PIE-NO-INIT-EP:     Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-NO-PIE-NO-INIT-EP-NOT: (INIT) 0x[[#%x, EP_ADDR]]
+# CHECK-NO-PIE-NO-INIT-EP-NOT: (INIT_ARRAY) 0x[[#%x, EP_ADDR]]
+## Check that the new entry point address points to __bolt_runtime_start
+# CHECK-NO-PIE-NO-INIT-EP:     Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-NO-PIE-NO-INIT-EP:     {{0+}}[[#%x, EP_ADDR]] {{.*}} __bolt_runtime_start
+
+  .globl _start
+  .type _start, %function
+_start:
+  # Dummy relocation to force relocation mode.
+  .reloc 0, R_AARCH64_NONE
+  ret
+.size _start, .-_start
+
+  .globl _init
+  .type _init, %function
+_init:
+  ret
+  .size _init, .-_init
+
+  .globl _fini
+  .type _fini, %function
+_fini:
+  ret
+  .size _fini, .-_fini
+
+  .section .init_array,"aw"
+  .align 3
+  .dword _init
+
+  .section .fini_array,"aw"
+  .align 3
+  .dword _fini
diff --git a/bolt/test/AArch64/inline-bti-dbg.s b/bolt/test/AArch64/inline-bti-dbg.s
new file mode 100644
index 0000000000000..a0db4589d39ac
--- /dev/null
+++ b/bolt/test/AArch64/inline-bti-dbg.s
@@ -0,0 +1,40 @@
+# This test checks that for AArch64 binaries with BTI, we do not inline blocks with indirect tailcalls.
+# Same as inline-bti.s, but checks the debug output, and therefore requires assertions.
+
+# REQUIRES: system-linux, assertions
+
+# RUN: llvm-mc -filetype=obj -triple aarch64-unknown-unknown %s -o %t.o
+# RUN: %clang %cflags -O0 %t.o -o %t.exe -Wl,-q -Wl,-z,force-bti
+# RUN: llvm-bolt --inline-all %t.exe -o %t.bolt --debug 2>&1 | FileCheck %s
+
+# For BTI, we should not inline foo.
+# CHECK: BOLT-DEBUG: Skipping inlining block with tailcall in _Z3barP1A : .LBB01 to keep BTIs consistent.
+# CHECK-NOT: BOLT-INFO: inlined {{[0-9]+}} calls at {{[0-9]+}} call sites in {{[0-9]+}} iteration(s). Change in binary size: {{[0-9]+}} bytes.
+
+	.text
+	.globl	_Z3fooP1A
+	.type	_Z3fooP1A, at function
+_Z3fooP1A:
+	ldr	x8, [x0]
+	ldr	w0, [x8]
+	br x30
+	.size	_Z3fooP1A, .-_Z3fooP1A
+
+	.globl	_Z3barP1A
+	.type	_Z3barP1A, at function
+_Z3barP1A:
+	stp	x29, x30, [sp, #-16]!
+	mov	x29, sp
+	bl	_Z3fooP1A
+	mul	w0, w0, w0
+	ldp	x29, x30, [sp], #16
+	ret
+	.size	_Z3barP1A, .-_Z3barP1A
+
+	.globl	main
+	.p2align	2
+	.type	main, at function
+main:
+	mov	w0, wzr
+	ret
+	.size	main, .-main
diff --git a/bolt/test/AArch64/inline-bti.s b/bolt/test/AArch64/inline-bti.s
new file mode 100644
index 0000000000000..62f6ea6f4b63a
--- /dev/null
+++ b/bolt/test/AArch64/inline-bti.s
@@ -0,0 +1,38 @@
+## This test checks that for AArch64 binaries with BTI, we do not inline blocks with indirect tailcalls.
+
+# REQUIRES: system-linux
+
+# RUN: llvm-mc -filetype=obj -triple aarch64-unknown-unknown %s -o %t.o
+# RUN: %clang %cflags -O0 %t.o -o %t.exe -Wl,-q -Wl,-z,force-bti
+# RUN: llvm-bolt --inline-all %t.exe -o %t.bolt  | FileCheck %s
+
+# For BTI, we should not inline foo.
+# CHECK-NOT: BOLT-INFO: inlined {{[0-9]+}} calls at {{[0-9]+}} call sites in {{[0-9]+}} iteration(s). Change in binary size: {{[0-9]+}} bytes.
+
+	.text
+	.globl	_Z3fooP1A
+	.type	_Z3fooP1A, at function
+_Z3fooP1A:
+	ldr	x8, [x0]
+	ldr	w0, [x8]
+	br x30
+	.size	_Z3fooP1A, .-_Z3fooP1A
+
+	.globl	_Z3barP1A
+	.type	_Z3barP1A, at function
+_Z3barP1A:
+	stp	x29, x30, [sp, #-16]!
+	mov	x29, sp
+	bl	_Z3fooP1A
+	mul	w0, w0, w0
+	ldp	x29, x30, [sp], #16
+	ret
+	.size	_Z3barP1A, .-_Z3barP1A
+
+	.globl	main
+	.p2align	2
+	.type	main, at function
+main:
+	mov	w0, wzr
+	ret
+	.size	main, .-main
diff --git a/bolt/test/AArch64/instrument-no-fini.s b/bolt/test/AArch64/instrument-no-fini.s
new file mode 100644
index 0000000000000..526ce11080f4f
--- /dev/null
+++ b/bolt/test/AArch64/instrument-no-fini.s
@@ -0,0 +1,34 @@
+# Test that BOLT will produce error by default and pass with instrumentation-sleep-time option
+
+# REQUIRES: system-linux,bolt-runtime,target=aarch64{{.*}}
+
+# RUN: llvm-mc -triple aarch64 -filetype=obj %s -o %t.o
+# RUN: ld.lld -q -pie -o %t.exe %t.o
+# RUN: llvm-readelf -d %t.exe | FileCheck --check-prefix=CHECK-NO-FINI %s
+# RUN: not llvm-bolt --instrument -o %t.out %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-BOLT-FAIL
+# RUN: llvm-bolt --instrument --instrumentation-sleep-time=1 -o %t.out %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-BOLT-PASS
+
+# CHECK-NO-FINI: INIT
+# CHECK-NO-FINI-NOT: FINI
+# CHECK-NO-FINI-NOT: FINI_ARRAY
+
+# CHECK-BOLT-FAIL: Instrumentation needs either DT_FINI or DT_FINI_ARRAY
+
+# CHECK-BOLT-PASS-NOT: Instrumentation needs either DT_FINI or DT_FINI_ARRAY
+# CHECK-BOLT-PASS: runtime library initialization was hooked via DT_INIT
+
+    .text
+    .globl _start
+    .type _start, %function
+_start:
+    # BOLT errs when instrumenting without relocations; create a dummy one.
+    .reloc 0, R_AARCH64_NONE
+    ret
+    .size _start, .-_start
+
+    .globl _init
+    .type _init, %function
+    # Force DT_INIT to be created (needed for instrumentation).
+_init:
+    ret
+    .size _init, .-_init
diff --git a/bolt/test/X86/hook-init.s b/bolt/test/X86/hook-init.s
new file mode 100644
index 0000000000000..3184541f040b9
--- /dev/null
+++ b/bolt/test/X86/hook-init.s
@@ -0,0 +1,221 @@
+## Test the different ways of hooking the init function for instrumentation (via
+## entry point, DT_INIT and via DT_INIT_ARRAY). We test the latter for both PIE
+## and non-PIE binaries because of the different ways of handling relocations
+## (static or dynamic), executable and shared library.
+## All tests perform the following steps:
+## - Compile and link for the case to be tested
+## - Some sanity-checks on the dynamic section and relocations in the binary to
+##   verify it has the shape we want for testing:
+##   - INTERP in Program Headers
+##   - DT_INIT or DT_INIT_ARRAY in dynamic section
+##   - No relative relocations for non-PIE
+## - Instrument (with extra --runtime-lib-init-hook=init/init_array options
+##   in some cases)
+## - Verify generated binary
+# REQUIRES: system-linux,bolt-runtime,target=x86_64-{{.*}}
+
+# RUN: %clang %cflags -pie %s -Wl,-q -o %t.exe
+# RUN: llvm-readelf -d %t.exe | FileCheck --check-prefix=DYN-INIT %s
+# RUN: llvm-readelf -l %t.exe | FileCheck --check-prefix=PH-INTERP %s
+# RUN: llvm-readelf -r %t.exe | FileCheck --check-prefix=RELOC-PIE %s
+# RUN: llvm-bolt %t.exe -o %t --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-EP %s
+# RUN: llvm-readelf -hdrs %t | FileCheck --check-prefix=CHECK-INIT-EP %s
+# RUN: llvm-bolt %t.exe -o %t-no-ep --instrument --runtime-lib-init-hook=init | FileCheck --check-prefix=CHECK-BOLT-RT-INIT %s
+# RUN: llvm-readelf -hdrs %t-no-ep | FileCheck --check-prefix=CHECK-INIT-NO-EP %s
+# RUN: llvm-bolt %t.exe -o %t-no-ep --instrument --runtime-lib-init-hook=init_array | FileCheck --check-prefix=CHECK-BOLT-RT-INIT-ARRAY %s
+# RUN: llvm-readelf -hdrs %t-no-ep | FileCheck --check-prefix=CHECK-INIT-ARRAY-NO-EP %s
+
+# RUN: %clang -shared %cflags -pie %s -Wl,-q -o %t-shared.exe
+# RUN: llvm-readelf -d %t-shared.exe | FileCheck --check-prefix=DYN-INIT %s
+# RUN: llvm-readelf -l %t-shared.exe | FileCheck --check-prefix=PH-INTERP-SHARED %s
+# RUN: llvm-readelf -r %t-shared.exe | FileCheck --check-prefix=RELOC-SHARED-PIE %s
+# RUN: llvm-bolt %t-shared.exe -o %t-shared --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-INIT %s
+# RUN: llvm-readelf -hdrs %t-shared | FileCheck --check-prefix=CHECK-SHARED-INIT %s
+
+# RUN: %clang %cflags -pie %s -Wl,-q,-init=0 -o %t-no-init.exe
+# RUN: llvm-readelf -d %t-no-init.exe | FileCheck --check-prefix=DYN-NO-INIT %s
+# RUN: llvm-readelf -l %t-no-init.exe | FileCheck --check-prefix=PH-INTERP %s
+# RUN: llvm-readelf -r %t-no-init.exe | FileCheck --check-prefix=RELOC-PIE %s
+# RUN: llvm-bolt %t-no-init.exe -o %t-no-init --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-EP %s
+# RUN: llvm-readelf -hdrs %t-no-init | FileCheck --check-prefix=CHECK-NO-INIT-EP %s
+# RUN: llvm-bolt %t-no-init.exe -o %t-no-init-no-ep --instrument --runtime-lib-init-hook=init | FileCheck --check-prefix=CHECK-BOLT-RT-INIT-ARRAY %s
+# RUN: llvm-readelf -hdrs %t-no-init-no-ep | FileCheck --check-prefix=CHECK-NO-INIT-NO-EP %s
+
+# RUN: %clang -shared %cflags -pie %s -Wl,-q,-init=0 -o %t-shared-no-init.exe
+# RUN: llvm-readelf -d %t-shared-no-init.exe | FileCheck --check-prefix=DYN-NO-INIT %s
+# RUN: llvm-readelf -l %t-shared-no-init.exe | FileCheck --check-prefix=PH-INTERP-SHARED %s
+# RUN: llvm-readelf -r %t-shared-no-init.exe | FileCheck --check-prefix=RELOC-SHARED-PIE %s
+# RUN: llvm-bolt %t-shared-no-init.exe -o %t-shared-no-init --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-INIT-ARRAY %s
+# RUN: llvm-readelf -drs %t-shared-no-init | FileCheck --check-prefix=CHECK-SHARED-NO-INIT %s
+
+## Create a dummy shared library to link against to force creation of the dynamic section.
+# RUN: %clang %cflags %p/../Inputs/stub.c -fPIC -shared -o %t-stub.so
+# RUN: %clang %cflags %s -no-pie -Wl,-q,-init=0 %t-stub.so -o %t-no-pie-no-init.exe
+# RUN: llvm-readelf -r %t-no-pie-no-init.exe | FileCheck --check-prefix=RELOC-NO-PIE %s
+# RUN: llvm-bolt %t-no-pie-no-init.exe -o %t-no-pie-no-init --instrument | FileCheck --check-prefix=CHECK-BOLT-RT-EP %s
+# RUN: llvm-readelf -hds %t-no-pie-no-init | FileCheck --check-prefix=CHECK-NO-PIE-NO-INIT-EP %s
+
+## With init: dynamic section should contain DT_INIT
+# DYN-INIT: (INIT)
+
+## Without init: dynamic section should only contain DT_INIT_ARRAY
+# DYN-NO-INIT-NOT: (INIT)
+# DYN-NO-INIT:     (INIT_ARRAY)
+# DYN-NO-INIT:     (INIT_ARRAYSZ)
+
+## With interp program header (executable)
+# PH-INTERP: Program Headers:
+# PH-INTERP: INTERP
+
+## Without interp program header (shared library)
+# PH-INTERP-SHARED:     Program Headers:
+# PH-INTERP-SHARED-NOT: INTERP
+
+## With PIE: binary should have relative relocations
+# RELOC-PIE: R_X86_64_RELATIVE
+
+## With PIE: binary should have relative relocations
+# RELOC-SHARED-PIE: R_X86_64_64
+
+## Without PIE: binary should not have relative relocations
+# RELOC-NO-PIE-NOT: R_X86_64_RELATIVE
+
+## Check BOLT output output initialization hook (ELF Header Entry Point)
+# CHECK-BOLT-RT-EP: runtime library initialization was hooked via ELF Header Entry Point
+# CHECK-BOLT-RT-EP-NOT: runtime library initialization was hooked via DT_INIT
+# CHECK-BOLT-RT-EP-NOT: runtime library initialization was hooked via .init_array entry
+
+## Check BOLT output output initialization hook (DT_INIT)
+# CHECK-BOLT-RT-INIT-NOT: runtime library initialization was hooked via ELF Header Entry Point
+# CHECK-BOLT-RT-INIT: runtime library initialization was hooked via DT_INIT
+# CHECK-BOLT-RT-INIT-NOT: runtime library initialization was hooked via .init_array entry
+
+## Check BOLT output output initialization hook (1st entry of .init_array)
+# CHECK-BOLT-RT-INIT-ARRAY-NOT: runtime library initialization was hooked via ELF Header Entry Point
+# CHECK-BOLT-RT-INIT-ARRAY-NOT: runtime library initialization was hooked via DT_INIT
+# CHECK-BOLT-RT-INIT-ARRAY: runtime library initialization was hooked via .init_array entry
+
+## Check that entry point address is set to __bolt_runtime_start for PIE executable with DT_INIT
+# CHECK-INIT-EP:               ELF Header:
+# CHECK-INIT-EP:               Entry point address: 0x[[#%x,EP_ADDR:]]
+## Check that the dynamic relocation at .init and .init_array were not patched
+# CHECK-INIT-EP:               Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-INIT-EP-NOT:           (INIT) 0x[[#%x, EP_ADDR]]
+# CHECK-INIT-EP-NOT:           (INIT_ARRAY) 0x[[#%x, EP_ADDR]]
+## Check that the new entry point address points to __bolt_runtime_start
+# CHECK-INIT-EP:               Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-INIT-EP:               {{0+}}[[#%x, EP_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that DT_INIT address is set to __bolt_runtime_start for PIE executable with DT_INIT
+# CHECK-INIT-NO-EP:            ELF Header:
+# CHECK-INIT-NO-EP:            Entry point address: 0x[[#%x,EP_ADDR:]]
+## Read Dynamic section DT_INIT and DT_INIT_ARRAY entries
+# CHECK-INIT-NO-EP:            Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-INIT-NO-EP-DAG:        (INIT) 0x[[#%x,INIT:]]
+# CHECK-INIT-NO-EP-DAG:        (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Check if ELF entry point address points to _start symbol and new DT_INIT entry points to __bolt_runtime_start
+# CHECK-INIT-NO-EP:            Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-INIT-NO-EP-DAG:        {{0+}}[[#%x, EP_ADDR]] {{.*}} _start
+# CHECK-INIT-NO-EP-DAG:        {{0+}}[[#%x, INIT]] {{.*}} __bolt_runtime_start
+
+## Check that 1st entry of DT_INIT_ARRAY is set to __bolt_runtime_start and DT_INIT was not changed
+# CHECK-INIT-ARRAY-NO-EP:      ELF Header:
+# CHECK-INIT-ARRAY-NO-EP:      Entry point address: 0x[[#%x,EP_ADDR:]]
+## Read Dynamic section DT_INIT and DT_INIT_ARRAY entries
+# CHECK-INIT-ARRAY-NO-EP:      Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-INIT-ARRAY-NO-EP-DAG:  (INIT) 0x[[#%x,INIT:]]
+# CHECK-INIT-ARRAY-NO-EP-DAG:  (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Read the dynamic relocation from 1st entry of .init_array
+# CHECK-INIT-ARRAY-NO-EP:      Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-INIT-ARRAY-NO-EP:      {{0+}}[[#%x,INIT_ARRAY]] {{.*}} R_X86_64_RELATIVE [[#%x,INIT_ADDR:]]
+# CHECK-INIT-ARRAY-NO-EP-NOT:  {{0+}}[[#%x,INIT_ARRAY]] {{.*}} R_X86_64_RELATIVE [[#%x,INIT]]
+## Check that 1st entry of .init_array points to __bolt_runtime_start
+# CHECK-INIT-ARRAY-NO-EP:      Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-INIT-ARRAY-NO-EP-DAG:  {{0+}}[[#%x, EP_ADDR]] {{.*}} _start
+# CHECK-INIT-ARRAY-NO-EP-DAG:  {{[0-9]]*}}: {{0+}}[[#%x, INIT_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for PIE executable without DT_INIT
+# CHECK-NO-INIT-EP:            ELF Header:
+# CHECK-NO-INIT-EP:            Entry point address: 0x[[#%x,EP_ADDR:]]
+## Check that the dynamic relocation at .init and .init_array were not patched
+# CHECK-NO-INIT-EP:            Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-NO-INIT-EP-NOT:        (INIT) 0x[[#%x, EP_ADDR]]
+# CHECK-NO-INIT-EP-NOT:        (INIT_ARRAY) 0x[[#%x, EP_ADDR]]
+## Check that the new entry point address points to __bolt_runtime_start
+# CHECK-NO-INIT-EP:            Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-NO-INIT-EP:            {{0+}}[[#%x, EP_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that DT_INIT is set to __bolt_runtime_start for shared library with DT_INIT
+# CHECK-SHARED-INIT:           Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-SHARED-INIT-DAG:       (INIT) 0x[[#%x, INIT:]]
+# CHECK-SHARED-INIT-DAG:       (INIT_ARRAY) 0x[[#%x, INIT_ARRAY:]]
+## Check that the dynamic relocation at .init_array was not patched
+# CHECK-SHARED-INIT:           Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-SHARED-INIT-NOT:       {{0+}}[[#%x, INIT_ARRAY]] {{.*}} R_X86_64_64 {{0+}}[[#%x, INIT]]
+## Check that dynamic section DT_INIT points to __bolt_runtime_start
+# CHECK-SHARED-INIT:           Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-SHARED-INIT:           {{0+}}[[#%x, INIT]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for PIE executable without DT_INIT
+# CHECK-NO-INIT-NO-EP:         ELF Header:
+# CHECK-NO-INIT-NO-EP:         Entry point address: 0x[[#%x,EP_ADDR:]]
+# CHECK-NO-INIT-NO-EP:         Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-NO-INIT-NO-EP-NOT:     (INIT)
+# CHECK-NO-INIT-NO-EP:         (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Read the dynamic relocation from 1st entry of .init_array
+# CHECK-NO-INIT-NO-EP:         Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-NO-INIT-NO-EP:         {{0+}}[[#%x,INIT_ARRAY]] {{.*}} R_X86_64_RELATIVE [[#%x,INIT_ADDR:]]
+## Check that 1st entry of .init_array points to __bolt_runtime_start
+# CHECK-NO-INIT-NO-EP:         Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-NO-INIT-NO-EP-DAG:     {{0+}}[[#%x, EP_ADDR]] {{.*}} _start
+# CHECK-NO-INIT-NO-EP-DAG:     {{[0-9]]*}}: {{0+}}[[#%x, INIT_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for shared library without DT_INIT
+# CHECK-SHARED-NO-INIT:        Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-SHARED-NO-INIT-NOT:    (INIT)
+# CHECK-SHARED-NO-INIT:        (INIT_ARRAY) 0x[[#%x,INIT_ARRAY:]]
+## Read the dynamic relocation from 1st entry of .init_array
+# CHECK-SHARED-NO-INIT:        Relocation section '.rela.dyn' at offset {{.*}} contains {{.*}} entries
+# CHECK-SHARED-NO-INIT:        {{0+}}[[#%x, INIT_ARRAY]] {{.*}} R_X86_64_64 [[#%x,INIT_ADDR:]]
+## Check that 1st entry of .init_array points to __bolt_runtime_start
+# CHECK-SHARED-NO-INIT:        Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-SHARED-NO-INIT:        {{[0-9]]*}}: {{0+}}[[#%x, INIT_ADDR]] {{.*}} __bolt_runtime_start
+
+## Check that entry point address is set to __bolt_runtime_start for non-PIE executable with DT_INIT
+# CHECK-NO-PIE-NO-INIT-EP:     ELF Header:
+# CHECK-NO-PIE-NO-INIT-EP:     Entry point address: 0x[[#%x,EP_ADDR:]]
+## Check that the dynamic relocation at .init and .init_array were not patched
+# CHECK-NO-PIE-NO-INIT-EP:     Dynamic section at offset {{.*}} contains {{.*}} entries:
+# CHECK-NO-PIE-NO-INIT-EP-NOT: (INIT) 0x[[#%x, EP_ADDR]]
+# CHECK-NO-PIE-NO-INIT-EP-NOT: (INIT_ARRAY) 0x[[#%x, EP_ADDR]]
+## Check that the new entry point address points to __bolt_runtime_start
+# CHECK-NO-PIE-NO-INIT-EP:     Symbol table '.symtab' contains {{.*}} entries:
+# CHECK-NO-PIE-NO-INIT-EP:     {{0+}}[[#%x, EP_ADDR]] {{.*}} __bolt_runtime_start
+
+  .globl _start
+  .type _start, %function
+_start:
+  # Dummy relocation to force relocation mode.
+  .reloc 0, R_X86_64_NONE
+  retq
+.size _start, .-_start
+
+  .globl _init
+  .type _init, %function
+_init:
+  retq
+  .size _init, .-_init
+
+  .globl _fini
+  .type _fini, %function
+_fini:
+  retq
+  .size _fini, .-_fini
+
+  .section .init_array,"aw"
+  .align 8
+  .quad _init
+
+  .section .fini_array,"aw"
+  .align 8
+  .quad _fini
diff --git a/bolt/test/X86/instrument-no-fini.s b/bolt/test/X86/instrument-no-fini.s
new file mode 100644
index 0000000000000..fff23761d1499
--- /dev/null
+++ b/bolt/test/X86/instrument-no-fini.s
@@ -0,0 +1,34 @@
+# Test that BOLT will produce error by default and pass with instrumentation-sleep-time option
+
+# REQUIRES: system-linux,bolt-runtime,target=x86_64-{{.*}}
+
+# RUN: llvm-mc -triple x86_64 -filetype=obj %s -o %t.o
+# RUN: ld.lld -q -pie -o %t.exe %t.o
+# RUN: llvm-readelf -d %t.exe | FileCheck --check-prefix=CHECK-NO-FINI %s
+# RUN: not llvm-bolt --instrument -o %t.out %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-BOLT-FAIL
+# RUN: llvm-bolt --instrument --instrumentation-sleep-time=1 -o %t.out %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-BOLT-PASS
+
+# CHECK-NO-FINI: INIT
+# CHECK-NO-FINI-NOT: FINI
+# CHECK-NO-FINI-NOT: FINI_ARRAY
+
+# CHECK-BOLT-FAIL: Instrumentation needs either DT_FINI or DT_FINI_ARRAY
+
+# CHECK-BOLT-PASS-NOT: Instrumentation needs either DT_FINI or DT_FINI_ARRAY
+# CHECK-BOLT-PASS: runtime library initialization was hooked via DT_INIT
+
+    .text
+    .globl _start
+    .type _start, %function
+_start:
+    # BOLT errs when instrumenting without relocations; create a dummy one.
+    .reloc 0, R_X86_64_NONE
+    retq
+    .size _start, .-_start
+
+    .globl _init
+    .type _init, %function
+    # Force DT_INIT to be created (needed for instrumentation).
+_init:
+    retq
+    .size _init, .-_init
diff --git a/bolt/test/X86/internal-call-instrument-so.s b/bolt/test/X86/internal-call-instrument-so.s
index 99e5b29221409..fe23bc61afa32 100644
--- a/bolt/test/X86/internal-call-instrument-so.s
+++ b/bolt/test/X86/internal-call-instrument-so.s
@@ -5,7 +5,7 @@
 # RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s -o %t.o
 # Delete our BB symbols so BOLT doesn't mark them as entry points
 # RUN: llvm-strip --strip-unneeded %t.o
-# RUN: ld.lld %t.o -o %t.exe -q -shared -fini=_fini
+# RUN: ld.lld %t.o -o %t.exe -q -shared -fini=_fini -init=_init
 # RUN: llvm-bolt --instrument %t.exe --relocs -o %t.out
 
   .text
@@ -48,6 +48,13 @@ _fini:
   hlt
   .size _fini, .-_fini
 
+  .globl  _init
+  .type _init, %function
+  .p2align  4
+_init:
+  retq
+  .size _init, .-_init
+
   .data
   .globl var
 var:
diff --git a/bolt/test/runtime/X86/instrument-wrong-target.s b/bolt/test/runtime/X86/instrument-wrong-target.s
index 343d93a89ed13..fa40d43f10a0f 100644
--- a/bolt/test/runtime/X86/instrument-wrong-target.s
+++ b/bolt/test/runtime/X86/instrument-wrong-target.s
@@ -19,6 +19,13 @@ _start:
     ret
     .size _start, .-_start
 
+    .globl _init
+    .type _init, %function
+    # Force DT_INIT to be created (needed for instrumentation).
+_init:
+    ret
+    .size _init, .-_init
+
     .globl _fini
     .type _fini, %function
     # Force DT_FINI to be created (needed for instrumentation).
diff --git a/bolt/unittests/CMakeLists.txt b/bolt/unittests/CMakeLists.txt
index 64414b83d39fe..d47ddc46b7388 100644
--- a/bolt/unittests/CMakeLists.txt
+++ b/bolt/unittests/CMakeLists.txt
@@ -7,3 +7,4 @@ endfunction()
 
 add_subdirectory(Core)
 add_subdirectory(Profile)
+add_subdirectory(Passes)
diff --git a/bolt/unittests/Passes/CMakeLists.txt b/bolt/unittests/Passes/CMakeLists.txt
new file mode 100644
index 0000000000000..3dc578adeb357
--- /dev/null
+++ b/bolt/unittests/Passes/CMakeLists.txt
@@ -0,0 +1,30 @@
+set(LLVM_LINK_COMPONENTS
+  DebugInfoDWARF
+  Object
+  MC
+  ${BOLT_TARGETS_TO_BUILD}
+  )
+
+add_bolt_unittest(PassTests
+  InsertNegateRAState.cpp
+
+  DISABLE_LLVM_LINK_LLVM_DYLIB
+  )
+
+target_link_libraries(PassTests
+  PRIVATE
+  LLVMBOLTCore
+  LLVMBOLTRewrite
+  LLVMBOLTPasses
+  LLVMBOLTProfile
+  LLVMBOLTUtils
+  )
+
+foreach (tgt ${BOLT_TARGETS_TO_BUILD})
+  include_directories(
+    ${LLVM_MAIN_SRC_DIR}/lib/Target/${tgt}
+    ${LLVM_BINARY_DIR}/lib/Target/${tgt}
+  )
+  string(TOUPPER "${tgt}" upper)
+  target_compile_definitions(PassTests PRIVATE "${upper}_AVAILABLE")
+endforeach()
diff --git a/bolt/unittests/Passes/InsertNegateRAState.cpp b/bolt/unittests/Passes/InsertNegateRAState.cpp
new file mode 100644
index 0000000000000..2ef78d381e570
--- /dev/null
+++ b/bolt/unittests/Passes/InsertNegateRAState.cpp
@@ -0,0 +1,333 @@
+//===- bolt/unittest/Passes/InsertNegateRAState.cpp -----------------------===//
+//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifdef AARCH64_AVAILABLE
+#include "AArch64Subtarget.h"
+#include "MCTargetDesc/AArch64MCTargetDesc.h"
+#endif // AARCH64_AVAILABLE
+
+#include "bolt/Core/BinaryBasicBlock.h"
+#include "bolt/Core/BinaryFunction.h"
+#include "bolt/Passes/InsertNegateRAStatePass.h"
+#include "bolt/Rewrite/BinaryPassManager.h"
+#include "bolt/Rewrite/RewriteInstance.h"
+#include "llvm/BinaryFormat/ELF.h"
+#include "llvm/MC/MCDwarf.h"
+#include "llvm/MC/MCInstBuilder.h"
+#include "llvm/Support/TargetSelect.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::object;
+using namespace llvm::ELF;
+using namespace bolt;
+
+namespace {
+struct PassTester : public testing::TestWithParam<Triple::ArchType> {
+  void SetUp() override {
+    initalizeLLVM();
+    prepareElf();
+    initializeBolt();
+  }
+
+protected:
+  void initalizeLLVM() {
+#define BOLT_TARGET(target)                                                    \
+  LLVMInitialize##target##TargetInfo();                                        \
+  LLVMInitialize##target##TargetMC();                                          \
+  LLVMInitialize##target##AsmParser();                                         \
+  LLVMInitialize##target##Disassembler();                                      \
+  LLVMInitialize##target##Target();                                            \
+  LLVMInitialize##target##AsmPrinter();
+
+#include "bolt/Core/TargetConfig.def"
+  }
+
+#define PREPARE_FUNC(name)                                                     \
+  constexpr uint64_t FunctionAddress = 0x1000;                                 \
+  BinaryFunction *BF = BC->createBinaryFunction(                               \
+      name, *TextSection, FunctionAddress, /*Size=*/0, /*SymbolSize=*/0,       \
+      /*Alignment=*/16);                                                       \
+  /* Make sure the pass runs on the BF.*/                                      \
+  BF->updateState(BinaryFunction::State::CFG);                                 \
+  BF->setContainedNegateRAState();                                             \
+  /* All tests need at least one BB. */                                        \
+  BinaryBasicBlock *BB = BF->addBasicBlock();                                  \
+  BF->addEntryPoint(*BB);                                                      \
+  BB->setCFIState(0);
+
+  void prepareElf() {
+    memcpy(ElfBuf, "\177ELF", 4);
+    ELF64LE::Ehdr *EHdr = reinterpret_cast<typename ELF64LE::Ehdr *>(ElfBuf);
+    EHdr->e_ident[llvm::ELF::EI_CLASS] = llvm::ELF::ELFCLASS64;
+    EHdr->e_ident[llvm::ELF::EI_DATA] = llvm::ELF::ELFDATA2LSB;
+    EHdr->e_machine = GetParam() == Triple::aarch64 ? EM_AARCH64 : EM_X86_64;
+    MemoryBufferRef Source(StringRef(ElfBuf, sizeof(ElfBuf)), "ELF");
+    ObjFile = cantFail(ObjectFile::createObjectFile(Source));
+  }
+  void initializeBolt() {
+    Relocation::Arch = ObjFile->makeTriple().getArch();
+    BC = cantFail(BinaryContext::createBinaryContext(
+        ObjFile->makeTriple(), std::make_shared<orc::SymbolStringPool>(),
+        ObjFile->getFileName(), nullptr, true, DWARFContext::create(*ObjFile),
+        {llvm::outs(), llvm::errs()}));
+    ASSERT_FALSE(!BC);
+    BC->initializeTarget(std::unique_ptr<MCPlusBuilder>(
+        createMCPlusBuilder(GetParam(), BC->MIA.get(), BC->MII.get(),
+                            BC->MRI.get(), BC->STI.get())));
+
+    PassManager = std::make_unique<BinaryFunctionPassManager>(*BC);
+    PassManager->registerPass(std::make_unique<InsertNegateRAState>());
+
+    TextSection = &BC->registerOrUpdateSection(
+        ".text", ELF::SHT_PROGBITS, ELF::SHF_ALLOC | ELF::SHF_EXECINSTR,
+        /*Data=*/nullptr, /*Size=*/0,
+        /*Alignment=*/16);
+  }
+
+  std::vector<int> findCFIOffsets(BinaryFunction &BF) {
+    std::vector<int> Locations;
+    int Idx = 0;
+    int InstSize = 4; // AArch64
+    for (BinaryBasicBlock &BB : BF) {
+      for (MCInst &Inst : BB) {
+        if (BC->MIB->isCFI(Inst)) {
+          const MCCFIInstruction *CFI = BF.getCFIFor(Inst);
+          if (CFI->getOperation() == MCCFIInstruction::OpNegateRAState)
+            Locations.push_back(Idx * InstSize);
+        }
+        Idx++;
+      }
+    }
+    return Locations;
+  }
+
+  char ElfBuf[sizeof(typename ELF64LE::Ehdr)] = {};
+  std::unique_ptr<ObjectFile> ObjFile;
+  std::unique_ptr<BinaryContext> BC;
+  std::unique_ptr<BinaryFunctionPassManager> PassManager;
+  BinarySection *TextSection;
+};
+} // namespace
+
+TEST_P(PassTester, ExampleTest) {
+  if (GetParam() != Triple::aarch64)
+    GTEST_SKIP();
+
+  ASSERT_NE(TextSection, nullptr);
+
+  PREPARE_FUNC("ExampleFunction");
+
+  MCInst UnsignedInst = MCInstBuilder(AArch64::ADDSXri)
+                            .addReg(AArch64::X0)
+                            .addReg(AArch64::X0)
+                            .addImm(0)
+                            .addImm(0);
+  BC->MIB->setRAState(UnsignedInst, false);
+  BB->addInstruction(UnsignedInst);
+
+  MCInst SignedInst = MCInstBuilder(AArch64::ADDSXri)
+                          .addReg(AArch64::X0)
+                          .addReg(AArch64::X0)
+                          .addImm(1)
+                          .addImm(0);
+  BC->MIB->setRAState(SignedInst, true);
+  BB->addInstruction(SignedInst);
+
+  Error E = PassManager->runPasses();
+  EXPECT_FALSE(E);
+
+  /* Expected layout of BF after the pass:
+
+   .LBB0 (3 instructions, align : 1)
+      Entry Point
+      CFI State : 0
+        00000000:   adds    x0, x0, #0x0
+        00000004:   !CFI    $0      ; OpNegateRAState
+        00000004:   adds    x0, x0, #0x1
+      CFI State: 0
+   */
+  auto CFILoc = findCFIOffsets(*BF);
+  EXPECT_EQ(CFILoc.size(), 1u);
+  EXPECT_EQ(CFILoc[0], 4);
+}
+
+TEST_P(PassTester, fillUnknownStateInBBTest) {
+  /* Check that a if BB starts with unknown RAState, we can fill the unknown
+   states based on following instructions with known RAStates.
+   *
+   * .LBB0 (1 instructions, align : 1)
+        Entry Point
+        CFI State : 0
+          00000000:   adds    x0, x0, #0x0
+        CFI State: 0
+
+     .LBB1 (4 instructions, align : 1)
+        CFI State : 0
+          00000004:   !CFI    $0      ; OpNegateRAState
+          00000004:   adds    x0, x0, #0x1
+          00000008:   adds    x0, x0, #0x2
+          0000000c:   adds    x0, x0, #0x3
+        CFI State: 0
+   */
+  if (GetParam() != Triple::aarch64)
+    GTEST_SKIP();
+
+  ASSERT_NE(TextSection, nullptr);
+
+  PREPARE_FUNC("FuncWithUnknownStateInBB");
+  BinaryBasicBlock *BB2 = BF->addBasicBlock();
+  BB2->setCFIState(0);
+
+  MCInst Unsigned = MCInstBuilder(AArch64::ADDSXri)
+                        .addReg(AArch64::X0)
+                        .addReg(AArch64::X0)
+                        .addImm(0)
+                        .addImm(0);
+  BC->MIB->setRAState(Unsigned, false);
+  BB->addInstruction(Unsigned);
+
+  MCInst Unknown = MCInstBuilder(AArch64::ADDSXri)
+                       .addReg(AArch64::X0)
+                       .addReg(AArch64::X0)
+                       .addImm(1)
+                       .addImm(0);
+  MCInst Unknown1 = MCInstBuilder(AArch64::ADDSXri)
+                        .addReg(AArch64::X0)
+                        .addReg(AArch64::X0)
+                        .addImm(2)
+                        .addImm(0);
+  MCInst Signed = MCInstBuilder(AArch64::ADDSXri)
+                      .addReg(AArch64::X0)
+                      .addReg(AArch64::X0)
+                      .addImm(3)
+                      .addImm(0);
+  BC->MIB->setRAState(Signed, true);
+  BB2->addInstruction(Unknown);
+  BB2->addInstruction(Unknown1);
+  BB2->addInstruction(Signed);
+
+  Error E = PassManager->runPasses();
+  EXPECT_FALSE(E);
+
+  auto CFILoc = findCFIOffsets(*BF);
+  EXPECT_EQ(CFILoc.size(), 1u);
+  EXPECT_EQ(CFILoc[0], 4);
+  // Check that the pass set Unknown and Unknown1 to signed.
+  // begin() is the CFI, begin() + 1 is Unknown, begin() + 2 is Unknown1.
+  std::optional<bool> RAState = BC->MIB->getRAState(*(BB2->begin() + 1));
+  EXPECT_TRUE(RAState.has_value());
+  EXPECT_TRUE(*RAState);
+  std::optional<bool> RAState1 = BC->MIB->getRAState(*(BB2->begin() + 2));
+  EXPECT_TRUE(RAState1.has_value());
+  EXPECT_TRUE(*RAState1);
+}
+
+TEST_P(PassTester, fillUnknownStubs) {
+  /*
+   * Stubs that are not part of the function's CFG should inherit the RAState of
+   the BasicBlock before it.
+   *
+   * LBB1 is not part of the CFG: LBB0 jumps unconditionally to LBB2.
+   * LBB1 would be a stub inserted in LongJmp in real code.
+   * We do not add any NegateRAState CFIs, as other CFIs are not added either.
+   * See issue #160989 for more details.
+   *
+   *  .LBB0 (1 instructions, align : 1)
+       Entry Point
+         00000000:   b       .LBB2
+       Successors: .LBB2
+
+     .LBB1 (1 instructions, align : 1)
+         00000004:   ret
+
+     .LBB2 (1 instructions, align : 1)
+       Predecessors: .LBB0
+          00000008:   ret
+   */
+  if (GetParam() != Triple::aarch64)
+    GTEST_SKIP();
+
+  ASSERT_NE(TextSection, nullptr);
+
+  PREPARE_FUNC("FuncWithStub");
+  BinaryBasicBlock *BB2 = BF->addBasicBlock();
+  BB2->setCFIState(0);
+  BinaryBasicBlock *BB3 = BF->addBasicBlock();
+  BB3->setCFIState(0);
+
+  BB->addSuccessor(BB3);
+
+  // Jumping over BB2, to BB3.
+  MCInst Jump;
+  BC->MIB->createUncondBranch(Jump, BB3->getLabel(), BC->Ctx.get());
+  BB->addInstruction(Jump);
+  BC->MIB->setRAState(Jump, false);
+
+  // BB2, in real code it would be a ShortJmp.
+  // Unknown RAState.
+  MCInst StubInst;
+  BC->MIB->createReturn(StubInst);
+  BB2->addInstruction(StubInst);
+
+  // Can be any instruction.
+  MCInst Ret;
+  BC->MIB->createReturn(Ret);
+  BB3->addInstruction(Ret);
+  BC->MIB->setRAState(Ret, false);
+
+  Error E = PassManager->runPasses();
+  EXPECT_FALSE(E);
+
+  // Check that we did not generate any NegateRAState CFIs.
+  auto CFILoc = findCFIOffsets(*BF);
+  EXPECT_EQ(CFILoc.size(), 0u);
+}
+
+TEST_P(PassTester, fillUnknownStubsEmpty) {
+  /*
+   * This test checks that BOLT can set the RAState of unknown BBs,
+   * even if all previous BBs are empty, hence no PrevInst gets set.
+   *
+   * As this means that the current (empty) BB is the first with non-pseudo
+   * instructions, the function's initialRAState should be used.
+   */
+  if (GetParam() != Triple::aarch64)
+    GTEST_SKIP();
+
+  ASSERT_NE(TextSection, nullptr);
+
+  PREPARE_FUNC("FuncWithStub");
+  BF->setInitialRAState(false);
+  BinaryBasicBlock *BB2 = BF->addBasicBlock();
+  BB2->setCFIState(0);
+
+  // BB is empty.
+  BB->addSuccessor(BB2);
+
+  // BB2, in real code it would be a ShortJmp.
+  // Unknown RAState.
+  MCInst StubInst;
+  BC->MIB->createReturn(StubInst);
+  BB2->addInstruction(StubInst);
+
+  Error E = PassManager->runPasses();
+  EXPECT_FALSE(E);
+
+  // Check that BOLT added an RAState to BB2.
+  std::optional<bool> RAState = BC->MIB->getRAState(*(BB2->begin()));
+  EXPECT_TRUE(RAState.has_value());
+  // BB2 should be set to BF.initialRAState (false).
+  EXPECT_FALSE(*RAState);
+}
+
+#ifdef AARCH64_AVAILABLE
+INSTANTIATE_TEST_SUITE_P(AArch64, PassTester,
+                         ::testing::Values(Triple::aarch64));
+#endif
diff --git a/clang-tools-extra/clang-tidy/.clang-tidy b/clang-tools-extra/clang-tidy/.clang-tidy
index 82d0df8697178..576b4a7b8443e 100644
--- a/clang-tools-extra/clang-tidy/.clang-tidy
+++ b/clang-tools-extra/clang-tidy/.clang-tidy
@@ -32,8 +32,7 @@ Checks: >
   -readability-qualified-auto,
   -readability-simplify-boolean-expr,
   -readability-static-definition-in-anonymous-namespace,
-  -readability-suspicious-call-argument,
-  -readability-use-anyofallof
+  -readability-suspicious-call-argument
 
 CheckOptions:
   - key:             performance-move-const-arg.CheckTriviallyCopyableMove
diff --git a/clang-tools-extra/clang-tidy/ClangTidyDiagnosticConsumer.cpp b/clang-tools-extra/clang-tidy/ClangTidyDiagnosticConsumer.cpp
index 6716d90a1acaf..a8fd499e45c92 100644
--- a/clang-tools-extra/clang-tidy/ClangTidyDiagnosticConsumer.cpp
+++ b/clang-tools-extra/clang-tidy/ClangTidyDiagnosticConsumer.cpp
@@ -478,11 +478,10 @@ bool ClangTidyDiagnosticConsumer::passesLineFilter(StringRef FileName,
     if (FileName.ends_with(Filter.Name)) {
       if (Filter.LineRanges.empty())
         return true;
-      for (const FileFilter::LineRange &Range : Filter.LineRanges) {
-        if (Range.first <= LineNumber && LineNumber <= Range.second)
-          return true;
-      }
-      return false;
+      return llvm::any_of(
+          Filter.LineRanges, [&](const FileFilter::LineRange &Range) {
+            return Range.first <= LineNumber && LineNumber <= Range.second;
+          });
     }
   }
   return false;
diff --git a/clang-tools-extra/clang-tidy/ExpandModularHeadersPPCallbacks.h b/clang-tools-extra/clang-tidy/ExpandModularHeadersPPCallbacks.h
index 95216368492ca..d72d021f44838 100644
--- a/clang-tools-extra/clang-tidy/ExpandModularHeadersPPCallbacks.h
+++ b/clang-tools-extra/clang-tidy/ExpandModularHeadersPPCallbacks.h
@@ -137,7 +137,7 @@ class ExpandModularHeadersPPCallbacks : public PPCallbacks {
   std::unique_ptr<Preprocessor> PP;
   bool EnteredMainFile = false;
   bool StartedLexing = false;
-  Token CurrentToken;
+  Token CurrentToken = Token();
 };
 
 } // namespace tooling
diff --git a/clang-tools-extra/clang-tidy/bugprone/BranchCloneCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/BranchCloneCheck.cpp
index 103b403b9fe5d..4f33670a8500a 100644
--- a/clang-tools-extra/clang-tidy/bugprone/BranchCloneCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/BranchCloneCheck.cpp
@@ -75,12 +75,9 @@ static bool isFallthroughSwitchBranch(const SwitchBranch &Branch) {
       if (!S)
         return true;
 
-      for (const Attr *A : S->getAttrs()) {
-        if (isa<FallThroughAttr>(A))
-          return false;
-      }
-
-      return true;
+      return llvm::all_of(S->getAttrs(), [](const Attr *A) {
+        return !isa<FallThroughAttr>(A);
+      });
     }
   } Visitor;
 
diff --git a/clang-tools-extra/clang-tidy/bugprone/CapturingThisInMemberVariableCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/CapturingThisInMemberVariableCheck.cpp
index a376de505dd70..6aed454813a22 100644
--- a/clang-tools-extra/clang-tidy/bugprone/CapturingThisInMemberVariableCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/CapturingThisInMemberVariableCheck.cpp
@@ -44,18 +44,17 @@ AST_MATCHER(CXXRecordDecl, correctHandleCaptureThisLambda) {
   if (Node.hasSimpleMoveAssignment())
     return false;
 
-  for (const CXXConstructorDecl *C : Node.ctors()) {
-    if (C->isCopyOrMoveConstructor() && C->isDefaulted() && !C->isDeleted())
-      return false;
-  }
-  for (const CXXMethodDecl *M : Node.methods()) {
-    if (M->isCopyAssignmentOperator())
-      llvm::errs() << M->isDeleted() << "\n";
-    if (M->isCopyAssignmentOperator() && M->isDefaulted() && !M->isDeleted())
-      return false;
-    if (M->isMoveAssignmentOperator() && M->isDefaulted() && !M->isDeleted())
-      return false;
-  }
+  if (llvm::any_of(Node.ctors(), [](const CXXConstructorDecl *C) {
+        return C->isCopyOrMoveConstructor() && C->isDefaulted() &&
+               !C->isDeleted();
+      }))
+    return false;
+  if (llvm::any_of(Node.methods(), [](const CXXMethodDecl *M) {
+        return (M->isCopyAssignmentOperator() ||
+                M->isMoveAssignmentOperator()) &&
+               M->isDefaulted() && !M->isDeleted();
+      }))
+    return false;
   // FIXME: find ways to identifier correct handle capture this lambda
   return true;
 }
diff --git a/clang-tools-extra/clang-tidy/bugprone/EasilySwappableParametersCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/EasilySwappableParametersCheck.cpp
index a07a68c8a3e65..496f3e5015990 100644
--- a/clang-tools-extra/clang-tidy/bugprone/EasilySwappableParametersCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/EasilySwappableParametersCheck.cpp
@@ -1589,11 +1589,9 @@ static bool lazyMapOfSetsIntersectionExists(const MapTy &Map, const ElemTy &E1,
   if (E1Iterator == Map.end() || E2Iterator == Map.end())
     return false;
 
-  for (const auto &E1SetElem : E1Iterator->second)
-    if (E2Iterator->second.contains(E1SetElem))
-      return true;
-
-  return false;
+  return llvm::any_of(E1Iterator->second, [&E2Iterator](const auto &E1SetElem) {
+    return E2Iterator->second.contains(E1SetElem);
+  });
 }
 
 /// Implements the heuristic that marks two parameters related if there is
diff --git a/clang-tools-extra/clang-tidy/bugprone/ExceptionEscapeCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/ExceptionEscapeCheck.cpp
index b7de8395ffa05..1cfb1511fa94e 100644
--- a/clang-tools-extra/clang-tidy/bugprone/ExceptionEscapeCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/ExceptionEscapeCheck.cpp
@@ -72,7 +72,8 @@ void ExceptionEscapeCheck::storeOptions(ClangTidyOptions::OptionMap &Opts) {
 
 void ExceptionEscapeCheck::registerMatchers(MatchFinder *Finder) {
   auto MatchIf = [](bool Enabled, const auto &Matcher) {
-    ast_matchers::internal::Matcher<FunctionDecl> Nothing = unless(anything());
+    const ast_matchers::internal::Matcher<FunctionDecl> Nothing =
+        unless(anything());
     return Enabled ? Matcher : Nothing;
   };
   Finder->addMatcher(
diff --git a/clang-tools-extra/clang-tidy/bugprone/FloatLoopCounterCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/FloatLoopCounterCheck.cpp
index adf2d2b4bcc07..38a0234337756 100644
--- a/clang-tools-extra/clang-tidy/bugprone/FloatLoopCounterCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/FloatLoopCounterCheck.cpp
@@ -31,6 +31,7 @@ void FloatLoopCounterCheck::registerMatchers(MatchFinder *Finder) {
 
 void FloatLoopCounterCheck::check(const MatchFinder::MatchResult &Result) {
   const auto *FS = Result.Nodes.getNodeAs<ForStmt>("for");
+  assert(FS && "FS should not be null");
 
   diag(FS->getInc()->getBeginLoc(), "loop induction expression should not have "
                                     "floating-point type")
diff --git a/clang-tools-extra/clang-tidy/bugprone/InfiniteLoopCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/InfiniteLoopCheck.cpp
index 50280d22be0d8..6749c59d5fd57 100644
--- a/clang-tools-extra/clang-tidy/bugprone/InfiniteLoopCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/InfiniteLoopCheck.cpp
@@ -119,14 +119,9 @@ static bool isAtLeastOneCondVarChanged(const Decl *Func, const Stmt *LoopStmt,
   if (isVarThatIsPossiblyChanged(Func, LoopStmt, Cond, Context))
     return true;
 
-  for (const Stmt *Child : Cond->children()) {
-    if (!Child)
-      continue;
-
-    if (isAtLeastOneCondVarChanged(Func, LoopStmt, Child, Context))
-      return true;
-  }
-  return false;
+  return llvm::any_of(Cond->children(), [&](const Stmt *Child) {
+    return Child && isAtLeastOneCondVarChanged(Func, LoopStmt, Child, Context);
+  });
 }
 
 /// Return the variable names in `Cond`.
@@ -240,10 +235,9 @@ static bool hasStaticLocalVariable(const Stmt *Cond) {
           return true;
   }
 
-  for (const Stmt *Child : Cond->children())
-    if (Child && hasStaticLocalVariable(Child))
-      return true;
-  return false;
+  return llvm::any_of(Cond->children(), [](const Stmt *Child) {
+    return Child && hasStaticLocalVariable(Child);
+  });
 }
 
 /// Tests if the loop condition `Cond` involves static local variables and
diff --git a/clang-tools-extra/clang-tidy/bugprone/SuspiciousReallocUsageCheck.cpp b/clang-tools-extra/clang-tidy/bugprone/SuspiciousReallocUsageCheck.cpp
index 7cc3630204e63..bf31218131d5e 100644
--- a/clang-tools-extra/clang-tidy/bugprone/SuspiciousReallocUsageCheck.cpp
+++ b/clang-tools-extra/clang-tidy/bugprone/SuspiciousReallocUsageCheck.cpp
@@ -92,10 +92,9 @@ class FindAssignToVarBefore
     return false;
   }
   bool VisitStmt(const Stmt *S) {
-    for (const Stmt *Child : S->children())
-      if (Child && Visit(Child))
-        return true;
-    return false;
+    return llvm::any_of(S->children(), [this](const Stmt *Child) {
+      return Child && Visit(Child);
+    });
   }
 };
 
diff --git a/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsArrayToPointerDecayCheck.cpp b/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsArrayToPointerDecayCheck.cpp
index d0f86526d1a29..1c5c854cb4d84 100644
--- a/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsArrayToPointerDecayCheck.cpp
+++ b/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsArrayToPointerDecayCheck.cpp
@@ -19,10 +19,11 @@ namespace clang::tidy::cppcoreguidelines {
 namespace {
 AST_MATCHER_P(CXXForRangeStmt, hasRangeBeginEndStmt,
               ast_matchers::internal::Matcher<DeclStmt>, InnerMatcher) {
-  for (const DeclStmt *Stmt : {Node.getBeginStmt(), Node.getEndStmt()})
-    if (Stmt != nullptr && InnerMatcher.matches(*Stmt, Finder, Builder))
-      return true;
-  return false;
+  return llvm::any_of(llvm::ArrayRef{Node.getBeginStmt(), Node.getEndStmt()},
+                      [&](const DeclStmt *Stmt) {
+                        return Stmt &&
+                               InnerMatcher.matches(*Stmt, Finder, Builder);
+                      });
 }
 
 AST_MATCHER(Stmt, isInsideOfRangeBeginEndStmt) {
diff --git a/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsAvoidUncheckedContainerAccessCheck.cpp b/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsAvoidUncheckedContainerAccessCheck.cpp
index 54c4692923949..cf4b445a554e8 100644
--- a/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsAvoidUncheckedContainerAccessCheck.cpp
+++ b/clang-tools-extra/clang-tidy/cppcoreguidelines/ProBoundsAvoidUncheckedContainerAccessCheck.cpp
@@ -176,7 +176,7 @@ void ProBoundsAvoidUncheckedContainerAccessCheck::check(
     }
   } else if (const auto *MCE = dyn_cast<CXXMemberCallExpr>(MatchedExpr)) {
     // Case: a.operator[](i) or a->operator[](i)
-    const auto *Callee = dyn_cast<MemberExpr>(MCE->getCallee());
+    const auto *Callee = cast<MemberExpr>(MCE->getCallee());
 
     if (FixMode == At) {
       // Cases: a.operator[](i) => a.at(i) and a->operator[](i) => a->at(i)
diff --git a/clang-tools-extra/clang-tidy/cppcoreguidelines/ProTypeMemberInitCheck.cpp b/clang-tools-extra/clang-tidy/cppcoreguidelines/ProTypeMemberInitCheck.cpp
index 66508da89f0dd..f2676468b6871 100644
--- a/clang-tools-extra/clang-tidy/cppcoreguidelines/ProTypeMemberInitCheck.cpp
+++ b/clang-tools-extra/clang-tidy/cppcoreguidelines/ProTypeMemberInitCheck.cpp
@@ -361,7 +361,8 @@ void ProTypeMemberInitCheck::storeOptions(ClangTidyOptions::OptionMap &Opts) {
 }
 
 // FIXME: Copied from clang/lib/Sema/SemaDeclCXX.cpp.
-static bool isIncompleteOrZeroLengthArrayType(ASTContext &Context, QualType T) {
+static bool isIncompleteOrZeroLengthArrayType(const ASTContext &Context,
+                                              QualType T) {
   if (T->isIncompleteArrayType())
     return true;
 
@@ -375,7 +376,7 @@ static bool isIncompleteOrZeroLengthArrayType(ASTContext &Context, QualType T) {
   return false;
 }
 
-static bool isEmpty(ASTContext &Context, const QualType &Type) {
+static bool isEmpty(const ASTContext &Context, const QualType &Type) {
   if (const CXXRecordDecl *ClassDecl = Type->getAsCXXRecordDecl()) {
     return ClassDecl->isEmpty();
   }
@@ -431,19 +432,13 @@ static llvm::StringLiteral getInitializer(QualType QT, bool UseAssignment) {
   }
 }
 
-void ProTypeMemberInitCheck::checkMissingMemberInitializer(
-    ASTContext &Context, const CXXRecordDecl &ClassDecl,
-    const CXXConstructorDecl *Ctor) {
-  const bool IsUnion = ClassDecl.isUnion();
-
-  if (IsUnion && ClassDecl.hasInClassInitializer())
-    return;
-
-  // Gather all fields (direct and indirect) that need to be initialized.
-  SmallPtrSet<const FieldDecl *, 16> FieldsToInit;
+static void
+computeFieldsToInit(const ASTContext &Context, const RecordDecl &Record,
+                    bool IgnoreArrays,
+                    SmallPtrSetImpl<const FieldDecl *> &FieldsToInit) {
   bool AnyMemberHasInitPerUnion = false;
   forEachFieldWithFilter(
-      ClassDecl, ClassDecl.fields(), AnyMemberHasInitPerUnion,
+      Record, Record.fields(), AnyMemberHasInitPerUnion,
       [&](const FieldDecl *F) {
         if (IgnoreArrays && F->getType()->isArrayType())
           return;
@@ -458,6 +453,19 @@ void ProTypeMemberInitCheck::checkMissingMemberInitializer(
             !AnyMemberHasInitPerUnion)
           FieldsToInit.insert(F);
       });
+}
+
+void ProTypeMemberInitCheck::checkMissingMemberInitializer(
+    ASTContext &Context, const CXXRecordDecl &ClassDecl,
+    const CXXConstructorDecl *Ctor) {
+  const bool IsUnion = ClassDecl.isUnion();
+
+  if (IsUnion && ClassDecl.hasInClassInitializer())
+    return;
+
+  // Gather all fields (direct and indirect) that need to be initialized.
+  SmallPtrSet<const FieldDecl *, 16> FieldsToInit;
+  computeFieldsToInit(Context, ClassDecl, IgnoreArrays, FieldsToInit);
   if (FieldsToInit.empty())
     return;
 
@@ -507,7 +515,7 @@ void ProTypeMemberInitCheck::checkMissingMemberInitializer(
   // Collect all fields but only suggest a fix for the first member of unions,
   // as initializing more than one union member is an error.
   SmallPtrSet<const FieldDecl *, 16> FieldsToFix;
-  AnyMemberHasInitPerUnion = false;
+  bool AnyMemberHasInitPerUnion = false;
   forEachFieldWithFilter(ClassDecl, ClassDecl.fields(),
                          AnyMemberHasInitPerUnion, [&](const FieldDecl *F) {
                            if (!FieldsToInit.contains(F))
@@ -582,6 +590,17 @@ void ProTypeMemberInitCheck::checkMissingBaseClassInitializer(
 
 void ProTypeMemberInitCheck::checkUninitializedTrivialType(
     const ASTContext &Context, const VarDecl *Var) {
+  // Verify that the record actually needs initialization
+  const CXXRecordDecl *Record = Var->getType()->getAsCXXRecordDecl();
+  if (!Record)
+    return;
+
+  SmallPtrSet<const FieldDecl *, 16> FieldsToInit;
+  computeFieldsToInit(Context, *Record, IgnoreArrays, FieldsToInit);
+
+  if (FieldsToInit.empty())
+    return;
+
   const DiagnosticBuilder Diag =
       diag(Var->getBeginLoc(), "uninitialized record type: %0") << Var;
 
diff --git a/clang-tools-extra/clang-tidy/fuchsia/VirtualInheritanceCheck.cpp b/clang-tools-extra/clang-tidy/fuchsia/VirtualInheritanceCheck.cpp
index b6fb22c66d374..9c98b4938844f 100644
--- a/clang-tools-extra/clang-tidy/fuchsia/VirtualInheritanceCheck.cpp
+++ b/clang-tools-extra/clang-tidy/fuchsia/VirtualInheritanceCheck.cpp
@@ -20,10 +20,9 @@ AST_MATCHER(CXXRecordDecl, hasDirectVirtualBaseClass) {
     return false;
   if (!Node.getNumVBases())
     return false;
-  for (const CXXBaseSpecifier &Base : Node.bases())
-    if (Base.isVirtual())
-      return true;
-  return false;
+  return llvm::any_of(Node.bases(), [](const CXXBaseSpecifier &Base) {
+    return Base.isVirtual();
+  });
 }
 } // namespace
 
diff --git a/clang-tools-extra/clang-tidy/misc/NewDeleteOverloadsCheck.cpp b/clang-tools-extra/clang-tidy/misc/NewDeleteOverloadsCheck.cpp
index a44e9b381d982..0471ba8ae291d 100644
--- a/clang-tools-extra/clang-tidy/misc/NewDeleteOverloadsCheck.cpp
+++ b/clang-tools-extra/clang-tidy/misc/NewDeleteOverloadsCheck.cpp
@@ -114,17 +114,15 @@ hasCorrespondingOverloadInBaseClass(const CXXMethodDecl *MD,
     RD = MD->getParent();
   }
 
-  for (const auto &BS : RD->bases()) {
+  return llvm::any_of(RD->bases(), [&](const CXXBaseSpecifier &BS) {
     // We can't say much about a dependent base class, but to avoid false
     // positives assume it can have a corresponding overload.
     if (BS.getType()->isDependentType())
       return true;
-    if (const auto *BaseRD = BS.getType()->getAsCXXRecordDecl())
-      if (hasCorrespondingOverloadInBaseClass(MD, BaseRD))
-        return true;
-  }
-
-  return false;
+    if (const CXXRecordDecl *BaseRD = BS.getType()->getAsCXXRecordDecl())
+      return hasCorrespondingOverloadInBaseClass(MD, BaseRD);
+    return false;
+  });
 }
 
 void NewDeleteOverloadsCheck::registerMatchers(MatchFinder *Finder) {
diff --git a/clang-tools-extra/clang-tidy/misc/UnusedParametersCheck.cpp b/clang-tools-extra/clang-tidy/misc/UnusedParametersCheck.cpp
index 870905087a9bd..9c38bb129022f 100644
--- a/clang-tools-extra/clang-tidy/misc/UnusedParametersCheck.cpp
+++ b/clang-tools-extra/clang-tidy/misc/UnusedParametersCheck.cpp
@@ -30,13 +30,10 @@ static bool isOverrideMethod(const FunctionDecl *Function) {
 
 static bool hasAttrAfterParam(const SourceManager *SourceManager,
                               const ParmVarDecl *Param) {
-  for (const auto *Attr : Param->attrs()) {
-    if (SourceManager->isBeforeInTranslationUnit(Param->getLocation(),
-                                                 Attr->getLocation())) {
-      return true;
-    }
-  }
-  return false;
+  return llvm::any_of(Param->attrs(), [&](const Attr *Attr) {
+    return SourceManager->isBeforeInTranslationUnit(Param->getLocation(),
+                                                    Attr->getLocation());
+  });
 }
 
 void UnusedParametersCheck::registerMatchers(MatchFinder *Finder) {
diff --git a/clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp b/clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp
index 668ba08400b2b..19b406f05a746 100644
--- a/clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp
+++ b/clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp
@@ -510,27 +510,24 @@ static bool canBeModified(ASTContext *Context, const Expr *E) {
 /// Returns true when it can be guaranteed that the elements of the
 /// container are not being modified.
 static bool usagesAreConst(ASTContext *Context, const UsageResult &Usages) {
-  for (const Usage &U : Usages) {
+  return llvm::none_of(Usages, [&Context](const Usage &U) {
     // Lambda captures are just redeclarations (VarDecl) of the same variable,
     // not expressions. If we want to know if a variable that is captured by
     // reference can be modified in an usage inside the lambda's body, we need
     // to find the expression corresponding to that particular usage, later in
     // this loop.
-    if (U.Kind != Usage::UK_CaptureByCopy && U.Kind != Usage::UK_CaptureByRef &&
-        canBeModified(Context, U.Expression))
-      return false;
-  }
-  return true;
+    return U.Kind != Usage::UK_CaptureByCopy &&
+           U.Kind != Usage::UK_CaptureByRef &&
+           canBeModified(Context, U.Expression);
+  });
 }
 
 /// Returns true if the elements of the container are never accessed
 /// by reference.
 static bool usagesReturnRValues(const UsageResult &Usages) {
-  for (const auto &U : Usages) {
-    if (U.Expression && !U.Expression->isPRValue())
-      return false;
-  }
-  return true;
+  return llvm::all_of(Usages, [](const Usage &U) {
+    return !U.Expression || U.Expression->isPRValue();
+  });
 }
 
 /// Returns true if the container is const-qualified.
diff --git a/clang-tools-extra/clang-tidy/modernize/LoopConvertUtils.cpp b/clang-tools-extra/clang-tidy/modernize/LoopConvertUtils.cpp
index 170a4f6d8731f..f6685dda7e09e 100644
--- a/clang-tools-extra/clang-tidy/modernize/LoopConvertUtils.cpp
+++ b/clang-tools-extra/clang-tidy/modernize/LoopConvertUtils.cpp
@@ -89,13 +89,11 @@ bool DependencyFinderASTVisitor::VisitVarDecl(VarDecl *V) {
 
   // Next, check if the variable was removed from existence by an earlier
   // iteration.
-  for (const auto &I : *ReplacedVars) {
-    if (I.second == V) {
-      DependsOnInsideVariable = true;
-      return false;
-    }
-  }
-  return true;
+  if (llvm::none_of(*ReplacedVars,
+                    [&](const auto &I) { return I.second == V; }))
+    return true;
+  DependsOnInsideVariable = true;
+  return false;
 }
 
 /// If we already created a variable for TheLoop, check to make sure
@@ -234,11 +232,8 @@ static bool containsExpr(ASTContext *Context, const ContainerT *Container,
                          const Expr *E) {
   llvm::FoldingSetNodeID ID;
   E->Profile(ID, *Context, true);
-  for (const auto &I : *Container) {
-    if (ID == I.second)
-      return true;
-  }
-  return false;
+  return llvm::any_of(*Container,
+                      [&](const auto &I) { return ID == I.second; });
 }
 
 /// Returns true when the index expression is a declaration reference to
diff --git a/clang-tools-extra/clang-tidy/modernize/PassByValueCheck.cpp b/clang-tools-extra/clang-tidy/modernize/PassByValueCheck.cpp
index a257f5325f780..09d98ee8bea6f 100644
--- a/clang-tools-extra/clang-tidy/modernize/PassByValueCheck.cpp
+++ b/clang-tools-extra/clang-tidy/modernize/PassByValueCheck.cpp
@@ -196,11 +196,7 @@ static bool hasRValueOverload(const CXXConstructorDecl *Ctor,
     return true;
   };
 
-  for (const auto *Candidate : Record->ctors()) {
-    if (IsRValueOverload(Candidate))
-      return true;
-  }
-  return false;
+  return llvm::any_of(Record->ctors(), IsRValueOverload);
 }
 
 /// Find all references to \p ParamDecl across all of the
diff --git a/clang-tools-extra/clang-tidy/modernize/UseEmplaceCheck.cpp b/clang-tools-extra/clang-tidy/modernize/UseEmplaceCheck.cpp
index e585dd1d40002..ca97b11b9990b 100644
--- a/clang-tools-extra/clang-tidy/modernize/UseEmplaceCheck.cpp
+++ b/clang-tools-extra/clang-tidy/modernize/UseEmplaceCheck.cpp
@@ -44,17 +44,12 @@ AST_MATCHER_P(NamedDecl, hasAnyNameIgnoringTemplates, std::vector<StringRef>,
   // clang/lib/ASTMatchers/ASTMatchersInternal.cpp and checks whether
   // FullNameTrimmed matches any of the given Names.
   const StringRef FullNameTrimmedRef = FullNameTrimmed;
-  for (const StringRef Pattern : Names) {
-    if (Pattern.starts_with("::")) {
-      if (FullNameTrimmed == Pattern)
-        return true;
-    } else if (FullNameTrimmedRef.ends_with(Pattern) &&
-               FullNameTrimmedRef.drop_back(Pattern.size()).ends_with("::")) {
-      return true;
-    }
-  }
-
-  return false;
+  return llvm::any_of(Names, [&](const StringRef Pattern) {
+    if (Pattern.starts_with("::"))
+      return FullNameTrimmed == Pattern;
+    return FullNameTrimmedRef.ends_with(Pattern) &&
+           FullNameTrimmedRef.drop_back(Pattern.size()).ends_with("::");
+  });
 }
 
 // Checks if the given matcher is the last argument of the given CallExpr.
diff --git a/clang-tools-extra/clang-tidy/modernize/UseStdPrintCheck.h b/clang-tools-extra/clang-tidy/modernize/UseStdPrintCheck.h
index 18cff9aa962b5..7d771c4446cd3 100644
--- a/clang-tools-extra/clang-tidy/modernize/UseStdPrintCheck.h
+++ b/clang-tools-extra/clang-tidy/modernize/UseStdPrintCheck.h
@@ -36,7 +36,7 @@ class UseStdPrintCheck : public ClangTidyCheck {
   }
 
 private:
-  Preprocessor *PP;
+  Preprocessor *PP = nullptr;
   bool StrictMode;
   std::vector<StringRef> PrintfLikeFunctions;
   std::vector<StringRef> FprintfLikeFunctions;
diff --git a/clang-tools-extra/clang-tidy/modernize/UseTrailingReturnTypeCheck.cpp b/clang-tools-extra/clang-tidy/modernize/UseTrailingReturnTypeCheck.cpp
index f9afd5044b584..02865b65a9ec2 100644
--- a/clang-tools-extra/clang-tidy/modernize/UseTrailingReturnTypeCheck.cpp
+++ b/clang-tools-extra/clang-tidy/modernize/UseTrailingReturnTypeCheck.cpp
@@ -55,13 +55,12 @@ struct UnqualNameVisitor : public RecursiveASTVisitor<UnqualNameVisitor> {
 
   bool visitUnqualName(StringRef UnqualName) {
     // Check for collisions with function arguments.
-    for (const ParmVarDecl *Param : F.parameters())
+    Collision = llvm::any_of(F.parameters(), [&](const ParmVarDecl *Param) {
       if (const IdentifierInfo *Ident = Param->getIdentifier())
-        if (Ident->getName() == UnqualName) {
-          Collision = true;
-          return true;
-        }
-    return false;
+        return Ident->getName() == UnqualName;
+      return false;
+    });
+    return Collision;
   }
 
   bool TraverseTypeLoc(TypeLoc TL, bool TraverseQualifier = true) {
diff --git a/clang-tools-extra/clang-tidy/objc/MissingHashCheck.cpp b/clang-tools-extra/clang-tidy/objc/MissingHashCheck.cpp
index 7b48fd9f77bca..b8010e0d29eb5 100644
--- a/clang-tools-extra/clang-tidy/objc/MissingHashCheck.cpp
+++ b/clang-tools-extra/clang-tidy/objc/MissingHashCheck.cpp
@@ -25,11 +25,9 @@ AST_MATCHER_P(ObjCImplementationDecl, hasInterface,
 AST_MATCHER_P(ObjCContainerDecl, hasInstanceMethod,
               ast_matchers::internal::Matcher<ObjCMethodDecl>, Base) {
   // Check each instance method against the provided matcher.
-  for (const auto *I : Node.instance_methods()) {
-    if (Base.matches(*I, Finder, Builder))
-      return true;
-  }
-  return false;
+  return llvm::any_of(Node.instance_methods(), [&](const ObjCMethodDecl *I) {
+    return Base.matches(*I, Finder, Builder);
+  });
 }
 
 } // namespace
diff --git a/clang-tools-extra/clang-tidy/performance/TriviallyDestructibleCheck.cpp b/clang-tools-extra/clang-tidy/performance/TriviallyDestructibleCheck.cpp
index 2f54b17367b06..416c41d7acd66 100644
--- a/clang-tools-extra/clang-tidy/performance/TriviallyDestructibleCheck.cpp
+++ b/clang-tools-extra/clang-tidy/performance/TriviallyDestructibleCheck.cpp
@@ -23,12 +23,9 @@ namespace {
 AST_MATCHER(Decl, isFirstDecl) { return Node.isFirstDecl(); }
 
 AST_MATCHER_P(CXXRecordDecl, hasBase, Matcher<QualType>, InnerMatcher) {
-  for (const CXXBaseSpecifier &BaseSpec : Node.bases()) {
-    const QualType BaseType = BaseSpec.getType();
-    if (InnerMatcher.matches(BaseType, Finder, Builder))
-      return true;
-  }
-  return false;
+  return llvm::any_of(Node.bases(), [&](const CXXBaseSpecifier &BaseSpec) {
+    return InnerMatcher.matches(BaseSpec.getType(), Finder, Builder);
+  });
 }
 
 } // namespace
diff --git a/clang-tools-extra/clang-tidy/readability/AmbiguousSmartptrResetCallCheck.cpp b/clang-tools-extra/clang-tidy/readability/AmbiguousSmartptrResetCallCheck.cpp
index 22ff5ce1545a5..ef9263beebfdd 100644
--- a/clang-tools-extra/clang-tidy/readability/AmbiguousSmartptrResetCallCheck.cpp
+++ b/clang-tools-extra/clang-tidy/readability/AmbiguousSmartptrResetCallCheck.cpp
@@ -20,12 +20,9 @@ namespace clang::tidy::readability {
 namespace {
 
 AST_MATCHER(CXXMethodDecl, hasOnlyDefaultParameters) {
-  for (const auto *Param : Node.parameters()) {
-    if (!Param->hasDefaultArg())
-      return false;
-  }
-
-  return true;
+  return llvm::all_of(Node.parameters(), [](const ParmVarDecl *Param) {
+    return Param->hasDefaultArg();
+  });
 }
 
 const auto DefaultSmartPointers = "::std::shared_ptr;::std::unique_ptr;"
diff --git a/clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp b/clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp
index d1583a62a8e5e..79f8437057b23 100644
--- a/clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp
+++ b/clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp
@@ -318,8 +318,8 @@ std::string IdentifierNamingCheck::HungarianNotation::getDeclTypeName(
   if (!EOL)
     EOL = Begin + strlen(Begin);
 
-  const char *PosList[] = {strchr(Begin, '='), strchr(Begin, ';'),
-                           strchr(Begin, ','), strchr(Begin, ')'), EOL};
+  const char *const PosList[] = {strchr(Begin, '='), strchr(Begin, ';'),
+                                 strchr(Begin, ','), strchr(Begin, ')'), EOL};
   for (const auto &Pos : PosList) {
     if (Pos > Begin)
       EOL = std::min(EOL, Pos);
diff --git a/clang-tools-extra/clang-tidy/readability/OperatorsRepresentationCheck.cpp b/clang-tools-extra/clang-tidy/readability/OperatorsRepresentationCheck.cpp
index 269996dc07916..05f31c76c4c75 100644
--- a/clang-tools-extra/clang-tidy/readability/OperatorsRepresentationCheck.cpp
+++ b/clang-tools-extra/clang-tidy/readability/OperatorsRepresentationCheck.cpp
@@ -136,11 +136,9 @@ getRepresentation(const std::vector<llvm::StringRef> &Config,
 template <typename T>
 static bool isAnyOperatorEnabled(const std::vector<llvm::StringRef> &Config,
                                  const T &Operators) {
-  for (const auto &[traditional, alternative] : Operators) {
-    if (!getRepresentation(Config, traditional, alternative).empty())
-      return true;
-  }
-  return false;
+  return llvm::any_of(Operators, [&](const auto &Op) {
+    return !getRepresentation(Config, Op.first, Op.second).empty();
+  });
 }
 
 OperatorsRepresentationCheck::OperatorsRepresentationCheck(
diff --git a/clang-tools-extra/clang-tidy/readability/RedundantTypenameCheck.cpp b/clang-tools-extra/clang-tidy/readability/RedundantTypenameCheck.cpp
index a4edd2b46b86b..5f2519ce9d5c3 100644
--- a/clang-tools-extra/clang-tidy/readability/RedundantTypenameCheck.cpp
+++ b/clang-tools-extra/clang-tidy/readability/RedundantTypenameCheck.cpp
@@ -47,6 +47,9 @@ void RedundantTypenameCheck::check(const MatchFinder::MatchResult &Result) {
   const SourceLocation ElaboratedKeywordLoc = [&] {
     if (const auto *NonDependentTypeLoc =
             Result.Nodes.getNodeAs<TypeLoc>("nonDependentTypeLoc")) {
+      if (NonDependentTypeLoc->getType()->isDependentType())
+        return SourceLocation();
+
       if (const auto TL = NonDependentTypeLoc->getAs<TypedefTypeLoc>())
         return TL.getElaboratedKeywordLoc();
 
@@ -59,8 +62,7 @@ void RedundantTypenameCheck::check(const MatchFinder::MatchResult &Result) {
 
       if (const auto TL =
               NonDependentTypeLoc->getAs<TemplateSpecializationTypeLoc>())
-        if (!TL.getType()->isDependentType())
-          return TL.getElaboratedKeywordLoc();
+        return TL.getElaboratedKeywordLoc();
     } else {
       TypeLoc InnermostTypeLoc =
           *Result.Nodes.getNodeAs<TypeLoc>("dependentTypeLoc");
diff --git a/clang-tools-extra/clang-tidy/readability/SuspiciousCallArgumentCheck.cpp b/clang-tools-extra/clang-tidy/readability/SuspiciousCallArgumentCheck.cpp
index d9ccd9920bb6f..0b52a9664a7a5 100644
--- a/clang-tools-extra/clang-tidy/readability/SuspiciousCallArgumentCheck.cpp
+++ b/clang-tools-extra/clang-tidy/readability/SuspiciousCallArgumentCheck.cpp
@@ -758,7 +758,7 @@ bool SuspiciousCallArgumentCheck::areParamAndArgComparable(
 
 bool SuspiciousCallArgumentCheck::areArgsSwapped(std::size_t Position1,
                                                  std::size_t Position2) const {
-  for (const Heuristic H : AppliedHeuristics) {
+  return llvm::any_of(AppliedHeuristics, [&](Heuristic H) {
     const bool A1ToP2Similar = areNamesSimilar(
         ArgNames[Position2], ParamNames[Position1], H, BoundKind::SimilarAbove);
     const bool A2ToP1Similar = areNamesSimilar(
@@ -771,11 +771,9 @@ bool SuspiciousCallArgumentCheck::areArgsSwapped(std::size_t Position1,
         !areNamesSimilar(ArgNames[Position2], ParamNames[Position2], H,
                          BoundKind::DissimilarBelow);
 
-    if ((A1ToP2Similar || A2ToP1Similar) && A1ToP1Dissimilar &&
-        A2ToP2Dissimilar)
-      return true;
-  }
-  return false;
+    return (A1ToP2Similar || A2ToP1Similar) && A1ToP1Dissimilar &&
+           A2ToP2Dissimilar;
+  });
 }
 
 bool SuspiciousCallArgumentCheck::areNamesSimilar(StringRef Arg,
diff --git a/clang-tools-extra/clang-tidy/tool/ClangTidyMain.cpp b/clang-tools-extra/clang-tidy/tool/ClangTidyMain.cpp
index bc6bd164e24f8..6a1f61dd6b9e1 100644
--- a/clang-tools-extra/clang-tidy/tool/ClangTidyMain.cpp
+++ b/clang-tools-extra/clang-tidy/tool/ClangTidyMain.cpp
@@ -104,8 +104,7 @@ Configuration files:
 )");
 
 const char DefaultChecks[] = // Enable these checks by default:
-    "clang-diagnostic-*,"    //   * compiler diagnostics
-    "clang-analyzer-*";      //   * Static Analyzer checks
+    "clang-diagnostic-*";    //   * compiler diagnostics
 
 static cl::opt<std::string> Checks("checks", desc(R"(
 Comma-separated list of globs with optional '-'
diff --git a/clang-tools-extra/clang-tidy/utils/Aliasing.cpp b/clang-tools-extra/clang-tidy/utils/Aliasing.cpp
index a22d2358bc560..1b12859c7e450 100644
--- a/clang-tools-extra/clang-tidy/utils/Aliasing.cpp
+++ b/clang-tools-extra/clang-tidy/utils/Aliasing.cpp
@@ -65,15 +65,9 @@ static bool hasPtrOrReferenceInStmt(const Stmt *S, const ValueDecl *Var) {
   if (isPtrOrReferenceForVar(S, Var))
     return true;
 
-  for (const Stmt *Child : S->children()) {
-    if (!Child)
-      continue;
-
-    if (hasPtrOrReferenceInStmt(Child, Var))
-      return true;
-  }
-
-  return false;
+  return llvm::any_of(S->children(), [&](const Stmt *Child) {
+    return Child && hasPtrOrReferenceInStmt(Child, Var);
+  });
 }
 
 static bool refersToEnclosingLambdaCaptureByRef(const Decl *Func,
diff --git a/clang-tools-extra/clang-tidy/utils/DeclRefExprUtils.cpp b/clang-tools-extra/clang-tidy/utils/DeclRefExprUtils.cpp
index 75a6dafed3c5e..a807c951a0e98 100644
--- a/clang-tools-extra/clang-tidy/utils/DeclRefExprUtils.cpp
+++ b/clang-tools-extra/clang-tidy/utils/DeclRefExprUtils.cpp
@@ -21,10 +21,7 @@ using llvm::SmallPtrSet;
 
 template <typename S>
 static bool isSetDifferenceEmpty(const S &S1, const S &S2) {
-  for (auto E : S1)
-    if (S2.count(E) == 0)
-      return false;
-  return true;
+  return llvm::none_of(S1, [&S2](const auto &E) { return !S2.contains(E); });
 }
 
 // Extracts all Nodes keyed by ID from Matches and inserts them into Nodes.
diff --git a/clang-tools-extra/clang-tidy/utils/ExprSequence.cpp b/clang-tools-extra/clang-tidy/utils/ExprSequence.cpp
index 0375d0f6c740f..45fcacf584157 100644
--- a/clang-tools-extra/clang-tidy/utils/ExprSequence.cpp
+++ b/clang-tools-extra/clang-tidy/utils/ExprSequence.cpp
@@ -148,12 +148,9 @@ bool ExprSequence::inSequence(const Stmt *Before, const Stmt *After) const {
 
   // If 'After' is a parent of 'Before' or is sequenced after one of these
   // parents, we know that it is sequenced after 'Before'.
-  for (const Stmt *Parent : BeforeParents) {
-    if (Parent == After || inSequence(Parent, After))
-      return true;
-  }
-
-  return false;
+  return llvm::any_of(BeforeParents, [&](const Stmt *Parent) {
+    return Parent == After || inSequence(Parent, After);
+  });
 }
 
 bool ExprSequence::potentiallyAfter(const Stmt *After,
diff --git a/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp b/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
index 23dae04916e9b..d210b000dfd33 100644
--- a/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
+++ b/clang-tools-extra/clang-tidy/utils/FormatStringConverter.cpp
@@ -700,6 +700,7 @@ void FormatStringConverter::finalizeFormatText() {
 /// Append literal parts of the format text, reinstating escapes as required.
 void FormatStringConverter::appendFormatText(const StringRef Text) {
   for (const char Ch : Text) {
+    const auto UCh = static_cast<unsigned char>(Ch);
     if (Ch == '\a')
       StandardFormatString += "\\a";
     else if (Ch == '\b')
@@ -724,10 +725,10 @@ void FormatStringConverter::appendFormatText(const StringRef Text) {
     } else if (Ch == '}') {
       StandardFormatString += "}}";
       FormatStringNeededRewriting = true;
-    } else if (Ch < 32) {
+    } else if (UCh < 32) {
       StandardFormatString += "\\x";
-      StandardFormatString += llvm::hexdigit(Ch >> 4, true);
-      StandardFormatString += llvm::hexdigit(Ch & 0xf, true);
+      StandardFormatString += llvm::hexdigit(UCh >> 4, true);
+      StandardFormatString += llvm::hexdigit(UCh & 0xf, true);
     } else
       StandardFormatString += Ch;
   }
diff --git a/clang-tools-extra/clang-tidy/utils/TypeTraits.cpp b/clang-tools-extra/clang-tidy/utils/TypeTraits.cpp
index 98a5d40d49313..dde6e9a8dca70 100644
--- a/clang-tools-extra/clang-tidy/utils/TypeTraits.cpp
+++ b/clang-tools-extra/clang-tidy/utils/TypeTraits.cpp
@@ -24,11 +24,9 @@ static bool hasDeletedCopyConstructor(QualType Type) {
   auto *Record = Type->getAsCXXRecordDecl();
   if (!Record || !Record->hasDefinition())
     return false;
-  for (const auto *Constructor : Record->ctors()) {
-    if (Constructor->isCopyConstructor() && Constructor->isDeleted())
-      return true;
-  }
-  return false;
+  return llvm::any_of(Record->ctors(), [](const auto *Constructor) {
+    return Constructor->isCopyConstructor() && Constructor->isDeleted();
+  });
 }
 
 std::optional<bool> isExpensiveToCopy(QualType Type,
@@ -70,14 +68,10 @@ bool recordIsTriviallyDefaultConstructible(const RecordDecl &RecordDecl,
       return false;
   }
   // If all its direct bases are trivially constructible.
-  for (const CXXBaseSpecifier &Base : ClassDecl->bases()) {
-    if (!isTriviallyDefaultConstructible(Base.getType(), Context))
-      return false;
-    if (Base.isVirtual())
-      return false;
-  }
-
-  return true;
+  return llvm::all_of(ClassDecl->bases(), [&](const CXXBaseSpecifier &Base) {
+    return isTriviallyDefaultConstructible(Base.getType(), Context) &&
+           !Base.isVirtual();
+  });
 }
 
 // Based on QualType::isTrivial.
diff --git a/clang-tools-extra/clangd/test/CMakeLists.txt b/clang-tools-extra/clangd/test/CMakeLists.txt
index 42fc3506641f2..eef8f529667f7 100644
--- a/clang-tools-extra/clangd/test/CMakeLists.txt
+++ b/clang-tools-extra/clangd/test/CMakeLists.txt
@@ -5,6 +5,10 @@ set(CLANGD_TEST_DEPS
   split-file
   )
 
+if (LLVM_INCLUDE_BENCHMARKS)
+  list(APPEND CLANGD_TEST_DEPS IndexBenchmark)
+endif()
+
 if(CLANGD_BUILD_XPC)
   list(APPEND CLANGD_TEST_DEPS clangd-xpc-test-client)
   list(APPEND CLANGD_TEST_DEPS ClangdXpcUnitTests)
diff --git a/clang-tools-extra/clangd/test/include-cleaner-batch-fix.test b/clang-tools-extra/clangd/test/include-cleaner-batch-fix.test
index 07ebe1009a78f..5a87a87e2f63a 100644
--- a/clang-tools-extra/clangd/test/include-cleaner-batch-fix.test
+++ b/clang-tools-extra/clangd/test/include-cleaner-batch-fix.test
@@ -7,7 +7,9 @@
 # RUN: cp -r %S/Inputs/include-cleaner %t/include
 # RUN: echo '-I%t/include' > %t/compile_flags.txt
 # Create a config file enabling include-cleaner features.
-# RUN: echo $'Diagnostics:\n  UnusedIncludes: Strict\n  MissingIncludes: Strict' >> %t/clangd/config.yaml
+# RUN: echo 'Diagnostics:' > %t/clangd/config.yaml
+# RUN: echo '  UnusedIncludes: Strict' >> %t/clangd/config.yaml
+# RUN: echo '  MissingIncludes: Strict' >> %t/clangd/config.yaml
 
 # RUN: env XDG_CONFIG_HOME=%t clangd -lit-test -enable-config --compile-commands-dir=%t < %s | FileCheck -strict-whitespace %s
 {"jsonrpc":"2.0","id":0,"method":"initialize","params":{"processId":123,"rootPath":"clangd","capabilities":{"workspace":{"workspaceEdit":{"documentChanges":true, "changeAnnotationSupport":{"groupsOnLabel":true}}}},"trace":"off"}}
diff --git a/clang-tools-extra/clangd/test/index-tools.test b/clang-tools-extra/clangd/test/index-tools.test
index 93cf56fea371a..cc01e196f113f 100644
--- a/clang-tools-extra/clangd/test/index-tools.test
+++ b/clang-tools-extra/clangd/test/index-tools.test
@@ -1,6 +1,7 @@
+# Paths are not constructed correctly for the test to run on Windows.
+# UNSUPPORTED: system-windows
+# REQUIRES: have-benchmarks
 # RUN: clangd-indexer %p/Inputs/BenchmarkSource.cpp -- -I%p/Inputs > %t.index
-# FIXME: By default, benchmarks are excluded from the list of default targets hence not built. Find a way to depend on benchmarks to run the next command.
-# REQUIRES: shell
-# RUN: if [ -f %clangd-benchmark-dir/IndexBenchmark ]; then %clangd-benchmark-dir/IndexBenchmark %t.index %p/Inputs/requests.json --benchmark_min_time=0.01 ; fi
+# RUN: %clangd-benchmark-dir/IndexBenchmark %t.index %p/Inputs/requests.json --benchmark_min_time=0.01
 # Pass invalid JSON file and check that IndexBenchmark fails to parse it.
-# RUN: if [ -f %clangd-benchmark-dir/IndexBenchmark ]; then not %clangd-benchmark-dir/IndexBenchmark %t.index %t --benchmark_min_time=0.01 ; fi
+# RUN: not %clangd-benchmark-dir/IndexBenchmark %t.index %t --benchmark_min_time=0.01
diff --git a/clang-tools-extra/clangd/test/lit.cfg.py b/clang-tools-extra/clangd/test/lit.cfg.py
index 8ab4309e337d1..05a0f5e7383e9 100644
--- a/clang-tools-extra/clangd/test/lit.cfg.py
+++ b/clang-tools-extra/clangd/test/lit.cfg.py
@@ -1,4 +1,5 @@
 import lit.llvm
+import lit.util
 
 lit.llvm.initialize(lit_config, config)
 lit.llvm.llvm_config.clang_setup()
@@ -37,6 +38,9 @@ def calculate_arch_features(arch_string):
 if config.have_zlib:
     config.available_features.add("zlib")
 
+if lit.util.pythonize_bool(config.have_benchmarks):
+    config.available_features.add("have-benchmarks")
+
 # It is not realistically possible to account for all options that could
 # possibly be present in system and user configuration files, so disable
 # default configs for the test runs.
diff --git a/clang-tools-extra/clangd/test/lit.site.cfg.py.in b/clang-tools-extra/clangd/test/lit.site.cfg.py.in
index a0bb3561e19ee..f5ae3eb1f0743 100644
--- a/clang-tools-extra/clangd/test/lit.site.cfg.py.in
+++ b/clang-tools-extra/clangd/test/lit.site.cfg.py.in
@@ -19,6 +19,7 @@ config.clangd_build_dexp = @CLANGD_BUILD_DEXP@
 config.clangd_enable_remote = @CLANGD_ENABLE_REMOTE@
 config.clangd_tidy_checks = @CLANGD_TIDY_CHECKS@
 config.have_zlib = @LLVM_ENABLE_ZLIB@
+config.have_benchmarks = "@LLVM_INCLUDE_BENCHMARKS@"
 
 # Delegate logic to lit.cfg.py.
 lit_config.load_config(config, "@CMAKE_CURRENT_SOURCE_DIR@/lit.cfg.py")
diff --git a/clang-tools-extra/clangd/test/system-include-extractor.test b/clang-tools-extra/clangd/test/system-include-extractor.test
index 83a8c28bf7d56..3314be806a801 100644
--- a/clang-tools-extra/clangd/test/system-include-extractor.test
+++ b/clang-tools-extra/clangd/test/system-include-extractor.test
@@ -5,7 +5,8 @@
 
 # Create a bin directory to store the mock-driver and add it to the path
 # RUN: mkdir -p %t.dir/bin
-# RUN: export PATH=%t.dir/bin:$PATH
+# RUN: %python -c "print(__import__('os').environ['PATH'])" > %t.path
+# RUN: export PATH=%t.dir/bin:%{readfile:%t.path}
 # Generate a mock-driver that will print %temp_dir%/my/dir and
 # %temp_dir%/my/dir2 as include search paths.
 # RUN: echo '#!/bin/sh' >> %t.dir/bin/my_driver.sh
diff --git a/clang-tools-extra/docs/ReleaseNotes.rst b/clang-tools-extra/docs/ReleaseNotes.rst
index a6f80e3721db1..9533d56c219f7 100644
--- a/clang-tools-extra/docs/ReleaseNotes.rst
+++ b/clang-tools-extra/docs/ReleaseNotes.rst
@@ -58,6 +58,10 @@ Potentially Breaking Changes
   :program:`clang-tidy-20`. Users should use the check-specific options of the
   same name instead.
 
+- Removed `clang-analyzer-*` checks from default checks in :program:`clang-tidy`.
+  From now on, users should specify explicitly that they want CSA checks to run
+  in :program:`clang-tidy` via `clang-analyzer-*`.
+
 - Renamed a few :program:`clang-tidy` check options, as they
   were misspelled:
 
@@ -69,7 +73,7 @@ Potentially Breaking Changes
   - `CharTypdefsToIgnore` to `CharTypedefsToIgnore` in
     :doc:`bugprone-signed-char-misuse
     <clang-tidy/checks/bugprone/signed-char-misuse>`
-  
+
 - Modified the custom message format of :doc:`bugprone-unsafe-functions
   <clang-tidy/checks/bugprone/unsafe-functions>` by assigning a special meaning
   to the character ``>`` at the start of the value of the option
@@ -184,6 +188,10 @@ Improvements to clang-tidy
   scripts by adding the `-hide-progress` option to suppress progress and
   informational messages.
 
+- Removed `clang-analyzer-*` check from default checks in :program:`clang-tidy`.
+  From now on, users should specify explicitly that they want CSA checks to run
+  in :program:`clang-tidy`.
+
 - Deprecated the :program:`clang-tidy` ``zircon`` module. All checks have been
   moved to the ``fuchsia`` module instead. The ``zircon`` module will be removed
   in the 24th release.
@@ -394,7 +402,7 @@ Changes in existing checks
   <clang-tidy/checks/bugprone/unhandled-self-assignment>` check by adding
   an additional matcher that generalizes the copy-and-swap idiom pattern
   detection.
-  
+
 - Improved :doc:`bugprone-unsafe-functions
   <clang-tidy/checks/bugprone/unsafe-functions>` check by hiding the default
   suffix when the reason starts with the character `>` in the `CustomFunctions`
@@ -424,6 +432,11 @@ Changes in existing checks
   adding an option to allow pointer arithmetic via prefix/postfix increment or
   decrement operators.
 
+- Improved :doc:`cppcoreguidelines-pro-type-member-init
+  <clang-tidy/checks/cppcoreguidelines/pro-type-member-init>` check to
+  correctly ignore ``std::array`` and other array-like containers when
+  `IgnoreArrays` option is set to `true`.
+
 - Improved :doc:`google-readability-casting
   <clang-tidy/checks/google/readability-casting>` check by adding fix-it
   notes for downcasts and casts to void pointer.
@@ -447,7 +460,8 @@ Changes in existing checks
   positives when pointers is transferred to non-const references
   and avoid false positives of function pointer and fix false
   positives on return of non-const pointer and fix false positives on
-  pointer-to-member operator.
+  pointer-to-member operator and avoid false positives when the address
+  of a variable is taken to be passed to a function.
 
 - Improved :doc:`misc-coroutine-hostile-raii
   <clang-tidy/checks/misc/coroutine-hostile-raii>` check by adding the option
@@ -497,7 +511,8 @@ Changes in existing checks
 - Improved :doc:`modernize-use-std-print
   <clang-tidy/checks/modernize/use-std-print>` check to correctly match
   when the format string is converted to a different type by an implicit
-  constructor call.
+  constructor call, and fixed a crash when handling format strings
+  containing non-ASCII characters.
 
 - Improved :doc:`performance-unnecessary-copy-initialization
   <clang-tidy/checks/performance/unnecessary-copy-initialization>` by printing
diff --git a/clang-tools-extra/docs/clang-tidy/checks/cppcoreguidelines/pro-bounds-avoid-unchecked-container-access.rst b/clang-tools-extra/docs/clang-tidy/checks/cppcoreguidelines/pro-bounds-avoid-unchecked-container-access.rst
index fe78ad8056443..38143c94cd3ae 100644
--- a/clang-tools-extra/docs/clang-tidy/checks/cppcoreguidelines/pro-bounds-avoid-unchecked-container-access.rst
+++ b/clang-tools-extra/docs/clang-tidy/checks/cppcoreguidelines/pro-bounds-avoid-unchecked-container-access.rst
@@ -29,9 +29,9 @@ STL containers for which ``operator[]`` is well-defined for all inputs are
 excluded from this check (e.g.: ``std::map::operator[]``).
 
 This check enforces part of the `SL.con.3
-<https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#slcon3-avoid-bounds-errors>`
+<https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#slcon3-avoid-bounds-errors>`_
 guideline and is part of the `Bounds Safety (Bounds 4)
-<https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#pro-bounds-arrayindex>`
+<https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#pro-bounds-arrayindex>`_
 profile from the C++ Core Guidelines.
 
 Options
diff --git a/clang-tools-extra/test/clang-tidy/check_clang_tidy.py b/clang-tools-extra/test/clang-tidy/check_clang_tidy.py
index 183b33f135be8..b173ecf4fbdca 100755
--- a/clang-tools-extra/test/clang-tidy/check_clang_tidy.py
+++ b/clang-tools-extra/test/clang-tidy/check_clang_tidy.py
@@ -398,6 +398,8 @@ def parse_arguments() -> Tuple[argparse.Namespace, List[str]]:
 
 
 def main() -> None:
+    sys.stdout.reconfigure(encoding="utf-8")
+    sys.stderr.reconfigure(encoding="utf-8")
     args, extra_args = parse_arguments()
 
     abbreviated_stds = args.std
diff --git a/clang-tools-extra/test/clang-tidy/checkers/cppcoreguidelines/pro-type-member-init.ignorearrays.cpp b/clang-tools-extra/test/clang-tidy/checkers/cppcoreguidelines/pro-type-member-init.ignorearrays.cpp
index 01859b3ad98f4..e4cfe679cfce9 100644
--- a/clang-tools-extra/test/clang-tidy/checkers/cppcoreguidelines/pro-type-member-init.ignorearrays.cpp
+++ b/clang-tools-extra/test/clang-tidy/checkers/cppcoreguidelines/pro-type-member-init.ignorearrays.cpp
@@ -14,3 +14,39 @@ struct HasArrayMember {
   int RawArray[4];
   int Number;
 };
+
+namespace std {
+template <typename T, int N>
+struct array {
+  T _Elems[N];
+  void fill(const T &);
+};
+}
+
+void test_local_std_array() {
+  std::array<int, 4> a;
+}
+
+struct OnlyArray {
+  int a[4];
+};
+
+void test_local_only_array() {
+  OnlyArray a;
+}
+
+struct Mixed {
+  int a[4];
+  int b;
+};
+
+void test_local_mixed() {
+  Mixed m;
+  // CHECK-MESSAGES: :[[@LINE-1]]:3: warning: uninitialized record type: 'm'
+}
+
+void test_std_array_fill() {
+  std::array<char, 10> someArray;
+  // CHECK-MESSAGES-NOT: warning: uninitialized record type: 'someArray'
+  someArray.fill('n');
+}
diff --git a/clang-tools-extra/test/clang-tidy/checkers/misc/const-correctness-pointer-as-pointers.cpp b/clang-tools-extra/test/clang-tidy/checkers/misc/const-correctness-pointer-as-pointers.cpp
index 4c847b58d395c..0cb58c2e83643 100644
--- a/clang-tools-extra/test/clang-tidy/checkers/misc/const-correctness-pointer-as-pointers.cpp
+++ b/clang-tools-extra/test/clang-tidy/checkers/misc/const-correctness-pointer-as-pointers.cpp
@@ -73,3 +73,18 @@ void ignoreNonConstRefOps() {
   int* p2 {nullptr};
   int*& r2 = (int*&)p2;
 }
+
+void pointer_to_pointer_param(int**);
+void pass_address_to_pointer_to_pointer() {
+  int i = 0;
+  int* ip = &i;
+  // CHECK-NOT: warning
+  pointer_to_pointer_param(&ip);
+}
+
+void void_pointer_to_pointer_param(void**);
+void pass_address_to_void_pointer_to_pointer() {
+  void* ptr = nullptr;
+  // CHECK-NOT: warning
+  void_pointer_to_pointer_param(&ptr);
+}
diff --git a/clang-tools-extra/test/clang-tidy/checkers/modernize/use-std-print.cpp b/clang-tools-extra/test/clang-tidy/checkers/modernize/use-std-print.cpp
index ec37f077df7fc..63972cc0fd25e 100644
--- a/clang-tools-extra/test/clang-tidy/checkers/modernize/use-std-print.cpp
+++ b/clang-tools-extra/test/clang-tidy/checkers/modernize/use-std-print.cpp
@@ -54,6 +54,12 @@ void printf_deceptive_newline() {
   // CHECK-FIXES: std::println("Hello");
 }
 
+void printf_utf8_text() {
+  printf("你好世界\n");
+  // CHECK-MESSAGES: [[@LINE-1]]:3: warning: use 'std::println' instead of 'printf' [modernize-use-std-print]
+  // CHECK-FIXES: std::println("你好世界");
+}
+
 void printf_crlf_newline() {
   printf("Hello\r\n");
   // CHECK-MESSAGES: [[@LINE-1]]:3: warning: use 'std::print' instead of 'printf' [modernize-use-std-print]
@@ -303,6 +309,12 @@ void fprintf_simple() {
   // CHECK-FIXES: std::print(stderr, "Hello");
 }
 
+void fprintf_utf8_text() {
+  fprintf(stderr, "你好世界\n");
+  // CHECK-MESSAGES: [[@LINE-1]]:3: warning: use 'std::println' instead of 'fprintf' [modernize-use-std-print]
+  // CHECK-FIXES: std::println(stderr, "你好世界");
+}
+
 void std_printf_simple() {
   std::printf("std::Hello");
   // CHECK-MESSAGES: [[@LINE-1]]:3: warning: use 'std::print' instead of 'printf' [modernize-use-std-print]
diff --git a/clang-tools-extra/test/clang-tidy/checkers/readability/redundant-typename.cpp b/clang-tools-extra/test/clang-tidy/checkers/readability/redundant-typename.cpp
index 2efafd1a9a649..e8fcd9bcd5731 100644
--- a/clang-tools-extra/test/clang-tidy/checkers/readability/redundant-typename.cpp
+++ b/clang-tools-extra/test/clang-tidy/checkers/readability/redundant-typename.cpp
@@ -267,3 +267,27 @@ WHOLE_TYPE_IN_MACRO Macro2;
 
 #define WHOLE_DECLARATION_IN_MACRO typename NotDependent::R Macro3
 WHOLE_DECLARATION_IN_MACRO;
+
+template <typename T> struct Wrapper {};
+template <typename T>
+struct ClassWrapper {
+    using R = T;
+    Wrapper<R> f();
+};
+
+template <typename T>
+Wrapper<typename ClassWrapper<T>::R> ClassWrapper<T>::f() {
+    return {};
+}
+
+template <typename T> struct StructWrapper {};
+template <typename T>
+class ClassWithNestedStruct {
+  struct Nested {};
+  StructWrapper<Nested> f();
+};
+
+template <typename T>
+StructWrapper<typename ClassWithNestedStruct<T>::Nested> ClassWithNestedStruct<T>::f() {
+  return {};
+}
diff --git a/clang/docs/LibASTMatchersReference.html b/clang/docs/LibASTMatchersReference.html
index ac1abb4d9f381..e34ac30b8f5a4 100644
--- a/clang/docs/LibASTMatchersReference.html
+++ b/clang/docs/LibASTMatchersReference.html
@@ -2449,6 +2449,18 @@ <h2 id="decl-matchers">Node Matchers</h2>
 </pre></td></tr>
 
 
+<tr><td>Matcher<<a href="https://clang.llvm.org/doxygen/classclang_1_1TypeLoc.html">TypeLoc</a>></td><td class="name" onclick="toggle('arrayTypeLoc0')"><a name="arrayTypeLoc0Anchor">arrayTypeLoc</a></td><td>Matcher<<a href="https://clang.llvm.org/doxygen/classclang_1_1ArrayTypeLoc.html">ArrayTypeLoc</a>>...</td></tr>
+<tr><td colspan="4" class="doc" id="arrayTypeLoc0"><pre>Matches `ArrayTypeLoc`s.
+
+Given
+  int a[] = {1, 2};
+  int b[3];
+  void f() { int c[a[0]]; }
+arrayTypeLoc()
+  matches "int a[]", "int b[3]" and "int c[a[0]]".
+</pre></td></tr>
+
+
 <tr><td>Matcher<<a href="https://clang.llvm.org/doxygen/classclang_1_1TypeLoc.html">TypeLoc</a>></td><td class="name" onclick="toggle('pointerTypeLoc0')"><a name="pointerTypeLoc0Anchor">pointerTypeLoc</a></td><td>Matcher<<a href="https://clang.llvm.org/doxygen/classclang_1_1PointerTypeLoc.html">PointerTypeLoc</a>>...</td></tr>
 <tr><td colspan="4" class="doc" id="pointerTypeLoc0"><pre>Matches pointer `TypeLoc`s.
 
diff --git a/clang/docs/OpenMPSupport.rst b/clang/docs/OpenMPSupport.rst
index e7ca7b0bd0792..ab3f2c48983ca 100644
--- a/clang/docs/OpenMPSupport.rst
+++ b/clang/docs/OpenMPSupport.rst
@@ -266,7 +266,7 @@ implementation.
 +------------------------------+--------------------------------------------------------------+--------------------------+-----------------------------------------------------------------------+
 | device                       | has_device_addr clause on target construct                   | :none:`unclaimed`        |                                                                       |
 +------------------------------+--------------------------------------------------------------+--------------------------+-----------------------------------------------------------------------+
-| device                       | iterators in map clause or motion clauses                    | :none:`unclaimed`        |                                                                       |
+| device                       | iterators in map clause or motion clauses                    | :none:`done`             | https://github.com/llvm/llvm-project/pull/159112                      |
 +------------------------------+--------------------------------------------------------------+--------------------------+-----------------------------------------------------------------------+
 | device                       | indirect clause on declare target directive                  | :part:`In Progress`      |                                                                       |
 +------------------------------+--------------------------------------------------------------+--------------------------+-----------------------------------------------------------------------+
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index da064534c25d9..654a8e48cd104 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -360,6 +360,12 @@ Attribute Changes in Clang
   attribute, but `malloc_span` applies not to functions returning pointers, but to functions returning
   span-like structures (i.e. those that contain a pointer field and a size integer field or two pointers).
 
+- Added new attribute ``modular_format`` to allow dynamically selecting at link
+  time which aspects of a statically linked libc's printf (et al)
+  implementation are required. This can reduce code size without requiring e.g.
+  multilibs for printf features. Requires cooperation with the libc
+  implementation.
+
 Improvements to Clang's diagnostics
 -----------------------------------
 - Diagnostics messages now refer to ``structured binding`` instead of ``decomposition``,
@@ -448,6 +454,11 @@ Improvements to Clang's diagnostics
 - A new warning ``-Wenum-compare-typo`` has been added to detect potential erroneous
   comparison operators when mixed with bitwise operators in enum value initializers.
   This can be locally disabled by explicitly casting the initializer value.
+- Clang now provides correct caret placement when attributes appear before
+  `enum class` (#GH163224).
+
+- A new warning ``-Wshadow-header`` has been added to detect when a header file
+  is found in multiple search directories (excluding system paths).
 
 Improvements to Clang's time-trace
 ----------------------------------
@@ -563,6 +574,7 @@ Bug Fixes to C++ Support
 - Fix for clang incorrectly rejecting the default construction of a union with
   nontrivial member when another member has an initializer. (#GH81774)
 - Fixed a template depth issue when parsing lambdas inside a type constraint. (#GH162092)
+- Fix the support of zero-length arrays in SFINAE context. (#GH170040)
 - Diagnose unresolved overload sets in non-dependent compound requirements. (#GH51246) (#GH97753)
 - Fix a crash when extracting unavailable member type from alias in template deduction. (#GH165560)
 - Fix incorrect diagnostics for lambdas with init-captures inside braced initializers. (#GH163498)
@@ -641,6 +653,9 @@ RISC-V Support
 - `__GCC_CONSTRUCTIVE_SIZE` and `__GCC_DESTRUCTIVE_SIZE` are changed to 64. These values are
   unstable according to `Clang's documentation <https://clang.llvm.org/docs/LanguageExtensions.html#gcc-destructive-size-and-gcc-constructive-size>`_.
 
+- DWARF fission is now compatible with linker relaxations, allowing `-gsplit-dwarf` and `-mrelax`
+  to be used together when building for the RISC-V platform.
+
 CUDA/HIP Language Changes
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -689,6 +704,7 @@ AST Matchers
 - Fixed detection of explicit parameter lists in ``LambdaExpr``. (#GH168452)
 - Added ``hasExplicitParameters`` for ``LambdaExpr`` as an output attribute to
   AST JSON dumps.
+- Add ``arrayTypeLoc`` matcher for matching ``ArrayTypeLoc``.
 
 clang-format
 ------------
@@ -732,6 +748,8 @@ Crash and bug fixes
   ``[[assume(expr)]]`` attribute was enclosed in parentheses.  (#GH151529)
 - Fixed a crash when parsing ``#embed`` parameters with unmatched closing brackets. (#GH152829)
 - Fixed a crash when compiling ``__real__`` or ``__imag__`` unary operator on scalar value with type promotion. (#GH160583)
+- Fixed a crash when parsing invalid nested name specifier sequences
+  containing a single colon. (#GH167905)
 
 Improvements
 ^^^^^^^^^^^^
diff --git a/clang/docs/StandardCPlusPlusModules.rst b/clang/docs/StandardCPlusPlusModules.rst
index 7155ad6cff83f..71988d0fced98 100644
--- a/clang/docs/StandardCPlusPlusModules.rst
+++ b/clang/docs/StandardCPlusPlusModules.rst
@@ -28,7 +28,10 @@ Standard C++ Named modules
 In order to better understand the compiler's behavior, it is helpful to
 understand some terms and definitions for readers who are not familiar with the
 C++ feature. This document is not a tutorial on C++; it only introduces
-necessary concepts to better understand use of modules in a project.
+necessary concepts to better understand use of modules in a project. Other
+resources at `Wikipedia <https://en.wikipedia.org/wiki/Modules_(C++)>`_ and
+`cppreference <https://en.cppreference.com/w/cpp/language/modules.html>`_ can
+provide more background information about modules if needed.
 
 Background and terminology
 --------------------------
diff --git a/clang/include/clang/AST/ASTConsumer.h b/clang/include/clang/AST/ASTConsumer.h
index 447f2592d2359..a1ef187ee2069 100644
--- a/clang/include/clang/AST/ASTConsumer.h
+++ b/clang/include/clang/AST/ASTConsumer.h
@@ -27,6 +27,7 @@ namespace clang {
   class VarDecl;
   class FunctionDecl;
   class ImportDecl;
+  class OpenACCRoutineDecl;
 
 /// ASTConsumer - This is an abstract interface that should be implemented by
 /// clients that read ASTs.  This abstraction layer allows the client to be
@@ -116,6 +117,11 @@ class ASTConsumer {
   // variable has been instantiated.
   virtual void HandleCXXStaticMemberVarInstantiation(VarDecl *D) {}
 
+  /// Callback to handle the end-of-translation unit attachment of OpenACC
+  /// routine declaration information.
+  virtual void HandleOpenACCRoutineReference(const FunctionDecl *FD,
+                                             const OpenACCRoutineDecl *RD) {}
+
   /// Callback involved at the end of a translation unit to
   /// notify the consumer that a vtable for the given C++ class is
   /// required.
diff --git a/clang/include/clang/AST/ASTContext.h b/clang/include/clang/AST/ASTContext.h
index 33aa2d343aa7a..f64e29be3205f 100644
--- a/clang/include/clang/AST/ASTContext.h
+++ b/clang/include/clang/AST/ASTContext.h
@@ -488,6 +488,10 @@ class ASTContext : public RefCountedBase<ASTContext> {
 
   /// Declaration for the CUDA cudaConfigureCall function.
   FunctionDecl *cudaConfigureCallDecl = nullptr;
+  /// Declaration for the CUDA cudaGetParameterBuffer function.
+  FunctionDecl *cudaGetParameterBufferDecl = nullptr;
+  /// Declaration for the CUDA cudaLaunchDevice function.
+  FunctionDecl *cudaLaunchDeviceDecl = nullptr;
 
   /// Keeps track of all declaration attributes.
   ///
@@ -1641,6 +1645,18 @@ class ASTContext : public RefCountedBase<ASTContext> {
     return cudaConfigureCallDecl;
   }
 
+  void setcudaGetParameterBufferDecl(FunctionDecl *FD) {
+    cudaGetParameterBufferDecl = FD;
+  }
+
+  FunctionDecl *getcudaGetParameterBufferDecl() {
+    return cudaGetParameterBufferDecl;
+  }
+
+  void setcudaLaunchDeviceDecl(FunctionDecl *FD) { cudaLaunchDeviceDecl = FD; }
+
+  FunctionDecl *getcudaLaunchDeviceDecl() { return cudaLaunchDeviceDecl; }
+
   /// Returns true iff we need copy/dispose helpers for the given type.
   bool BlockRequiresCopying(QualType Ty, const VarDecl *D);
 
diff --git a/clang/include/clang/AST/CXXInheritance.h b/clang/include/clang/AST/CXXInheritance.h
index e89326081a180..72d365bfbc1f3 100644
--- a/clang/include/clang/AST/CXXInheritance.h
+++ b/clang/include/clang/AST/CXXInheritance.h
@@ -192,7 +192,7 @@ class CXXBasePaths {
   /// Determine whether the path from the most-derived type to the
   /// given base type is ambiguous (i.e., it refers to multiple subobjects of
   /// the same base type).
-  bool isAmbiguous(CanQualType BaseType);
+  bool isAmbiguous(CanQualType BaseType) const;
 
   /// Whether we are finding multiple paths to detect ambiguities.
   bool isFindingAmbiguities() const { return FindAmbiguities; }
diff --git a/clang/include/clang/AST/Decl.h b/clang/include/clang/AST/Decl.h
index ee2321dd158d4..2e8ceff453547 100644
--- a/clang/include/clang/AST/Decl.h
+++ b/clang/include/clang/AST/Decl.h
@@ -4040,6 +4040,11 @@ class EnumDecl : public TagDecl {
   /// and can be accessed with the provided accessors.
   unsigned ODRHash;
 
+  /// Source range covering the enum key:
+  ///  - 'enum'              (unscoped)
+  ///  - 'enum class|struct' (scoped)
+  SourceRange EnumKeyRange;
+
   EnumDecl(ASTContext &C, DeclContext *DC, SourceLocation StartLoc,
            SourceLocation IdLoc, IdentifierInfo *Id, EnumDecl *PrevDecl,
            bool Scoped, bool ScopedUsingClassTag, bool Fixed);
@@ -4077,6 +4082,10 @@ class EnumDecl : public TagDecl {
   /// Microsoft-style enumeration with a fixed underlying type.
   void setFixed(bool Fixed = true) { EnumDeclBits.IsFixed = Fixed; }
 
+  SourceRange getEnumKeyRange() const { return EnumKeyRange; }
+
+  void setEnumKeyRange(SourceRange Range) { EnumKeyRange = Range; }
+
 private:
   /// True if a valid hash is stored in ODRHash.
   bool hasODRHash() const { return EnumDeclBits.HasODRHash; }
@@ -4524,6 +4533,11 @@ class RecordDecl : public TagDecl {
     return field_begin() == field_end();
   }
 
+  /// Returns the number of fields (non-static data members) in this record.
+  unsigned getNumFields() const {
+    return std::distance(field_begin(), field_end());
+  }
+
   /// noload_fields - Iterate over the fields stored in this record
   /// that are currently loaded; don't attempt to retrieve anything
   /// from an external source.
diff --git a/clang/include/clang/AST/OpenMPClause.h b/clang/include/clang/AST/OpenMPClause.h
index 72c5efde7449b..d9c3cf239451e 100644
--- a/clang/include/clang/AST/OpenMPClause.h
+++ b/clang/include/clang/AST/OpenMPClause.h
@@ -7582,7 +7582,8 @@ class OMPToClause final : public OMPMappableExprListClause<OMPToClause>,
 
   /// Motion-modifiers for the 'to' clause.
   OpenMPMotionModifierKind MotionModifiers[NumberOfOMPMotionModifiers] = {
-      OMPC_MOTION_MODIFIER_unknown, OMPC_MOTION_MODIFIER_unknown};
+      OMPC_MOTION_MODIFIER_unknown, OMPC_MOTION_MODIFIER_unknown,
+      OMPC_MOTION_MODIFIER_unknown};
 
   /// Location of motion-modifiers for the 'to' clause.
   SourceLocation MotionModifiersLoc[NumberOfOMPMotionModifiers];
@@ -7654,6 +7655,9 @@ class OMPToClause final : public OMPMappableExprListClause<OMPToClause>,
     MotionModifiersLoc[I] = TLoc;
   }
 
+  void setIteratorModifier(Expr *IteratorModifier) {
+    getTrailingObjects<Expr *>()[2 * varlist_size()] = IteratorModifier;
+  }
   /// Set colon location.
   void setColonLoc(SourceLocation Loc) { ColonLoc = Loc; }
 
@@ -7662,7 +7666,7 @@ class OMPToClause final : public OMPMappableExprListClause<OMPToClause>,
   size_t numTrailingObjects(OverloadToken<Expr *>) const {
     // There are varlist_size() of expressions, and varlist_size() of
     // user-defined mappers.
-    return 2 * varlist_size();
+    return 2 * varlist_size() + 1;
   }
   size_t numTrailingObjects(OverloadToken<ValueDecl *>) const {
     return getUniqueDeclarationsNum();
@@ -7688,15 +7692,14 @@ class OMPToClause final : public OMPMappableExprListClause<OMPToClause>,
   /// \param UDMQualifierLoc C++ nested name specifier for the associated
   /// user-defined mapper.
   /// \param MapperId The identifier of associated user-defined mapper.
-  static OMPToClause *Create(const ASTContext &C, const OMPVarListLocTy &Locs,
-                             ArrayRef<Expr *> Vars,
-                             ArrayRef<ValueDecl *> Declarations,
-                             MappableExprComponentListsRef ComponentLists,
-                             ArrayRef<Expr *> UDMapperRefs,
-                             ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
-                             ArrayRef<SourceLocation> MotionModifiersLoc,
-                             NestedNameSpecifierLoc UDMQualifierLoc,
-                             DeclarationNameInfo MapperId);
+  static OMPToClause *
+  Create(const ASTContext &C, const OMPVarListLocTy &Locs,
+         ArrayRef<Expr *> Vars, ArrayRef<ValueDecl *> Declarations,
+         MappableExprComponentListsRef ComponentLists,
+         ArrayRef<Expr *> UDMapperRefs, Expr *IteratorModifier,
+         ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
+         ArrayRef<SourceLocation> MotionModifiersLoc,
+         NestedNameSpecifierLoc UDMQualifierLoc, DeclarationNameInfo MapperId);
 
   /// Creates an empty clause with the place for \a NumVars variables.
   ///
@@ -7717,7 +7720,9 @@ class OMPToClause final : public OMPMappableExprListClause<OMPToClause>,
            "Requested modifier exceeds the total number of modifiers.");
     return MotionModifiers[Cnt];
   }
-
+  Expr *getIteratorModifier() const {
+    return getTrailingObjects<Expr *>()[2 * varlist_size()];
+  }
   /// Fetches the motion-modifier location at 'Cnt' index of array of modifiers'
   /// locations.
   ///
@@ -7782,7 +7787,8 @@ class OMPFromClause final
 
   /// Motion-modifiers for the 'from' clause.
   OpenMPMotionModifierKind MotionModifiers[NumberOfOMPMotionModifiers] = {
-      OMPC_MOTION_MODIFIER_unknown, OMPC_MOTION_MODIFIER_unknown};
+      OMPC_MOTION_MODIFIER_unknown, OMPC_MOTION_MODIFIER_unknown,
+      OMPC_MOTION_MODIFIER_unknown};
 
   /// Location of motion-modifiers for the 'from' clause.
   SourceLocation MotionModifiersLoc[NumberOfOMPMotionModifiers];
@@ -7843,7 +7849,9 @@ class OMPFromClause final
            "Unexpected index to store motion modifier, exceeds array size.");
     MotionModifiers[I] = T;
   }
-
+  void setIteratorModifier(Expr *IteratorModifier) {
+    getTrailingObjects<Expr *>()[2 * varlist_size()] = IteratorModifier;
+  }
   /// Set location for the motion-modifier.
   ///
   /// \param I index for motion-modifier location.
@@ -7862,7 +7870,7 @@ class OMPFromClause final
   size_t numTrailingObjects(OverloadToken<Expr *>) const {
     // There are varlist_size() of expressions, and varlist_size() of
     // user-defined mappers.
-    return 2 * varlist_size();
+    return 2 * varlist_size() + 1;
   }
   size_t numTrailingObjects(OverloadToken<ValueDecl *>) const {
     return getUniqueDeclarationsNum();
@@ -7892,7 +7900,7 @@ class OMPFromClause final
   Create(const ASTContext &C, const OMPVarListLocTy &Locs,
          ArrayRef<Expr *> Vars, ArrayRef<ValueDecl *> Declarations,
          MappableExprComponentListsRef ComponentLists,
-         ArrayRef<Expr *> UDMapperRefs,
+         ArrayRef<Expr *> UDMapperRefs, Expr *IteratorExpr,
          ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
          ArrayRef<SourceLocation> MotionModifiersLoc,
          NestedNameSpecifierLoc UDMQualifierLoc, DeclarationNameInfo MapperId);
@@ -7916,7 +7924,9 @@ class OMPFromClause final
            "Requested modifier exceeds the total number of modifiers.");
     return MotionModifiers[Cnt];
   }
-
+  Expr *getIteratorModifier() const {
+    return getTrailingObjects<Expr *>()[2 * varlist_size()];
+  }
   /// Fetches the motion-modifier location at 'Cnt' index of array of modifiers'
   /// locations.
   ///
diff --git a/clang/include/clang/AST/OperationKinds.def b/clang/include/clang/AST/OperationKinds.def
index c2dca895e8411..8a13ad988403b 100644
--- a/clang/include/clang/AST/OperationKinds.def
+++ b/clang/include/clang/AST/OperationKinds.def
@@ -364,6 +364,9 @@ CAST_OPERATION(IntToOCLSampler)
 // Truncate a vector type by dropping elements from the end (HLSL only).
 CAST_OPERATION(HLSLVectorTruncation)
 
+// Truncate a matrix type by dropping elements from the end (HLSL only).
+CAST_OPERATION(HLSLMatrixTruncation)
+
 // Non-decaying array RValue cast (HLSL only).
 CAST_OPERATION(HLSLArrayRValue)
 
diff --git a/clang/include/clang/ASTMatchers/ASTMatchers.h b/clang/include/clang/ASTMatchers/ASTMatchers.h
index bca2d8425b3f5..e3ec26207d2bc 100644
--- a/clang/include/clang/ASTMatchers/ASTMatchers.h
+++ b/clang/include/clang/ASTMatchers/ASTMatchers.h
@@ -7003,6 +7003,19 @@ AST_MATCHER_P(ReferenceTypeLoc, hasReferentLoc, internal::Matcher<TypeLoc>,
   return ReferentMatcher.matches(Node.getPointeeLoc(), Finder, Builder);
 }
 
+/// Matches `ArrayTypeLoc`s.
+///
+/// Given
+/// \code
+///   int a[] = {1, 2};
+///   int b[3];
+///   void f() { int c[a[0]]; }
+/// \endcode
+/// arrayTypeLoc()
+///   matches "int a[]", "int b[3]" and "int c[a[0]]".
+extern const internal::VariadicDynCastAllOfMatcher<TypeLoc, ArrayTypeLoc>
+    arrayTypeLoc;
+
 /// Matches template specialization `TypeLoc`s.
 ///
 /// Given
diff --git a/clang/include/clang/Basic/Attr.td b/clang/include/clang/Basic/Attr.td
index 8e5f7ef0bb82d..d8d1675f245a1 100644
--- a/clang/include/clang/Basic/Attr.td
+++ b/clang/include/clang/Basic/Attr.td
@@ -508,7 +508,7 @@ def TargetMicrosoftRecordLayout : TargetArch<["x86", "x86_64", "arm", "thumb",
   let CustomCode = [{ Target.hasMicrosoftRecordLayout() }];
 }
 
-def TargetMustTailAvaiable:  TargetSpec {
+def TargetMustTailAvailable:  TargetSpec {
   let CustomCode = [{ Target.hasMustTail() }];
 }
 
@@ -1917,7 +1917,7 @@ def NoMerge : DeclOrStmtAttr {
                              "functions, statements and variables">;
 }
 
-def MustTail : StmtAttr, TargetSpecificAttr<TargetMustTailAvaiable> {
+def MustTail : StmtAttr, TargetSpecificAttr<TargetMustTailAvailable> {
   let Spellings = [Clang<"musttail">];
   let Documentation = [MustTailDocs];
   let Subjects = SubjectList<[ReturnStmt], ErrorDiag, "return statements">;
@@ -5172,6 +5172,14 @@ def HLSLVkConstantId : InheritableAttr {
   let Documentation = [VkConstantIdDocs];
 }
 
+def HLSLVkLocation : HLSLAnnotationAttr {
+  let Spellings = [CXX11<"vk", "location">];
+  let Args = [IntArgument<"Location">];
+  let Subjects = SubjectList<[ParmVar, Field, Function], ErrorDiag>;
+  let LangOpts = [HLSL];
+  let Documentation = [HLSLVkLocationDocs];
+}
+
 def RandomizeLayout : InheritableAttr {
   let Spellings = [GCC<"randomize_layout">];
   let Subjects = SubjectList<[Record]>;
@@ -5323,3 +5331,11 @@ def NonString : InheritableAttr {
   let Subjects = SubjectList<[Var, Field]>;
   let Documentation = [NonStringDocs];
 }
+
+def ModularFormat : InheritableAttr {
+  let Spellings = [Clang<"modular_format">];
+  let Args = [IdentifierArgument<"ModularImplFn">, StringArgument<"ImplName">,
+              VariadicStringArgument<"Aspects">];
+  let Subjects = SubjectList<[Function]>;
+  let Documentation = [ModularFormatDocs];
+}
diff --git a/clang/include/clang/Basic/AttrDocs.td b/clang/include/clang/Basic/AttrDocs.td
index c1b1510f363d4..ae929c7dea37d 100644
--- a/clang/include/clang/Basic/AttrDocs.td
+++ b/clang/include/clang/Basic/AttrDocs.td
@@ -8981,6 +8981,18 @@ The descriptor set is optional and defaults to 0 if not provided.
   }];
 }
 
+def HLSLVkLocationDocs : Documentation {
+  let Category = DocCatVariable;
+  let Content = [{
+Attribute used for specifying the location number for the stage input/output
+variables. Allowed on function parameters, function returns, and struct
+fields. This parameter has no effect when used outside of an entrypoint
+parameter/parameter field/return value.
+
+This attribute maps to the 'Location' SPIR-V decoration.
+  }];
+}
+
 def WebAssemblyFuncrefDocs : Documentation {
   let Category = DocCatType;
   let Content = [{
@@ -9630,3 +9642,43 @@ silence diagnostics with code like:
   __attribute__((nonstring)) char NotAStr[3] = "foo"; // Not diagnosed
   }];
 }
+
+def ModularFormatDocs : Documentation {
+  let Category = DocCatFunction;
+  let Content = [{
+The ``modular_format`` attribute can be applied to a function that bears the
+``format`` attribute (or standard library functions) to indicate that the
+implementation is "modular", that is, that the implementation is logically
+divided into a number of named aspects. When the compiler can determine that
+not all aspects of the implementation are needed for a given call, the compiler
+may redirect the call to the identifier given as the first argument to the
+attribute (the modular implementation function).
+
+The second argument is a implementation name, and the remaining arguments are
+aspects of the format string for the compiler to report. The implementation
+name is an unevaluated identifier be in the C namespace.
+
+The compiler reports that a call requires an aspect by issuing a relocation for
+the symbol ``<impl_name>_<aspect>`` at the point of the call. This arranges for
+code and data needed to support the aspect of the implementation to be brought
+into the link to satisfy weak references in the modular implemenation function.
+If the compiler does not understand an aspect, it must summarily consider any
+call to require that aspect. 
+
+For example, say ``printf`` is annotated with
+``modular_format(__modular_printf, "__printf", "float")``. Then, a call to
+``printf(var, 42)`` would be untouched. A call to ``printf("%d", 42)`` would
+become a call to ``__modular_printf`` with the same arguments, as would
+``printf("%f", 42.0)``. The latter would be accompanied with a strong
+relocation against the symbol ``__printf_float``, which would bring floating
+point support for ``printf`` into the link.
+
+If the attribute appears more than once on a declaration, or across a chain of
+redeclarations, it is an error for the attributes to have different arguments,
+excepting that the aspects may be in any order.
+
+The following aspects are currently supported:
+
+- ``float``: The call has a floating point argument
+  }];
+}
diff --git a/clang/include/clang/Basic/BuiltinsX86.td b/clang/include/clang/Basic/BuiltinsX86.td
index 98cea35beb0ea..a4b7215d6334d 100644
--- a/clang/include/clang/Basic/BuiltinsX86.td
+++ b/clang/include/clang/Basic/BuiltinsX86.td
@@ -166,14 +166,20 @@ let Features = "sse2", Attributes = [NoThrow] in {
   def movnti : X86Builtin<"void(int *, int)">;
 }
 
+let Features = "sse2", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWidth<128>] in {
+  def cvtpd2ps : X86Builtin<"_Vector<4, float>(_Vector<2, double>)">;
+  def cvtsd2ss : X86Builtin<"_Vector<4, float>(_Vector<4, float>, _Vector<2, double>)">;
+}
+let Features = "avx512f", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWidth<128>] in {
+  def cvtsd2ss_round_mask : X86Builtin<"_Vector<4, float>(_Vector<4, float>, _Vector<2, double>, _Vector<4, float>, unsigned char, _Constant int)">;
+}
+
 let Features = "sse2", Attributes = [NoThrow, Const, RequiredVectorWidth<128>] in {
   def psadbw128 : X86Builtin<"_Vector<2, long long int>(_Vector<16, char>, _Vector<16, char>)">;
   def cvtpd2dq : X86Builtin<"_Vector<2, long long int>(_Vector<2, double>)">;
-  def cvtpd2ps : X86Builtin<"_Vector<4, float>(_Vector<2, double>)">;
   def cvttpd2dq : X86Builtin<"_Vector<4, int>(_Vector<2, double>)">;
   def cvtsd2si : X86Builtin<"int(_Vector<2, double>)">;
   def cvttsd2si : X86Builtin<"int(_Vector<2, double>)">;
-  def cvtsd2ss : X86Builtin<"_Vector<4, float>(_Vector<4, float>, _Vector<2, double>)">;
   def cvtps2dq : X86Builtin<"_Vector<4, int>(_Vector<4, float>)">;
   def cvttps2dq : X86Builtin<"_Vector<4, int>(_Vector<4, float>)">;
 }
@@ -462,11 +468,14 @@ let Features = "avx", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWid
   def vpermilvarps256 : X86Builtin<"_Vector<8, float>(_Vector<8, float>, _Vector<8, int>)">;
 }
 
+let Features = "avx", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWidth<256>] in {
+  def cvtpd2ps256 : X86Builtin<"_Vector<4, float>(_Vector<4, double>)">;
+}
+
 let Features = "avx", Attributes = [NoThrow, Const, RequiredVectorWidth<256>] in {
   def dpps256 : X86Builtin<"_Vector<8, float>(_Vector<8, float>, _Vector<8, float>, _Constant char)">;
   def cmppd256 : X86Builtin<"_Vector<4, double>(_Vector<4, double>, _Vector<4, double>, _Constant char)">;
   def cmpps256 : X86Builtin<"_Vector<8, float>(_Vector<8, float>, _Vector<8, float>, _Constant char)">;
-  def cvtpd2ps256 : X86Builtin<"_Vector<4, float>(_Vector<4, double>)">;
   def cvtps2dq256 : X86Builtin<"_Vector<8, int>(_Vector<8, float>)">;
   def cvttpd2dq256 : X86Builtin<"_Vector<4, int>(_Vector<4, double>)">;
   def cvtpd2dq256 : X86Builtin<"_Vector<4, int>(_Vector<4, double>)">;
@@ -474,7 +483,6 @@ let Features = "avx", Attributes = [NoThrow, Const, RequiredVectorWidth<256>] in
   def vperm2f128_pd256 : X86Builtin<"_Vector<4, double>(_Vector<4, double>, _Vector<4, double>, _Constant int)">;
   def vperm2f128_ps256 : X86Builtin<"_Vector<8, float>(_Vector<8, float>, _Vector<8, float>, _Constant int)">;
   def vperm2f128_si256 : X86Builtin<"_Vector<8, int>(_Vector<8, int>, _Vector<8, int>, _Constant int)">;
-
   foreach Op = ["max", "min"] in {
     def Op#pd256 : X86Builtin<"_Vector<4, double>(_Vector<4, double>, _Vector<4, double>)">;
     def Op#ps256 : X86Builtin<"_Vector<8, float>(_Vector<8, float>, _Vector<8, float>)">;
@@ -577,13 +585,14 @@ let Features = "avx2", Attributes = [NoThrow, Const, RequiredVectorWidth<256>] i
   def psadbw256
       : X86Builtin<
             "_Vector<4, long long int>(_Vector<32, char>, _Vector<32, char>)">;
-  def permdf256 : X86Builtin<"_Vector<4, double>(_Vector<4, double>, _Constant int)">;
   def permti256 : X86Builtin<"_Vector<4, long long int>(_Vector<4, long long int>, _Vector<4, long long int>, _Constant int)">;
-  def permdi256 : X86Builtin<"_Vector<4, long long int>(_Vector<4, long long int>, _Constant int)">;
 }
 
-
 let Features = "avx2", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWidth<256>] in {
+  def permdf256
+      : X86Builtin<"_Vector<4, double>(_Vector<4, double>, _Constant int)">;
+  def permdi256 : X86Builtin<"_Vector<4, long long int>(_Vector<4, long long "
+                             "int>, _Constant int)">;
   def pmovmskb256 : X86Builtin<"int(_Vector<32, char>)">;
   def pavgb256 : X86Builtin<"_Vector<32, unsigned char>(_Vector<32, unsigned char>, _Vector<32, unsigned char>)">;
   def pavgw256 : X86Builtin<"_Vector<16, unsigned short>(_Vector<16, unsigned short>, _Vector<16, unsigned short>)">;
@@ -1004,6 +1013,10 @@ let Features = "avx512vl", Attributes = [NoThrow, Const, RequiredVectorWidth<128
   def cmppd128_mask : X86Builtin<"unsigned char(_Vector<2, double>, _Vector<2, double>, _Constant int, unsigned char)">;
 }
 
+let Features = "avx512f", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWidth<512>] in {
+  def cvtpd2ps512_mask : X86Builtin<"_Vector<8, float>(_Vector<8, double>, _Vector<8, float>, unsigned char, _Constant int)">;
+}
+
 let Features = "avx512f", Attributes = [NoThrow, Const, RequiredVectorWidth<512>] in {
   def rndscaleps_mask : X86Builtin<"_Vector<16, float>(_Vector<16, float>, _Constant int, _Vector<16, float>, unsigned short, _Constant int)">;
   def rndscalepd_mask : X86Builtin<"_Vector<8, double>(_Vector<8, double>, _Constant int, _Vector<8, double>, unsigned char, _Constant int)">;
@@ -1017,7 +1030,6 @@ let Features = "avx512f", Attributes = [NoThrow, Const, RequiredVectorWidth<512>
   def maxpd512 : X86Builtin<"_Vector<8, double>(_Vector<8, double>, _Vector<8, double>, _Constant int)">;
   def cvtdq2ps512_mask : X86Builtin<"_Vector<16, float>(_Vector<16, int>, _Vector<16, float>, unsigned short, _Constant int)">;
   def cvtudq2ps512_mask : X86Builtin<"_Vector<16, float>(_Vector<16, int>, _Vector<16, float>, unsigned short, _Constant int)">;
-  def cvtpd2ps512_mask : X86Builtin<"_Vector<8, float>(_Vector<8, double>, _Vector<8, float>, unsigned char, _Constant int)">;
   def vcvtps2ph512_mask : X86Builtin<"_Vector<16, short>(_Vector<16, float>, _Constant int, _Vector<16, short>, unsigned short)">;
   def vcvtph2ps512_mask : X86Builtin<"_Vector<16, float>(_Vector<16, short>, _Vector<16, float>, unsigned short, _Constant int)">;
 }
@@ -1452,9 +1464,12 @@ let Features = "avx512vl", Attributes = [NoThrow, RequiredVectorWidth<256>] in {
   def compressstoresi256_mask : X86Builtin<"void(_Vector<8, int *>, _Vector<8, int>, unsigned char)">;
 }
 
+let Features = "avx512vl", Attributes = [NoThrow, Const, Constexpr, RequiredVectorWidth<128>] in {
+  def cvtpd2ps_mask : X86Builtin<"_Vector<4, float>(_Vector<2, double>, _Vector<4, float>, unsigned char)">;
+}
+
 let Features = "avx512vl", Attributes = [NoThrow, Const, RequiredVectorWidth<128>] in {
   def cvtpd2dq128_mask : X86Builtin<"_Vector<4, int>(_Vector<2, double>, _Vector<4, int>, unsigned char)">;
-  def cvtpd2ps_mask : X86Builtin<"_Vector<4, float>(_Vector<2, double>, _Vector<4, float>, unsigned char)">;
   def cvtpd2udq128_mask : X86Builtin<"_Vector<4, int>(_Vector<2, double>, _Vector<4, int>, unsigned char)">;
 }
 
@@ -3134,41 +3149,41 @@ let Features = "avx512bw", Attributes = [NoThrow, Const, Constexpr] in {
   def kxordi : X86Builtin<"unsigned long long int(unsigned long long int, unsigned long long int)">;
 }
 
-let Features = "avx512dq", Attributes = [NoThrow, Const] in {
+let Features = "avx512dq", Attributes = [NoThrow, Const, Constexpr] in {
   def kshiftliqi : X86Builtin<"unsigned char(unsigned char, _Constant unsigned int)">;
 }
 
-let Features = "avx512f", Attributes = [NoThrow, Const] in {
+let Features = "avx512f", Attributes = [NoThrow, Const, Constexpr] in {
   def kshiftlihi : X86Builtin<"unsigned short(unsigned short, _Constant unsigned int)">;
 }
 
-let Features = "avx512bw", Attributes = [NoThrow, Const] in {
+let Features = "avx512bw", Attributes = [NoThrow, Const, Constexpr] in {
   def kshiftlisi : X86Builtin<"unsigned int(unsigned int, _Constant unsigned int)">;
   def kshiftlidi : X86Builtin<"unsigned long long int(unsigned long long int, _Constant unsigned int)">;
 }
 
-let Features = "avx512dq", Attributes = [NoThrow, Const] in {
+let Features = "avx512dq", Attributes = [NoThrow, Const, Constexpr] in {
   def kshiftriqi : X86Builtin<"unsigned char(unsigned char, _Constant unsigned int)">;
 }
 
-let Features = "avx512f", Attributes = [NoThrow, Const] in {
+let Features = "avx512f", Attributes = [NoThrow, Const, Constexpr] in {
   def kshiftrihi : X86Builtin<"unsigned short(unsigned short, _Constant unsigned int)">;
 }
 
-let Features = "avx512bw", Attributes = [NoThrow, Const] in {
+let Features = "avx512bw", Attributes = [NoThrow, Const, Constexpr] in {
   def kshiftrisi : X86Builtin<"unsigned int(unsigned int, _Constant unsigned int)">;
   def kshiftridi : X86Builtin<"unsigned long long int(unsigned long long int, _Constant unsigned int)">;
 }
 
-let Features = "avx512dq", Attributes = [NoThrow, Const] in {
+let Features = "avx512dq", Attributes = [NoThrow, Const, Constexpr] in {
   def kmovb : X86Builtin<"unsigned char(unsigned char)">;
 }
 
-let Features = "avx512f", Attributes = [NoThrow, Const] in {
+let Features = "avx512f", Attributes = [NoThrow, Const, Constexpr] in {
   def kmovw : X86Builtin<"unsigned short(unsigned short)">;
 }
 
-let Features = "avx512bw", Attributes = [NoThrow, Const] in {
+let Features = "avx512bw", Attributes = [NoThrow, Const, Constexpr] in {
   def kmovd : X86Builtin<"unsigned int(unsigned int)">;
   def kmovq : X86Builtin<"unsigned long long int(unsigned long long int)">;
 }
@@ -3288,7 +3303,6 @@ let Features = "avx512bw,avx512vl",
 }
 
 let Features = "avx512f", Attributes = [NoThrow, Const, RequiredVectorWidth<128>] in {
-  def cvtsd2ss_round_mask : X86Builtin<"_Vector<4, float>(_Vector<4, float>, _Vector<2, double>, _Vector<4, float>, unsigned char, _Constant int)">;
   def cvtsi2ss32 : X86Builtin<"_Vector<4, float>(_Vector<4, float>, int, _Constant int)">;
   def cvtss2sd_round_mask : X86Builtin<"_Vector<2, double>(_Vector<2, double>, _Vector<4, float>, _Vector<2, double>, unsigned char, _Constant int)">;
   def cvtusi2ss32 : X86Builtin<"_Vector<4, float>(_Vector<4, float>, unsigned int, _Constant int)">;
diff --git a/clang/include/clang/Basic/DiagnosticGroups.td b/clang/include/clang/Basic/DiagnosticGroups.td
index 2fff32bbc4d6c..80fc12caa1d24 100644
--- a/clang/include/clang/Basic/DiagnosticGroups.td
+++ b/clang/include/clang/Basic/DiagnosticGroups.td
@@ -790,6 +790,8 @@ def ShadowFieldInConstructor : DiagGroup<"shadow-field-in-constructor",
 def ShadowIvar : DiagGroup<"shadow-ivar">;
 def ShadowUncapturedLocal : DiagGroup<"shadow-uncaptured-local">;
 
+def ShadowHeader : DiagGroup<"shadow-header">;
+
 // -Wshadow-all is a catch-all for all shadowing. -Wshadow is just the
 // shadowing that we think is unsafe.
 def Shadow : DiagGroup<"shadow", [ShadowFieldInConstructorModified,
@@ -1061,6 +1063,7 @@ def SuperSubClassMismatch : DiagGroup<"super-class-method-mismatch">;
 def OverridingMethodMismatch : DiagGroup<"overriding-method-mismatch">;
 def VariadicMacros : DiagGroup<"variadic-macros">;
 def VectorConversion : DiagGroup<"vector-conversion">;      // clang specific
+def MatrixConversion : DiagGroup<"matrix-conversion">;      // clang specific
 def VexingParse : DiagGroup<"vexing-parse">;
 def VLAUseStaticAssert : DiagGroup<"vla-extension-static-assert">;
 def VLACxxExtension : DiagGroup<"vla-cxx-extension", [VLAUseStaticAssert]>;
@@ -1335,6 +1338,8 @@ def : DiagGroup<"int-conversions",
                 [IntConversion]>; // -Wint-conversions = -Wint-conversion
 def : DiagGroup<"vector-conversions",
                 [VectorConversion]>; // -Wvector-conversions = -Wvector-conversion
+def : DiagGroup<"matrix-conversions",
+                [MatrixConversion]>; // -Wmatrix-conversions = -Wmatrix-conversion
 def : DiagGroup<"unused-local-typedefs", [UnusedLocalTypedef]>;
                 // -Wunused-local-typedefs = -Wunused-local-typedef
 
diff --git a/clang/include/clang/Basic/DiagnosticLexKinds.td b/clang/include/clang/Basic/DiagnosticLexKinds.td
index 417187222e448..a72d3f37b1b72 100644
--- a/clang/include/clang/Basic/DiagnosticLexKinds.td
+++ b/clang/include/clang/Basic/DiagnosticLexKinds.td
@@ -959,6 +959,10 @@ def warn_quoted_include_in_framework_header : Warning<
 def warn_framework_include_private_from_public : Warning<
   "public framework header includes private framework header '%0'"
   >, InGroup<FrameworkIncludePrivateFromPublic>;
+def warn_header_shadowing : Warning<
+  "multiple candidates for header '%0' found; "
+  "directory '%1' chosen, ignoring others including '%2'">,
+  InGroup<ShadowHeader>, DefaultIgnore;
 def warn_deprecated_module_dot_map : Warning<
   "'%0' as a module map name is deprecated, rename it to %select{module.modulemap|module.private.modulemap}1%select{| in the 'Modules' directory of the framework}2">,
   InGroup<DeprecatedModuleDotMap>;
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 4a145fd71eedd..e2c694cb2d9df 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -4356,6 +4356,9 @@ def warn_param_typestate_mismatch : Warning<
 def warn_unknown_sanitizer_ignored : Warning<
   "unknown sanitizer '%0' ignored">, InGroup<UnknownSanitizers>;
 
+def warn_impcast_matrix_scalar : Warning<
+  "implicit conversion turns matrix to scalar: %0 to %1">,
+  InGroup<MatrixConversion>;
 def warn_impcast_vector_scalar : Warning<
   "implicit conversion turns vector to scalar: %0 to %1">,
   InGroup<Conversion>, DefaultIgnore;
@@ -9532,6 +9535,8 @@ def err_kern_is_nonstatic_method : Error<
   "kernel function %0 must be a free function or static member function">;
 def err_config_scalar_return : Error<
   "CUDA special function '%0' must have scalar return type">;
+def err_config_pointer_return
+    : Error<"CUDA special function '%0' must have pointer return type">;
 def err_kern_call_not_global_function : Error<
   "kernel call to non-global function %0">;
 def err_global_call_not_config : Error<
@@ -11272,6 +11277,8 @@ def warn_duplicate_attribute_exact : Warning<
 def warn_duplicate_attribute : Warning<
   "attribute %0 is already applied with different arguments">,
   InGroup<IgnoredAttributes>;
+def err_duplicate_attribute
+    : Error<"attribute %0 is already applied with different arguments">;
 def err_disallowed_duplicate_attribute : Error<
   "attribute %0 cannot appear more than once on a declaration">;
 
@@ -13065,6 +13072,12 @@ def err_get_vtable_pointer_requires_complete_type
     : Error<"__builtin_get_vtable_pointer requires an argument with a complete "
             "type, but %0 is incomplete">;
 
+def err_modular_format_attribute_no_format
+    : Error<"'modular_format' attribute requires 'format' attribute">;
+
+def err_modular_format_duplicate_aspect
+    : Error<"duplicate aspect '%0' in 'modular_format' attribute">;
+
 // SYCL-specific diagnostics
 def warn_sycl_kernel_num_of_template_params : Warning<
   "'sycl_kernel' attribute only applies to a function template with at least"
@@ -13233,6 +13246,12 @@ def err_hlsl_semantic_indexing_not_supported
 def err_hlsl_init_priority_unsupported : Error<
   "initializer priorities are not supported in HLSL">;
 def err_hlsl_semantic_index_overlap : Error<"semantic index overlap %0">;
+def err_hlsl_semantic_unsupported_iotype_for_stage
+    : Error<"semantic %0 is unsupported in %2 shaders as %1, requires one of "
+            "the following: %3">;
+def err_hlsl_semantic_partial_explicit_indexing
+    : Error<"partial explicit stage input location assignment via "
+            "vk::location(X) unsupported">;
 
 def warn_hlsl_user_defined_type_missing_member: Warning<"binding type '%select{t|u|b|s|c}0' only applies to types containing %select{SRV resources|UAV resources|constant buffer resources|sampler state|numeric types}0">, InGroup<LegacyConstantRegisterBinding>;
 def err_hlsl_binding_type_mismatch: Error<"binding type '%select{t|u|b|s|c}0' only applies to %select{SRV resources|UAV resources|constant buffer resources|sampler state|numeric variables in the global scope}0">;
@@ -13274,7 +13293,10 @@ def err_hlsl_builtin_scalar_vector_mismatch
           "vector type with matching scalar element type%diff{: $ vs $|}2,3">;
 
 def warn_hlsl_impcast_vector_truncation : Warning<
-  "implicit conversion truncates vector: %0 to %1">, InGroup<Conversion>;
+  "implicit conversion truncates vector: %0 to %1">, InGroup<VectorConversion>;
+
+def warn_hlsl_impcast_matrix_truncation : Warning<
+  "implicit conversion truncates matrix: %0 to %1">, InGroup<MatrixConversion>;
 
 def warn_hlsl_availability : Warning<
   "%0 is only available %select{|in %4 environment }3on %1 %2 or newer">,
@@ -13749,4 +13771,9 @@ def warn_comparison_in_enum_initializer : Warning<
 def note_enum_compare_typo_suggest : Note<
   "use '%0' to perform a bitwise shift">;
 
+def err_cuda_device_kernel_launch_not_supported
+    : Error<"device-side kernel call/launch is not supported">;
+def err_cuda_device_kernel_launch_require_rdc
+    : Error<"kernel launch from __device__ or __global__ function requires "
+            "relocatable device code (i.e. requires -fgpu-rdc)">;
 } // end of sema component.
diff --git a/clang/include/clang/Basic/OpenMPKinds.def b/clang/include/clang/Basic/OpenMPKinds.def
index b98b946cad75a..ceac89d3aba6d 100644
--- a/clang/include/clang/Basic/OpenMPKinds.def
+++ b/clang/include/clang/Basic/OpenMPKinds.def
@@ -207,6 +207,7 @@ OPENMP_MAP_MODIFIER_KIND(ompx_hold)
 
 // Modifiers for 'to' or 'from' clause.
 OPENMP_MOTION_MODIFIER_KIND(mapper)
+OPENMP_MOTION_MODIFIER_KIND(iterator)
 OPENMP_MOTION_MODIFIER_KIND(present)
 
 // Static attributes for 'dist_schedule' clause.
diff --git a/clang/include/clang/Basic/TokenKinds.h b/clang/include/clang/Basic/TokenKinds.h
index d84f3598cbf33..a801113c57715 100644
--- a/clang/include/clang/Basic/TokenKinds.h
+++ b/clang/include/clang/Basic/TokenKinds.h
@@ -98,7 +98,7 @@ inline bool isLiteral(TokenKind K) {
   const bool isInLiteralRange =
       K >= tok::numeric_constant && K <= tok::utf32_string_literal;
 
-#if !NDEBUG
+#ifndef NDEBUG
   const bool isLiteralExplicit =
       K == tok::numeric_constant || K == tok::char_constant ||
       K == tok::wide_char_constant || K == tok::utf8_char_constant ||
diff --git a/clang/include/clang/Basic/arm_mve.td b/clang/include/clang/Basic/arm_mve.td
index 2e5e1d93be096..77531c31538c1 100644
--- a/clang/include/clang/Basic/arm_mve.td
+++ b/clang/include/clang/Basic/arm_mve.td
@@ -167,7 +167,9 @@ multiclass FMA<bit add> {
   // second multiply input.
   defvar m2_cg = !if(add, (id $m2), (fneg $m2));
 
-  defvar unpred_cg = (IRIntBase<"fma", [Vector]> $m1, m2_cg, $addend);
+  defvar fma = strictFPAlt<IRIntBase<"fma", [Vector]>,
+                           IRInt<"fma", [Vector]>>;
+  defvar unpred_cg = (fma $m1, m2_cg, $addend);
   defvar pred_cg   = (IRInt<"fma_predicated", [Vector, Predicate]>
                           $m1, m2_cg, $addend, $pred);
 
@@ -723,7 +725,7 @@ multiclass compare_with_pred<string condname, dag arguments,
        NameOverride<"vcmp" # condname # "q_m" # suffix>;
 }
 
-multiclass compare<string condname, IRBuilder cmpop> {
+multiclass compare<string condname, Builder cmpop> {
   // Make all four variants of a comparison: the vector/vector and
   // vector/scalar forms, each using compare_with_pred to make a
   // predicated and unpredicated version.
@@ -781,15 +783,15 @@ let params = T.Unsigned in {
 }
 let params = T.Float in {
   def vminnmq: Intrinsic<Vector, (args Vector:$a, Vector:$b),
-                                 (IRIntBase<"minnum", [Vector]> $a, $b)>;
+                                 (fminnm $a, $b)>;
   def vmaxnmq: Intrinsic<Vector, (args Vector:$a, Vector:$b),
-                                 (IRIntBase<"maxnum", [Vector]> $a, $b)>;
+                                 (fmaxnm $a, $b)>;
   def vminnmaq: Intrinsic<Vector, (args Vector:$a, Vector:$b),
-                                  (IRIntBase<"minnum", [Vector]>
+                                  (fminnm
                                    (IRIntBase<"fabs", [Vector]> $a),
                                    (IRIntBase<"fabs", [Vector]> $b))>;
   def vmaxnmaq: Intrinsic<Vector, (args Vector:$a, Vector:$b),
-                                  (IRIntBase<"maxnum", [Vector]>
+                                  (fmaxnm
                                    (IRIntBase<"fabs", [Vector]> $a),
                                    (IRIntBase<"fabs", [Vector]> $b))>;
 }
diff --git a/clang/include/clang/Basic/arm_mve_defs.td b/clang/include/clang/Basic/arm_mve_defs.td
index eeca9153dd742..3210549d0cb56 100644
--- a/clang/include/clang/Basic/arm_mve_defs.td
+++ b/clang/include/clang/Basic/arm_mve_defs.td
@@ -34,7 +34,8 @@ class IRBuilderAddrParam<int index_> : IRBuilderParam<index_>;
 class IRBuilderIntParam<int index_, string type_> : IRBuilderParam<index_> {
   string type = type_;
 }
-class IRBuilderBase {
+class Builder {}
+class IRBuilderBase : Builder {
   // The prefix of the function call, including an open parenthesis.
   string prefix;
 
@@ -166,7 +167,7 @@ def address;
 // Another node class you can use in the codegen dag. This one corresponds to
 // an IR intrinsic function, which has to be specialized to a particular list
 // of types.
-class IRIntBase<string name_, list<Type> params_ = [], bit appendKind_ = 0> {
+class IRIntBase<string name_, list<Type> params_ = [], bit appendKind_ = 0> : Builder {
   string intname = name_;       // base name of the intrinsic
   list<Type> params = params_;  // list of parameter types
 
@@ -214,8 +215,8 @@ def bitsize;
 
 // strictFPAlt allows a node to have different code generation under strict-fp.
 // TODO: The standard node can be IRBuilderBase or IRIntBase.
-class strictFPAlt<IRBuilderBase standard_, IRIntBase strictfp_> {
-  IRBuilderBase standard = standard_;
+class strictFPAlt<Builder standard_, IRIntBase strictfp_> : Builder {
+  Builder standard = standard_;
   IRIntBase strictfp = strictfp_;
 }
 
@@ -588,6 +589,10 @@ def fsub: strictFPAlt<fsub_node,
                       IRInt<"vsub", [Vector]>>;
 def fmul: strictFPAlt<fmul_node,
                       IRInt<"vmul", [Vector]>>;
+def fminnm : strictFPAlt<IRIntBase<"minnum", [Vector]>,
+                         IRInt<"vminnm", [Vector]>>;
+def fmaxnm : strictFPAlt<IRIntBase<"maxnum", [Vector]>,
+                         IRInt<"vmaxnm", [Vector]>>;
 
 // -----------------------------------------------------------------------------
 // Convenience lists of parameter types. 'T' is just a container record, so you
diff --git a/clang/include/clang/CIR/CIRGenerator.h b/clang/include/clang/CIR/CIRGenerator.h
index 5ea11463ffa9f..31dead2d7b585 100644
--- a/clang/include/clang/CIR/CIRGenerator.h
+++ b/clang/include/clang/CIR/CIRGenerator.h
@@ -81,6 +81,9 @@ class CIRGenerator : public clang::ASTConsumer {
   void HandleTagDeclDefinition(clang::TagDecl *d) override;
   void HandleTagDeclRequiredDefinition(const clang::TagDecl *D) override;
   void HandleCXXStaticMemberVarInstantiation(clang::VarDecl *D) override;
+  void
+  HandleOpenACCRoutineReference(const clang::FunctionDecl *FD,
+                                const clang::OpenACCRoutineDecl *RD) override;
   void CompleteTentativeDefinition(clang::VarDecl *d) override;
   void HandleVTable(clang::CXXRecordDecl *rd) override;
 
diff --git a/clang/include/clang/CIR/Dialect/IR/CIRAttrs.td b/clang/include/clang/CIR/Dialect/IR/CIRAttrs.td
index 12bc9cf7b5b04..98d4636dafc29 100644
--- a/clang/include/clang/CIR/Dialect/IR/CIRAttrs.td
+++ b/clang/include/clang/CIR/Dialect/IR/CIRAttrs.td
@@ -1085,44 +1085,10 @@ def CIR_TypeInfoAttr : CIR_Attr<"TypeInfo", "typeinfo", [TypedAttrInterface]> {
 //===----------------------------------------------------------------------===//
 
 def CIR_InlineKind : CIR_I32EnumAttr<"InlineKind", "inlineKind", [
-  I32EnumAttrCase<"NoInline", 1, "never">,
-  I32EnumAttrCase<"AlwaysInline", 2, "always">,
-  I32EnumAttrCase<"InlineHint", 3, "hint">
-]> {
-  let genSpecializedAttr = 0;
-}
-
-def CIR_InlineAttr : CIR_EnumAttr<CIR_InlineKind, "inline"> {
-  let summary = "Inline attribute";
-  let description = [{
-    Inline attribute represents user directives for inlining behavior.
-    This attribute is only used by `cir.func` operations.
-
-    Values:
-    - `never`: Prevents the function from being inlined (__attribute__((noinline)))
-    - `always`: Forces the function to be inlined (__attribute__((always_inline)))
-    - `hint`: Suggests the function should be inlined (inline keyword)
-
-    Example:
-    ```
-    cir.func @noinline_func(%arg0: !s32i) -> !s32i inline(never) {
-      cir.return %arg0 : !s32i
-    }
-    cir.func @always_inline_func() -> !s32i inline(always) {
-      %0 = cir.const #cir.int<42> : !s32i
-      cir.return %0 : !s32i
-    }
-    ```
-  }];
-
-  let cppClassName = "InlineAttr";
-
-  let extraClassDeclaration = [{
-    bool isNoInline() const { return getValue() == InlineKind::NoInline; };
-    bool isAlwaysInline() const { return getValue() == InlineKind::AlwaysInline; };
-    bool isInlineHint() const { return getValue() == InlineKind::InlineHint; };
-  }];
-}
+  I32EnumAttrCase<"NoInline", 1, "no_inline">,
+  I32EnumAttrCase<"AlwaysInline", 2, "always_inline">,
+  I32EnumAttrCase<"InlineHint", 3, "inline_hint">
+]>;
 
 //===----------------------------------------------------------------------===//
 // CatchAllAttr & UnwindAttr
diff --git a/clang/include/clang/CIR/Dialect/IR/CIRDialect.td b/clang/include/clang/CIR/Dialect/IR/CIRDialect.td
index 34df9af7fc06d..7c38492544b39 100644
--- a/clang/include/clang/CIR/Dialect/IR/CIRDialect.td
+++ b/clang/include/clang/CIR/Dialect/IR/CIRDialect.td
@@ -24,8 +24,7 @@ def CIR_Dialect : Dialect {
 
   let cppNamespace = "::cir";
 
-  let useDefaultAttributePrinterParser = 0;
-  let useDefaultTypePrinterParser = 0;
+  let useDefaultAttributePrinterParser = 1;
 
   // Enable constant materialization for the CIR dialect. This generates a
   // declaration for the cir::CIRDialect::materializeConstant function. This
@@ -52,12 +51,6 @@ def CIR_Dialect : Dialect {
     mlir::Type parseType(mlir::DialectAsmParser &parser) const override;
     void printType(mlir::Type type,
                    mlir::DialectAsmPrinter &printer) const override;
-
-    mlir::Attribute parseAttribute(mlir::DialectAsmParser &parser,
-                                   mlir::Type type) const override;
-
-    void printAttribute(mlir::Attribute attr,
-                        mlir::DialectAsmPrinter &os) const override;
   }];
 }
 
diff --git a/clang/include/clang/CIR/Dialect/IR/CIROps.td b/clang/include/clang/CIR/Dialect/IR/CIROps.td
index 5f5fab6f12300..ae199f35cb10e 100644
--- a/clang/include/clang/CIR/Dialect/IR/CIROps.td
+++ b/clang/include/clang/CIR/Dialect/IR/CIROps.td
@@ -1173,6 +1173,35 @@ def CIR_SwitchOp : CIR_Op<"switch", [
   let hasLLVMLowering = false;
 }
 
+//===----------------------------------------------------------------------===//
+// IsConstantOp
+//===----------------------------------------------------------------------===//
+
+def CIR_IsConstantOp : CIR_Op<"is_constant", [Pure]> {
+  let summary = "Test for manifest compile-time constant";
+  let description = [{
+    Returns `true` if the argument is known to be a manifest compile-time
+    constant otherwise returns `false`. If the argument is a constant expression
+    which refers to a global (the address of which _is_ a constant, but not
+    manifest during the compile), then the intrinsic evaluates to `false`.
+
+    This is used to represent `__builtin_constant_p` in cases where the argument
+    isn't known to be constant during initial translation of the source code but
+    might be proven to be constant after later optimizations.
+
+    Example:
+    ```
+    %1 = cir.is_constant %2 : !s32i -> !cir.bool
+    ```
+  }];
+  let arguments = (ins CIR_AnyType:$val);
+  let results = (outs CIR_BoolType:$result);
+
+  let assemblyFormat = [{
+    $val `:` qualified(type($val)) `->` qualified(type($result)) attr-dict
+  }];
+}
+
 //===----------------------------------------------------------------------===//
 // SwitchFlatOp
 //===----------------------------------------------------------------------===//
@@ -2564,9 +2593,9 @@ def CIR_FuncOp : CIR_Op<"func", [
     Similarly, for global destructors both `global_dtor` and
     `global_dtor(<priority>)` are available.
 
-    The `inline(never)` keyword marks a function that should not be inlined.
-    The `inline(always)` keyword marks a function that should always be inlined.
-    The `inline(hint)` keyword suggests that the function should be inlined.
+    The `no_inline` attribute marks a function that should not be inlined.
+    The `always_inline` attribute marks a function that should always be inlined.
+    The `inline_hint` attribute suggests that the function should be inlined.
 
     Example:
 
@@ -2580,7 +2609,10 @@ def CIR_FuncOp : CIR_Op<"func", [
 
     // Linkage information
     cir.func linkonce_odr @some_method(...)
-    ```
+
+    // Inline information
+    cir.func no_inline @some_method(...)
+    
     // Builtin function
     cir.func builtin @__builtin_coro_end(!cir.ptr<i8>, !cir.bool) -> !cir.bool
     // Coroutine
@@ -2592,26 +2624,29 @@ def CIR_FuncOp : CIR_Op<"func", [
     ```
   }];
 
-  let arguments = (ins SymbolNameAttr:$sym_name,
-                       CIR_VisibilityAttr:$global_visibility,
-                       TypeAttrOf<CIR_FuncType>:$function_type,
-                       UnitAttr:$builtin,
-                       UnitAttr:$coroutine,
-                       UnitAttr:$lambda,
-                       UnitAttr:$no_proto,
-                       UnitAttr:$dso_local,
-                       DefaultValuedAttr<CIR_GlobalLinkageKind,
-                                         "cir::GlobalLinkageKind::ExternalLinkage">:$linkage,
-                       OptionalAttr<CIR_InlineAttr>:$inline_kind,
-                       OptionalAttr<StrAttr>:$sym_visibility,
-                       UnitAttr:$comdat,
-                       OptionalAttr<DictArrayAttr>:$arg_attrs,
-                       OptionalAttr<DictArrayAttr>:$res_attrs,
-                       OptionalAttr<FlatSymbolRefAttr>:$aliasee,
-                       CIR_OptionalPriorityAttr:$global_ctor_priority,
-                       CIR_OptionalPriorityAttr:$global_dtor_priority,
-                       OptionalAttr<CIR_CXXSpecialMemberAttr>:$cxx_special_member
-   );
+  let arguments = (ins 
+    SymbolNameAttr:$sym_name,
+    CIR_VisibilityAttr:$global_visibility,
+    TypeAttrOf<CIR_FuncType>:$function_type,
+    UnitAttr:$builtin,
+    UnitAttr:$coroutine,
+    OptionalAttr<CIR_InlineKind>:$inline_kind,
+    UnitAttr:$lambda,
+    UnitAttr:$no_proto,
+    UnitAttr:$dso_local,
+    DefaultValuedAttr<
+      CIR_GlobalLinkageKind,
+      "cir::GlobalLinkageKind::ExternalLinkage"
+    >:$linkage,
+    OptionalAttr<StrAttr>:$sym_visibility,
+    UnitAttr:$comdat,
+    OptionalAttr<DictArrayAttr>:$arg_attrs,
+    OptionalAttr<DictArrayAttr>:$res_attrs,
+    OptionalAttr<FlatSymbolRefAttr>:$aliasee,
+    CIR_OptionalPriorityAttr:$global_ctor_priority,
+    CIR_OptionalPriorityAttr:$global_dtor_priority,
+    OptionalAttr<CIR_CXXSpecialMemberAttr>:$cxx_special_member
+  );
 
   let regions = (region AnyRegion:$body);
 
@@ -4722,6 +4757,16 @@ def CIR_FAbsOp : CIR_UnaryFPToFPBuiltinOp<"fabs", "FAbsOp"> {
   }];
 }
 
+def CIR_FloorOp : CIR_UnaryFPToFPBuiltinOp<"floor", "FloorOp"> {
+  let summary = "Computes the floating-point floor value";
+  let description = [{
+    `cir.floor` computes the floor of a floating-point operand and returns
+    a result of the same type.
+
+    Floating-point exceptions are ignored, and it does not set `errno`.
+  }];
+}
+
 //===----------------------------------------------------------------------===//
 // Variadic Operations
 //===----------------------------------------------------------------------===//
@@ -4804,6 +4849,37 @@ def CIR_VAEndOp : CIR_Op<"va_end"> {
   }];
 }
 
+def CIR_VACopyOp : CIR_Op<"va_copy"> {
+  let summary = "Copied a variable argument list";
+  let description = [{
+    The `cir.copy` operation models the C/C++ va_copy macro.
+    The variable argument list passed as the `$src_list` is copied to an
+    unitialized `va_list` in the destination operand. The next argument that
+    can be extracted from the copied list is the same as the next argument in
+    the source list. The copied list must be destroyed with `va_end`.
+
+    Example:
+
+    ```mlir
+    // %args : !cir.ptr<!cir.array<!rec___va_list_tag x 1>>
+    %p = cir.cast array_to_ptrdecay %args
+          : !cir.ptr<!cir.array<!rec___va_list_tag x 1>>
+          -> !cir.ptr<!rec___va_list_tag>
+    cir.va_copy %p to %dst
+          : (!cir.ptr<!rec___va_list_tag>, !cir.ptr<!rec___va_list_tag>)
+    ```
+  }];
+
+  let arguments = (ins
+    CIR_PointerType:$dst_list,
+    CIR_PointerType:$src_list
+  );
+
+  let assemblyFormat = [{
+    $src_list `to` $dst_list attr-dict `:` type(operands)
+  }];
+}
+
 def CIR_VAArgOp : CIR_Op<"va_arg"> {
   let summary = "Fetches next variadic element as a given type";
   let description = [{
diff --git a/clang/include/clang/CIR/Dialect/IR/CIRTypeConstraints.td b/clang/include/clang/CIR/Dialect/IR/CIRTypeConstraints.td
index b2c146c5d2c39..ddca98eac93ab 100644
--- a/clang/include/clang/CIR/Dialect/IR/CIRTypeConstraints.td
+++ b/clang/include/clang/CIR/Dialect/IR/CIRTypeConstraints.td
@@ -173,7 +173,7 @@ def CIR_AnyComplexType : CIR_TypeBase<"::cir::ComplexType", "complex type">;
 
 def CIR_AnyComplexOrIntOrBoolOrFloatType
     : AnyTypeOf<[CIR_AnyComplexType, CIR_AnyIntOrBoolOrFloatType],
-                "complex, integer or floating point type"> {
+                "complex, integer, boolean or floating point type"> {
   let cppFunctionName = "isComplexOrIntegerOrBoolOrFloatingPointType";
 }
 
diff --git a/clang/include/clang/CIR/MissingFeatures.h b/clang/include/clang/CIR/MissingFeatures.h
index 1427c677d0f34..5075071661fb5 100644
--- a/clang/include/clang/CIR/MissingFeatures.h
+++ b/clang/include/clang/CIR/MissingFeatures.h
@@ -153,6 +153,8 @@ struct MissingFeatures {
   static bool coroEndBuiltinCall() { return false; }
   static bool emitBodyAndFallthrough() { return false; }
   static bool coroOutsideFrameMD() { return false; }
+  static bool coroCoReturn() { return false; }
+  static bool coroCoYield() { return false; }
 
   // Various handling of deferred processing in CIRGenModule.
   static bool cgmRelease() { return false; }
@@ -188,6 +190,10 @@ struct MissingFeatures {
   static bool globalCtorAssociatedData() { return false; }
 
   // Misc
+  static bool aarch64SIMDIntrinsics() { return false; }
+  static bool aarch64SMEIntrinsics() { return false; }
+  static bool aarch64SVEIntrinsics() { return false; }
+  static bool aarch64TblBuiltinExpr() { return false; }
   static bool abiArgInfo() { return false; }
   static bool addAutoInitAnnotation() { return false; }
   static bool addHeapAllocSiteMetadata() { return false; }
@@ -291,6 +297,7 @@ struct MissingFeatures {
   static bool metaDataNode() { return false; }
   static bool moduleNameHash() { return false; }
   static bool msabi() { return false; }
+  static bool neonSISDIntrinsics() { return false; }
   static bool nrvo() { return false; }
   static bool objCBlocks() { return false; }
   static bool objCGC() { return false; }
diff --git a/clang/include/clang/DependencyScanning/DependencyScanningWorker.h b/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
index 9d3966c25414a..8d91c78c72322 100644
--- a/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
@@ -85,9 +85,9 @@ class DependencyScanningWorker {
   /// Construct a dependency scanning worker.
   ///
   /// @param Service The parent service. Must outlive the worker.
-  /// @param FS The filesystem for the worker to use.
+  /// @param BaseFS The filesystem for the worker to use.
   DependencyScanningWorker(DependencyScanningService &Service,
-                           llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS);
+                           IntrusiveRefCntPtr<llvm::vfs::FileSystem> BaseFS);
 
   ~DependencyScanningWorker();
 
@@ -156,31 +156,26 @@ class DependencyScanningWorker {
                                        DependencyActionController &Controller);
   bool finalizeCompilerInstance();
 
-  llvm::vfs::FileSystem &getVFS() const { return *BaseFS; }
+  llvm::vfs::FileSystem &getVFS() const { return *DepFS; }
 
 private:
   /// The parent dependency scanning service.
   DependencyScanningService &Service;
   std::shared_ptr<PCHContainerOperations> PCHContainerOps;
-  /// The file system to be used during the scan.
-  /// This is either \c FS passed in the constructor (when performing canonical
-  /// preprocessing), or \c DepFS (when performing dependency directives scan).
-  llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> BaseFS;
-  /// When performing dependency directives scan, this is the caching (and
-  /// dependency-directives-extracting) filesystem overlaid on top of \c FS
-  /// (passed in the constructor).
-  llvm::IntrusiveRefCntPtr<DependencyScanningWorkerFilesystem> DepFS;
+  /// This is the caching (and optionally dependency-directives-providing) VFS
+  /// overlaid on top of the base VFS passed in the constructor.
+  IntrusiveRefCntPtr<DependencyScanningWorkerFilesystem> DepFS;
 
   friend CompilerInstanceWithContext;
   std::unique_ptr<CompilerInstanceWithContext> CIWithContext;
 
-  /// Private helper functions.
-  bool scanDependencies(StringRef WorkingDirectory,
-                        const std::vector<std::string> &CommandLine,
-                        DependencyConsumer &Consumer,
-                        DependencyActionController &Controller,
-                        DiagnosticConsumer &DC,
-                        llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS);
+  /// Actually carries out the scan. If \c OverlayFS is provided, it must be
+  /// based on top of DepFS.
+  bool scanDependencies(
+      StringRef WorkingDirectory, const std::vector<std::string> &CommandLine,
+      DependencyConsumer &Consumer, DependencyActionController &Controller,
+      DiagnosticConsumer &DC,
+      IntrusiveRefCntPtr<llvm::vfs::FileSystem> OverlayFS = nullptr);
 };
 
 } // end namespace dependencies
diff --git a/clang/include/clang/Lex/HeaderSearch.h b/clang/include/clang/Lex/HeaderSearch.h
index 850aea41c4c3b..5369c872ac1cd 100644
--- a/clang/include/clang/Lex/HeaderSearch.h
+++ b/clang/include/clang/Lex/HeaderSearch.h
@@ -465,6 +465,12 @@ class HeaderSearch {
     ExternalSource = ES;
   }
 
+  void diagnoseHeaderShadowing(
+      StringRef Filename, OptionalFileEntryRef FE, bool &DiagnosedShadowing,
+      SourceLocation IncludeLoc, ConstSearchDirIterator FromDir,
+      ArrayRef<std::pair<OptionalFileEntryRef, DirectoryEntryRef>> Includers,
+      bool isAngled, int IncluderLoopIndex, ConstSearchDirIterator MainLoopIt);
+
   /// Set the target information for the header search, if not
   /// already known.
   void setTarget(const TargetInfo &Target);
diff --git a/clang/include/clang/Lex/PPCallbacks.h b/clang/include/clang/Lex/PPCallbacks.h
index 313b730afbab8..e6120c5648798 100644
--- a/clang/include/clang/Lex/PPCallbacks.h
+++ b/clang/include/clang/Lex/PPCallbacks.h
@@ -499,10 +499,10 @@ class PPChainedCallbacks : public PPCallbacks {
   }
 
   bool EmbedFileNotFound(StringRef FileName) override {
-    bool Skip = First->FileNotFound(FileName);
+    bool Skip = First->EmbedFileNotFound(FileName);
     // Make sure to invoke the second callback, no matter if the first already
     // returned true to skip the file.
-    Skip |= Second->FileNotFound(FileName);
+    Skip |= Second->EmbedFileNotFound(FileName);
     return Skip;
   }
 
diff --git a/clang/include/clang/Lex/Token.h b/clang/include/clang/Lex/Token.h
index d9dc5a562d802..43091a6f3a8c6 100644
--- a/clang/include/clang/Lex/Token.h
+++ b/clang/include/clang/Lex/Token.h
@@ -100,13 +100,19 @@ class Token {
   /// is/isNot - Predicates to check if this token is a specific kind, as in
   /// "if (Tok.is(tok::l_brace)) {...}".
   bool is(tok::TokenKind K) const { return Kind == K; }
-  bool isNot(tok::TokenKind K) const { return Kind != K; }
   template <typename... Ts> bool isOneOf(Ts... Ks) const {
     static_assert(sizeof...(Ts) > 0,
                   "requires at least one tok::TokenKind specified");
     return (is(Ks) || ...);
   }
 
+  bool isNot(tok::TokenKind K) const { return Kind != K; }
+  template <typename... Ts> bool isNoneOf(Ts... Ks) const {
+    static_assert(sizeof...(Ts) > 0,
+                  "requires at least one tok::TokenKind specified");
+    return (isNot(Ks) && ...);
+  }
+
   /// Return true if this is a raw identifier (when lexing
   /// in raw mode) or a non-keyword identifier (when lexing in non-raw mode).
   bool isAnyIdentifier() const {
diff --git a/clang/include/clang/Options/Options.td b/clang/include/clang/Options/Options.td
index 756d6deed7130..d31bd7d6be322 100644
--- a/clang/include/clang/Options/Options.td
+++ b/clang/include/clang/Options/Options.td
@@ -4765,25 +4765,25 @@ def ggdb3 : Flag<["-"], "ggdb3">, Group<ggdbN_Group>;
 def glldb : Flag<["-"], "glldb">, Group<gTune_Group>;
 def gsce : Flag<["-"], "gsce">, Group<gTune_Group>;
 def gdbx : Flag<["-"], "gdbx">, Group<gTune_Group>;
-// Equivalent to our default dwarf version. Forces usual dwarf emission when
+// Equivalent to our default DWARF version. Forces usual DWARF emission when
 // CodeView is enabled.
 def gdwarf : Flag<["-"], "gdwarf">, Group<g_Group>,
   Visibility<[ClangOption, CLOption, DXCOption, FlangOption]>,
-  HelpText<"Generate source-level debug information with the default dwarf version">;
+  HelpText<"Generate source-level debug information with the default DWARF version">;
 
 let Visibility = [ClangOption, FlangOption] in {
 def gdwarf_2 : Flag<["-"], "gdwarf-2">, Group<g_Group>,
-  HelpText<"Generate source-level debug information with dwarf version 2">;
+  HelpText<"Generate source-level debug information with DWARF version 2">;
 def gdwarf_3 : Flag<["-"], "gdwarf-3">, Group<g_Group>,
-  HelpText<"Generate source-level debug information with dwarf version 3">;
+  HelpText<"Generate source-level debug information with DWARF version 3">;
 def gdwarf_4 : Flag<["-"], "gdwarf-4">, Group<g_Group>,
-  HelpText<"Generate source-level debug information with dwarf version 4">;
+  HelpText<"Generate source-level debug information with DWARF version 4">;
 def gdwarf_5 : Flag<["-"], "gdwarf-5">, Group<g_Group>,
-  HelpText<"Generate source-level debug information with dwarf version 5">;
+  HelpText<"Generate source-level debug information with DWARF version 5">;
 def gdwarf_6
     : Flag<["-"], "gdwarf-6">,
       Group<g_Group>,
-      HelpText<"Generate source-level debug information with dwarf version 6">;
+      HelpText<"Generate source-level debug information with DWARF version 6">;
 }
 def gdwarf64 : Flag<["-"], "gdwarf64">, Group<g_Group>,
   Visibility<[ClangOption, CC1Option, CC1AsOption]>,
@@ -4793,7 +4793,7 @@ def gdwarf32 : Flag<["-"], "gdwarf32">, Group<g_Group>,
   Visibility<[ClangOption, CC1Option, CC1AsOption]>,
   HelpText<"Enables DWARF32 format for ELF binaries, if debug information emission is enabled.">;
 
-def gcodeview : Flag<["-"], "gcodeview">,
+def gcodeview : Flag<["-"], "gcodeview">, Group<g_Group>,
   HelpText<"Generate CodeView debug information">,
   Visibility<[ClangOption, CC1Option, CC1AsOption, CLOption, DXCOption]>,
   MarshallingInfoFlag<CodeGenOpts<"EmitCodeView">>;
@@ -4801,17 +4801,20 @@ defm codeview_ghash : BoolOption<"g", "codeview-ghash",
   CodeGenOpts<"CodeViewGHash">, DefaultFalse,
   PosFlag<SetTrue, [], [ClangOption, CC1Option],
           "Emit type record hashes in a .debug$H section">,
-  NegFlag<SetFalse>, BothFlags<[], [ClangOption, CLOption, DXCOption]>>;
+  NegFlag<SetFalse>, BothFlags<[], [ClangOption, CLOption, DXCOption]>>,
+  Group<g_flags_Group>;
 defm codeview_command_line : BoolOption<"g", "codeview-command-line",
   CodeGenOpts<"CodeViewCommandLine">, DefaultTrue,
   PosFlag<SetTrue, [], [ClangOption], "Emit compiler path and command line into CodeView debug information">,
   NegFlag<SetFalse, [], [ClangOption], "Don't emit compiler path and command line into CodeView debug information">,
-  BothFlags<[], [ClangOption, CLOption, DXCOption, CC1Option]>>;
+  BothFlags<[], [ClangOption, CLOption, DXCOption, CC1Option]>>,
+  Group<g_flags_Group>;
 defm inline_line_tables : BoolGOption<"inline-line-tables",
   CodeGenOpts<"NoInlineLineTables">, DefaultFalse,
   NegFlag<SetTrue, [], [ClangOption, CC1Option],
           "Don't emit inline line tables.">,
-  PosFlag<SetFalse>, BothFlags<[], [ClangOption, CLOption, DXCOption]>>;
+  PosFlag<SetFalse>, BothFlags<[], [ClangOption, CLOption, DXCOption]>>,
+  Group<g_flags_Group>;
 
 def gfull : Flag<["-"], "gfull">, Group<g_Group>;
 def gused : Flag<["-"], "gused">, Group<g_Group>;
@@ -4836,7 +4839,8 @@ defm strict_dwarf : BoolOption<"g", "strict-dwarf",
 defm omit_unreferenced_methods : BoolGOption<"omit-unreferenced-methods",
   CodeGenOpts<"DebugOmitUnreferencedMethods">, DefaultFalse,
   NegFlag<SetFalse>,
-  PosFlag<SetTrue, [], [CC1Option]>, BothFlags<[], [ClangOption, CLOption, DXCOption]>>;
+  PosFlag<SetTrue, [], [CC1Option]>, BothFlags<[], [ClangOption, CLOption, DXCOption]>>,
+  Group<g_flags_Group>;
 defm column_info : BoolOption<"g", "column-info",
   CodeGenOpts<"DebugColumnInfo">, DefaultTrue,
   NegFlag<SetFalse, [], [ClangOption, CC1Option]>,
@@ -4903,6 +4907,7 @@ defm structor_decl_linkage_names
                           "Attach linkage names to C++ constructor/destructor "
                           "declarations in DWARF.">,
                   BothFlags<[], [ClangOption, CLOption, CC1Option]>>,
+                  Group<g_flags_Group>,
                   DocBrief<[{On some ABIs (e.g., Itanium), constructors and destructors may have multiple variants. Historically, when generating DWARF, Clang did not attach ``DW_AT_linkage_name`` to structor DIEs because there were multiple possible manglings (depending on the structor variant) that could be used. With ``-gstructor-decl-linkage-names``, for ABIs with structor variants, we attach a "unified" mangled name to structor declarations DIEs which debuggers can use to look up all the definitions for a structor declaration. E.g., a "unified" mangled name ``_ZN3FooC4Ev`` may have multiple definitions associated with it such as ``_ZN3FooC1Ev`` and ``_ZN3FooC2Ev``.
 
 Enabling this flag results in a better interactive debugging experience (both GDB and LLDB have support for understanding these "unified" linkage names). However, it comes with a significant increase in debug-info size (particularly the `.debug_str` section). As an escape hatch, users can disable this feature using ``-gno-structor-decl-linkage-names``.}]>;
@@ -4911,7 +4916,8 @@ defm key_instructions : BoolGOption<"key-instructions",
     NegFlag<SetFalse>, PosFlag<SetTrue, [], [],
         "Enable Key Instructions, which reduces the jumpiness of debug stepping in optimized C/C++ code"
         " in some debuggers. DWARF only.">,
-    BothFlags<[], [ClangOption, CLOption, CC1Option]>>;
+    BothFlags<[], [ClangOption, CLOption, CC1Option]>>,
+  Group<g_flags_Group>;
 def headerpad__max__install__names : Joined<["-"], "headerpad_max_install_names">;
 def help : Flag<["-", "--"], "help">,
     Visibility<[ClangOption, CC1Option, CC1AsOption,
@@ -8530,7 +8536,7 @@ def main_file_name : Separate<["-"], "main-file-name">,
   Visibility<[CC1Option, CC1AsOption]>,
   MarshallingInfoString<CodeGenOpts<"MainFileName">>;
 def split_dwarf_output : Separate<["-"], "split-dwarf-output">,
-  HelpText<"File name to use for split dwarf debug info output">,
+  HelpText<"File name to use for split DWARF debug info output">,
   Visibility<[CC1Option, CC1AsOption, FC1Option]>,
   MarshallingInfoString<CodeGenOpts<"SplitDwarfOutput">>;
 
@@ -8564,7 +8570,7 @@ def dependent_lib : Joined<["--"], "dependent-lib=">,
   MarshallingInfoStringVector<CodeGenOpts<"DependentLibraries">>;
 
 def split_dwarf_file : Separate<["-"], "split-dwarf-file">,
-  HelpText<"Name of the split dwarf debug info file to encode in the object file">,
+  HelpText<"Name of the split DWARF debug info file to encode in the object file">,
   MarshallingInfoString<CodeGenOpts<"SplitDwarfFile">>;
 
 } // let Visibility = [CC1Option, FC1Option]
diff --git a/clang/include/clang/Sema/Overload.h b/clang/include/clang/Sema/Overload.h
index 59bbd0fbd9e95..ab45328ee8ab7 100644
--- a/clang/include/clang/Sema/Overload.h
+++ b/clang/include/clang/Sema/Overload.h
@@ -198,6 +198,9 @@ class Sema;
     /// HLSL vector truncation.
     ICK_HLSL_Vector_Truncation,
 
+    /// HLSL Matrix truncation.
+    ICK_HLSL_Matrix_Truncation,
+
     /// HLSL non-decaying array rvalue cast.
     ICK_HLSL_Array_RValue,
 
diff --git a/clang/include/clang/Sema/Sema.h b/clang/include/clang/Sema/Sema.h
index 78ecbccbe4efc..4a601a0eaf1b9 100644
--- a/clang/include/clang/Sema/Sema.h
+++ b/clang/include/clang/Sema/Sema.h
@@ -4957,6 +4957,11 @@ class Sema final : public SemaBase {
                                             IdentifierInfo *Format,
                                             int FormatIdx,
                                             StringLiteral *FormatStr);
+  ModularFormatAttr *mergeModularFormatAttr(Decl *D,
+                                            const AttributeCommonInfo &CI,
+                                            IdentifierInfo *ModularImplFn,
+                                            StringRef ImplName,
+                                            MutableArrayRef<StringRef> Aspects);
 
   /// AddAlignedAttr - Adds an aligned attribute to a particular declaration.
   void AddAlignedAttr(Decl *D, const AttributeCommonInfo &CI, Expr *E,
diff --git a/clang/include/clang/Sema/SemaCUDA.h b/clang/include/clang/Sema/SemaCUDA.h
index dbc1432860d89..dbb4290f5d149 100644
--- a/clang/include/clang/Sema/SemaCUDA.h
+++ b/clang/include/clang/Sema/SemaCUDA.h
@@ -273,6 +273,11 @@ class SemaCUDA : public SemaBase {
   /// of the function that will be called to configure kernel call, with the
   /// parameters specified via <<<>>>.
   std::string getConfigureFuncName() const;
+  /// Return the name of the parameter buffer allocation function for the
+  /// device kernel launch.
+  std::string getGetParameterBufferFuncName() const;
+  /// Return the name of the device kernel launch function.
+  std::string getLaunchDeviceFuncName() const;
 
   /// Record variables that are potentially ODR-used in CUDA/HIP.
   void recordPotentialODRUsedVariable(MultiExprArg Args,
diff --git a/clang/include/clang/Sema/SemaHLSL.h b/clang/include/clang/Sema/SemaHLSL.h
index 15edb7e77a22b..a2faa91d1e54d 100644
--- a/clang/include/clang/Sema/SemaHLSL.h
+++ b/clang/include/clang/Sema/SemaHLSL.h
@@ -134,9 +134,6 @@ class SemaHLSL : public SemaBase {
   void CheckEntryPoint(FunctionDecl *FD);
   bool CheckResourceBinOp(BinaryOperatorKind Opc, Expr *LHSExpr, Expr *RHSExpr,
                           SourceLocation Loc);
-  void DiagnoseAttrStageMismatch(
-      const Attr *A, llvm::Triple::EnvironmentType Stage,
-      std::initializer_list<llvm::Triple::EnvironmentType> AllowedStages);
 
   QualType handleVectorBinOpConversion(ExprResult &LHS, ExprResult &RHS,
                                        QualType LHSType, QualType RHSType,
@@ -171,6 +168,7 @@ class SemaHLSL : public SemaBase {
   void handleWaveSizeAttr(Decl *D, const ParsedAttr &AL);
   void handleVkConstantIdAttr(Decl *D, const ParsedAttr &AL);
   void handleVkBindingAttr(Decl *D, const ParsedAttr &AL);
+  void handleVkLocationAttr(Decl *D, const ParsedAttr &AL);
   void handlePackOffsetAttr(Decl *D, const ParsedAttr &AL);
   void handleShaderAttr(Decl *D, const ParsedAttr &AL);
   void handleResourceBindingAttr(Decl *D, const ParsedAttr &AL);
@@ -239,9 +237,36 @@ class SemaHLSL : public SemaBase {
 
   IdentifierInfo *RootSigOverrideIdent = nullptr;
 
+  // Information about the current subtree being flattened.
   struct SemanticInfo {
     HLSLParsedSemanticAttr *Semantic;
-    std::optional<uint32_t> Index;
+    std::optional<uint32_t> Index = std::nullopt;
+  };
+
+  // Bitmask used to recall if the current semantic subtree is
+  // input, output or inout.
+  enum IOType {
+    In = 0b01,
+    Out = 0b10,
+    InOut = 0b11,
+  };
+
+  // The context shared by all semantics with the same IOType during
+  // flattening.
+  struct SemanticContext {
+    // Present if any semantic sharing the same IO type has an explicit or
+    // implicit SPIR-V location index assigned.
+    std::optional<bool> UsesExplicitVkLocations = std::nullopt;
+    // The set of semantics found to be active during flattening. Used to detect
+    // index collisions.
+    llvm::StringSet<> ActiveSemantics = {};
+    // The IOType of this semantic set.
+    IOType CurrentIOType;
+  };
+
+  struct SemanticStageInfo {
+    llvm::Triple::EnvironmentType Stage;
+    IOType AllowedIOTypesMask;
   };
 
 private:
@@ -251,24 +276,30 @@ class SemaHLSL : public SemaBase {
 
   void checkSemanticAnnotation(FunctionDecl *EntryPoint, const Decl *Param,
                                const HLSLAppliedSemanticAttr *SemanticAttr,
-                               bool IsInput);
+                               const SemanticContext &SC);
 
   bool determineActiveSemanticOnScalar(FunctionDecl *FD,
                                        DeclaratorDecl *OutputDecl,
                                        DeclaratorDecl *D,
                                        SemanticInfo &ActiveSemantic,
-                                       llvm::StringSet<> &ActiveSemantics,
-                                       bool IsInput);
+                                       SemanticContext &SC);
 
   bool determineActiveSemantic(FunctionDecl *FD, DeclaratorDecl *OutputDecl,
                                DeclaratorDecl *D, SemanticInfo &ActiveSemantic,
-                               llvm::StringSet<> &ActiveSemantics,
-                               bool IsInput);
+                               SemanticContext &SC);
 
   void processExplicitBindingsOnDecl(VarDecl *D);
 
   void diagnoseAvailabilityViolations(TranslationUnitDecl *TU);
 
+  void diagnoseAttrStageMismatch(
+      const Attr *A, llvm::Triple::EnvironmentType Stage,
+      std::initializer_list<llvm::Triple::EnvironmentType> AllowedStages);
+
+  void diagnoseSemanticStageMismatch(
+      const Attr *A, llvm::Triple::EnvironmentType Stage, IOType CurrentIOType,
+      std::initializer_list<SemanticStageInfo> AllowedStages);
+
   uint32_t getNextImplicitBindingOrderID() {
     return ImplicitBindingNextOrderID++;
   }
diff --git a/clang/include/clang/Sema/SemaOpenACC.h b/clang/include/clang/Sema/SemaOpenACC.h
index f751e985ae0ff..b5e3ecab36d22 100644
--- a/clang/include/clang/Sema/SemaOpenACC.h
+++ b/clang/include/clang/Sema/SemaOpenACC.h
@@ -37,8 +37,16 @@ class Scope;
 class SemaOpenACC : public SemaBase {
 public:
   using DeclGroupPtrTy = OpaquePtr<DeclGroupRef>;
+  using RoutineRefListTy = std::pair<FunctionDecl *, OpenACCRoutineDecl *>;
 
 private:
+  // We save a list of routine clauses that refer to a different function(that
+  // is, routine-with-a-name) so that we can do the emission at the 'end'.  We
+  // have to do this, since functions can be emitted before they are referenced,
+  // and the OpenACCRoutineDecl isn't necessarily emitted, as it might be in a
+  // function/etc. So we do these emits at the end of the TU.
+  llvm::SmallVector<RoutineRefListTy> RoutineRefList;
+
   struct ComputeConstructInfo {
     /// Which type of compute construct we are inside of, which we can use to
     /// determine whether we should add loops to the above collection.  We can
@@ -752,6 +760,7 @@ class SemaOpenACC : public SemaBase {
   };
 
   SemaOpenACC(Sema &S);
+  void ActOnEndOfTranslationUnit(TranslationUnitDecl *TU);
 
   // Called when we encounter a 'while' statement, before looking at its 'body'.
   void ActOnWhileStmt(SourceLocation WhileLoc);
diff --git a/clang/include/clang/Sema/SemaOpenMP.h b/clang/include/clang/Sema/SemaOpenMP.h
index 686e51ee92a08..2d05b4423140b 100644
--- a/clang/include/clang/Sema/SemaOpenMP.h
+++ b/clang/include/clang/Sema/SemaOpenMP.h
@@ -1351,7 +1351,7 @@ class SemaOpenMP : public SemaBase {
   OMPClause *
   ActOnOpenMPToClause(ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
                       ArrayRef<SourceLocation> MotionModifiersLoc,
-                      CXXScopeSpec &MapperIdScopeSpec,
+                      Expr *IteratorModifier, CXXScopeSpec &MapperIdScopeSpec,
                       DeclarationNameInfo &MapperId, SourceLocation ColonLoc,
                       ArrayRef<Expr *> VarList, const OMPVarListLocTy &Locs,
                       ArrayRef<Expr *> UnresolvedMappers = {});
@@ -1359,7 +1359,7 @@ class SemaOpenMP : public SemaBase {
   OMPClause *
   ActOnOpenMPFromClause(ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
                         ArrayRef<SourceLocation> MotionModifiersLoc,
-                        CXXScopeSpec &MapperIdScopeSpec,
+                        Expr *IteratorModifier, CXXScopeSpec &MapperIdScopeSpec,
                         DeclarationNameInfo &MapperId, SourceLocation ColonLoc,
                         ArrayRef<Expr *> VarList, const OMPVarListLocTy &Locs,
                         ArrayRef<Expr *> UnresolvedMappers = {});
diff --git a/clang/include/clang/Serialization/ASTReader.h b/clang/include/clang/Serialization/ASTReader.h
index a27cfe8a9b307..d276f0d21b958 100644
--- a/clang/include/clang/Serialization/ASTReader.h
+++ b/clang/include/clang/Serialization/ASTReader.h
@@ -1005,7 +1005,7 @@ class ASTReader
   ///
   /// The AST context tracks a few important decls, currently cudaConfigureCall,
   /// directly.
-  SmallVector<GlobalDeclID, 2> CUDASpecialDeclRefs;
+  SmallVector<GlobalDeclID, 4> CUDASpecialDeclRefs;
 
   /// The floating point pragma option settings.
   SmallVector<uint64_t, 1> FPPragmaOptions;
diff --git a/clang/include/clang/Tooling/Transformer/RangeSelector.h b/clang/include/clang/Tooling/Transformer/RangeSelector.h
index 462a9da8f10eb..c76a5106edd65 100644
--- a/clang/include/clang/Tooling/Transformer/RangeSelector.h
+++ b/clang/include/clang/Tooling/Transformer/RangeSelector.h
@@ -37,6 +37,10 @@ RangeSelector enclose(RangeSelector Begin, RangeSelector End);
 /// Convenience version of \c range where end-points are bound nodes.
 RangeSelector encloseNodes(std::string BeginID, std::string EndID);
 
+/// Selects the merge of the two ranges, i.e. from min(First.begin,
+/// Second.begin) to max(First.end, Second.end).
+RangeSelector merge(RangeSelector First, RangeSelector Second);
+
 /// DEPRECATED. Use `enclose`.
 inline RangeSelector range(RangeSelector Begin, RangeSelector End) {
   return enclose(std::move(Begin), std::move(End));
diff --git a/clang/lib/AST/ByteCode/Compiler.cpp b/clang/lib/AST/ByteCode/Compiler.cpp
index dd0b8e790d444..58e84ef70abb7 100644
--- a/clang/lib/AST/ByteCode/Compiler.cpp
+++ b/clang/lib/AST/ByteCode/Compiler.cpp
@@ -1705,6 +1705,9 @@ bool Compiler<Emitter>::VisitFixedPointUnaryOperator(const UnaryOperator *E) {
 template <class Emitter>
 bool Compiler<Emitter>::VisitImplicitValueInitExpr(
     const ImplicitValueInitExpr *E) {
+  if (DiscardResult)
+    return true;
+
   QualType QT = E->getType();
 
   if (OptPrimType T = classify(QT))
diff --git a/clang/lib/AST/ByteCode/Disasm.cpp b/clang/lib/AST/ByteCode/Disasm.cpp
index 638028f84ff24..30ed41cd37ca3 100644
--- a/clang/lib/AST/ByteCode/Disasm.cpp
+++ b/clang/lib/AST/ByteCode/Disasm.cpp
@@ -138,9 +138,16 @@ static size_t getNumDisplayWidth(size_t N) {
   return L;
 }
 
-LLVM_DUMP_METHOD void Function::dump() const { dump(llvm::errs()); }
+LLVM_DUMP_METHOD void Function::dump(CodePtr PC) const {
+  dump(llvm::errs(), PC);
+}
 
-LLVM_DUMP_METHOD void Function::dump(llvm::raw_ostream &OS) const {
+LLVM_DUMP_METHOD void Function::dump(llvm::raw_ostream &OS,
+                                     CodePtr OpPC) const {
+  if (OpPC) {
+    assert(OpPC >= getCodeBegin());
+    assert(OpPC <= getCodeEnd());
+  }
   {
     ColorScope SC(OS, true, {llvm::raw_ostream::BRIGHT_GREEN, true});
     OS << getName() << " " << (const void *)this << "\n";
@@ -154,6 +161,7 @@ LLVM_DUMP_METHOD void Function::dump(llvm::raw_ostream &OS) const {
     size_t Addr;
     std::string Op;
     bool IsJump;
+    bool CurrentOp = false;
     llvm::SmallVector<std::string> Args;
   };
 
@@ -171,6 +179,7 @@ LLVM_DUMP_METHOD void Function::dump(llvm::raw_ostream &OS) const {
     auto Op = PC.read<Opcode>();
     Text.Addr = Addr;
     Text.IsJump = isJumpOpcode(Op);
+    Text.CurrentOp = (PC == OpPC);
     switch (Op) {
 #define GET_DISASM
 #include "Opcodes.inc"
@@ -198,9 +207,15 @@ LLVM_DUMP_METHOD void Function::dump(llvm::raw_ostream &OS) const {
   Text.reserve(Code.size());
   size_t LongestLine = 0;
   // Print code to a string, one at a time.
-  for (auto C : Code) {
+  for (const auto &C : Code) {
     std::string Line;
     llvm::raw_string_ostream LS(Line);
+    if (OpPC) {
+      if (C.CurrentOp)
+        LS << " * ";
+      else
+        LS << "   ";
+    }
     LS << C.Addr;
     LS.indent(LongestAddr - getNumDisplayWidth(C.Addr) + 4);
     LS << C.Op;
diff --git a/clang/lib/AST/ByteCode/Function.h b/clang/lib/AST/ByteCode/Function.h
index 8c309c921afa9..80283afb6e987 100644
--- a/clang/lib/AST/ByteCode/Function.h
+++ b/clang/lib/AST/ByteCode/Function.h
@@ -312,8 +312,8 @@ class Function final {
 
 public:
   /// Dumps the disassembled bytecode to \c llvm::errs().
-  void dump() const;
-  void dump(llvm::raw_ostream &OS) const;
+  void dump(CodePtr PC = {}) const;
+  void dump(llvm::raw_ostream &OS, CodePtr PC = {}) const;
 };
 
 } // namespace interp
diff --git a/clang/lib/AST/ByteCode/InterpBuiltin.cpp b/clang/lib/AST/ByteCode/InterpBuiltin.cpp
index d21f42d94d3a5..4a789fe3a6af4 100644
--- a/clang/lib/AST/ByteCode/InterpBuiltin.cpp
+++ b/clang/lib/AST/ByteCode/InterpBuiltin.cpp
@@ -48,6 +48,11 @@ static void discard(InterpStack &Stk, PrimType T) {
   TYPE_SWITCH(T, { Stk.discard<T>(); });
 }
 
+static uint64_t popToUInt64(const InterpState &S, const Expr *E) {
+  INT_TYPE_SWITCH(*S.getContext().classify(E->getType()),
+                  return static_cast<uint64_t>(S.Stk.pop<T>()));
+}
+
 static APSInt popToAPSInt(InterpStack &Stk, PrimType T) {
   INT_TYPE_SWITCH(T, return Stk.pop<T>().toAPSInt());
 }
@@ -167,6 +172,38 @@ static llvm::APSInt convertBoolVectorToInt(const Pointer &Val) {
   return Result;
 }
 
+// Strict double -> float conversion used for X86 PD2PS/cvtsd2ss intrinsics.
+// Reject NaN/Inf/Subnormal inputs and any lossy/inexact conversions.
+static bool convertDoubleToFloatStrict(APFloat Src, Floating &Dst,
+                                       InterpState &S, const Expr *DiagExpr) {
+  if (Src.isInfinity()) {
+    if (S.diagnosing())
+      S.CCEDiag(DiagExpr, diag::note_constexpr_float_arithmetic) << 0;
+    return false;
+  }
+  if (Src.isNaN()) {
+    if (S.diagnosing())
+      S.CCEDiag(DiagExpr, diag::note_constexpr_float_arithmetic) << 1;
+    return false;
+  }
+  APFloat Val = Src;
+  bool LosesInfo = false;
+  APFloat::opStatus Status = Val.convert(
+      APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &LosesInfo);
+  if (LosesInfo || Val.isDenormal()) {
+    if (S.diagnosing())
+      S.CCEDiag(DiagExpr, diag::note_constexpr_float_arithmetic_strict);
+    return false;
+  }
+  if (Status != APFloat::opOK) {
+    if (S.diagnosing())
+      S.CCEDiag(DiagExpr, diag::note_invalid_subexpr_in_const_expr);
+    return false;
+  }
+  Dst.copy(Val);
+  return true;
+}
+
 static bool interp__builtin_is_constant_evaluated(InterpState &S, CodePtr OpPC,
                                                   const InterpFrame *Frame,
                                                   const CallExpr *Call) {
@@ -212,8 +249,7 @@ static bool interp__builtin_strcmp(InterpState &S, CodePtr OpPC,
   uint64_t Limit = ~static_cast<uint64_t>(0);
   if (ID == Builtin::BIstrncmp || ID == Builtin::BI__builtin_strncmp ||
       ID == Builtin::BIwcsncmp || ID == Builtin::BI__builtin_wcsncmp)
-    Limit = popToAPSInt(S.Stk, *S.getContext().classify(Call->getArg(2)))
-                .getZExtValue();
+    Limit = popToUInt64(S, Call->getArg(2));
 
   const Pointer &B = S.Stk.pop<Pointer>();
   const Pointer &A = S.Stk.pop<Pointer>();
@@ -991,7 +1027,7 @@ static bool interp__builtin_atomic_lock_free(InterpState &S, CodePtr OpPC,
   };
 
   const Pointer &Ptr = S.Stk.pop<Pointer>();
-  const APSInt &SizeVal = popToAPSInt(S, Call->getArg(0));
+  uint64_t SizeVal = popToUInt64(S, Call->getArg(0));
 
   // For __atomic_is_lock_free(sizeof(_Atomic(T))), if the size is a power
   // of two less than or equal to the maximum inline atomic width, we know it
@@ -1003,7 +1039,7 @@ static bool interp__builtin_atomic_lock_free(InterpState &S, CodePtr OpPC,
   // x86-64 processors.
 
   // Check power-of-two.
-  CharUnits Size = CharUnits::fromQuantity(SizeVal.getZExtValue());
+  CharUnits Size = CharUnits::fromQuantity(SizeVal);
   if (Size.isPowerOfTwo()) {
     // Check against inlining width.
     unsigned InlineWidthBits =
@@ -1057,9 +1093,9 @@ static bool interp__builtin_c11_atomic_is_lock_free(InterpState &S,
                                                     CodePtr OpPC,
                                                     const InterpFrame *Frame,
                                                     const CallExpr *Call) {
-  const APSInt &SizeVal = popToAPSInt(S, Call->getArg(0));
+  uint64_t SizeVal = popToUInt64(S, Call->getArg(0));
 
-  CharUnits Size = CharUnits::fromQuantity(SizeVal.getZExtValue());
+  CharUnits Size = CharUnits::fromQuantity(SizeVal);
   if (Size.isPowerOfTwo()) {
     // Check against inlining width.
     unsigned InlineWidthBits =
@@ -1626,51 +1662,6 @@ static bool interp__builtin_elementwise_abs(InterpState &S, CodePtr OpPC,
   return true;
 }
 
-/// Can be called with an integer or vector as the first and only parameter.
-static bool interp__builtin_elementwise_popcount(InterpState &S, CodePtr OpPC,
-                                                 const InterpFrame *Frame,
-                                                 const CallExpr *Call,
-                                                 unsigned BuiltinID) {
-  assert(Call->getNumArgs() == 1);
-  if (Call->getArg(0)->getType()->isIntegerType()) {
-    APSInt Val = popToAPSInt(S, Call->getArg(0));
-
-    if (BuiltinID == Builtin::BI__builtin_elementwise_popcount) {
-      pushInteger(S, Val.popcount(), Call->getType());
-    } else {
-      pushInteger(S, Val.reverseBits(), Call->getType());
-    }
-    return true;
-  }
-  // Otherwise, the argument must be a vector.
-  assert(Call->getArg(0)->getType()->isVectorType());
-  const Pointer &Arg = S.Stk.pop<Pointer>();
-  assert(Arg.getFieldDesc()->isPrimitiveArray());
-  const Pointer &Dst = S.Stk.peek<Pointer>();
-  assert(Dst.getFieldDesc()->isPrimitiveArray());
-  assert(Arg.getFieldDesc()->getNumElems() ==
-         Dst.getFieldDesc()->getNumElems());
-
-  QualType ElemType = Arg.getFieldDesc()->getElemQualType();
-  PrimType ElemT = *S.getContext().classify(ElemType);
-  unsigned NumElems = Arg.getNumElems();
-
-  // FIXME: Reading from uninitialized vector elements?
-  for (unsigned I = 0; I != NumElems; ++I) {
-    INT_TYPE_SWITCH_NO_BOOL(ElemT, {
-      if (BuiltinID == Builtin::BI__builtin_elementwise_popcount) {
-        Dst.elem<T>(I) = T::from(Arg.elem<T>(I).toAPSInt().popcount());
-      } else {
-        Dst.elem<T>(I) =
-            T::from(Arg.elem<T>(I).toAPSInt().reverseBits().getZExtValue());
-      }
-    });
-  }
-  Dst.initializeAllElements();
-
-  return true;
-}
-
 /// Can be called with an integer or vector as the first and only parameter.
 static bool interp__builtin_elementwise_countzeroes(InterpState &S,
                                                     CodePtr OpPC,
@@ -1764,12 +1755,10 @@ static bool interp__builtin_memcpy(InterpState &S, CodePtr OpPC,
                                    const CallExpr *Call, unsigned ID) {
   assert(Call->getNumArgs() == 3);
   const ASTContext &ASTCtx = S.getASTContext();
-  APSInt Size = popToAPSInt(S, Call->getArg(2));
+  uint64_t Size = popToUInt64(S, Call->getArg(2));
   Pointer SrcPtr = S.Stk.pop<Pointer>().expand();
   Pointer DestPtr = S.Stk.pop<Pointer>().expand();
 
-  assert(!Size.isSigned() && "memcpy and friends take an unsigned size");
-
   if (ID == Builtin::BImemcpy || ID == Builtin::BImemmove)
     diagnoseNonConstexprBuiltin(S, OpPC, ID);
 
@@ -1781,7 +1770,7 @@ static bool interp__builtin_memcpy(InterpState &S, CodePtr OpPC,
                ID == Builtin::BI__builtin_wmemmove;
 
   // If the size is zero, we treat this as always being a valid no-op.
-  if (Size.isZero()) {
+  if (Size == 0) {
     S.Stk.push<Pointer>(DestPtr);
     return true;
   }
@@ -1843,11 +1832,10 @@ static bool interp__builtin_memcpy(InterpState &S, CodePtr OpPC,
   if (WChar) {
     uint64_t WCharSize =
         ASTCtx.getTypeSizeInChars(ASTCtx.getWCharType()).getQuantity();
-    Size *= APSInt(APInt(Size.getBitWidth(), WCharSize, /*IsSigned=*/false),
-                   /*IsUnsigend=*/true);
+    Size *= WCharSize;
   }
 
-  if (Size.urem(DestElemSize) != 0) {
+  if (Size % DestElemSize != 0) {
     S.FFDiag(S.Current->getSource(OpPC),
              diag::note_constexpr_memcpy_unsupported)
         << Move << WChar << 0 << DestElemType << Size << DestElemSize;
@@ -1880,12 +1868,12 @@ static bool interp__builtin_memcpy(InterpState &S, CodePtr OpPC,
   // Check if we have enough elements to read from and write to.
   size_t RemainingDestBytes = RemainingDestElems * DestElemSize;
   size_t RemainingSrcBytes = RemainingSrcElems * SrcElemSize;
-  if (Size.ugt(RemainingDestBytes) || Size.ugt(RemainingSrcBytes)) {
-    APInt N = Size.udiv(DestElemSize);
+  if (Size > RemainingDestBytes || Size > RemainingSrcBytes) {
+    APInt N = APInt(64, Size / DestElemSize);
     S.FFDiag(S.Current->getSource(OpPC),
              diag::note_constexpr_memcpy_unsupported)
-        << Move << WChar << (Size.ugt(RemainingSrcBytes) ? 1 : 2)
-        << DestElemType << toString(N, 10, /*Signed=*/false);
+        << Move << WChar << (Size > RemainingSrcBytes ? 1 : 2) << DestElemType
+        << toString(N, 10, /*Signed=*/false);
     return false;
   }
 
@@ -1902,18 +1890,17 @@ static bool interp__builtin_memcpy(InterpState &S, CodePtr OpPC,
 
     unsigned SrcIndex = SrcP.expand().getIndex() * SrcP.elemSize();
     unsigned DstIndex = DestP.expand().getIndex() * DestP.elemSize();
-    unsigned N = Size.getZExtValue();
 
-    if ((SrcIndex <= DstIndex && (SrcIndex + N) > DstIndex) ||
-        (DstIndex <= SrcIndex && (DstIndex + N) > SrcIndex)) {
+    if ((SrcIndex <= DstIndex && (SrcIndex + Size) > DstIndex) ||
+        (DstIndex <= SrcIndex && (DstIndex + Size) > SrcIndex)) {
       S.FFDiag(S.Current->getSource(OpPC), diag::note_constexpr_memcpy_overlap)
           << /*IsWChar=*/false;
       return false;
     }
   }
 
-  assert(Size.getZExtValue() % DestElemSize == 0);
-  if (!DoMemcpy(S, OpPC, SrcPtr, DestPtr, Bytes(Size.getZExtValue()).toBits()))
+  assert(Size % DestElemSize == 0);
+  if (!DoMemcpy(S, OpPC, SrcPtr, DestPtr, Bytes(Size).toBits()))
     return false;
 
   S.Stk.push<Pointer>(DestPtr);
@@ -1930,7 +1917,7 @@ static bool interp__builtin_memcmp(InterpState &S, CodePtr OpPC,
                                    const InterpFrame *Frame,
                                    const CallExpr *Call, unsigned ID) {
   assert(Call->getNumArgs() == 3);
-  const APSInt &Size = popToAPSInt(S, Call->getArg(2));
+  uint64_t Size = popToUInt64(S, Call->getArg(2));
   const Pointer &PtrB = S.Stk.pop<Pointer>();
   const Pointer &PtrA = S.Stk.pop<Pointer>();
 
@@ -1938,7 +1925,7 @@ static bool interp__builtin_memcmp(InterpState &S, CodePtr OpPC,
       ID == Builtin::BIwmemcmp)
     diagnoseNonConstexprBuiltin(S, OpPC, ID);
 
-  if (Size.isZero()) {
+  if (Size == 0) {
     pushInteger(S, 0, Call->getType());
     return true;
   }
@@ -1966,6 +1953,10 @@ static bool interp__builtin_memcmp(InterpState &S, CodePtr OpPC,
   if (PtrA.isDummy() || PtrB.isDummy())
     return false;
 
+  if (!CheckRange(S, OpPC, PtrA, AK_Read) ||
+      !CheckRange(S, OpPC, PtrB, AK_Read))
+    return false;
+
   // Now, read both pointers to a buffer and compare those.
   BitcastBuffer BufferA(
       Bits(ASTCtx.getTypeSize(ElemTypeA) * PtrA.getNumElems()));
@@ -1991,7 +1982,7 @@ static bool interp__builtin_memcmp(InterpState &S, CodePtr OpPC,
     ElemSize = ASTCtx.getTypeSizeInChars(ASTCtx.getWCharType()).getQuantity();
   // The Size given for the wide variants is in wide-char units. Convert it
   // to bytes.
-  size_t ByteSize = Size.getZExtValue() * ElemSize;
+  size_t ByteSize = Size * ElemSize;
   size_t CmpSize = std::min(MinBufferSize, ByteSize);
 
   for (size_t I = 0; I != CmpSize; I += ElemSize) {
@@ -2279,7 +2270,7 @@ static bool interp__builtin_object_size(InterpState &S, CodePtr OpPC,
   // clear, objects are whole variables. If it is set, a closest surrounding
   // subobject is considered the object a pointer points to. The second bit
   // determines if maximum or minimum of remaining bytes is computed.
-  unsigned Kind = popToAPSInt(S, Call->getArg(1)).getZExtValue();
+  unsigned Kind = popToUInt64(S, Call->getArg(1));
   assert(Kind <= 3 && "unexpected kind");
   bool UseFieldDesc = (Kind & 1u);
   bool ReportMinimum = (Kind & 2u);
@@ -2407,18 +2398,39 @@ static bool interp__builtin_elementwise_int_unaryop(
     InterpState &S, CodePtr OpPC, const CallExpr *Call,
     llvm::function_ref<APInt(const APSInt &)> Fn) {
   assert(Call->getNumArgs() == 1);
-  assert(Call->getType()->isIntegerType());
 
   // Single integer case.
   if (!Call->getArg(0)->getType()->isVectorType()) {
+    assert(Call->getType()->isIntegerType());
     APSInt Src = popToAPSInt(S, Call->getArg(0));
     APInt Result = Fn(Src);
     pushInteger(S, APSInt(std::move(Result), !Src.isSigned()), Call->getType());
     return true;
   }
 
-  // TODO: Add vector integer handling.
-  return false;
+  // Vector case.
+  const Pointer &Arg = S.Stk.pop<Pointer>();
+  assert(Arg.getFieldDesc()->isPrimitiveArray());
+  const Pointer &Dst = S.Stk.peek<Pointer>();
+  assert(Dst.getFieldDesc()->isPrimitiveArray());
+  assert(Arg.getFieldDesc()->getNumElems() ==
+         Dst.getFieldDesc()->getNumElems());
+
+  QualType ElemType = Arg.getFieldDesc()->getElemQualType();
+  PrimType ElemT = *S.getContext().classify(ElemType);
+  unsigned NumElems = Arg.getNumElems();
+  bool DestUnsigned = Call->getType()->isUnsignedIntegerOrEnumerationType();
+
+  for (unsigned I = 0; I != NumElems; ++I) {
+    INT_TYPE_SWITCH_NO_BOOL(ElemT, {
+      APSInt Src = Arg.elem<T>(I).toAPSInt();
+      APInt Result = Fn(Src);
+      Dst.elem<T>(I) = static_cast<T>(APSInt(std::move(Result), DestUnsigned));
+    });
+  }
+  Dst.initializeAllElements();
+
+  return true;
 }
 
 static bool interp__builtin_elementwise_int_binop(
@@ -3383,6 +3395,122 @@ static bool interp__builtin_ia32_cvt_vec2mask(InterpState &S, CodePtr OpPC,
   pushInteger(S, RetMask, Call->getType());
   return true;
 }
+static bool interp__builtin_ia32_cvtsd2ss(InterpState &S, CodePtr OpPC,
+                                          const CallExpr *Call,
+                                          bool HasRoundingMask) {
+  APSInt Rounding, MaskInt;
+  Pointer Src, B, A;
+
+  if (HasRoundingMask) {
+    assert(Call->getNumArgs() == 5);
+    Rounding = popToAPSInt(S, Call->getArg(4));
+    MaskInt = popToAPSInt(S, Call->getArg(3));
+    Src = S.Stk.pop<Pointer>();
+    B = S.Stk.pop<Pointer>();
+    A = S.Stk.pop<Pointer>();
+    if (!CheckLoad(S, OpPC, A) || !CheckLoad(S, OpPC, B) ||
+        !CheckLoad(S, OpPC, Src))
+      return false;
+  } else {
+    assert(Call->getNumArgs() == 2);
+    B = S.Stk.pop<Pointer>();
+    A = S.Stk.pop<Pointer>();
+    if (!CheckLoad(S, OpPC, A) || !CheckLoad(S, OpPC, B))
+      return false;
+  }
+
+  const auto *DstVTy = Call->getType()->castAs<VectorType>();
+  unsigned NumElems = DstVTy->getNumElements();
+  const Pointer &Dst = S.Stk.peek<Pointer>();
+
+  // Copy all elements except lane 0 (overwritten below) from A to Dst.
+  for (unsigned I = 1; I != NumElems; ++I)
+    Dst.elem<Floating>(I) = A.elem<Floating>(I);
+
+  // Convert element 0 from double to float, or use Src if masked off.
+  if (!HasRoundingMask || (MaskInt.getZExtValue() & 0x1)) {
+    assert(S.getASTContext().FloatTy == DstVTy->getElementType() &&
+           "cvtsd2ss requires float element type in destination vector");
+
+    Floating Conv = S.allocFloat(
+        S.getASTContext().getFloatTypeSemantics(DstVTy->getElementType()));
+    APFloat SrcVal = B.elem<Floating>(0).getAPFloat();
+    if (!convertDoubleToFloatStrict(SrcVal, Conv, S, Call))
+      return false;
+    Dst.elem<Floating>(0) = Conv;
+  } else {
+    Dst.elem<Floating>(0) = Src.elem<Floating>(0);
+  }
+
+  Dst.initializeAllElements();
+  return true;
+}
+
+static bool interp__builtin_ia32_cvtpd2ps(InterpState &S, CodePtr OpPC,
+                                          const CallExpr *Call, bool IsMasked,
+                                          bool HasRounding) {
+
+  APSInt MaskVal;
+  Pointer PassThrough;
+  Pointer Src;
+  APSInt Rounding;
+
+  if (IsMasked) {
+    // Pop in reverse order.
+    if (HasRounding) {
+      Rounding = popToAPSInt(S, Call->getArg(3));
+      MaskVal = popToAPSInt(S, Call->getArg(2));
+      PassThrough = S.Stk.pop<Pointer>();
+      Src = S.Stk.pop<Pointer>();
+    } else {
+      MaskVal = popToAPSInt(S, Call->getArg(2));
+      PassThrough = S.Stk.pop<Pointer>();
+      Src = S.Stk.pop<Pointer>();
+    }
+
+    if (!CheckLoad(S, OpPC, PassThrough))
+      return false;
+  } else {
+    // Pop source only.
+    Src = S.Stk.pop<Pointer>();
+  }
+
+  if (!CheckLoad(S, OpPC, Src))
+    return false;
+
+  const auto *RetVTy = Call->getType()->castAs<VectorType>();
+  unsigned RetElems = RetVTy->getNumElements();
+  unsigned SrcElems = Src.getNumElems();
+  const Pointer &Dst = S.Stk.peek<Pointer>();
+
+  // Initialize destination with passthrough or zeros.
+  for (unsigned I = 0; I != RetElems; ++I)
+    if (IsMasked)
+      Dst.elem<Floating>(I) = PassThrough.elem<Floating>(I);
+    else
+      Dst.elem<Floating>(I) = Floating(APFloat(0.0f));
+
+  assert(S.getASTContext().FloatTy == RetVTy->getElementType() &&
+         "cvtpd2ps requires float element type in return vector");
+
+  // Convert double to float for enabled elements (only process source elements
+  // that exist).
+  for (unsigned I = 0; I != SrcElems; ++I) {
+    if (IsMasked && !MaskVal[I])
+      continue;
+
+    APFloat SrcVal = Src.elem<Floating>(I).getAPFloat();
+
+    Floating Conv = S.allocFloat(
+        S.getASTContext().getFloatTypeSemantics(RetVTy->getElementType()));
+    if (!convertDoubleToFloatStrict(SrcVal, Conv, S, Call))
+      return false;
+    Dst.elem<Floating>(I) = Conv;
+  }
+
+  Dst.initializeAllElements();
+  return true;
+}
 
 static bool interp__builtin_ia32_shuffle_generic(
     InterpState &S, CodePtr OpPC, const CallExpr *Call,
@@ -4127,6 +4255,30 @@ bool InterpretBuiltin(InterpState &S, CodePtr OpPC, const CallExpr *Call,
           return APInt(sizeof(unsigned char) * 8, (A | B) == 0);
         });
 
+  case clang::X86::BI__builtin_ia32_kshiftliqi:
+  case clang::X86::BI__builtin_ia32_kshiftlihi:
+  case clang::X86::BI__builtin_ia32_kshiftlisi:
+  case clang::X86::BI__builtin_ia32_kshiftlidi:
+    return interp__builtin_elementwise_int_binop(
+        S, OpPC, Call, [](const APSInt &LHS, const APSInt &RHS) {
+          unsigned Amt = RHS.getZExtValue() & 0xFF;
+          if (Amt >= LHS.getBitWidth())
+            return APInt::getZero(LHS.getBitWidth());
+          return LHS.shl(Amt);
+        });
+
+  case clang::X86::BI__builtin_ia32_kshiftriqi:
+  case clang::X86::BI__builtin_ia32_kshiftrihi:
+  case clang::X86::BI__builtin_ia32_kshiftrisi:
+  case clang::X86::BI__builtin_ia32_kshiftridi:
+    return interp__builtin_elementwise_int_binop(
+        S, OpPC, Call, [](const APSInt &LHS, const APSInt &RHS) {
+          unsigned Amt = RHS.getZExtValue() & 0xFF;
+          if (Amt >= LHS.getBitWidth())
+            return APInt::getZero(LHS.getBitWidth());
+          return LHS.lshr(Amt);
+        });
+
   case clang::X86::BI__builtin_ia32_lzcnt_u16:
   case clang::X86::BI__builtin_ia32_lzcnt_u32:
   case clang::X86::BI__builtin_ia32_lzcnt_u64:
@@ -4212,9 +4364,13 @@ bool InterpretBuiltin(InterpState &S, CodePtr OpPC, const CallExpr *Call,
     return interp__builtin_vector_reduce(S, OpPC, Call, BuiltinID);
 
   case Builtin::BI__builtin_elementwise_popcount:
+    return interp__builtin_elementwise_int_unaryop(
+        S, OpPC, Call, [](const APSInt &Src) {
+          return APInt(Src.getBitWidth(), Src.popcount());
+        });
   case Builtin::BI__builtin_elementwise_bitreverse:
-    return interp__builtin_elementwise_popcount(S, OpPC, Frame, Call,
-                                                BuiltinID);
+    return interp__builtin_elementwise_int_unaryop(
+        S, OpPC, Call, [](const APSInt &Src) { return Src.reverseBits(); });
 
   case Builtin::BI__builtin_elementwise_abs:
     return interp__builtin_elementwise_abs(S, OpPC, Frame, Call, BuiltinID);
@@ -4960,6 +5116,16 @@ bool InterpretBuiltin(InterpState &S, CodePtr OpPC, const CallExpr *Call,
           return std::make_pair(0, static_cast<int>(LaneOffset + Index));
         });
 
+  case X86::BI__builtin_ia32_permdf256:
+  case X86::BI__builtin_ia32_permdi256:
+    return interp__builtin_ia32_shuffle_generic(
+        S, OpPC, Call, [](unsigned DstIdx, unsigned Control) {
+          // permute4x64 operates on 4 64-bit elements
+          // For element i (0-3), extract bits [2*i+1:2*i] from Control
+          unsigned Index = (Control >> (2 * DstIdx)) & 0x3;
+          return std::make_pair(0, static_cast<int>(Index));
+        });
+
   case X86::BI__builtin_ia32_vpmultishiftqb128:
   case X86::BI__builtin_ia32_vpmultishiftqb256:
   case X86::BI__builtin_ia32_vpmultishiftqb512:
@@ -5019,6 +5185,13 @@ bool InterpretBuiltin(InterpState &S, CodePtr OpPC, const CallExpr *Call,
         S, OpPC, Call,
         [](const APSInt &LHS, const APSInt &RHS) { return LHS + RHS; });
 
+  case X86::BI__builtin_ia32_kmovb:
+  case X86::BI__builtin_ia32_kmovw:
+  case X86::BI__builtin_ia32_kmovd:
+  case X86::BI__builtin_ia32_kmovq:
+    return interp__builtin_elementwise_int_unaryop(
+        S, OpPC, Call, [](const APSInt &Src) { return Src; });
+
   case X86::BI__builtin_ia32_kunpckhi:
   case X86::BI__builtin_ia32_kunpckdi:
   case X86::BI__builtin_ia32_kunpcksi:
@@ -5189,6 +5362,20 @@ bool InterpretBuiltin(InterpState &S, CodePtr OpPC, const CallExpr *Call,
   case X86::BI__builtin_ia32_cvtq2mask512:
     return interp__builtin_ia32_cvt_vec2mask(S, OpPC, Call, BuiltinID);
 
+  case X86::BI__builtin_ia32_cvtsd2ss:
+    return interp__builtin_ia32_cvtsd2ss(S, OpPC, Call, false);
+
+  case X86::BI__builtin_ia32_cvtsd2ss_round_mask:
+    return interp__builtin_ia32_cvtsd2ss(S, OpPC, Call, true);
+
+  case X86::BI__builtin_ia32_cvtpd2ps:
+  case X86::BI__builtin_ia32_cvtpd2ps256:
+    return interp__builtin_ia32_cvtpd2ps(S, OpPC, Call, false, false);
+  case X86::BI__builtin_ia32_cvtpd2ps_mask:
+    return interp__builtin_ia32_cvtpd2ps(S, OpPC, Call, true, false);
+  case X86::BI__builtin_ia32_cvtpd2ps512_mask:
+    return interp__builtin_ia32_cvtpd2ps(S, OpPC, Call, true, true);
+
   case X86::BI__builtin_ia32_cmpb128_mask:
   case X86::BI__builtin_ia32_cmpw128_mask:
   case X86::BI__builtin_ia32_cmpd128_mask:
diff --git a/clang/lib/AST/ByteCode/Source.h b/clang/lib/AST/ByteCode/Source.h
index f355d14db5e30..56ca197e66473 100644
--- a/clang/lib/AST/ByteCode/Source.h
+++ b/clang/lib/AST/ByteCode/Source.h
@@ -51,6 +51,7 @@ class CodePtr final {
   explicit operator bool() const { return Ptr; }
   bool operator<=(const CodePtr &RHS) const { return Ptr <= RHS.Ptr; }
   bool operator>=(const CodePtr &RHS) const { return Ptr >= RHS.Ptr; }
+  bool operator==(const CodePtr RHS) const { return Ptr == RHS.Ptr; }
 
   /// Reads data and advances the pointer.
   template <typename T> std::enable_if_t<!std::is_pointer<T>::value, T> read() {
diff --git a/clang/lib/AST/CXXInheritance.cpp b/clang/lib/AST/CXXInheritance.cpp
index 7a3e7ea4e5b8f..29f5916284ebb 100644
--- a/clang/lib/AST/CXXInheritance.cpp
+++ b/clang/lib/AST/CXXInheritance.cpp
@@ -34,9 +34,9 @@ using namespace clang;
 /// ambiguous, i.e., there are two or more paths that refer to
 /// different base class subobjects of the same type. BaseType must be
 /// an unqualified, canonical class type.
-bool CXXBasePaths::isAmbiguous(CanQualType BaseType) {
+bool CXXBasePaths::isAmbiguous(CanQualType BaseType) const {
   BaseType = BaseType.getUnqualifiedType();
-  IsVirtBaseAndNumberNonVirtBases Subobjects = ClassSubobjects[BaseType];
+  IsVirtBaseAndNumberNonVirtBases Subobjects = ClassSubobjects.lookup(BaseType);
   return Subobjects.NumberOfNonVirtBases + (Subobjects.IsVirtBase ? 1 : 0) > 1;
 }
 
diff --git a/clang/lib/AST/ComparisonCategories.cpp b/clang/lib/AST/ComparisonCategories.cpp
index 0c7a7f4eacbbf..1b9c938e2ace3 100644
--- a/clang/lib/AST/ComparisonCategories.cpp
+++ b/clang/lib/AST/ComparisonCategories.cpp
@@ -49,7 +49,7 @@ bool ComparisonCategoryInfo::ValueInfo::hasValidIntValue() const {
   // Before we attempt to get the value of the first field, ensure that we
   // actually have one (and only one) field.
   const auto *Record = VD->getType()->getAsCXXRecordDecl();
-  if (std::distance(Record->field_begin(), Record->field_end()) != 1 ||
+  if (Record->getNumFields() != 1 ||
       !Record->field_begin()->getType()->isIntegralOrEnumerationType())
     return false;
 
diff --git a/clang/lib/AST/Expr.cpp b/clang/lib/AST/Expr.cpp
index 1f405920ce6b5..ca7f3e16a9276 100644
--- a/clang/lib/AST/Expr.cpp
+++ b/clang/lib/AST/Expr.cpp
@@ -1934,6 +1934,7 @@ bool CastExpr::CastConsistency() const {
   case CK_FixedPointToBoolean:
   case CK_HLSLArrayRValue:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
   CheckNoBasePath:
diff --git a/clang/lib/AST/ExprConstant.cpp b/clang/lib/AST/ExprConstant.cpp
index b986ee6ca4fa3..11c5e1c6e90f4 100644
--- a/clang/lib/AST/ExprConstant.cpp
+++ b/clang/lib/AST/ExprConstant.cpp
@@ -3971,8 +3971,7 @@ static bool constructAggregate(EvalInfo &Info, const FPOptions FPO,
       if (auto *CXXRD = dyn_cast<CXXRecordDecl>(RD))
         NumBases = CXXRD->getNumBases();
 
-      *Res = APValue(APValue::UninitStruct(), NumBases,
-                     std::distance(RD->field_begin(), RD->field_end()));
+      *Res = APValue(APValue::UninitStruct(), NumBases, RD->getNumFields());
 
       SmallVector<std::tuple<APValue *, QualType, unsigned>> ReverseList;
       // we need to traverse backwards
@@ -5529,8 +5528,8 @@ static bool handleDefaultInitValue(QualType T, APValue &Result) {
       Result = APValue((const FieldDecl *)nullptr);
       return true;
     }
-    Result = APValue(APValue::UninitStruct(), RD->getNumBases(),
-                     std::distance(RD->field_begin(), RD->field_end()));
+    Result =
+        APValue(APValue::UninitStruct(), RD->getNumBases(), RD->getNumFields());
 
     unsigned Index = 0;
     for (CXXRecordDecl::base_class_const_iterator I = RD->bases_begin(),
@@ -7184,7 +7183,7 @@ static bool HandleConstructorCall(const Expr *E, const LValue &This,
   if (!Result.hasValue()) {
     if (!RD->isUnion())
       Result = APValue(APValue::UninitStruct(), RD->getNumBases(),
-                       std::distance(RD->field_begin(), RD->field_end()));
+                       RD->getNumFields());
     else
       // A union starts with no active member.
       Result = APValue((const FieldDecl*)nullptr);
@@ -8135,8 +8134,7 @@ class BufferToAPValueConverter {
     if (auto *CXXRD = dyn_cast<CXXRecordDecl>(RD))
       NumBases = CXXRD->getNumBases();
 
-    APValue ResultVal(APValue::UninitStruct(), NumBases,
-                      std::distance(RD->field_begin(), RD->field_end()));
+    APValue ResultVal(APValue::UninitStruct(), NumBases, RD->getNumFields());
 
     // Visit the base classes.
     if (auto *CXXRD = dyn_cast<CXXRecordDecl>(RD)) {
@@ -11146,7 +11144,7 @@ static bool HandleClassZeroInitialization(EvalInfo &Info, const Expr *E,
   assert(!RD->isUnion() && "Expected non-union class type");
   const CXXRecordDecl *CD = dyn_cast<CXXRecordDecl>(RD);
   Result = APValue(APValue::UninitStruct(), CD ? CD->getNumBases() : 0,
-                   std::distance(RD->field_begin(), RD->field_end()));
+                   RD->getNumFields());
 
   if (RD->isInvalidDecl()) return false;
   const ASTRecordLayout &Layout = Info.Ctx.getASTRecordLayout(RD);
@@ -11342,7 +11340,7 @@ bool RecordExprEvaluator::VisitCXXParenListOrInitListExpr(
 
   if (!Result.hasValue())
     Result = APValue(APValue::UninitStruct(), CXXRD ? CXXRD->getNumBases() : 0,
-                     std::distance(RD->field_begin(), RD->field_end()));
+                     RD->getNumFields());
   unsigned ElementNo = 0;
   bool Success = true;
 
@@ -11549,8 +11547,7 @@ bool RecordExprEvaluator::VisitLambdaExpr(const LambdaExpr *E) {
   if (ClosureClass->isInvalidDecl())
     return false;
 
-  const size_t NumFields =
-      std::distance(ClosureClass->field_begin(), ClosureClass->field_end());
+  const size_t NumFields = ClosureClass->getNumFields();
 
   assert(NumFields == (size_t)std::distance(E->capture_init_begin(),
                                             E->capture_init_end()) &&
@@ -11773,6 +11770,10 @@ bool VectorExprEvaluator::VisitCastExpr(const CastExpr *E) {
       Elements.push_back(Val.getVectorElt(I));
     return Success(Elements, E);
   }
+  case CK_HLSLMatrixTruncation: {
+    // TODO: See #168935. Add matrix truncation support to expr constant.
+    return Error(E);
+  }
   case CK_HLSLAggregateSplatCast: {
     APValue Val;
     QualType ValTy;
@@ -12165,7 +12166,36 @@ static bool evalShuffleGeneric(
   Out = APValue(ResultElements.data(), ResultElements.size());
   return true;
 }
+static bool ConvertDoubleToFloatStrict(EvalInfo &Info, const Expr *E,
+                                       APFloat OrigVal, APValue &Result) {
+
+  if (OrigVal.isInfinity()) {
+    Info.CCEDiag(E, diag::note_constexpr_float_arithmetic) << 0;
+    return false;
+  }
+  if (OrigVal.isNaN()) {
+    Info.CCEDiag(E, diag::note_constexpr_float_arithmetic) << 1;
+    return false;
+  }
+
+  APFloat Val = OrigVal;
+  bool LosesInfo = false;
+  APFloat::opStatus Status = Val.convert(
+      APFloat::IEEEsingle(), APFloat::rmNearestTiesToEven, &LosesInfo);
+
+  if (LosesInfo || Val.isDenormal()) {
+    Info.CCEDiag(E, diag::note_constexpr_float_arithmetic_strict);
+    return false;
+  }
 
+  if (Status != APFloat::opOK) {
+    Info.CCEDiag(E, diag::note_invalid_subexpr_in_const_expr);
+    return false;
+  }
+
+  Result = APValue(Val);
+  return true;
+}
 static bool evalShiftWithCount(
     EvalInfo &Info, const CallExpr *Call, APValue &Out,
     llvm::function_ref<APInt(const APInt &, uint64_t)> ShiftOp,
@@ -12924,6 +12954,120 @@ bool VectorExprEvaluator::VisitCallExpr(const CallExpr *E) {
 
     return Success(APValue(ResultElements.data(), ResultElements.size()), E);
   }
+
+  case X86::BI__builtin_ia32_cvtsd2ss: {
+    APValue VecA, VecB;
+    if (!EvaluateAsRValue(Info, E->getArg(0), VecA) ||
+        !EvaluateAsRValue(Info, E->getArg(1), VecB))
+      return false;
+
+    SmallVector<APValue, 4> Elements;
+
+    APValue ResultVal;
+    if (!ConvertDoubleToFloatStrict(Info, E, VecB.getVectorElt(0).getFloat(),
+                                    ResultVal))
+      return false;
+
+    Elements.push_back(ResultVal);
+
+    unsigned NumEltsA = VecA.getVectorLength();
+    for (unsigned I = 1; I < NumEltsA; ++I) {
+      Elements.push_back(VecA.getVectorElt(I));
+    }
+
+    return Success(Elements, E);
+  }
+  case X86::BI__builtin_ia32_cvtsd2ss_round_mask: {
+    APValue VecA, VecB, VecSrc, MaskValue;
+
+    if (!EvaluateAsRValue(Info, E->getArg(0), VecA) ||
+        !EvaluateAsRValue(Info, E->getArg(1), VecB) ||
+        !EvaluateAsRValue(Info, E->getArg(2), VecSrc) ||
+        !EvaluateAsRValue(Info, E->getArg(3), MaskValue))
+      return false;
+
+    unsigned Mask = MaskValue.getInt().getZExtValue();
+    SmallVector<APValue, 4> Elements;
+
+    if (Mask & 1) {
+      APValue ResultVal;
+      if (!ConvertDoubleToFloatStrict(Info, E, VecB.getVectorElt(0).getFloat(),
+                                      ResultVal))
+        return false;
+      Elements.push_back(ResultVal);
+    } else {
+      Elements.push_back(VecSrc.getVectorElt(0));
+    }
+
+    unsigned NumEltsA = VecA.getVectorLength();
+    for (unsigned I = 1; I < NumEltsA; ++I) {
+      Elements.push_back(VecA.getVectorElt(I));
+    }
+
+    return Success(Elements, E);
+  }
+  case X86::BI__builtin_ia32_cvtpd2ps:
+  case X86::BI__builtin_ia32_cvtpd2ps256:
+  case X86::BI__builtin_ia32_cvtpd2ps_mask:
+  case X86::BI__builtin_ia32_cvtpd2ps512_mask: {
+
+    const auto BuiltinID = E->getBuiltinCallee();
+    bool IsMasked = (BuiltinID == X86::BI__builtin_ia32_cvtpd2ps_mask ||
+                     BuiltinID == X86::BI__builtin_ia32_cvtpd2ps512_mask);
+
+    APValue InputValue;
+    if (!EvaluateAsRValue(Info, E->getArg(0), InputValue))
+      return false;
+
+    APValue MergeValue;
+    unsigned Mask = 0xFFFFFFFF;
+    bool NeedsMerge = false;
+    if (IsMasked) {
+      APValue MaskValue;
+      if (!EvaluateAsRValue(Info, E->getArg(2), MaskValue))
+        return false;
+      Mask = MaskValue.getInt().getZExtValue();
+      auto NumEltsResult = E->getType()->getAs<VectorType>()->getNumElements();
+      for (unsigned I = 0; I < NumEltsResult; ++I) {
+        if (!((Mask >> I) & 1)) {
+          NeedsMerge = true;
+          break;
+        }
+      }
+      if (NeedsMerge) {
+        if (!EvaluateAsRValue(Info, E->getArg(1), MergeValue))
+          return false;
+      }
+    }
+
+    unsigned NumEltsResult =
+        E->getType()->getAs<VectorType>()->getNumElements();
+    unsigned NumEltsInput = InputValue.getVectorLength();
+    SmallVector<APValue, 8> Elements;
+    for (unsigned I = 0; I < NumEltsResult; ++I) {
+      if (IsMasked && !((Mask >> I) & 1)) {
+        if (!NeedsMerge) {
+          return false;
+        }
+        Elements.push_back(MergeValue.getVectorElt(I));
+        continue;
+      }
+
+      if (I >= NumEltsInput) {
+        Elements.push_back(APValue(APFloat::getZero(APFloat::IEEEsingle())));
+        continue;
+      }
+
+      APValue ResultVal;
+      if (!ConvertDoubleToFloatStrict(
+              Info, E, InputValue.getVectorElt(I).getFloat(), ResultVal))
+        return false;
+
+      Elements.push_back(ResultVal);
+    }
+    return Success(Elements, E);
+  }
+
   case X86::BI__builtin_ia32_shufps:
   case X86::BI__builtin_ia32_shufps256:
   case X86::BI__builtin_ia32_shufps512: {
@@ -13125,6 +13269,19 @@ bool VectorExprEvaluator::VisitCallExpr(const CallExpr *E) {
     return Success(R, E);
   }
 
+  case X86::BI__builtin_ia32_permdf256:
+  case X86::BI__builtin_ia32_permdi256: {
+    APValue R;
+    if (!evalShuffleGeneric(Info, E, R, [](unsigned DstIdx, unsigned Control) {
+          // permute4x64 operates on 4 64-bit elements
+          // For element i (0-3), extract bits [2*i+1:2*i] from Control
+          unsigned Index = (Control >> (2 * DstIdx)) & 0x3;
+          return std::make_pair(0, static_cast<int>(Index));
+        }))
+      return false;
+    return Success(R, E);
+  }
+
   case X86::BI__builtin_ia32_vpermilvarps:
   case X86::BI__builtin_ia32_vpermilvarps256:
   case X86::BI__builtin_ia32_vpermilvarps512: {
@@ -16900,6 +17057,40 @@ bool IntExprEvaluator::VisitBuiltinCallExpr(const CallExpr *E,
         [](const APSInt &LHS, const APSInt &RHS) { return LHS + RHS; });
   }
 
+  case X86::BI__builtin_ia32_kmovb:
+  case X86::BI__builtin_ia32_kmovw:
+  case X86::BI__builtin_ia32_kmovd:
+  case X86::BI__builtin_ia32_kmovq: {
+    APSInt Val;
+    if (!EvaluateInteger(E->getArg(0), Val, Info))
+      return false;
+    return Success(Val, E);
+  }
+
+  case X86::BI__builtin_ia32_kshiftliqi:
+  case X86::BI__builtin_ia32_kshiftlihi:
+  case X86::BI__builtin_ia32_kshiftlisi:
+  case X86::BI__builtin_ia32_kshiftlidi: {
+    return HandleMaskBinOp([](const APSInt &LHS, const APSInt &RHS) {
+      unsigned Amt = RHS.getZExtValue() & 0xFF;
+      if (Amt >= LHS.getBitWidth())
+        return APSInt(APInt::getZero(LHS.getBitWidth()), LHS.isUnsigned());
+      return APSInt(LHS.shl(Amt), LHS.isUnsigned());
+    });
+  }
+
+  case X86::BI__builtin_ia32_kshiftriqi:
+  case X86::BI__builtin_ia32_kshiftrihi:
+  case X86::BI__builtin_ia32_kshiftrisi:
+  case X86::BI__builtin_ia32_kshiftridi: {
+    return HandleMaskBinOp([](const APSInt &LHS, const APSInt &RHS) {
+      unsigned Amt = RHS.getZExtValue() & 0xFF;
+      if (Amt >= LHS.getBitWidth())
+        return APSInt(APInt::getZero(LHS.getBitWidth()), LHS.isUnsigned());
+      return APSInt(LHS.lshr(Amt), LHS.isUnsigned());
+    });
+  }
+
   case clang::X86::BI__builtin_ia32_vec_ext_v4hi:
   case clang::X86::BI__builtin_ia32_vec_ext_v16qi:
   case clang::X86::BI__builtin_ia32_vec_ext_v8hi:
@@ -18433,6 +18624,10 @@ bool IntExprEvaluator::VisitCastExpr(const CastExpr *E) {
       return Error(E);
     return Success(Val.getVectorElt(0), E);
   }
+  case CK_HLSLMatrixTruncation: {
+    // TODO: See #168935. Add matrix truncation support to expr constant.
+    return Error(E);
+  }
   case CK_HLSLElementwiseCast: {
     SmallVector<APValue> SrcVals;
     SmallVector<QualType> SrcTypes;
@@ -19026,6 +19221,10 @@ bool FloatExprEvaluator::VisitCastExpr(const CastExpr *E) {
       return Error(E);
     return Success(Val.getVectorElt(0), E);
   }
+  case CK_HLSLMatrixTruncation: {
+    // TODO: See #168935. Add matrix truncation support to expr constant.
+    return Error(E);
+  }
   case CK_HLSLElementwiseCast: {
     SmallVector<APValue> SrcVals;
     SmallVector<QualType> SrcTypes;
@@ -19183,6 +19382,7 @@ bool ComplexExprEvaluator::VisitCastExpr(const CastExpr *E) {
   case CK_IntegralToFixedPoint:
   case CK_MatrixCast:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
     llvm_unreachable("invalid cast kind for complex value");
diff --git a/clang/lib/AST/OpenMPClause.cpp b/clang/lib/AST/OpenMPClause.cpp
index 0640fed823771..2183d77de8fa7 100644
--- a/clang/lib/AST/OpenMPClause.cpp
+++ b/clang/lib/AST/OpenMPClause.cpp
@@ -1321,7 +1321,7 @@ OMPToClause *OMPToClause::Create(
     const ASTContext &C, const OMPVarListLocTy &Locs, ArrayRef<Expr *> Vars,
     ArrayRef<ValueDecl *> Declarations,
     MappableExprComponentListsRef ComponentLists, ArrayRef<Expr *> UDMapperRefs,
-    ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
+    Expr *IteratorModifier, ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
     ArrayRef<SourceLocation> MotionModifiersLoc,
     NestedNameSpecifierLoc UDMQualifierLoc, DeclarationNameInfo MapperId) {
   OMPMappableExprListSizeTy Sizes;
@@ -1343,7 +1343,7 @@ OMPToClause *OMPToClause::Create(
   void *Mem = C.Allocate(
       totalSizeToAlloc<Expr *, ValueDecl *, unsigned,
                        OMPClauseMappableExprCommon::MappableComponent>(
-          2 * Sizes.NumVars, Sizes.NumUniqueDeclarations,
+          2 * Sizes.NumVars + 1, Sizes.NumUniqueDeclarations,
           Sizes.NumUniqueDeclarations + Sizes.NumComponentLists,
           Sizes.NumComponents));
 
@@ -1353,6 +1353,7 @@ OMPToClause *OMPToClause::Create(
   Clause->setVarRefs(Vars);
   Clause->setUDMapperRefs(UDMapperRefs);
   Clause->setClauseInfo(Declarations, ComponentLists);
+  Clause->setIteratorModifier(IteratorModifier);
   return Clause;
 }
 
@@ -1361,17 +1362,19 @@ OMPToClause *OMPToClause::CreateEmpty(const ASTContext &C,
   void *Mem = C.Allocate(
       totalSizeToAlloc<Expr *, ValueDecl *, unsigned,
                        OMPClauseMappableExprCommon::MappableComponent>(
-          2 * Sizes.NumVars, Sizes.NumUniqueDeclarations,
+          2 * Sizes.NumVars + 1, Sizes.NumUniqueDeclarations,
           Sizes.NumUniqueDeclarations + Sizes.NumComponentLists,
           Sizes.NumComponents));
-  return new (Mem) OMPToClause(Sizes);
+  OMPToClause *Clause = new (Mem) OMPToClause(Sizes);
+  Clause->setIteratorModifier(nullptr);
+  return Clause;
 }
 
 OMPFromClause *OMPFromClause::Create(
     const ASTContext &C, const OMPVarListLocTy &Locs, ArrayRef<Expr *> Vars,
     ArrayRef<ValueDecl *> Declarations,
     MappableExprComponentListsRef ComponentLists, ArrayRef<Expr *> UDMapperRefs,
-    ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
+    Expr *IteratorModifier, ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
     ArrayRef<SourceLocation> MotionModifiersLoc,
     NestedNameSpecifierLoc UDMQualifierLoc, DeclarationNameInfo MapperId) {
   OMPMappableExprListSizeTy Sizes;
@@ -1393,7 +1396,7 @@ OMPFromClause *OMPFromClause::Create(
   void *Mem = C.Allocate(
       totalSizeToAlloc<Expr *, ValueDecl *, unsigned,
                        OMPClauseMappableExprCommon::MappableComponent>(
-          2 * Sizes.NumVars, Sizes.NumUniqueDeclarations,
+          2 * Sizes.NumVars + 1, Sizes.NumUniqueDeclarations,
           Sizes.NumUniqueDeclarations + Sizes.NumComponentLists,
           Sizes.NumComponents));
 
@@ -1404,6 +1407,7 @@ OMPFromClause *OMPFromClause::Create(
   Clause->setVarRefs(Vars);
   Clause->setUDMapperRefs(UDMapperRefs);
   Clause->setClauseInfo(Declarations, ComponentLists);
+  Clause->setIteratorModifier(IteratorModifier);
   return Clause;
 }
 
@@ -1413,10 +1417,12 @@ OMPFromClause::CreateEmpty(const ASTContext &C,
   void *Mem = C.Allocate(
       totalSizeToAlloc<Expr *, ValueDecl *, unsigned,
                        OMPClauseMappableExprCommon::MappableComponent>(
-          2 * Sizes.NumVars, Sizes.NumUniqueDeclarations,
+          2 * Sizes.NumVars + 1, Sizes.NumUniqueDeclarations,
           Sizes.NumUniqueDeclarations + Sizes.NumComponentLists,
           Sizes.NumComponents));
-  return new (Mem) OMPFromClause(Sizes);
+  OMPFromClause *Clause = new (Mem) OMPFromClause(Sizes);
+  Clause->setIteratorModifier(nullptr);
+  return Clause;
 }
 
 void OMPUseDevicePtrClause::setPrivateCopies(ArrayRef<Expr *> VL) {
@@ -2694,12 +2700,16 @@ template <typename T> void OMPClausePrinter::VisitOMPMotionClause(T *Node) {
     OS << '(';
     for (unsigned I = 0; I < NumberOfOMPMotionModifiers; ++I) {
       if (Node->getMotionModifier(I) != OMPC_MOTION_MODIFIER_unknown) {
-        OS << getOpenMPSimpleClauseTypeName(Node->getClauseKind(),
-                                            Node->getMotionModifier(I));
-        if (Node->getMotionModifier(I) == OMPC_MOTION_MODIFIER_mapper)
-          PrintMapper(OS, Node, Policy);
-        if (I < ModifierCount - 1)
-          OS << ", ";
+        if (Node->getMotionModifier(I) == OMPC_MOTION_MODIFIER_iterator) {
+          PrintIterator(OS, Node, Policy);
+        } else {
+          OS << getOpenMPSimpleClauseTypeName(Node->getClauseKind(),
+                                              Node->getMotionModifier(I));
+          if (Node->getMotionModifier(I) == OMPC_MOTION_MODIFIER_mapper)
+            PrintMapper(OS, Node, Policy);
+          if (I < ModifierCount - 1)
+            OS << ", ";
+        }
       }
     }
     OS << ':';
diff --git a/clang/lib/AST/RecordLayoutBuilder.cpp b/clang/lib/AST/RecordLayoutBuilder.cpp
index ac18d4da22e8c..5d8f54f111ab2 100644
--- a/clang/lib/AST/RecordLayoutBuilder.cpp
+++ b/clang/lib/AST/RecordLayoutBuilder.cpp
@@ -3363,16 +3363,15 @@ void MicrosoftRecordLayoutBuilder::computeVtorDispSet(
 /// position information.
 const ASTRecordLayout &
 ASTContext::getASTRecordLayout(const RecordDecl *D) const {
-  // These asserts test different things.  A record has a definition
-  // as soon as we begin to parse the definition.  That definition is
-  // not a complete definition (which is what isDefinition() tests)
-  // until we *finish* parsing the definition.
-
   if (D->hasExternalLexicalStorage() && !D->getDefinition())
     getExternalSource()->CompleteType(const_cast<RecordDecl*>(D));
   // Complete the redecl chain (if necessary).
   (void)D->getMostRecentDecl();
 
+  // These asserts test different things.  A record has a definition
+  // as soon as we begin to parse the definition.  That definition is
+  // not a complete definition (which is what isCompleteDefinition() tests)
+  // until we *finish* parsing the definition.
   D = D->getDefinition();
   assert(D && "Cannot get layout of forward declarations!");
   assert(!D->isInvalidDecl() && "Cannot get layout of invalid decl!");
diff --git a/clang/lib/ASTMatchers/ASTMatchersInternal.cpp b/clang/lib/ASTMatchers/ASTMatchersInternal.cpp
index 0874b3d0c45f5..fbb8b49676045 100644
--- a/clang/lib/ASTMatchers/ASTMatchersInternal.cpp
+++ b/clang/lib/ASTMatchers/ASTMatchersInternal.cpp
@@ -807,6 +807,7 @@ const internal::VariadicDynCastAllOfMatcher<TypeLoc, PointerTypeLoc>
     pointerTypeLoc;
 const internal::VariadicDynCastAllOfMatcher<TypeLoc, ReferenceTypeLoc>
     referenceTypeLoc;
+const internal::VariadicDynCastAllOfMatcher<TypeLoc, ArrayTypeLoc> arrayTypeLoc;
 const internal::VariadicDynCastAllOfMatcher<TypeLoc,
                                             TemplateSpecializationTypeLoc>
     templateSpecializationTypeLoc;
diff --git a/clang/lib/ASTMatchers/Dynamic/Registry.cpp b/clang/lib/ASTMatchers/Dynamic/Registry.cpp
index 66848f7c42127..d1b19659905ca 100644
--- a/clang/lib/ASTMatchers/Dynamic/Registry.cpp
+++ b/clang/lib/ASTMatchers/Dynamic/Registry.cpp
@@ -138,6 +138,7 @@ RegistryMaps::RegistryMaps() {
   REGISTER_MATCHER(argumentCountAtLeast);
   REGISTER_MATCHER(arraySubscriptExpr);
   REGISTER_MATCHER(arrayType);
+  REGISTER_MATCHER(arrayTypeLoc);
   REGISTER_MATCHER(asString);
   REGISTER_MATCHER(asmStmt);
   REGISTER_MATCHER(atomicExpr);
diff --git a/clang/lib/Analysis/Consumed.cpp b/clang/lib/Analysis/Consumed.cpp
index f2c714ab1528d..efc7098e52042 100644
--- a/clang/lib/Analysis/Consumed.cpp
+++ b/clang/lib/Analysis/Consumed.cpp
@@ -1354,12 +1354,13 @@ void ConsumedAnalyzer::run(AnalysisDeclContext &AC) {
 
       case CFGElement::AutomaticObjectDtor: {
         const CFGAutomaticObjDtor &DTor = B.castAs<CFGAutomaticObjDtor>();
+        const auto *DD = DTor.getDestructorDecl(AC.getASTContext());
+        if (!DD)
+          break;
+
         SourceLocation Loc = DTor.getTriggerStmt()->getEndLoc();
         const VarDecl *Var = DTor.getVarDecl();
-
-        Visitor.checkCallability(PropagationInfo(Var),
-                                 DTor.getDestructorDecl(AC.getASTContext()),
-                                 Loc);
+        Visitor.checkCallability(PropagationInfo(Var), DD, Loc);
         break;
       }
 
diff --git a/clang/lib/Analysis/ExprMutationAnalyzer.cpp b/clang/lib/Analysis/ExprMutationAnalyzer.cpp
index 2f40c7e4888e3..86d7dcab807d3 100644
--- a/clang/lib/Analysis/ExprMutationAnalyzer.cpp
+++ b/clang/lib/Analysis/ExprMutationAnalyzer.cpp
@@ -135,6 +135,11 @@ class ExprPointeeResolve {
     if (const auto *PE = dyn_cast<ParenExpr>(E))
       return resolveExpr(PE->getSubExpr());
 
+    if (const auto *UO = dyn_cast<UnaryOperator>(E)) {
+      if (UO->getOpcode() == UO_AddrOf)
+        return resolveExpr(UO->getSubExpr());
+    }
+
     if (const auto *ICE = dyn_cast<ImplicitCastExpr>(E)) {
       // only implicit cast needs to be treated as resolvable.
       // explicit cast will be checked in `findPointeeToNonConst`
diff --git a/clang/lib/Basic/Targets/AMDGPU.h b/clang/lib/Basic/Targets/AMDGPU.h
index a51d8d2375cfe..8dcf1d1c9561a 100644
--- a/clang/lib/Basic/Targets/AMDGPU.h
+++ b/clang/lib/Basic/Targets/AMDGPU.h
@@ -84,6 +84,18 @@ class LLVM_LIBRARY_VISIBILITY AMDGPUTargetInfo final : public TargetInfo {
     return TT.getArch() == llvm::Triple::r600;
   }
 
+  bool hasFlatSupport() const {
+    if (GPUKind >= llvm::AMDGPU::GK_GFX700)
+      return true;
+
+    // Dummy target is assumed to be gfx700+ for amdhsa.
+    if (GPUKind == llvm::AMDGPU::GK_NONE &&
+        getTriple().getOS() == llvm::Triple::AMDHSA)
+      return true;
+
+    return false;
+  }
+
 public:
   AMDGPUTargetInfo(const llvm::Triple &Triple, const TargetOptions &Opts);
 
@@ -316,14 +328,16 @@ class LLVM_LIBRARY_VISIBILITY AMDGPUTargetInfo final : public TargetInfo {
       Opts["cl_amd_media_ops"] = true;
       Opts["cl_amd_media_ops2"] = true;
 
+      // FIXME: Check subtarget for image support.
       Opts["__opencl_c_images"] = true;
       Opts["__opencl_c_3d_image_writes"] = true;
+      Opts["__opencl_c_read_write_images"] = true;
       Opts["cl_khr_3d_image_writes"] = true;
       Opts["__opencl_c_program_scope_global_variables"] = true;
       Opts["__opencl_c_atomic_order_seq_cst"] = true;
       Opts["__opencl_c_atomic_scope_all_devices"] = true;
 
-      if (GPUKind >= llvm::AMDGPU::GK_GFX700) {
+      if (hasFlatSupport()) {
         Opts["__opencl_c_generic_address_space"] = true;
         Opts["__opencl_c_device_enqueue"] = true;
       }
diff --git a/clang/lib/Basic/Targets/Sparc.cpp b/clang/lib/Basic/Targets/Sparc.cpp
index d47eecb3cf058..fe1aad6804aa6 100644
--- a/clang/lib/Basic/Targets/Sparc.cpp
+++ b/clang/lib/Basic/Targets/Sparc.cpp
@@ -165,6 +165,7 @@ void SparcV8TargetInfo::getTargetDefines(const LangOptions &Opts,
     Builder.defineMacro("__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4");
     Builder.defineMacro("__GCC_HAVE_SYNC_COMPARE_AND_SWAP_8");
   }
+  Builder.defineMacro("__LONG_DOUBLE_128__");
 }
 
 void SparcV9TargetInfo::getTargetDefines(const LangOptions &Opts,
diff --git a/clang/lib/Basic/Targets/Sparc.h b/clang/lib/Basic/Targets/Sparc.h
index 3215e648ba6c3..acc27194c38ea 100644
--- a/clang/lib/Basic/Targets/Sparc.h
+++ b/clang/lib/Basic/Targets/Sparc.h
@@ -166,6 +166,13 @@ class LLVM_LIBRARY_VISIBILITY SparcV8TargetInfo : public SparcTargetInfo {
       PtrDiffType = SignedLong;
       break;
     }
+
+    // The SPARCv8 System V ABI has long double 128-bits in size, but 64-bit
+    // aligned.
+    LongDoubleWidth = 128;
+    LongDoubleAlign = 64;
+    LongDoubleFormat = &llvm::APFloat::IEEEquad();
+
     // Up to 32 bits (V8) or 64 bits (V9) are lock-free atomic, but we're
     // willing to do atomic ops on up to 64 bits.
     MaxAtomicPromoteWidth = 64;
diff --git a/clang/lib/CIR/CodeGen/CIRGenBuiltin.cpp b/clang/lib/CIR/CodeGen/CIRGenBuiltin.cpp
index 7d4d13121d5e5..72495270b11ed 100644
--- a/clang/lib/CIR/CodeGen/CIRGenBuiltin.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenBuiltin.cpp
@@ -266,7 +266,12 @@ RValue CIRGenFunction::emitBuiltinExpr(const GlobalDecl &gd, unsigned builtinID,
   case Builtin::BI__builtin_va_end:
     emitVAEnd(emitVAListRef(e->getArg(0)).getPointer());
     return {};
-
+  case Builtin::BI__builtin_va_copy: {
+    mlir::Value dstPtr = emitVAListRef(e->getArg(0)).getPointer();
+    mlir::Value srcPtr = emitVAListRef(e->getArg(1)).getPointer();
+    cir::VACopyOp::create(builder, dstPtr.getLoc(), dstPtr, srcPtr);
+    return {};
+  }
   case Builtin::BIcos:
   case Builtin::BIcosf:
   case Builtin::BIcosl:
@@ -321,6 +326,16 @@ RValue CIRGenFunction::emitBuiltinExpr(const GlobalDecl &gd, unsigned builtinID,
   case Builtin::BI__builtin_fabsf128:
     return emitUnaryMaybeConstrainedFPBuiltin<cir::FAbsOp>(*this, *e);
 
+  case Builtin::BIfloor:
+  case Builtin::BIfloorf:
+  case Builtin::BIfloorl:
+  case Builtin::BI__builtin_floor:
+  case Builtin::BI__builtin_floorf:
+  case Builtin::BI__builtin_floorf16:
+  case Builtin::BI__builtin_floorl:
+  case Builtin::BI__builtin_floorf128:
+    return emitUnaryMaybeConstrainedFPBuiltin<cir::FloorOp>(*this, *e);
+
   case Builtin::BI__assume:
   case Builtin::BI__builtin_assume: {
     if (e->getArg(0)->HasSideEffects(getContext()))
@@ -527,6 +542,45 @@ RValue CIRGenFunction::emitBuiltinExpr(const GlobalDecl &gd, unsigned builtinID,
     return emitCall(e->getCallee()->getType(), CIRGenCallee::forDirect(fnOp), e,
                     returnValue);
   }
+
+  case Builtin::BI__builtin_constant_p: {
+    mlir::Type resultType = convertType(e->getType());
+
+    const Expr *arg = e->getArg(0);
+    QualType argType = arg->getType();
+    // FIXME: The allowance for Obj-C pointers and block pointers is historical
+    // and likely a mistake.
+    if (!argType->isIntegralOrEnumerationType() && !argType->isFloatingType() &&
+        !argType->isObjCObjectPointerType() && !argType->isBlockPointerType()) {
+      // Per the GCC documentation, only numeric constants are recognized after
+      // inlining.
+      return RValue::get(
+          builder.getConstInt(getLoc(e->getSourceRange()),
+                              mlir::cast<cir::IntType>(resultType), 0));
+    }
+
+    if (arg->HasSideEffects(getContext())) {
+      // The argument is unevaluated, so be conservative if it might have
+      // side-effects.
+      return RValue::get(
+          builder.getConstInt(getLoc(e->getSourceRange()),
+                              mlir::cast<cir::IntType>(resultType), 0));
+    }
+
+    mlir::Value argValue = emitScalarExpr(arg);
+    if (argType->isObjCObjectPointerType()) {
+      cgm.errorNYI(e->getSourceRange(),
+                   "__builtin_constant_p: Obj-C object pointer");
+      return {};
+    }
+    argValue = builder.createBitcast(argValue, convertType(argType));
+
+    mlir::Value result = cir::IsConstantOp::create(
+        builder, getLoc(e->getSourceRange()), argValue);
+    // IsConstantOp returns a bool, but __builtin_constant_p returns an int.
+    result = builder.createBoolToInt(result, resultType);
+    return RValue::get(result);
+  }
   case Builtin::BI__builtin_dynamic_object_size:
   case Builtin::BI__builtin_object_size: {
     unsigned type =
@@ -1307,9 +1361,13 @@ static mlir::Value emitTargetArchBuiltinExpr(CIRGenFunction *cgf,
   case llvm::Triple::armeb:
   case llvm::Triple::thumb:
   case llvm::Triple::thumbeb:
+    // These are actually NYI, but that will be reported by emitBuiltinExpr.
+    // At this point, we don't even know that the builtin is target-specific.
+    return nullptr;
   case llvm::Triple::aarch64:
   case llvm::Triple::aarch64_32:
   case llvm::Triple::aarch64_be:
+    return cgf->emitAArch64BuiltinExpr(builtinID, e, returnValue, arch);
   case llvm::Triple::bpfeb:
   case llvm::Triple::bpfel:
     // These are actually NYI, but that will be reported by emitBuiltinExpr.
diff --git a/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp b/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
new file mode 100644
index 0000000000000..5a9ae59ca253a
--- /dev/null
+++ b/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
@@ -0,0 +1,1583 @@
+//===---- CIRGenBuiltinAArch64.cpp - Emit CIR for AArch64 builtins --------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This contains code to emit ARM64 Builtin calls as CIR or a function call
+// to be later resolved.
+//
+//===----------------------------------------------------------------------===//
+
+#include "CIRGenFunction.h"
+#include "clang/CIR/MissingFeatures.h"
+
+// TODO(cir): once all builtins are covered, decide whether we still
+// need to use LLVM intrinsics or if there's a better approach to follow. Right
+// now the intrinsics are reused to make it convenient to encode all thousands
+// of them and passing down to LLVM lowering.
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/IntrinsicsAArch64.h"
+
+#include "mlir/IR/Value.h"
+#include "clang/AST/GlobalDecl.h"
+#include "clang/Basic/Builtins.h"
+#include "clang/Basic/TargetBuiltins.h"
+
+using namespace clang;
+using namespace clang::CIRGen;
+using namespace llvm;
+
+mlir::Value CIRGenFunction::emitAArch64SVEBuiltinExpr(unsigned builtinID,
+                                                      const CallExpr *expr) {
+  if (builtinID >= SVE::BI__builtin_sve_reinterpret_s8_s8 &&
+      builtinID <= SVE::BI__builtin_sve_reinterpret_f64_f64_x4) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  assert(!cir::MissingFeatures::aarch64SVEIntrinsics());
+
+  switch (builtinID) {
+  default:
+    return {};
+
+  case SVE::BI__builtin_sve_svreinterpret_b:
+  case SVE::BI__builtin_sve_svreinterpret_c:
+  case SVE::BI__builtin_sve_svpsel_lane_b8:
+  case SVE::BI__builtin_sve_svpsel_lane_b16:
+  case SVE::BI__builtin_sve_svpsel_lane_b32:
+  case SVE::BI__builtin_sve_svpsel_lane_b64:
+  case SVE::BI__builtin_sve_svpsel_lane_c8:
+  case SVE::BI__builtin_sve_svpsel_lane_c16:
+  case SVE::BI__builtin_sve_svpsel_lane_c32:
+  case SVE::BI__builtin_sve_svpsel_lane_c64:
+  case SVE::BI__builtin_sve_svmov_b_z:
+  case SVE::BI__builtin_sve_svnot_b_z:
+  case SVE::BI__builtin_sve_svmovlb_u16:
+  case SVE::BI__builtin_sve_svmovlb_u32:
+  case SVE::BI__builtin_sve_svmovlb_u64:
+  case SVE::BI__builtin_sve_svmovlb_s16:
+  case SVE::BI__builtin_sve_svmovlb_s32:
+  case SVE::BI__builtin_sve_svmovlb_s64:
+  case SVE::BI__builtin_sve_svmovlt_u16:
+  case SVE::BI__builtin_sve_svmovlt_u32:
+  case SVE::BI__builtin_sve_svmovlt_u64:
+  case SVE::BI__builtin_sve_svmovlt_s16:
+  case SVE::BI__builtin_sve_svmovlt_s32:
+  case SVE::BI__builtin_sve_svmovlt_s64:
+  case SVE::BI__builtin_sve_svpmullt_u16:
+  case SVE::BI__builtin_sve_svpmullt_u64:
+  case SVE::BI__builtin_sve_svpmullt_n_u16:
+  case SVE::BI__builtin_sve_svpmullt_n_u64:
+  case SVE::BI__builtin_sve_svpmullb_u16:
+  case SVE::BI__builtin_sve_svpmullb_u64:
+  case SVE::BI__builtin_sve_svpmullb_n_u16:
+  case SVE::BI__builtin_sve_svpmullb_n_u64:
+  case SVE::BI__builtin_sve_svdup_n_b8:
+  case SVE::BI__builtin_sve_svdup_n_b16:
+  case SVE::BI__builtin_sve_svdup_n_b32:
+  case SVE::BI__builtin_sve_svdup_n_b64:
+  case SVE::BI__builtin_sve_svdupq_n_b8:
+  case SVE::BI__builtin_sve_svdupq_n_b16:
+  case SVE::BI__builtin_sve_svdupq_n_b32:
+  case SVE::BI__builtin_sve_svdupq_n_b64:
+  case SVE::BI__builtin_sve_svdupq_n_u8:
+  case SVE::BI__builtin_sve_svdupq_n_s8:
+  case SVE::BI__builtin_sve_svdupq_n_u64:
+  case SVE::BI__builtin_sve_svdupq_n_f64:
+  case SVE::BI__builtin_sve_svdupq_n_s64:
+  case SVE::BI__builtin_sve_svdupq_n_u16:
+  case SVE::BI__builtin_sve_svdupq_n_f16:
+  case SVE::BI__builtin_sve_svdupq_n_bf16:
+  case SVE::BI__builtin_sve_svdupq_n_s16:
+  case SVE::BI__builtin_sve_svdupq_n_u32:
+  case SVE::BI__builtin_sve_svdupq_n_f32:
+  case SVE::BI__builtin_sve_svdupq_n_s32:
+  case SVE::BI__builtin_sve_svpfalse_b:
+  case SVE::BI__builtin_sve_svpfalse_c:
+  case SVE::BI__builtin_sve_svlen_bf16:
+  case SVE::BI__builtin_sve_svlen_f16:
+  case SVE::BI__builtin_sve_svlen_f32:
+  case SVE::BI__builtin_sve_svlen_f64:
+  case SVE::BI__builtin_sve_svlen_s8:
+  case SVE::BI__builtin_sve_svlen_s16:
+  case SVE::BI__builtin_sve_svlen_s32:
+  case SVE::BI__builtin_sve_svlen_s64:
+  case SVE::BI__builtin_sve_svlen_u8:
+  case SVE::BI__builtin_sve_svlen_u16:
+  case SVE::BI__builtin_sve_svlen_u32:
+  case SVE::BI__builtin_sve_svlen_u64:
+  case SVE::BI__builtin_sve_svtbl2_u8:
+  case SVE::BI__builtin_sve_svtbl2_s8:
+  case SVE::BI__builtin_sve_svtbl2_u16:
+  case SVE::BI__builtin_sve_svtbl2_s16:
+  case SVE::BI__builtin_sve_svtbl2_u32:
+  case SVE::BI__builtin_sve_svtbl2_s32:
+  case SVE::BI__builtin_sve_svtbl2_u64:
+  case SVE::BI__builtin_sve_svtbl2_s64:
+  case SVE::BI__builtin_sve_svtbl2_f16:
+  case SVE::BI__builtin_sve_svtbl2_bf16:
+  case SVE::BI__builtin_sve_svtbl2_f32:
+  case SVE::BI__builtin_sve_svtbl2_f64:
+  case SVE::BI__builtin_sve_svset_neonq_s8:
+  case SVE::BI__builtin_sve_svset_neonq_s16:
+  case SVE::BI__builtin_sve_svset_neonq_s32:
+  case SVE::BI__builtin_sve_svset_neonq_s64:
+  case SVE::BI__builtin_sve_svset_neonq_u8:
+  case SVE::BI__builtin_sve_svset_neonq_u16:
+  case SVE::BI__builtin_sve_svset_neonq_u32:
+  case SVE::BI__builtin_sve_svset_neonq_u64:
+  case SVE::BI__builtin_sve_svset_neonq_f16:
+  case SVE::BI__builtin_sve_svset_neonq_f32:
+  case SVE::BI__builtin_sve_svset_neonq_f64:
+  case SVE::BI__builtin_sve_svset_neonq_bf16:
+  case SVE::BI__builtin_sve_svget_neonq_s8:
+  case SVE::BI__builtin_sve_svget_neonq_s16:
+  case SVE::BI__builtin_sve_svget_neonq_s32:
+  case SVE::BI__builtin_sve_svget_neonq_s64:
+  case SVE::BI__builtin_sve_svget_neonq_u8:
+  case SVE::BI__builtin_sve_svget_neonq_u16:
+  case SVE::BI__builtin_sve_svget_neonq_u32:
+  case SVE::BI__builtin_sve_svget_neonq_u64:
+  case SVE::BI__builtin_sve_svget_neonq_f16:
+  case SVE::BI__builtin_sve_svget_neonq_f32:
+  case SVE::BI__builtin_sve_svget_neonq_f64:
+  case SVE::BI__builtin_sve_svget_neonq_bf16:
+  case SVE::BI__builtin_sve_svdup_neonq_s8:
+  case SVE::BI__builtin_sve_svdup_neonq_s16:
+  case SVE::BI__builtin_sve_svdup_neonq_s32:
+  case SVE::BI__builtin_sve_svdup_neonq_s64:
+  case SVE::BI__builtin_sve_svdup_neonq_u8:
+  case SVE::BI__builtin_sve_svdup_neonq_u16:
+  case SVE::BI__builtin_sve_svdup_neonq_u32:
+  case SVE::BI__builtin_sve_svdup_neonq_u64:
+  case SVE::BI__builtin_sve_svdup_neonq_f16:
+  case SVE::BI__builtin_sve_svdup_neonq_f32:
+  case SVE::BI__builtin_sve_svdup_neonq_f64:
+  case SVE::BI__builtin_sve_svdup_neonq_bf16:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // Unreachable: All cases in the switch above return.
+}
+
+mlir::Value CIRGenFunction::emitAArch64SMEBuiltinExpr(unsigned builtinID,
+                                                      const CallExpr *expr) {
+  assert(!cir::MissingFeatures::aarch64SMEIntrinsics());
+
+  cgm.errorNYI(expr->getSourceRange(),
+               std::string("unimplemented AArch64 builtin call: ") +
+                   getContext().BuiltinInfo.getName(builtinID));
+  return {};
+}
+
+// Some intrinsics are equivalent for codegen.
+static const std::pair<unsigned, unsigned> neonEquivalentIntrinsicMap[] = {
+    {
+        NEON::BI__builtin_neon_splat_lane_bf16,
+        NEON::BI__builtin_neon_splat_lane_v,
+    },
+    {
+        NEON::BI__builtin_neon_splat_laneq_bf16,
+        NEON::BI__builtin_neon_splat_laneq_v,
+    },
+    {
+        NEON::BI__builtin_neon_splatq_lane_bf16,
+        NEON::BI__builtin_neon_splatq_lane_v,
+    },
+    {
+        NEON::BI__builtin_neon_splatq_laneq_bf16,
+        NEON::BI__builtin_neon_splatq_laneq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vabd_f16,
+        NEON::BI__builtin_neon_vabd_v,
+    },
+    {
+        NEON::BI__builtin_neon_vabdq_f16,
+        NEON::BI__builtin_neon_vabdq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vabs_f16,
+        NEON::BI__builtin_neon_vabs_v,
+    },
+    {
+        NEON::BI__builtin_neon_vabsq_f16,
+        NEON::BI__builtin_neon_vabsq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcage_f16,
+        NEON::BI__builtin_neon_vcage_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcageq_f16,
+        NEON::BI__builtin_neon_vcageq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcagt_f16,
+        NEON::BI__builtin_neon_vcagt_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcagtq_f16,
+        NEON::BI__builtin_neon_vcagtq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcale_f16,
+        NEON::BI__builtin_neon_vcale_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcaleq_f16,
+        NEON::BI__builtin_neon_vcaleq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcalt_f16,
+        NEON::BI__builtin_neon_vcalt_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcaltq_f16,
+        NEON::BI__builtin_neon_vcaltq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vceqz_f16,
+        NEON::BI__builtin_neon_vceqz_v,
+    },
+    {
+        NEON::BI__builtin_neon_vceqzq_f16,
+        NEON::BI__builtin_neon_vceqzq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcgez_f16,
+        NEON::BI__builtin_neon_vcgez_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcgezq_f16,
+        NEON::BI__builtin_neon_vcgezq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcgtz_f16,
+        NEON::BI__builtin_neon_vcgtz_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcgtzq_f16,
+        NEON::BI__builtin_neon_vcgtzq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vclez_f16,
+        NEON::BI__builtin_neon_vclez_v,
+    },
+    {
+        NEON::BI__builtin_neon_vclezq_f16,
+        NEON::BI__builtin_neon_vclezq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcltz_f16,
+        NEON::BI__builtin_neon_vcltz_v,
+    },
+    {
+        NEON::BI__builtin_neon_vcltzq_f16,
+        NEON::BI__builtin_neon_vcltzq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vfma_f16,
+        NEON::BI__builtin_neon_vfma_v,
+    },
+    {
+        NEON::BI__builtin_neon_vfma_lane_f16,
+        NEON::BI__builtin_neon_vfma_lane_v,
+    },
+    {
+        NEON::BI__builtin_neon_vfma_laneq_f16,
+        NEON::BI__builtin_neon_vfma_laneq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vfmaq_f16,
+        NEON::BI__builtin_neon_vfmaq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vfmaq_lane_f16,
+        NEON::BI__builtin_neon_vfmaq_lane_v,
+    },
+    {
+        NEON::BI__builtin_neon_vfmaq_laneq_f16,
+        NEON::BI__builtin_neon_vfmaq_laneq_v,
+    },
+    {NEON::BI__builtin_neon_vld1_bf16_x2, NEON::BI__builtin_neon_vld1_x2_v},
+    {NEON::BI__builtin_neon_vld1_bf16_x3, NEON::BI__builtin_neon_vld1_x3_v},
+    {NEON::BI__builtin_neon_vld1_bf16_x4, NEON::BI__builtin_neon_vld1_x4_v},
+    {NEON::BI__builtin_neon_vld1_bf16, NEON::BI__builtin_neon_vld1_v},
+    {NEON::BI__builtin_neon_vld1_dup_bf16, NEON::BI__builtin_neon_vld1_dup_v},
+    {NEON::BI__builtin_neon_vld1_lane_bf16, NEON::BI__builtin_neon_vld1_lane_v},
+    {NEON::BI__builtin_neon_vld1q_bf16_x2, NEON::BI__builtin_neon_vld1q_x2_v},
+    {NEON::BI__builtin_neon_vld1q_bf16_x3, NEON::BI__builtin_neon_vld1q_x3_v},
+    {NEON::BI__builtin_neon_vld1q_bf16_x4, NEON::BI__builtin_neon_vld1q_x4_v},
+    {NEON::BI__builtin_neon_vld1q_bf16, NEON::BI__builtin_neon_vld1q_v},
+    {NEON::BI__builtin_neon_vld1q_dup_bf16, NEON::BI__builtin_neon_vld1q_dup_v},
+    {NEON::BI__builtin_neon_vld1q_lane_bf16,
+     NEON::BI__builtin_neon_vld1q_lane_v},
+    {NEON::BI__builtin_neon_vld2_bf16, NEON::BI__builtin_neon_vld2_v},
+    {NEON::BI__builtin_neon_vld2_dup_bf16, NEON::BI__builtin_neon_vld2_dup_v},
+    {NEON::BI__builtin_neon_vld2_lane_bf16, NEON::BI__builtin_neon_vld2_lane_v},
+    {NEON::BI__builtin_neon_vld2q_bf16, NEON::BI__builtin_neon_vld2q_v},
+    {NEON::BI__builtin_neon_vld2q_dup_bf16, NEON::BI__builtin_neon_vld2q_dup_v},
+    {NEON::BI__builtin_neon_vld2q_lane_bf16,
+     NEON::BI__builtin_neon_vld2q_lane_v},
+    {NEON::BI__builtin_neon_vld3_bf16, NEON::BI__builtin_neon_vld3_v},
+    {NEON::BI__builtin_neon_vld3_dup_bf16, NEON::BI__builtin_neon_vld3_dup_v},
+    {NEON::BI__builtin_neon_vld3_lane_bf16, NEON::BI__builtin_neon_vld3_lane_v},
+    {NEON::BI__builtin_neon_vld3q_bf16, NEON::BI__builtin_neon_vld3q_v},
+    {NEON::BI__builtin_neon_vld3q_dup_bf16, NEON::BI__builtin_neon_vld3q_dup_v},
+    {NEON::BI__builtin_neon_vld3q_lane_bf16,
+     NEON::BI__builtin_neon_vld3q_lane_v},
+    {NEON::BI__builtin_neon_vld4_bf16, NEON::BI__builtin_neon_vld4_v},
+    {NEON::BI__builtin_neon_vld4_dup_bf16, NEON::BI__builtin_neon_vld4_dup_v},
+    {NEON::BI__builtin_neon_vld4_lane_bf16, NEON::BI__builtin_neon_vld4_lane_v},
+    {NEON::BI__builtin_neon_vld4q_bf16, NEON::BI__builtin_neon_vld4q_v},
+    {NEON::BI__builtin_neon_vld4q_dup_bf16, NEON::BI__builtin_neon_vld4q_dup_v},
+    {NEON::BI__builtin_neon_vld4q_lane_bf16,
+     NEON::BI__builtin_neon_vld4q_lane_v},
+    {
+        NEON::BI__builtin_neon_vmax_f16,
+        NEON::BI__builtin_neon_vmax_v,
+    },
+    {
+        NEON::BI__builtin_neon_vmaxnm_f16,
+        NEON::BI__builtin_neon_vmaxnm_v,
+    },
+    {
+        NEON::BI__builtin_neon_vmaxnmq_f16,
+        NEON::BI__builtin_neon_vmaxnmq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vmaxq_f16,
+        NEON::BI__builtin_neon_vmaxq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vmin_f16,
+        NEON::BI__builtin_neon_vmin_v,
+    },
+    {
+        NEON::BI__builtin_neon_vminnm_f16,
+        NEON::BI__builtin_neon_vminnm_v,
+    },
+    {
+        NEON::BI__builtin_neon_vminnmq_f16,
+        NEON::BI__builtin_neon_vminnmq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vminq_f16,
+        NEON::BI__builtin_neon_vminq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vmulx_f16,
+        NEON::BI__builtin_neon_vmulx_v,
+    },
+    {
+        NEON::BI__builtin_neon_vmulxq_f16,
+        NEON::BI__builtin_neon_vmulxq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpadd_f16,
+        NEON::BI__builtin_neon_vpadd_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpaddq_f16,
+        NEON::BI__builtin_neon_vpaddq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpmax_f16,
+        NEON::BI__builtin_neon_vpmax_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpmaxnm_f16,
+        NEON::BI__builtin_neon_vpmaxnm_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpmaxnmq_f16,
+        NEON::BI__builtin_neon_vpmaxnmq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpmaxq_f16,
+        NEON::BI__builtin_neon_vpmaxq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpmin_f16,
+        NEON::BI__builtin_neon_vpmin_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpminnm_f16,
+        NEON::BI__builtin_neon_vpminnm_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpminnmq_f16,
+        NEON::BI__builtin_neon_vpminnmq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vpminq_f16,
+        NEON::BI__builtin_neon_vpminq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrecpe_f16,
+        NEON::BI__builtin_neon_vrecpe_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrecpeq_f16,
+        NEON::BI__builtin_neon_vrecpeq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrecps_f16,
+        NEON::BI__builtin_neon_vrecps_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrecpsq_f16,
+        NEON::BI__builtin_neon_vrecpsq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrnd_f16,
+        NEON::BI__builtin_neon_vrnd_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrnda_f16,
+        NEON::BI__builtin_neon_vrnda_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndaq_f16,
+        NEON::BI__builtin_neon_vrndaq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndi_f16,
+        NEON::BI__builtin_neon_vrndi_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndiq_f16,
+        NEON::BI__builtin_neon_vrndiq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndm_f16,
+        NEON::BI__builtin_neon_vrndm_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndmq_f16,
+        NEON::BI__builtin_neon_vrndmq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndn_f16,
+        NEON::BI__builtin_neon_vrndn_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndnq_f16,
+        NEON::BI__builtin_neon_vrndnq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndp_f16,
+        NEON::BI__builtin_neon_vrndp_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndpq_f16,
+        NEON::BI__builtin_neon_vrndpq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndq_f16,
+        NEON::BI__builtin_neon_vrndq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndx_f16,
+        NEON::BI__builtin_neon_vrndx_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrndxq_f16,
+        NEON::BI__builtin_neon_vrndxq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrsqrte_f16,
+        NEON::BI__builtin_neon_vrsqrte_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrsqrteq_f16,
+        NEON::BI__builtin_neon_vrsqrteq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrsqrts_f16,
+        NEON::BI__builtin_neon_vrsqrts_v,
+    },
+    {
+        NEON::BI__builtin_neon_vrsqrtsq_f16,
+        NEON::BI__builtin_neon_vrsqrtsq_v,
+    },
+    {
+        NEON::BI__builtin_neon_vsqrt_f16,
+        NEON::BI__builtin_neon_vsqrt_v,
+    },
+    {
+        NEON::BI__builtin_neon_vsqrtq_f16,
+        NEON::BI__builtin_neon_vsqrtq_v,
+    },
+    {NEON::BI__builtin_neon_vst1_bf16_x2, NEON::BI__builtin_neon_vst1_x2_v},
+    {NEON::BI__builtin_neon_vst1_bf16_x3, NEON::BI__builtin_neon_vst1_x3_v},
+    {NEON::BI__builtin_neon_vst1_bf16_x4, NEON::BI__builtin_neon_vst1_x4_v},
+    {NEON::BI__builtin_neon_vst1_bf16, NEON::BI__builtin_neon_vst1_v},
+    {NEON::BI__builtin_neon_vst1_lane_bf16, NEON::BI__builtin_neon_vst1_lane_v},
+    {NEON::BI__builtin_neon_vst1q_bf16_x2, NEON::BI__builtin_neon_vst1q_x2_v},
+    {NEON::BI__builtin_neon_vst1q_bf16_x3, NEON::BI__builtin_neon_vst1q_x3_v},
+    {NEON::BI__builtin_neon_vst1q_bf16_x4, NEON::BI__builtin_neon_vst1q_x4_v},
+    {NEON::BI__builtin_neon_vst1q_bf16, NEON::BI__builtin_neon_vst1q_v},
+    {NEON::BI__builtin_neon_vst1q_lane_bf16,
+     NEON::BI__builtin_neon_vst1q_lane_v},
+    {NEON::BI__builtin_neon_vst2_bf16, NEON::BI__builtin_neon_vst2_v},
+    {NEON::BI__builtin_neon_vst2_lane_bf16, NEON::BI__builtin_neon_vst2_lane_v},
+    {NEON::BI__builtin_neon_vst2q_bf16, NEON::BI__builtin_neon_vst2q_v},
+    {NEON::BI__builtin_neon_vst2q_lane_bf16,
+     NEON::BI__builtin_neon_vst2q_lane_v},
+    {NEON::BI__builtin_neon_vst3_bf16, NEON::BI__builtin_neon_vst3_v},
+    {NEON::BI__builtin_neon_vst3_lane_bf16, NEON::BI__builtin_neon_vst3_lane_v},
+    {NEON::BI__builtin_neon_vst3q_bf16, NEON::BI__builtin_neon_vst3q_v},
+    {NEON::BI__builtin_neon_vst3q_lane_bf16,
+     NEON::BI__builtin_neon_vst3q_lane_v},
+    {NEON::BI__builtin_neon_vst4_bf16, NEON::BI__builtin_neon_vst4_v},
+    {NEON::BI__builtin_neon_vst4_lane_bf16, NEON::BI__builtin_neon_vst4_lane_v},
+    {NEON::BI__builtin_neon_vst4q_bf16, NEON::BI__builtin_neon_vst4q_v},
+    {NEON::BI__builtin_neon_vst4q_lane_bf16,
+     NEON::BI__builtin_neon_vst4q_lane_v},
+    // The mangling rules cause us to have one ID for each type for
+    // vldap1(q)_lane and vstl1(q)_lane, but codegen is equivalent for all of
+    // them. Choose an arbitrary one to be handled as tha canonical variation.
+    {NEON::BI__builtin_neon_vldap1_lane_u64,
+     NEON::BI__builtin_neon_vldap1_lane_s64},
+    {NEON::BI__builtin_neon_vldap1_lane_f64,
+     NEON::BI__builtin_neon_vldap1_lane_s64},
+    {NEON::BI__builtin_neon_vldap1_lane_p64,
+     NEON::BI__builtin_neon_vldap1_lane_s64},
+    {NEON::BI__builtin_neon_vldap1q_lane_u64,
+     NEON::BI__builtin_neon_vldap1q_lane_s64},
+    {NEON::BI__builtin_neon_vldap1q_lane_f64,
+     NEON::BI__builtin_neon_vldap1q_lane_s64},
+    {NEON::BI__builtin_neon_vldap1q_lane_p64,
+     NEON::BI__builtin_neon_vldap1q_lane_s64},
+    {NEON::BI__builtin_neon_vstl1_lane_u64,
+     NEON::BI__builtin_neon_vstl1_lane_s64},
+    {NEON::BI__builtin_neon_vstl1_lane_f64,
+     NEON::BI__builtin_neon_vstl1_lane_s64},
+    {NEON::BI__builtin_neon_vstl1_lane_p64,
+     NEON::BI__builtin_neon_vstl1_lane_s64},
+    {NEON::BI__builtin_neon_vstl1q_lane_u64,
+     NEON::BI__builtin_neon_vstl1q_lane_s64},
+    {NEON::BI__builtin_neon_vstl1q_lane_f64,
+     NEON::BI__builtin_neon_vstl1q_lane_s64},
+    {NEON::BI__builtin_neon_vstl1q_lane_p64,
+     NEON::BI__builtin_neon_vstl1q_lane_s64},
+};
+
+mlir::Value
+CIRGenFunction::emitAArch64BuiltinExpr(unsigned builtinID, const CallExpr *expr,
+                                       ReturnValueSlot returnValue,
+                                       llvm::Triple::ArchType arch) {
+  if (builtinID >= clang::AArch64::FirstSVEBuiltin &&
+      builtinID <= clang::AArch64::LastSVEBuiltin)
+    return emitAArch64SVEBuiltinExpr(builtinID, expr);
+
+  if (builtinID >= clang::AArch64::FirstSMEBuiltin &&
+      builtinID <= clang::AArch64::LastSMEBuiltin)
+    return emitAArch64SMEBuiltinExpr(builtinID, expr);
+
+  if (builtinID == Builtin::BI__builtin_cpu_supports) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  switch (builtinID) {
+  default:
+    break;
+  case clang::AArch64::BI__builtin_arm_nop:
+  case clang::AArch64::BI__builtin_arm_yield:
+  case clang::AArch64::BI__yield:
+  case clang::AArch64::BI__builtin_arm_wfe:
+  case clang::AArch64::BI__wfe:
+  case clang::AArch64::BI__builtin_arm_wfi:
+  case clang::AArch64::BI__wfi:
+  case clang::AArch64::BI__builtin_arm_sev:
+  case clang::AArch64::BI__sev:
+  case clang::AArch64::BI__builtin_arm_sevl:
+  case clang::AArch64::BI__sevl:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_trap) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_get_sme_state) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rbit) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+  if (builtinID == clang::AArch64::BI__builtin_arm_rbit64) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_clz ||
+      builtinID == clang::AArch64::BI__builtin_arm_clz64) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_cls) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+  if (builtinID == clang::AArch64::BI__builtin_arm_cls64) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rint32zf ||
+      builtinID == clang::AArch64::BI__builtin_arm_rint32z) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rint64zf ||
+      builtinID == clang::AArch64::BI__builtin_arm_rint64z) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rint32xf ||
+      builtinID == clang::AArch64::BI__builtin_arm_rint32x) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rint64xf ||
+      builtinID == clang::AArch64::BI__builtin_arm_rint64x) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_jcvt) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_ld64b ||
+      builtinID == clang::AArch64::BI__builtin_arm_st64b ||
+      builtinID == clang::AArch64::BI__builtin_arm_st64bv ||
+      builtinID == clang::AArch64::BI__builtin_arm_st64bv0) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rndr ||
+      builtinID == clang::AArch64::BI__builtin_arm_rndrrs) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__clear_cache) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if ((builtinID == clang::AArch64::BI__builtin_arm_ldrex ||
+       builtinID == clang::AArch64::BI__builtin_arm_ldaex) &&
+      getContext().getTypeSize(expr->getType()) == 128) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+  if (builtinID == clang::AArch64::BI__builtin_arm_ldrex ||
+      builtinID == clang::AArch64::BI__builtin_arm_ldaex) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if ((builtinID == clang::AArch64::BI__builtin_arm_strex ||
+       builtinID == clang::AArch64::BI__builtin_arm_stlex) &&
+      getContext().getTypeSize(expr->getArg(0)->getType()) == 128) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_strex ||
+      builtinID == clang::AArch64::BI__builtin_arm_stlex) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__getReg) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__break) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_clrex) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI_ReadWriteBarrier) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // CRC32
+  Intrinsic::ID crcIntrinsicID = Intrinsic::not_intrinsic;
+  switch (builtinID) {
+  case clang::AArch64::BI__builtin_arm_crc32b:
+    crcIntrinsicID = Intrinsic::aarch64_crc32b;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32cb:
+    crcIntrinsicID = Intrinsic::aarch64_crc32cb;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32h:
+    crcIntrinsicID = Intrinsic::aarch64_crc32h;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32ch:
+    crcIntrinsicID = Intrinsic::aarch64_crc32ch;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32w:
+    crcIntrinsicID = Intrinsic::aarch64_crc32w;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32cw:
+    crcIntrinsicID = Intrinsic::aarch64_crc32cw;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32d:
+    crcIntrinsicID = Intrinsic::aarch64_crc32x;
+    break;
+  case clang::AArch64::BI__builtin_arm_crc32cd:
+    crcIntrinsicID = Intrinsic::aarch64_crc32cx;
+    break;
+  }
+
+  if (crcIntrinsicID != Intrinsic::not_intrinsic) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // Memory Operations (MOPS)
+  if (builtinID == AArch64::BI__builtin_arm_mops_memset_tag) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // Memory Tagging Extensions (MTE) Intrinsics
+  Intrinsic::ID mteIntrinsicID = Intrinsic::not_intrinsic;
+  switch (builtinID) {
+  case clang::AArch64::BI__builtin_arm_irg:
+    mteIntrinsicID = Intrinsic::aarch64_irg;
+    break;
+  case clang::AArch64::BI__builtin_arm_addg:
+    mteIntrinsicID = Intrinsic::aarch64_addg;
+    break;
+  case clang::AArch64::BI__builtin_arm_gmi:
+    mteIntrinsicID = Intrinsic::aarch64_gmi;
+    break;
+  case clang::AArch64::BI__builtin_arm_ldg:
+    mteIntrinsicID = Intrinsic::aarch64_ldg;
+    break;
+  case clang::AArch64::BI__builtin_arm_stg:
+    mteIntrinsicID = Intrinsic::aarch64_stg;
+    break;
+  case clang::AArch64::BI__builtin_arm_subp:
+    mteIntrinsicID = Intrinsic::aarch64_subp;
+    break;
+  }
+
+  if (mteIntrinsicID != Intrinsic::not_intrinsic) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_arm_rsr ||
+      builtinID == clang::AArch64::BI__builtin_arm_rsr64 ||
+      builtinID == clang::AArch64::BI__builtin_arm_rsr128 ||
+      builtinID == clang::AArch64::BI__builtin_arm_rsrp ||
+      builtinID == clang::AArch64::BI__builtin_arm_wsr ||
+      builtinID == clang::AArch64::BI__builtin_arm_wsr64 ||
+      builtinID == clang::AArch64::BI__builtin_arm_wsr128 ||
+      builtinID == clang::AArch64::BI__builtin_arm_wsrp) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI_ReadStatusReg ||
+      builtinID == clang::AArch64::BI_WriteStatusReg ||
+      builtinID == clang::AArch64::BI__sys) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI_AddressOfReturnAddress) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__builtin_sponentry) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == clang::AArch64::BI__mulh ||
+      builtinID == clang::AArch64::BI__umulh) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI__writex18byte ||
+      builtinID == AArch64::BI__writex18word ||
+      builtinID == AArch64::BI__writex18dword ||
+      builtinID == AArch64::BI__writex18qword) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI__readx18byte ||
+      builtinID == AArch64::BI__readx18word ||
+      builtinID == AArch64::BI__readx18dword ||
+      builtinID == AArch64::BI__readx18qword) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI__addx18byte ||
+      builtinID == AArch64::BI__addx18word ||
+      builtinID == AArch64::BI__addx18dword ||
+      builtinID == AArch64::BI__addx18qword ||
+      builtinID == AArch64::BI__incx18byte ||
+      builtinID == AArch64::BI__incx18word ||
+      builtinID == AArch64::BI__incx18dword ||
+      builtinID == AArch64::BI__incx18qword) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI_CopyDoubleFromInt64 ||
+      builtinID == AArch64::BI_CopyFloatFromInt32 ||
+      builtinID == AArch64::BI_CopyInt32FromFloat ||
+      builtinID == AArch64::BI_CopyInt64FromDouble) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI_CountLeadingOnes ||
+      builtinID == AArch64::BI_CountLeadingOnes64 ||
+      builtinID == AArch64::BI_CountLeadingZeros ||
+      builtinID == AArch64::BI_CountLeadingZeros64) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI_CountLeadingSigns ||
+      builtinID == AArch64::BI_CountLeadingSigns64) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI_CountOneBits ||
+      builtinID == AArch64::BI_CountOneBits64) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI__prefetch) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == AArch64::BI__hlt) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  if (builtinID == NEON::BI__builtin_neon_vcvth_bf16_f32) {
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // Handle MSVC intrinsics before argument evaluation to prevent double
+  // evaluation.
+  assert(!cir::MissingFeatures::msvcBuiltins());
+
+  // Some intrinsics are equivalent - if they are use the base intrinsic ID.
+  auto it = llvm::find_if(neonEquivalentIntrinsicMap, [builtinID](auto &p) {
+    return p.first == builtinID;
+  });
+  if (it != end(neonEquivalentIntrinsicMap))
+    builtinID = it->second;
+
+  // Find out if any arguments are required to be integer constant
+  // expressions.
+  assert(!cir::MissingFeatures::handleBuiltinICEArguments());
+
+  assert(!cir::MissingFeatures::neonSISDIntrinsics());
+
+  // Handle non-overloaded intrinsics first.
+  switch (builtinID) {
+  default:
+    break;
+  case NEON::BI__builtin_neon_vabsh_f16:
+  case NEON::BI__builtin_neon_vaddq_p128:
+  case NEON::BI__builtin_neon_vldrq_p128:
+  case NEON::BI__builtin_neon_vstrq_p128:
+  case NEON::BI__builtin_neon_vcvts_f32_u32:
+  case NEON::BI__builtin_neon_vcvtd_f64_u64:
+  case NEON::BI__builtin_neon_vcvts_f32_s32:
+  case NEON::BI__builtin_neon_vcvtd_f64_s64:
+  case NEON::BI__builtin_neon_vcvth_f16_u16:
+  case NEON::BI__builtin_neon_vcvth_f16_u32:
+  case NEON::BI__builtin_neon_vcvth_f16_u64:
+  case NEON::BI__builtin_neon_vcvth_f16_s16:
+  case NEON::BI__builtin_neon_vcvth_f16_s32:
+  case NEON::BI__builtin_neon_vcvth_f16_s64:
+  case NEON::BI__builtin_neon_vcvtah_u16_f16:
+  case NEON::BI__builtin_neon_vcvtmh_u16_f16:
+  case NEON::BI__builtin_neon_vcvtnh_u16_f16:
+  case NEON::BI__builtin_neon_vcvtph_u16_f16:
+  case NEON::BI__builtin_neon_vcvth_u16_f16:
+  case NEON::BI__builtin_neon_vcvtah_s16_f16:
+  case NEON::BI__builtin_neon_vcvtmh_s16_f16:
+  case NEON::BI__builtin_neon_vcvtnh_s16_f16:
+  case NEON::BI__builtin_neon_vcvtph_s16_f16:
+  case NEON::BI__builtin_neon_vcvth_s16_f16:
+  case NEON::BI__builtin_neon_vcaleh_f16:
+  case NEON::BI__builtin_neon_vcalth_f16:
+  case NEON::BI__builtin_neon_vcageh_f16:
+  case NEON::BI__builtin_neon_vcagth_f16:
+  case NEON::BI__builtin_neon_vcvth_n_s16_f16:
+  case NEON::BI__builtin_neon_vcvth_n_u16_f16:
+  case NEON::BI__builtin_neon_vcvth_n_f16_s16:
+  case NEON::BI__builtin_neon_vcvth_n_f16_u16:
+  case NEON::BI__builtin_neon_vpaddd_s64:
+  case NEON::BI__builtin_neon_vpaddd_f64:
+  case NEON::BI__builtin_neon_vpadds_f32:
+  case NEON::BI__builtin_neon_vceqzd_s64:
+  case NEON::BI__builtin_neon_vceqzd_f64:
+  case NEON::BI__builtin_neon_vceqzs_f32:
+  case NEON::BI__builtin_neon_vceqzh_f16:
+  case NEON::BI__builtin_neon_vcgezd_s64:
+  case NEON::BI__builtin_neon_vcgezd_f64:
+  case NEON::BI__builtin_neon_vcgezs_f32:
+  case NEON::BI__builtin_neon_vcgezh_f16:
+  case NEON::BI__builtin_neon_vclezd_s64:
+  case NEON::BI__builtin_neon_vclezd_f64:
+  case NEON::BI__builtin_neon_vclezs_f32:
+  case NEON::BI__builtin_neon_vclezh_f16:
+  case NEON::BI__builtin_neon_vcgtzd_s64:
+  case NEON::BI__builtin_neon_vcgtzd_f64:
+  case NEON::BI__builtin_neon_vcgtzs_f32:
+  case NEON::BI__builtin_neon_vcgtzh_f16:
+  case NEON::BI__builtin_neon_vcltzd_s64:
+  case NEON::BI__builtin_neon_vcltzd_f64:
+  case NEON::BI__builtin_neon_vcltzs_f32:
+  case NEON::BI__builtin_neon_vcltzh_f16:
+  case NEON::BI__builtin_neon_vceqzd_u64:
+  case NEON::BI__builtin_neon_vceqd_f64:
+  case NEON::BI__builtin_neon_vcled_f64:
+  case NEON::BI__builtin_neon_vcltd_f64:
+  case NEON::BI__builtin_neon_vcged_f64:
+  case NEON::BI__builtin_neon_vcgtd_f64:
+  case NEON::BI__builtin_neon_vceqs_f32:
+  case NEON::BI__builtin_neon_vcles_f32:
+  case NEON::BI__builtin_neon_vclts_f32:
+  case NEON::BI__builtin_neon_vcges_f32:
+  case NEON::BI__builtin_neon_vcgts_f32:
+  case NEON::BI__builtin_neon_vceqh_f16:
+  case NEON::BI__builtin_neon_vcleh_f16:
+  case NEON::BI__builtin_neon_vclth_f16:
+  case NEON::BI__builtin_neon_vcgeh_f16:
+  case NEON::BI__builtin_neon_vcgth_f16:
+  case NEON::BI__builtin_neon_vceqd_s64:
+  case NEON::BI__builtin_neon_vceqd_u64:
+  case NEON::BI__builtin_neon_vcgtd_s64:
+  case NEON::BI__builtin_neon_vcgtd_u64:
+  case NEON::BI__builtin_neon_vcltd_s64:
+  case NEON::BI__builtin_neon_vcltd_u64:
+  case NEON::BI__builtin_neon_vcged_u64:
+  case NEON::BI__builtin_neon_vcged_s64:
+  case NEON::BI__builtin_neon_vcled_u64:
+  case NEON::BI__builtin_neon_vcled_s64:
+  case NEON::BI__builtin_neon_vtstd_s64:
+  case NEON::BI__builtin_neon_vtstd_u64:
+  case NEON::BI__builtin_neon_vset_lane_i8:
+  case NEON::BI__builtin_neon_vset_lane_i16:
+  case NEON::BI__builtin_neon_vset_lane_i32:
+  case NEON::BI__builtin_neon_vset_lane_i64:
+  case NEON::BI__builtin_neon_vset_lane_bf16:
+  case NEON::BI__builtin_neon_vset_lane_f32:
+  case NEON::BI__builtin_neon_vsetq_lane_i8:
+  case NEON::BI__builtin_neon_vsetq_lane_i16:
+  case NEON::BI__builtin_neon_vsetq_lane_i32:
+  case NEON::BI__builtin_neon_vsetq_lane_i64:
+  case NEON::BI__builtin_neon_vsetq_lane_bf16:
+  case NEON::BI__builtin_neon_vsetq_lane_f32:
+  case NEON::BI__builtin_neon_vset_lane_f64:
+  case NEON::BI__builtin_neon_vset_lane_mf8:
+  case NEON::BI__builtin_neon_vsetq_lane_mf8:
+  case NEON::BI__builtin_neon_vsetq_lane_f64:
+  case NEON::BI__builtin_neon_vget_lane_i8:
+  case NEON::BI__builtin_neon_vdupb_lane_i8:
+  case NEON::BI__builtin_neon_vgetq_lane_i8:
+  case NEON::BI__builtin_neon_vdupb_laneq_i8:
+  case NEON::BI__builtin_neon_vget_lane_mf8:
+  case NEON::BI__builtin_neon_vdupb_lane_mf8:
+  case NEON::BI__builtin_neon_vgetq_lane_mf8:
+  case NEON::BI__builtin_neon_vdupb_laneq_mf8:
+  case NEON::BI__builtin_neon_vget_lane_i16:
+  case NEON::BI__builtin_neon_vduph_lane_i16:
+  case NEON::BI__builtin_neon_vgetq_lane_i16:
+  case NEON::BI__builtin_neon_vduph_laneq_i16:
+  case NEON::BI__builtin_neon_vget_lane_i32:
+  case NEON::BI__builtin_neon_vdups_lane_i32:
+  case NEON::BI__builtin_neon_vdups_lane_f32:
+  case NEON::BI__builtin_neon_vgetq_lane_i32:
+  case NEON::BI__builtin_neon_vdups_laneq_i32:
+  case NEON::BI__builtin_neon_vget_lane_i64:
+  case NEON::BI__builtin_neon_vdupd_lane_i64:
+  case NEON::BI__builtin_neon_vdupd_lane_f64:
+  case NEON::BI__builtin_neon_vgetq_lane_i64:
+  case NEON::BI__builtin_neon_vdupd_laneq_i64:
+  case NEON::BI__builtin_neon_vget_lane_f32:
+  case NEON::BI__builtin_neon_vget_lane_f64:
+  case NEON::BI__builtin_neon_vgetq_lane_f32:
+  case NEON::BI__builtin_neon_vdups_laneq_f32:
+  case NEON::BI__builtin_neon_vgetq_lane_f64:
+  case NEON::BI__builtin_neon_vdupd_laneq_f64:
+  case NEON::BI__builtin_neon_vaddh_f16:
+  case NEON::BI__builtin_neon_vsubh_f16:
+  case NEON::BI__builtin_neon_vmulh_f16:
+  case NEON::BI__builtin_neon_vdivh_f16:
+  case NEON::BI__builtin_neon_vfmah_f16:
+  case NEON::BI__builtin_neon_vfmsh_f16:
+  case NEON::BI__builtin_neon_vaddd_s64:
+  case NEON::BI__builtin_neon_vaddd_u64:
+  case NEON::BI__builtin_neon_vsubd_s64:
+  case NEON::BI__builtin_neon_vsubd_u64:
+  case NEON::BI__builtin_neon_vqdmlalh_s16:
+  case NEON::BI__builtin_neon_vqdmlslh_s16:
+  case NEON::BI__builtin_neon_vqshlud_n_s64:
+  case NEON::BI__builtin_neon_vqshld_n_u64:
+  case NEON::BI__builtin_neon_vqshld_n_s64:
+  case NEON::BI__builtin_neon_vrshrd_n_u64:
+  case NEON::BI__builtin_neon_vrshrd_n_s64:
+  case NEON::BI__builtin_neon_vrsrad_n_u64:
+  case NEON::BI__builtin_neon_vrsrad_n_s64:
+  case NEON::BI__builtin_neon_vshld_n_s64:
+  case NEON::BI__builtin_neon_vshld_n_u64:
+  case NEON::BI__builtin_neon_vshrd_n_s64:
+  case NEON::BI__builtin_neon_vshrd_n_u64:
+  case NEON::BI__builtin_neon_vsrad_n_s64:
+  case NEON::BI__builtin_neon_vsrad_n_u64:
+  case NEON::BI__builtin_neon_vqdmlalh_lane_s16:
+  case NEON::BI__builtin_neon_vqdmlalh_laneq_s16:
+  case NEON::BI__builtin_neon_vqdmlslh_lane_s16:
+  case NEON::BI__builtin_neon_vqdmlslh_laneq_s16:
+  case NEON::BI__builtin_neon_vqdmlals_s32:
+  case NEON::BI__builtin_neon_vqdmlsls_s32:
+  case NEON::BI__builtin_neon_vqdmlals_lane_s32:
+  case NEON::BI__builtin_neon_vqdmlals_laneq_s32:
+  case NEON::BI__builtin_neon_vqdmlsls_lane_s32:
+  case NEON::BI__builtin_neon_vqdmlsls_laneq_s32:
+  case NEON::BI__builtin_neon_vget_lane_bf16:
+  case NEON::BI__builtin_neon_vduph_lane_bf16:
+  case NEON::BI__builtin_neon_vduph_lane_f16:
+  case NEON::BI__builtin_neon_vgetq_lane_bf16:
+  case NEON::BI__builtin_neon_vduph_laneq_bf16:
+  case NEON::BI__builtin_neon_vduph_laneq_f16:
+  case NEON::BI__builtin_neon_vcvt_bf16_f32:
+  case NEON::BI__builtin_neon_vcvtq_low_bf16_f32:
+  case NEON::BI__builtin_neon_vcvtq_high_bf16_f32:
+  case clang::AArch64::BI_InterlockedAdd:
+  case clang::AArch64::BI_InterlockedAdd_acq:
+  case clang::AArch64::BI_InterlockedAdd_rel:
+  case clang::AArch64::BI_InterlockedAdd_nf:
+  case clang::AArch64::BI_InterlockedAdd64:
+  case clang::AArch64::BI_InterlockedAdd64_acq:
+  case clang::AArch64::BI_InterlockedAdd64_rel:
+  case clang::AArch64::BI_InterlockedAdd64_nf:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // Not all intrinsics handled by the common case work for AArch64 yet, so only
+  // defer to common code if it's been added to our special map.
+  assert(!cir::MissingFeatures::aarch64SIMDIntrinsics());
+
+  assert(!cir::MissingFeatures::aarch64TblBuiltinExpr());
+
+  switch (builtinID) {
+  default:
+    return {};
+  case NEON::BI__builtin_neon_vbsl_v:
+  case NEON::BI__builtin_neon_vbslq_v:
+  case NEON::BI__builtin_neon_vfma_lane_v:
+  case NEON::BI__builtin_neon_vfmaq_lane_v:
+  case NEON::BI__builtin_neon_vfma_laneq_v:
+  case NEON::BI__builtin_neon_vfmaq_laneq_v:
+  case NEON::BI__builtin_neon_vfmah_lane_f16:
+  case NEON::BI__builtin_neon_vfmas_lane_f32:
+  case NEON::BI__builtin_neon_vfmah_laneq_f16:
+  case NEON::BI__builtin_neon_vfmas_laneq_f32:
+  case NEON::BI__builtin_neon_vfmad_lane_f64:
+  case NEON::BI__builtin_neon_vfmad_laneq_f64:
+  case NEON::BI__builtin_neon_vmull_v:
+  case NEON::BI__builtin_neon_vmax_v:
+  case NEON::BI__builtin_neon_vmaxq_v:
+  case NEON::BI__builtin_neon_vmaxh_f16:
+  case NEON::BI__builtin_neon_vmin_v:
+  case NEON::BI__builtin_neon_vminq_v:
+  case NEON::BI__builtin_neon_vminh_f16:
+  case NEON::BI__builtin_neon_vabd_v:
+  case NEON::BI__builtin_neon_vabdq_v:
+  case NEON::BI__builtin_neon_vpadal_v:
+  case NEON::BI__builtin_neon_vpadalq_v:
+  case NEON::BI__builtin_neon_vpmin_v:
+  case NEON::BI__builtin_neon_vpminq_v:
+  case NEON::BI__builtin_neon_vpmax_v:
+  case NEON::BI__builtin_neon_vpmaxq_v:
+  case NEON::BI__builtin_neon_vminnm_v:
+  case NEON::BI__builtin_neon_vminnmq_v:
+  case NEON::BI__builtin_neon_vminnmh_f16:
+  case NEON::BI__builtin_neon_vmaxnm_v:
+  case NEON::BI__builtin_neon_vmaxnmq_v:
+  case NEON::BI__builtin_neon_vmaxnmh_f16:
+  case NEON::BI__builtin_neon_vrecpss_f32:
+  case NEON::BI__builtin_neon_vrecpsd_f64:
+  case NEON::BI__builtin_neon_vrecpsh_f16:
+  case NEON::BI__builtin_neon_vqshrun_n_v:
+  case NEON::BI__builtin_neon_vqrshrun_n_v:
+  case NEON::BI__builtin_neon_vqshrn_n_v:
+  case NEON::BI__builtin_neon_vrshrn_n_v:
+  case NEON::BI__builtin_neon_vqrshrn_n_v:
+  case NEON::BI__builtin_neon_vrndah_f16:
+  case NEON::BI__builtin_neon_vrnda_v:
+  case NEON::BI__builtin_neon_vrndaq_v:
+  case NEON::BI__builtin_neon_vrndih_f16:
+  case NEON::BI__builtin_neon_vrndmh_f16:
+  case NEON::BI__builtin_neon_vrndm_v:
+  case NEON::BI__builtin_neon_vrndmq_v:
+  case NEON::BI__builtin_neon_vrndnh_f16:
+  case NEON::BI__builtin_neon_vrndn_v:
+  case NEON::BI__builtin_neon_vrndnq_v:
+  case NEON::BI__builtin_neon_vrndns_f32:
+  case NEON::BI__builtin_neon_vrndph_f16:
+  case NEON::BI__builtin_neon_vrndp_v:
+  case NEON::BI__builtin_neon_vrndpq_v:
+  case NEON::BI__builtin_neon_vrndxh_f16:
+  case NEON::BI__builtin_neon_vrndx_v:
+  case NEON::BI__builtin_neon_vrndxq_v:
+  case NEON::BI__builtin_neon_vrndh_f16:
+  case NEON::BI__builtin_neon_vrnd32x_f32:
+  case NEON::BI__builtin_neon_vrnd32xq_f32:
+  case NEON::BI__builtin_neon_vrnd32x_f64:
+  case NEON::BI__builtin_neon_vrnd32xq_f64:
+  case NEON::BI__builtin_neon_vrnd32z_f32:
+  case NEON::BI__builtin_neon_vrnd32zq_f32:
+  case NEON::BI__builtin_neon_vrnd32z_f64:
+  case NEON::BI__builtin_neon_vrnd32zq_f64:
+  case NEON::BI__builtin_neon_vrnd64x_f32:
+  case NEON::BI__builtin_neon_vrnd64xq_f32:
+  case NEON::BI__builtin_neon_vrnd64x_f64:
+  case NEON::BI__builtin_neon_vrnd64xq_f64:
+  case NEON::BI__builtin_neon_vrnd64z_f32:
+  case NEON::BI__builtin_neon_vrnd64zq_f32:
+  case NEON::BI__builtin_neon_vrnd64z_f64:
+  case NEON::BI__builtin_neon_vrnd64zq_f64:
+  case NEON::BI__builtin_neon_vrnd_v:
+  case NEON::BI__builtin_neon_vrndq_v:
+  case NEON::BI__builtin_neon_vcvt_f64_v:
+  case NEON::BI__builtin_neon_vcvtq_f64_v:
+  case NEON::BI__builtin_neon_vcvt_f64_f32:
+  case NEON::BI__builtin_neon_vcvt_f32_f64:
+  case NEON::BI__builtin_neon_vcvt_s32_v:
+  case NEON::BI__builtin_neon_vcvt_u32_v:
+  case NEON::BI__builtin_neon_vcvt_s64_v:
+  case NEON::BI__builtin_neon_vcvt_u64_v:
+  case NEON::BI__builtin_neon_vcvt_s16_f16:
+  case NEON::BI__builtin_neon_vcvt_u16_f16:
+  case NEON::BI__builtin_neon_vcvtq_s32_v:
+  case NEON::BI__builtin_neon_vcvtq_u32_v:
+  case NEON::BI__builtin_neon_vcvtq_s64_v:
+  case NEON::BI__builtin_neon_vcvtq_u64_v:
+  case NEON::BI__builtin_neon_vcvtq_s16_f16:
+  case NEON::BI__builtin_neon_vcvtq_u16_f16:
+  case NEON::BI__builtin_neon_vcvta_s16_f16:
+  case NEON::BI__builtin_neon_vcvta_u16_f16:
+  case NEON::BI__builtin_neon_vcvta_s32_v:
+  case NEON::BI__builtin_neon_vcvtaq_s16_f16:
+  case NEON::BI__builtin_neon_vcvtaq_s32_v:
+  case NEON::BI__builtin_neon_vcvta_u32_v:
+  case NEON::BI__builtin_neon_vcvtaq_u16_f16:
+  case NEON::BI__builtin_neon_vcvtaq_u32_v:
+  case NEON::BI__builtin_neon_vcvta_s64_v:
+  case NEON::BI__builtin_neon_vcvtaq_s64_v:
+  case NEON::BI__builtin_neon_vcvta_u64_v:
+  case NEON::BI__builtin_neon_vcvtaq_u64_v:
+  case NEON::BI__builtin_neon_vcvtm_s16_f16:
+  case NEON::BI__builtin_neon_vcvtm_s32_v:
+  case NEON::BI__builtin_neon_vcvtmq_s16_f16:
+  case NEON::BI__builtin_neon_vcvtmq_s32_v:
+  case NEON::BI__builtin_neon_vcvtm_u16_f16:
+  case NEON::BI__builtin_neon_vcvtm_u32_v:
+  case NEON::BI__builtin_neon_vcvtmq_u16_f16:
+  case NEON::BI__builtin_neon_vcvtmq_u32_v:
+  case NEON::BI__builtin_neon_vcvtm_s64_v:
+  case NEON::BI__builtin_neon_vcvtmq_s64_v:
+  case NEON::BI__builtin_neon_vcvtm_u64_v:
+  case NEON::BI__builtin_neon_vcvtmq_u64_v:
+  case NEON::BI__builtin_neon_vcvtn_s16_f16:
+  case NEON::BI__builtin_neon_vcvtn_s32_v:
+  case NEON::BI__builtin_neon_vcvtnq_s16_f16:
+  case NEON::BI__builtin_neon_vcvtnq_s32_v:
+  case NEON::BI__builtin_neon_vcvtn_u16_f16:
+  case NEON::BI__builtin_neon_vcvtn_u32_v:
+  case NEON::BI__builtin_neon_vcvtnq_u16_f16:
+  case NEON::BI__builtin_neon_vcvtnq_u32_v:
+  case NEON::BI__builtin_neon_vcvtn_s64_v:
+  case NEON::BI__builtin_neon_vcvtnq_s64_v:
+  case NEON::BI__builtin_neon_vcvtn_u64_v:
+  case NEON::BI__builtin_neon_vcvtnq_u64_v:
+  case NEON::BI__builtin_neon_vcvtp_s16_f16:
+  case NEON::BI__builtin_neon_vcvtp_s32_v:
+  case NEON::BI__builtin_neon_vcvtpq_s16_f16:
+  case NEON::BI__builtin_neon_vcvtpq_s32_v:
+  case NEON::BI__builtin_neon_vcvtp_u16_f16:
+  case NEON::BI__builtin_neon_vcvtp_u32_v:
+  case NEON::BI__builtin_neon_vcvtpq_u16_f16:
+  case NEON::BI__builtin_neon_vcvtpq_u32_v:
+  case NEON::BI__builtin_neon_vcvtp_s64_v:
+  case NEON::BI__builtin_neon_vcvtpq_s64_v:
+  case NEON::BI__builtin_neon_vcvtp_u64_v:
+  case NEON::BI__builtin_neon_vcvtpq_u64_v:
+  case NEON::BI__builtin_neon_vmulx_v:
+  case NEON::BI__builtin_neon_vmulxq_v:
+  case NEON::BI__builtin_neon_vmulxh_lane_f16:
+  case NEON::BI__builtin_neon_vmulxh_laneq_f16:
+  case NEON::BI__builtin_neon_vmul_lane_v:
+  case NEON::BI__builtin_neon_vmul_laneq_v:
+  case NEON::BI__builtin_neon_vnegd_s64:
+  case NEON::BI__builtin_neon_vnegh_f16:
+  case NEON::BI__builtin_neon_vpmaxnm_v:
+  case NEON::BI__builtin_neon_vpmaxnmq_v:
+  case NEON::BI__builtin_neon_vpminnm_v:
+  case NEON::BI__builtin_neon_vpminnmq_v:
+  case NEON::BI__builtin_neon_vsqrth_f16:
+  case NEON::BI__builtin_neon_vsqrt_v:
+  case NEON::BI__builtin_neon_vsqrtq_v:
+  case NEON::BI__builtin_neon_vrbit_v:
+  case NEON::BI__builtin_neon_vrbitq_v:
+  case NEON::BI__builtin_neon_vmaxv_f16:
+  case NEON::BI__builtin_neon_vmaxvq_f16:
+  case NEON::BI__builtin_neon_vminv_f16:
+  case NEON::BI__builtin_neon_vminvq_f16:
+  case NEON::BI__builtin_neon_vmaxnmv_f16:
+  case NEON::BI__builtin_neon_vmaxnmvq_f16:
+  case NEON::BI__builtin_neon_vminnmv_f16:
+  case NEON::BI__builtin_neon_vminnmvq_f16:
+  case NEON::BI__builtin_neon_vmul_n_f64:
+  case NEON::BI__builtin_neon_vaddlv_u8:
+  case NEON::BI__builtin_neon_vaddlv_u16:
+  case NEON::BI__builtin_neon_vaddlvq_u8:
+  case NEON::BI__builtin_neon_vaddlvq_u16:
+  case NEON::BI__builtin_neon_vaddlv_s8:
+  case NEON::BI__builtin_neon_vaddlv_s16:
+  case NEON::BI__builtin_neon_vaddlvq_s8:
+  case NEON::BI__builtin_neon_vaddlvq_s16:
+  case NEON::BI__builtin_neon_vsri_n_v:
+  case NEON::BI__builtin_neon_vsriq_n_v:
+  case NEON::BI__builtin_neon_vsli_n_v:
+  case NEON::BI__builtin_neon_vsliq_n_v:
+  case NEON::BI__builtin_neon_vsra_n_v:
+  case NEON::BI__builtin_neon_vsraq_n_v:
+  case NEON::BI__builtin_neon_vrsra_n_v:
+  case NEON::BI__builtin_neon_vrsraq_n_v:
+  case NEON::BI__builtin_neon_vld1_v:
+  case NEON::BI__builtin_neon_vld1q_v:
+  case NEON::BI__builtin_neon_vst1_v:
+  case NEON::BI__builtin_neon_vst1q_v:
+  case NEON::BI__builtin_neon_vld1_lane_v:
+  case NEON::BI__builtin_neon_vld1q_lane_v:
+  case NEON::BI__builtin_neon_vldap1_lane_s64:
+  case NEON::BI__builtin_neon_vldap1q_lane_s64:
+  case NEON::BI__builtin_neon_vld1_dup_v:
+  case NEON::BI__builtin_neon_vld1q_dup_v:
+  case NEON::BI__builtin_neon_vst1_lane_v:
+  case NEON::BI__builtin_neon_vst1q_lane_v:
+  case NEON::BI__builtin_neon_vstl1_lane_s64:
+  case NEON::BI__builtin_neon_vstl1q_lane_s64:
+  case NEON::BI__builtin_neon_vld2_v:
+  case NEON::BI__builtin_neon_vld2q_v:
+  case NEON::BI__builtin_neon_vld3_v:
+  case NEON::BI__builtin_neon_vld3q_v:
+  case NEON::BI__builtin_neon_vld4_v:
+  case NEON::BI__builtin_neon_vld4q_v:
+  case NEON::BI__builtin_neon_vld2_dup_v:
+  case NEON::BI__builtin_neon_vld2q_dup_v:
+  case NEON::BI__builtin_neon_vld3_dup_v:
+  case NEON::BI__builtin_neon_vld3q_dup_v:
+  case NEON::BI__builtin_neon_vld4_dup_v:
+  case NEON::BI__builtin_neon_vld4q_dup_v:
+  case NEON::BI__builtin_neon_vld2_lane_v:
+  case NEON::BI__builtin_neon_vld2q_lane_v:
+  case NEON::BI__builtin_neon_vld3_lane_v:
+  case NEON::BI__builtin_neon_vld3q_lane_v:
+  case NEON::BI__builtin_neon_vld4_lane_v:
+  case NEON::BI__builtin_neon_vld4q_lane_v:
+  case NEON::BI__builtin_neon_vst2_v:
+  case NEON::BI__builtin_neon_vst2q_v:
+  case NEON::BI__builtin_neon_vst2_lane_v:
+  case NEON::BI__builtin_neon_vst2q_lane_v:
+  case NEON::BI__builtin_neon_vst3_v:
+  case NEON::BI__builtin_neon_vst3q_v:
+  case NEON::BI__builtin_neon_vst3_lane_v:
+  case NEON::BI__builtin_neon_vst3q_lane_v:
+  case NEON::BI__builtin_neon_vst4_v:
+  case NEON::BI__builtin_neon_vst4q_v:
+  case NEON::BI__builtin_neon_vst4_lane_v:
+  case NEON::BI__builtin_neon_vst4q_lane_v:
+  case NEON::BI__builtin_neon_vtrn_v:
+  case NEON::BI__builtin_neon_vtrnq_v:
+  case NEON::BI__builtin_neon_vuzp_v:
+  case NEON::BI__builtin_neon_vuzpq_v:
+  case NEON::BI__builtin_neon_vzip_v:
+  case NEON::BI__builtin_neon_vzipq_v:
+  case NEON::BI__builtin_neon_vqtbl1q_v:
+  case NEON::BI__builtin_neon_vqtbl2q_v:
+  case NEON::BI__builtin_neon_vqtbl3q_v:
+  case NEON::BI__builtin_neon_vqtbl4q_v:
+  case NEON::BI__builtin_neon_vqtbx1q_v:
+  case NEON::BI__builtin_neon_vqtbx2q_v:
+  case NEON::BI__builtin_neon_vqtbx3q_v:
+  case NEON::BI__builtin_neon_vqtbx4q_v:
+  case NEON::BI__builtin_neon_vsqadd_v:
+  case NEON::BI__builtin_neon_vsqaddq_v:
+  case NEON::BI__builtin_neon_vuqadd_v:
+  case NEON::BI__builtin_neon_vuqaddq_v:
+  case NEON::BI__builtin_neon_vluti2_laneq_mf8:
+  case NEON::BI__builtin_neon_vluti2_laneq_bf16:
+  case NEON::BI__builtin_neon_vluti2_laneq_f16:
+  case NEON::BI__builtin_neon_vluti2_laneq_p16:
+  case NEON::BI__builtin_neon_vluti2_laneq_p8:
+  case NEON::BI__builtin_neon_vluti2_laneq_s16:
+  case NEON::BI__builtin_neon_vluti2_laneq_s8:
+  case NEON::BI__builtin_neon_vluti2_laneq_u16:
+  case NEON::BI__builtin_neon_vluti2_laneq_u8:
+  case NEON::BI__builtin_neon_vluti2q_laneq_mf8:
+  case NEON::BI__builtin_neon_vluti2q_laneq_bf16:
+  case NEON::BI__builtin_neon_vluti2q_laneq_f16:
+  case NEON::BI__builtin_neon_vluti2q_laneq_p16:
+  case NEON::BI__builtin_neon_vluti2q_laneq_p8:
+  case NEON::BI__builtin_neon_vluti2q_laneq_s16:
+  case NEON::BI__builtin_neon_vluti2q_laneq_s8:
+  case NEON::BI__builtin_neon_vluti2q_laneq_u16:
+  case NEON::BI__builtin_neon_vluti2q_laneq_u8:
+  case NEON::BI__builtin_neon_vluti2_lane_mf8:
+  case NEON::BI__builtin_neon_vluti2_lane_bf16:
+  case NEON::BI__builtin_neon_vluti2_lane_f16:
+  case NEON::BI__builtin_neon_vluti2_lane_p16:
+  case NEON::BI__builtin_neon_vluti2_lane_p8:
+  case NEON::BI__builtin_neon_vluti2_lane_s16:
+  case NEON::BI__builtin_neon_vluti2_lane_s8:
+  case NEON::BI__builtin_neon_vluti2_lane_u16:
+  case NEON::BI__builtin_neon_vluti2_lane_u8:
+  case NEON::BI__builtin_neon_vluti2q_lane_mf8:
+  case NEON::BI__builtin_neon_vluti2q_lane_bf16:
+  case NEON::BI__builtin_neon_vluti2q_lane_f16:
+  case NEON::BI__builtin_neon_vluti2q_lane_p16:
+  case NEON::BI__builtin_neon_vluti2q_lane_p8:
+  case NEON::BI__builtin_neon_vluti2q_lane_s16:
+  case NEON::BI__builtin_neon_vluti2q_lane_s8:
+  case NEON::BI__builtin_neon_vluti2q_lane_u16:
+  case NEON::BI__builtin_neon_vluti2q_lane_u8:
+  case NEON::BI__builtin_neon_vluti4q_lane_mf8:
+  case NEON::BI__builtin_neon_vluti4q_lane_p8:
+  case NEON::BI__builtin_neon_vluti4q_lane_s8:
+  case NEON::BI__builtin_neon_vluti4q_lane_u8:
+  case NEON::BI__builtin_neon_vluti4q_laneq_mf8:
+  case NEON::BI__builtin_neon_vluti4q_laneq_p8:
+  case NEON::BI__builtin_neon_vluti4q_laneq_s8:
+  case NEON::BI__builtin_neon_vluti4q_laneq_u8:
+  case NEON::BI__builtin_neon_vluti4q_lane_bf16_x2:
+  case NEON::BI__builtin_neon_vluti4q_lane_f16_x2:
+  case NEON::BI__builtin_neon_vluti4q_lane_p16_x2:
+  case NEON::BI__builtin_neon_vluti4q_lane_s16_x2:
+  case NEON::BI__builtin_neon_vluti4q_lane_u16_x2:
+  case NEON::BI__builtin_neon_vluti4q_laneq_bf16_x2:
+  case NEON::BI__builtin_neon_vluti4q_laneq_f16_x2:
+  case NEON::BI__builtin_neon_vluti4q_laneq_p16_x2:
+  case NEON::BI__builtin_neon_vluti4q_laneq_s16_x2:
+  case NEON::BI__builtin_neon_vluti4q_laneq_u16_x2:
+  case NEON::BI__builtin_neon_vmmlaq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmmlaq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt1_low_bf16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt1_bf16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt1_high_bf16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt2_low_bf16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt2_bf16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt2_high_bf16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt1_low_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt1_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt1_high_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt2_low_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt2_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt2_high_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vcvt_mf8_f32_fpm:
+  case NEON::BI__builtin_neon_vcvt_mf8_f16_fpm:
+  case NEON::BI__builtin_neon_vcvtq_mf8_f16_fpm:
+  case NEON::BI__builtin_neon_vcvt_high_mf8_f32_fpm:
+  case NEON::BI__builtin_neon_vdot_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vdotq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vdot_lane_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vdotq_lane_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vdot_laneq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vdotq_laneq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vdot_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vdotq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vdot_lane_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vdotq_lane_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vdot_laneq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vdotq_laneq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlalbq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlaltq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallbbq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallbtq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlalltbq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallttq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlalbq_lane_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlalbq_laneq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlaltq_lane_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlaltq_laneq_f16_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallbbq_lane_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallbbq_laneq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallbtq_lane_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallbtq_laneq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlalltbq_lane_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlalltbq_laneq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallttq_lane_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vmlallttq_laneq_f32_mf8_fpm:
+  case NEON::BI__builtin_neon_vamin_f16:
+  case NEON::BI__builtin_neon_vaminq_f16:
+  case NEON::BI__builtin_neon_vamin_f32:
+  case NEON::BI__builtin_neon_vaminq_f32:
+  case NEON::BI__builtin_neon_vaminq_f64:
+  case NEON::BI__builtin_neon_vamax_f16:
+  case NEON::BI__builtin_neon_vamaxq_f16:
+  case NEON::BI__builtin_neon_vamax_f32:
+  case NEON::BI__builtin_neon_vamaxq_f32:
+  case NEON::BI__builtin_neon_vamaxq_f64:
+  case NEON::BI__builtin_neon_vscale_f16:
+  case NEON::BI__builtin_neon_vscaleq_f16:
+  case NEON::BI__builtin_neon_vscale_f32:
+  case NEON::BI__builtin_neon_vscaleq_f32:
+  case NEON::BI__builtin_neon_vscaleq_f64:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented AArch64 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
+  }
+
+  // Unreachable: All cases in the switch above return.
+}
diff --git a/clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp b/clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp
index 0e43345bad6f1..1b2e3f41479db 100644
--- a/clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp
@@ -11,10 +11,14 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "CIRGenBuilder.h"
 #include "CIRGenFunction.h"
 #include "CIRGenModule.h"
+#include "mlir/IR/Location.h"
+#include "mlir/IR/ValueRange.h"
 #include "clang/Basic/Builtins.h"
 #include "clang/Basic/TargetBuiltins.h"
+#include "clang/CIR/Dialect/IR/CIRTypes.h"
 #include "clang/CIR/MissingFeatures.h"
 
 using namespace clang;
@@ -85,6 +89,186 @@ static mlir::Value getMaskVecValue(CIRGenBuilderTy &builder, mlir::Location loc,
   return maskVec;
 }
 
+// Builds the VecShuffleOp for pshuflw and pshufhw x86 builtins.
+//
+// The vector is split into lanes of 8 word elements (16 bits). The lower or
+// upper half of each lane, controlled by `isLow`, is shuffled in the following
+// way: The immediate is truncated to 8 bits, separated into 4 2-bit fields. The
+// i-th field's value represents the resulting index of the i-th element in the
+// half lane after shuffling. The other half of the lane remains unchanged.
+static cir::VecShuffleOp emitPshufWord(CIRGenBuilderTy &builder,
+                                       const mlir::Value vec,
+                                       const mlir::Value immediate,
+                                       const mlir::Location loc,
+                                       const bool isLow) {
+  uint32_t imm = CIRGenFunction::getZExtIntValueFromConstOp(immediate);
+
+  auto vecTy = cast<cir::VectorType>(vec.getType());
+  unsigned numElts = vecTy.getSize();
+
+  unsigned firstHalfStart = isLow ? 0 : 4;
+  unsigned secondHalfStart = 4 - firstHalfStart;
+
+  // Splat the 8-bits of immediate 4 times to help the loop wrap around.
+  imm = (imm & 0xff) * 0x01010101;
+
+  int64_t indices[32];
+  for (unsigned l = 0; l != numElts; l += 8) {
+    for (unsigned i = firstHalfStart; i != firstHalfStart + 4; ++i) {
+      indices[l + i] = l + (imm & 3) + firstHalfStart;
+      imm >>= 2;
+    }
+    for (unsigned i = secondHalfStart; i != secondHalfStart + 4; ++i)
+      indices[l + i] = l + i;
+  }
+
+  return builder.createVecShuffle(loc, vec, ArrayRef(indices, numElts));
+}
+
+// Builds the shuffle mask for pshufd and shufpd/shufps x86 builtins.
+// The shuffle mask is written to outIndices.
+static void
+computeFullLaneShuffleMask(CIRGenFunction &cgf, const mlir::Value vec,
+                           uint32_t imm, const bool isShufP,
+                           llvm::SmallVectorImpl<int64_t> &outIndices) {
+  auto vecTy = cast<cir::VectorType>(vec.getType());
+  unsigned numElts = vecTy.getSize();
+  unsigned numLanes = cgf.cgm.getDataLayout().getTypeSizeInBits(vecTy) / 128;
+  unsigned numLaneElts = numElts / numLanes;
+
+  // Splat the 8-bits of immediate 4 times to help the loop wrap around.
+  imm = (imm & 0xff) * 0x01010101;
+
+  for (unsigned l = 0; l != numElts; l += numLaneElts) {
+    for (unsigned i = 0; i != numLaneElts; ++i) {
+      uint32_t idx = imm % numLaneElts;
+      imm /= numLaneElts;
+      if (isShufP && i >= (numLaneElts / 2))
+        idx += numElts;
+      outIndices[l + i] = l + idx;
+    }
+  }
+
+  outIndices.resize(numElts);
+}
+
+static mlir::Value emitX86MaskAddLogic(CIRGenBuilderTy &builder,
+                                       mlir::Location loc,
+                                       const std::string &intrinsicName,
+                                       SmallVectorImpl<mlir::Value> &ops) {
+
+  auto intTy = cast<cir::IntType>(ops[0].getType());
+  unsigned numElts = intTy.getWidth();
+  mlir::Value lhsVec = getMaskVecValue(builder, loc, ops[0], numElts);
+  mlir::Value rhsVec = getMaskVecValue(builder, loc, ops[1], numElts);
+  mlir::Type vecTy = lhsVec.getType();
+  mlir::Value resVec = emitIntrinsicCallOp(builder, loc, intrinsicName, vecTy,
+                                           mlir::ValueRange{lhsVec, rhsVec});
+  return builder.createBitcast(resVec, ops[0].getType());
+}
+
+static mlir::Value emitX86MaskUnpack(CIRGenBuilderTy &builder,
+                                     mlir::Location loc,
+                                     const std::string &intrinsicName,
+                                     SmallVectorImpl<mlir::Value> &ops) {
+  unsigned numElems = cast<cir::IntType>(ops[0].getType()).getWidth();
+
+  // Convert both operands to mask vectors.
+  mlir::Value lhs = getMaskVecValue(builder, loc, ops[0], numElems);
+  mlir::Value rhs = getMaskVecValue(builder, loc, ops[1], numElems);
+
+  mlir::Type i32Ty = builder.getSInt32Ty();
+
+  // Create indices for extracting the first half of each vector.
+  SmallVector<mlir::Attribute, 32> halfIndices;
+  for (auto i : llvm::seq<unsigned>(0, numElems / 2))
+    halfIndices.push_back(cir::IntAttr::get(i32Ty, i));
+
+  // Extract first half of each vector. This gives better codegen than
+  // doing it in a single shuffle.
+  mlir::Value lhsHalf = builder.createVecShuffle(loc, lhs, lhs, halfIndices);
+  mlir::Value rhsHalf = builder.createVecShuffle(loc, rhs, rhs, halfIndices);
+
+  // Create indices for concatenating the vectors.
+  // NOTE: Operands are swapped to match the intrinsic definition.
+  // After the half extraction, both vectors have numElems/2 elements.
+  // In createVecShuffle(rhsHalf, lhsHalf, indices), indices [0..numElems/2-1]
+  // select from rhsHalf, and indices [numElems/2..numElems-1] select from
+  // lhsHalf.
+  SmallVector<mlir::Attribute, 64> concatIndices;
+  for (auto i : llvm::seq<unsigned>(0, numElems))
+    concatIndices.push_back(cir::IntAttr::get(i32Ty, i));
+
+  // Concat the vectors (RHS first, then LHS).
+  mlir::Value res =
+      builder.createVecShuffle(loc, rhsHalf, lhsHalf, concatIndices);
+  return builder.createBitcast(res, ops[0].getType());
+}
+
+static mlir::Value emitX86MaskLogic(CIRGenBuilderTy &builder,
+                                    mlir::Location loc,
+                                    cir::BinOpKind binOpKind,
+                                    SmallVectorImpl<mlir::Value> &ops,
+                                    bool invertLHS = false) {
+  unsigned numElts = cast<cir::IntType>(ops[0].getType()).getWidth();
+  mlir::Value lhs = getMaskVecValue(builder, loc, ops[0], numElts);
+  mlir::Value rhs = getMaskVecValue(builder, loc, ops[1], numElts);
+
+  if (invertLHS)
+    lhs = builder.createNot(lhs);
+  return builder.createBitcast(builder.createBinop(loc, lhs, binOpKind, rhs),
+                               ops[0].getType());
+}
+
+static mlir::Value emitVecInsert(CIRGenBuilderTy &builder, mlir::Location loc,
+                                 mlir::Value vec, mlir::Value value,
+                                 mlir::Value indexOp) {
+  unsigned numElts = cast<cir::VectorType>(vec.getType()).getSize();
+
+  uint64_t index =
+      indexOp.getDefiningOp<cir::ConstantOp>().getIntValue().getZExtValue();
+
+  index &= numElts - 1;
+
+  cir::ConstantOp indexVal = builder.getUInt64(index, loc);
+
+  return cir::VecInsertOp::create(builder, loc, vec, value, indexVal);
+}
+
+static mlir::Value emitX86FunnelShift(CIRGenBuilderTy &builder,
+                                      mlir::Location location, mlir::Value &op0,
+                                      mlir::Value &op1, mlir::Value &amt,
+                                      bool isRight) {
+  mlir::Type op0Ty = op0.getType();
+
+  // Amount may be scalar immediate, in which case create a splat vector.
+  // Funnel shifts amounts are treated as modulo and types are all power-of-2
+  // so we only care about the lowest log2 bits anyway.
+  if (amt.getType() != op0Ty) {
+    auto vecTy = mlir::cast<cir::VectorType>(op0Ty);
+    uint64_t numElems = vecTy.getSize();
+
+    auto amtTy = mlir::cast<cir::IntType>(amt.getType());
+    auto vecElemTy = mlir::cast<cir::IntType>(vecTy.getElementType());
+
+    // If signed, cast to the same width but unsigned first to
+    // ensure zero-extension when casting to a bigger unsigned `vecElemeTy`.
+    if (amtTy.isSigned()) {
+      cir::IntType unsignedAmtTy = builder.getUIntNTy(amtTy.getWidth());
+      amt = builder.createIntCast(amt, unsignedAmtTy);
+    }
+    cir::IntType unsignedVecElemType = builder.getUIntNTy(vecElemTy.getWidth());
+    amt = builder.createIntCast(amt, unsignedVecElemType);
+    amt = cir::VecSplatOp::create(
+        builder, location, cir::VectorType::get(unsignedVecElemType, numElems),
+        amt);
+  }
+
+  const StringRef intrinsicName = isRight ? "fshr" : "fshl";
+  return emitIntrinsicCallOp(builder, location, intrinsicName, op0Ty,
+                             mlir::ValueRange{op0, op1, amt});
+}
+
 mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
                                                const CallExpr *expr) {
   if (builtinID == Builtin::BI__builtin_cpu_is) {
@@ -187,9 +371,7 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_vec_ext_v4di: {
     unsigned numElts = cast<cir::VectorType>(ops[0].getType()).getSize();
 
-    uint64_t index =
-        ops[1].getDefiningOp<cir::ConstantOp>().getIntValue().getZExtValue();
-
+    uint64_t index = getZExtIntValueFromConstOp(ops[1]);
     index &= numElts - 1;
 
     cir::ConstantOp indexVal =
@@ -208,11 +390,19 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_vec_set_v32qi:
   case X86::BI__builtin_ia32_vec_set_v16hi:
   case X86::BI__builtin_ia32_vec_set_v8si:
-  case X86::BI__builtin_ia32_vec_set_v4di:
-    cgm.errorNYI(expr->getSourceRange(),
-                 std::string("unimplemented X86 builtin call: ") +
-                     getContext().BuiltinInfo.getName(builtinID));
-    return {};
+  case X86::BI__builtin_ia32_vec_set_v4di: {
+    return emitVecInsert(builder, getLoc(expr->getExprLoc()), ops[0], ops[1],
+                         ops[2]);
+  }
+  case X86::BI__builtin_ia32_kunpckhi:
+    return emitX86MaskUnpack(builder, getLoc(expr->getExprLoc()),
+                             "x86.avx512.kunpackb", ops);
+  case X86::BI__builtin_ia32_kunpcksi:
+    return emitX86MaskUnpack(builder, getLoc(expr->getExprLoc()),
+                             "x86.avx512.kunpackw", ops);
+  case X86::BI__builtin_ia32_kunpckdi:
+    return emitX86MaskUnpack(builder, getLoc(expr->getExprLoc()),
+                             "x86.avx512.kunpackd", ops);
   case X86::BI_mm_setcsr:
   case X86::BI__builtin_ia32_ldmxcsr: {
     mlir::Location loc = getLoc(expr->getExprLoc());
@@ -457,6 +647,10 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_compressqi128_mask:
   case X86::BI__builtin_ia32_compressqi256_mask:
   case X86::BI__builtin_ia32_compressqi512_mask:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented X86 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
   case X86::BI__builtin_ia32_gather3div2df:
   case X86::BI__builtin_ia32_gather3div2di:
   case X86::BI__builtin_ia32_gather3div4df:
@@ -480,7 +674,93 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_gathersiv8di:
   case X86::BI__builtin_ia32_gathersiv16si:
   case X86::BI__builtin_ia32_gatherdiv8di:
-  case X86::BI__builtin_ia32_gatherdiv16si:
+  case X86::BI__builtin_ia32_gatherdiv16si: {
+    StringRef intrinsicName;
+    switch (builtinID) {
+    default:
+      llvm_unreachable("Unexpected builtin");
+    case X86::BI__builtin_ia32_gather3div2df:
+      intrinsicName = "x86.avx512.mask.gather3div2.df";
+      break;
+    case X86::BI__builtin_ia32_gather3div2di:
+      intrinsicName = "x86.avx512.mask.gather3div2.di";
+      break;
+    case X86::BI__builtin_ia32_gather3div4df:
+      intrinsicName = "x86.avx512.mask.gather3div4.df";
+      break;
+    case X86::BI__builtin_ia32_gather3div4di:
+      intrinsicName = "x86.avx512.mask.gather3div4.di";
+      break;
+    case X86::BI__builtin_ia32_gather3div4sf:
+      intrinsicName = "x86.avx512.mask.gather3div4.sf";
+      break;
+    case X86::BI__builtin_ia32_gather3div4si:
+      intrinsicName = "x86.avx512.mask.gather3div4.si";
+      break;
+    case X86::BI__builtin_ia32_gather3div8sf:
+      intrinsicName = "x86.avx512.mask.gather3div8.sf";
+      break;
+    case X86::BI__builtin_ia32_gather3div8si:
+      intrinsicName = "x86.avx512.mask.gather3div8.si";
+      break;
+    case X86::BI__builtin_ia32_gather3siv2df:
+      intrinsicName = "x86.avx512.mask.gather3siv2.df";
+      break;
+    case X86::BI__builtin_ia32_gather3siv2di:
+      intrinsicName = "x86.avx512.mask.gather3siv2.di";
+      break;
+    case X86::BI__builtin_ia32_gather3siv4df:
+      intrinsicName = "x86.avx512.mask.gather3siv4.df";
+      break;
+    case X86::BI__builtin_ia32_gather3siv4di:
+      intrinsicName = "x86.avx512.mask.gather3siv4.di";
+      break;
+    case X86::BI__builtin_ia32_gather3siv4sf:
+      intrinsicName = "x86.avx512.mask.gather3siv4.sf";
+      break;
+    case X86::BI__builtin_ia32_gather3siv4si:
+      intrinsicName = "x86.avx512.mask.gather3siv4.si";
+      break;
+    case X86::BI__builtin_ia32_gather3siv8sf:
+      intrinsicName = "x86.avx512.mask.gather3siv8.sf";
+      break;
+    case X86::BI__builtin_ia32_gather3siv8si:
+      intrinsicName = "x86.avx512.mask.gather3siv8.si";
+      break;
+    case X86::BI__builtin_ia32_gathersiv8df:
+      intrinsicName = "x86.avx512.mask.gather.dpd.512";
+      break;
+    case X86::BI__builtin_ia32_gathersiv16sf:
+      intrinsicName = "x86.avx512.mask.gather.dps.512";
+      break;
+    case X86::BI__builtin_ia32_gatherdiv8df:
+      intrinsicName = "x86.avx512.mask.gather.qpd.512";
+      break;
+    case X86::BI__builtin_ia32_gatherdiv16sf:
+      intrinsicName = "x86.avx512.mask.gather.qps.512";
+      break;
+    case X86::BI__builtin_ia32_gathersiv8di:
+      intrinsicName = "x86.avx512.mask.gather.dpq.512";
+      break;
+    case X86::BI__builtin_ia32_gathersiv16si:
+      intrinsicName = "x86.avx512.mask.gather.dpi.512";
+      break;
+    case X86::BI__builtin_ia32_gatherdiv8di:
+      intrinsicName = "x86.avx512.mask.gather.qpq.512";
+      break;
+    case X86::BI__builtin_ia32_gatherdiv16si:
+      intrinsicName = "x86.avx512.mask.gather.qpi.512";
+      break;
+    }
+
+    mlir::Location loc = getLoc(expr->getExprLoc());
+    unsigned minElts =
+        std::min(cast<cir::VectorType>(ops[0].getType()).getSize(),
+                 cast<cir::VectorType>(ops[2].getType()).getSize());
+    ops[3] = getMaskVecValue(builder, loc, ops[3], minElts);
+    return emitIntrinsicCallOp(builder, loc, intrinsicName,
+                               convertType(expr->getType()), ops);
+  }
   case X86::BI__builtin_ia32_scattersiv8df:
   case X86::BI__builtin_ia32_scattersiv16sf:
   case X86::BI__builtin_ia32_scatterdiv8df:
@@ -504,7 +784,94 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_scattersiv4sf:
   case X86::BI__builtin_ia32_scattersiv4si:
   case X86::BI__builtin_ia32_scattersiv8sf:
-  case X86::BI__builtin_ia32_scattersiv8si:
+  case X86::BI__builtin_ia32_scattersiv8si: {
+    llvm::StringRef intrinsicName;
+    switch (builtinID) {
+    default:
+      llvm_unreachable("Unexpected builtin");
+    case X86::BI__builtin_ia32_scattersiv8df:
+      intrinsicName = "x86.avx512.mask.scatter.dpd.512";
+      break;
+    case X86::BI__builtin_ia32_scattersiv16sf:
+      intrinsicName = "x86.avx512.mask.scatter.dps.512";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv8df:
+      intrinsicName = "x86.avx512.mask.scatter.qpd.512";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv16sf:
+      intrinsicName = "x86.avx512.mask.scatter.qps.512";
+      break;
+    case X86::BI__builtin_ia32_scattersiv8di:
+      intrinsicName = "x86.avx512.mask.scatter.dpq.512";
+      break;
+    case X86::BI__builtin_ia32_scattersiv16si:
+      intrinsicName = "x86.avx512.mask.scatter.dpi.512";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv8di:
+      intrinsicName = "x86.avx512.mask.scatter.qpq.512";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv16si:
+      intrinsicName = "x86.avx512.mask.scatter.qpi.512";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv2df:
+      intrinsicName = "x86.avx512.mask.scatterdiv2.df";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv2di:
+      intrinsicName = "x86.avx512.mask.scatterdiv2.di";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv4df:
+      intrinsicName = "x86.avx512.mask.scatterdiv4.df";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv4di:
+      intrinsicName = "x86.avx512.mask.scatterdiv4.di";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv4sf:
+      intrinsicName = "x86.avx512.mask.scatterdiv4.sf";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv4si:
+      intrinsicName = "x86.avx512.mask.scatterdiv4.si";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv8sf:
+      intrinsicName = "x86.avx512.mask.scatterdiv8.sf";
+      break;
+    case X86::BI__builtin_ia32_scatterdiv8si:
+      intrinsicName = "x86.avx512.mask.scatterdiv8.si";
+      break;
+    case X86::BI__builtin_ia32_scattersiv2df:
+      intrinsicName = "x86.avx512.mask.scattersiv2.df";
+      break;
+    case X86::BI__builtin_ia32_scattersiv2di:
+      intrinsicName = "x86.avx512.mask.scattersiv2.di";
+      break;
+    case X86::BI__builtin_ia32_scattersiv4df:
+      intrinsicName = "x86.avx512.mask.scattersiv4.df";
+      break;
+    case X86::BI__builtin_ia32_scattersiv4di:
+      intrinsicName = "x86.avx512.mask.scattersiv4.di";
+      break;
+    case X86::BI__builtin_ia32_scattersiv4sf:
+      intrinsicName = "x86.avx512.mask.scattersiv4.sf";
+      break;
+    case X86::BI__builtin_ia32_scattersiv4si:
+      intrinsicName = "x86.avx512.mask.scattersiv4.si";
+      break;
+    case X86::BI__builtin_ia32_scattersiv8sf:
+      intrinsicName = "x86.avx512.mask.scattersiv8.sf";
+      break;
+    case X86::BI__builtin_ia32_scattersiv8si:
+      intrinsicName = "x86.avx512.mask.scattersiv8.si";
+      break;
+    }
+
+    mlir::Location loc = getLoc(expr->getExprLoc());
+    unsigned minElts =
+        std::min(cast<cir::VectorType>(ops[2].getType()).getSize(),
+                 cast<cir::VectorType>(ops[3].getType()).getSize());
+    ops[1] = getMaskVecValue(builder, loc, ops[1], minElts);
+
+    return emitIntrinsicCallOp(builder, loc, intrinsicName,
+                               convertType(expr->getType()), ops);
+  }
   case X86::BI__builtin_ia32_vextractf128_pd256:
   case X86::BI__builtin_ia32_vextractf128_ps256:
   case X86::BI__builtin_ia32_vextractf128_si256:
@@ -547,12 +914,20 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_pblendw256:
   case X86::BI__builtin_ia32_pblendd128:
   case X86::BI__builtin_ia32_pblendd256:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented X86 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
   case X86::BI__builtin_ia32_pshuflw:
   case X86::BI__builtin_ia32_pshuflw256:
   case X86::BI__builtin_ia32_pshuflw512:
+    return emitPshufWord(builder, ops[0], ops[1], getLoc(expr->getExprLoc()),
+                         true);
   case X86::BI__builtin_ia32_pshufhw:
   case X86::BI__builtin_ia32_pshufhw256:
   case X86::BI__builtin_ia32_pshufhw512:
+    return emitPshufWord(builder, ops[0], ops[1], getLoc(expr->getExprLoc()),
+                         false);
   case X86::BI__builtin_ia32_pshufd:
   case X86::BI__builtin_ia32_pshufd256:
   case X86::BI__builtin_ia32_pshufd512:
@@ -561,13 +936,28 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_vpermilpd256:
   case X86::BI__builtin_ia32_vpermilps256:
   case X86::BI__builtin_ia32_vpermilpd512:
-  case X86::BI__builtin_ia32_vpermilps512:
+  case X86::BI__builtin_ia32_vpermilps512: {
+    const uint32_t imm = getSExtIntValueFromConstOp(ops[1]);
+
+    llvm::SmallVector<int64_t, 16> mask(16);
+    computeFullLaneShuffleMask(*this, ops[0], imm, false, mask);
+
+    return builder.createVecShuffle(getLoc(expr->getExprLoc()), ops[0], mask);
+  }
   case X86::BI__builtin_ia32_shufpd:
   case X86::BI__builtin_ia32_shufpd256:
   case X86::BI__builtin_ia32_shufpd512:
   case X86::BI__builtin_ia32_shufps:
   case X86::BI__builtin_ia32_shufps256:
-  case X86::BI__builtin_ia32_shufps512:
+  case X86::BI__builtin_ia32_shufps512: {
+    const uint32_t imm = getZExtIntValueFromConstOp(ops[2]);
+
+    llvm::SmallVector<int64_t, 16> mask(16);
+    computeFullLaneShuffleMask(*this, ops[0], imm, true, mask);
+
+    return builder.createVecShuffle(getLoc(expr->getExprLoc()), ops[0], ops[1],
+                                    mask);
+  }
   case X86::BI__builtin_ia32_permdi256:
   case X86::BI__builtin_ia32_permdf256:
   case X86::BI__builtin_ia32_permdi512:
@@ -661,12 +1051,16 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_prolq128:
   case X86::BI__builtin_ia32_prolq256:
   case X86::BI__builtin_ia32_prolq512:
+    return emitX86FunnelShift(builder, getLoc(expr->getExprLoc()), ops[0],
+                              ops[0], ops[1], false);
   case X86::BI__builtin_ia32_prord128:
   case X86::BI__builtin_ia32_prord256:
   case X86::BI__builtin_ia32_prord512:
   case X86::BI__builtin_ia32_prorq128:
   case X86::BI__builtin_ia32_prorq256:
   case X86::BI__builtin_ia32_prorq512:
+    return emitX86FunnelShift(builder, getLoc(expr->getExprLoc()), ops[0],
+                              ops[0], ops[1], true);
   case X86::BI__builtin_ia32_selectb_128:
   case X86::BI__builtin_ia32_selectb_256:
   case X86::BI__builtin_ia32_selectb_512:
@@ -743,41 +1137,75 @@ mlir::Value CIRGenFunction::emitX86BuiltinExpr(unsigned builtinID,
   case X86::BI__builtin_ia32_ktestzsi:
   case X86::BI__builtin_ia32_ktestcdi:
   case X86::BI__builtin_ia32_ktestzdi:
+    cgm.errorNYI(expr->getSourceRange(),
+                 std::string("unimplemented X86 builtin call: ") +
+                     getContext().BuiltinInfo.getName(builtinID));
+    return {};
   case X86::BI__builtin_ia32_kaddqi:
+    return emitX86MaskAddLogic(builder, getLoc(expr->getExprLoc()),
+                               "x86.avx512.kadd.b", ops);
   case X86::BI__builtin_ia32_kaddhi:
+    return emitX86MaskAddLogic(builder, getLoc(expr->getExprLoc()),
+                               "x86.avx512.kadd.w", ops);
   case X86::BI__builtin_ia32_kaddsi:
+    return emitX86MaskAddLogic(builder, getLoc(expr->getExprLoc()),
+                               "x86.avx512.kadd.d", ops);
   case X86::BI__builtin_ia32_kadddi:
+    return emitX86MaskAddLogic(builder, getLoc(expr->getExprLoc()),
+                               "x86.avx512.kadd.q", ops);
   case X86::BI__builtin_ia32_kandqi:
   case X86::BI__builtin_ia32_kandhi:
   case X86::BI__builtin_ia32_kandsi:
   case X86::BI__builtin_ia32_kanddi:
+    return emitX86MaskLogic(builder, getLoc(expr->getExprLoc()),
+                            cir::BinOpKind::And, ops);
   case X86::BI__builtin_ia32_kandnqi:
   case X86::BI__builtin_ia32_kandnhi:
   case X86::BI__builtin_ia32_kandnsi:
   case X86::BI__builtin_ia32_kandndi:
+    return emitX86MaskLogic(builder, getLoc(expr->getExprLoc()),
+                            cir::BinOpKind::And, ops, true);
   case X86::BI__builtin_ia32_korqi:
   case X86::BI__builtin_ia32_korhi:
   case X86::BI__builtin_ia32_korsi:
   case X86::BI__builtin_ia32_kordi:
+    return emitX86MaskLogic(builder, getLoc(expr->getExprLoc()),
+                            cir::BinOpKind::Or, ops);
   case X86::BI__builtin_ia32_kxnorqi:
   case X86::BI__builtin_ia32_kxnorhi:
   case X86::BI__builtin_ia32_kxnorsi:
   case X86::BI__builtin_ia32_kxnordi:
+    return emitX86MaskLogic(builder, getLoc(expr->getExprLoc()),
+                            cir::BinOpKind::Xor, ops, true);
   case X86::BI__builtin_ia32_kxorqi:
   case X86::BI__builtin_ia32_kxorhi:
   case X86::BI__builtin_ia32_kxorsi:
   case X86::BI__builtin_ia32_kxordi:
+    return emitX86MaskLogic(builder, getLoc(expr->getExprLoc()),
+                            cir::BinOpKind::Xor, ops);
   case X86::BI__builtin_ia32_knotqi:
   case X86::BI__builtin_ia32_knothi:
   case X86::BI__builtin_ia32_knotsi:
-  case X86::BI__builtin_ia32_knotdi:
+  case X86::BI__builtin_ia32_knotdi: {
+    cir::IntType intTy = cast<cir::IntType>(ops[0].getType());
+    unsigned numElts = intTy.getWidth();
+    mlir::Value resVec =
+        getMaskVecValue(builder, getLoc(expr->getExprLoc()), ops[0], numElts);
+    return builder.createBitcast(builder.createNot(resVec), ops[0].getType());
+  }
   case X86::BI__builtin_ia32_kmovb:
   case X86::BI__builtin_ia32_kmovw:
   case X86::BI__builtin_ia32_kmovd:
-  case X86::BI__builtin_ia32_kmovq:
-  case X86::BI__builtin_ia32_kunpckdi:
-  case X86::BI__builtin_ia32_kunpcksi:
-  case X86::BI__builtin_ia32_kunpckhi:
+  case X86::BI__builtin_ia32_kmovq: {
+    // Bitcast to vXi1 type and then back to integer. This gets the mask
+    // register type into the IR, but might be optimized out depending on
+    // what's around it.
+    cir::IntType intTy = cast<cir::IntType>(ops[0].getType());
+    unsigned numElts = intTy.getWidth();
+    mlir::Value resVec =
+        getMaskVecValue(builder, getLoc(expr->getExprLoc()), ops[0], numElts);
+    return builder.createBitcast(resVec, ops[0].getType());
+  }
   case X86::BI__builtin_ia32_sqrtsh_round_mask:
   case X86::BI__builtin_ia32_sqrtsd_round_mask:
   case X86::BI__builtin_ia32_sqrtss_round_mask:
diff --git a/clang/lib/CIR/CodeGen/CIRGenClass.cpp b/clang/lib/CIR/CodeGen/CIRGenClass.cpp
index c98d9bb0724f6..ca9fe939139cd 100644
--- a/clang/lib/CIR/CodeGen/CIRGenClass.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenClass.cpp
@@ -126,8 +126,7 @@ static void emitMemberInitializer(CIRGenFunction &cgf,
                             lhs.isVolatileQualified());
       // Ensure that we destroy the objects if an exception is thrown later in
       // the constructor.
-      QualType::DestructionKind dtorKind = fieldType.isDestructedType();
-      assert(!cgf.needsEHCleanup(dtorKind) &&
+      assert(!cgf.needsEHCleanup(fieldType.isDestructedType()) &&
              "Arrays of non-record types shouldn't need EH cleanup");
       return;
     }
diff --git a/clang/lib/CIR/CodeGen/CIRGenCoroutine.cpp b/clang/lib/CIR/CodeGen/CIRGenCoroutine.cpp
index f7df811a67c26..b4f185d0b2e3e 100644
--- a/clang/lib/CIR/CodeGen/CIRGenCoroutine.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenCoroutine.cpp
@@ -32,6 +32,9 @@ struct clang::CIRGen::CGCoroData {
 
   // Stores the result of __builtin_coro_begin call.
   mlir::Value coroBegin = nullptr;
+
+  // The promise type's 'unhandled_exception' handler, if it defines one.
+  Stmt *exceptionHandler = nullptr;
 };
 
 // Defining these here allows to keep CGCoroData private to this file.
@@ -272,6 +275,17 @@ CIRGenFunction::emitCoroutineBody(const CoroutineBodyStmt &s) {
   }
   return mlir::success();
 }
+
+static bool memberCallExpressionCanThrow(const Expr *e) {
+  if (const auto *ce = dyn_cast<CXXMemberCallExpr>(e))
+    if (const auto *proto =
+            ce->getMethodDecl()->getType()->getAs<FunctionProtoType>())
+      if (isNoexceptExceptionSpec(proto->getExceptionSpecType()) &&
+          proto->canThrow() == CT_Cannot)
+        return false;
+  return true;
+}
+
 // Given a suspend expression which roughly looks like:
 //
 //   auto && x = CommonExpr();
@@ -333,6 +347,31 @@ emitSuspendExpression(CIRGenFunction &cgf, CGCoroData &coro,
       },
       /*resumeBuilder=*/
       [&](mlir::OpBuilder &b, mlir::Location loc) {
+        // Exception handling requires additional IR. If the 'await_resume'
+        // function is marked as 'noexcept', we avoid generating this additional
+        // IR.
+        CXXTryStmt *tryStmt = nullptr;
+        if (coro.exceptionHandler && kind == cir::AwaitKind::Init &&
+            memberCallExpressionCanThrow(s.getResumeExpr()))
+          cgf.cgm.errorNYI("Coro resume Exception");
+
+        // FIXME(cir): the alloca for the resume expr should be placed in the
+        // enclosing cir.scope instead.
+        if (forLValue) {
+          assert(!cir::MissingFeatures::coroCoYield());
+        } else {
+          awaitRes.rv =
+              cgf.emitAnyExpr(s.getResumeExpr(), aggSlot, ignoreResult);
+          if (!awaitRes.rv.isIgnored())
+            // Create the alloca in the block before the scope wrapping
+            // cir.await.
+            assert(!cir::MissingFeatures::coroCoReturn());
+        }
+
+        if (tryStmt)
+          cgf.cgm.errorNYI("Coro tryStmt");
+
+        // Returns control back to parent.
         cir::YieldOp::create(builder, loc);
       });
 
diff --git a/clang/lib/CIR/CodeGen/CIRGenDeclOpenACC.cpp b/clang/lib/CIR/CodeGen/CIRGenDeclOpenACC.cpp
index d52986db49ea6..a5322ac4e1930 100644
--- a/clang/lib/CIR/CodeGen/CIRGenDeclOpenACC.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenDeclOpenACC.cpp
@@ -287,9 +287,92 @@ void CIRGenModule::emitGlobalOpenACCDeclareDecl(const OpenACCDeclareDecl *d) {
 }
 
 void CIRGenFunction::emitOpenACCRoutine(const OpenACCRoutineDecl &d) {
-  getCIRGenModule().errorNYI(d.getSourceRange(), "OpenACC Routine Construct");
+  // Do nothing here. The OpenACCRoutineDeclAttr handles the implicit name
+  // cases, and the end-of-TU handling manages the named cases. This is
+  // necessary because these references aren't necessarily emitted themselves,
+  // but can be named anywhere.
 }
 
 void CIRGenModule::emitGlobalOpenACCRoutineDecl(const OpenACCRoutineDecl *d) {
-  errorNYI(d->getSourceRange(), "OpenACC Global Routine Construct");
+  // Do nothing here. The OpenACCRoutineDeclAttr handles the implicit name
+  // cases, and the end-of-TU handling manages the named cases. This is
+  // necessary because these references aren't necessarily emitted themselves,
+  // but can be named anywhere.
+}
+
+namespace {
+class OpenACCRoutineClauseEmitter final
+    : public OpenACCClauseVisitor<OpenACCRoutineClauseEmitter> {
+  CIRGen::CIRGenBuilderTy &builder;
+  mlir::acc::RoutineOp routineOp;
+  llvm::SmallVector<mlir::acc::DeviceType> lastDeviceTypeValues;
+
+public:
+  OpenACCRoutineClauseEmitter(CIRGen::CIRGenBuilderTy &builder,
+                              mlir::acc::RoutineOp routineOp)
+      : builder(builder), routineOp(routineOp) {}
+
+  void emitClauses(ArrayRef<const OpenACCClause *> clauses) {
+    this->VisitClauseList(clauses);
+  }
+
+  void VisitClause(const OpenACCClause &clause) {
+    llvm_unreachable("Invalid OpenACC clause on routine");
+  }
+
+  void VisitSeqClause(const OpenACCSeqClause &clause) {
+    routineOp.addSeq(builder.getContext(), lastDeviceTypeValues);
+  }
+  void VisitWorkerClause(const OpenACCWorkerClause &clause) {
+    routineOp.addWorker(builder.getContext(), lastDeviceTypeValues);
+  }
+  void VisitVectorClause(const OpenACCVectorClause &clause) {
+    routineOp.addVector(builder.getContext(), lastDeviceTypeValues);
+  }
+
+  void VisitNoHostClause(const OpenACCNoHostClause &clause) {
+    routineOp.setNohost(/*attrValue=*/true);
+  }
+};
+} // namespace
+
+void CIRGenModule::emitOpenACCRoutineDecl(
+    const clang::FunctionDecl *funcDecl, cir::FuncOp func,
+    SourceLocation pragmaLoc, ArrayRef<const OpenACCClause *> clauses) {
+  mlir::OpBuilder::InsertionGuard guardCase(builder);
+  // These need to appear at the global module.
+  builder.setInsertionPointToEnd(&getModule().getBodyRegion().front());
+
+  mlir::Location routineLoc = getLoc(pragmaLoc);
+
+  std::stringstream routineNameSS;
+  // This follows the same naming format as Flang.
+  routineNameSS << "acc_routine_" << routineCounter++;
+  std::string routineName = routineNameSS.str();
+
+  // There isn't a good constructor for RoutineOp that just takes a location +
+  // name + function, so we use one that creates an otherwise RoutineOp and
+  // count on the visitor/emitter to fill these in.
+  auto routineOp = mlir::acc::RoutineOp::create(
+      builder, routineLoc, routineName,
+      mlir::SymbolRefAttr::get(builder.getContext(), func.getName()),
+      /*implicit=*/false);
+
+  // We have to add a pointer going the other direction via an acc.routine_info,
+  // from the func to the routine.
+  llvm::SmallVector<mlir::SymbolRefAttr> funcRoutines;
+  if (auto routineInfo =
+          func.getOperation()->getAttrOfType<mlir::acc::RoutineInfoAttr>(
+              mlir::acc::getRoutineInfoAttrName()))
+    funcRoutines.append(routineInfo.getAccRoutines().begin(),
+                        routineInfo.getAccRoutines().end());
+
+  funcRoutines.push_back(
+      mlir::SymbolRefAttr::get(builder.getContext(), routineName));
+  func.getOperation()->setAttr(
+      mlir::acc::getRoutineInfoAttrName(),
+      mlir::acc::RoutineInfoAttr::get(func.getContext(), funcRoutines));
+
+  OpenACCRoutineClauseEmitter emitter{builder, routineOp};
+  emitter.emitClauses(clauses);
 }
diff --git a/clang/lib/CIR/CodeGen/CIRGenExpr.cpp b/clang/lib/CIR/CodeGen/CIRGenExpr.cpp
index 4065124f8f568..5d509e37f4621 100644
--- a/clang/lib/CIR/CodeGen/CIRGenExpr.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenExpr.cpp
@@ -188,6 +188,7 @@ Address CIRGenFunction::emitPointerWithAlignment(const Expr *expr,
     case CK_HLSLArrayRValue:
     case CK_HLSLElementwiseCast:
     case CK_HLSLVectorTruncation:
+    case CK_HLSLMatrixTruncation:
     case CK_IntToOCLSampler:
     case CK_IntegralCast:
     case CK_IntegralComplexCast:
@@ -1323,6 +1324,7 @@ LValue CIRGenFunction::emitCastLValue(const CastExpr *e) {
   case CK_IntegralToFixedPoint:
   case CK_MatrixCast:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLArrayRValue:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
@@ -1870,8 +1872,7 @@ CIRGenCallee CIRGenFunction::emitDirectCallee(const GlobalDecl &gd) {
         clone.setLinkageAttr(cir::GlobalLinkageKindAttr::get(
             &cgm.getMLIRContext(), cir::GlobalLinkageKind::InternalLinkage));
         clone.setSymVisibility("private");
-        clone.setInlineKindAttr(cir::InlineAttr::get(
-            &cgm.getMLIRContext(), cir::InlineKind::AlwaysInline));
+        clone.setInlineKind(cir::InlineKind::AlwaysInline);
       }
       return CIRGenCallee::forDirect(clone, gd);
     }
diff --git a/clang/lib/CIR/CodeGen/CIRGenExprComplex.cpp b/clang/lib/CIR/CodeGen/CIRGenExprComplex.cpp
index 9ed920085c8c6..fe06f8cc2c430 100644
--- a/clang/lib/CIR/CodeGen/CIRGenExprComplex.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenExprComplex.cpp
@@ -534,6 +534,7 @@ mlir::Value ComplexExprEmitter::emitCast(CastKind ck, Expr *op,
   case CK_IntegralToFixedPoint:
   case CK_MatrixCast:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLArrayRValue:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
diff --git a/clang/lib/CIR/CodeGen/CIRGenExprConstant.cpp b/clang/lib/CIR/CodeGen/CIRGenExprConstant.cpp
index 66f8ef9b05913..329fd08bc8914 100644
--- a/clang/lib/CIR/CodeGen/CIRGenExprConstant.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenExprConstant.cpp
@@ -1012,6 +1012,7 @@ class ConstExprEmitter
     case CK_MatrixCast:
     case CK_HLSLArrayRValue:
     case CK_HLSLVectorTruncation:
+    case CK_HLSLMatrixTruncation:
     case CK_HLSLElementwiseCast:
     case CK_HLSLAggregateSplatCast:
       return {};
diff --git a/clang/lib/CIR/CodeGen/CIRGenExprScalar.cpp b/clang/lib/CIR/CodeGen/CIRGenExprScalar.cpp
index a8c2061ddbd6c..3e9d3db768bea 100644
--- a/clang/lib/CIR/CodeGen/CIRGenExprScalar.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenExprScalar.cpp
@@ -2344,25 +2344,30 @@ mlir::Value ScalarExprEmitter::VisitUnaryExprOrTypeTraitExpr(
         } else {
           // C99 6.5.3.4p2: If the argument is an expression of type
           // VLA, it is evaluated.
-          cgf.getCIRGenModule().errorNYI(
-              e->getSourceRange(),
-              "sizeof operator for VariableArrayType & evaluateExtent "
-              "ignoredExpr",
-              e->getStmtClassName());
-          return {};
+          cgf.emitIgnoredExpr(e->getArgumentExpr());
         }
 
         // For _Countof, we just want to return the size of a single dimension.
         if (kind == UETT_CountOf)
           return cgf.getVLAElements1D(vat).numElts;
 
-        cgf.getCIRGenModule().errorNYI(
-            e->getSourceRange(),
-            "sizeof operator for VariableArrayType & evaluateExtent",
-            e->getStmtClassName());
-        return builder.getConstant(
-            loc, cir::IntAttr::get(cgf.cgm.uInt64Ty,
-                                   -llvm::APSInt(llvm::APInt(64, 1), true)));
+        // For sizeof and __datasizeof, we need to scale the number of elements
+        // by the size of the array element type.
+        CIRGenFunction::VlaSizePair vlaSize = cgf.getVLASize(vat);
+        mlir::Value numElts = vlaSize.numElts;
+
+        // Scale the number of non-VLA elements by the non-VLA element size.
+        CharUnits eltSize = cgf.getContext().getTypeSizeInChars(vlaSize.type);
+        if (!eltSize.isOne()) {
+          mlir::Location loc = cgf.getLoc(e->getSourceRange());
+          mlir::Value eltSizeValue =
+              builder.getConstAPInt(numElts.getLoc(), numElts.getType(),
+                                    cgf.cgm.getSize(eltSize).getValue());
+          return builder.createMul(loc, eltSizeValue, numElts,
+                                   cir::OverflowBehavior::NoUnsignedWrap);
+        }
+
+        return numElts;
       }
     }
   } else if (e->getKind() == UETT_OpenMPRequiredSimdAlign) {
diff --git a/clang/lib/CIR/CodeGen/CIRGenFunction.h b/clang/lib/CIR/CodeGen/CIRGenFunction.h
index b6926bb88ac85..91b5ffa8b9ff9 100644
--- a/clang/lib/CIR/CodeGen/CIRGenFunction.h
+++ b/clang/lib/CIR/CodeGen/CIRGenFunction.h
@@ -203,6 +203,22 @@ class CIRGenFunction : public CIRGenTypeCache {
     return convertType(getContext().getTypeDeclType(t));
   }
 
+  /// Get integer from a mlir::Value that is an int constant or a constant op.
+  static int64_t getSExtIntValueFromConstOp(mlir::Value val) {
+    auto constOp = val.getDefiningOp<cir::ConstantOp>();
+    assert(constOp && "getIntValueFromConstOp call with non ConstantOp");
+    return constOp.getIntValue().getSExtValue();
+  }
+
+  /// Get zero-extended integer from a mlir::Value that is an int constant or a
+  /// constant op.
+  static int64_t getZExtIntValueFromConstOp(mlir::Value val) {
+    auto constOp = val.getDefiningOp<cir::ConstantOp>();
+    assert(constOp &&
+           "getZeroExtendedIntValueFromConstOp call with non ConstantOp");
+    return constOp.getIntValue().getZExtValue();
+  }
+
   ///  Return the cir::TypeEvaluationKind of QualType \c type.
   static cir::TypeEvaluationKind getEvaluationKind(clang::QualType type);
 
@@ -1220,6 +1236,14 @@ class CIRGenFunction : public CIRGenTypeCache {
   /// CIR emit functions
   /// ----------------------
 public:
+  mlir::Value emitAArch64BuiltinExpr(unsigned builtinID, const CallExpr *expr,
+                                     ReturnValueSlot returnValue,
+                                     llvm::Triple::ArchType arch);
+  mlir::Value emitAArch64SMEBuiltinExpr(unsigned builtinID,
+                                        const CallExpr *expr);
+  mlir::Value emitAArch64SVEBuiltinExpr(unsigned builtinID,
+                                        const CallExpr *expr);
+
   mlir::Value emitAlignmentAssumption(mlir::Value ptrValue, QualType ty,
                                       SourceLocation loc,
                                       SourceLocation assumptionLoc,
@@ -1816,7 +1840,7 @@ class CIRGenFunction : public CIRGenTypeCache {
 
   mlir::LogicalResult emitWhileStmt(const clang::WhileStmt &s);
 
-  mlir::Value emitX86BuiltinExpr(unsigned builtinID, const CallExpr *e);
+  mlir::Value emitX86BuiltinExpr(unsigned builtinID, const CallExpr *expr);
 
   /// Given an assignment `*lhs = rhs`, emit a test that checks if \p rhs is
   /// nonnull, if 1\p LHS is marked _Nonnull.
diff --git a/clang/lib/CIR/CodeGen/CIRGenModule.cpp b/clang/lib/CIR/CodeGen/CIRGenModule.cpp
index 809c24f8aa670..1d8e4a3b444ee 100644
--- a/clang/lib/CIR/CodeGen/CIRGenModule.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenModule.cpp
@@ -1975,7 +1975,6 @@ void CIRGenModule::setCIRFunctionAttributesForDefinition(
       existingInlineKind && *existingInlineKind == cir::InlineKind::NoInline;
   bool isAlwaysInline = existingInlineKind &&
                         *existingInlineKind == cir::InlineKind::AlwaysInline;
-
   if (!decl) {
     assert(!cir::MissingFeatures::hlsl());
 
@@ -1984,8 +1983,7 @@ void CIRGenModule::setCIRFunctionAttributesForDefinition(
       // If inlining is disabled and we don't have a declaration to control
       // inlining, mark the function as 'noinline' unless it is explicitly
       // marked as 'alwaysinline'.
-      f.setInlineKindAttr(
-          cir::InlineAttr::get(&getMLIRContext(), cir::InlineKind::NoInline));
+      f.setInlineKind(cir::InlineKind::NoInline);
     }
 
     return;
@@ -2002,19 +2000,16 @@ void CIRGenModule::setCIRFunctionAttributesForDefinition(
   // Handle inline attributes
   if (decl->hasAttr<NoInlineAttr>() && !isAlwaysInline) {
     // Add noinline if the function isn't always_inline.
-    f.setInlineKindAttr(
-        cir::InlineAttr::get(&getMLIRContext(), cir::InlineKind::NoInline));
+    f.setInlineKind(cir::InlineKind::NoInline);
   } else if (decl->hasAttr<AlwaysInlineAttr>() && !isNoInline) {
     // Don't override AlwaysInline with NoInline, or vice versa, since we can't
     // specify both in IR.
-    f.setInlineKindAttr(
-        cir::InlineAttr::get(&getMLIRContext(), cir::InlineKind::AlwaysInline));
+    f.setInlineKind(cir::InlineKind::AlwaysInline);
   } else if (codeGenOpts.getInlining() == CodeGenOptions::OnlyAlwaysInlining) {
     // If inlining is disabled, force everything that isn't always_inline
     // to carry an explicit noinline attribute.
     if (!isAlwaysInline) {
-      f.setInlineKindAttr(
-          cir::InlineAttr::get(&getMLIRContext(), cir::InlineKind::NoInline));
+      f.setInlineKind(cir::InlineKind::NoInline);
     }
   } else {
     // Otherwise, propagate the inline hint attribute and potentially use its
@@ -2036,13 +2031,11 @@ void CIRGenModule::setCIRFunctionAttributesForDefinition(
         return any_of(pattern->redecls(), checkRedeclForInline);
       };
       if (checkForInline(fd)) {
-        f.setInlineKindAttr(cir::InlineAttr::get(&getMLIRContext(),
-                                                 cir::InlineKind::InlineHint));
+        f.setInlineKind(cir::InlineKind::InlineHint);
       } else if (codeGenOpts.getInlining() ==
                      CodeGenOptions::OnlyHintInlining &&
                  !fd->isInlined() && !isAlwaysInline) {
-        f.setInlineKindAttr(
-            cir::InlineAttr::get(&getMLIRContext(), cir::InlineKind::NoInline));
+        f.setInlineKind(cir::InlineKind::NoInline);
       }
     }
   }
@@ -2234,6 +2227,15 @@ CIRGenModule::createCIRFunction(mlir::Location loc, StringRef name,
 
     if (!cgf)
       theModule.push_back(func);
+
+    if (this->getLangOpts().OpenACC) {
+      // We only have to handle this attribute, since OpenACCAnnotAttrs are
+      // handled via the end-of-TU work.
+      for (const auto *attr :
+           funcDecl->specific_attrs<OpenACCRoutineDeclAttr>())
+        emitOpenACCRoutineDecl(funcDecl, func, attr->getLocation(),
+                               attr->Clauses);
+    }
   }
   return func;
 }
diff --git a/clang/lib/CIR/CodeGen/CIRGenModule.h b/clang/lib/CIR/CodeGen/CIRGenModule.h
index 6600d086f8f61..d7aee8ebf4d7a 100644
--- a/clang/lib/CIR/CodeGen/CIRGenModule.h
+++ b/clang/lib/CIR/CodeGen/CIRGenModule.h
@@ -461,6 +461,12 @@ class CIRGenModule : public CIRGenTypeCache {
                                             OpenACCModifierKind modifiers,
                                             bool structured, bool implicit,
                                             bool requiresDtor);
+  // Each of the acc.routine operations must have a unique name, so we just use
+  // an integer counter.  This is how Flang does it, so it seems reasonable.
+  unsigned routineCounter = 0;
+  void emitOpenACCRoutineDecl(const clang::FunctionDecl *funcDecl,
+                              cir::FuncOp func, SourceLocation pragmaLoc,
+                              ArrayRef<const OpenACCClause *> clauses);
 
   // C++ related functions.
   void emitDeclContext(const DeclContext *dc);
diff --git a/clang/lib/CIR/CodeGen/CIRGenerator.cpp b/clang/lib/CIR/CodeGen/CIRGenerator.cpp
index aa4d9eba35c04..0208eeea7146a 100644
--- a/clang/lib/CIR/CodeGen/CIRGenerator.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenerator.cpp
@@ -166,6 +166,18 @@ void CIRGenerator::HandleCXXStaticMemberVarInstantiation(VarDecl *D) {
   cgm->handleCXXStaticMemberVarInstantiation(D);
 }
 
+void CIRGenerator::HandleOpenACCRoutineReference(const FunctionDecl *FD,
+                                                 const OpenACCRoutineDecl *RD) {
+  llvm::StringRef mangledName = cgm->getMangledName(FD);
+  cir::FuncOp entry =
+      mlir::dyn_cast_if_present<cir::FuncOp>(cgm->getGlobalValue(mangledName));
+
+  // if this wasn't generated, don't force it to be.
+  if (!entry)
+    return;
+  cgm->emitOpenACCRoutineDecl(FD, entry, RD->getBeginLoc(), RD->clauses());
+}
+
 void CIRGenerator::CompleteTentativeDefinition(VarDecl *d) {
   if (diags.hasErrorOccurred())
     return;
diff --git a/clang/lib/CIR/CodeGen/CMakeLists.txt b/clang/lib/CIR/CodeGen/CMakeLists.txt
index d3e2290ceea0b..d6cd15039a9bc 100644
--- a/clang/lib/CIR/CodeGen/CMakeLists.txt
+++ b/clang/lib/CIR/CodeGen/CMakeLists.txt
@@ -12,6 +12,7 @@ add_clang_library(clangCIR
   CIRGenAtomic.cpp
   CIRGenBuilder.cpp
   CIRGenBuiltin.cpp
+  CIRGenBuiltinAArch64.cpp
   CIRGenBuiltinX86.cpp
   CIRGenCall.cpp
   CIRGenClass.cpp
diff --git a/clang/lib/CIR/Dialect/IR/CIRAttrs.cpp b/clang/lib/CIR/Dialect/IR/CIRAttrs.cpp
index 64ac97025e7c7..ee296f171e0d9 100644
--- a/clang/lib/CIR/Dialect/IR/CIRAttrs.cpp
+++ b/clang/lib/CIR/Dialect/IR/CIRAttrs.cpp
@@ -68,24 +68,6 @@ using namespace cir;
 // General CIR parsing / printing
 //===----------------------------------------------------------------------===//
 
-Attribute CIRDialect::parseAttribute(DialectAsmParser &parser,
-                                     Type type) const {
-  llvm::SMLoc typeLoc = parser.getCurrentLocation();
-  llvm::StringRef mnemonic;
-  Attribute genAttr;
-  OptionalParseResult parseResult =
-      generatedAttributeParser(parser, &mnemonic, type, genAttr);
-  if (parseResult.has_value())
-    return genAttr;
-  parser.emitError(typeLoc, "unknown attribute in CIR dialect");
-  return Attribute();
-}
-
-void CIRDialect::printAttribute(Attribute attr, DialectAsmPrinter &os) const {
-  if (failed(generatedAttributePrinter(attr, os)))
-    llvm_unreachable("unexpected CIR type kind");
-}
-
 static void printRecordMembers(mlir::AsmPrinter &printer,
                                mlir::ArrayAttr members) {
   printer << '{';
diff --git a/clang/lib/CIR/Dialect/IR/CIRDialect.cpp b/clang/lib/CIR/Dialect/IR/CIRDialect.cpp
index d505ca141d383..396d97ddd794e 100644
--- a/clang/lib/CIR/Dialect/IR/CIRDialect.cpp
+++ b/clang/lib/CIR/Dialect/IR/CIRDialect.cpp
@@ -220,6 +220,41 @@ void parseVisibilityAttr(OpAsmParser &parser, cir::VisibilityAttr &visibility) {
   visibility = cir::VisibilityAttr::get(parser.getContext(), visibilityKind);
 }
 
+//===----------------------------------------------------------------------===//
+// InlineKindAttr (FIXME: remove once FuncOp uses assembly format)
+//===----------------------------------------------------------------------===//
+
+ParseResult parseInlineKindAttr(OpAsmParser &parser,
+                                cir::InlineKindAttr &inlineKindAttr) {
+  // Static list of possible inline kind keywords
+  static constexpr llvm::StringRef keywords[] = {"no_inline", "always_inline",
+                                                 "inline_hint"};
+
+  // Parse the inline kind keyword (optional)
+  llvm::StringRef keyword;
+  if (parser.parseOptionalKeyword(&keyword, keywords).failed()) {
+    // Not an inline kind keyword, leave inlineKindAttr empty
+    return success();
+  }
+
+  // Parse the enum value from the keyword
+  auto inlineKindResult = ::cir::symbolizeEnum<::cir::InlineKind>(keyword);
+  if (!inlineKindResult) {
+    return parser.emitError(parser.getCurrentLocation(), "expected one of [")
+           << llvm::join(llvm::ArrayRef(keywords), ", ")
+           << "] for inlineKind, got: " << keyword;
+  }
+
+  inlineKindAttr =
+      ::cir::InlineKindAttr::get(parser.getContext(), *inlineKindResult);
+  return success();
+}
+
+void printInlineKindAttr(OpAsmPrinter &p, cir::InlineKindAttr inlineKindAttr) {
+  if (inlineKindAttr) {
+    p << " " << stringifyInlineKind(inlineKindAttr.getValue());
+  }
+}
 //===----------------------------------------------------------------------===//
 // CIR Custom Parsers/Printers
 //===----------------------------------------------------------------------===//
@@ -1753,6 +1788,7 @@ ParseResult cir::FuncOp::parse(OpAsmParser &parser, OperationState &state) {
 
   mlir::StringAttr builtinNameAttr = getBuiltinAttrName(state.name);
   mlir::StringAttr coroutineNameAttr = getCoroutineAttrName(state.name);
+  mlir::StringAttr inlineKindNameAttr = getInlineKindAttrName(state.name);
   mlir::StringAttr lambdaNameAttr = getLambdaAttrName(state.name);
   mlir::StringAttr noProtoNameAttr = getNoProtoAttrName(state.name);
   mlir::StringAttr visNameAttr = getSymVisibilityAttrName(state.name);
@@ -1765,6 +1801,14 @@ ParseResult cir::FuncOp::parse(OpAsmParser &parser, OperationState &state) {
   if (::mlir::succeeded(
           parser.parseOptionalKeyword(coroutineNameAttr.strref())))
     state.addAttribute(coroutineNameAttr, parser.getBuilder().getUnitAttr());
+
+  // Parse optional inline kind attribute
+  cir::InlineKindAttr inlineKindAttr;
+  if (failed(parseInlineKindAttr(parser, inlineKindAttr)))
+    return failure();
+  if (inlineKindAttr)
+    state.addAttribute(inlineKindNameAttr, inlineKindAttr);
+
   if (::mlir::succeeded(parser.parseOptionalKeyword(lambdaNameAttr.strref())))
     state.addAttribute(lambdaNameAttr, parser.getBuilder().getUnitAttr());
   if (parser.parseOptionalKeyword(noProtoNameAttr).succeeded())
@@ -1890,36 +1934,20 @@ ParseResult cir::FuncOp::parse(OpAsmParser &parser, OperationState &state) {
       }).failed())
     return failure();
 
-  // Parse optional inline kind: inline(never|always|hint)
-  if (parser.parseOptionalKeyword("inline").succeeded()) {
-    if (parser.parseLParen().failed())
-      return failure();
-
-    llvm::StringRef inlineKindStr;
-    const std::array<llvm::StringRef, cir::getMaxEnumValForInlineKind()>
-        allowedInlineKindStrs{
-            cir::stringifyInlineKind(cir::InlineKind::NoInline),
-            cir::stringifyInlineKind(cir::InlineKind::AlwaysInline),
-            cir::stringifyInlineKind(cir::InlineKind::InlineHint),
-        };
-    if (parser.parseOptionalKeyword(&inlineKindStr, allowedInlineKindStrs)
-            .failed())
-      return parser.emitError(parser.getCurrentLocation(),
-                              "expected 'never', 'always', or 'hint'");
-
-    std::optional<InlineKind> inlineKind =
-        cir::symbolizeInlineKind(inlineKindStr);
-    if (!inlineKind)
-      return parser.emitError(parser.getCurrentLocation(),
-                              "invalid inline kind");
-
-    state.addAttribute(getInlineKindAttrName(state.name),
-                       cir::InlineAttr::get(builder.getContext(), *inlineKind));
+  // Parse the rest of the attributes.
+  NamedAttrList parsedAttrs;
+  if (parser.parseOptionalAttrDictWithKeyword(parsedAttrs))
+    return failure();
 
-    if (parser.parseRParen().failed())
-      return failure();
+  for (StringRef disallowed : cir::FuncOp::getAttributeNames()) {
+    if (parsedAttrs.get(disallowed))
+      return parser.emitError(loc, "attribute '")
+             << disallowed
+             << "' should not be specified in the explicit attribute list";
   }
 
+  state.attributes.append(parsedAttrs);
+
   // Parse the optional function body.
   auto *body = state.addRegion();
   OptionalParseResult parseResult = parser.parseOptionalRegion(
@@ -2014,6 +2042,8 @@ void cir::FuncOp::print(OpAsmPrinter &p) {
   if (getCoroutine())
     p << " coroutine";
 
+  printInlineKindAttr(p, getInlineKindAttr());
+
   if (getLambda())
     p << " lambda";
 
@@ -2069,9 +2099,8 @@ void cir::FuncOp::print(OpAsmPrinter &p) {
       p << "(" << globalDtorPriority.value() << ")";
   }
 
-  if (cir::InlineAttr inlineAttr = getInlineKindAttr()) {
-    p << " inline(" << cir::stringifyInlineKind(inlineAttr.getValue()) << ")";
-  }
+  function_interface_impl::printFunctionAttributes(
+      p, *this, cir::FuncOp::getAttributeNames());
 
   // Print the body if this is not an external function.
   Region &body = getOperation()->getRegion(0);
diff --git a/clang/lib/CIR/Dialect/Transforms/LoweringPrepare.cpp b/clang/lib/CIR/Dialect/Transforms/LoweringPrepare.cpp
index cedc2a73b9260..94e143e202736 100644
--- a/clang/lib/CIR/Dialect/Transforms/LoweringPrepare.cpp
+++ b/clang/lib/CIR/Dialect/Transforms/LoweringPrepare.cpp
@@ -74,6 +74,7 @@ struct LoweringPreparePass
   void lowerDynamicCastOp(cir::DynamicCastOp op);
   void lowerArrayDtor(cir::ArrayDtor op);
   void lowerArrayCtor(cir::ArrayCtor op);
+  void lowerTrivialCopyCall(cir::CallOp op);
 
   /// Build the function that initializes the specified global
   cir::FuncOp buildCXXGlobalVarDeclInitFunc(cir::GlobalOp op);
@@ -1086,6 +1087,25 @@ void LoweringPreparePass::lowerArrayCtor(cir::ArrayCtor op) {
                              true);
 }
 
+void LoweringPreparePass::lowerTrivialCopyCall(cir::CallOp op) {
+  cir::FuncOp funcOp = getCalledFunction(op);
+  if (!funcOp)
+    return;
+
+  std::optional<cir::CtorKind> ctorKind = funcOp.getCxxConstructorKind();
+  if (ctorKind && *ctorKind == cir::CtorKind::Copy &&
+      funcOp.isCxxTrivialMemberFunction()) {
+    // Replace the trivial copy constructor call with a `CopyOp`
+    CIRBaseBuilderTy builder(getContext());
+    mlir::ValueRange operands = op.getOperands();
+    mlir::Value dest = operands[0];
+    mlir::Value src = operands[1];
+    builder.setInsertionPoint(op);
+    builder.createCopy(dest, src);
+    op.erase();
+  }
+}
+
 void LoweringPreparePass::runOnOp(mlir::Operation *op) {
   if (auto arrayCtor = dyn_cast<cir::ArrayCtor>(op)) {
     lowerArrayCtor(arrayCtor);
@@ -1103,6 +1123,8 @@ void LoweringPreparePass::runOnOp(mlir::Operation *op) {
     lowerDynamicCastOp(dynamicCast);
   } else if (auto unary = mlir::dyn_cast<cir::UnaryOp>(op)) {
     lowerUnaryOp(unary);
+  } else if (auto callOp = dyn_cast<cir::CallOp>(op)) {
+    lowerTrivialCopyCall(callOp);
   } else if (auto fnOp = dyn_cast<cir::FuncOp>(op)) {
     if (auto globalCtor = fnOp.getGlobalCtorPriority())
       globalCtorList.emplace_back(fnOp.getName(), globalCtor.value());
@@ -1121,7 +1143,7 @@ void LoweringPreparePass::runOnOperation() {
   op->walk([&](mlir::Operation *op) {
     if (mlir::isa<cir::ArrayCtor, cir::ArrayDtor, cir::CastOp,
                   cir::ComplexMulOp, cir::ComplexDivOp, cir::DynamicCastOp,
-                  cir::FuncOp, cir::GlobalOp, cir::UnaryOp>(op))
+                  cir::FuncOp, cir::CallOp, cir::GlobalOp, cir::UnaryOp>(op))
       opsToTransform.push_back(op);
   });
 
diff --git a/clang/lib/CIR/FrontendAction/CIRGenAction.cpp b/clang/lib/CIR/FrontendAction/CIRGenAction.cpp
index 67bb5657d4001..daec8ae409e0f 100644
--- a/clang/lib/CIR/FrontendAction/CIRGenAction.cpp
+++ b/clang/lib/CIR/FrontendAction/CIRGenAction.cpp
@@ -88,6 +88,11 @@ class CIRGenConsumer : public clang::ASTConsumer {
     Gen->HandleCXXStaticMemberVarInstantiation(VD);
   }
 
+  void HandleOpenACCRoutineReference(const FunctionDecl *FD,
+                                     const OpenACCRoutineDecl *RD) override {
+    Gen->HandleOpenACCRoutineReference(FD, RD);
+  }
+
   void HandleInlineFunctionDefinition(FunctionDecl *D) override {
     Gen->HandleInlineFunctionDefinition(D);
   }
diff --git a/clang/lib/CIR/Lowering/DirectToLLVM/LowerToLLVM.cpp b/clang/lib/CIR/Lowering/DirectToLLVM/LowerToLLVM.cpp
index 0c34d87734c3e..40e14474890dc 100644
--- a/clang/lib/CIR/Lowering/DirectToLLVM/LowerToLLVM.cpp
+++ b/clang/lib/CIR/Lowering/DirectToLLVM/LowerToLLVM.cpp
@@ -210,6 +210,15 @@ mlir::LogicalResult CIRToLLVMExp2OpLowering::matchAndRewrite(
   return mlir::success();
 }
 
+mlir::LogicalResult CIRToLLVMFloorOpLowering::matchAndRewrite(
+    cir::FloorOp op, OpAdaptor adaptor,
+    mlir::ConversionPatternRewriter &rewriter) const {
+  mlir::Type resTy = typeConverter->convertType(op.getType());
+  rewriter.replaceOpWithNewOp<mlir::LLVM::FFloorOp>(op, resTy,
+                                                    adaptor.getSrc());
+  return mlir::success();
+}
+
 static mlir::Value getLLVMIntCast(mlir::ConversionPatternRewriter &rewriter,
                                   mlir::Value llvmSrc, mlir::Type llvmDstIntTy,
                                   bool isUnsigned, uint64_t cirSrcWidth,
@@ -1978,10 +1987,10 @@ mlir::LogicalResult CIRToLLVMFuncOpLowering::matchAndRewrite(
 
   assert(!cir::MissingFeatures::opFuncMultipleReturnVals());
 
-  if (auto inlineKind = op.getInlineKind()) {
-    fn.setNoInline(inlineKind == cir::InlineKind::NoInline);
-    fn.setInlineHint(inlineKind == cir::InlineKind::InlineHint);
-    fn.setAlwaysInline(inlineKind == cir::InlineKind::AlwaysInline);
+  if (std::optional<cir::InlineKind> inlineKind = op.getInlineKind()) {
+    fn.setNoInline(*inlineKind == cir::InlineKind::NoInline);
+    fn.setInlineHint(*inlineKind == cir::InlineKind::InlineHint);
+    fn.setAlwaysInline(*inlineKind == cir::InlineKind::AlwaysInline);
   }
 
   fn.setVisibility_Attr(mlir::LLVM::VisibilityAttr::get(
@@ -3970,6 +3979,13 @@ mlir::LogicalResult CIRToLLVMGetBitfieldOpLowering::matchAndRewrite(
   return mlir::success();
 }
 
+mlir::LogicalResult CIRToLLVMIsConstantOpLowering::matchAndRewrite(
+    cir::IsConstantOp op, OpAdaptor adaptor,
+    mlir::ConversionPatternRewriter &rewriter) const {
+  rewriter.replaceOpWithNewOp<mlir::LLVM::IsConstantOp>(op, adaptor.getVal());
+  return mlir::success();
+}
+
 mlir::LogicalResult CIRToLLVMInlineAsmOpLowering::matchAndRewrite(
     cir::InlineAsmOp op, OpAdaptor adaptor,
     mlir::ConversionPatternRewriter &rewriter) const {
@@ -4052,6 +4068,18 @@ mlir::LogicalResult CIRToLLVMVAEndOpLowering::matchAndRewrite(
   return mlir::success();
 }
 
+mlir::LogicalResult CIRToLLVMVACopyOpLowering::matchAndRewrite(
+    cir::VACopyOp op, OpAdaptor adaptor,
+    mlir::ConversionPatternRewriter &rewriter) const {
+  auto opaquePtr = mlir::LLVM::LLVMPointerType::get(getContext());
+  auto dstList = mlir::LLVM::BitcastOp::create(rewriter, op.getLoc(), opaquePtr,
+                                               adaptor.getDstList());
+  auto srcList = mlir::LLVM::BitcastOp::create(rewriter, op.getLoc(), opaquePtr,
+                                               adaptor.getSrcList());
+  rewriter.replaceOpWithNewOp<mlir::LLVM::VaCopyOp>(op, dstList, srcList);
+  return mlir::success();
+}
+
 mlir::LogicalResult CIRToLLVMVAArgOpLowering::matchAndRewrite(
     cir::VAArgOp op, OpAdaptor adaptor,
     mlir::ConversionPatternRewriter &rewriter) const {
diff --git a/clang/lib/CodeGen/BackendUtil.cpp b/clang/lib/CodeGen/BackendUtil.cpp
index 82ca831f35da2..af3480d5755f1 100644
--- a/clang/lib/CodeGen/BackendUtil.cpp
+++ b/clang/lib/CodeGen/BackendUtil.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Analysis/GlobalsModRef.h"
+#include "llvm/Analysis/RuntimeLibcallInfo.h"
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Bitcode/BitcodeReader.h"
@@ -66,7 +67,6 @@
 #include "llvm/Transforms/InstCombine/InstCombine.h"
 #include "llvm/Transforms/Instrumentation/AddressSanitizer.h"
 #include "llvm/Transforms/Instrumentation/AddressSanitizerOptions.h"
-#include "llvm/Transforms/Instrumentation/AllocToken.h"
 #include "llvm/Transforms/Instrumentation/BoundsChecking.h"
 #include "llvm/Transforms/Instrumentation/DataFlowSanitizer.h"
 #include "llvm/Transforms/Instrumentation/GCOVProfiler.h"
@@ -234,17 +234,6 @@ class EmitAssemblyHelper {
 };
 } // namespace
 
-static AllocTokenOptions getAllocTokenOptions(const LangOptions &LangOpts,
-                                              const CodeGenOptions &CGOpts) {
-  AllocTokenOptions Opts;
-  if (LangOpts.AllocTokenMode)
-    Opts.Mode = *LangOpts.AllocTokenMode;
-  Opts.MaxTokens = LangOpts.AllocTokenMax;
-  Opts.Extended = CGOpts.SanitizeAllocTokenExtended;
-  Opts.FastABI = CGOpts.SanitizeAllocTokenFastABI;
-  return Opts;
-}
-
 static SanitizerCoverageOptions
 getSancovOptsFromCGOpts(const CodeGenOptions &CGOpts) {
   SanitizerCoverageOptions Opts;
@@ -667,6 +656,11 @@ bool EmitAssemblyHelper::AddEmitPasses(legacy::PassManager &CodeGenPasses,
       llvm::driver::createTLII(TargetTriple, CodeGenOpts.getVecLib()));
   CodeGenPasses.add(new TargetLibraryInfoWrapperPass(*TLII));
 
+  const llvm::TargetOptions &Options = TM->Options;
+  CodeGenPasses.add(new RuntimeLibraryInfoWrapper(
+      TargetTriple, Options.ExceptionModel, Options.FloatABIType,
+      Options.EABIVersion, Options.MCOptions.ABIName, Options.VecLib));
+
   // Normal mode, emit a .s or .o file by running the code generator. Note,
   // this also adds codegenerator level optimization passes.
   CodeGenFileType CGFT = getCodeGenFileType(Action);
@@ -873,23 +867,6 @@ static void addSanitizers(const Triple &TargetTriple,
   }
 }
 
-static void addAllocTokenPass(const Triple &TargetTriple,
-                              const CodeGenOptions &CodeGenOpts,
-                              const LangOptions &LangOpts, PassBuilder &PB) {
-  PB.registerOptimizerLastEPCallback([&](ModulePassManager &MPM,
-                                         OptimizationLevel Level,
-                                         ThinOrFullLTOPhase) {
-    if (Level == OptimizationLevel::O0 &&
-        LangOpts.Sanitize.has(SanitizerKind::AllocToken)) {
-      // The default pass builder only infers libcall function attrs when
-      // optimizing, so we insert it here because we need it for accurate
-      // memory allocation function detection with -fsanitize=alloc-token.
-      MPM.addPass(InferFunctionAttrsPass());
-    }
-    MPM.addPass(AllocTokenPass(getAllocTokenOptions(LangOpts, CodeGenOpts)));
-  });
-}
-
 void EmitAssemblyHelper::RunOptimizationPipeline(
     BackendAction Action, std::unique_ptr<raw_pwrite_stream> &OS,
     std::unique_ptr<llvm::ToolOutputFile> &ThinLinkOS, BackendConsumer *BC) {
@@ -1141,12 +1118,23 @@ void EmitAssemblyHelper::RunOptimizationPipeline(
         FPM.addPass(BoundsCheckingPass(Options));
       });
 
-    // Don't add sanitizers if we are here from ThinLTO PostLink. That already
-    // done on PreLink stage.
     if (!IsThinLTOPostLink) {
+      // Most sanitizers only run during PreLink stage.
       addSanitizers(TargetTriple, CodeGenOpts, LangOpts, PB);
       addKCFIPass(TargetTriple, LangOpts, PB);
-      addAllocTokenPass(TargetTriple, CodeGenOpts, LangOpts, PB);
+
+      PB.registerPipelineStartEPCallback(
+          [&](ModulePassManager &MPM, OptimizationLevel Level) {
+            if (Level == OptimizationLevel::O0 &&
+                LangOpts.Sanitize.has(SanitizerKind::AllocToken)) {
+              // With the default O0 pipeline, LibFunc attrs are not inferred,
+              // so we insert it here because we need it for accurate memory
+              // allocation function detection with -fsanitize=alloc-token.
+              // Note: This could also be added to the default O0 pipeline, but
+              // has a non-trivial effect on generated IR size (attributes).
+              MPM.addPass(InferFunctionAttrsPass());
+            }
+          });
     }
 
     if (std::optional<GCOVOptions> Options =
diff --git a/clang/lib/CodeGen/CGCUDARuntime.cpp b/clang/lib/CodeGen/CGCUDARuntime.cpp
index 121a481213396..9c831b26c3a7b 100644
--- a/clang/lib/CodeGen/CGCUDARuntime.cpp
+++ b/clang/lib/CodeGen/CGCUDARuntime.cpp
@@ -22,6 +22,112 @@ using namespace CodeGen;
 
 CGCUDARuntime::~CGCUDARuntime() {}
 
+static llvm::Value *emitGetParamBuf(CodeGenFunction &CGF,
+                                    const CUDAKernelCallExpr *E) {
+  auto *GetParamBuf = CGF.getContext().getcudaGetParameterBufferDecl();
+  const FunctionProtoType *GetParamBufProto =
+      GetParamBuf->getType()->getAs<FunctionProtoType>();
+
+  DeclRefExpr *DRE = DeclRefExpr::Create(
+      CGF.getContext(), {}, {}, GetParamBuf,
+      /*RefersToEnclosingVariableOrCapture=*/false, GetParamBuf->getNameInfo(),
+      GetParamBuf->getType(), VK_PRValue);
+  auto *ImpCast = ImplicitCastExpr::Create(
+      CGF.getContext(), CGF.getContext().getPointerType(GetParamBuf->getType()),
+      CK_FunctionToPointerDecay, DRE, nullptr, VK_PRValue, FPOptionsOverride());
+
+  CGCallee Callee = CGF.EmitCallee(ImpCast);
+  CallArgList Args;
+  // Use 64B alignment.
+  Args.add(RValue::get(CGF.CGM.getSize(CharUnits::fromQuantity(64))),
+           CGF.getContext().getSizeType());
+  // Calculate parameter sizes.
+  const PointerType *PT = E->getCallee()->getType()->getAs<PointerType>();
+  const FunctionProtoType *FTP =
+      PT->getPointeeType()->getAs<FunctionProtoType>();
+  CharUnits Offset = CharUnits::Zero();
+  for (auto ArgTy : FTP->getParamTypes()) {
+    auto TInfo = CGF.CGM.getContext().getTypeInfoInChars(ArgTy);
+    Offset = Offset.alignTo(TInfo.Align) + TInfo.Width;
+  }
+  Args.add(RValue::get(CGF.CGM.getSize(Offset)),
+           CGF.getContext().getSizeType());
+  const CGFunctionInfo &CallInfo = CGF.CGM.getTypes().arrangeFreeFunctionCall(
+      Args, GetParamBufProto, /*ChainCall=*/false);
+  auto Ret = CGF.EmitCall(CallInfo, Callee, /*ReturnValue=*/{}, Args);
+
+  return Ret.getScalarVal();
+}
+
+RValue CGCUDARuntime::EmitCUDADeviceKernelCallExpr(
+    CodeGenFunction &CGF, const CUDAKernelCallExpr *E,
+    ReturnValueSlot ReturnValue, llvm::CallBase **CallOrInvoke) {
+  assert(CGM.getContext().getcudaLaunchDeviceDecl() ==
+         E->getConfig()->getDirectCallee());
+
+  llvm::BasicBlock *ConfigOKBlock = CGF.createBasicBlock("dkcall.configok");
+  llvm::BasicBlock *ContBlock = CGF.createBasicBlock("dkcall.end");
+
+  llvm::Value *Config = emitGetParamBuf(CGF, E);
+  CGF.Builder.CreateCondBr(
+      CGF.Builder.CreateICmpNE(Config,
+                               llvm::Constant::getNullValue(Config->getType())),
+      ConfigOKBlock, ContBlock);
+
+  CodeGenFunction::ConditionalEvaluation eval(CGF);
+
+  eval.begin(CGF);
+  CGF.EmitBlock(ConfigOKBlock);
+
+  QualType KernelCalleeFuncTy =
+      E->getCallee()->getType()->getAs<PointerType>()->getPointeeType();
+  CGCallee KernelCallee = CGF.EmitCallee(E->getCallee());
+  // Emit kernel arguments.
+  CallArgList KernelCallArgs;
+  CGF.EmitCallArgs(KernelCallArgs,
+                   KernelCalleeFuncTy->getAs<FunctionProtoType>(),
+                   E->arguments(), E->getDirectCallee());
+  // Copy emitted kernel arguments into that parameter buffer.
+  RawAddress CfgBase(Config, CGM.Int8Ty,
+                     /*Alignment=*/CharUnits::fromQuantity(64));
+  CharUnits Offset = CharUnits::Zero();
+  for (auto &Arg : KernelCallArgs) {
+    auto TInfo = CGM.getContext().getTypeInfoInChars(Arg.getType());
+    Offset = Offset.alignTo(TInfo.Align);
+    Address Addr =
+        CGF.Builder.CreateConstInBoundsGEP(CfgBase, Offset.getQuantity());
+    Arg.copyInto(CGF, Addr);
+    Offset += TInfo.Width;
+  }
+  // Make `cudaLaunchDevice` call, i.e. E->getConfig().
+  const CallExpr *LaunchCall = E->getConfig();
+  QualType LaunchCalleeFuncTy = LaunchCall->getCallee()
+                                    ->getType()
+                                    ->getAs<PointerType>()
+                                    ->getPointeeType();
+  CGCallee LaunchCallee = CGF.EmitCallee(LaunchCall->getCallee());
+  CallArgList LaunchCallArgs;
+  CGF.EmitCallArgs(LaunchCallArgs,
+                   LaunchCalleeFuncTy->getAs<FunctionProtoType>(),
+                   LaunchCall->arguments(), LaunchCall->getDirectCallee());
+  // Replace func and paramterbuffer arguments.
+  LaunchCallArgs[0] = CallArg(RValue::get(KernelCallee.getFunctionPointer()),
+                              CGM.getContext().VoidPtrTy);
+  LaunchCallArgs[1] = CallArg(RValue::get(Config), CGM.getContext().VoidPtrTy);
+  const CGFunctionInfo &LaunchCallInfo = CGM.getTypes().arrangeFreeFunctionCall(
+      LaunchCallArgs, LaunchCalleeFuncTy->getAs<FunctionProtoType>(),
+      /*ChainCall=*/false);
+  CGF.EmitCall(LaunchCallInfo, LaunchCallee, ReturnValue, LaunchCallArgs,
+               CallOrInvoke,
+               /*IsMustTail=*/false, E->getExprLoc());
+  CGF.EmitBranch(ContBlock);
+
+  CGF.EmitBlock(ContBlock);
+  eval.end(CGF);
+
+  return RValue::get(nullptr);
+}
+
 RValue CGCUDARuntime::EmitCUDAKernelCallExpr(CodeGenFunction &CGF,
                                              const CUDAKernelCallExpr *E,
                                              ReturnValueSlot ReturnValue,
diff --git a/clang/lib/CodeGen/CGCUDARuntime.h b/clang/lib/CodeGen/CGCUDARuntime.h
index 86f776004ee7c..64fb9a31422e0 100644
--- a/clang/lib/CodeGen/CGCUDARuntime.h
+++ b/clang/lib/CodeGen/CGCUDARuntime.h
@@ -88,6 +88,10 @@ class CGCUDARuntime {
                          ReturnValueSlot ReturnValue,
                          llvm::CallBase **CallOrInvoke = nullptr);
 
+  virtual RValue EmitCUDADeviceKernelCallExpr(
+      CodeGenFunction &CGF, const CUDAKernelCallExpr *E,
+      ReturnValueSlot ReturnValue, llvm::CallBase **CallOrInvoke = nullptr);
+
   /// Emits a kernel launch stub.
   virtual void emitDeviceStub(CodeGenFunction &CGF, FunctionArgList &Args) = 0;
 
diff --git a/clang/lib/CodeGen/CGCall.cpp b/clang/lib/CodeGen/CGCall.cpp
index efacb3cc04c01..4a9025b6e0b0f 100644
--- a/clang/lib/CodeGen/CGCall.cpp
+++ b/clang/lib/CodeGen/CGCall.cpp
@@ -2559,6 +2559,19 @@ void CodeGenModule::ConstructAttributeList(StringRef Name,
 
     if (TargetDecl->hasAttr<ArmLocallyStreamingAttr>())
       FuncAttrs.addAttribute("aarch64_pstate_sm_body");
+
+    if (auto *ModularFormat = TargetDecl->getAttr<ModularFormatAttr>()) {
+      FormatAttr *Format = TargetDecl->getAttr<FormatAttr>();
+      StringRef Type = Format->getType()->getName();
+      std::string FormatIdx = std::to_string(Format->getFormatIdx());
+      std::string FirstArg = std::to_string(Format->getFirstArg());
+      SmallVector<StringRef> Args = {
+          Type, FormatIdx, FirstArg,
+          ModularFormat->getModularImplFn()->getName(),
+          ModularFormat->getImplName()};
+      llvm::append_range(Args, ModularFormat->aspects());
+      FuncAttrs.addAttribute("modular-format", llvm::join(Args, ","));
+    }
   }
 
   // Attach "no-builtins" attributes to:
diff --git a/clang/lib/CodeGen/CGExpr.cpp b/clang/lib/CodeGen/CGExpr.cpp
index c8f669b69d991..e842158236cd4 100644
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -2575,6 +2575,32 @@ void CodeGenFunction::EmitStoreThroughLValue(RValue Src, LValue Dst,
                                              bool isInit) {
   if (!Dst.isSimple()) {
     if (Dst.isVectorElt()) {
+      if (getLangOpts().HLSL) {
+        // HLSL allows direct access to vector elements, so storing to
+        // individual elements of a vector through VectorElt is handled as
+        // separate store instructions.
+        Address DstAddr = Dst.getVectorAddress();
+        llvm::Type *DestAddrTy = DstAddr.getElementType();
+        llvm::Type *ElemTy = DestAddrTy->getScalarType();
+        CharUnits ElemAlign = CharUnits::fromQuantity(
+            CGM.getDataLayout().getPrefTypeAlign(ElemTy));
+
+        assert(ElemTy->getScalarSizeInBits() >= 8 &&
+               "vector element type must be at least byte-sized");
+
+        llvm::Value *Val = Src.getScalarVal();
+        if (Val->getType()->getPrimitiveSizeInBits() <
+            ElemTy->getScalarSizeInBits())
+          Val = Builder.CreateZExt(Val, ElemTy->getScalarType());
+
+        llvm::Value *Idx = Dst.getVectorIdx();
+        llvm::Value *Zero = llvm::ConstantInt::get(Int32Ty, 0);
+        Address DstElemAddr =
+            Builder.CreateGEP(DstAddr, {Zero, Idx}, DestAddrTy, ElemAlign);
+        Builder.CreateStore(Val, DstElemAddr, Dst.isVolatileQualified());
+        return;
+      }
+
       // Read/modify/write the vector, inserting the new element.
       llvm::Value *Vec = Builder.CreateLoad(Dst.getVectorAddress(),
                                             Dst.isVolatileQualified());
@@ -5772,6 +5798,7 @@ LValue CodeGenFunction::EmitCastLValue(const CastExpr *E) {
   case CK_IntegralToFixedPoint:
   case CK_MatrixCast:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLArrayRValue:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
@@ -6330,8 +6357,15 @@ LValue CodeGenFunction::EmitBinaryOperatorLValue(const BinaryOperator *E) {
 LValue CodeGenFunction::EmitHLSLArrayAssignLValue(const BinaryOperator *E) {
   // Don't emit an LValue for the RHS because it might not be an LValue
   LValue LHS = EmitLValue(E->getLHS());
+
+  // If the RHS is a global resource array, copy all individual resources
+  // into LHS.
+  if (E->getRHS()->getType()->isHLSLResourceRecordArray())
+    if (CGM.getHLSLRuntime().emitResourceArrayCopy(LHS, E->getRHS(), *this))
+      return LHS;
+
   // In C the RHS of an assignment operator is an RValue.
-  // EmitAggregateAssign takes anan LValue for the RHS. Instead we can call
+  // EmitAggregateAssign takes an LValue for the RHS. Instead we can call
   // EmitInitializationToLValue to emit an RValue into an LValue.
   EmitInitializationToLValue(E->getRHS(), LHS);
   return LHS;
diff --git a/clang/lib/CodeGen/CGExprAgg.cpp b/clang/lib/CodeGen/CGExprAgg.cpp
index 67b5f919d1b2a..7cc4d6c8f06f6 100644
--- a/clang/lib/CodeGen/CGExprAgg.cpp
+++ b/clang/lib/CodeGen/CGExprAgg.cpp
@@ -1036,7 +1036,7 @@ void AggExprEmitter::VisitCastExpr(CastExpr *E) {
   case CK_ZeroToOCLOpaqueType:
   case CK_MatrixCast:
   case CK_HLSLVectorTruncation:
-
+  case CK_HLSLMatrixTruncation:
   case CK_IntToOCLSampler:
   case CK_FloatingToFixedPoint:
   case CK_FixedPointToFloating:
@@ -1550,6 +1550,7 @@ static bool castPreservesZero(const CastExpr *CE) {
   case CK_NonAtomicToAtomic:
   case CK_AtomicToNonAtomic:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
     return true;
diff --git a/clang/lib/CodeGen/CGExprCXX.cpp b/clang/lib/CodeGen/CGExprCXX.cpp
index 14d8db32bafc6..ce2ed9026fa1f 100644
--- a/clang/lib/CodeGen/CGExprCXX.cpp
+++ b/clang/lib/CodeGen/CGExprCXX.cpp
@@ -503,6 +503,10 @@ RValue CodeGenFunction::EmitCXXOperatorMemberCallExpr(
 RValue CodeGenFunction::EmitCUDAKernelCallExpr(const CUDAKernelCallExpr *E,
                                                ReturnValueSlot ReturnValue,
                                                llvm::CallBase **CallOrInvoke) {
+  // Emit as a device kernel call if CUDA device code is to be generated.
+  if (getLangOpts().CUDAIsDevice)
+    return CGM.getCUDARuntime().EmitCUDADeviceKernelCallExpr(
+        *this, E, ReturnValue, CallOrInvoke);
   return CGM.getCUDARuntime().EmitCUDAKernelCallExpr(*this, E, ReturnValue,
                                                      CallOrInvoke);
 }
diff --git a/clang/lib/CodeGen/CGExprComplex.cpp b/clang/lib/CodeGen/CGExprComplex.cpp
index bca7c30557f03..e5815ef1130dc 100644
--- a/clang/lib/CodeGen/CGExprComplex.cpp
+++ b/clang/lib/CodeGen/CGExprComplex.cpp
@@ -621,6 +621,7 @@ ComplexPairTy ComplexExprEmitter::EmitCast(CastKind CK, Expr *Op,
   case CK_IntegralToFixedPoint:
   case CK_MatrixCast:
   case CK_HLSLVectorTruncation:
+  case CK_HLSLMatrixTruncation:
   case CK_HLSLArrayRValue:
   case CK_HLSLElementwiseCast:
   case CK_HLSLAggregateSplatCast:
diff --git a/clang/lib/CodeGen/CGExprConstant.cpp b/clang/lib/CodeGen/CGExprConstant.cpp
index 6407afc3d9447..0eec4dba4824a 100644
--- a/clang/lib/CodeGen/CGExprConstant.cpp
+++ b/clang/lib/CodeGen/CGExprConstant.cpp
@@ -1333,6 +1333,7 @@ class ConstExprEmitter
     case CK_ZeroToOCLOpaqueType:
     case CK_MatrixCast:
     case CK_HLSLVectorTruncation:
+    case CK_HLSLMatrixTruncation:
     case CK_HLSLArrayRValue:
     case CK_HLSLElementwiseCast:
     case CK_HLSLAggregateSplatCast:
diff --git a/clang/lib/CodeGen/CGExprScalar.cpp b/clang/lib/CodeGen/CGExprScalar.cpp
index 714192db1b15c..769bc37b0e131 100644
--- a/clang/lib/CodeGen/CGExprScalar.cpp
+++ b/clang/lib/CodeGen/CGExprScalar.cpp
@@ -2422,9 +2422,31 @@ static Value *EmitHLSLElementwiseCast(CodeGenFunction &CGF, LValue SrcVal,
     }
     return V;
   }
+  if (auto *MatTy = DestTy->getAs<ConstantMatrixType>()) {
+    assert(LoadList.size() >= MatTy->getNumElementsFlattened() &&
+           "Flattened type on RHS must have the same number or more elements "
+           "than vector on LHS.");
+
+    llvm::Value *V =
+        CGF.Builder.CreateLoad(CGF.CreateIRTemp(DestTy, "flatcast.tmp"));
+    // V is an allocated temporary to build the truncated matrix into.
+    for (unsigned I = 0, E = MatTy->getNumElementsFlattened(); I < E; I++) {
+      unsigned ColMajorIndex =
+          (I % MatTy->getNumRows()) * MatTy->getNumColumns() +
+          (I / MatTy->getNumRows());
+      RValue RVal = CGF.EmitLoadOfLValue(LoadList[ColMajorIndex], Loc);
+      assert(RVal.isScalar() &&
+             "All flattened source values should be scalars.");
+      llvm::Value *Cast = CGF.EmitScalarConversion(
+          RVal.getScalarVal(), LoadList[ColMajorIndex].getType(),
+          MatTy->getElementType(), Loc);
+      V = CGF.Builder.CreateInsertElement(V, Cast, I);
+    }
+    return V;
+  }
   // if its a builtin just do an extract element or load.
   assert(DestTy->isBuiltinType() &&
-         "Destination type must be a vector or builtin type.");
+         "Destination type must be a vector, matrix, or builtin type.");
   RValue RVal = CGF.EmitLoadOfLValue(LoadList[0], Loc);
   assert(RVal.isScalar() && "All flattened source values should be scalars.");
   return CGF.EmitScalarConversion(RVal.getScalarVal(), LoadList[0].getType(),
@@ -2954,15 +2976,47 @@ Value *ScalarExprEmitter::VisitCastExpr(CastExpr *CE) {
     llvm::Value *Zero = llvm::Constant::getNullValue(CGF.SizeTy);
     return Builder.CreateExtractElement(Vec, Zero, "cast.vtrunc");
   }
+  case CK_HLSLMatrixTruncation: {
+    assert((DestTy->isMatrixType() || DestTy->isBuiltinType()) &&
+           "Destination type must be a matrix or builtin type.");
+    Value *Mat = Visit(E);
+    if (auto *MatTy = DestTy->getAs<ConstantMatrixType>()) {
+      SmallVector<int> Mask;
+      unsigned NumCols = MatTy->getNumColumns();
+      unsigned NumRows = MatTy->getNumRows();
+      unsigned ColOffset = NumCols;
+      if (auto *SrcMatTy = E->getType()->getAs<ConstantMatrixType>())
+        ColOffset = SrcMatTy->getNumColumns();
+      for (unsigned R = 0; R < NumRows; R++) {
+        for (unsigned C = 0; C < NumCols; C++) {
+          unsigned I = R * ColOffset + C;
+          Mask.push_back(I);
+        }
+      }
+
+      return Builder.CreateShuffleVector(Mat, Mask, "trunc");
+    }
+    llvm::Value *Zero = llvm::Constant::getNullValue(CGF.SizeTy);
+    return Builder.CreateExtractElement(Mat, Zero, "cast.mtrunc");
+  }
   case CK_HLSLElementwiseCast: {
     RValue RV = CGF.EmitAnyExpr(E);
     SourceLocation Loc = CE->getExprLoc();
 
-    assert(RV.isAggregate() && "Not a valid HLSL Elementwise Cast.");
-    // RHS is an aggregate
-    LValue SrcVal = CGF.MakeAddrLValue(RV.getAggregateAddress(), E->getType());
+    Address SrcAddr = Address::invalid();
+
+    if (RV.isAggregate()) {
+      SrcAddr = RV.getAggregateAddress();
+    } else {
+      SrcAddr = CGF.CreateMemTemp(E->getType(), "hlsl.ewcast.src");
+      LValue TmpLV = CGF.MakeAddrLValue(SrcAddr, E->getType());
+      CGF.EmitStoreThroughLValue(RV, TmpLV);
+    }
+
+    LValue SrcVal = CGF.MakeAddrLValue(SrcAddr, E->getType());
     return EmitHLSLElementwiseCast(CGF, SrcVal, DestTy, Loc);
   }
+
   } // end of switch
 
   llvm_unreachable("unknown scalar cast");
diff --git a/clang/lib/CodeGen/CGHLSLRuntime.cpp b/clang/lib/CodeGen/CGHLSLRuntime.cpp
index f5c07fe2e33ff..f485fdd49e43f 100644
--- a/clang/lib/CodeGen/CGHLSLRuntime.cpp
+++ b/clang/lib/CodeGen/CGHLSLRuntime.cpp
@@ -22,6 +22,7 @@
 #include "clang/AST/ASTContext.h"
 #include "clang/AST/Attrs.inc"
 #include "clang/AST/Decl.h"
+#include "clang/AST/Expr.h"
 #include "clang/AST/HLSLResource.h"
 #include "clang/AST/RecursiveASTVisitor.h"
 #include "clang/AST/Type.h"
@@ -94,6 +95,14 @@ void addRootSignatureMD(llvm::dxbc::RootSignatureVersion RootSigVer,
   RootSignatureValMD->addOperand(MDVals);
 }
 
+// Find array variable declaration from DeclRef expression
+static const ValueDecl *getArrayDecl(const Expr *E) {
+  if (const DeclRefExpr *DRE =
+          dyn_cast_or_null<DeclRefExpr>(E->IgnoreImpCasts()))
+    return DRE->getDecl();
+  return nullptr;
+}
+
 // Find array variable declaration from nested array subscript AST nodes
 static const ValueDecl *getArrayDecl(const ArraySubscriptExpr *ASE) {
   const Expr *E = nullptr;
@@ -103,9 +112,7 @@ static const ValueDecl *getArrayDecl(const ArraySubscriptExpr *ASE) {
       return nullptr;
     ASE = dyn_cast<ArraySubscriptExpr>(E);
   }
-  if (const DeclRefExpr *DRE = dyn_cast_or_null<DeclRefExpr>(E))
-    return DRE->getDecl();
-  return nullptr;
+  return getArrayDecl(E);
 }
 
 // Get the total size of the array, or -1 if the array is unbounded.
@@ -582,20 +589,22 @@ static llvm::Value *createSPIRVLocationLoad(IRBuilder<> &B, llvm::Module &M,
   return B.CreateLoad(Ty, GV);
 }
 
-llvm::Value *
-CGHLSLRuntime::emitSPIRVUserSemanticLoad(llvm::IRBuilder<> &B, llvm::Type *Type,
-                                         HLSLAppliedSemanticAttr *Semantic,
-                                         std::optional<unsigned> Index) {
+llvm::Value *CGHLSLRuntime::emitSPIRVUserSemanticLoad(
+    llvm::IRBuilder<> &B, llvm::Type *Type, const clang::DeclaratorDecl *Decl,
+    HLSLAppliedSemanticAttr *Semantic, std::optional<unsigned> Index) {
   Twine BaseName = Twine(Semantic->getAttrName()->getName());
   Twine VariableName = BaseName.concat(Twine(Index.value_or(0)));
 
   unsigned Location = SPIRVLastAssignedInputSemanticLocation;
+  if (auto *L = Decl->getAttr<HLSLVkLocationAttr>())
+    Location = L->getLocation();
 
   // DXC completely ignores the semantic/index pair. Location are assigned from
   // the first semantic to the last.
   llvm::ArrayType *AT = dyn_cast<llvm::ArrayType>(Type);
   unsigned ElementCount = AT ? AT->getNumElements() : 1;
   SPIRVLastAssignedInputSemanticLocation += ElementCount;
+
   return createSPIRVLocationLoad(B, CGM.getModule(), Type, Location,
                                  VariableName.str());
 }
@@ -616,10 +625,14 @@ static void createSPIRVLocationStore(IRBuilder<> &B, llvm::Module &M,
 
 void CGHLSLRuntime::emitSPIRVUserSemanticStore(
     llvm::IRBuilder<> &B, llvm::Value *Source,
-    HLSLAppliedSemanticAttr *Semantic, std::optional<unsigned> Index) {
+    const clang::DeclaratorDecl *Decl, HLSLAppliedSemanticAttr *Semantic,
+    std::optional<unsigned> Index) {
   Twine BaseName = Twine(Semantic->getAttrName()->getName());
   Twine VariableName = BaseName.concat(Twine(Index.value_or(0)));
+
   unsigned Location = SPIRVLastAssignedOutputSemanticLocation;
+  if (auto *L = Decl->getAttr<HLSLVkLocationAttr>())
+    Location = L->getLocation();
 
   // DXC completely ignores the semantic/index pair. Location are assigned from
   // the first semantic to the last.
@@ -671,7 +684,7 @@ llvm::Value *CGHLSLRuntime::emitUserSemanticLoad(
     IRBuilder<> &B, llvm::Type *Type, const clang::DeclaratorDecl *Decl,
     HLSLAppliedSemanticAttr *Semantic, std::optional<unsigned> Index) {
   if (CGM.getTarget().getTriple().isSPIRV())
-    return emitSPIRVUserSemanticLoad(B, Type, Semantic, Index);
+    return emitSPIRVUserSemanticLoad(B, Type, Decl, Semantic, Index);
 
   if (CGM.getTarget().getTriple().isDXIL())
     return emitDXILUserSemanticLoad(B, Type, Semantic, Index);
@@ -684,7 +697,7 @@ void CGHLSLRuntime::emitUserSemanticStore(IRBuilder<> &B, llvm::Value *Source,
                                           HLSLAppliedSemanticAttr *Semantic,
                                           std::optional<unsigned> Index) {
   if (CGM.getTarget().getTriple().isSPIRV())
-    return emitSPIRVUserSemanticStore(B, Source, Semantic, Index);
+    return emitSPIRVUserSemanticStore(B, Source, Decl, Semantic, Index);
 
   if (CGM.getTarget().getTriple().isDXIL())
     return emitDXILUserSemanticStore(B, Source, Semantic, Index);
@@ -693,8 +706,9 @@ void CGHLSLRuntime::emitUserSemanticStore(IRBuilder<> &B, llvm::Value *Source,
 }
 
 llvm::Value *CGHLSLRuntime::emitSystemSemanticLoad(
-    IRBuilder<> &B, llvm::Type *Type, const clang::DeclaratorDecl *Decl,
-    HLSLAppliedSemanticAttr *Semantic, std::optional<unsigned> Index) {
+    IRBuilder<> &B, const FunctionDecl *FD, llvm::Type *Type,
+    const clang::DeclaratorDecl *Decl, HLSLAppliedSemanticAttr *Semantic,
+    std::optional<unsigned> Index) {
 
   std::string SemanticName = Semantic->getAttrName()->getName().upper();
   if (SemanticName == "SV_GROUPINDEX") {
@@ -730,8 +744,12 @@ llvm::Value *CGHLSLRuntime::emitSystemSemanticLoad(
     return buildVectorInput(B, GroupIDIntrinsic, Type);
   }
 
+  const auto *ShaderAttr = FD->getAttr<HLSLShaderAttr>();
+  assert(ShaderAttr && "Entry point has no shader attribute");
+  llvm::Triple::EnvironmentType ST = ShaderAttr->getType();
+
   if (SemanticName == "SV_POSITION") {
-    if (CGM.getTriple().getEnvironment() == Triple::EnvironmentType::Pixel) {
+    if (ST == Triple::EnvironmentType::Pixel) {
       if (CGM.getTarget().getTriple().isSPIRV())
         return createSPIRVBuiltinLoad(B, CGM.getModule(), Type,
                                       Semantic->getAttrName()->getName(),
@@ -740,7 +758,7 @@ llvm::Value *CGHLSLRuntime::emitSystemSemanticLoad(
         return emitDXILUserSemanticLoad(B, Type, Semantic, Index);
     }
 
-    if (CGM.getTriple().getEnvironment() == Triple::EnvironmentType::Vertex) {
+    if (ST == Triple::EnvironmentType::Vertex) {
       return emitUserSemanticLoad(B, Type, Decl, Semantic, Index);
     }
   }
@@ -783,6 +801,11 @@ void CGHLSLRuntime::emitSystemSemanticStore(IRBuilder<> &B, llvm::Value *Source,
     }
   }
 
+  if (SemanticName == "SV_TARGET") {
+    emitUserSemanticStore(B, Source, Decl, Semantic, Index);
+    return;
+  }
+
   llvm_unreachable(
       "Store hasn't been implemented yet for this system semantic. FIXME");
 }
@@ -793,7 +816,7 @@ llvm::Value *CGHLSLRuntime::handleScalarSemanticLoad(
 
   std::optional<unsigned> Index = Semantic->getSemanticIndex();
   if (Semantic->getAttrName()->getName().starts_with_insensitive("SV_"))
-    return emitSystemSemanticLoad(B, Type, Decl, Semantic, Index);
+    return emitSystemSemanticLoad(B, FD, Type, Decl, Semantic, Index);
   return emitUserSemanticLoad(B, Type, Decl, Semantic, Index);
 }
 
@@ -816,8 +839,7 @@ CGHLSLRuntime::handleStructSemanticLoad(
   const llvm::StructType *ST = cast<StructType>(Type);
   const clang::RecordDecl *RD = Decl->getType()->getAsRecordDecl();
 
-  assert(std::distance(RD->field_begin(), RD->field_end()) ==
-         ST->getNumElements());
+  assert(RD->getNumFields() == ST->getNumElements());
 
   llvm::Value *Aggregate = llvm::PoisonValue::get(Type);
   auto FieldDecl = RD->field_begin();
@@ -849,8 +871,7 @@ CGHLSLRuntime::handleStructSemanticStore(
     RD = Decl->getType()->getAsRecordDecl();
   assert(RD);
 
-  assert(std::distance(RD->field_begin(), RD->field_end()) ==
-         ST->getNumElements());
+  assert(RD->getNumFields() == ST->getNumElements());
 
   auto FieldDecl = RD->field_begin();
   for (unsigned I = 0; I < ST->getNumElements(); ++I) {
@@ -1200,12 +1221,13 @@ std::optional<LValue> CGHLSLRuntime::emitResourceArraySubscriptExpr(
           ArraySubsExpr->getType()->isHLSLResourceRecordArray()) &&
          "expected resource array subscript expression");
 
-  // Let clang codegen handle local resource array subscripts,
+  // Let clang codegen handle local and static resource array subscripts,
   // or when the subscript references on opaque expression (as part of
   // ArrayInitLoopExpr AST node).
   const VarDecl *ArrayDecl =
       dyn_cast_or_null<VarDecl>(getArrayDecl(ArraySubsExpr));
-  if (!ArrayDecl || !ArrayDecl->hasGlobalStorage())
+  if (!ArrayDecl || !ArrayDecl->hasGlobalStorage() ||
+      ArrayDecl->getStorageClass() == SC_Static)
     return std::nullopt;
 
   // get the resource array type
@@ -1235,7 +1257,7 @@ std::optional<LValue> CGHLSLRuntime::emitResourceArraySubscriptExpr(
   // Find binding info for the resource array. For implicit binding
   // an HLSLResourceBindingAttr should have been added by SemaHLSL.
   ResourceBindingAttrs Binding(ArrayDecl);
-  assert((Binding.hasBinding()) &&
+  assert(Binding.hasBinding() &&
          "resource array must have a binding attribute");
 
   // Find the individual resource type.
@@ -1291,6 +1313,49 @@ std::optional<LValue> CGHLSLRuntime::emitResourceArraySubscriptExpr(
   return CGF.MakeAddrLValue(TmpVar, ResultTy, AlignmentSource::Decl);
 }
 
+// If RHSExpr is a global resource array, initialize all of its resources and
+// set them into LHS. Returns false if no copy has been performed and the
+// array copy should be handled by Clang codegen.
+bool CGHLSLRuntime::emitResourceArrayCopy(LValue &LHS, Expr *RHSExpr,
+                                          CodeGenFunction &CGF) {
+  QualType ResultTy = RHSExpr->getType();
+  assert(ResultTy->isHLSLResourceRecordArray() && "expected resource array");
+
+  // Let Clang codegen handle local and static resource array copies.
+  const VarDecl *ArrayDecl = dyn_cast_or_null<VarDecl>(getArrayDecl(RHSExpr));
+  if (!ArrayDecl || !ArrayDecl->hasGlobalStorage() ||
+      ArrayDecl->getStorageClass() == SC_Static)
+    return false;
+
+  // Find binding info for the resource array. For implicit binding
+  // the HLSLResourceBindingAttr should have been added by SemaHLSL.
+  ResourceBindingAttrs Binding(ArrayDecl);
+  assert(Binding.hasBinding() &&
+         "resource array must have a binding attribute");
+
+  // Find the individual resource type.
+  ASTContext &AST = ArrayDecl->getASTContext();
+  QualType ResTy = AST.getBaseElementType(ResultTy);
+  const auto *ResArrayTy = cast<ConstantArrayType>(ResultTy.getTypePtr());
+
+  // Use the provided LHS for the result.
+  AggValueSlot ValueSlot = AggValueSlot::forAddr(
+      LHS.getAddress(), Qualifiers(), AggValueSlot::IsDestructed_t(true),
+      AggValueSlot::DoesNotNeedGCBarriers, AggValueSlot::IsAliased_t(false),
+      AggValueSlot::DoesNotOverlap);
+
+  // Create Value for index and total array size (= range size).
+  int Size = getTotalArraySize(AST, ResArrayTy);
+  llvm::Value *Zero = llvm::ConstantInt::get(CGM.IntTy, 0);
+  llvm::Value *Range = llvm::ConstantInt::get(CGM.IntTy, Size);
+
+  // Initialize individual resources in the array into LHS.
+  std::optional<llvm::Value *> EndIndex = initializeLocalResourceArray(
+      CGF, ResTy->getAsCXXRecordDecl(), ResArrayTy, ValueSlot, Range, Zero,
+      ArrayDecl->getName(), Binding, {Zero}, RHSExpr->getExprLoc());
+  return EndIndex.has_value();
+}
+
 std::optional<LValue> CGHLSLRuntime::emitBufferArraySubscriptExpr(
     const ArraySubscriptExpr *E, CodeGenFunction &CGF,
     llvm::function_ref<llvm::Value *(bool Promote)> EmitIdxAfterBase) {
diff --git a/clang/lib/CodeGen/CGHLSLRuntime.h b/clang/lib/CodeGen/CGHLSLRuntime.h
index c883282a8d9c8..c7cd668419d10 100644
--- a/clang/lib/CodeGen/CGHLSLRuntime.h
+++ b/clang/lib/CodeGen/CGHLSLRuntime.h
@@ -179,7 +179,8 @@ class CGHLSLRuntime {
 protected:
   CodeGenModule &CGM;
 
-  llvm::Value *emitSystemSemanticLoad(llvm::IRBuilder<> &B, llvm::Type *Type,
+  llvm::Value *emitSystemSemanticLoad(llvm::IRBuilder<> &B,
+                                      const FunctionDecl *FD, llvm::Type *Type,
                                       const clang::DeclaratorDecl *Decl,
                                       HLSLAppliedSemanticAttr *Semantic,
                                       std::optional<unsigned> Index);
@@ -257,6 +258,7 @@ class CGHLSLRuntime {
   std::optional<LValue>
   emitResourceArraySubscriptExpr(const ArraySubscriptExpr *E,
                                  CodeGenFunction &CGF);
+  bool emitResourceArrayCopy(LValue &LHS, Expr *RHSExpr, CodeGenFunction &CGF);
 
   std::optional<LValue> emitBufferArraySubscriptExpr(
       const ArraySubscriptExpr *E, CodeGenFunction &CGF,
@@ -278,6 +280,7 @@ class CGHLSLRuntime {
                                    HLSLResourceBindingAttr *RBA);
 
   llvm::Value *emitSPIRVUserSemanticLoad(llvm::IRBuilder<> &B, llvm::Type *Type,
+                                         const clang::DeclaratorDecl *Decl,
                                          HLSLAppliedSemanticAttr *Semantic,
                                          std::optional<unsigned> Index);
   llvm::Value *emitDXILUserSemanticLoad(llvm::IRBuilder<> &B, llvm::Type *Type,
@@ -289,6 +292,7 @@ class CGHLSLRuntime {
                                     std::optional<unsigned> Index);
 
   void emitSPIRVUserSemanticStore(llvm::IRBuilder<> &B, llvm::Value *Source,
+                                  const clang::DeclaratorDecl *Decl,
                                   HLSLAppliedSemanticAttr *Semantic,
                                   std::optional<unsigned> Index);
   void emitDXILUserSemanticStore(llvm::IRBuilder<> &B, llvm::Value *Source,
diff --git a/clang/lib/CodeGen/CGOpenMPRuntime.cpp b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
index a8255ac74cfcf..9bd6da4a38df8 100644
--- a/clang/lib/CodeGen/CGOpenMPRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenMPRuntime.cpp
@@ -8634,6 +8634,15 @@ class MappableExprsHandler {
       if (llvm::is_contained(C->getMotionModifiers(),
                              OMPC_MOTION_MODIFIER_present))
         Kind = Present;
+      if (llvm::is_contained(C->getMotionModifiers(),
+                             OMPC_MOTION_MODIFIER_iterator)) {
+        if (auto *IteratorExpr = dyn_cast<OMPIteratorExpr>(
+                C->getIteratorModifier()->IgnoreParenImpCasts())) {
+          const auto *VD = cast<VarDecl>(IteratorExpr->getIteratorDecl(0));
+          CGF.EmitVarDecl(*VD);
+        }
+      }
+
       const auto *EI = C->getVarRefs().begin();
       for (const auto L : C->component_lists()) {
         InfoGen(std::get<0>(L), Kind, std::get<1>(L), OMPC_MAP_to, {},
@@ -8650,6 +8659,15 @@ class MappableExprsHandler {
       if (llvm::is_contained(C->getMotionModifiers(),
                              OMPC_MOTION_MODIFIER_present))
         Kind = Present;
+      if (llvm::is_contained(C->getMotionModifiers(),
+                             OMPC_MOTION_MODIFIER_iterator)) {
+        if (auto *IteratorExpr = dyn_cast<OMPIteratorExpr>(
+                C->getIteratorModifier()->IgnoreParenImpCasts())) {
+          const auto *VD = cast<VarDecl>(IteratorExpr->getIteratorDecl(0));
+          CGF.EmitVarDecl(*VD);
+        }
+      }
+
       const auto *EI = C->getVarRefs().begin();
       for (const auto L : C->component_lists()) {
         InfoGen(std::get<0>(L), Kind, std::get<1>(L), OMPC_MAP_from, {},
diff --git a/clang/lib/CodeGen/CodeGenFunction.cpp b/clang/lib/CodeGen/CodeGenFunction.cpp
index 64e594d09067b..ac25bd95f0463 100644
--- a/clang/lib/CodeGen/CodeGenFunction.cpp
+++ b/clang/lib/CodeGen/CodeGenFunction.cpp
@@ -2158,6 +2158,23 @@ void CodeGenFunction::EmitBranchOnBoolExpr(
   }
 }
 
+llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
+                                                          unsigned Idx,
+                                                          const CallExpr *E) {
+  llvm::Value *Arg = nullptr;
+  if ((ICEArguments & (1 << Idx)) == 0) {
+    Arg = EmitScalarExpr(E->getArg(Idx));
+  } else {
+    // If this is required to be a constant, constant fold it so that we
+    // know that the generated intrinsic gets a ConstantInt.
+    std::optional<llvm::APSInt> Result =
+        E->getArg(Idx)->getIntegerConstantExpr(getContext());
+    assert(Result && "Expected argument to be a constant");
+    Arg = llvm::ConstantInt::get(getLLVMContext(), *Result);
+  }
+  return Arg;
+}
+
 /// ErrorUnsupported - Print out an error that codegen doesn't support the
 /// specified stmt yet.
 void CodeGenFunction::ErrorUnsupported(const Stmt *S, const char *Type) {
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 4789c6b26797f..a6a1b84e278b9 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1635,6 +1635,22 @@ void CodeGenModule::EmitBackendOptionsMetadata(
     getModule().addModuleFlag(llvm::Module::Min, "SmallDataLimit",
                               CodeGenOpts.SmallDataLimit);
   }
+
+  // Set AllocToken configuration for backend pipeline.
+  if (LangOpts.AllocTokenMode) {
+    StringRef S = llvm::getAllocTokenModeAsString(*LangOpts.AllocTokenMode);
+    getModule().addModuleFlag(llvm::Module::Error, "alloc-token-mode",
+                              llvm::MDString::get(VMContext, S));
+  }
+  if (LangOpts.AllocTokenMax)
+    getModule().addModuleFlag(
+        llvm::Module::Error, "alloc-token-max",
+        llvm::ConstantInt::get(llvm::Type::getInt64Ty(VMContext),
+                               *LangOpts.AllocTokenMax));
+  if (CodeGenOpts.SanitizeAllocTokenFastABI)
+    getModule().addModuleFlag(llvm::Module::Error, "alloc-token-fast-abi", 1);
+  if (CodeGenOpts.SanitizeAllocTokenExtended)
+    getModule().addModuleFlag(llvm::Module::Error, "alloc-token-extended", 1);
 }
 
 void CodeGenModule::UpdateCompletedType(const TagDecl *TD) {
@@ -5960,7 +5976,8 @@ void CodeGenModule::EmitGlobalVarDefinition(const VarDecl *D,
              (D->getType()->isHLSLResourceRecord() ||
               D->getType()->isHLSLResourceRecordArray())) {
     Init = llvm::PoisonValue::get(getTypes().ConvertType(ASTTy));
-    NeedsGlobalCtor = D->getType()->isHLSLResourceRecord();
+    NeedsGlobalCtor = D->getType()->isHLSLResourceRecord() ||
+                      D->getStorageClass() == SC_Static;
   } else if (D->hasAttr<LoaderUninitializedAttr>()) {
     Init = llvm::UndefValue::get(getTypes().ConvertTypeForMem(ASTTy));
   } else if (!InitExpr) {
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index 81b3fe9e79483..eabdc370da6b4 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -343,23 +343,6 @@ void CodeGenFunction::ProcessOrderScopeAMDGCN(Value *Order, Value *Scope,
     SSID = getLLVMContext().getOrInsertSyncScopeID(SSN);
 }
 
-llvm::Value *CodeGenFunction::EmitScalarOrConstFoldImmArg(unsigned ICEArguments,
-                                                          unsigned Idx,
-                                                          const CallExpr *E) {
-  llvm::Value *Arg = nullptr;
-  if ((ICEArguments & (1 << Idx)) == 0) {
-    Arg = EmitScalarExpr(E->getArg(Idx));
-  } else {
-    // If this is required to be a constant, constant fold it so that we
-    // know that the generated intrinsic gets a ConstantInt.
-    std::optional<llvm::APSInt> Result =
-        E->getArg(Idx)->getIntegerConstantExpr(getContext());
-    assert(Result && "Expected argument to be a constant");
-    Arg = llvm::ConstantInt::get(getLLVMContext(), *Result);
-  }
-  return Arg;
-}
-
 void CodeGenFunction::AddAMDGPUFenceAddressSpaceMMRA(llvm::Instruction *Inst,
                                                      const CallExpr *E) {
   constexpr const char *Tag = "amdgpu-synchronize-as";
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index e4ad078dab197..0ab6c753b8bad 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -439,11 +439,8 @@ void AMDGPUTargetCodeGenInfo::setTargetAttributes(
     return;
 
   const FunctionDecl *FD = dyn_cast_or_null<FunctionDecl>(D);
-  if (FD) {
+  if (FD)
     setFunctionDeclAttributes(FD, F, M);
-    if (FD->hasAttr<DeviceKernelAttr>() && !M.getLangOpts().OpenCL)
-      F->setCallingConv(getDeviceKernelCallingConv());
-  }
   if (!getABIInfo().getCodeGenOpts().EmitIEEENaNCompliantInsts)
     F->addFnAttr("amdgpu-ieee", "false");
 }
diff --git a/clang/lib/CodeGen/Targets/NVPTX.cpp b/clang/lib/CodeGen/Targets/NVPTX.cpp
index f6715861d91bc..ba2acd821c704 100644
--- a/clang/lib/CodeGen/Targets/NVPTX.cpp
+++ b/clang/lib/CodeGen/Targets/NVPTX.cpp
@@ -276,9 +276,6 @@ void NVPTXTargetCodeGenInfo::setTargetAttributes(
         M.handleCUDALaunchBoundsAttr(F, Attr);
     }
   }
-  // Attach kernel metadata directly if compiling for NVPTX.
-  if (FD->hasAttr<DeviceKernelAttr>())
-    F->setCallingConv(getDeviceKernelCallingConv());
 }
 
 void NVPTXTargetCodeGenInfo::addNVVMMetadata(llvm::GlobalValue *GV,
diff --git a/clang/lib/CodeGen/Targets/SPIR.cpp b/clang/lib/CodeGen/Targets/SPIR.cpp
index 1a8c85d8871ec..ccc35a22d9938 100644
--- a/clang/lib/CodeGen/Targets/SPIR.cpp
+++ b/clang/lib/CodeGen/Targets/SPIR.cpp
@@ -77,8 +77,6 @@ class CommonSPIRTargetCodeGenInfo : public TargetCodeGenInfo {
   llvm::Constant *getNullPointer(const CodeGen::CodeGenModule &CGM,
                                  llvm::PointerType *T,
                                  QualType QT) const override;
-  void setTargetAttributes(const Decl *D, llvm::GlobalValue *GV,
-                           CodeGen::CodeGenModule &M) const override;
 };
 class SPIRVTargetCodeGenInfo : public CommonSPIRTargetCodeGenInfo {
 public:
@@ -292,22 +290,6 @@ CommonSPIRTargetCodeGenInfo::getNullPointer(const CodeGen::CodeGenModule &CGM,
       llvm::ConstantPointerNull::get(NPT), PT);
 }
 
-void CommonSPIRTargetCodeGenInfo::setTargetAttributes(
-    const Decl *D, llvm::GlobalValue *GV, CodeGen::CodeGenModule &M) const {
-  if (M.getLangOpts().OpenCL || GV->isDeclaration())
-    return;
-
-  const FunctionDecl *FD = dyn_cast<FunctionDecl>(D);
-  if (!FD)
-    return;
-
-  llvm::Function *F = dyn_cast<llvm::Function>(GV);
-  assert(F && "Expected GlobalValue to be a Function");
-
-  if (FD->hasAttr<DeviceKernelAttr>())
-    F->setCallingConv(getDeviceKernelCallingConv());
-}
-
 LangAS
 SPIRVTargetCodeGenInfo::getGlobalVarAddressSpace(CodeGenModule &CGM,
                                                  const VarDecl *D) const {
@@ -342,9 +324,6 @@ void SPIRVTargetCodeGenInfo::setTargetAttributes(
   llvm::Function *F = dyn_cast<llvm::Function>(GV);
   assert(F && "Expected GlobalValue to be a Function");
 
-  if (FD->hasAttr<DeviceKernelAttr>())
-    F->setCallingConv(getDeviceKernelCallingConv());
-
   if (!M.getLangOpts().HIP ||
       M.getTarget().getTriple().getVendor() != llvm::Triple::AMD)
     return;
diff --git a/clang/lib/CodeGen/Targets/Sparc.cpp b/clang/lib/CodeGen/Targets/Sparc.cpp
index 38dbebdec2429..3fa4e84823d51 100644
--- a/clang/lib/CodeGen/Targets/Sparc.cpp
+++ b/clang/lib/CodeGen/Targets/Sparc.cpp
@@ -26,23 +26,39 @@ class SparcV8ABIInfo : public DefaultABIInfo {
 
 private:
   ABIArgInfo classifyReturnType(QualType RetTy) const;
+  ABIArgInfo classifyArgumentType(QualType Ty) const;
   void computeInfo(CGFunctionInfo &FI) const override;
 };
 } // end anonymous namespace
 
+ABIArgInfo SparcV8ABIInfo::classifyReturnType(QualType Ty) const {
+  const auto *CT = Ty->getAs<ComplexType>();
+  const auto *BT = Ty->getAs<BuiltinType>();
+  if (CT)
+    BT = CT->getElementType()->getAs<BuiltinType>();
+  bool IsLongDouble = BT && BT->getKind() == BuiltinType::LongDouble;
 
-ABIArgInfo
-SparcV8ABIInfo::classifyReturnType(QualType Ty) const {
-  if (Ty->isAnyComplexType()) {
-    return ABIArgInfo::getDirect();
-  }
-  else {
-    return DefaultABIInfo::classifyReturnType(Ty);
-  }
+  // long double _Complex is special in that it should be marked as inreg.
+  if (CT)
+    return IsLongDouble ? ABIArgInfo::getDirectInReg()
+                        : ABIArgInfo::getDirect();
+
+  if (IsLongDouble)
+    return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace(),
+                                   /*ByVal=*/false);
+
+  return DefaultABIInfo::classifyReturnType(Ty);
 }
 
-void SparcV8ABIInfo::computeInfo(CGFunctionInfo &FI) const {
+ABIArgInfo SparcV8ABIInfo::classifyArgumentType(QualType Ty) const {
+  if (const auto *BT = Ty->getAs<BuiltinType>();
+      BT && BT->getKind() == BuiltinType::LongDouble)
+    return getNaturalAlignIndirect(Ty, getDataLayout().getAllocaAddrSpace());
 
+  return DefaultABIInfo::classifyArgumentType(Ty);
+}
+
+void SparcV8ABIInfo::computeInfo(CGFunctionInfo &FI) const {
   FI.getReturnInfo() = classifyReturnType(FI.getReturnType());
   for (auto &Arg : FI.arguments())
     Arg.info = classifyArgumentType(Arg.type);
diff --git a/clang/lib/DependencyScanning/DependencyScannerImpl.cpp b/clang/lib/DependencyScanning/DependencyScannerImpl.cpp
index b17d6aec7263e..8ebbffee14c18 100644
--- a/clang/lib/DependencyScanning/DependencyScannerImpl.cpp
+++ b/clang/lib/DependencyScanning/DependencyScannerImpl.cpp
@@ -523,8 +523,8 @@ bool dependencies::initializeScanCompilerInstance(
   // Create a new FileManager to match the invocation's FileSystemOptions.
   ScanInstance.createFileManager();
 
-  // Use the dependency scanning optimized file system if requested to do so.
-  if (DepFS) {
+  // Use DepFS for getting the dependency directives if requested to do so.
+  if (Service.getMode() == ScanningMode::DependencyDirectivesScan) {
     DepFS->resetBypassedPathPrefix();
     SmallString<256> ModulesCachePath;
     normalizeModuleCachePath(ScanInstance.getFileManager(),
@@ -716,7 +716,7 @@ bool CompilerInstanceWithContext::initialize(DiagnosticConsumer *DC) {
   }
 
   std::tie(OverlayFS, CommandLine) = initVFSForByNameScanning(
-      Worker.BaseFS, CommandLine, CWD, "ScanningByName");
+      Worker.DepFS, CommandLine, CWD, "ScanningByName");
 
   DiagEngineWithCmdAndOpts = std::make_unique<DignosticsEngineWithDiagOpts>(
       CommandLine, OverlayFS, *DiagConsumer);
diff --git a/clang/lib/DependencyScanning/DependencyScanningWorker.cpp b/clang/lib/DependencyScanning/DependencyScanningWorker.cpp
index b22b0f456fd5c..fdb94212a2f40 100644
--- a/clang/lib/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/DependencyScanning/DependencyScanningWorker.cpp
@@ -17,7 +17,7 @@ using namespace dependencies;
 
 DependencyScanningWorker::DependencyScanningWorker(
     DependencyScanningService &Service,
-    llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS)
+    llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> BaseFS)
     : Service(Service) {
   PCHContainerOps = std::make_shared<PCHContainerOperations>();
   // We need to read object files from PCH built outside the scanner.
@@ -27,19 +27,11 @@ DependencyScanningWorker::DependencyScanningWorker(
   PCHContainerOps->registerWriter(std::make_unique<RawPCHContainerWriter>());
 
   if (Service.shouldTraceVFS())
-    FS = llvm::makeIntrusiveRefCnt<llvm::vfs::TracingFileSystem>(std::move(FS));
-
-  switch (Service.getMode()) {
-  case ScanningMode::DependencyDirectivesScan:
-    DepFS = llvm::makeIntrusiveRefCnt<DependencyScanningWorkerFilesystem>(
-        Service.getSharedCache(), FS);
-    BaseFS = DepFS;
-    break;
-  case ScanningMode::CanonicalPreprocessing:
-    DepFS = nullptr;
-    BaseFS = FS;
-    break;
-  }
+    BaseFS = llvm::makeIntrusiveRefCnt<llvm::vfs::TracingFileSystem>(
+        std::move(BaseFS));
+
+  DepFS = llvm::makeIntrusiveRefCnt<DependencyScanningWorkerFilesystem>(
+      Service.getSharedCache(), std::move(BaseFS));
 }
 
 DependencyScanningWorker::~DependencyScanningWorker() = default;
@@ -101,7 +93,18 @@ bool DependencyScanningWorker::scanDependencies(
     StringRef WorkingDirectory, const std::vector<std::string> &CommandLine,
     DependencyConsumer &Consumer, DependencyActionController &Controller,
     DiagnosticConsumer &DC,
-    llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS) {
+    IntrusiveRefCntPtr<llvm::vfs::FileSystem> OverlayFS) {
+  IntrusiveRefCntPtr<llvm::vfs::FileSystem> FS = DepFS;
+  if (OverlayFS) {
+#ifndef NDEBUG
+    bool SawDepFS = false;
+    OverlayFS->visit(
+        [&](llvm::vfs::FileSystem &VFS) { SawDepFS |= &VFS == DepFS.get(); });
+    assert(SawDepFS && "OverlayFS not based on DepFS");
+#endif
+    FS = std::move(OverlayFS);
+  }
+
   DignosticsEngineWithDiagOpts DiagEngineWithCmdAndOpts(CommandLine, FS, DC);
   DependencyScanningAction Action(Service, WorkingDirectory, Consumer,
                                   Controller, DepFS);
@@ -157,13 +160,13 @@ bool DependencyScanningWorker::computeDependencies(
     DiagnosticConsumer &DC, std::optional<llvm::MemoryBufferRef> TUBuffer) {
   if (TUBuffer) {
     auto [FinalFS, FinalCommandLine] = initVFSForTUBufferScanning(
-        BaseFS, CommandLine, WorkingDirectory, *TUBuffer);
+        DepFS, CommandLine, WorkingDirectory, *TUBuffer);
     return scanDependencies(WorkingDirectory, FinalCommandLine, Consumer,
                             Controller, DC, FinalFS);
   } else {
-    BaseFS->setCurrentWorkingDirectory(WorkingDirectory);
+    DepFS->setCurrentWorkingDirectory(WorkingDirectory);
     return scanDependencies(WorkingDirectory, CommandLine, Consumer, Controller,
-                            DC, BaseFS);
+                            DC);
   }
 }
 
diff --git a/clang/lib/Driver/ToolChains/Linux.cpp b/clang/lib/Driver/ToolChains/Linux.cpp
index 2c741a38fce1a..cdbf21fb90263 100644
--- a/clang/lib/Driver/ToolChains/Linux.cpp
+++ b/clang/lib/Driver/ToolChains/Linux.cpp
@@ -922,7 +922,7 @@ SanitizerMask Linux::getSupportedSanitizers() const {
   if (IsX86_64 || IsMIPS64 || IsAArch64 || IsPowerPC64 || IsSystemZ ||
       IsLoongArch64 || IsRISCV64)
     Res |= SanitizerKind::Thread;
-  if (IsX86_64 || IsAArch64)
+  if (IsX86_64 || IsAArch64 || IsSystemZ)
     Res |= SanitizerKind::Type;
   if (IsX86_64 || IsSystemZ || IsPowerPC64)
     Res |= SanitizerKind::KernelMemory;
diff --git a/clang/lib/Edit/RewriteObjCFoundationAPI.cpp b/clang/lib/Edit/RewriteObjCFoundationAPI.cpp
index 40f8348241ecc..e8d4660fd36b2 100644
--- a/clang/lib/Edit/RewriteObjCFoundationAPI.cpp
+++ b/clang/lib/Edit/RewriteObjCFoundationAPI.cpp
@@ -1085,6 +1085,7 @@ static bool rewriteToNumericBoxedExpression(const ObjCMessageExpr *Msg,
       llvm_unreachable("OpenCL-specific cast in Objective-C?");
 
     case CK_HLSLVectorTruncation:
+    case CK_HLSLMatrixTruncation:
     case CK_HLSLElementwiseCast:
     case CK_HLSLAggregateSplatCast:
       llvm_unreachable("HLSL-specific cast in Objective-C?");
diff --git a/clang/lib/Format/ContinuationIndenter.cpp b/clang/lib/Format/ContinuationIndenter.cpp
index 9ab024a03fbd7..1272bb72d423f 100644
--- a/clang/lib/Format/ContinuationIndenter.cpp
+++ b/clang/lib/Format/ContinuationIndenter.cpp
@@ -240,6 +240,45 @@ RawStringFormatStyleManager::getEnclosingFunctionStyle(
   return It->second;
 }
 
+IndentationAndAlignment
+IndentationAndAlignment::addPadding(unsigned Spaces) const {
+  return IndentationAndAlignment(Total + Spaces, IndentedFrom);
+}
+
+IndentationAndAlignment
+IndentationAndAlignment::operator+(unsigned Spaces) const {
+  return IndentationAndAlignment(Total + Spaces, Total);
+}
+
+IndentationAndAlignment
+IndentationAndAlignment::operator-(unsigned Spaces) const {
+  return IndentationAndAlignment(Total - Spaces, Total);
+}
+
+IndentationAndAlignment &IndentationAndAlignment::operator+=(unsigned Spaces) {
+  *this = *this + Spaces;
+  return *this;
+}
+
+IndentationAndAlignment::IndentationAndAlignment(unsigned Total,
+                                                 unsigned IndentedFrom)
+    : Total(Total), IndentedFrom(IndentedFrom) {}
+
+IndentationAndAlignment::IndentationAndAlignment(unsigned Spaces)
+    : Total(Spaces), IndentedFrom(Spaces) {}
+
+bool IndentationAndAlignment::operator<(
+    const IndentationAndAlignment &Other) const {
+  if (Total != Other.Total)
+    return Total < Other.Total;
+  // The sign to use here was decided arbitrarily. This operator is mostly used
+  // when a line's indentation should be the max of 2 things. Using this sign
+  // here makes the program prefer alignment over continuation indentation. That
+  // is, it makes the alignment step that follows prefer to move the line when
+  // aligning the previous line.
+  return IndentedFrom > Other.IndentedFrom;
+}
+
 ContinuationIndenter::ContinuationIndenter(const FormatStyle &Style,
                                            const AdditionalKeywords &Keywords,
                                            const SourceManager &SourceMgr,
@@ -491,7 +530,7 @@ bool ContinuationIndenter::mustBreak(const LineState &State) {
     return true;
   }
 
-  unsigned NewLineColumn = getNewLineColumn(State);
+  unsigned NewLineColumn = getNewLineColumn(State).Total;
   if (Current.isMemberAccess() && Style.ColumnLimit != 0 &&
       State.Column + getLengthToNextOperator(Current) > Style.ColumnLimit &&
       (State.Column > NewLineColumn ||
@@ -819,8 +858,9 @@ void ContinuationIndenter::addTokenOnCurrentLine(LineState &State, bool DryRun,
   }
 
   if (Current.is(TT_SelectorName) && !CurrentState.ObjCSelectorNameFound) {
-    unsigned MinIndent = std::max(
-        State.FirstIndent + Style.ContinuationIndentWidth, CurrentState.Indent);
+    unsigned MinIndent =
+        std::max(State.FirstIndent + Style.ContinuationIndentWidth,
+                 CurrentState.Indent.Total);
     unsigned FirstColonPos = State.Column + Spaces + Current.ColumnWidth;
     if (Current.LongestObjCSelectorName == 0)
       CurrentState.AlignColons = false;
@@ -910,7 +950,8 @@ void ContinuationIndenter::addTokenOnCurrentLine(LineState &State, bool DryRun,
     return !Next || Next->isMemberAccess() ||
            Next->is(TT_FunctionDeclarationLParen) || IsFunctionCallParen(*Next);
   };
-  if (IsOpeningBracket(Previous) && State.Column > getNewLineColumn(State) &&
+  if (IsOpeningBracket(Previous) &&
+      State.Column > getNewLineColumn(State).Total &&
       // Don't do this for simple (no expressions) one-argument function calls
       // as that feels like needlessly wasting whitespace, e.g.:
       //
@@ -955,7 +996,7 @@ void ContinuationIndenter::addTokenOnCurrentLine(LineState &State, bool DryRun,
     CurrentState.NoLineBreak = true;
 
   if (startsSegmentOfBuilderTypeCall(Current) &&
-      State.Column > getNewLineColumn(State)) {
+      State.Column > getNewLineColumn(State).Total) {
     CurrentState.ContainsUnwrappedBuilder = true;
   }
 
@@ -1086,7 +1127,8 @@ unsigned ContinuationIndenter::addTokenOnNewLine(LineState &State,
     Penalty += Style.PenaltyBreakFirstLessLess;
   }
 
-  State.Column = getNewLineColumn(State);
+  const auto [TotalColumn, IndentedFromColumn] = getNewLineColumn(State);
+  State.Column = TotalColumn;
 
   // Add Penalty proportional to amount of whitespace away from FirstColumn
   // This tends to penalize several lines that are far-right indented,
@@ -1132,9 +1174,9 @@ unsigned ContinuationIndenter::addTokenOnNewLine(LineState &State,
       } else {
         CurrentState.ColonPos =
             (shouldIndentWrappedSelectorName(Style, State.Line->Type)
-                 ? std::max(CurrentState.Indent,
+                 ? std::max(CurrentState.Indent.Total,
                             State.FirstIndent + Style.ContinuationIndentWidth)
-                 : CurrentState.Indent) +
+                 : CurrentState.Indent.Total) +
             std::max(NextNonComment->LongestObjCSelectorName,
                      NextNonComment->ColumnWidth);
       }
@@ -1155,7 +1197,7 @@ unsigned ContinuationIndenter::addTokenOnNewLine(LineState &State,
     // when we consume all of the "}"'s FakeRParens at the "{".
     if (State.Stack.size() > 1) {
       State.Stack[State.Stack.size() - 2].LastSpace =
-          std::max(CurrentState.LastSpace, CurrentState.Indent) +
+          std::max(CurrentState.LastSpace, CurrentState.Indent.Total) +
           Style.ContinuationIndentWidth;
     }
   }
@@ -1196,7 +1238,8 @@ unsigned ContinuationIndenter::addTokenOnNewLine(LineState &State,
                                      State.Line->Type != LT_ImportStatement &&
                                      Current.isNot(TT_LineComment);
     Whitespaces.replaceWhitespace(Current, Newlines, State.Column, State.Column,
-                                  CurrentState.IsAligned, ContinuePPDirective);
+                                  CurrentState.IsAligned, ContinuePPDirective,
+                                  IndentedFromColumn);
   }
 
   if (!Current.isTrailingComment())
@@ -1340,7 +1383,8 @@ unsigned ContinuationIndenter::addTokenOnNewLine(LineState &State,
   return Penalty;
 }
 
-unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
+IndentationAndAlignment
+ContinuationIndenter::getNewLineColumn(const LineState &State) {
   if (!State.NextToken || !State.NextToken->Previous)
     return 0;
 
@@ -1354,8 +1398,9 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
 
   const FormatToken &Previous = *Current.Previous;
   // If we are continuing an expression, we want to use the continuation indent.
-  unsigned ContinuationIndent =
-      std::max(CurrentState.LastSpace, CurrentState.Indent) +
+  const auto ContinuationIndent =
+      std::max(IndentationAndAlignment(CurrentState.LastSpace),
+               CurrentState.Indent) +
       Style.ContinuationIndentWidth;
   const FormatToken *PreviousNonComment = Current.getPreviousNonComment();
   const FormatToken *NextNonComment = Previous.getNextNonComment();
@@ -1365,7 +1410,7 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
   // Java specific bits.
   if (Style.isJava() &&
       Current.isOneOf(Keywords.kw_implements, Keywords.kw_extends)) {
-    return std::max(CurrentState.LastSpace,
+    return std::max(IndentationAndAlignment(CurrentState.LastSpace),
                     CurrentState.Indent + Style.ContinuationIndentWidth);
   }
 
@@ -1378,7 +1423,8 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
 
   if (Style.BreakBeforeBraces == FormatStyle::BS_Whitesmiths &&
       State.Line->First->is(tok::kw_enum)) {
-    return (Style.IndentWidth * State.Line->First->IndentLevel) +
+    return IndentationAndAlignment(Style.IndentWidth *
+                                   State.Line->First->IndentLevel) +
            Style.IndentWidth;
   }
 
@@ -1497,7 +1543,7 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
       //    * not remove the 'lead' ContinuationIndentWidth
       //    * always un-indent by the operator when
       //    BreakBeforeTernaryOperators=true
-      unsigned Indent = CurrentState.Indent;
+      unsigned Indent = CurrentState.Indent.Total;
       if (Style.AlignOperands != FormatStyle::OAS_DontAlign)
         Indent -= Style.ContinuationIndentWidth;
       if (Style.BreakBeforeTernaryOperators && CurrentState.UnindentOperator)
@@ -1537,14 +1583,16 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
                                     TT_LeadingJavaAnnotation))) ||
       (!Style.IndentWrappedFunctionNames &&
        NextNonComment->isOneOf(tok::kw_operator, TT_FunctionDeclarationName))) {
-    return std::max(CurrentState.LastSpace, CurrentState.Indent);
+    return std::max(IndentationAndAlignment(CurrentState.LastSpace),
+                    CurrentState.Indent);
   }
   if (NextNonComment->is(TT_SelectorName)) {
     if (!CurrentState.ObjCSelectorNameFound) {
-      unsigned MinIndent = CurrentState.Indent;
+      auto MinIndent = CurrentState.Indent;
       if (shouldIndentWrappedSelectorName(Style, State.Line->Type)) {
-        MinIndent = std::max(MinIndent,
-                             State.FirstIndent + Style.ContinuationIndentWidth);
+        MinIndent =
+            std::max(MinIndent, IndentationAndAlignment(State.FirstIndent) +
+                                    Style.ContinuationIndentWidth);
       }
       // If LongestObjCSelectorName is 0, we are indenting the first
       // part of an ObjC selector (or a selector component which is
@@ -1555,10 +1603,10 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
       // component of the ObjC selector.
       //
       // In either case, we want to respect Style.IndentWrappedFunctionNames.
-      return MinIndent +
-             std::max(NextNonComment->LongestObjCSelectorName,
-                      NextNonComment->ColumnWidth) -
-             NextNonComment->ColumnWidth;
+      return MinIndent.addPadding(
+          std::max(NextNonComment->LongestObjCSelectorName,
+                   NextNonComment->ColumnWidth) -
+          NextNonComment->ColumnWidth);
     }
     if (!CurrentState.AlignColons)
       return CurrentState.Indent;
@@ -1628,7 +1676,7 @@ unsigned ContinuationIndenter::getNewLineColumn(const LineState &State) {
     return CurrentState.Indent - NextNonComment->Tok.getLength() -
            NextNonComment->SpacesRequiredBefore;
   }
-  if (CurrentState.Indent == State.FirstIndent && PreviousNonComment &&
+  if (CurrentState.Indent.Total == State.FirstIndent && PreviousNonComment &&
       PreviousNonComment->isNoneOf(tok::r_brace, TT_CtorInitializerComma)) {
     // Ensure that we fall back to the continuation indent width instead of
     // just flushing continuations left.
@@ -1718,7 +1766,7 @@ unsigned ContinuationIndenter::moveStateToNextToken(LineState &State,
                                                   FormatStyle::BCIS_BeforeComma
                                               ? 0
                                               : 2);
-    CurrentState.NestedBlockIndent = CurrentState.Indent;
+    CurrentState.NestedBlockIndent = CurrentState.Indent.Total;
     if (Style.PackConstructorInitializers > FormatStyle::PCIS_BinPack) {
       CurrentState.AvoidBinPacking = true;
       CurrentState.BreakBeforeParameter =
@@ -1733,7 +1781,7 @@ unsigned ContinuationIndenter::moveStateToNextToken(LineState &State,
       Style.BreakConstructorInitializers == FormatStyle::BCIS_AfterColon) {
     CurrentState.Indent =
         State.FirstIndent + Style.ConstructorInitializerIndentWidth;
-    CurrentState.NestedBlockIndent = CurrentState.Indent;
+    CurrentState.NestedBlockIndent = CurrentState.Indent.Total;
     if (Style.PackConstructorInitializers > FormatStyle::PCIS_BinPack)
       CurrentState.AvoidBinPacking = true;
     else
@@ -1877,8 +1925,9 @@ void ContinuationIndenter::moveStatePastFakeLParens(LineState &State,
         (!Style.isTableGen() ||
          (Previous && Previous->isOneOf(TT_TableGenDAGArgListComma,
                                         TT_TableGenDAGArgListCommaToBreak)))) {
-      NewParenState.Indent = std::max(
-          std::max(State.Column, NewParenState.Indent), CurrentState.LastSpace);
+      NewParenState.Indent =
+          std::max({IndentationAndAlignment(State.Column), NewParenState.Indent,
+                    IndentationAndAlignment(CurrentState.LastSpace)});
     }
 
     // Special case for generic selection expressions, its comma-separated
@@ -1986,7 +2035,7 @@ void ContinuationIndenter::moveStatePastScopeOpener(LineState &State,
     return Prev->is(tok::comma);
   }(Current.MatchingParen);
 
-  unsigned NewIndent;
+  IndentationAndAlignment NewIndent = 0;
   unsigned LastSpace = CurrentState.LastSpace;
   bool AvoidBinPacking;
   bool BreakBeforeParameter = false;
@@ -1999,7 +2048,7 @@ void ContinuationIndenter::moveStatePastScopeOpener(LineState &State,
                   std::min(State.Column, CurrentState.NestedBlockIndent);
     } else if (Current.is(tok::l_brace)) {
       const auto Width = Style.BracedInitializerIndentWidth;
-      NewIndent = CurrentState.LastSpace +
+      NewIndent = IndentationAndAlignment(CurrentState.LastSpace) +
                   (Width < 0 ? Style.ContinuationIndentWidth : Width);
     } else {
       NewIndent = CurrentState.LastSpace + Style.ContinuationIndentWidth;
@@ -2014,9 +2063,9 @@ void ContinuationIndenter::moveStatePastScopeOpener(LineState &State,
     if (Current.ParameterCount > 1)
       NestedBlockIndent = std::max(NestedBlockIndent, State.Column + 1);
   } else {
-    NewIndent =
-        Style.ContinuationIndentWidth +
-        std::max(CurrentState.LastSpace, CurrentState.StartOfFunctionCall);
+    NewIndent = IndentationAndAlignment(std::max(
+                    CurrentState.LastSpace, CurrentState.StartOfFunctionCall)) +
+                Style.ContinuationIndentWidth;
 
     if (Style.isTableGen() && Current.is(TT_TableGenDAGArgOpenerToBreak) &&
         Style.TableGenBreakInsideDAGArg == FormatStyle::DAS_BreakElements) {
@@ -2035,7 +2084,7 @@ void ContinuationIndenter::moveStatePastScopeOpener(LineState &State,
     // FIXME: We likely want to do this for more combinations of brackets.
     if (Current.is(tok::less) && Current.ParentBracket == tok::l_paren) {
       NewIndent = std::max(NewIndent, CurrentState.Indent);
-      LastSpace = std::max(LastSpace, CurrentState.Indent);
+      LastSpace = std::max(LastSpace, CurrentState.Indent.Total);
     }
 
     // If ObjCBinPackProtocolList is unspecified, fall back to BinPackParameters
@@ -2281,7 +2330,7 @@ unsigned ContinuationIndenter::reformatRawStringLiteral(
   unsigned CurrentIndent =
       (!Newline && Current.Next && Current.Next->is(tok::r_paren))
           ? State.Stack.back().NestedBlockIndent
-          : State.Stack.back().Indent;
+          : State.Stack.back().Indent.Total;
   unsigned NextStartColumn = ContentStartsOnNewline
                                  ? CurrentIndent + Style.IndentWidth
                                  : FirstStartColumn;
diff --git a/clang/lib/Format/ContinuationIndenter.h b/clang/lib/Format/ContinuationIndenter.h
index fe957cf43721a..1554fb441dff0 100644
--- a/clang/lib/Format/ContinuationIndenter.h
+++ b/clang/lib/Format/ContinuationIndenter.h
@@ -43,6 +43,41 @@ struct RawStringFormatStyleManager {
   getEnclosingFunctionStyle(StringRef EnclosingFunction) const;
 };
 
+/// Represents the spaces at the start of a line, keeping track of what the
+/// spaces are for.
+struct IndentationAndAlignment {
+  unsigned Total;
+
+  /// The column that the position of the start of the line is calculated
+  /// from. It can be more than Total.
+  unsigned IndentedFrom;
+
+  /// Add spaces for right-justifying the token. The IndentedFrom field does not
+  /// change.
+  ///
+  /// This example in Objective-C shows why the field should not change.  The
+  /// token `xx` is right-justified with this method to align the `:`
+  /// symbols. The `:` symbols should remain aligned through the step that
+  /// aligns assignments. That step uses the IndentedFrom field to tell what
+  /// lines to move. Not changing the field in this method ensures that the 2
+  /// lines move together.
+  ///
+  /// [x //
+  ///     xxxx:0
+  ///       xx:0];
+  IndentationAndAlignment addPadding(unsigned Spaces) const;
+  /// Adding indentation is more common than padding. So the operator does that.
+  IndentationAndAlignment operator+(unsigned Spaces) const;
+  IndentationAndAlignment operator-(unsigned Spaces) const;
+  IndentationAndAlignment &operator+=(unsigned Spaces);
+
+  IndentationAndAlignment(unsigned Total, unsigned IndentedFrom);
+
+  IndentationAndAlignment(unsigned Spaces);
+
+  bool operator<(const IndentationAndAlignment &Other) const;
+};
+
 class ContinuationIndenter {
 public:
   /// Constructs a \c ContinuationIndenter to format \p Line starting in
@@ -168,7 +203,7 @@ class ContinuationIndenter {
   unsigned addTokenOnNewLine(LineState &State, bool DryRun);
 
   /// Calculate the new column for a line wrap before the next token.
-  unsigned getNewLineColumn(const LineState &State);
+  IndentationAndAlignment getNewLineColumn(const LineState &State);
 
   /// Adds a multiline token to the \p State.
   ///
@@ -195,10 +230,10 @@ class ContinuationIndenter {
 };
 
 struct ParenState {
-  ParenState(const FormatToken *Tok, unsigned Indent, unsigned LastSpace,
-             bool AvoidBinPacking, bool NoLineBreak)
+  ParenState(const FormatToken *Tok, IndentationAndAlignment Indent,
+             unsigned LastSpace, bool AvoidBinPacking, bool NoLineBreak)
       : Tok(Tok), Indent(Indent), LastSpace(LastSpace),
-        NestedBlockIndent(Indent), IsAligned(false),
+        NestedBlockIndent(Indent.Total), IsAligned(false),
         BreakBeforeClosingBrace(false), BreakBeforeClosingParen(false),
         BreakBeforeClosingAngle(false), AvoidBinPacking(AvoidBinPacking),
         BreakBeforeParameter(false), NoLineBreak(NoLineBreak),
@@ -219,7 +254,7 @@ struct ParenState {
 
   /// The position to which a specific parenthesis level needs to be
   /// indented.
-  unsigned Indent;
+  IndentationAndAlignment Indent;
 
   /// The position of the last space on each level.
   ///
@@ -356,8 +391,8 @@ struct ParenState {
   bool UnindentOperator : 1;
 
   bool operator<(const ParenState &Other) const {
-    if (Indent != Other.Indent)
-      return Indent < Other.Indent;
+    if (Indent.Total != Other.Indent.Total)
+      return Indent.Total < Other.Indent.Total;
     if (LastSpace != Other.LastSpace)
       return LastSpace < Other.LastSpace;
     if (NestedBlockIndent != Other.NestedBlockIndent)
@@ -406,7 +441,7 @@ struct ParenState {
       return IsWrappedConditional;
     if (UnindentOperator != Other.UnindentOperator)
       return UnindentOperator;
-    return false;
+    return Indent < Other.Indent;
   }
 };
 
diff --git a/clang/lib/Format/FormatTokenLexer.cpp b/clang/lib/Format/FormatTokenLexer.cpp
index a9ea5ec9009c4..eb8658396ecde 100644
--- a/clang/lib/Format/FormatTokenLexer.cpp
+++ b/clang/lib/Format/FormatTokenLexer.cpp
@@ -1408,6 +1408,8 @@ FormatToken *FormatTokenLexer::getNextToken() {
       FormatTok->Tok.setKind(tok::identifier);
     } else if (Style.isTableGen() && !Keywords.isTableGenKeyword(*FormatTok)) {
       FormatTok->Tok.setKind(tok::identifier);
+    } else if (Style.isVerilog() && Keywords.isVerilogIdentifier(*FormatTok)) {
+      FormatTok->Tok.setKind(tok::identifier);
     }
   } else if (const bool Greater = FormatTok->is(tok::greatergreater);
              Greater || FormatTok->is(tok::lessless)) {
diff --git a/clang/lib/Format/UnwrappedLineFormatter.cpp b/clang/lib/Format/UnwrappedLineFormatter.cpp
index d31d656a63fc5..913789afd9919 100644
--- a/clang/lib/Format/UnwrappedLineFormatter.cpp
+++ b/clang/lib/Format/UnwrappedLineFormatter.cpp
@@ -1052,8 +1052,8 @@ static void markFinalized(FormatToken *Tok) {
 static void printLineState(const LineState &State) {
   llvm::dbgs() << "State: ";
   for (const ParenState &P : State.Stack) {
-    llvm::dbgs() << (P.Tok ? P.Tok->TokenText : "F") << "|" << P.Indent << "|"
-                 << P.LastSpace << "|" << P.NestedBlockIndent << " ";
+    llvm::dbgs() << (P.Tok ? P.Tok->TokenText : "F") << "|" << P.Indent.Total
+                 << "|" << P.LastSpace << "|" << P.NestedBlockIndent << " ";
   }
   llvm::dbgs() << State.NextToken->TokenText << "\n";
 }
@@ -1111,7 +1111,7 @@ class LineFormatter {
       const ParenState &P = State.Stack.back();
 
       int AdditionalIndent =
-          P.Indent - Previous.Children[0]->Level * Style.IndentWidth;
+          P.Indent.Total - Previous.Children[0]->Level * Style.IndentWidth;
       Penalty +=
           BlockFormatter->format(Previous.Children, DryRun, AdditionalIndent,
                                  /*FixBadIndentation=*/true);
diff --git a/clang/lib/Format/WhitespaceManager.cpp b/clang/lib/Format/WhitespaceManager.cpp
index 94ccf9eb7842a..805bb78f5c90e 100644
--- a/clang/lib/Format/WhitespaceManager.cpp
+++ b/clang/lib/Format/WhitespaceManager.cpp
@@ -35,13 +35,15 @@ WhitespaceManager::Change::Change(const FormatToken &Tok,
                                   bool CreateReplacement,
                                   SourceRange OriginalWhitespaceRange,
                                   int Spaces, unsigned StartOfTokenColumn,
+                                  unsigned IndentedFromColumn,
                                   unsigned NewlinesBefore,
                                   StringRef PreviousLinePostfix,
                                   StringRef CurrentLinePrefix, bool IsAligned,
                                   bool ContinuesPPDirective, bool IsInsideToken)
     : Tok(&Tok), CreateReplacement(CreateReplacement),
       OriginalWhitespaceRange(OriginalWhitespaceRange),
-      StartOfTokenColumn(StartOfTokenColumn), NewlinesBefore(NewlinesBefore),
+      StartOfTokenColumn(StartOfTokenColumn),
+      IndentedFromColumn(IndentedFromColumn), NewlinesBefore(NewlinesBefore),
       PreviousLinePostfix(PreviousLinePostfix),
       CurrentLinePrefix(CurrentLinePrefix), IsAligned(IsAligned),
       ContinuesPPDirective(ContinuesPPDirective), Spaces(Spaces),
@@ -53,13 +55,15 @@ WhitespaceManager::Change::Change(const FormatToken &Tok,
 void WhitespaceManager::replaceWhitespace(FormatToken &Tok, unsigned Newlines,
                                           unsigned Spaces,
                                           unsigned StartOfTokenColumn,
-                                          bool IsAligned, bool InPPDirective) {
+                                          bool IsAligned, bool InPPDirective,
+                                          unsigned IndentedFromColumn) {
   if (Tok.Finalized || (Tok.MacroCtx && Tok.MacroCtx->Role == MR_ExpandedArg))
     return;
   Tok.setDecision((Newlines > 0) ? FD_Break : FD_Continue);
   Changes.push_back(Change(Tok, /*CreateReplacement=*/true, Tok.WhitespaceRange,
-                           Spaces, StartOfTokenColumn, Newlines, "", "",
-                           IsAligned, InPPDirective && !Tok.IsFirst,
+                           Spaces, StartOfTokenColumn, IndentedFromColumn,
+                           Newlines, "", "", IsAligned,
+                           InPPDirective && !Tok.IsFirst,
                            /*IsInsideToken=*/false));
 }
 
@@ -67,11 +71,11 @@ void WhitespaceManager::addUntouchableToken(const FormatToken &Tok,
                                             bool InPPDirective) {
   if (Tok.Finalized || (Tok.MacroCtx && Tok.MacroCtx->Role == MR_ExpandedArg))
     return;
-  Changes.push_back(Change(Tok, /*CreateReplacement=*/false,
-                           Tok.WhitespaceRange, /*Spaces=*/0,
-                           Tok.OriginalColumn, Tok.NewlinesBefore, "", "",
-                           /*IsAligned=*/false, InPPDirective && !Tok.IsFirst,
-                           /*IsInsideToken=*/false));
+  Changes.push_back(Change(
+      Tok, /*CreateReplacement=*/false, Tok.WhitespaceRange, /*Spaces=*/0,
+      Tok.OriginalColumn, /*IndentedFromColumn=*/0, Tok.NewlinesBefore, "", "",
+      /*IsAligned=*/false, InPPDirective && !Tok.IsFirst,
+      /*IsInsideToken=*/false));
 }
 
 llvm::Error
@@ -95,7 +99,8 @@ void WhitespaceManager::replaceWhitespaceInToken(
   Changes.push_back(
       Change(Tok, /*CreateReplacement=*/true,
              SourceRange(Start, Start.getLocWithOffset(ReplaceChars)), Spaces,
-             std::max(0, Spaces), Newlines, PreviousPostfix, CurrentPrefix,
+             std::max(0, Spaces), /*IndentedFromColumn=*/0, Newlines,
+             PreviousPostfix, CurrentPrefix,
              /*IsAligned=*/true, InPPDirective && !Tok.IsFirst,
              /*IsInsideToken=*/true));
 }
@@ -287,6 +292,7 @@ AlignTokenSequence(const FormatStyle &Style, unsigned Start, unsigned End,
                    unsigned Column, bool RightJustify,
                    ArrayRef<unsigned> Matches,
                    SmallVector<WhitespaceManager::Change, 16> &Changes) {
+  unsigned OriginalMatchColumn = 0;
   int Shift = 0;
   // Set when the shift is applied anywhere in the line. Cleared when the line
   // ends.
@@ -330,21 +336,19 @@ AlignTokenSequence(const FormatStyle &Style, unsigned Start, unsigned End,
     // Keep track of the level that should not move with the aligned token.
     if (ScopeStack.size() == 1u && CurrentChange.NewlinesBefore != 0u &&
         CurrentChange.indentAndNestingLevel() > ScopeStack[0] &&
-        !CurrentChange.IsAligned) {
+        CurrentChange.IndentedFromColumn < OriginalMatchColumn) {
       ScopeStack.push_back(CurrentChange.indentAndNestingLevel());
     }
 
     bool InsideNestedScope =
         !ScopeStack.empty() &&
-        CurrentChange.indentAndNestingLevel() > ScopeStack[0];
-    bool ContinuedStringLiteral = i > Start &&
-                                  CurrentChange.Tok->is(tok::string_literal) &&
-                                  Changes[i - 1].Tok->is(tok::string_literal);
-    bool SkipMatchCheck = InsideNestedScope || ContinuedStringLiteral;
+        (CurrentChange.indentAndNestingLevel() > ScopeStack[0] ||
+         (CurrentChange.indentAndNestingLevel() == ScopeStack[0] &&
+          CurrentChange.IndentedFromColumn >= OriginalMatchColumn));
 
     if (CurrentChange.NewlinesBefore > 0) {
       LineShifted = false;
-      if (!SkipMatchCheck)
+      if (!InsideNestedScope)
         Shift = 0;
     }
 
@@ -352,6 +356,7 @@ AlignTokenSequence(const FormatStyle &Style, unsigned Start, unsigned End,
     // spaces it has to be shifted, so the rest of the changes on the line are
     // shifted by the same amount
     if (!Matches.empty() && Matches[0] == i) {
+      OriginalMatchColumn = CurrentChange.StartOfTokenColumn;
       Shift = Column - (RightJustify ? CurrentChange.TokenLength : 0) -
               CurrentChange.StartOfTokenColumn;
       ScopeStack = {CurrentChange.indentAndNestingLevel()};
@@ -365,7 +370,7 @@ AlignTokenSequence(const FormatStyle &Style, unsigned Start, unsigned End,
     // not in a scope that should not move.
     if ((!Matches.empty() && Matches[0] == i) ||
         (ScopeStack.size() == 1u && CurrentChange.NewlinesBefore > 0 &&
-         (ContinuedStringLiteral || InsideNestedScope))) {
+         InsideNestedScope)) {
       LineShifted = true;
       CurrentChange.Spaces += Shift;
     }
@@ -623,7 +628,7 @@ static unsigned AlignTokens(const FormatStyle &Style, F &&Matches,
         // int k  = bar(   | We still want to align the =  | int k = bar(
         //     argument1,  | here, even if we can't move   |     argument1,
         //     argument2); | the following lines.          |     argument2);
-        if (static_cast<unsigned>(Change.Spaces) < ChangeWidthStart)
+        if (Change.IndentedFromColumn < ChangeWidthStart)
           break;
         CurrentChangeWidthRight = Change.Spaces - ChangeWidthStart;
       } else {
diff --git a/clang/lib/Format/WhitespaceManager.h b/clang/lib/Format/WhitespaceManager.h
index 6d18db74cd2e4..3e6fa9dc32978 100644
--- a/clang/lib/Format/WhitespaceManager.h
+++ b/clang/lib/Format/WhitespaceManager.h
@@ -49,9 +49,15 @@ class WhitespaceManager {
   /// \p StartOfTokenColumn is the column at which the token will start after
   /// this replacement. It is needed for determining how \p Spaces is turned
   /// into tabs and spaces for some format styles.
+  ///
+  /// \p IndentedFromColumn is only used when the replacement starts a new
+  /// line. It should be the column that the position of the line is derived
+  /// from. It is used for determining what lines the alignment process should
+  /// move.
   void replaceWhitespace(FormatToken &Tok, unsigned Newlines, unsigned Spaces,
                          unsigned StartOfTokenColumn, bool IsAligned = false,
-                         bool InPPDirective = false);
+                         bool InPPDirective = false,
+                         unsigned IndentedFromColumn = 0);
 
   /// Adds information about an unchangeable token's whitespace.
   ///
@@ -104,13 +110,15 @@ class WhitespaceManager {
     /// \p PreviousLinePostfix, \p NewlinesBefore line breaks, \p Spaces spaces
     /// and \p CurrentLinePrefix.
     ///
-    /// \p StartOfTokenColumn and \p InPPDirective will be used to lay out
-    /// trailing comments and escaped newlines.
+    /// \p StartOfTokenColumn and \p ContinuesPPDirective will be used to lay
+    /// out trailing comments and escaped newlines. \p IndentedFromColumn will
+    /// be used to continue aligned lines.
     Change(const FormatToken &Tok, bool CreateReplacement,
            SourceRange OriginalWhitespaceRange, int Spaces,
-           unsigned StartOfTokenColumn, unsigned NewlinesBefore,
-           StringRef PreviousLinePostfix, StringRef CurrentLinePrefix,
-           bool IsAligned, bool ContinuesPPDirective, bool IsInsideToken);
+           unsigned StartOfTokenColumn, unsigned IndentedFromColumn,
+           unsigned NewlinesBefore, StringRef PreviousLinePostfix,
+           StringRef CurrentLinePrefix, bool IsAligned,
+           bool ContinuesPPDirective, bool IsInsideToken);
 
     // The kind of the token whose whitespace this change replaces, or in which
     // this change inserts whitespace.
@@ -123,6 +131,11 @@ class WhitespaceManager {
     // FormatToken around to query its information.
     SourceRange OriginalWhitespaceRange;
     unsigned StartOfTokenColumn;
+    // Only used when the token is at the start of a line. The column that the
+    // position of the line is derived from. The alignment procedure moves the
+    // line when it moves a token in the same unwrapped line that is to the left
+    // of said column.
+    unsigned IndentedFromColumn;
     unsigned NewlinesBefore;
     std::string PreviousLinePostfix;
     std::string CurrentLinePrefix;
diff --git a/clang/lib/Headers/avx512bwintrin.h b/clang/lib/Headers/avx512bwintrin.h
index 67e8461560b04..48b7c98df7b68 100644
--- a/clang/lib/Headers/avx512bwintrin.h
+++ b/clang/lib/Headers/avx512bwintrin.h
@@ -178,22 +178,22 @@ _kadd_mask64(__mmask64 __A, __mmask64 __B) {
 #define _kshiftri_mask64(A, I) \
   ((__mmask64)__builtin_ia32_kshiftridi((__mmask64)(A), (unsigned int)(I)))
 
-static __inline__ unsigned int __DEFAULT_FN_ATTRS
-_cvtmask32_u32(__mmask32 __A) {
+static __inline__ unsigned int
+    __DEFAULT_FN_ATTRS_CONSTEXPR _cvtmask32_u32(__mmask32 __A) {
   return (unsigned int)__builtin_ia32_kmovd((__mmask32)__A);
 }
 
-static __inline__ unsigned long long __DEFAULT_FN_ATTRS
+static __inline__ unsigned long long __DEFAULT_FN_ATTRS_CONSTEXPR
 _cvtmask64_u64(__mmask64 __A) {
   return (unsigned long long)__builtin_ia32_kmovq((__mmask64)__A);
 }
 
-static __inline__ __mmask32 __DEFAULT_FN_ATTRS
+static __inline__ __mmask32 __DEFAULT_FN_ATTRS_CONSTEXPR
 _cvtu32_mask32(unsigned int __A) {
   return (__mmask32)__builtin_ia32_kmovd((__mmask32)__A);
 }
 
-static __inline__ __mmask64 __DEFAULT_FN_ATTRS
+static __inline__ __mmask64 __DEFAULT_FN_ATTRS_CONSTEXPR
 _cvtu64_mask64(unsigned long long __A) {
   return (__mmask64)__builtin_ia32_kmovq((__mmask64)__A);
 }
diff --git a/clang/lib/Headers/avx512dqintrin.h b/clang/lib/Headers/avx512dqintrin.h
index f200b22f27ee1..ae02cdd47af2e 100644
--- a/clang/lib/Headers/avx512dqintrin.h
+++ b/clang/lib/Headers/avx512dqintrin.h
@@ -123,12 +123,12 @@ _kadd_mask16(__mmask16 __A, __mmask16 __B) {
 #define _kshiftri_mask8(A, I) \
   ((__mmask8)__builtin_ia32_kshiftriqi((__mmask8)(A), (unsigned int)(I)))
 
-static __inline__ unsigned int __DEFAULT_FN_ATTRS
-_cvtmask8_u32(__mmask8 __A) {
+static __inline__ unsigned int
+    __DEFAULT_FN_ATTRS_CONSTEXPR _cvtmask8_u32(__mmask8 __A) {
   return (unsigned int)__builtin_ia32_kmovb((__mmask8)__A);
 }
 
-static __inline__ __mmask8 __DEFAULT_FN_ATTRS
+static __inline__ __mmask8 __DEFAULT_FN_ATTRS_CONSTEXPR
 _cvtu32_mask8(unsigned int __A) {
   return (__mmask8)__builtin_ia32_kmovb((__mmask8)__A);
 }
diff --git a/clang/lib/Headers/avx512fintrin.h b/clang/lib/Headers/avx512fintrin.h
index 806a13c414c10..02282cbccf05d 100644
--- a/clang/lib/Headers/avx512fintrin.h
+++ b/clang/lib/Headers/avx512fintrin.h
@@ -207,9 +207,7 @@ _mm512_undefined(void)
   return (__m512)__builtin_ia32_undef512();
 }
 
-static __inline__ __m512 __DEFAULT_FN_ATTRS512
-_mm512_undefined_ps(void)
-{
+static __inline__ __m512 __DEFAULT_FN_ATTRS512 _mm512_undefined_ps(void) {
   return (__m512)__builtin_ia32_undef512();
 }
 
@@ -3489,44 +3487,38 @@ _mm512_mask_cvtepu32lo_pd(__m512d __W, __mmask8 __U, __m512i __A) {
                                            (__v8sf)_mm256_setzero_ps(), \
                                            (__mmask8)(U), (int)(R)))
 
-static __inline__ __m256 __DEFAULT_FN_ATTRS512
-_mm512_cvtpd_ps (__m512d __A)
-{
-  return (__m256) __builtin_ia32_cvtpd2ps512_mask ((__v8df) __A,
-                (__v8sf) _mm256_undefined_ps (),
-                (__mmask8) -1,
-                _MM_FROUND_CUR_DIRECTION);
+static __inline__ __m256
+    __DEFAULT_FN_ATTRS512_CONSTEXPR _mm512_cvtpd_ps(__m512d __A) {
+  return (__m256)__builtin_ia32_cvtpd2ps512_mask(
+      (__v8df)__A, (__v8sf)_mm256_setzero_ps(), (__mmask8)-1,
+      _MM_FROUND_CUR_DIRECTION);
 }
 
-static __inline__ __m256 __DEFAULT_FN_ATTRS512
-_mm512_mask_cvtpd_ps (__m256 __W, __mmask8 __U, __m512d __A)
-{
+static __inline__ __m256 __DEFAULT_FN_ATTRS512_CONSTEXPR
+_mm512_mask_cvtpd_ps(__m256 __W, __mmask8 __U, __m512d __A) {
   return (__m256) __builtin_ia32_cvtpd2ps512_mask ((__v8df) __A,
                 (__v8sf) __W,
                 (__mmask8) __U,
                 _MM_FROUND_CUR_DIRECTION);
 }
 
-static __inline__ __m256 __DEFAULT_FN_ATTRS512
-_mm512_maskz_cvtpd_ps (__mmask8 __U, __m512d __A)
-{
+static __inline__ __m256 __DEFAULT_FN_ATTRS512_CONSTEXPR
+_mm512_maskz_cvtpd_ps(__mmask8 __U, __m512d __A) {
   return (__m256) __builtin_ia32_cvtpd2ps512_mask ((__v8df) __A,
                 (__v8sf) _mm256_setzero_ps (),
                 (__mmask8) __U,
                 _MM_FROUND_CUR_DIRECTION);
 }
 
-static __inline__ __m512 __DEFAULT_FN_ATTRS512
-_mm512_cvtpd_pslo (__m512d __A)
-{
+static __inline__ __m512 __DEFAULT_FN_ATTRS512_CONSTEXPR
+_mm512_cvtpd_pslo(__m512d __A) {
   return (__m512) __builtin_shufflevector((__v8sf) _mm512_cvtpd_ps(__A),
                 (__v8sf) _mm256_setzero_ps (),
                 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
 }
 
-static __inline__ __m512 __DEFAULT_FN_ATTRS512
-_mm512_mask_cvtpd_pslo (__m512 __W, __mmask8 __U,__m512d __A)
-{
+static __inline__ __m512 __DEFAULT_FN_ATTRS512_CONSTEXPR
+_mm512_mask_cvtpd_pslo(__m512 __W, __mmask8 __U, __m512d __A) {
   return (__m512) __builtin_shufflevector (
                 (__v8sf) _mm512_mask_cvtpd_ps (_mm512_castps512_ps256(__W),
                                                __U, __A),
@@ -8069,12 +8061,12 @@ _mm512_kxor(__mmask16 __A, __mmask16 __B) {
 #define _kshiftri_mask16(A, I) \
   ((__mmask16)__builtin_ia32_kshiftrihi((__mmask16)(A), (unsigned int)(I)))
 
-static __inline__ unsigned int __DEFAULT_FN_ATTRS
-_cvtmask16_u32(__mmask16 __A) {
+static __inline__ unsigned int
+    __DEFAULT_FN_ATTRS_CONSTEXPR _cvtmask16_u32(__mmask16 __A) {
   return (unsigned int)__builtin_ia32_kmovw((__mmask16)__A);
 }
 
-static __inline__ __mmask16 __DEFAULT_FN_ATTRS
+static __inline__ __mmask16 __DEFAULT_FN_ATTRS_CONSTEXPR
 _cvtu32_mask16(unsigned int __A) {
   return (__mmask16)__builtin_ia32_kmovw((__mmask16)__A);
 }
@@ -8654,18 +8646,16 @@ _mm512_mask_compressstoreu_epi32 (void *__P, __mmask16 __U, __m512i __A)
                                               (__v4sf)_mm_setzero_ps(), \
                                               (__mmask8)(U), (int)(R)))
 
-static __inline__ __m128 __DEFAULT_FN_ATTRS128
-_mm_mask_cvtsd_ss (__m128 __W, __mmask8 __U, __m128 __A, __m128d __B)
-{
+static __inline__ __m128 __DEFAULT_FN_ATTRS128_CONSTEXPR
+_mm_mask_cvtsd_ss(__m128 __W, __mmask8 __U, __m128 __A, __m128d __B) {
   return __builtin_ia32_cvtsd2ss_round_mask ((__v4sf)__A,
                                              (__v2df)__B,
                                              (__v4sf)__W,
                                              (__mmask8)__U, _MM_FROUND_CUR_DIRECTION);
 }
 
-static __inline__ __m128 __DEFAULT_FN_ATTRS128
-_mm_maskz_cvtsd_ss (__mmask8 __U, __m128 __A, __m128d __B)
-{
+static __inline__ __m128 __DEFAULT_FN_ATTRS128_CONSTEXPR
+_mm_maskz_cvtsd_ss(__mmask8 __U, __m128 __A, __m128d __B) {
   return __builtin_ia32_cvtsd2ss_round_mask ((__v4sf)__A,
                                              (__v2df)__B,
                                              (__v4sf)_mm_setzero_ps(),
diff --git a/clang/lib/Headers/avx512vlintrin.h b/clang/lib/Headers/avx512vlintrin.h
index 388f99d812312..6e2b19c8a2f7b 100644
--- a/clang/lib/Headers/avx512vlintrin.h
+++ b/clang/lib/Headers/avx512vlintrin.h
@@ -1791,30 +1791,30 @@ _mm256_maskz_cvtpd_epi32 (__mmask8 __U, __m256d __A) {
                                              (__v4si)_mm_setzero_si128());
 }
 
-static __inline__ __m128 __DEFAULT_FN_ATTRS128
-_mm_mask_cvtpd_ps (__m128 __W, __mmask8 __U, __m128d __A) {
+static __inline__ __m128 __DEFAULT_FN_ATTRS128_CONSTEXPR
+_mm_mask_cvtpd_ps(__m128 __W, __mmask8 __U, __m128d __A) {
   return (__m128) __builtin_ia32_cvtpd2ps_mask ((__v2df) __A,
             (__v4sf) __W,
             (__mmask8) __U);
 }
 
-static __inline__ __m128 __DEFAULT_FN_ATTRS128
-_mm_maskz_cvtpd_ps (__mmask8 __U, __m128d __A) {
+static __inline__ __m128 __DEFAULT_FN_ATTRS128_CONSTEXPR
+_mm_maskz_cvtpd_ps(__mmask8 __U, __m128d __A) {
   return (__m128) __builtin_ia32_cvtpd2ps_mask ((__v2df) __A,
             (__v4sf)
             _mm_setzero_ps (),
             (__mmask8) __U);
 }
 
-static __inline__ __m128 __DEFAULT_FN_ATTRS256
-_mm256_mask_cvtpd_ps (__m128 __W, __mmask8 __U, __m256d __A) {
+static __inline__ __m128 __DEFAULT_FN_ATTRS256_CONSTEXPR
+_mm256_mask_cvtpd_ps(__m128 __W, __mmask8 __U, __m256d __A) {
   return (__m128)__builtin_ia32_selectps_128((__mmask8)__U,
                                              (__v4sf)_mm256_cvtpd_ps(__A),
                                              (__v4sf)__W);
 }
 
-static __inline__ __m128 __DEFAULT_FN_ATTRS256
-_mm256_maskz_cvtpd_ps (__mmask8 __U, __m256d __A) {
+static __inline__ __m128 __DEFAULT_FN_ATTRS256_CONSTEXPR
+_mm256_maskz_cvtpd_ps(__mmask8 __U, __m256d __A) {
   return (__m128)__builtin_ia32_selectps_128((__mmask8)__U,
                                              (__v4sf)_mm256_cvtpd_ps(__A),
                                              (__v4sf)_mm_setzero_ps());
diff --git a/clang/lib/Headers/avxintrin.h b/clang/lib/Headers/avxintrin.h
index 54a6e0cd73ab9..9b45bc3e56bdb 100644
--- a/clang/lib/Headers/avxintrin.h
+++ b/clang/lib/Headers/avxintrin.h
@@ -2186,9 +2186,8 @@ _mm256_cvtepi32_ps(__m256i __a) {
 /// \param __a
 ///    A 256-bit vector of [4 x double].
 /// \returns A 128-bit vector of [4 x float] containing the converted values.
-static __inline __m128 __DEFAULT_FN_ATTRS
-_mm256_cvtpd_ps(__m256d __a)
-{
+static __inline __m128 __DEFAULT_FN_ATTRS_CONSTEXPR
+_mm256_cvtpd_ps(__m256d __a) {
   return (__m128)__builtin_ia32_cvtpd2ps256((__v4df) __a);
 }
 
@@ -3606,9 +3605,7 @@ _mm256_undefined_pd(void)
 /// This intrinsic has no corresponding instruction.
 ///
 /// \returns A 256-bit vector of [8 x float] containing undefined values.
-static __inline__ __m256 __DEFAULT_FN_ATTRS
-_mm256_undefined_ps(void)
-{
+static __inline__ __m256 __DEFAULT_FN_ATTRS _mm256_undefined_ps(void) {
   return (__m256)__builtin_ia32_undef256();
 }
 
diff --git a/clang/lib/Headers/emmintrin.h b/clang/lib/Headers/emmintrin.h
index 9d71c69878e47..1ca7097cd170a 100644
--- a/clang/lib/Headers/emmintrin.h
+++ b/clang/lib/Headers/emmintrin.h
@@ -1278,7 +1278,8 @@ static __inline__ int __DEFAULT_FN_ATTRS _mm_ucomineq_sd(__m128d __a,
 ///    A 128-bit vector of [2 x double].
 /// \returns A 128-bit vector of [4 x float] whose lower 64 bits contain the
 ///    converted values. The upper 64 bits are set to zero.
-static __inline__ __m128 __DEFAULT_FN_ATTRS _mm_cvtpd_ps(__m128d __a) {
+static __inline__ __m128 __DEFAULT_FN_ATTRS_CONSTEXPR
+_mm_cvtpd_ps(__m128d __a) {
   return __builtin_ia32_cvtpd2ps((__v2df)__a);
 }
 
@@ -1383,8 +1384,8 @@ static __inline__ int __DEFAULT_FN_ATTRS _mm_cvtsd_si32(__m128d __a) {
 /// \returns A 128-bit vector of [4 x float]. The lower 32 bits contain the
 ///    converted value from the second parameter. The upper 96 bits are copied
 ///    from the upper 96 bits of the first parameter.
-static __inline__ __m128 __DEFAULT_FN_ATTRS _mm_cvtsd_ss(__m128 __a,
-                                                         __m128d __b) {
+static __inline__ __m128 __DEFAULT_FN_ATTRS_CONSTEXPR
+_mm_cvtsd_ss(__m128 __a, __m128d __b) {
   return (__m128)__builtin_ia32_cvtsd2ss((__v4sf)__a, (__v2df)__b);
 }
 
diff --git a/clang/lib/Lex/HeaderSearch.cpp b/clang/lib/Lex/HeaderSearch.cpp
index f05c28fd7a123..b2ed24f765dab 100644
--- a/clang/lib/Lex/HeaderSearch.cpp
+++ b/clang/lib/Lex/HeaderSearch.cpp
@@ -881,6 +881,66 @@ diagnoseFrameworkInclude(DiagnosticsEngine &Diags, SourceLocation IncludeLoc,
         << IncludeFilename;
 }
 
+void HeaderSearch::diagnoseHeaderShadowing(
+    StringRef Filename, OptionalFileEntryRef FE, bool &DiagnosedShadowing,
+    SourceLocation IncludeLoc, ConstSearchDirIterator FromDir,
+    ArrayRef<std::pair<OptionalFileEntryRef, DirectoryEntryRef>> Includers,
+    bool isAngled, int IncluderLoopIndex, ConstSearchDirIterator MainLoopIt) {
+
+  if (Diags.isIgnored(diag::warn_header_shadowing, IncludeLoc) ||
+      DiagnosedShadowing)
+    return;
+  // Ignore diagnostics from system headers.
+  if (MainLoopIt && MainLoopIt->isSystemHeaderDirectory())
+    return;
+
+  DiagnosedShadowing = true;
+
+  // Indicates that file is first found in the includer's directory
+  if (!MainLoopIt) {
+    for (size_t i = IncluderLoopIndex + 1; i < Includers.size(); ++i) {
+      const auto &IncluderAndDir = Includers[i];
+      SmallString<1024> TmpDir = IncluderAndDir.second.getName();
+      llvm::sys::path::append(TmpDir, Filename);
+      if (auto File = getFileMgr().getFileRef(TmpDir, false, false)) {
+        if (&File->getFileEntry() == *FE)
+          continue;
+        Diags.Report(IncludeLoc, diag::warn_header_shadowing)
+            << Filename << (*FE).getDir().getName()
+            << IncluderAndDir.second.getName();
+        return;
+      } else {
+        llvm::errorToErrorCode(File.takeError());
+      }
+    }
+  }
+
+  // Continue searching in the regular search paths
+  ConstSearchDirIterator It =
+      isAngled ? angled_dir_begin() : search_dir_begin();
+  if (MainLoopIt) {
+    It = std::next(MainLoopIt);
+  } else if (FromDir) {
+    It = FromDir;
+  }
+  for (; It != search_dir_end(); ++It) {
+    // Suppress check for system headers, as duplicates are often intentional.
+    if (It->getDirCharacteristic() != SrcMgr::C_User)
+      continue;
+    SmallString<1024> TmpPath = It->getName();
+    llvm::sys::path::append(TmpPath, Filename);
+    if (auto File = getFileMgr().getFileRef(TmpPath, false, false)) {
+      if (&File->getFileEntry() == *FE)
+        continue;
+      Diags.Report(IncludeLoc, diag::warn_header_shadowing)
+          << Filename << (*FE).getDir().getName() << It->getName();
+      return;
+    } else {
+      llvm::errorToErrorCode(File.takeError());
+    }
+  }
+}
+
 /// LookupFile - Given a "foo" or \<foo> reference, look up the indicated file,
 /// return null on failure.  isAngled indicates whether the file reference is
 /// for system \#include's or not (i.e. using <> instead of ""). Includers, if
@@ -930,6 +990,7 @@ OptionalFileEntryRef HeaderSearch::LookupFile(
   // This is the header that MSVC's header search would have found.
   ModuleMap::KnownHeader MSSuggestedModule;
   OptionalFileEntryRef MSFE;
+  bool DiagnosedShadowing = false;
 
   // Check to see if the file is in the #includer's directory. This cannot be
   // based on CurDir, because each includer could be a #include of a
@@ -963,6 +1024,9 @@ OptionalFileEntryRef HeaderSearch::LookupFile(
       if (OptionalFileEntryRef FE = getFileAndSuggestModule(
               TmpDir, IncludeLoc, IncluderAndDir.second, IncluderIsSystemHeader,
               RequestingModule, SuggestedModule)) {
+        diagnoseHeaderShadowing(Filename, FE, DiagnosedShadowing, IncludeLoc,
+                                FromDir, Includers, isAngled,
+                                &IncluderAndDir - Includers.begin(), nullptr);
         if (!Includer) {
           assert(First && "only first includer can have no file");
           return FE;
@@ -1097,6 +1161,9 @@ OptionalFileEntryRef HeaderSearch::LookupFile(
     if (!File)
       continue;
 
+    diagnoseHeaderShadowing(Filename, File, DiagnosedShadowing, IncludeLoc,
+                            FromDir, Includers, isAngled, -1, It);
+
     CurDir = It;
 
     IncludeNames[*File] = Filename;
diff --git a/clang/lib/Lex/PPDirectives.cpp b/clang/lib/Lex/PPDirectives.cpp
index 891c8ab7f3155..764a893eebe3c 100644
--- a/clang/lib/Lex/PPDirectives.cpp
+++ b/clang/lib/Lex/PPDirectives.cpp
@@ -1397,11 +1397,12 @@ void Preprocessor::HandleDirective(Token &Result) {
       return HandleIdentSCCSDirective(Result);
     case tok::pp_sccs:
       return HandleIdentSCCSDirective(Result);
-    case tok::pp_embed:
-      return HandleEmbedDirective(SavedHash.getLocation(), Result,
-                                  getCurrentFileLexer()
-                                      ? *getCurrentFileLexer()->getFileEntry()
-                                      : static_cast<FileEntry *>(nullptr));
+    case tok::pp_embed: {
+      if (PreprocessorLexer *CurrentFileLexer = getCurrentFileLexer())
+        if (OptionalFileEntryRef FERef = CurrentFileLexer->getFileEntry())
+          return HandleEmbedDirective(SavedHash.getLocation(), Result, *FERef);
+      return HandleEmbedDirective(SavedHash.getLocation(), Result, nullptr);
+    }
     case tok::pp_assert:
       //isExtension = true;  // FIXME: implement #assert
       break;
diff --git a/clang/lib/Parse/ParseOpenMP.cpp b/clang/lib/Parse/ParseOpenMP.cpp
index 3b69c286634bb..15c3f7594bf44 100644
--- a/clang/lib/Parse/ParseOpenMP.cpp
+++ b/clang/lib/Parse/ParseOpenMP.cpp
@@ -4925,19 +4925,28 @@ bool Parser::ParseOpenMPVarList(OpenMPDirectiveKind DKind,
         break;
       Data.MotionModifiers.push_back(Modifier);
       Data.MotionModifiersLoc.push_back(Tok.getLocation());
-      ConsumeToken();
-      if (Modifier == OMPC_MOTION_MODIFIER_mapper) {
-        IsInvalidMapperModifier = parseMapperModifier(Data);
-        if (IsInvalidMapperModifier)
+      if (PP.getSpelling(Tok) == "iterator" && getLangOpts().OpenMP >= 51) {
+        ExprResult Tail;
+        Tail = ParseOpenMPIteratorsExpr();
+        Tail = Actions.ActOnFinishFullExpr(Tail.get(), T.getOpenLocation(),
+                                           /*DiscardedValue=*/false);
+        if (Tail.isUsable())
+          Data.IteratorExpr = Tail.get();
+      } else {
+        ConsumeToken();
+        if (Modifier == OMPC_MOTION_MODIFIER_mapper) {
+          IsInvalidMapperModifier = parseMapperModifier(Data);
+          if (IsInvalidMapperModifier)
+            break;
+        }
+        // OpenMP < 5.1 doesn't permit a ',' or additional modifiers.
+        if (getLangOpts().OpenMP < 51)
           break;
+        // OpenMP 5.1 accepts an optional ',' even if the next character is ':'.
+        // TODO: Is that intentional?
+        if (Tok.is(tok::comma))
+          ConsumeToken();
       }
-      // OpenMP < 5.1 doesn't permit a ',' or additional modifiers.
-      if (getLangOpts().OpenMP < 51)
-        break;
-      // OpenMP 5.1 accepts an optional ',' even if the next character is ':'.
-      // TODO: Is that intentional?
-      if (Tok.is(tok::comma))
-        ConsumeToken();
     }
     if (!Data.MotionModifiers.empty() && Tok.isNot(tok::colon)) {
       if (!IsInvalidMapperModifier) {
diff --git a/clang/lib/Parse/ParseTentative.cpp b/clang/lib/Parse/ParseTentative.cpp
index 82f2294ff5bb7..9622a00687ca5 100644
--- a/clang/lib/Parse/ParseTentative.cpp
+++ b/clang/lib/Parse/ParseTentative.cpp
@@ -1063,7 +1063,7 @@ Parser::isCXXDeclarationSpecifier(ImplicitTypenameContext AllowImplicitTypename,
       return TPResult::False;
     }
 
-    if (Next.isNot(tok::coloncolon) && Next.isNot(tok::less)) {
+    if (Next.isNoneOf(tok::coloncolon, tok::less, tok::colon)) {
       // Determine whether this is a valid expression. If not, we will hit
       // a parse error one way or another. In that case, tell the caller that
       // this is ambiguous. Typo-correct to type and expression keywords and
diff --git a/clang/lib/Parse/Parser.cpp b/clang/lib/Parse/Parser.cpp
index a6fc676f23a51..7b425dd3dda43 100644
--- a/clang/lib/Parse/Parser.cpp
+++ b/clang/lib/Parse/Parser.cpp
@@ -1100,30 +1100,25 @@ Parser::DeclGroupPtrTy Parser::ParseDeclOrFunctionDefInternal(
   // C99 6.7.2.3p6: Handle "struct-or-union identifier;", "enum { X };"
   // declaration-specifiers init-declarator-list[opt] ';'
   if (Tok.is(tok::semi)) {
-    auto LengthOfTSTToken = [](DeclSpec::TST TKind) {
-      assert(DeclSpec::isDeclRep(TKind));
-      switch(TKind) {
-      case DeclSpec::TST_class:
-        return 5;
-      case DeclSpec::TST_struct:
-        return 6;
-      case DeclSpec::TST_union:
-        return 5;
-      case DeclSpec::TST_enum:
-        return 4;
-      case DeclSpec::TST_interface:
-        return 9;
-      default:
-        llvm_unreachable("we only expect to get the length of the class/struct/union/enum");
+    // Suggest correct location to fix '[[attrib]] struct' to 'struct
+    // [[attrib]]'
+    SourceLocation CorrectLocationForAttributes{};
+    TypeSpecifierType TKind = DS.getTypeSpecType();
+    if (DeclSpec::isDeclRep(TKind)) {
+      if (TKind == DeclSpec::TST_enum) {
+        if (const auto *ED = dyn_cast_or_null<EnumDecl>(DS.getRepAsDecl())) {
+          CorrectLocationForAttributes =
+              PP.getLocForEndOfToken(ED->getEnumKeyRange().getEnd());
+        }
       }
-
-    };
-    // Suggest correct location to fix '[[attrib]] struct' to 'struct [[attrib]]'
-    SourceLocation CorrectLocationForAttributes =
-        DeclSpec::isDeclRep(DS.getTypeSpecType())
-            ? DS.getTypeSpecTypeLoc().getLocWithOffset(
-                  LengthOfTSTToken(DS.getTypeSpecType()))
-            : SourceLocation();
+      if (CorrectLocationForAttributes.isInvalid()) {
+        const auto &Policy = Actions.getASTContext().getPrintingPolicy();
+        unsigned Offset =
+            StringRef(DeclSpec::getSpecifierName(TKind, Policy)).size();
+        CorrectLocationForAttributes =
+            DS.getTypeSpecTypeLoc().getLocWithOffset(Offset);
+      }
+    }
     ProhibitAttributes(Attrs, CorrectLocationForAttributes);
     ConsumeToken();
     RecordDecl *AnonRecord = nullptr;
diff --git a/clang/lib/Sema/AnalysisBasedWarnings.cpp b/clang/lib/Sema/AnalysisBasedWarnings.cpp
index 43d2b9a829545..1ac217293ba4a 100644
--- a/clang/lib/Sema/AnalysisBasedWarnings.cpp
+++ b/clang/lib/Sema/AnalysisBasedWarnings.cpp
@@ -36,7 +36,6 @@
 #include "clang/Analysis/Analyses/UnsafeBufferUsage.h"
 #include "clang/Analysis/AnalysisDeclContext.h"
 #include "clang/Analysis/CFG.h"
-#include "clang/Analysis/CFGStmtMap.h"
 #include "clang/Analysis/FlowSensitive/DataflowWorklist.h"
 #include "clang/Basic/Diagnostic.h"
 #include "clang/Basic/DiagnosticSema.h"
diff --git a/clang/lib/Sema/CodeCompleteConsumer.cpp b/clang/lib/Sema/CodeCompleteConsumer.cpp
index e3fc7c11f4594..50a552272f421 100644
--- a/clang/lib/Sema/CodeCompleteConsumer.cpp
+++ b/clang/lib/Sema/CodeCompleteConsumer.cpp
@@ -539,8 +539,7 @@ unsigned CodeCompleteConsumer::OverloadCandidate::getNumParams() const {
     return Template->getTemplateParameters()->size();
 
   if (Kind == CK_Aggregate) {
-    unsigned Count =
-        std::distance(AggregateType->field_begin(), AggregateType->field_end());
+    unsigned Count = AggregateType->getNumFields();
     if (const auto *CRD = dyn_cast<CXXRecordDecl>(AggregateType))
       Count += CRD->getNumBases();
     return Count;
diff --git a/clang/lib/Sema/Sema.cpp b/clang/lib/Sema/Sema.cpp
index 1541b2cc95d8c..d32d7b960288d 100644
--- a/clang/lib/Sema/Sema.cpp
+++ b/clang/lib/Sema/Sema.cpp
@@ -1497,6 +1497,9 @@ void Sema::ActOnEndOfTranslationUnit() {
 
   if (LangOpts.HLSL)
     HLSL().ActOnEndOfTranslationUnit(getASTContext().getTranslationUnitDecl());
+  if (LangOpts.OpenACC)
+    OpenACC().ActOnEndOfTranslationUnit(
+        getASTContext().getTranslationUnitDecl());
 
   // If there were errors, disable 'unused' warnings since they will mostly be
   // noise. Don't warn for a use from a module: either we should warn on all
diff --git a/clang/lib/Sema/SemaCUDA.cpp b/clang/lib/Sema/SemaCUDA.cpp
index 31735a0f5feb3..dd9bcab56b083 100644
--- a/clang/lib/Sema/SemaCUDA.cpp
+++ b/clang/lib/Sema/SemaCUDA.cpp
@@ -52,16 +52,94 @@ bool SemaCUDA::PopForceHostDevice() {
 ExprResult SemaCUDA::ActOnExecConfigExpr(Scope *S, SourceLocation LLLLoc,
                                          MultiExprArg ExecConfig,
                                          SourceLocation GGGLoc) {
-  FunctionDecl *ConfigDecl = getASTContext().getcudaConfigureCallDecl();
+  bool IsDeviceKernelCall = false;
+  switch (CurrentTarget()) {
+  case CUDAFunctionTarget::Global:
+  case CUDAFunctionTarget::Device:
+    IsDeviceKernelCall = true;
+    break;
+  case CUDAFunctionTarget::HostDevice:
+    if (getLangOpts().CUDAIsDevice) {
+      IsDeviceKernelCall = true;
+      if (FunctionDecl *Caller =
+              SemaRef.getCurFunctionDecl(/*AllowLambda=*/true);
+          Caller && isImplicitHostDeviceFunction(Caller)) {
+        // Under the device compilation, config call under an HD function should
+        // be treated as a device kernel call. But, for implicit HD ones (such
+        // as lambdas), need to check whether RDC is enabled or not.
+        if (!getLangOpts().GPURelocatableDeviceCode)
+          IsDeviceKernelCall = false;
+        // HIP doesn't support device-side kernel call yet. Still treat it as
+        // the host-side kernel call.
+        if (getLangOpts().HIP)
+          IsDeviceKernelCall = false;
+      }
+    }
+    break;
+  default:
+    break;
+  }
+
+  if (IsDeviceKernelCall && getLangOpts().HIP)
+    return ExprError(
+        Diag(LLLLoc, diag::err_cuda_device_kernel_launch_not_supported));
+
+  if (IsDeviceKernelCall && !getLangOpts().GPURelocatableDeviceCode)
+    return ExprError(
+        Diag(LLLLoc, diag::err_cuda_device_kernel_launch_require_rdc));
+
+  FunctionDecl *ConfigDecl = IsDeviceKernelCall
+                                 ? getASTContext().getcudaLaunchDeviceDecl()
+                                 : getASTContext().getcudaConfigureCallDecl();
   if (!ConfigDecl)
     return ExprError(Diag(LLLLoc, diag::err_undeclared_var_use)
-                     << getConfigureFuncName());
+                     << (IsDeviceKernelCall ? getLaunchDeviceFuncName()
+                                            : getConfigureFuncName()));
+  // Additional check on the launch function if it's a device kernel call.
+  if (IsDeviceKernelCall) {
+    auto *GetParamBuf = getASTContext().getcudaGetParameterBufferDecl();
+    if (!GetParamBuf)
+      return ExprError(Diag(LLLLoc, diag::err_undeclared_var_use)
+                       << getGetParameterBufferFuncName());
+  }
+
   QualType ConfigQTy = ConfigDecl->getType();
 
   DeclRefExpr *ConfigDR = new (getASTContext()) DeclRefExpr(
       getASTContext(), ConfigDecl, false, ConfigQTy, VK_LValue, LLLLoc);
   SemaRef.MarkFunctionReferenced(LLLLoc, ConfigDecl);
 
+  if (IsDeviceKernelCall) {
+    SmallVector<Expr *> Args;
+    // Use a null pointer as the kernel function, which may not be resolvable
+    // here. For example, resolving that kernel function may need additional
+    // kernel arguments.
+    llvm::APInt Zero(SemaRef.Context.getTypeSize(SemaRef.Context.IntTy), 0);
+    Args.push_back(IntegerLiteral::Create(SemaRef.Context, Zero,
+                                          SemaRef.Context.IntTy, LLLLoc));
+    // Use a null pointer as the placeholder of the parameter buffer, which
+    // should be replaced with the actual allocation later, in the codegen.
+    Args.push_back(IntegerLiteral::Create(SemaRef.Context, Zero,
+                                          SemaRef.Context.IntTy, LLLLoc));
+    // Add the original config arguments.
+    llvm::append_range(Args, ExecConfig);
+    // Add the default blockDim if it's missing.
+    if (Args.size() < 4) {
+      llvm::APInt One(SemaRef.Context.getTypeSize(SemaRef.Context.IntTy), 1);
+      Args.push_back(IntegerLiteral::Create(SemaRef.Context, One,
+                                            SemaRef.Context.IntTy, LLLLoc));
+    }
+    // Add the default sharedMemSize if it's missing.
+    if (Args.size() < 5)
+      Args.push_back(IntegerLiteral::Create(SemaRef.Context, Zero,
+                                            SemaRef.Context.IntTy, LLLLoc));
+    // Add the default stream if it's missing.
+    if (Args.size() < 6)
+      Args.push_back(new (SemaRef.Context) CXXNullPtrLiteralExpr(
+          SemaRef.Context.NullPtrTy, LLLLoc));
+    return SemaRef.BuildCallExpr(S, ConfigDR, LLLLoc, Args, GGGLoc, nullptr,
+                                 /*IsExecConfig=*/true);
+  }
   return SemaRef.BuildCallExpr(S, ConfigDR, LLLLoc, ExecConfig, GGGLoc, nullptr,
                                /*IsExecConfig=*/true);
 }
@@ -246,12 +324,12 @@ SemaCUDA::IdentifyPreference(const FunctionDecl *Caller,
       CalleeTarget == CUDAFunctionTarget::InvalidTarget)
     return CFP_Never;
 
-  // (a) Can't call global from some contexts until we support CUDA's
-  // dynamic parallelism.
+  // (a) Call global from either global or device contexts is allowed as part
+  // of CUDA's dynamic parallelism support.
   if (CalleeTarget == CUDAFunctionTarget::Global &&
       (CallerTarget == CUDAFunctionTarget::Global ||
        CallerTarget == CUDAFunctionTarget::Device))
-    return CFP_Never;
+    return CFP_Native;
 
   // (b) Calling HostDevice is OK for everyone.
   if (CalleeTarget == CUDAFunctionTarget::HostDevice)
@@ -279,7 +357,8 @@ SemaCUDA::IdentifyPreference(const FunctionDecl *Caller,
   if (CallerTarget == CUDAFunctionTarget::HostDevice) {
     // It's OK to call a compilation-mode matching function from an HD one.
     if ((getLangOpts().CUDAIsDevice &&
-         CalleeTarget == CUDAFunctionTarget::Device) ||
+         (CalleeTarget == CUDAFunctionTarget::Device ||
+          CalleeTarget == CUDAFunctionTarget::Global)) ||
         (!getLangOpts().CUDAIsDevice &&
          (CalleeTarget == CUDAFunctionTarget::Host ||
           CalleeTarget == CUDAFunctionTarget::Global)))
@@ -1103,6 +1182,14 @@ std::string SemaCUDA::getConfigureFuncName() const {
   return "cudaConfigureCall";
 }
 
+std::string SemaCUDA::getGetParameterBufferFuncName() const {
+  return "cudaGetParameterBuffer";
+}
+
+std::string SemaCUDA::getLaunchDeviceFuncName() const {
+  return "cudaLaunchDevice";
+}
+
 // Record any local constexpr variables that are passed one way on the host
 // and another on the device.
 void SemaCUDA::recordPotentialODRUsedVariable(
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index 0ffb4854ba86d..58de9fe48162b 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -37,6 +37,7 @@
 #include "clang/AST/TemplateBase.h"
 #include "clang/AST/TemplateName.h"
 #include "clang/AST/Type.h"
+#include "clang/AST/TypeBase.h"
 #include "clang/AST/TypeLoc.h"
 #include "clang/AST/UnresolvedSet.h"
 #include "clang/Basic/AddressSpaces.h"
@@ -12570,9 +12571,10 @@ void Sema::CheckImplicitConversion(Expr *E, QualType T, SourceLocation CC,
       if (SourceMgr.isInSystemMacro(CC))
         return;
       return DiagnoseImpCast(*this, E, T, CC, diag::warn_impcast_vector_scalar);
-    } else if (getLangOpts().HLSL &&
-               Target->castAs<VectorType>()->getNumElements() <
-                   Source->castAs<VectorType>()->getNumElements()) {
+    }
+    if (getLangOpts().HLSL &&
+        Target->castAs<VectorType>()->getNumElements() <
+            Source->castAs<VectorType>()->getNumElements()) {
       // Diagnose vector truncation but don't return. We may also want to
       // diagnose an element conversion.
       DiagnoseImpCast(*this, E, T, CC,
@@ -12588,9 +12590,22 @@ void Sema::CheckImplicitConversion(Expr *E, QualType T, SourceLocation CC,
     Source = cast<VectorType>(Source)->getElementType().getTypePtr();
     Target = cast<VectorType>(Target)->getElementType().getTypePtr();
   }
-  if (auto VecTy = dyn_cast<VectorType>(Target))
+  if (const auto *VecTy = dyn_cast<VectorType>(Target))
     Target = VecTy->getElementType().getTypePtr();
 
+  if (isa<ConstantMatrixType>(Source)) {
+    if (Target->isScalarType())
+      return DiagnoseImpCast(*this, E, T, CC, diag::warn_impcast_matrix_scalar);
+
+    if (getLangOpts().HLSL &&
+        Target->castAs<ConstantMatrixType>()->getNumElementsFlattened() <
+            Source->castAs<ConstantMatrixType>()->getNumElementsFlattened()) {
+      // Diagnose Matrix truncation but don't return. We may also want to
+      // diagnose an element conversion.
+      DiagnoseImpCast(*this, E, T, CC,
+                      diag::warn_hlsl_impcast_matrix_truncation);
+    }
+  }
   // Strip complex types.
   if (isa<ComplexType>(Source)) {
     if (!isa<ComplexType>(Target)) {
diff --git a/clang/lib/Sema/SemaDecl.cpp b/clang/lib/Sema/SemaDecl.cpp
index 7ad1f17263c4d..4b74b4c0354b2 100644
--- a/clang/lib/Sema/SemaDecl.cpp
+++ b/clang/lib/Sema/SemaDecl.cpp
@@ -58,6 +58,7 @@
 #include "clang/Sema/SemaSwift.h"
 #include "clang/Sema/SemaWasm.h"
 #include "clang/Sema/Template.h"
+#include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/STLForwardCompat.h"
 #include "llvm/ADT/ScopeExit.h"
 #include "llvm/ADT/SmallPtrSet.h"
@@ -2901,6 +2902,10 @@ static bool mergeDeclAttribute(Sema &S, NamedDecl *D,
   else if (const auto *FMA = dyn_cast<FormatMatchesAttr>(Attr))
     NewAttr = S.mergeFormatMatchesAttr(
         D, *FMA, FMA->getType(), FMA->getFormatIdx(), FMA->getFormatString());
+  else if (const auto *MFA = dyn_cast<ModularFormatAttr>(Attr))
+    NewAttr = S.mergeModularFormatAttr(
+        D, *MFA, MFA->getModularImplFn(), MFA->getImplName(),
+        MutableArrayRef<StringRef>{MFA->aspects_begin(), MFA->aspects_size()});
   else if (const auto *SA = dyn_cast<SectionAttr>(Attr))
     NewAttr = S.mergeSectionAttr(D, *SA, SA->getName());
   else if (const auto *CSA = dyn_cast<CodeSegAttr>(Attr))
@@ -7217,6 +7222,11 @@ static void checkLifetimeBoundAttr(Sema &S, NamedDecl &ND) {
   }
 }
 
+static void checkModularFormatAttr(Sema &S, NamedDecl &ND) {
+  if (ND.hasAttr<ModularFormatAttr>() && !ND.hasAttr<FormatAttr>())
+    S.Diag(ND.getLocation(), diag::err_modular_format_attribute_no_format);
+}
+
 static void checkAttributesAfterMerging(Sema &S, NamedDecl &ND) {
   // Ensure that an auto decl is deduced otherwise the checks below might cache
   // the wrong linkage.
@@ -7229,6 +7239,7 @@ static void checkAttributesAfterMerging(Sema &S, NamedDecl &ND) {
   checkHybridPatchableAttr(S, ND);
   checkInheritableAttr(S, ND);
   checkLifetimeBoundAttr(S, ND);
+  checkModularFormatAttr(S, ND);
 }
 
 static void checkDLLAttributeRedeclaration(Sema &S, NamedDecl *OldDecl,
@@ -11046,14 +11057,30 @@ Sema::ActOnFunctionDeclarator(Scope *S, Declarator &D, DeclContext *DC,
   }
 
   if (getLangOpts().CUDA) {
-    IdentifierInfo *II = NewFD->getIdentifier();
-    if (II && II->isStr(CUDA().getConfigureFuncName()) &&
-        !NewFD->isInvalidDecl() &&
-        NewFD->getDeclContext()->getRedeclContext()->isTranslationUnit()) {
-      if (!R->castAs<FunctionType>()->getReturnType()->isScalarType())
-        Diag(NewFD->getLocation(), diag::err_config_scalar_return)
-            << CUDA().getConfigureFuncName();
-      Context.setcudaConfigureCallDecl(NewFD);
+    if (IdentifierInfo *II = NewFD->getIdentifier()) {
+      if (II->isStr(CUDA().getConfigureFuncName()) && !NewFD->isInvalidDecl() &&
+          NewFD->getDeclContext()->getRedeclContext()->isTranslationUnit()) {
+        if (!R->castAs<FunctionType>()->getReturnType()->isScalarType())
+          Diag(NewFD->getLocation(), diag::err_config_scalar_return)
+              << CUDA().getConfigureFuncName();
+        Context.setcudaConfigureCallDecl(NewFD);
+      }
+      if (II->isStr(CUDA().getGetParameterBufferFuncName()) &&
+          !NewFD->isInvalidDecl() &&
+          NewFD->getDeclContext()->getRedeclContext()->isTranslationUnit()) {
+        if (!R->castAs<FunctionType>()->getReturnType()->isPointerType())
+          Diag(NewFD->getLocation(), diag::err_config_pointer_return)
+              << CUDA().getConfigureFuncName();
+        Context.setcudaGetParameterBufferDecl(NewFD);
+      }
+      if (II->isStr(CUDA().getLaunchDeviceFuncName()) &&
+          !NewFD->isInvalidDecl() &&
+          NewFD->getDeclContext()->getRedeclContext()->isTranslationUnit()) {
+        if (!R->castAs<FunctionType>()->getReturnType()->isScalarType())
+          Diag(NewFD->getLocation(), diag::err_config_scalar_return)
+              << CUDA().getConfigureFuncName();
+        Context.setcudaLaunchDeviceDecl(NewFD);
+      }
     }
   }
 
@@ -18468,17 +18495,21 @@ Sema::ActOnTag(Scope *S, unsigned TagSpec, TagUseKind TUK, SourceLocation KWLoc,
                            cast_or_null<EnumDecl>(PrevDecl), ScopedEnum,
                            ScopedEnumUsesClassTag, IsFixed);
 
+    EnumDecl *ED = cast<EnumDecl>(New);
+    ED->setEnumKeyRange(SourceRange(
+        KWLoc, ScopedEnumKWLoc.isValid() ? ScopedEnumKWLoc : KWLoc));
+
     if (isStdAlignValT && (!StdAlignValT || getStdAlignValT()->isImplicit()))
       StdAlignValT = cast<EnumDecl>(New);
 
     // If this is an undefined enum, warn.
     if (TUK != TagUseKind::Definition && !Invalid) {
       TagDecl *Def;
-      if (IsFixed && cast<EnumDecl>(New)->isFixed()) {
+      if (IsFixed && ED->isFixed()) {
         // C++0x: 7.2p2: opaque-enum-declaration.
         // Conflicts are diagnosed above. Do nothing.
-      }
-      else if (PrevDecl && (Def = cast<EnumDecl>(PrevDecl)->getDefinition())) {
+      } else if (PrevDecl &&
+                 (Def = cast<EnumDecl>(PrevDecl)->getDefinition())) {
         Diag(Loc, diag::ext_forward_ref_enum_def)
           << New;
         Diag(Def->getLocation(), diag::note_previous_definition);
diff --git a/clang/lib/Sema/SemaDeclAttr.cpp b/clang/lib/Sema/SemaDeclAttr.cpp
index e3af5023c74d0..04cd68a4223d8 100644
--- a/clang/lib/Sema/SemaDeclAttr.cpp
+++ b/clang/lib/Sema/SemaDeclAttr.cpp
@@ -6973,6 +6973,71 @@ static void handleVTablePointerAuthentication(Sema &S, Decl *D,
       CustomDiscriminationValue));
 }
 
+static bool modularFormatAttrsEquiv(const ModularFormatAttr *Existing,
+                                    IdentifierInfo *ModularImplFn,
+                                    StringRef ImplName,
+                                    ArrayRef<StringRef> Aspects) {
+  return Existing->getModularImplFn() == ModularImplFn &&
+         Existing->getImplName() == ImplName &&
+         Existing->aspects_size() == Aspects.size() &&
+         llvm::equal(Existing->aspects(), Aspects);
+}
+
+ModularFormatAttr *
+Sema::mergeModularFormatAttr(Decl *D, const AttributeCommonInfo &CI,
+                             IdentifierInfo *ModularImplFn, StringRef ImplName,
+                             MutableArrayRef<StringRef> Aspects) {
+  if (const auto *Existing = D->getAttr<ModularFormatAttr>()) {
+    if (!modularFormatAttrsEquiv(Existing, ModularImplFn, ImplName, Aspects)) {
+      Diag(Existing->getLocation(), diag::err_duplicate_attribute) << *Existing;
+      Diag(CI.getLoc(), diag::note_conflicting_attribute);
+    }
+    return nullptr;
+  }
+  return ::new (Context) ModularFormatAttr(Context, CI, ModularImplFn, ImplName,
+                                           Aspects.data(), Aspects.size());
+}
+
+static void handleModularFormat(Sema &S, Decl *D, const ParsedAttr &AL) {
+  bool Valid = true;
+  StringRef ImplName;
+  if (!S.checkStringLiteralArgumentAttr(AL, 1, ImplName))
+    Valid = false;
+  SmallVector<StringRef> Aspects;
+  llvm::DenseSet<StringRef> SeenAspects;
+  for (unsigned I = 2, E = AL.getNumArgs(); I != E; ++I) {
+    StringRef Aspect;
+    if (!S.checkStringLiteralArgumentAttr(AL, I, Aspect))
+      return;
+    if (!SeenAspects.insert(Aspect).second) {
+      S.Diag(AL.getArgAsExpr(I)->getExprLoc(),
+             diag::err_modular_format_duplicate_aspect)
+          << Aspect;
+      Valid = false;
+      continue;
+    }
+    Aspects.push_back(Aspect);
+  }
+  if (!Valid)
+    return;
+
+  // Store aspects sorted.
+  llvm::sort(Aspects);
+  IdentifierInfo *ModularImplFn = AL.getArgAsIdent(0)->getIdentifierInfo();
+
+  if (const auto *Existing = D->getAttr<ModularFormatAttr>()) {
+    if (!modularFormatAttrsEquiv(Existing, ModularImplFn, ImplName, Aspects)) {
+      S.Diag(AL.getLoc(), diag::err_duplicate_attribute) << *Existing;
+      S.Diag(Existing->getLoc(), diag::note_conflicting_attribute);
+    }
+    // Ignore the later declaration in favor of the earlier one.
+    return;
+  }
+
+  D->addAttr(::new (S.Context) ModularFormatAttr(
+      S.Context, AL, ModularImplFn, ImplName, Aspects.data(), Aspects.size()));
+}
+
 //===----------------------------------------------------------------------===//
 // Top Level Sema Entry Points
 //===----------------------------------------------------------------------===//
@@ -7703,6 +7768,9 @@ ProcessDeclAttribute(Sema &S, Scope *scope, Decl *D, const ParsedAttr &AL,
   case ParsedAttr::AT_HLSLUnparsedSemantic:
     S.HLSL().handleSemanticAttr(D, AL);
     break;
+  case ParsedAttr::AT_HLSLVkLocation:
+    S.HLSL().handleVkLocationAttr(D, AL);
+    break;
 
   case ParsedAttr::AT_AbiTag:
     handleAbiTagAttr(S, D, AL);
@@ -7910,6 +7978,10 @@ ProcessDeclAttribute(Sema &S, Scope *scope, Decl *D, const ParsedAttr &AL,
   case ParsedAttr::AT_VTablePointerAuthentication:
     handleVTablePointerAuthentication(S, D, AL);
     break;
+
+  case ParsedAttr::AT_ModularFormat:
+    handleModularFormat(S, D, AL);
+    break;
   }
 }
 
diff --git a/clang/lib/Sema/SemaExprCXX.cpp b/clang/lib/Sema/SemaExprCXX.cpp
index d6f70e728be29..69719ebd1fc8c 100644
--- a/clang/lib/Sema/SemaExprCXX.cpp
+++ b/clang/lib/Sema/SemaExprCXX.cpp
@@ -5196,6 +5196,7 @@ Sema::PerformImplicitConversion(Expr *From, QualType ToType,
   case ICK_Incompatible_Pointer_Conversion:
   case ICK_HLSL_Array_RValue:
   case ICK_HLSL_Vector_Truncation:
+  case ICK_HLSL_Matrix_Truncation:
   case ICK_HLSL_Vector_Splat:
     llvm_unreachable("Improper second standard conversion");
   }
@@ -5203,12 +5204,10 @@ Sema::PerformImplicitConversion(Expr *From, QualType ToType,
   if (SCS.Dimension != ICK_Identity) {
     // If SCS.Element is not ICK_Identity the To and From types must be HLSL
     // vectors or matrices.
-
-    // TODO: Support HLSL matrices.
-    assert((!From->getType()->isMatrixType() && !ToType->isMatrixType()) &&
-           "Dimension conversion for matrix types is not implemented yet.");
-    assert((ToType->isVectorType() || ToType->isBuiltinType()) &&
-           "Dimension conversion output must be vector or scalar type.");
+    assert(
+        (ToType->isVectorType() || ToType->isConstantMatrixType() ||
+         ToType->isBuiltinType()) &&
+        "Dimension conversion output must be vector, matrix, or scalar type.");
     switch (SCS.Dimension) {
     case ICK_HLSL_Vector_Splat: {
       // Vector splat from any arithmetic type to a vector.
@@ -5234,6 +5233,17 @@ Sema::PerformImplicitConversion(Expr *From, QualType ToType,
 
       break;
     }
+    case ICK_HLSL_Matrix_Truncation: {
+      auto *FromMat = From->getType()->castAs<ConstantMatrixType>();
+      QualType TruncTy = FromMat->getElementType();
+      if (auto *ToMat = ToType->getAs<ConstantMatrixType>())
+        TruncTy = Context.getConstantMatrixType(TruncTy, ToMat->getNumRows(),
+                                                ToMat->getNumColumns());
+      From = ImpCastExprToType(From, TruncTy, CK_HLSLMatrixTruncation,
+                               From->getValueKind())
+                 .get();
+      break;
+    }
     case ICK_Identity:
     default:
       llvm_unreachable("Improper element standard conversion");
diff --git a/clang/lib/Sema/SemaHLSL.cpp b/clang/lib/Sema/SemaHLSL.cpp
index ecab3946b58c7..dadea331f930b 100644
--- a/clang/lib/Sema/SemaHLSL.cpp
+++ b/clang/lib/Sema/SemaHLSL.cpp
@@ -771,12 +771,33 @@ void SemaHLSL::ActOnTopLevelFunction(FunctionDecl *FD) {
   }
 }
 
+static bool isVkPipelineBuiltin(const ASTContext &AstContext, FunctionDecl *FD,
+                                HLSLAppliedSemanticAttr *Semantic,
+                                bool IsInput) {
+  if (AstContext.getTargetInfo().getTriple().getOS() != llvm::Triple::Vulkan)
+    return false;
+
+  const auto *ShaderAttr = FD->getAttr<HLSLShaderAttr>();
+  assert(ShaderAttr && "Entry point has no shader attribute");
+  llvm::Triple::EnvironmentType ST = ShaderAttr->getType();
+  auto SemanticName = Semantic->getSemanticName().upper();
+
+  // The SV_Position semantic is lowered to:
+  //  - Position built-in for vertex output.
+  //  - FragCoord built-in for fragment input.
+  if (SemanticName == "SV_POSITION") {
+    return (ST == llvm::Triple::Vertex && !IsInput) ||
+           (ST == llvm::Triple::Pixel && IsInput);
+  }
+
+  return false;
+}
+
 bool SemaHLSL::determineActiveSemanticOnScalar(FunctionDecl *FD,
                                                DeclaratorDecl *OutputDecl,
                                                DeclaratorDecl *D,
                                                SemanticInfo &ActiveSemantic,
-                                               llvm::StringSet<> &UsedSemantics,
-                                               bool IsInput) {
+                                               SemaHLSL::SemanticContext &SC) {
   if (ActiveSemantic.Semantic == nullptr) {
     ActiveSemantic.Semantic = D->getAttr<HLSLParsedSemanticAttr>();
     if (ActiveSemantic.Semantic)
@@ -795,11 +816,26 @@ bool SemaHLSL::determineActiveSemanticOnScalar(FunctionDecl *FD,
   if (!A)
     return false;
 
-  checkSemanticAnnotation(FD, D, A, IsInput);
+  checkSemanticAnnotation(FD, D, A, SC);
   OutputDecl->addAttr(A);
 
   unsigned Location = ActiveSemantic.Index.value_or(0);
 
+  if (!isVkPipelineBuiltin(getASTContext(), FD, A,
+                           SC.CurrentIOType & IOType::In)) {
+    bool HasVkLocation = false;
+    if (auto *A = D->getAttr<HLSLVkLocationAttr>()) {
+      HasVkLocation = true;
+      Location = A->getLocation();
+    }
+
+    if (SC.UsesExplicitVkLocations.value_or(HasVkLocation) != HasVkLocation) {
+      Diag(D->getLocation(), diag::err_hlsl_semantic_partial_explicit_indexing);
+      return false;
+    }
+    SC.UsesExplicitVkLocations = HasVkLocation;
+  }
+
   const ConstantArrayType *AT = dyn_cast<ConstantArrayType>(D->getType());
   unsigned ElementCount = AT ? AT->getZExtSize() : 1;
   ActiveSemantic.Index = Location + ElementCount;
@@ -808,7 +844,7 @@ bool SemaHLSL::determineActiveSemanticOnScalar(FunctionDecl *FD,
   for (unsigned I = 0; I < ElementCount; ++I) {
     Twine VariableName = BaseName.concat(Twine(Location + I));
 
-    auto [_, Inserted] = UsedSemantics.insert(VariableName.str());
+    auto [_, Inserted] = SC.ActiveSemantics.insert(VariableName.str());
     if (!Inserted) {
       Diag(D->getLocation(), diag::err_hlsl_semantic_index_overlap)
           << VariableName.str();
@@ -823,8 +859,7 @@ bool SemaHLSL::determineActiveSemantic(FunctionDecl *FD,
                                        DeclaratorDecl *OutputDecl,
                                        DeclaratorDecl *D,
                                        SemanticInfo &ActiveSemantic,
-                                       llvm::StringSet<> &UsedSemantics,
-                                       bool IsInput) {
+                                       SemaHLSL::SemanticContext &SC) {
   if (ActiveSemantic.Semantic == nullptr) {
     ActiveSemantic.Semantic = D->getAttr<HLSLParsedSemanticAttr>();
     if (ActiveSemantic.Semantic)
@@ -837,13 +872,12 @@ bool SemaHLSL::determineActiveSemantic(FunctionDecl *FD,
   const RecordType *RT = dyn_cast<RecordType>(T);
   if (!RT)
     return determineActiveSemanticOnScalar(FD, OutputDecl, D, ActiveSemantic,
-                                           UsedSemantics, IsInput);
+                                           SC);
 
   const RecordDecl *RD = RT->getDecl();
   for (FieldDecl *Field : RD->fields()) {
     SemanticInfo Info = ActiveSemantic;
-    if (!determineActiveSemantic(FD, OutputDecl, Field, Info, UsedSemantics,
-                                 IsInput)) {
+    if (!determineActiveSemantic(FD, OutputDecl, Field, Info, SC)) {
       Diag(Field->getLocation(), diag::note_hlsl_semantic_used_here) << Field;
       return false;
     }
@@ -873,14 +907,14 @@ void SemaHLSL::CheckEntryPoint(FunctionDecl *FD) {
   case llvm::Triple::Miss:
   case llvm::Triple::Callable:
     if (const auto *NT = FD->getAttr<HLSLNumThreadsAttr>()) {
-      DiagnoseAttrStageMismatch(NT, ST,
+      diagnoseAttrStageMismatch(NT, ST,
                                 {llvm::Triple::Compute,
                                  llvm::Triple::Amplification,
                                  llvm::Triple::Mesh});
       FD->setInvalidDecl();
     }
     if (const auto *WS = FD->getAttr<HLSLWaveSizeAttr>()) {
-      DiagnoseAttrStageMismatch(WS, ST,
+      diagnoseAttrStageMismatch(WS, ST,
                                 {llvm::Triple::Compute,
                                  llvm::Triple::Amplification,
                                  llvm::Triple::Mesh});
@@ -916,7 +950,9 @@ void SemaHLSL::CheckEntryPoint(FunctionDecl *FD) {
     llvm_unreachable("Unhandled environment in triple");
   }
 
-  llvm::StringSet<> ActiveInputSemantics;
+  SemaHLSL::SemanticContext InputSC = {};
+  InputSC.CurrentIOType = IOType::In;
+
   for (ParmVarDecl *Param : FD->parameters()) {
     SemanticInfo ActiveSemantic;
     ActiveSemantic.Semantic = Param->getAttr<HLSLParsedSemanticAttr>();
@@ -924,26 +960,25 @@ void SemaHLSL::CheckEntryPoint(FunctionDecl *FD) {
       ActiveSemantic.Index = ActiveSemantic.Semantic->getSemanticIndex();
 
     // FIXME: Verify output semantics in parameters.
-    if (!determineActiveSemantic(FD, Param, Param, ActiveSemantic,
-                                 ActiveInputSemantics, /* IsInput= */ true)) {
+    if (!determineActiveSemantic(FD, Param, Param, ActiveSemantic, InputSC)) {
       Diag(Param->getLocation(), diag::note_previous_decl) << Param;
       FD->setInvalidDecl();
     }
   }
 
   SemanticInfo ActiveSemantic;
-  llvm::StringSet<> ActiveOutputSemantics;
+  SemaHLSL::SemanticContext OutputSC = {};
+  OutputSC.CurrentIOType = IOType::Out;
   ActiveSemantic.Semantic = FD->getAttr<HLSLParsedSemanticAttr>();
   if (ActiveSemantic.Semantic)
     ActiveSemantic.Index = ActiveSemantic.Semantic->getSemanticIndex();
   if (!FD->getReturnType()->isVoidType())
-    determineActiveSemantic(FD, FD, FD, ActiveSemantic, ActiveOutputSemantics,
-                            /* IsInput= */ false);
+    determineActiveSemantic(FD, FD, FD, ActiveSemantic, OutputSC);
 }
 
 void SemaHLSL::checkSemanticAnnotation(
     FunctionDecl *EntryPoint, const Decl *Param,
-    const HLSLAppliedSemanticAttr *SemanticAttr, bool IsInput) {
+    const HLSLAppliedSemanticAttr *SemanticAttr, const SemanticContext &SC) {
   auto *ShaderAttr = EntryPoint->getAttr<HLSLShaderAttr>();
   assert(ShaderAttr && "Entry point has no shader attribute");
   llvm::Triple::EnvironmentType ST = ShaderAttr->getType();
@@ -954,7 +989,8 @@ void SemaHLSL::checkSemanticAnnotation(
       SemanticName == "SV_GROUPID") {
 
     if (ST != llvm::Triple::Compute)
-      DiagnoseAttrStageMismatch(SemanticAttr, ST, {llvm::Triple::Compute});
+      diagnoseSemanticStageMismatch(SemanticAttr, ST, SC.CurrentIOType,
+                                    {{llvm::Triple::Compute, IOType::In}});
 
     if (SemanticAttr->getSemanticIndex() != 0) {
       std::string PrettyName =
@@ -969,10 +1005,15 @@ void SemaHLSL::checkSemanticAnnotation(
   if (SemanticName == "SV_POSITION") {
     // SV_Position can be an input or output in vertex shaders,
     // but only an input in pixel shaders.
-    if (ST == llvm::Triple::Vertex || (ST == llvm::Triple::Pixel && IsInput))
-      return;
-    DiagnoseAttrStageMismatch(SemanticAttr, ST,
-                              {llvm::Triple::Pixel, llvm::Triple::Vertex});
+    diagnoseSemanticStageMismatch(SemanticAttr, ST, SC.CurrentIOType,
+                                  {{llvm::Triple::Vertex, IOType::InOut},
+                                   {llvm::Triple::Pixel, IOType::In}});
+    return;
+  }
+
+  if (SemanticName == "SV_TARGET") {
+    diagnoseSemanticStageMismatch(SemanticAttr, ST, SC.CurrentIOType,
+                                  {{llvm::Triple::Pixel, IOType::Out}});
     return;
   }
 
@@ -982,7 +1023,7 @@ void SemaHLSL::checkSemanticAnnotation(
     llvm_unreachable("Unknown SemanticAttr");
 }
 
-void SemaHLSL::DiagnoseAttrStageMismatch(
+void SemaHLSL::diagnoseAttrStageMismatch(
     const Attr *A, llvm::Triple::EnvironmentType Stage,
     std::initializer_list<llvm::Triple::EnvironmentType> AllowedStages) {
   SmallVector<StringRef, 8> StageStrings;
@@ -996,6 +1037,48 @@ void SemaHLSL::DiagnoseAttrStageMismatch(
       << (AllowedStages.size() != 1) << join(StageStrings, ", ");
 }
 
+void SemaHLSL::diagnoseSemanticStageMismatch(
+    const Attr *A, llvm::Triple::EnvironmentType Stage, IOType CurrentIOType,
+    std::initializer_list<SemanticStageInfo> Allowed) {
+
+  for (auto &Case : Allowed) {
+    if (Case.Stage != Stage)
+      continue;
+
+    if (CurrentIOType & Case.AllowedIOTypesMask)
+      return;
+
+    SmallVector<std::string, 8> ValidCases;
+    llvm::transform(
+        Allowed, std::back_inserter(ValidCases), [](SemanticStageInfo Case) {
+          SmallVector<std::string, 2> ValidType;
+          if (Case.AllowedIOTypesMask & IOType::In)
+            ValidType.push_back("input");
+          if (Case.AllowedIOTypesMask & IOType::Out)
+            ValidType.push_back("output");
+          return std::string(
+                     HLSLShaderAttr::ConvertEnvironmentTypeToStr(Case.Stage)) +
+                 " " + join(ValidType, "/");
+        });
+    Diag(A->getLoc(), diag::err_hlsl_semantic_unsupported_iotype_for_stage)
+        << A->getAttrName() << (CurrentIOType & IOType::In ? "input" : "output")
+        << llvm::Triple::getEnvironmentTypeName(Case.Stage)
+        << join(ValidCases, ", ");
+    return;
+  }
+
+  SmallVector<StringRef, 8> StageStrings;
+  llvm::transform(
+      Allowed, std::back_inserter(StageStrings), [](SemanticStageInfo Case) {
+        return StringRef(
+            HLSLShaderAttr::ConvertEnvironmentTypeToStr(Case.Stage));
+      });
+
+  Diag(A->getLoc(), diag::err_hlsl_attr_unsupported_in_stage)
+      << A->getAttrName() << llvm::Triple::getEnvironmentTypeName(Stage)
+      << (Allowed.size() != 1) << join(StageStrings, ", ");
+}
+
 template <CastKind Kind>
 static void castVector(Sema &S, ExprResult &E, QualType &Ty, unsigned Sz) {
   if (const auto *VTy = Ty->getAs<VectorType>())
@@ -1707,6 +1790,15 @@ void SemaHLSL::handleVkBindingAttr(Decl *D, const ParsedAttr &AL) {
                  HLSLVkBindingAttr(getASTContext(), AL, Binding, Set));
 }
 
+void SemaHLSL::handleVkLocationAttr(Decl *D, const ParsedAttr &AL) {
+  uint32_t Location;
+  if (!SemaRef.checkUInt32Argument(AL, AL.getArgAsExpr(0), Location))
+    return;
+
+  D->addAttr(::new (getASTContext())
+                 HLSLVkLocationAttr(getASTContext(), AL, Location));
+}
+
 bool SemaHLSL::diagnoseInputIDType(QualType T, const ParsedAttr &AL) {
   const auto *VT = T->getAs<VectorType>();
 
@@ -1797,6 +1889,16 @@ void SemaHLSL::diagnoseSystemSemanticAttr(Decl *D, const ParsedAttr &AL,
     return;
   }
 
+  if (SemanticName == "SV_TARGET") {
+    const auto *VT = ValueType->getAs<VectorType>();
+    if (!ValueType->hasFloatingRepresentation() ||
+        (VT && VT->getNumElements() > 4))
+      Diag(AL.getLoc(), diag::err_hlsl_attr_invalid_type)
+          << AL << "float/float1/float2/float3/float4";
+    D->addAttr(createSemanticAttr<HLSLParsedSemanticAttr>(AL, Index));
+    return;
+  }
+
   Diag(AL.getLoc(), diag::err_hlsl_unknown_semantic) << AL;
 }
 
@@ -3721,7 +3823,6 @@ bool SemaHLSL::CanPerformAggregateSplatCast(Expr *Src, QualType DestTy) {
 }
 
 // Can we perform an HLSL Elementwise cast?
-// TODO: update this code when matrices are added; see issue #88060
 bool SemaHLSL::CanPerformElementwiseCast(Expr *Src, QualType DestTy) {
 
   // Don't handle casts where LHS and RHS are any combination of scalar/vector
@@ -3734,6 +3835,10 @@ bool SemaHLSL::CanPerformElementwiseCast(Expr *Src, QualType DestTy) {
       (DestTy->isScalarType() || DestTy->isVectorType()))
     return false;
 
+  if (SrcTy->isConstantMatrixType() &&
+      (DestTy->isScalarType() || DestTy->isConstantMatrixType()))
+    return false;
+
   llvm::SmallVector<QualType> DestTypes;
   BuildFlattenedTypeList(DestTy, DestTypes);
   llvm::SmallVector<QualType> SrcTypes;
@@ -3916,7 +4021,9 @@ void SemaHLSL::ActOnVariableDeclarator(VarDecl *VD) {
     // process explicit bindings
     processExplicitBindingsOnDecl(VD);
 
-    if (VD->getType()->isHLSLResourceRecordArray()) {
+    // Add implicit binding attribute to non-static resource arrays.
+    if (VD->getType()->isHLSLResourceRecordArray() &&
+        VD->getStorageClass() != SC_Static) {
       // If the resource array does not have an explicit binding attribute,
       // create an implicit one. It will be used to transfer implicit binding
       // order_ID to codegen.
@@ -4110,8 +4217,8 @@ bool SemaHLSL::ActOnUninitializedVarDecl(VarDecl *VD) {
   if (VD->getType().getAddressSpace() == LangAS::hlsl_constant)
     return true;
 
-  // Initialize resources at the global scope
-  if (VD->hasGlobalStorage()) {
+  // Initialize non-static resources at the global scope.
+  if (VD->hasGlobalStorage() && VD->getStorageClass() != SC_Static) {
     const Type *Ty = VD->getType().getTypePtr();
     if (Ty->isHLSLResourceRecord())
       return initGlobalResourceDecl(VD);
@@ -4135,10 +4242,10 @@ bool SemaHLSL::CheckResourceBinOp(BinaryOperatorKind Opc, Expr *LHSExpr,
   while (auto *ASE = dyn_cast<ArraySubscriptExpr>(E))
     E = ASE->getBase()->IgnoreParenImpCasts();
 
-  // Report error if LHS is a resource declared at a global scope.
+  // Report error if LHS is a non-static resource declared at a global scope.
   if (DeclRefExpr *DRE = dyn_cast<DeclRefExpr>(E->IgnoreParens())) {
     if (VarDecl *VD = dyn_cast<VarDecl>(DRE->getDecl())) {
-      if (VD->hasGlobalStorage()) {
+      if (VD->hasGlobalStorage() && VD->getStorageClass() != SC_Static) {
         // assignment to global resource is not allowed
         SemaRef.Diag(Loc, diag::err_hlsl_assign_to_global_resource) << VD;
         SemaRef.Diag(VD->getLocation(), diag::note_var_declared_here) << VD;
diff --git a/clang/lib/Sema/SemaLookup.cpp b/clang/lib/Sema/SemaLookup.cpp
index 5915d6e57d893..b9fac5a4a1153 100644
--- a/clang/lib/Sema/SemaLookup.cpp
+++ b/clang/lib/Sema/SemaLookup.cpp
@@ -3927,7 +3927,8 @@ void Sema::ArgumentDependentLookup(DeclarationName Name, SourceLocation Loc,
             break;
           }
 
-          if (!getLangOpts().CPlusPlusModules)
+          if (!D->getOwningModule() ||
+              !D->getOwningModule()->getTopLevelModule()->isNamedModule())
             continue;
 
           if (D->isInExportDeclContext()) {
@@ -3959,7 +3960,9 @@ void Sema::ArgumentDependentLookup(DeclarationName Name, SourceLocation Loc,
               break;
             }
           }
-        } else if (D->getFriendObjectKind()) {
+        }
+
+        if (D->getFriendObjectKind()) {
           auto *RD = cast<CXXRecordDecl>(D->getLexicalDeclContext());
           // [basic.lookup.argdep]p4:
           //   Argument-dependent lookup finds all declarations of functions and
diff --git a/clang/lib/Sema/SemaOpenACC.cpp b/clang/lib/Sema/SemaOpenACC.cpp
index f0f3832e160cd..1115efbb8305c 100644
--- a/clang/lib/Sema/SemaOpenACC.cpp
+++ b/clang/lib/Sema/SemaOpenACC.cpp
@@ -12,6 +12,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "clang/Sema/SemaOpenACC.h"
+#include "clang/AST/ASTConsumer.h"
 #include "clang/AST/DeclOpenACC.h"
 #include "clang/AST/StmtOpenACC.h"
 #include "clang/Basic/DiagnosticSema.h"
@@ -2457,7 +2458,8 @@ OpenACCRoutineDecl *SemaOpenACC::CheckRoutineDecl(
     ArrayRef<const OpenACCClause *> Clauses, SourceLocation EndLoc) {
   assert(LParenLoc.isValid());
 
-  if (FunctionDecl *FD = getFunctionFromRoutineName(FuncRef)) {
+  FunctionDecl *FD = nullptr;
+  if ((FD = getFunctionFromRoutineName(FuncRef))) {
     // OpenACC 3.3 2.15:
     // In C and C++, function static variables are not supported in functions to
     // which a routine directive applies.
@@ -2509,11 +2511,9 @@ OpenACCRoutineDecl *SemaOpenACC::CheckRoutineDecl(
                                                         {DirLoc, BindLoc});
     FD->addAttr(RAA);
     // In case we are referencing not the 'latest' version, make sure we add
-    // the attribute to all declarations.
-    while (FD != FD->getMostRecentDecl()) {
-      FD = FD->getMostRecentDecl();
-      FD->addAttr(RAA);
-    }
+    // the attribute to all declarations after the 'found' one.
+    for (auto *CurFD : FD->redecls())
+      CurFD->addAttr(RAA->clone(getASTContext()));
   }
 
   LastRoutineDecl = OpenACCRoutineDecl::Create(
@@ -2522,9 +2522,20 @@ OpenACCRoutineDecl *SemaOpenACC::CheckRoutineDecl(
   LastRoutineDecl->setAccess(AS_public);
   getCurContext()->addDecl(LastRoutineDecl);
 
+  if (FD) {
+    // Add this attribute to the list of annotations so that codegen can visit
+    // it later. FD doesn't necessarily exist, but that case should be
+    // diagnosed.
+    RoutineRefList.emplace_back(FD, LastRoutineDecl);
+  }
   return LastRoutineDecl;
 }
 
+void SemaOpenACC::ActOnEndOfTranslationUnit(TranslationUnitDecl *TU) {
+  for (auto [FD, RoutineDecl] : RoutineRefList)
+    SemaRef.Consumer.HandleOpenACCRoutineReference(FD, RoutineDecl);
+}
+
 DeclGroupRef SemaOpenACC::ActOnEndRoutineDeclDirective(
     SourceLocation StartLoc, SourceLocation DirLoc, SourceLocation LParenLoc,
     Expr *ReferencedFunc, SourceLocation RParenLoc,
diff --git a/clang/lib/Sema/SemaOpenMP.cpp b/clang/lib/Sema/SemaOpenMP.cpp
index 31c8f0cd30c56..431c545c07e47 100644
--- a/clang/lib/Sema/SemaOpenMP.cpp
+++ b/clang/lib/Sema/SemaOpenMP.cpp
@@ -18712,16 +18712,16 @@ OMPClause *SemaOpenMP::ActOnOpenMPVarListClause(OpenMPClauseKind Kind,
         ExtraModifierLoc, ColonLoc, VarList, Locs);
     break;
   case OMPC_to:
-    Res =
-        ActOnOpenMPToClause(Data.MotionModifiers, Data.MotionModifiersLoc,
-                            Data.ReductionOrMapperIdScopeSpec,
-                            Data.ReductionOrMapperId, ColonLoc, VarList, Locs);
+    Res = ActOnOpenMPToClause(
+        Data.MotionModifiers, Data.MotionModifiersLoc, Data.IteratorExpr,
+        Data.ReductionOrMapperIdScopeSpec, Data.ReductionOrMapperId, ColonLoc,
+        VarList, Locs);
     break;
   case OMPC_from:
-    Res = ActOnOpenMPFromClause(Data.MotionModifiers, Data.MotionModifiersLoc,
-                                Data.ReductionOrMapperIdScopeSpec,
-                                Data.ReductionOrMapperId, ColonLoc, VarList,
-                                Locs);
+    Res = ActOnOpenMPFromClause(
+        Data.MotionModifiers, Data.MotionModifiersLoc, Data.IteratorExpr,
+        Data.ReductionOrMapperIdScopeSpec, Data.ReductionOrMapperId, ColonLoc,
+        VarList, Locs);
     break;
   case OMPC_use_device_ptr:
     Res = ActOnOpenMPUseDevicePtrClause(VarList, Locs);
@@ -24457,11 +24457,12 @@ void SemaOpenMP::ActOnOpenMPDeclareTargetInitializer(Decl *TargetDecl) {
 
 OMPClause *SemaOpenMP::ActOnOpenMPToClause(
     ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
-    ArrayRef<SourceLocation> MotionModifiersLoc,
+    ArrayRef<SourceLocation> MotionModifiersLoc, Expr *IteratorExpr,
     CXXScopeSpec &MapperIdScopeSpec, DeclarationNameInfo &MapperId,
     SourceLocation ColonLoc, ArrayRef<Expr *> VarList,
     const OMPVarListLocTy &Locs, ArrayRef<Expr *> UnresolvedMappers) {
   OpenMPMotionModifierKind Modifiers[] = {OMPC_MOTION_MODIFIER_unknown,
+                                          OMPC_MOTION_MODIFIER_unknown,
                                           OMPC_MOTION_MODIFIER_unknown};
   SourceLocation ModifiersLoc[NumberOfOMPMotionModifiers];
 
@@ -24485,20 +24486,25 @@ OMPClause *SemaOpenMP::ActOnOpenMPToClause(
                               MapperIdScopeSpec, MapperId, UnresolvedMappers);
   if (MVLI.ProcessedVarList.empty())
     return nullptr;
-
+  if (IteratorExpr)
+    if (auto *DRE = dyn_cast<DeclRefExpr>(IteratorExpr))
+      if (auto *VD = dyn_cast<VarDecl>(DRE->getDecl()))
+        DSAStack->addIteratorVarDecl(VD);
   return OMPToClause::Create(
       getASTContext(), Locs, MVLI.ProcessedVarList, MVLI.VarBaseDeclarations,
-      MVLI.VarComponents, MVLI.UDMapperList, Modifiers, ModifiersLoc,
-      MapperIdScopeSpec.getWithLocInContext(getASTContext()), MapperId);
+      MVLI.VarComponents, MVLI.UDMapperList, IteratorExpr, Modifiers,
+      ModifiersLoc, MapperIdScopeSpec.getWithLocInContext(getASTContext()),
+      MapperId);
 }
 
 OMPClause *SemaOpenMP::ActOnOpenMPFromClause(
     ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
-    ArrayRef<SourceLocation> MotionModifiersLoc,
+    ArrayRef<SourceLocation> MotionModifiersLoc, Expr *IteratorExpr,
     CXXScopeSpec &MapperIdScopeSpec, DeclarationNameInfo &MapperId,
     SourceLocation ColonLoc, ArrayRef<Expr *> VarList,
     const OMPVarListLocTy &Locs, ArrayRef<Expr *> UnresolvedMappers) {
   OpenMPMotionModifierKind Modifiers[] = {OMPC_MOTION_MODIFIER_unknown,
+                                          OMPC_MOTION_MODIFIER_unknown,
                                           OMPC_MOTION_MODIFIER_unknown};
   SourceLocation ModifiersLoc[NumberOfOMPMotionModifiers];
 
@@ -24522,11 +24528,15 @@ OMPClause *SemaOpenMP::ActOnOpenMPFromClause(
                               MapperIdScopeSpec, MapperId, UnresolvedMappers);
   if (MVLI.ProcessedVarList.empty())
     return nullptr;
-
+  if (IteratorExpr)
+    if (auto *DRE = dyn_cast<DeclRefExpr>(IteratorExpr))
+      if (auto *VD = dyn_cast<VarDecl>(DRE->getDecl()))
+        DSAStack->addIteratorVarDecl(VD);
   return OMPFromClause::Create(
       getASTContext(), Locs, MVLI.ProcessedVarList, MVLI.VarBaseDeclarations,
-      MVLI.VarComponents, MVLI.UDMapperList, Modifiers, ModifiersLoc,
-      MapperIdScopeSpec.getWithLocInContext(getASTContext()), MapperId);
+      MVLI.VarComponents, MVLI.UDMapperList, IteratorExpr, Modifiers,
+      ModifiersLoc, MapperIdScopeSpec.getWithLocInContext(getASTContext()),
+      MapperId);
 }
 
 OMPClause *
diff --git a/clang/lib/Sema/SemaOverload.cpp b/clang/lib/Sema/SemaOverload.cpp
index c12f92dfdab66..9a3a78164f0f8 100644
--- a/clang/lib/Sema/SemaOverload.cpp
+++ b/clang/lib/Sema/SemaOverload.cpp
@@ -162,6 +162,7 @@ ImplicitConversionRank clang::GetConversionRank(ImplicitConversionKind Kind) {
       ICR_C_Conversion_Extension,
       ICR_Conversion,
       ICR_HLSL_Dimension_Reduction,
+      ICR_HLSL_Dimension_Reduction,
       ICR_Conversion,
       ICR_HLSL_Scalar_Widening,
   };
@@ -224,6 +225,7 @@ static const char *GetImplicitConversionName(ImplicitConversionKind Kind) {
       "Incompatible pointer conversion",
       "Fixed point conversion",
       "HLSL vector truncation",
+      "HLSL matrix truncation",
       "Non-decaying array conversion",
       "HLSL vector splat",
   };
@@ -2060,9 +2062,10 @@ static bool IsFloatingPointConversion(Sema &S, QualType FromType,
   return true;
 }
 
-static bool IsVectorElementConversion(Sema &S, QualType FromType,
-                                      QualType ToType,
-                                      ImplicitConversionKind &ICK, Expr *From) {
+static bool IsVectorOrMatrixElementConversion(Sema &S, QualType FromType,
+                                              QualType ToType,
+                                              ImplicitConversionKind &ICK,
+                                              Expr *From) {
   if (S.Context.hasSameUnqualifiedType(FromType, ToType))
     return true;
 
@@ -2102,6 +2105,57 @@ static bool IsVectorElementConversion(Sema &S, QualType FromType,
   return false;
 }
 
+/// Determine whether the conversion from FromType to ToType is a valid
+/// matrix conversion.
+///
+/// \param ICK Will be set to the matrix conversion kind, if this is a matrix
+/// conversion.
+static bool IsMatrixConversion(Sema &S, QualType FromType, QualType ToType,
+                               ImplicitConversionKind &ICK,
+                               ImplicitConversionKind &ElConv, Expr *From,
+                               bool InOverloadResolution, bool CStyle) {
+  // Implicit conversions for matrices are an HLSL feature not present in C/C++.
+  if (!S.getLangOpts().HLSL)
+    return false;
+
+  auto *ToMatrixType = ToType->getAs<ConstantMatrixType>();
+  auto *FromMatrixType = FromType->getAs<ConstantMatrixType>();
+
+  // If both arguments are matrix, handle possible matrix truncation and
+  // element conversion.
+  if (ToMatrixType && FromMatrixType) {
+    unsigned FromCols = FromMatrixType->getNumColumns();
+    unsigned ToCols = ToMatrixType->getNumColumns();
+    if (FromCols < ToCols)
+      return false;
+
+    unsigned FromRows = FromMatrixType->getNumRows();
+    unsigned ToRows = ToMatrixType->getNumRows();
+    if (FromRows < ToRows)
+      return false;
+
+    if (FromRows == ToRows && FromCols == ToCols)
+      ElConv = ICK_Identity;
+    else
+      ElConv = ICK_HLSL_Matrix_Truncation;
+
+    QualType FromElTy = FromMatrixType->getElementType();
+    QualType ToElTy = ToMatrixType->getElementType();
+    if (S.Context.hasSameUnqualifiedType(FromElTy, ToElTy))
+      return true;
+    return IsVectorOrMatrixElementConversion(S, FromElTy, ToElTy, ICK, From);
+  }
+  if (FromMatrixType && !ToMatrixType) {
+    ElConv = ICK_HLSL_Matrix_Truncation;
+    QualType FromElTy = FromMatrixType->getElementType();
+    if (S.Context.hasSameUnqualifiedType(FromElTy, ToType))
+      return true;
+    return IsVectorOrMatrixElementConversion(S, FromElTy, ToType, ICK, From);
+  }
+
+  return false;
+}
+
 /// Determine whether the conversion from FromType to ToType is a valid
 /// vector conversion.
 ///
@@ -2141,14 +2195,14 @@ static bool IsVectorConversion(Sema &S, QualType FromType, QualType ToType,
       QualType ToElTy = ToExtType->getElementType();
       if (S.Context.hasSameUnqualifiedType(FromElTy, ToElTy))
         return true;
-      return IsVectorElementConversion(S, FromElTy, ToElTy, ICK, From);
+      return IsVectorOrMatrixElementConversion(S, FromElTy, ToElTy, ICK, From);
     }
     if (FromExtType && !ToExtType) {
       ElConv = ICK_HLSL_Vector_Truncation;
       QualType FromElTy = FromExtType->getElementType();
       if (S.Context.hasSameUnqualifiedType(FromElTy, ToType))
         return true;
-      return IsVectorElementConversion(S, FromElTy, ToType, ICK, From);
+      return IsVectorOrMatrixElementConversion(S, FromElTy, ToType, ICK, From);
     }
     // Fallthrough for the case where ToType is a vector and FromType is not.
   }
@@ -2175,7 +2229,8 @@ static bool IsVectorConversion(Sema &S, QualType FromType, QualType ToType,
       if (S.getLangOpts().HLSL) {
         ElConv = ICK_HLSL_Vector_Splat;
         QualType ToElTy = ToExtType->getElementType();
-        return IsVectorElementConversion(S, FromType, ToElTy, ICK, From);
+        return IsVectorOrMatrixElementConversion(S, FromType, ToElTy, ICK,
+                                                 From);
       }
       ICK = ICK_Vector_Splat;
       return true;
@@ -2474,6 +2529,11 @@ static bool IsStandardConversion(Sema &S, Expr* From, QualType ToType,
     SCS.Second = SecondICK;
     SCS.Dimension = DimensionICK;
     FromType = ToType.getUnqualifiedType();
+  } else if (IsMatrixConversion(S, FromType, ToType, SecondICK, DimensionICK,
+                                From, InOverloadResolution, CStyle)) {
+    SCS.Second = SecondICK;
+    SCS.Dimension = DimensionICK;
+    FromType = ToType.getUnqualifiedType();
   } else if (!S.getLangOpts().CPlusPlus &&
              S.Context.typesAreCompatible(ToType, FromType)) {
     // Compatible conversions (Clang extension for C function overloading)
@@ -6251,6 +6311,7 @@ static bool CheckConvertedConstantConversions(Sema &S,
   case ICK_Incompatible_Pointer_Conversion:
   case ICK_Fixed_Point_Conversion:
   case ICK_HLSL_Vector_Truncation:
+  case ICK_HLSL_Matrix_Truncation:
     return false;
 
   case ICK_Lvalue_To_Rvalue:
diff --git a/clang/lib/Sema/SemaType.cpp b/clang/lib/Sema/SemaType.cpp
index eaf95a8371c2f..fd64d4456cbfa 100644
--- a/clang/lib/Sema/SemaType.cpp
+++ b/clang/lib/Sema/SemaType.cpp
@@ -2259,6 +2259,8 @@ QualType Sema::BuildArrayType(QualType T, ArraySizeModifier ASM,
              isSFINAEContext() ? diag::err_typecheck_zero_array_size
                                : diag::ext_typecheck_zero_array_size)
             << 0 << ArraySize->getSourceRange();
+        if (isSFINAEContext())
+          return QualType();
       }
 
       // Is the array too large?
@@ -3796,8 +3798,10 @@ static CallingConv getCCForDeclaratorChunk(
       }
     }
   }
+
   for (const ParsedAttr &AL : llvm::concat<ParsedAttr>(
-           D.getDeclSpec().getAttributes(), D.getAttributes())) {
+           D.getDeclSpec().getAttributes(), D.getAttributes(),
+           D.getDeclarationAttributes())) {
     if (AL.getKind() == ParsedAttr::AT_DeviceKernel) {
       CC = CC_DeviceKernel;
       break;
diff --git a/clang/lib/Sema/TreeTransform.h b/clang/lib/Sema/TreeTransform.h
index 0e8b674a006d0..8e5dbeb792348 100644
--- a/clang/lib/Sema/TreeTransform.h
+++ b/clang/lib/Sema/TreeTransform.h
@@ -2221,13 +2221,14 @@ class TreeTransform {
   OMPClause *
   RebuildOMPToClause(ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
                      ArrayRef<SourceLocation> MotionModifiersLoc,
-                     CXXScopeSpec &MapperIdScopeSpec,
+                     Expr *IteratorModifier, CXXScopeSpec &MapperIdScopeSpec,
                      DeclarationNameInfo &MapperId, SourceLocation ColonLoc,
                      ArrayRef<Expr *> VarList, const OMPVarListLocTy &Locs,
                      ArrayRef<Expr *> UnresolvedMappers) {
     return getSema().OpenMP().ActOnOpenMPToClause(
-        MotionModifiers, MotionModifiersLoc, MapperIdScopeSpec, MapperId,
-        ColonLoc, VarList, Locs, UnresolvedMappers);
+        MotionModifiers, MotionModifiersLoc, IteratorModifier,
+        MapperIdScopeSpec, MapperId, ColonLoc, VarList, Locs,
+        UnresolvedMappers);
   }
 
   /// Build a new OpenMP 'from' clause.
@@ -2237,13 +2238,14 @@ class TreeTransform {
   OMPClause *
   RebuildOMPFromClause(ArrayRef<OpenMPMotionModifierKind> MotionModifiers,
                        ArrayRef<SourceLocation> MotionModifiersLoc,
-                       CXXScopeSpec &MapperIdScopeSpec,
+                       Expr *IteratorModifier, CXXScopeSpec &MapperIdScopeSpec,
                        DeclarationNameInfo &MapperId, SourceLocation ColonLoc,
                        ArrayRef<Expr *> VarList, const OMPVarListLocTy &Locs,
                        ArrayRef<Expr *> UnresolvedMappers) {
     return getSema().OpenMP().ActOnOpenMPFromClause(
-        MotionModifiers, MotionModifiersLoc, MapperIdScopeSpec, MapperId,
-        ColonLoc, VarList, Locs, UnresolvedMappers);
+        MotionModifiers, MotionModifiersLoc, IteratorModifier,
+        MapperIdScopeSpec, MapperId, ColonLoc, VarList, Locs,
+        UnresolvedMappers);
   }
 
   /// Build a new OpenMP 'use_device_ptr' clause.
@@ -11535,6 +11537,13 @@ template <typename Derived>
 OMPClause *TreeTransform<Derived>::TransformOMPToClause(OMPToClause *C) {
   OMPVarListLocTy Locs(C->getBeginLoc(), C->getLParenLoc(), C->getEndLoc());
   llvm::SmallVector<Expr *, 16> Vars;
+  Expr *IteratorModifier = C->getIteratorModifier();
+  if (IteratorModifier) {
+    ExprResult MapModRes = getDerived().TransformExpr(IteratorModifier);
+    if (MapModRes.isInvalid())
+      return nullptr;
+    IteratorModifier = MapModRes.get();
+  }
   CXXScopeSpec MapperIdScopeSpec;
   DeclarationNameInfo MapperIdInfo;
   llvm::SmallVector<Expr *, 16> UnresolvedMappers;
@@ -11542,14 +11551,22 @@ OMPClause *TreeTransform<Derived>::TransformOMPToClause(OMPToClause *C) {
           *this, C, Vars, MapperIdScopeSpec, MapperIdInfo, UnresolvedMappers))
     return nullptr;
   return getDerived().RebuildOMPToClause(
-      C->getMotionModifiers(), C->getMotionModifiersLoc(), MapperIdScopeSpec,
-      MapperIdInfo, C->getColonLoc(), Vars, Locs, UnresolvedMappers);
+      C->getMotionModifiers(), C->getMotionModifiersLoc(), IteratorModifier,
+      MapperIdScopeSpec, MapperIdInfo, C->getColonLoc(), Vars, Locs,
+      UnresolvedMappers);
 }
 
 template <typename Derived>
 OMPClause *TreeTransform<Derived>::TransformOMPFromClause(OMPFromClause *C) {
   OMPVarListLocTy Locs(C->getBeginLoc(), C->getLParenLoc(), C->getEndLoc());
   llvm::SmallVector<Expr *, 16> Vars;
+  Expr *IteratorModifier = C->getIteratorModifier();
+  if (IteratorModifier) {
+    ExprResult MapModRes = getDerived().TransformExpr(IteratorModifier);
+    if (MapModRes.isInvalid())
+      return nullptr;
+    IteratorModifier = MapModRes.get();
+  }
   CXXScopeSpec MapperIdScopeSpec;
   DeclarationNameInfo MapperIdInfo;
   llvm::SmallVector<Expr *, 16> UnresolvedMappers;
@@ -11557,8 +11574,9 @@ OMPClause *TreeTransform<Derived>::TransformOMPFromClause(OMPFromClause *C) {
           *this, C, Vars, MapperIdScopeSpec, MapperIdInfo, UnresolvedMappers))
     return nullptr;
   return getDerived().RebuildOMPFromClause(
-      C->getMotionModifiers(), C->getMotionModifiersLoc(), MapperIdScopeSpec,
-      MapperIdInfo, C->getColonLoc(), Vars, Locs, UnresolvedMappers);
+      C->getMotionModifiers(), C->getMotionModifiersLoc(), IteratorModifier,
+      MapperIdScopeSpec, MapperIdInfo, C->getColonLoc(), Vars, Locs,
+      UnresolvedMappers);
 }
 
 template <typename Derived>
diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index 55c52154c4113..aec61322fb8be 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -5580,9 +5580,13 @@ void ASTReader::InitializeContext() {
 
   // If there were any CUDA special declarations, deserialize them.
   if (!CUDASpecialDeclRefs.empty()) {
-    assert(CUDASpecialDeclRefs.size() == 1 && "More decl refs than expected!");
+    assert(CUDASpecialDeclRefs.size() == 3 && "More decl refs than expected!");
     Context.setcudaConfigureCallDecl(
-                           cast<FunctionDecl>(GetDecl(CUDASpecialDeclRefs[0])));
+        cast_or_null<FunctionDecl>(GetDecl(CUDASpecialDeclRefs[0])));
+    Context.setcudaGetParameterBufferDecl(
+        cast_or_null<FunctionDecl>(GetDecl(CUDASpecialDeclRefs[1])));
+    Context.setcudaLaunchDeviceDecl(
+        cast_or_null<FunctionDecl>(GetDecl(CUDASpecialDeclRefs[2])));
   }
 
   // Re-export any modules that were imported by a non-module AST file.
@@ -12387,6 +12391,8 @@ void OMPClauseReader::VisitOMPToClause(OMPToClause *C) {
     C->setMotionModifier(
         I, static_cast<OpenMPMotionModifierKind>(Record.readInt()));
     C->setMotionModifierLoc(I, Record.readSourceLocation());
+    if (C->getMotionModifier(I) == OMPC_MOTION_MODIFIER_iterator)
+      C->setIteratorModifier(Record.readExpr());
   }
   C->setMapperQualifierLoc(Record.readNestedNameSpecifierLoc());
   C->setMapperIdInfo(Record.readDeclarationNameInfo());
@@ -12443,6 +12449,8 @@ void OMPClauseReader::VisitOMPFromClause(OMPFromClause *C) {
     C->setMotionModifier(
         I, static_cast<OpenMPMotionModifierKind>(Record.readInt()));
     C->setMotionModifierLoc(I, Record.readSourceLocation());
+    if (C->getMotionModifier(I) == OMPC_MOTION_MODIFIER_iterator)
+      C->setIteratorModifier(Record.readExpr());
   }
   C->setMapperQualifierLoc(Record.readNestedNameSpecifierLoc());
   C->setMapperIdInfo(Record.readDeclarationNameInfo());
diff --git a/clang/lib/Serialization/ASTReaderDecl.cpp b/clang/lib/Serialization/ASTReaderDecl.cpp
index 5456e73956659..882d54f31280a 100644
--- a/clang/lib/Serialization/ASTReaderDecl.cpp
+++ b/clang/lib/Serialization/ASTReaderDecl.cpp
@@ -2107,8 +2107,8 @@ void ASTDeclMerger::MergeDefinitionData(
     auto *Def = DD.Definition;
     DD = std::move(MergeDD);
     DD.Definition = Def;
-    while ((Def = Def->getPreviousDecl()))
-      cast<CXXRecordDecl>(Def)->DefinitionData = ⅅ
+    for (auto *D : Def->redecls())
+      cast<CXXRecordDecl>(D)->DefinitionData = ⅅ
     return;
   }
 
diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index e8c0d3f2b4ee9..667e04049dac8 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -5706,8 +5706,13 @@ void ASTWriter::PrepareWritingSpecialDecls(Sema &SemaRef) {
     GetDeclRef(SemaRef.getStdAlignValT());
   }
 
-  if (Context.getcudaConfigureCallDecl())
+  if (Context.getcudaConfigureCallDecl() ||
+      Context.getcudaGetParameterBufferDecl() ||
+      Context.getcudaLaunchDeviceDecl()) {
     GetDeclRef(Context.getcudaConfigureCallDecl());
+    GetDeclRef(Context.getcudaGetParameterBufferDecl());
+    GetDeclRef(Context.getcudaLaunchDeviceDecl());
+  }
 
   // Writing all of the known namespaces.
   for (const auto &I : SemaRef.KnownNamespaces)
@@ -5834,19 +5839,19 @@ void ASTWriter::WriteSpecialDeclRecords(Sema &SemaRef) {
       Stream.EmitRecord(PENDING_IMPLICIT_INSTANTIATIONS, PendingInstantiations);
   }
 
+  auto AddEmittedDeclRefOrZero = [this](RecordData &Refs, Decl *D) {
+    if (!D || !wasDeclEmitted(D))
+      Refs.push_back(0);
+    else
+      AddDeclRef(D, Refs);
+  };
+
   // Write the record containing declaration references of Sema.
   RecordData SemaDeclRefs;
   if (SemaRef.StdNamespace || SemaRef.StdBadAlloc || SemaRef.StdAlignValT) {
-    auto AddEmittedDeclRefOrZero = [this, &SemaDeclRefs](Decl *D) {
-      if (!D || !wasDeclEmitted(D))
-        SemaDeclRefs.push_back(0);
-      else
-        AddDeclRef(D, SemaDeclRefs);
-    };
-
-    AddEmittedDeclRefOrZero(SemaRef.getStdNamespace());
-    AddEmittedDeclRefOrZero(SemaRef.getStdBadAlloc());
-    AddEmittedDeclRefOrZero(SemaRef.getStdAlignValT());
+    AddEmittedDeclRefOrZero(SemaDeclRefs, SemaRef.getStdNamespace());
+    AddEmittedDeclRefOrZero(SemaDeclRefs, SemaRef.getStdBadAlloc());
+    AddEmittedDeclRefOrZero(SemaDeclRefs, SemaRef.getStdAlignValT());
   }
   if (!SemaDeclRefs.empty())
     Stream.EmitRecord(SEMA_DECL_REFS, SemaDeclRefs);
@@ -5862,9 +5867,13 @@ void ASTWriter::WriteSpecialDeclRecords(Sema &SemaRef) {
 
   // Write the record containing CUDA-specific declaration references.
   RecordData CUDASpecialDeclRefs;
-  if (auto *CudaCallDecl = Context.getcudaConfigureCallDecl();
-      CudaCallDecl && wasDeclEmitted(CudaCallDecl)) {
-    AddDeclRef(CudaCallDecl, CUDASpecialDeclRefs);
+  if (auto *CudaCallDecl = Context.getcudaConfigureCallDecl(),
+      *CudaGetParamDecl = Context.getcudaGetParameterBufferDecl(),
+      *CudaLaunchDecl = Context.getcudaLaunchDeviceDecl();
+      CudaCallDecl || CudaGetParamDecl || CudaLaunchDecl) {
+    AddEmittedDeclRefOrZero(CUDASpecialDeclRefs, CudaCallDecl);
+    AddEmittedDeclRefOrZero(CUDASpecialDeclRefs, CudaGetParamDecl);
+    AddEmittedDeclRefOrZero(CUDASpecialDeclRefs, CudaLaunchDecl);
     Stream.EmitRecord(CUDA_SPECIAL_DECL_REFS, CUDASpecialDeclRefs);
   }
 
@@ -8417,6 +8426,8 @@ void OMPClauseWriter::VisitOMPToClause(OMPToClause *C) {
   for (unsigned I = 0; I < NumberOfOMPMotionModifiers; ++I) {
     Record.push_back(C->getMotionModifier(I));
     Record.AddSourceLocation(C->getMotionModifierLoc(I));
+    if (C->getMotionModifier(I) == OMPC_MOTION_MODIFIER_iterator)
+      Record.AddStmt(C->getIteratorModifier());
   }
   Record.AddNestedNameSpecifierLoc(C->getMapperQualifierLoc());
   Record.AddDeclarationNameInfo(C->getMapperIdInfo());
@@ -8447,6 +8458,8 @@ void OMPClauseWriter::VisitOMPFromClause(OMPFromClause *C) {
   for (unsigned I = 0; I < NumberOfOMPMotionModifiers; ++I) {
     Record.push_back(C->getMotionModifier(I));
     Record.AddSourceLocation(C->getMotionModifierLoc(I));
+    if (C->getMotionModifier(I) == OMPC_MOTION_MODIFIER_iterator)
+      Record.AddStmt(C->getIteratorModifier());
   }
   Record.AddNestedNameSpecifierLoc(C->getMapperQualifierLoc());
   Record.AddDeclarationNameInfo(C->getMapperIdInfo());
diff --git a/clang/lib/StaticAnalyzer/Core/BugReporter.cpp b/clang/lib/StaticAnalyzer/Core/BugReporter.cpp
index 5fe64dc5e9270..4c066520b668f 100644
--- a/clang/lib/StaticAnalyzer/Core/BugReporter.cpp
+++ b/clang/lib/StaticAnalyzer/Core/BugReporter.cpp
@@ -25,7 +25,6 @@
 #include "clang/AST/StmtObjC.h"
 #include "clang/Analysis/AnalysisDeclContext.h"
 #include "clang/Analysis/CFG.h"
-#include "clang/Analysis/CFGStmtMap.h"
 #include "clang/Analysis/PathDiagnostic.h"
 #include "clang/Analysis/ProgramPoint.h"
 #include "clang/Basic/LLVM.h"
diff --git a/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp b/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
index 4ddf8fd5b4b0f..db27c06cd18a3 100644
--- a/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
+++ b/clang/lib/StaticAnalyzer/Core/ExprEngineC.cpp
@@ -560,6 +560,7 @@ void ExprEngine::VisitCast(const CastExpr *CastE, const Expr *Ex,
       case CK_VectorSplat:
       case CK_HLSLElementwiseCast:
       case CK_HLSLAggregateSplatCast:
+      case CK_HLSLMatrixTruncation:
       case CK_HLSLVectorTruncation: {
         QualType resultType = CastE->getType();
         if (CastE->isGLValue())
diff --git a/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp b/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
index e0deec16afb12..827fcaaf1b634 100644
--- a/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
+++ b/clang/lib/StaticAnalyzer/Frontend/AnalysisConsumer.cpp
@@ -364,7 +364,7 @@ class AnalysisConsumer : public AnalysisASTConsumer,
   void storeTopLevelDecls(DeclGroupRef DG);
 
   /// Check if we should skip (not analyze) the given function.
-  AnalysisMode getModeForDecl(Decl *D, AnalysisMode Mode);
+  AnalysisMode getModeForDecl(const Decl *D, AnalysisMode Mode) const;
   void runAnalysisOnTranslationUnit(ASTContext &C);
 
   /// Print \p S to stderr if \c Opts.AnalyzerDisplayProgress is set.
@@ -677,7 +677,7 @@ void AnalysisConsumer::HandleTranslationUnit(ASTContext &C) {
 }
 
 AnalysisConsumer::AnalysisMode
-AnalysisConsumer::getModeForDecl(Decl *D, AnalysisMode Mode) {
+AnalysisConsumer::getModeForDecl(const Decl *D, AnalysisMode Mode) const {
   if (!Opts.AnalyzeSpecificFunction.empty() &&
       AnalysisDeclContext::getFunctionName(D) != Opts.AnalyzeSpecificFunction &&
       cross_tu::CrossTranslationUnitContext::getLookupName(D).value_or("") !=
@@ -695,7 +695,7 @@ AnalysisConsumer::getModeForDecl(Decl *D, AnalysisMode Mode) {
 
   const SourceManager &SM = Ctx->getSourceManager();
 
-  const SourceLocation Loc = [&SM](Decl *D) -> SourceLocation {
+  const SourceLocation Loc = [&SM](const Decl *D) -> SourceLocation {
     const Stmt *Body = D->getBody();
     SourceLocation SL = Body ? Body->getBeginLoc() : D->getLocation();
     return SM.getExpansionLoc(SL);
diff --git a/clang/lib/Tooling/Transformer/Parsing.cpp b/clang/lib/Tooling/Transformer/Parsing.cpp
index 19a1c7c66df46..f7bffda6967a9 100644
--- a/clang/lib/Tooling/Transformer/Parsing.cpp
+++ b/clang/lib/Tooling/Transformer/Parsing.cpp
@@ -108,7 +108,7 @@ getBinaryStringSelectors() {
 static const llvm::StringMap<RangeSelectorOp<RangeSelector, RangeSelector>> &
 getBinaryRangeSelectors() {
   static const llvm::StringMap<RangeSelectorOp<RangeSelector, RangeSelector>>
-      M = {{"enclose", enclose}, {"between", between}};
+      M = {{"enclose", enclose}, {"between", between}, {"merge", merge}};
   return M;
 }
 
diff --git a/clang/lib/Tooling/Transformer/RangeSelector.cpp b/clang/lib/Tooling/Transformer/RangeSelector.cpp
index 54a1590d3106d..68b16f91652fb 100644
--- a/clang/lib/Tooling/Transformer/RangeSelector.cpp
+++ b/clang/lib/Tooling/Transformer/RangeSelector.cpp
@@ -178,6 +178,63 @@ RangeSelector transformer::encloseNodes(std::string BeginID,
   return transformer::enclose(node(std::move(BeginID)), node(std::move(EndID)));
 }
 
+RangeSelector transformer::merge(RangeSelector First, RangeSelector Second) {
+  return [First,
+          Second](const MatchResult &Result) -> Expected<CharSourceRange> {
+    Expected<CharSourceRange> FirstRange = First(Result);
+    if (!FirstRange)
+      return FirstRange.takeError();
+    Expected<CharSourceRange> SecondRange = Second(Result);
+    if (!SecondRange)
+      return SecondRange.takeError();
+
+    SourceLocation FirstB = FirstRange->getBegin();
+    SourceLocation FirstE = FirstRange->getEnd();
+    SourceLocation SecondB = SecondRange->getBegin();
+    SourceLocation SecondE = SecondRange->getEnd();
+    // Result begin loc is the minimum of the begin locs of the two ranges.
+    SourceLocation B =
+        Result.SourceManager->isBeforeInTranslationUnit(FirstB, SecondB)
+            ? FirstB
+            : SecondB;
+    if (FirstRange->isTokenRange() && SecondRange->isTokenRange()) {
+      // Both ranges are token ranges. Just take the maximum of their end locs.
+      SourceLocation E =
+          Result.SourceManager->isBeforeInTranslationUnit(FirstE, SecondE)
+              ? SecondE
+              : FirstE;
+      return CharSourceRange::getTokenRange(B, E);
+    }
+
+    if (FirstRange->isTokenRange()) {
+      // The end of the first range is a token. Need to resolve the token to a
+      // char range.
+      FirstE = Lexer::getLocForEndOfToken(FirstE, /*Offset=*/0,
+                                          *Result.SourceManager,
+                                          Result.Context->getLangOpts());
+      if (FirstE.isInvalid())
+        return invalidArgumentError(
+            "merge: can't resolve first token range to valid source range");
+    }
+    if (SecondRange->isTokenRange()) {
+      // The end of the second range is a token. Need to resolve the token to a
+      // char range.
+      SecondE = Lexer::getLocForEndOfToken(SecondE, /*Offset=*/0,
+                                           *Result.SourceManager,
+                                           Result.Context->getLangOpts());
+      if (SecondE.isInvalid())
+        return invalidArgumentError(
+            "merge: can't resolve second token range to valid source range");
+    }
+    // Result end loc is the maximum of the end locs of the two ranges.
+    SourceLocation E =
+        Result.SourceManager->isBeforeInTranslationUnit(FirstE, SecondE)
+            ? SecondE
+            : FirstE;
+    return CharSourceRange::getCharRange(B, E);
+  };
+}
+
 RangeSelector transformer::member(std::string ID) {
   return [ID](const MatchResult &Result) -> Expected<CharSourceRange> {
     Expected<DynTypedNode> Node = getNode(Result.Nodes, ID);
diff --git a/clang/test/AST/ByteCode/builtin-functions.cpp b/clang/test/AST/ByteCode/builtin-functions.cpp
index 4a53cb66b2fdd..3076b5239ebbe 100644
--- a/clang/test/AST/ByteCode/builtin-functions.cpp
+++ b/clang/test/AST/ByteCode/builtin-functions.cpp
@@ -1545,6 +1545,13 @@ namespace Memcmp {
 
   int unknown;
   void foo(void) { unknown *= __builtin_memcmp(0, 0, 2); }
+
+  constexpr int onepasttheend(char a) {
+    __builtin_memcmp(&a, &a + 1, 1); // both-note {{read of dereferenced one-past-the-end pointer}}
+    return 1;
+  }
+  static_assert(onepasttheend(10)); // both-error {{not an integral constant expression}} \
+                                    // both-note {{in call to}}
 }
 
 namespace Memchr {
diff --git a/clang/test/AST/ByteCode/c.c b/clang/test/AST/ByteCode/c.c
index bffd557ff77a6..0d3d97b5eeab2 100644
--- a/clang/test/AST/ByteCode/c.c
+++ b/clang/test/AST/ByteCode/c.c
@@ -392,3 +392,16 @@ void plainComplex(void) {
   _Complex cd; // all-warning {{_Complex double}}
   cd = *(_Complex *)&(struct { double r, i; }){0.0, 0.0}; // all-warning {{_Complex double}}
 }
+
+/// This test results in an ImplicitValueInitExpr with DiscardResult set.
+struct M{
+  char c;
+};
+typedef struct S64 {
+  struct M m;
+  char a[64];
+} I64;
+
+_Static_assert((((I64){}, 1)), ""); // all-warning {{left operand of comma operator has no effect}} \
+                                    // pedantic-warning {{use of an empty initializer is a C23 extension}} \
+                                    // pedantic-warning {{expression is not an integer constant expression; folding it to a constant is a GNU extension}}
diff --git a/clang/test/CIR/CodeGen/aapcs-volatile-bitfields.c b/clang/test/CIR/CodeGen/aapcs-volatile-bitfields.c
index 19362cf79b107..66891f9e1ad78 100644
--- a/clang/test/CIR/CodeGen/aapcs-volatile-bitfields.c
+++ b/clang/test/CIR/CodeGen/aapcs-volatile-bitfields.c
@@ -82,7 +82,7 @@ int check_load(st1 *s1) {
   return s1->b;
 }
 
-// CIR:  cir.func dso_local @check_load
+// CIR:  cir.func {{.*}} @check_load
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_st1>>, !cir.ptr<!rec_st1>
 // CIR:    [[MEMBER:%.*]] = cir.get_member [[LOAD]][0] {name = "b"} : !cir.ptr<!rec_st1> -> !cir.ptr<!u16i>
 // CIR:    [[BITFI:%.*]] = cir.get_bitfield align(4) (#bfi_b, [[MEMBER]] {is_volatile} : !cir.ptr<!u16i>) -> !u32i
@@ -114,7 +114,7 @@ int check_load_exception(st3 *s3) {
   return s3->b;
 }
 
-// CIR:  cir.func dso_local @check_load_exception
+// CIR:  cir.func {{.*}} @check_load_exception
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_st3>>, !cir.ptr<!rec_st3>
 // CIR:    [[MEMBER:%.*]] = cir.get_member [[LOAD]][2] {name = "b"} : !cir.ptr<!rec_st3> -> !cir.ptr<!u8i>
 // CIR:    [[BITFI:%.*]] = cir.get_bitfield align(4) (#bfi_b1, [[MEMBER]] {is_volatile} : !cir.ptr<!u8i>) -> !u32i
@@ -151,7 +151,7 @@ int clip_load_exception2(clip *c) {
   return c->a;
 }
 
-// CIR:  cir.func dso_local @clip_load_exception2
+// CIR:  cir.func {{.*}} @clip_load_exception2
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_clip>>, !cir.ptr<!rec_clip>
 // CIR:    [[MEMBER:%.*]] = cir.get_member [[LOAD]][0] {name = "a"} : !cir.ptr<!rec_clip> -> !cir.ptr<!cir.array<!u8i x 3>>
 // CIR:    [[BITFI:%.*]] = cir.get_bitfield align(4) (#bfi_a1, [[MEMBER]] {is_volatile} : !cir.ptr<!cir.array<!u8i x 3>>) -> !s32i
@@ -178,7 +178,7 @@ void check_store(st2 *s2) {
   s2->a = 1;
 }
 
-// CIR:  cir.func dso_local @check_store
+// CIR:  cir.func {{.*}} @check_store
 // CIR:    [[CONST:%.*]] = cir.const #cir.int<1> : !s32i
 // CIR:    [[CAST:%.*]] = cir.cast integral [[CONST]] : !s32i -> !s16i
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_st2>>, !cir.ptr<!rec_st2>
@@ -209,7 +209,7 @@ void check_store_exception(st3 *s3) {
   s3->b = 2;
 }
 
-// CIR:  cir.func dso_local @check_store_exception
+// CIR:  cir.func {{.*}} @check_store_exception
 // CIR:    [[CONST:%.*]] = cir.const #cir.int<2> : !s32i
 // CIR:    [[CAST:%.*]] = cir.cast integral [[CONST]] : !s32i -> !u32i
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_st3>>, !cir.ptr<!rec_st3>
@@ -239,7 +239,7 @@ void clip_store_exception2(clip *c) {
   c->a = 3;
 }
 
-// CIR:  cir.func dso_local @clip_store_exception2
+// CIR:  cir.func {{.*}} @clip_store_exception2
 // CIR:    [[CONST:%.*]] = cir.const #cir.int<3> : !s32i
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_clip>>, !cir.ptr<!rec_clip>
 // CIR:    [[MEMBER:%.*]] = cir.get_member [[LOAD]][0] {name = "a"} : !cir.ptr<!rec_clip> -> !cir.ptr<!cir.array<!u8i x 3>>
@@ -261,7 +261,7 @@ void check_store_second_member (st4 *s4) {
   s4->b = 1;
 }
 
-// CIR:  cir.func dso_local @check_store_second_member
+// CIR:  cir.func {{.*}} @check_store_second_member
 // CIR:    [[ONE:%.*]] = cir.const #cir.int<1> : !s32i
 // CIR:    [[CAST:%.*]] = cir.cast integral [[ONE]] : !s32i -> !u64i
 // CIR:    [[LOAD:%.*]] = cir.load align(8) {{.*}} : !cir.ptr<!cir.ptr<!rec_st4>>, !cir.ptr<!rec_st4>
diff --git a/clang/test/CIR/CodeGen/address-space-conversion.cpp b/clang/test/CIR/CodeGen/address-space-conversion.cpp
index ca026be60ee71..9ce1f5e4b8e24 100644
--- a/clang/test/CIR/CodeGen/address-space-conversion.cpp
+++ b/clang/test/CIR/CodeGen/address-space-conversion.cpp
@@ -11,7 +11,7 @@ using pi2_t = int __attribute__((address_space(2))) *;
 using ri1_t = int __attribute__((address_space(1))) &;
 using ri2_t = int __attribute__((address_space(2))) &;
 
-// CIR: cir.func dso_local @{{.*test_ptr.*}}
+// CIR: cir.func {{.*}} @{{.*test_ptr.*}}
 // LLVM: define dso_local void @{{.*test_ptr.*}}
 // OGCG: define dso_local void @{{.*test_ptr.*}}
 void test_ptr() {
@@ -30,7 +30,7 @@ void test_ptr() {
   // OGCG-NEXT: store ptr addrspace(2)  %{{.*}}, ptr %{{.*}}
 }
 
-// CIR: cir.func dso_local @{{.*test_ref.*}}
+// CIR: cir.func {{.*}} @{{.*test_ref.*}}
 // LLVM: define dso_local void @{{.*test_ref.*}}
 // OGCG: define dso_local void @{{.*test_ref.*}}
 void test_ref() {
@@ -56,7 +56,7 @@ void test_ref() {
   // OGCG-NEXT: store ptr addrspace(2) %{{.*}}, ptr %{{.*}}
 }
 
-// CIR: cir.func dso_local @{{.*test_nullptr.*}}
+// CIR: cir.func {{.*}} @{{.*test_nullptr.*}}
 // LLVM: define dso_local void @{{.*test_nullptr.*}}
 // OGCG: define dso_local void @{{.*test_nullptr.*}}
 void test_nullptr() {
@@ -74,7 +74,7 @@ void test_nullptr() {
   // OGCG-NEXT: store ptr addrspace(2) null, ptr %{{.*}}
 }
 
-// CIR: cir.func dso_local @{{.*test_side_effect.*}}
+// CIR: cir.func {{.*}} @{{.*test_side_effect.*}}
 // LLVM: define dso_local void @{{.*test_side_effect.*}}
 // OGCG: define dso_local void @{{.*test_side_effect.*}}
 void test_side_effect(pi1_t b) {
diff --git a/clang/test/CIR/CodeGen/address-space.c b/clang/test/CIR/CodeGen/address-space.c
index a334b8a2907e4..2a5c0e15d5850 100644
--- a/clang/test/CIR/CodeGen/address-space.c
+++ b/clang/test/CIR/CodeGen/address-space.c
@@ -6,7 +6,7 @@
 // RUN: FileCheck --input-file=%t.ll %s -check-prefix=OGCG
 
 // Test address space 1
-// CIR: cir.func dso_local @foo(%arg0: !cir.ptr<!s32i, target_address_space(1)>
+// CIR: cir.func {{.*}} @foo(%arg0: !cir.ptr<!s32i, target_address_space(1)>
 // LLVM: define dso_local void @foo(ptr addrspace(1) %0)
 // OGCG: define dso_local void @foo(ptr addrspace(1) noundef %arg)
 void foo(int __attribute__((address_space(1))) *arg) {
@@ -14,7 +14,7 @@ void foo(int __attribute__((address_space(1))) *arg) {
 }
 
 // Test explicit address space 0 (should be same as default)
-// CIR: cir.func dso_local @bar(%arg0: !cir.ptr<!s32i, target_address_space(0)>
+// CIR: cir.func {{.*}} @bar(%arg0: !cir.ptr<!s32i, target_address_space(0)>
 // LLVM: define dso_local void @bar(ptr %0)
 // OGCG: define dso_local void @bar(ptr noundef %arg)
 void bar(int __attribute__((address_space(0))) *arg) {
@@ -22,7 +22,7 @@ void bar(int __attribute__((address_space(0))) *arg) {
 }
 
 // Test default address space (no attribute)
-// CIR: cir.func dso_local @baz(%arg0: !cir.ptr<!s32i>
+// CIR: cir.func {{.*}} @baz(%arg0: !cir.ptr<!s32i>
 // LLVM: define dso_local void @baz(ptr %0)
 // OGCG: define dso_local void @baz(ptr noundef %arg)
 void baz(int *arg) {
diff --git a/clang/test/CIR/CodeGen/array-ctor.cpp b/clang/test/CIR/CodeGen/array-ctor.cpp
index 1fb14ecf0663e..8643c8c644e11 100644
--- a/clang/test/CIR/CodeGen/array-ctor.cpp
+++ b/clang/test/CIR/CodeGen/array-ctor.cpp
@@ -14,7 +14,7 @@ void foo() {
     S s[42];
 }
 
-// CIR-BEFORE-LPP: cir.func dso_local @_Z3foov()
+// CIR-BEFORE-LPP: cir.func {{.*}} @_Z3foov()
 // CIR-BEFORE-LPP:   %[[ARRAY:.*]] = cir.alloca !cir.array<!rec_S x 42>, !cir.ptr<!cir.array<!rec_S x 42>>, ["s", init]
 // CIR-BEFORE-LPP:   cir.array.ctor %[[ARRAY]] : !cir.ptr<!cir.array<!rec_S x 42>> {
 // CIR-BEFORE-LPP:    ^bb0(%[[ARG:.*]]: !cir.ptr<!rec_S>):
@@ -24,7 +24,7 @@ void foo() {
 // CIR-BEFORE-LPP:   cir.return
 // CIR-BEFORE-LPP: }
 
-// CIR: cir.func dso_local @_Z3foov()
+// CIR: cir.func {{.*}} @_Z3foov()
 // CIR:   %[[ARRAY:.*]] = cir.alloca !cir.array<!rec_S x 42>, !cir.ptr<!cir.array<!rec_S x 42>>, ["s", init]
 // CIR:   %[[CONST42:.*]] = cir.const #cir.int<42> : !u64i
 // CIR:   %[[DECAY:.*]] = cir.cast array_to_ptrdecay %[[ARRAY]] : !cir.ptr<!cir.array<!rec_S x 42>> -> !cir.ptr<!rec_S>
@@ -84,12 +84,12 @@ void zero_sized() {
     S s[0];
 }
 
-// CIR-BEFORE-LPP:     cir.func dso_local @_Z10zero_sizedv()
+// CIR-BEFORE-LPP:     cir.func {{.*}} @_Z10zero_sizedv()
 // CIR-BEFORE-LPP:       cir.alloca !cir.array<!rec_S x 0>, !cir.ptr<!cir.array<!rec_S x 0>>, ["s"]
 // CIR-BEFORE-LPP-NOT:   cir.array.ctor
 // CIR-BEFORE-LPP:       cir.return
 
-// CIR:     cir.func dso_local @_Z10zero_sizedv()
+// CIR:     cir.func {{.*}} @_Z10zero_sizedv()
 // CIR:       cir.alloca !cir.array<!rec_S x 0>, !cir.ptr<!cir.array<!rec_S x 0>>, ["s"]
 // CIR-NOT:   cir.do
 // CIR-NOT:   cir.call @_ZN1SC1Ev
diff --git a/clang/test/CIR/CodeGen/asm-label-inline-builtins.c b/clang/test/CIR/CodeGen/asm-label-inline-builtins.c
index 24c9a32e7c41d..bad521aed7821 100644
--- a/clang/test/CIR/CodeGen/asm-label-inline-builtins.c
+++ b/clang/test/CIR/CodeGen/asm-label-inline-builtins.c
@@ -31,7 +31,7 @@ void test(const char *fmt, __builtin_va_list ap) {
   vprintf(fmt, ap);
 }
 
-// CIR: cir.func internal private @__vprintfieee128.inline({{.*}}) -> !s32i inline(always)
+// CIR: cir.func always_inline internal private @__vprintfieee128.inline({{.*}}) -> !s32i
 // CIR:   cir.call @__vfprintf_chkieee128(%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}})
 //
 // CIR: cir.func {{.*}} @test({{.*}})
diff --git a/clang/test/CIR/CodeGen/assign-operator.cpp b/clang/test/CIR/CodeGen/assign-operator.cpp
index 66d4b4818c10e..ad3e5c00911c4 100644
--- a/clang/test/CIR/CodeGen/assign-operator.cpp
+++ b/clang/test/CIR/CodeGen/assign-operator.cpp
@@ -13,7 +13,7 @@ void a() {
   a = 1u;
 }
 
-// CIR: cir.func private @_ZN1xaSEi(!cir.ptr<!rec_x>, !s32i)
+// CIR: cir.func {{.*}} @_ZN1xaSEi(!cir.ptr<!rec_x>, !s32i)
 // CIR: cir.func{{.*}} @_Z1av()
 // CIR:   %[[A_ADDR:.*]] = cir.alloca !rec_x, !cir.ptr<!rec_x>, ["a"]
 // CIR:   %[[ONE:.*]] = cir.const #cir.int<1> : !u32i
@@ -63,7 +63,7 @@ void copy_c(C &c1, C &c2) {
 
 // Implicit assignment operator for C.
 
-// CIR: cir.func comdat linkonce_odr @_ZN1CaSERKS_(%arg0: !cir.ptr<!rec_C> {{.*}}, %arg1: !cir.ptr<!rec_C> {{.*}}) -> !cir.ptr<!rec_C>
+// CIR: cir.func {{.*}} @_ZN1CaSERKS_(%arg0: !cir.ptr<!rec_C> {{.*}}, %arg1: !cir.ptr<!rec_C> {{.*}}) -> !cir.ptr<!rec_C>
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<!rec_C>, !cir.ptr<!cir.ptr<!rec_C>>, ["this", init]
 // CIR:   %[[ARG1_ADDR:.*]] = cir.alloca !cir.ptr<!rec_C>, !cir.ptr<!cir.ptr<!rec_C>>, ["", init, const]
 // CIR:   %[[RET_ADDR:.*]] = cir.alloca !cir.ptr<!rec_C>, !cir.ptr<!cir.ptr<!rec_C>>, ["__retval"]
diff --git a/clang/test/CIR/CodeGen/bitfield-union.c b/clang/test/CIR/CodeGen/bitfield-union.c
index 14a2aaf68d318..9c235c5dc6195 100644
--- a/clang/test/CIR/CodeGen/bitfield-union.c
+++ b/clang/test/CIR/CodeGen/bitfield-union.c
@@ -39,7 +39,7 @@ void f() {
 // CIR: #bfi_y = #cir.bitfield_info<name = "y", storage_type = !u8i, size = 4, offset = 0, is_signed = true>
 // CIR: #bfi_z = #cir.bitfield_info<name = "z", storage_type = !u8i, size = 8, offset = 0, is_signed = true>
 
-// CIR:   cir.func no_proto dso_local @f
+// CIR:   cir.func {{.*}} @f
 // CIR:    [[ALLOC:%.*]] = cir.alloca !rec_demo, !cir.ptr<!rec_demo>, ["d"] {alignment = 4 : i64}
 // CIR:    [[ONE:%.*]] = cir.const #cir.int<1> : !s32i
 // CIR:    [[X:%.*]] = cir.get_member [[ALLOC]][0] {name = "x"} : !cir.ptr<!rec_demo> -> !cir.ptr<!s32i>
diff --git a/clang/test/CIR/CodeGen/bitfields.c b/clang/test/CIR/CodeGen/bitfields.c
index b2c7d1c0be926..d1160399fd919 100644
--- a/clang/test/CIR/CodeGen/bitfields.c
+++ b/clang/test/CIR/CodeGen/bitfields.c
@@ -122,7 +122,7 @@ unsigned int load_field_unsigned(A* s) {
   return s->more_bits;
 }
 
-//CIR: cir.func dso_local @load_field_unsigned
+//CIR: cir.func {{.*}} @load_field_unsigned
 //CIR:   [[TMP0:%.*]] = cir.alloca !cir.ptr<!rec_A>, !cir.ptr<!cir.ptr<!rec_A>>, ["s", init] {alignment = 8 : i64}
 //CIR:   [[TMP1:%.*]] = cir.load align(8) [[TMP0]] : !cir.ptr<!cir.ptr<!rec_A>>, !cir.ptr<!rec_A>
 //CIR:   [[TMP2:%.*]] = cir.get_member [[TMP1]][3] {name = "more_bits"} : !cir.ptr<!rec_A> -> !cir.ptr<!u16i>
@@ -228,7 +228,7 @@ void get_volatile(V* v) {
   v->b = 3;
 }
 
-// CIR: cir.func dso_local @get_volatile
+// CIR: cir.func {{.*}} @get_volatile
 // CIR:   [[TMP0:%.*]] = cir.alloca !cir.ptr<!rec_V>, !cir.ptr<!cir.ptr<!rec_V>>, ["v", init] {alignment = 8 : i64}
 // CIR:   [[TMP1:%.*]] = cir.const #cir.int<3> : !s32i
 // CIR:   [[TMP2:%.*]] = cir.load align(8) [[TMP0]] : !cir.ptr<!cir.ptr<!rec_V>>, !cir.ptr<!rec_V>
@@ -255,7 +255,7 @@ void get_volatile(V* v) {
 void set_volatile(V* v) {
   v->b = 3;
 }
-//CIR: cir.func dso_local @set_volatile
+//CIR: cir.func {{.*}} @set_volatile
 //CIR:   [[TMP0:%.*]] = cir.alloca !cir.ptr<!rec_V>, !cir.ptr<!cir.ptr<!rec_V>>, ["v", init] {alignment = 8 : i64}
 //CIR:   [[TMP1:%.*]] = cir.const #cir.int<3> : !s32i
 //CIR:   [[TMP2:%.*]] = cir.load align(8) [[TMP0]] : !cir.ptr<!cir.ptr<!rec_V>>, !cir.ptr<!rec_V>
diff --git a/clang/test/CIR/CodeGen/bitfields.cpp b/clang/test/CIR/CodeGen/bitfields.cpp
index 7650e0b83faf6..8eeb71c3edfea 100644
--- a/clang/test/CIR/CodeGen/bitfields.cpp
+++ b/clang/test/CIR/CodeGen/bitfields.cpp
@@ -35,7 +35,7 @@ void def() {
 int load_field(S* s) {
   return s->c;
 }
-// CIR: cir.func dso_local @_Z10load_field
+// CIR: cir.func {{.*}} @_Z10load_field
 // CIR:   [[TMP0:%.*]] = cir.alloca !cir.ptr<!rec_S>, !cir.ptr<!cir.ptr<!rec_S>>, ["s", init]
 // CIR:   [[TMP1:%.*]] = cir.load{{.*}} [[TMP0]] : !cir.ptr<!cir.ptr<!rec_S>>, !cir.ptr<!rec_S>
 // CIR:   [[TMP2:%.*]] = cir.get_member [[TMP1]][0] {name = "c"} : !cir.ptr<!rec_S> -> !cir.ptr<!u64i>
@@ -63,7 +63,7 @@ void store_field() {
   S s;
   s.a = 3;
 }
-// CIR: cir.func dso_local @_Z11store_field
+// CIR: cir.func {{.*}} @_Z11store_field
 // CIR:   [[TMP0:%.*]] = cir.alloca !rec_S, !cir.ptr<!rec_S>
 // CIR:   [[TMP1:%.*]] = cir.const #cir.int<3> : !s32i
 // CIR:   [[TMP2:%.*]] = cir.get_member [[TMP0]][0] {name = "a"} : !cir.ptr<!rec_S> -> !cir.ptr<!u64i>
@@ -88,7 +88,7 @@ void store_bitfield_to_bitfield(S* s) {
   s->a = s->b = 3;
 }
 
-// CIR: cir.func dso_local @_Z26store_bitfield_to_bitfieldP1S
+// CIR: cir.func {{.*}} @_Z26store_bitfield_to_bitfieldP1S
 // CIR:   [[TMP0:%.*]] = cir.alloca !cir.ptr<!rec_S>, !cir.ptr<!cir.ptr<!rec_S>>, ["s", init] {alignment = 8 : i64}
 // CIR:   [[TMP1:%.*]] = cir.const #cir.int<3> : !s32i
 // CIR:   [[TMP2:%.*]] = cir.load align(8) [[TMP0]] : !cir.ptr<!cir.ptr<!rec_S>>, !cir.ptr<!rec_S>
diff --git a/clang/test/CIR/CodeGen/bitfields_be.c b/clang/test/CIR/CodeGen/bitfields_be.c
index 3e1f05401728a..f4f3476d2ef23 100644
--- a/clang/test/CIR/CodeGen/bitfields_be.c
+++ b/clang/test/CIR/CodeGen/bitfields_be.c
@@ -21,7 +21,7 @@ int init(S* s) {
   return s->c;
 }
 
-//CIR: cir.func dso_local @init
+//CIR: cir.func {{.*}} @init
 //CIR:   [[TMP0:%.*]] = cir.alloca !cir.ptr<!rec_S>, !cir.ptr<!cir.ptr<!rec_S>>, ["s", init] {alignment = 8 : i64}
 //CIR:   [[TMP1:%.*]] = cir.load align(8) [[TMP0]] : !cir.ptr<!cir.ptr<!rec_S>>, !cir.ptr<!rec_S>
 //CIR:   [[TMP2:%.*]] = cir.get_member [[TMP1]][0] {name = "c"} : !cir.ptr<!rec_S> -> !cir.ptr<!u32i>
@@ -51,7 +51,7 @@ void load(S* s) {
 }
 
 // field 'a'
-// CIR: cir.func dso_local @load
+// CIR: cir.func {{.*}} @load
 // CIR:    %[[PTR0:.*]] = cir.alloca !cir.ptr<!rec_S>, !cir.ptr<!cir.ptr<!rec_S>>, ["s", init] {alignment = 8 : i64} loc(#loc35)
 // CIR:    %[[CONST1:.*]] = cir.const #cir.int<4> : !s32i
 // CIR:    %[[MIN1:.*]] = cir.unary(minus, %[[CONST1]]) nsw : !s32i, !s32i
diff --git a/clang/test/CIR/CodeGen/constant-inits.cpp b/clang/test/CIR/CodeGen/constant-inits.cpp
index ef9802de405c1..ff0c0da26b559 100644
--- a/clang/test/CIR/CodeGen/constant-inits.cpp
+++ b/clang/test/CIR/CodeGen/constant-inits.cpp
@@ -159,7 +159,7 @@ void function() {
 // CIR-DAG-SAME:   #cir.int<125> : !u8i
 // CIR-DAG-SAME: }> : !rec_mixed_partial_bitfields
 
-// CIR-LABEL: cir.func dso_local @_Z8functionv()
+// CIR-LABEL: cir.func {{.*}} @_Z8functionv()
 // CIR:   cir.return
 
 
diff --git a/clang/test/CIR/CodeGen/copy-constructor.cpp b/clang/test/CIR/CodeGen/copy-constructor.cpp
index be05bd582d6f0..97c514ac67e03 100644
--- a/clang/test/CIR/CodeGen/copy-constructor.cpp
+++ b/clang/test/CIR/CodeGen/copy-constructor.cpp
@@ -12,7 +12,7 @@ struct HasScalarArrayMember {
 
 HasScalarArrayMember::HasScalarArrayMember(const HasScalarArrayMember &) = default;
 
-// CIR-LABEL: cir.func dso_local @_ZN20HasScalarArrayMemberC2ERKS_(
+// CIR-LABEL: cir.func {{.*}} @_ZN20HasScalarArrayMemberC2ERKS_(
 // CIR-NEXT:    %[[THIS:.*]] = cir.alloca !cir.ptr<!rec_HasScalarArrayMember>
 // CIR-NEXT:    %[[OTHER:.*]] = cir.alloca !cir.ptr<!rec_HasScalarArrayMember>
 // CIR-NEXT:    cir.store %arg0, %[[THIS]]
diff --git a/clang/test/CIR/CodeGen/coro-task.cpp b/clang/test/CIR/CodeGen/coro-task.cpp
index 4843f2433fa64..c6e21c993b64f 100644
--- a/clang/test/CIR/CodeGen/coro-task.cpp
+++ b/clang/test/CIR/CodeGen/coro-task.cpp
@@ -130,7 +130,7 @@ VoidTask silly_task() {
   co_await std::suspend_always();
 }
 
-// CIR: cir.func coroutine dso_local @_Z10silly_taskv() -> ![[VoidTask]]
+// CIR: cir.func coroutine {{.*}} @_Z10silly_taskv() -> ![[VoidTask]]
 // CIR: %[[VoidTaskAddr:.*]] = cir.alloca ![[VoidTask]], {{.*}}, ["__retval"]
 // CIR: %[[SavedFrameAddr:.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["__coro_frame_addr"]
 // CIR: %[[VoidPromisseAddr:.*]] = cir.alloca ![[VoidPromisse]], {{.*}}, ["__promise"]
@@ -203,16 +203,26 @@ VoidTask silly_task() {
 // CIR:     %[[CoroHandleVoidReload:.*]] = cir.load{{.*}} %[[CoroHandleVoidAddr]] : !cir.ptr<![[CoroHandleVoid]]>, ![[CoroHandleVoid]]
 // CIR:     cir.call @_ZNSt14suspend_always13await_suspendESt16coroutine_handleIvE(%[[SuspendAlwaysAddr]], %[[CoroHandleVoidReload]])
 // CIR:     cir.yield
+
+// Third region `resume` handles coroutine resuming logic.
+
 // CIR:   }, resume : {
+// CIR:     cir.call @_ZNSt14suspend_always12await_resumeEv(%[[SuspendAlwaysAddr]])
 // CIR:     cir.yield
 // CIR:   },)
 // CIR: }
 
+// Since we already tested cir.await guts above, the remaining checks for:
+// - The actual user written co_await
+// - The promise call
+// - The final suspend co_await
+// - Return
+
 folly::coro::Task<int> byRef(const std::string& s) {
   co_return s.size();
 }
 
-// CIR:  cir.func coroutine dso_local @_Z5byRefRKSt6string(%[[ARG:.*]]: !cir.ptr<![[StdString]]> {{.*}}) -> ![[IntTask]]
+// CIR:  cir.func coroutine {{.*}} @_Z5byRefRKSt6string(%[[ARG:.*]]: !cir.ptr<![[StdString]]> {{.*}}) -> ![[IntTask]]
 // CIR:    %[[AllocaParam:.*]] = cir.alloca !cir.ptr<![[StdString]]>, {{.*}}, ["s", init, const]
 // CIR:    %[[IntTaskAddr:.*]] = cir.alloca ![[IntTask]], {{.*}}, ["__retval"]
 // CIR:    %[[SavedFrameAddr:.*]]  = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["__coro_frame_addr"]
@@ -245,6 +255,8 @@ folly::coro::Task<int> byRef(const std::string& s) {
 // CIR:       %[[CoroHandleVoidReload:.*]] = cir.load{{.*}} %[[CoroHandleVoidAddr]] : !cir.ptr<![[CoroHandleVoid]]>, ![[CoroHandleVoid]]
 // CIR:       cir.call @_ZNSt14suspend_always13await_suspendESt16coroutine_handleIvE(%[[SuspendAlwaysAddr]], %[[CoroHandleVoidReload]])
 // CIR:       cir.yield
-// CIR:      }, resume : {
-// CIR:        cir.yield
-// CIR:      },)
+// CIR:       }, resume : {
+// CIR:         cir.call @_ZNSt14suspend_always12await_resumeEv(%[[SuspendAlwaysAddr]])
+// CIR:         cir.yield
+// CIR:       },)
+// CIR:     }
diff --git a/clang/test/CIR/CodeGen/cxx-conversion-operators.cpp b/clang/test/CIR/CodeGen/cxx-conversion-operators.cpp
index a386a4161f82d..30e6298d65fa5 100644
--- a/clang/test/CIR/CodeGen/cxx-conversion-operators.cpp
+++ b/clang/test/CIR/CodeGen/cxx-conversion-operators.cpp
@@ -27,7 +27,7 @@ void test() {
   x = o;
 }
 
-// CIR: cir.func dso_local @_ZN20out_of_line_operatorcviEv(%[[THIS_ARG:.+]]: !cir.ptr<!rec_out_of_line_operator>{{.*}}) -> !s32i
+// CIR: cir.func {{.*}} @_ZN20out_of_line_operatorcviEv(%[[THIS_ARG:.+]]: !cir.ptr<!rec_out_of_line_operator>{{.*}}) -> !s32i
 // CIR:   %[[THIS_ALLOCA:.+]] = cir.alloca !cir.ptr<!rec_out_of_line_operator>, !cir.ptr<!cir.ptr<!rec_out_of_line_operator>>, ["this", init]
 // CIR:   %[[RETVAL:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS_ALLOCA]] : !cir.ptr<!rec_out_of_line_operator>, !cir.ptr<!cir.ptr<!rec_out_of_line_operator>>
@@ -38,7 +38,7 @@ void test() {
 // CIR:   cir.return %[[RET_LOAD]] : !s32i
 // CIR: }
 
-// CIR: cir.func comdat linkonce_odr @_ZNK15inline_operatorcviEv(%[[INLINE_THIS_ARG:.+]]: !cir.ptr<!rec_inline_operator>{{.*}}) -> !s32i
+// CIR: cir.func no_inline comdat linkonce_odr @_ZNK15inline_operatorcviEv(%[[INLINE_THIS_ARG:.+]]: !cir.ptr<!rec_inline_operator>{{.*}}) -> !s32i
 // CIR:   %[[INLINE_THIS_ALLOCA:.+]] = cir.alloca !cir.ptr<!rec_inline_operator>, !cir.ptr<!cir.ptr<!rec_inline_operator>>, ["this", init]
 // CIR:   %[[INLINE_RETVAL:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store %[[INLINE_THIS_ARG]], %[[INLINE_THIS_ALLOCA]] : !cir.ptr<!rec_inline_operator>, !cir.ptr<!cir.ptr<!rec_inline_operator>>
@@ -49,7 +49,7 @@ void test() {
 // CIR:   cir.return %[[INLINE_RET_LOAD]] : !s32i
 // CIR: }
 
-// CIR: cir.func dso_local @_Z4testv()
+// CIR: cir.func {{.*}} @_Z4testv()
 // CIR:   %[[X_ALLOCA:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["x", init]
 // CIR:   %[[I_ALLOCA:.+]] = cir.alloca {{.*}}, {{.*}}, ["i"]
 // CIR:   %[[O_ALLOCA:.+]] = cir.alloca {{.*}}, {{.*}}, ["o"]
@@ -82,7 +82,7 @@ void test() {
 // LLVM:   ret i32 %[[INLINE_RET_LOAD]]
 // LLVM: }
 
-// LLVM: define dso_local void @_Z4testv()
+// LLVM: define {{.*}} void @_Z4testv()
 // LLVM:   %[[X_ALLOCA:.+]] = alloca i32, i64 1
 // LLVM:   %[[I_ALLOCA:.+]] = alloca {{.*}}, i64 1
 // LLVM:   %[[O_ALLOCA:.+]] = alloca {{.*}}, i64 1
@@ -102,7 +102,7 @@ void test() {
 // OGCG:   ret i32 123
 // OGCG: }
 
-// OGCG: define dso_local void @_Z4testv()
+// OGCG: define {{.*}} void @_Z4testv()
 // OGCG: entry:
 // OGCG:   %[[X_VAR:.+]] = alloca i32
 // OGCG:   %[[I_VAR:.+]] = alloca {{.*}}
diff --git a/clang/test/CIR/CodeGen/cxx-special-member-attr.cpp b/clang/test/CIR/CodeGen/cxx-special-member-attr.cpp
index 815ef2c2aaa25..f2c2c1f683395 100644
--- a/clang/test/CIR/CodeGen/cxx-special-member-attr.cpp
+++ b/clang/test/CIR/CodeGen/cxx-special-member-attr.cpp
@@ -3,25 +3,15 @@
 
 struct Flub {
   int a = 123;
-  // CIR: @_ZN4FlubC1ERKS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Flub, copy, trivial true>>
-  // CIR: @_ZN4FlubC2EOS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Flub, move, trivial true>
-  // CIR: @_ZN4FlubaSERKS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) -> !cir.ptr<!rec_Flub> special_member<#cir.cxx_assign<!rec_Flub, copy, trivial true>>
-  // CIR: @_ZN4FlubaSEOS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) -> !cir.ptr<!rec_Flub> special_member<#cir.cxx_assign<!rec_Flub, move, trivial true>>
 };
 
 struct Foo {
   int a;
 
-  // CIR: @_ZN3FooC2Ev(%arg0: !cir.ptr<!rec_Foo> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Foo, default>>
   Foo() : a(123) {}
-
-  // CIR: @_ZN3FooC2ERKS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Foo, copy>>
   Foo(const Foo &other) : a(other.a) {}
-
-  // CIR: @_ZN3FooC2EOS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Foo, move>>
   Foo(Foo &&other) noexcept : a(other.a) { other.a = 0; }
 
-  // CIR: @_ZN3FooaSERKS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) -> !cir.ptr<!rec_Foo> special_member<#cir.cxx_assign<!rec_Foo, copy>>
   Foo &operator=(const Foo &other) {
     if (this != &other) {
       a = other.a;
@@ -29,7 +19,6 @@ struct Foo {
     return *this;
   }
 
-  // CIR: @_ZN3FooaSEOS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) -> !cir.ptr<!rec_Foo> special_member<#cir.cxx_assign<!rec_Foo, move>>
   Foo &operator=(Foo &&other) noexcept {
     if (this != &other) {
       a = other.a;
@@ -38,22 +27,40 @@ struct Foo {
     return *this;
   }
 
-  // CIR: @_ZN3FooD1Ev(!cir.ptr<!rec_Foo>) special_member<#cir.cxx_dtor<!rec_Foo>>
   ~Foo();
 };
 
-void trivial() {
+void trivial_func() {
   Flub f1{};
+
   Flub f2 = f1;
+  // Trivial copy constructors/assignments are replaced with cir.copy
+  // CIR: cir.copy {{.*}} : !cir.ptr<!rec_Flub>
+
   Flub f3 = static_cast<Flub&&>(f1);
+  // CIR: @_ZN4FlubC1EOS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Flub, move, trivial true>
+
   f2 = f1;
+  // CIR: @_ZN4FlubaSERKS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) -> !cir.ptr<!rec_Flub> special_member<#cir.cxx_assign<!rec_Flub, copy, trivial true>>
+
   f1 = static_cast<Flub&&>(f3);
+  // CIR: @_ZN4FlubaSEOS_(%arg0: !cir.ptr<!rec_Flub> loc({{.*}}), %arg1: !cir.ptr<!rec_Flub> loc({{.*}})) -> !cir.ptr<!rec_Flub> special_member<#cir.cxx_assign<!rec_Flub, move, trivial true>>
 }
 
-void non_trivial() {
+void non_trivial_func() {
   Foo f1{};
+  // CIR: @_ZN3FooC2Ev(%arg0: !cir.ptr<!rec_Foo> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Foo, default>>
+
   Foo f2 = f1;
+  // CIR: @_ZN3FooC2ERKS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Foo, copy>>
+
   Foo f3 = static_cast<Foo&&>(f1);
+  // CIR: @_ZN3FooC2EOS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) special_member<#cir.cxx_ctor<!rec_Foo, move>>
+
   f2 = f1;
+  // CIR: @_ZN3FooaSERKS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) -> !cir.ptr<!rec_Foo> special_member<#cir.cxx_assign<!rec_Foo, copy>>
+
   f1 = static_cast<Foo&&>(f3);
+  // CIR: @_ZN3FooaSEOS_(%arg0: !cir.ptr<!rec_Foo> loc({{.*}}), %arg1: !cir.ptr<!rec_Foo> loc({{.*}})) -> !cir.ptr<!rec_Foo> special_member<#cir.cxx_assign<!rec_Foo, move>>
+  // CIR: @_ZN3FooD1Ev(!cir.ptr<!rec_Foo>) special_member<#cir.cxx_dtor<!rec_Foo>>
 }
diff --git a/clang/test/CIR/CodeGen/delete.cpp b/clang/test/CIR/CodeGen/delete.cpp
index d8ac4361bb538..c8d6f050179fd 100644
--- a/clang/test/CIR/CodeGen/delete.cpp
+++ b/clang/test/CIR/CodeGen/delete.cpp
@@ -19,7 +19,7 @@ void test_sized_delete(SizedDelete *x) {
 // CIR:  cir.func private @_ZN11SizedDeletedlEPvm(!cir.ptr<!void>, !u64i)
 // LLVM: declare void @_ZN11SizedDeletedlEPvm(ptr, i64)
 
-// CIR: cir.func dso_local @_Z17test_sized_deleteP11SizedDelete
+// CIR: cir.func {{.*}} @_Z17test_sized_deleteP11SizedDelete
 // CIR:   %[[X:.*]] = cir.load{{.*}} %{{.*}}
 // CIR:   %[[X_CAST:.*]] = cir.cast bitcast %[[X]] : !cir.ptr<!rec_SizedDelete> -> !cir.ptr<!void>
 // CIR:   %[[OBJ_SIZE:.*]] = cir.const #cir.int<4> : !u64i
@@ -49,15 +49,15 @@ struct Container {
 Container::~Container() { delete contents; }
 
 // Contents::~Contents()
-// CIR: cir.func comdat linkonce_odr @_ZN8ContentsD2Ev
+// CIR: cir.func {{.*}} @_ZN8ContentsD2Ev
 // LLVM: define linkonce_odr void @_ZN8ContentsD2Ev
 
 // operator delete(void*, unsigned long)
-// CIR: cir.func private @_ZdlPvm(!cir.ptr<!void>, !u64i)
+// CIR: cir.func {{.*}} @_ZdlPvm(!cir.ptr<!void>, !u64i)
 // LLVM: declare void @_ZdlPvm(ptr, i64)
 
 // Container::~Container()
-// CIR: cir.func dso_local @_ZN9ContainerD2Ev
+// CIR: cir.func {{.*}} @_ZN9ContainerD2Ev
 // CIR:   %[[THIS:.*]] = cir.load %{{.*}}
 // CIR:   %[[CONTENTS_PTR_ADDR:.*]] = cir.get_member %[[THIS]][0] {name = "contents"} : !cir.ptr<!rec_Container> -> !cir.ptr<!cir.ptr<!rec_Contents>>
 // CIR:   %[[CONTENTS_PTR:.*]] = cir.load{{.*}} %[[CONTENTS_PTR_ADDR]]
diff --git a/clang/test/CIR/CodeGen/destructors.cpp b/clang/test/CIR/CodeGen/destructors.cpp
index 4363db5ad34dc..ec190f59b2f1d 100644
--- a/clang/test/CIR/CodeGen/destructors.cpp
+++ b/clang/test/CIR/CodeGen/destructors.cpp
@@ -18,7 +18,7 @@ out_of_line_destructor::~out_of_line_destructor() {
 
 // CIR: !rec_out_of_line_destructor = !cir.record<struct "out_of_line_destructor" {!s32i}>
 
-// CIR: cir.func dso_local @_ZN22out_of_line_destructorD2Ev(%{{.+}}: !cir.ptr<!rec_out_of_line_destructor>
+// CIR: cir.func {{.*}} @_ZN22out_of_line_destructorD2Ev(%{{.+}}: !cir.ptr<!rec_out_of_line_destructor>
 // CIR:   cir.call @_Z13some_functionv() nothrow : () -> () 
 // CIR:   cir.return 
 
@@ -30,7 +30,7 @@ out_of_line_destructor::~out_of_line_destructor() {
 // OGCG:   call void @_Z13some_functionv()
 // OGCG:   ret void
 
-// CIR: cir.func dso_local @_ZN22out_of_line_destructorD1Ev(%{{.+}}: !cir.ptr<!rec_out_of_line_destructor>
+// CIR: cir.func {{.*}} @_ZN22out_of_line_destructorD1Ev(%{{.+}}: !cir.ptr<!rec_out_of_line_destructor>
 // CIR:  cir.call @_ZN22out_of_line_destructorD2Ev(%{{.*}}) nothrow : (!cir.ptr<!rec_out_of_line_destructor>)
 // CIR:  cir.return
 
@@ -61,7 +61,7 @@ void test_array_destructor() {
   array_element arr[5]{};
 }
 
-// CIR: cir.func dso_local @_Z21test_array_destructorv()
+// CIR: cir.func {{.*}} @_Z21test_array_destructorv()
 // CIR:   %[[ARR:.*]] = cir.alloca !cir.array<!rec_array_element x 5>, !cir.ptr<!cir.array<!rec_array_element x 5>>, ["arr", init]
 // CIR:   %[[ARR_PTR:.*]] = cir.alloca !cir.ptr<!rec_array_element>, !cir.ptr<!cir.ptr<!rec_array_element>>, ["arrayinit.temp", init]
 // CIR:   %[[BEGIN:.*]] = cir.cast array_to_ptrdecay %[[ARR]] : !cir.ptr<!cir.array<!rec_array_element x 5>>
diff --git a/clang/test/CIR/CodeGen/dtors.cpp b/clang/test/CIR/CodeGen/dtors.cpp
index 1fe048b7d5327..aeee0854dacf0 100644
--- a/clang/test/CIR/CodeGen/dtors.cpp
+++ b/clang/test/CIR/CodeGen/dtors.cpp
@@ -13,7 +13,7 @@ void test_temporary_dtor() {
   A();
 }
 
-// CIR: cir.func dso_local @_Z19test_temporary_dtorv()
+// CIR: cir.func {{.*}} @_Z19test_temporary_dtorv()
 // CIR:   %[[ALLOCA:.*]] = cir.alloca !rec_A, !cir.ptr<!rec_A>, ["agg.tmp.ensured"]
 // CIR:   cir.call @_ZN1AD1Ev(%[[ALLOCA]]) nothrow : (!cir.ptr<!rec_A>) -> ()
 
diff --git a/clang/test/CIR/CodeGen/dynamic-cast.cpp b/clang/test/CIR/CodeGen/dynamic-cast.cpp
index 5d010d20bb9f1..e963be01950c4 100644
--- a/clang/test/CIR/CodeGen/dynamic-cast.cpp
+++ b/clang/test/CIR/CodeGen/dynamic-cast.cpp
@@ -21,11 +21,11 @@ Derived *ptr_cast(Base *b) {
   return dynamic_cast<Derived *>(b);
 }
 
-// CIR-BEFORE: cir.func dso_local @_Z8ptr_castP4Base
+// CIR-BEFORE: cir.func {{.*}} @_Z8ptr_castP4Base
 // CIR-BEFORE:   %{{.+}} = cir.dyn_cast ptr %{{.+}} : !cir.ptr<!rec_Base> -> !cir.ptr<!rec_Derived> #dyn_cast_info__ZTI4Base__ZTI7Derived
 // CIR-BEFORE: }
 
-//      CIR-AFTER: cir.func dso_local @_Z8ptr_castP4Base
+//      CIR-AFTER: cir.func {{.*}} @_Z8ptr_castP4Base
 //      CIR-AFTER:   %[[SRC:.*]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!rec_Base>>, !cir.ptr<!rec_Base>
 // CIR-AFTER-NEXT:   %[[SRC_IS_NOT_NULL:.*]] = cir.cast ptr_to_bool %[[SRC]] : !cir.ptr<!rec_Base> -> !cir.bool
 // CIR-AFTER-NEXT:   %{{.+}} = cir.ternary(%[[SRC_IS_NOT_NULL]], true {
@@ -69,11 +69,11 @@ Derived &ref_cast(Base &b) {
   return dynamic_cast<Derived &>(b);
 }
 
-// CIR-BEFORE: cir.func dso_local @_Z8ref_castR4Base
+// CIR-BEFORE: cir.func {{.*}} @_Z8ref_castR4Base
 // CIR-BEFORE:   %{{.+}} = cir.dyn_cast ref %{{.+}} : !cir.ptr<!rec_Base> -> !cir.ptr<!rec_Derived> #dyn_cast_info__ZTI4Base__ZTI7Derived
 // CIR-BEFORE: }
 
-//      CIR-AFTER: cir.func dso_local @_Z8ref_castR4Base
+//      CIR-AFTER: cir.func {{.*}} @_Z8ref_castR4Base
 //      CIR-AFTER:   %[[SRC_VOID_PTR:.*]] = cir.cast bitcast %{{.+}} : !cir.ptr<!rec_Base> -> !cir.ptr<!void>
 // CIR-AFTER-NEXT:   %[[SRC_RTTI:.*]] = cir.const #cir.global_view<@_ZTI4Base> : !cir.ptr<!u8i>
 // CIR-AFTER-NEXT:   %[[DEST_RTTI:.*]] = cir.const #cir.global_view<@_ZTI7Derived> : !cir.ptr<!u8i>
@@ -106,11 +106,11 @@ void *ptr_cast_to_complete(Base *ptr) {
   return dynamic_cast<void *>(ptr);
 }
 
-// CIR-BEFORE: cir.func dso_local @_Z20ptr_cast_to_completeP4Base
+// CIR-BEFORE: cir.func {{.*}} @_Z20ptr_cast_to_completeP4Base
 // CIR-BEFORE:   %{{.+}} = cir.dyn_cast ptr %{{.+}} : !cir.ptr<!rec_Base> -> !cir.ptr<!void>
 // CIR-BEFORE: }
 
-//      CIR-AFTER: cir.func dso_local @_Z20ptr_cast_to_completeP4Base
+//      CIR-AFTER: cir.func {{.*}} @_Z20ptr_cast_to_completeP4Base
 //      CIR-AFTER:   %[[SRC:.*]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!rec_Base>>, !cir.ptr<!rec_Base>
 // CIR-AFTER-NEXT:   %[[SRC_IS_NOT_NULL:.*]] = cir.cast ptr_to_bool %[[SRC]] : !cir.ptr<!rec_Base> -> !cir.bool
 // CIR-AFTER-NEXT:   %{{.+}} = cir.ternary(%[[SRC_IS_NOT_NULL]], true {
diff --git a/clang/test/CIR/CodeGen/global-ctor-dtor.cpp b/clang/test/CIR/CodeGen/global-ctor-dtor.cpp
index 2e03ff3e88c7d..63f175281a02e 100644
--- a/clang/test/CIR/CodeGen/global-ctor-dtor.cpp
+++ b/clang/test/CIR/CodeGen/global-ctor-dtor.cpp
@@ -13,28 +13,28 @@ void foo(void) {
   bar();
 }
 
-// CIR-BEFORE-LPP: cir.func dso_local @_Z3foov() global_ctor
+// CIR-BEFORE-LPP: cir.func {{.*}} @_Z3foov() global_ctor
 
 void foo2(void) __attribute__((constructor(777)));
 void foo2(void) {
   bar();
 }
 
-// CIR-BEFORE-LPP: cir.func dso_local @_Z4foo2v() global_ctor(777)
+// CIR-BEFORE-LPP: cir.func {{.*}} @_Z4foo2v() global_ctor(777)
 
 void foo3(void) __attribute__((destructor));
 void foo3(void) {
   bar();
 }
 
-// CIR-BEFORE-LPP: cir.func dso_local @_Z4foo3v() global_dtor
+// CIR-BEFORE-LPP: cir.func {{.*}} @_Z4foo3v() global_dtor
 
 void foo4(void) __attribute__((destructor(789)));
 void foo4(void) {
   bar();
 }
 
-// CIR-BEFORE-LPP: cir.func dso_local @_Z4foo4v() global_dtor(789)
+// CIR-BEFORE-LPP: cir.func {{.*}} @_Z4foo4v() global_dtor(789)
 
 // CIR-AFTER: module @{{.*}} attributes {cir.global_ctors = [#cir.global_ctor<"_Z3foov", 65535>, #cir.global_ctor<"_Z4foo2v", 777>], cir.global_dtors = [#cir.global_dtor<"_Z4foo3v", 65535>, #cir.global_dtor<"_Z4foo4v", 789>]
 
diff --git a/clang/test/CIR/CodeGen/goto.cpp b/clang/test/CIR/CodeGen/goto.cpp
index 257c2550c2399..4b825d619c221 100644
--- a/clang/test/CIR/CodeGen/goto.cpp
+++ b/clang/test/CIR/CodeGen/goto.cpp
@@ -12,7 +12,7 @@ int shouldNotGenBranchRet(int x) {
 err:
   return -1;
 }
-// CIR:  cir.func dso_local @_Z21shouldNotGenBranchReti
+// CIR:  cir.func {{.*}} @_Z21shouldNotGenBranchReti
 // CIR:    cir.if {{.*}} {
 // CIR:      cir.goto "err"
 // CIR:    }
@@ -63,7 +63,7 @@ int shouldGenBranch(int x) {
 err:
   return -1;
 }
-// CIR:  cir.func dso_local @_Z15shouldGenBranchi
+// CIR:  cir.func {{.*}} @_Z15shouldGenBranchi
 // CIR:    cir.if {{.*}} {
 // CIR:      cir.goto "err"
 // CIR:    }
@@ -99,7 +99,7 @@ void severalLabelsInARow(int a) {
 end2:
   b = b + 2;
 }
-// CIR:  cir.func dso_local @_Z19severalLabelsInARowi
+// CIR:  cir.func {{.*}} @_Z19severalLabelsInARowi
 // CIR:    cir.goto "end1"
 // CIR:  ^bb[[#BLK1:]]
 // CIR:    cir.goto "end2"
@@ -132,7 +132,7 @@ void severalGotosInARow(int a) {
 end:
   b = b + 2;
 }
-// CIR:  cir.func dso_local @_Z18severalGotosInARowi
+// CIR:  cir.func {{.*}} @_Z18severalGotosInARowi
 // CIR:    cir.goto "end"
 // CIR:  ^bb[[#BLK1:]]:
 // CIR:    cir.goto "end"
@@ -163,7 +163,7 @@ extern "C" void multiple_non_case(int v) {
   }
 }
 
-// CIR: cir.func dso_local @multiple_non_case
+// CIR: cir.func {{.*}} @multiple_non_case
 // CIR: cir.switch
 // CIR: cir.case(default, []) {
 // CIR: cir.call @action1()
@@ -202,7 +202,7 @@ extern "C" void case_follow_label(int v) {
   }
 }
 
-// CIR: cir.func dso_local @case_follow_label
+// CIR: cir.func {{.*}} @case_follow_label
 // CIR: cir.switch
 // CIR: cir.case(equal, [#cir.int<1> : !s32i]) {
 // CIR:   cir.br ^bb1
@@ -264,7 +264,7 @@ extern "C" void default_follow_label(int v) {
   }
 }
 
-// CIR: cir.func dso_local @default_follow_label
+// CIR: cir.func {{.*}} @default_follow_label
 // CIR: cir.switch
 // CIR: cir.case(equal, [#cir.int<1> : !s32i]) {
 // CIR:   cir.yield
@@ -313,7 +313,7 @@ void g3() {
   goto label;
 }
 
-// CIR:  cir.func dso_local @_Z2g3v
+// CIR:  cir.func {{.*}} @_Z2g3v
 // CIR:    cir.br ^bb1
 // CIR:  ^bb1:
 // CIR:    cir.label "label"
diff --git a/clang/test/CIR/CodeGen/inline-attributes.cpp b/clang/test/CIR/CodeGen/inline-attributes.cpp
index fab4010354daf..54777975e32ab 100644
--- a/clang/test/CIR/CodeGen/inline-attributes.cpp
+++ b/clang/test/CIR/CodeGen/inline-attributes.cpp
@@ -29,17 +29,17 @@ int (*inline_hint_ptr)(int) = &inline_hint_function;
 int (*noinline_ptr)(int) = &noinline_function;
 int (*regular_ptr)(int) = &regular_function;
 
-// CIR-LABEL: cir.func dso_local @_Z17noinline_functioni(%arg0: !s32i {{.*}}) -> !s32i inline(never)
+// CIR-LABEL: cir.func no_inline dso_local @_Z17noinline_functioni(%arg0: !s32i {{.*}}) -> !s32i
 
 // CIR-LABEL: cir.func dso_local @_Z16regular_functioni(%arg0: !s32i {{.*}}) -> !s32i
-// CIR-NOT: inline(never)
-// CIR-NOT: inline(always)
-// CIR-NOT: inline(hint)
+// CIR-NOT: no_inline
+// CIR-NOT: always_inline
+// CIR-NOT: inline_hint
 // CIR-SAME: {
 
-// CIR-LABEL: cir.func {{.*}}@_Z22always_inline_functioni(%arg0: !s32i {{.*}}) -> !s32i inline(always)
+// CIR-LABEL: cir.func{{.*}} always_inline {{.*}}@_Z22always_inline_functioni(%arg0: !s32i {{.*}}) -> !s32i
 
-// CIR-LABEL: cir.func {{.*}}@_Z20inline_hint_functioni(%arg0: !s32i {{.*}}) -> !s32i inline(hint)
+// CIR-LABEL: cir.func{{.*}} inline_hint {{.*}}@_Z20inline_hint_functioni(%arg0: !s32i {{.*}}) -> !s32i
 
 // LLVM: ; Function Attrs:{{.*}} noinline
 // LLVM: define{{.*}} i32 @_Z17noinline_functioni
diff --git a/clang/test/CIR/CodeGen/label-values.c b/clang/test/CIR/CodeGen/label-values.c
index 41178e3f62f20..0305442a38471 100644
--- a/clang/test/CIR/CodeGen/label-values.c
+++ b/clang/test/CIR/CodeGen/label-values.c
@@ -6,7 +6,7 @@ void A(void) {
 LABEL_A:
   return;
 }
-// CIR:  cir.func dso_local @A
+// CIR:  cir.func {{.*}} @A
 // CIR:    [[PTR:%.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["ptr", init] {alignment = 8 : i64}
 // CIR:    [[BLOCK:%.*]] = cir.block_address <@A, "LABEL_A"> : !cir.ptr<!void>
 // CIR:    cir.store align(8) [[BLOCK]], [[PTR]] : !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>
@@ -20,7 +20,7 @@ void B(void) {
   void *ptr = &&LABEL_B;
 }
 
-// CIR:  cir.func dso_local @B()
+// CIR:  cir.func {{.*}} @B()
 // CIR:    [[PTR:%.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["ptr", init] {alignment = 8 : i64}
 // CIR:    cir.br ^bb1
 // CIR:   ^bb1:
@@ -37,7 +37,7 @@ void C(int x) {
     return;
 }
 
-// CIR:  cir.func dso_local @C
+// CIR:  cir.func {{.*}} @C
 // CIR:    [[BLOCK1:%.*]] = cir.block_address <@C, "LABEL_A"> : !cir.ptr<!void>
 // CIR:    [[BLOCK2:%.*]] = cir.block_address <@C, "LABEL_B"> : !cir.ptr<!void>
 // CIR:    [[COND:%.*]] = cir.select if [[CMP:%.*]] then [[BLOCK1]] else [[BLOCK2]] : (!cir.bool, !cir.ptr<!void>, !cir.ptr<!void>) -> !cir.ptr<!void>
@@ -60,7 +60,7 @@ void D(void) {
   return;
 }
 
-// CIR:  cir.func dso_local @D
+// CIR:  cir.func {{.*}} @D
 // CIR:    %[[PTR:.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["ptr", init]
 // CIR:    %[[PTR2:.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["ptr2", init]
 // CIR:    %[[PTR3:.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["ptr3", init]
diff --git a/clang/test/CIR/CodeGen/label.c b/clang/test/CIR/CodeGen/label.c
index fd3c7f233fc8b..d3fdf5f5abb77 100644
--- a/clang/test/CIR/CodeGen/label.c
+++ b/clang/test/CIR/CodeGen/label.c
@@ -10,7 +10,7 @@ void label() {
   return;
 }
 
-// CIR:  cir.func no_proto dso_local @label
+// CIR:  cir.func {{.*}} @label
 // CIR:     cir.br ^bb1
 // CIR:  ^bb1:
 // CIR:    cir.label "labelA"
@@ -32,7 +32,7 @@ void multiple_labels() {
   return;
 }
 
-// CIR:  cir.func no_proto dso_local @multiple_labels
+// CIR:  cir.func {{.*}} @multiple_labels
 // CIR:    cir.br ^bb1
 // CIR:  ^bb1:
 // CIR:    cir.label "labelB"
@@ -62,7 +62,7 @@ void label_in_if(int cond) {
   }
 }
 
-// CIR:  cir.func dso_local @label_in_if
+// CIR:  cir.func {{.*}} @label_in_if
 // CIR:      cir.if {{.*}} {
 // CIR:        cir.br ^bb1
 // CIR:      ^bb1:
@@ -107,7 +107,7 @@ void after_return() {
   label:
 }
 
-// CIR:  cir.func no_proto dso_local @after_return
+// CIR:  cir.func {{.*}} @after_return
 // CIR:    cir.br ^bb1
 // CIR:  ^bb1:  // 2 preds: ^bb0, ^bb2
 // CIR:    cir.return
@@ -133,7 +133,7 @@ void after_unreachable() {
   label:
 }
 
-// CIR:  cir.func no_proto dso_local @after_unreachable
+// CIR:  cir.func {{.*}} @after_unreachable
 // CIR:    cir.unreachable
 // CIR:  ^bb1:
 // CIR:    cir.label "label"
@@ -153,7 +153,7 @@ void labelWithoutMatch() {
 end:
   return;
 }
-// CIR:  cir.func no_proto dso_local @labelWithoutMatch
+// CIR:  cir.func {{.*}} @labelWithoutMatch
 // CIR:    cir.br ^bb1
 // CIR:  ^bb1:
 // CIR:    cir.label "end"
@@ -181,7 +181,7 @@ void foo() {
   }
 }
 
-// CIR: cir.func no_proto dso_local @foo
+// CIR: cir.func {{.*}} @foo
 // CIR:   cir.scope {
 // CIR:     %0 = cir.alloca !rec_S, !cir.ptr<!rec_S>, ["agg.tmp0"]
 // CIR:      cir.br ^bb1
diff --git a/clang/test/CIR/CodeGen/lambda-static-invoker.cpp b/clang/test/CIR/CodeGen/lambda-static-invoker.cpp
index e7d199b976865..fc68447f7c445 100644
--- a/clang/test/CIR/CodeGen/lambda-static-invoker.cpp
+++ b/clang/test/CIR/CodeGen/lambda-static-invoker.cpp
@@ -37,7 +37,7 @@ int g3() {
 // OGCG:   ret ptr @"_ZZ2g3vEN3$_08__invokeERKi"
 
 // lambda operator()
-// CIR: cir.func lambda internal private dso_local @_ZZ2g3vENK3$_0clERKi(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_G3:.*]]> {{.*}}, %[[REF_I_ARG:.*]]: !cir.ptr<!s32i> {{.*}})
+// CIR: cir.func no_inline lambda internal private dso_local @_ZZ2g3vENK3$_0clERKi(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_G3:.*]]> {{.*}}, %[[REF_I_ARG:.*]]: !cir.ptr<!s32i> {{.*}})
 // CIR:   %[[THIS_ALLOCA:.*]] = cir.alloca !cir.ptr<![[REC_LAM_G3]]>, !cir.ptr<!cir.ptr<![[REC_LAM_G3]]>>, ["this", init]
 // CIR:   %[[REF_I_ALLOCA:.*]] = cir.alloca {{.*}} ["i", init, const]
 // CIR:   %[[RETVAL:.*]] = cir.alloca {{.*}} ["__retval"]
@@ -66,7 +66,7 @@ int g3() {
 // In OGCG, the _ZZ2g3vENK3$_0clERKi function is emitted after _ZZ2g3vEN3$_08__invokeERKi, see below.
 
 // lambda invoker
-// CIR: cir.func internal private dso_local @_ZZ2g3vEN3$_08__invokeERKi(%[[REF_I_ARG:.*]]: !cir.ptr<!s32i> {{.*}}) -> !s32i{{.*}} {
+// CIR: cir.func no_inline internal private dso_local @_ZZ2g3vEN3$_08__invokeERKi(%[[REF_I_ARG:.*]]: !cir.ptr<!s32i> {{.*}}) -> !s32i{{.*}} {
 // CIR:   %[[REF_I_ALLOCA:.*]] = cir.alloca {{.*}} ["i", init, const]
 // CIR:   %[[RETVAL:.*]] = cir.alloca {{.*}} ["__retval"]
 // CIR:   %[[LAM_ALLOCA:.*]] = cir.alloca ![[REC_LAM_G3]], !cir.ptr<![[REC_LAM_G3]]>, ["unused.capture"]
@@ -91,7 +91,7 @@ int g3() {
 // In OGCG, the _ZZ2g3vEN3$_08__invokeERKi function is emitted after _ZN1A3barEv, see below.
 
 // lambda operator int (*)(int const&)()
-// CIR:   cir.func internal private dso_local @_ZZ2g3vENK3$_0cvPFiRKiEEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_G3]]> {{.*}}) -> !cir.ptr<!cir.func<(!cir.ptr<!s32i>) -> !s32i>>{{.*}} {
+// CIR:   cir.func no_inline internal private dso_local @_ZZ2g3vENK3$_0cvPFiRKiEEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_G3]]> {{.*}}) -> !cir.ptr<!cir.func<(!cir.ptr<!s32i>) -> !s32i>>{{.*}} {
 // CIR:   %[[THIS_ALLOCA:.*]] = cir.alloca !cir.ptr<![[REC_LAM_G3]]>, !cir.ptr<!cir.ptr<![[REC_LAM_G3]]>>, ["this", init]
 // CIR:   %[[RETVAL:.*]] = cir.alloca !cir.ptr<!cir.func<(!cir.ptr<!s32i>) -> !s32i>>, !cir.ptr<!cir.ptr<!cir.func<(!cir.ptr<!s32i>) -> !s32i>>>, ["__retval"]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS_ALLOCA]]
diff --git a/clang/test/CIR/CodeGen/lambda.cpp b/clang/test/CIR/CodeGen/lambda.cpp
index 1d06496a85530..3c2b0969fa7a2 100644
--- a/clang/test/CIR/CodeGen/lambda.cpp
+++ b/clang/test/CIR/CodeGen/lambda.cpp
@@ -14,7 +14,7 @@ void use_global_lambda() {
 }
 
 // CIR: cir.global "private" internal dso_local @global_lambda = #cir.undef : ![[REC_LAM_GLOBAL_LAMBDA:.*]] {alignment = 1 : i64}
-// CIR: cir.func lambda internal private dso_local @_ZNK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_GLOBAL_LAMBDA]]> {{.*}})
+// CIR: cir.func {{.*}} lambda internal private dso_local @_ZNK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_GLOBAL_LAMBDA]]> {{.*}})
 // CIR:   %[[THIS:.*]] = cir.alloca !cir.ptr<![[REC_LAM_GLOBAL_LAMBDA]]>, !cir.ptr<!cir.ptr<![[REC_LAM_GLOBAL_LAMBDA]]>>, ["this", init]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS]]
 // CIR:   cir.load %[[THIS]]
@@ -46,13 +46,13 @@ void fn() {
   a();
 }
 
-// CIR: cir.func lambda internal private dso_local @_ZZ2fnvENK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_FN_A:.*]]> {{.*}}) {{.*}} {
+// CIR: cir.func {{.*}} lambda internal private dso_local @_ZZ2fnvENK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_FN_A:.*]]> {{.*}})
 // CIR:   %[[THIS:.*]] = cir.alloca !cir.ptr<![[REC_LAM_FN_A]]>, !cir.ptr<!cir.ptr<![[REC_LAM_FN_A]]>>, ["this", init]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS]]
 // CIR:   cir.load %[[THIS]]
 // CIR:   cir.return
 
-// CIR: cir.func dso_local @_Z2fnv() {{.*}} {
+// CIR: cir.func {{.*}} @_Z2fnv()
 // CIR:   %[[A:.*]] = cir.alloca ![[REC_LAM_FN_A]], !cir.ptr<![[REC_LAM_FN_A]]>, ["a"]
 // CIR:   cir.call @_ZZ2fnvENK3$_0clEv(%[[A]])
 
@@ -85,7 +85,7 @@ void l0() {
   a();
 }
 
-// CIR: cir.func lambda internal private dso_local @_ZZ2l0vENK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_L0_A:.*]]> {{.*}}) {{.*}} {
+// CIR: cir.func {{.*}} lambda internal private dso_local @_ZZ2l0vENK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_L0_A:.*]]> {{.*}})
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<![[REC_LAM_L0_A]]>, !cir.ptr<!cir.ptr<![[REC_LAM_L0_A]]>>, ["this", init] {alignment = 8 : i64}
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS_ADDR]]
 // CIR:   %[[THIS:.*]] = cir.load %[[THIS_ADDR]]
@@ -99,7 +99,7 @@ void l0() {
 // CIR:   cir.store{{.*}} %[[I_PLUS_ONE]], %[[I_ADDR]]
 // CIR:   cir.return
 
-// CIR: cir.func {{.*}} @_Z2l0v() {{.*}} {
+// CIR: cir.func {{.*}} @_Z2l0v()
 // CIR:   %[[I:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["i"]
 // CIR:   %[[A:.*]] = cir.alloca ![[REC_LAM_L0_A]], !cir.ptr<![[REC_LAM_L0_A]]>, ["a", init]
 // CIR:   %[[I_ADDR:.*]] = cir.get_member %[[A]][0] {name = "i"}
@@ -157,7 +157,7 @@ auto g() {
   };
 }
 
-// CIR: cir.func dso_local @_Z1gv() -> ![[REC_LAM_G:.*]] {{.*}} {
+// CIR: cir.func {{.*}} @_Z1gv() -> ![[REC_LAM_G:.*]] {
 // CIR:   %[[RETVAL:.*]] = cir.alloca ![[REC_LAM_G]], !cir.ptr<![[REC_LAM_G]]>, ["__retval"]
 // CIR:   %[[I_ADDR:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["i", init]
 // CIR:   %[[TWELVE:.*]] = cir.const #cir.int<12> : !s32i
@@ -199,7 +199,7 @@ auto g2() {
 }
 
 // Should be same as above because of NRVO
-// CIR: cir.func dso_local @_Z2g2v() -> ![[REC_LAM_G2:.*]] {{.*}} {
+// CIR: cir.func {{.*}} @_Z2g2v() -> ![[REC_LAM_G2:.*]] {
 // CIR:   %[[RETVAL:.*]] = cir.alloca ![[REC_LAM_G2]], !cir.ptr<![[REC_LAM_G2]]>, ["__retval", init]
 // CIR:   %[[I_ADDR:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["i", init]
 // CIR:   %[[TWELVE:.*]] = cir.const #cir.int<12> : !s32i
@@ -232,7 +232,7 @@ int f() {
   return g2()();
 }
 
-// CIR:cir.func lambda internal private dso_local @_ZZ2g2vENK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_G2]]> {{.*}}) -> !s32i {{.*}} {
+// CIR:cir.func {{.*}} lambda internal private dso_local @_ZZ2g2vENK3$_0clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_G2]]> {{.*}}) -> !s32i
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<![[REC_LAM_G2]]>, !cir.ptr<!cir.ptr<![[REC_LAM_G2]]>>, ["this", init]
 // CIR:   %[[RETVAL:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS_ADDR]]
@@ -250,7 +250,7 @@ int f() {
 // CIR:   %[[RET:.*]] = cir.load %[[RETVAL]]
 // CIR:   cir.return %[[RET]]
 
-// CIR: cir.func dso_local @_Z1fv() -> !s32i {{.*}} {
+// CIR: cir.func {{.*}} @_Z1fv() -> !s32i
 // CIR:   %[[RETVAL:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.scope {
 // CIR:     %[[TMP:.*]] = cir.alloca ![[REC_LAM_G2]], !cir.ptr<![[REC_LAM_G2]]>, ["ref.tmp0"]
@@ -332,7 +332,7 @@ struct A {
 // OGCG:   call noundef i32 @_ZN1A3barEv(ptr {{.*}} %[[A_THIS]])
 
 // lambda operator() in foo()
-// CIR: cir.func lambda comdat linkonce_odr @_ZZN1A3fooEvENKUlvE_clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_A:.*]]> {{.*}}) {{.*}} {
+// CIR: cir.func {{.*}} lambda comdat linkonce_odr @_ZZN1A3fooEvENKUlvE_clEv(%[[THIS_ARG:.*]]: !cir.ptr<![[REC_LAM_A:.*]]> {{.*}})
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<![[REC_LAM_A]]>, !cir.ptr<!cir.ptr<![[REC_LAM_A]]>>, ["this", init]
 // CIR:   %[[RETVAL:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store{{.*}} %[[THIS_ARG]], %[[THIS_ADDR]]
@@ -359,7 +359,7 @@ struct A {
 // The function above is defined after _ZN1A3barEv in OGCG, see below.
 
 // A::foo()
-// CIR: cir.func {{.*}} @_ZN1A3fooEv(%[[THIS_ARG:.*]]: !cir.ptr<!rec_A> {{.*}}) -> !s32i {{.*}} {
+// CIR: cir.func {{.*}} @_ZN1A3fooEv(%[[THIS_ARG:.*]]: !cir.ptr<!rec_A> {{.*}}) -> !s32i
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<!rec_A>, !cir.ptr<!cir.ptr<!rec_A>>, ["this", init]
 // CIR:   %[[RETVAL:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS_ADDR]]
@@ -367,7 +367,7 @@ struct A {
 // CIR:   cir.scope {
 // CIR:     %[[LAM_ADDR:.*]] = cir.alloca ![[REC_LAM_A]], !cir.ptr<![[REC_LAM_A]]>, ["ref.tmp0"]
 // CIR:     %[[STRUCT_A:.*]] = cir.get_member %[[LAM_ADDR]][0] {name = "this"} : !cir.ptr<![[REC_LAM_A]]> -> !cir.ptr<!rec_A>
-// CIR:     cir.call @_ZN1AC1ERKS_(%[[STRUCT_A]], %[[THIS]]){{.*}} : (!cir.ptr<!rec_A>, !cir.ptr<!rec_A>){{.*}} -> ()
+// CIR:     cir.copy %[[THIS]] to %[[STRUCT_A]] : !cir.ptr<!rec_A>
 // CIR:     %[[LAM_RET:.*]] = cir.call @_ZZN1A3fooEvENKUlvE_clEv(%[[LAM_ADDR]])
 // CIR:     cir.store{{.*}} %[[LAM_RET]], %[[RETVAL]]
 // CIR:   }
@@ -383,7 +383,7 @@ struct A {
 // LLVM:   br label %[[SCOPE_BB:.*]]
 // LLVM: [[SCOPE_BB]]:
 // LLVM:   %[[STRUCT_A:.*]] = getelementptr %[[REC_LAM_A]], ptr %[[LAM_ALLOCA]], i32 0, i32 0
-// LLVM:   call void @_ZN1AC1ERKS_(ptr %[[STRUCT_A]], ptr %[[THIS]])
+// LLVM:   call void @llvm.memcpy.p0.p0.i32(ptr %[[STRUCT_A]], ptr %[[THIS]], i32 4, i1 false)
 // LLVM:   %[[LAM_RET:.*]] = call i32 @_ZZN1A3fooEvENKUlvE_clEv(ptr %[[LAM_ALLOCA]])
 // LLVM:   store i32 %[[LAM_RET]], ptr %[[RETVAL]]
 // LLVM:   br label %[[RET_BB:.*]]
@@ -402,7 +402,7 @@ struct A {
 // OGCG:   ret i32 %[[LAM_RET]]
 
 // lambda operator() in bar()
-// CIR: cir.func {{.*}} @_ZZN1A3barEvENKUlvE_clEv(%[[THIS_ARG2:.*]]: !cir.ptr<![[REC_LAM_PTR_A:.*]]> {{.*}}) -> !s32i {{.*}} {
+// CIR: cir.func {{.*}} @_ZZN1A3barEvENKUlvE_clEv(%[[THIS_ARG2:.*]]: !cir.ptr<![[REC_LAM_PTR_A:.*]]> {{.*}}) -> !s32i
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<![[REC_LAM_PTR_A]]>, !cir.ptr<!cir.ptr<![[REC_LAM_PTR_A]]>>, ["this", init]
 // CIR:   %[[RETVAL:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store{{.*}} %[[THIS_ARG]], %[[THIS_ADDR]]
@@ -431,7 +431,7 @@ struct A {
 // The function above is defined after _ZZN1A3fooEvENKUlvE_clEv in OGCG, see below.
 
 // A::bar()
-// CIR: cir.func {{.*}} @_ZN1A3barEv(%[[THIS_ARG:.*]]: !cir.ptr<!rec_A> {{.*}}) -> !s32i {{.*}} {
+// CIR: cir.func {{.*}} @_ZN1A3barEv(%[[THIS_ARG:.*]]: !cir.ptr<!rec_A> {{.*}}) -> !s32i
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<!rec_A>, !cir.ptr<!cir.ptr<!rec_A>>, ["this", init]
 // CIR:   %[[RETVAL:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.store %[[THIS_ARG]], %[[THIS_ADDR]]
@@ -499,7 +499,7 @@ int test_lambda_this1(){
   return x+y;
 }
 
-// CIR: cir.func {{.*}} @_Z17test_lambda_this1v{{.*}} {
+// CIR: cir.func {{.*}} @_Z17test_lambda_this1v
 // CIR:   cir.call @_ZN1AC1Ev(%[[A_THIS:.*]]){{.*}} : (!cir.ptr<!rec_A>) -> ()
 // CIR:   cir.call @_ZN1A3fooEv(%[[A_THIS]]){{.*}} : (!cir.ptr<!rec_A>) -> !s32i
 // CIR:   cir.call @_ZN1A3barEv(%[[A_THIS]]){{.*}} : (!cir.ptr<!rec_A>) -> !s32i
diff --git a/clang/test/CIR/CodeGen/linkage-spec.cpp b/clang/test/CIR/CodeGen/linkage-spec.cpp
index 1affecd28d488..bfb21f868bf63 100644
--- a/clang/test/CIR/CodeGen/linkage-spec.cpp
+++ b/clang/test/CIR/CodeGen/linkage-spec.cpp
@@ -1,42 +1,42 @@
 // RUN: %clang_cc1 -std=c++20 -triple x86_64-unknown-linux-gnu -fclangir -emit-cir %s -o - 2>&1 | FileCheck %s
 
 extern "C" void TopLevelC(){}
-// CHECK: cir.func dso_local @TopLevelC() inline(never) {
+// CHECK: cir.func no_inline dso_local @TopLevelC()
 extern "C++" void TopLevelCpp(){}
-// CHECK: cir.func dso_local @_Z11TopLevelCppv() inline(never) {
+// CHECK: cir.func no_inline dso_local @_Z11TopLevelCppv()
 
 extern "C++" {
   void ExternCppEmpty(){}
-  // CHECK: cir.func dso_local @_Z14ExternCppEmptyv() inline(never) {
+  // CHECK: cir.func no_inline dso_local @_Z14ExternCppEmptyv()
   extern "C" void ExternCpp_C(){}
-  // CHECK: cir.func dso_local @ExternCpp_C() inline(never) {
+  // CHECK: cir.func no_inline dso_local @ExternCpp_C()
   extern "C++" void ExternCpp_Cpp(){}
-  // CHECK: cir.func dso_local @_Z13ExternCpp_Cppv() inline(never) {
+  // CHECK: cir.func no_inline dso_local @_Z13ExternCpp_Cppv()
 
   extern "C" {
   void ExternCpp_CEmpty(){}
-  // CHECK: cir.func dso_local @ExternCpp_CEmpty() inline(never) {
+  // CHECK: cir.func no_inline dso_local @ExternCpp_CEmpty()
   extern "C" void ExternCpp_C_C(){}
-  // CHECK: cir.func dso_local @ExternCpp_C_C() inline(never) {
+  // CHECK: cir.func no_inline dso_local @ExternCpp_C_C()
   extern "C++" void ExternCpp_C_Cpp(){}
-  // CHECK: cir.func dso_local @_Z15ExternCpp_C_Cppv() inline(never) {
+  // CHECK: cir.func no_inline dso_local @_Z15ExternCpp_C_Cppv()
   }
 }
 
 extern "C" {
   void ExternCEmpty(){}
-  // CHECK: cir.func dso_local @ExternCEmpty() inline(never) {
+  // CHECK: cir.func no_inline dso_local @ExternCEmpty()
   extern "C" void ExternC_C(){}
-  // CHECK: cir.func dso_local @ExternC_C() inline(never) {
+  // CHECK: cir.func no_inline dso_local @ExternC_C()
   extern "C++" void ExternC_Cpp(){}
-  // CHECK: cir.func dso_local @_Z11ExternC_Cppv() inline(never) {
+  // CHECK: cir.func no_inline dso_local @_Z11ExternC_Cppv()
   extern "C++" {
   void ExternC_CppEmpty(){}
-  // CHECK: cir.func dso_local @_Z16ExternC_CppEmptyv() inline(never) {
+  // CHECK: cir.func no_inline dso_local @_Z16ExternC_CppEmptyv()
   extern "C" void ExternC_Cpp_C(){}
-  // CHECK: cir.func dso_local @ExternC_Cpp_C() inline(never) {
+  // CHECK: cir.func no_inline dso_local @ExternC_Cpp_C()
   extern "C++" void ExternC_Cpp_Cpp(){}
-  // CHECK: cir.func dso_local @_Z15ExternC_Cpp_Cppv() inline(never) {
+  // CHECK: cir.func no_inline dso_local @_Z15ExternC_Cpp_Cppv()
   }
 }
 
diff --git a/clang/test/CIR/CodeGen/no-prototype.c b/clang/test/CIR/CodeGen/no-prototype.c
index 728c4b80b95a2..d266ccb86448a 100644
--- a/clang/test/CIR/CodeGen/no-prototype.c
+++ b/clang/test/CIR/CodeGen/no-prototype.c
@@ -7,9 +7,9 @@
 
 // No-proto definition followed by a correct call.
 int noProto0(x) int x; { return x; }
-// CHECK: cir.func no_proto dso_local @noProto0(%arg0: !s32i {{.+}}) -> !s32i
+// CHECK: cir.func {{.*}} no_proto {{.*}} @noProto0(%arg0: !s32i {{.+}}) -> !s32i
 int test0(int x) {
-  // CHECK: cir.func dso_local @test0
+  // CHECK: cir.func {{.*}} @test0
   return noProto0(x); // We know the definition. Should be a direct call.
   // CHECK: %{{.+}} = cir.call @noProto0(%{{.+}})
 }
@@ -21,9 +21,9 @@ int test0(int x) {
 // definition is not marked as no-proto.
 int noProto1();
 int noProto1(int x) { return x; }
-// CHECK: cir.func dso_local @noProto1(%arg0: !s32i {{.+}}) -> !s32i
+// CHECK: cir.func {{.*}} @noProto1(%arg0: !s32i {{.+}}) -> !s32i
 int test1(int x) {
-  // CHECK: cir.func dso_local @test1
+  // CHECK: cir.func {{.*}} @test1
   return noProto1(x);
   // CHECK: %{{.+}} = cir.call @noProto1(%{{[0-9]+}}) : (!s32i) -> !s32i
 }
@@ -39,7 +39,7 @@ int test2(int x) {
   // CHECK:  {{.*}} = cir.call [[GGO]](%{{[0-9]+}}) : (!cir.ptr<!cir.func<(!s32i) -> !s32i>>, !s32i) -> !s32i
 }
 int noProto2(int x) { return x; }
-// CHECK: cir.func no_proto dso_local @noProto2(%arg0: !s32i {{.+}}) -> !s32i
+// CHECK: cir.func {{.*}} no_proto {{.*}} @noProto2(%arg0: !s32i {{.+}}) -> !s32i
 
 // No-proto declaration without definition (any call here is "correct").
 //
@@ -48,7 +48,7 @@ int noProto2(int x) { return x; }
 int noProto3();
 // cir.func private no_proto @noProto3(...) -> !s32i
 int test3(int x) {
-// CHECK: cir.func dso_local @test3
+// CHECK: cir.func {{.*}} @test3
   return noProto3(x);
   // CHECK:  [[GGO:%.*]] = cir.get_global @noProto3 : !cir.ptr<!cir.func<(...) -> !s32i>>
   // CHECK:  [[CAST:%.*]] = cir.cast bitcast [[GGO]] : !cir.ptr<!cir.func<(...) -> !s32i>> -> !cir.ptr<!cir.func<(!s32i) -> !s32i>>
@@ -64,7 +64,7 @@ int test3(int x) {
 
 // No-proto definition followed by an incorrect call due to extra args.
 int noProto4() { return 0; }
-// cir.func private no_proto @noProto4() -> !s32i
+// cir.func {{.*}} no_proto {{.*}} @noProto4() -> !s32i
 int test4(int x) {
   return noProto4(x); // Even if we know the definition, this should compile.
   // CHECK:  [[GGO:%.*]] = cir.get_global @noProto4 : !cir.ptr<!cir.func<() -> !s32i>>
@@ -81,4 +81,4 @@ int test5(int x) {
   // CHECK:  {{%.*}} = cir.call [[CAST]]() : (!cir.ptr<!cir.func<() -> !s32i>>) -> !s32i
 }
 int noProto5(int x) { return x; }
-// CHECK: cir.func no_proto dso_local @noProto5(%arg0: !s32i {{.+}}) -> !s32i
+// CHECK: cir.func {{.*}} no_proto {{.*}} @noProto5(%arg0: !s32i {{.+}}) -> !s32i
diff --git a/clang/test/CIR/CodeGen/placement-new.cpp b/clang/test/CIR/CodeGen/placement-new.cpp
index 7ceaa0a359e1f..ccc3548091ef3 100644
--- a/clang/test/CIR/CodeGen/placement-new.cpp
+++ b/clang/test/CIR/CodeGen/placement-new.cpp
@@ -16,7 +16,7 @@ void test_reserved_placement_new(void *p) {
   new (p) A();
 }
 
-// CIR-LABEL:   cir.func dso_local @_Z27test_reserved_placement_newPv(
+// CIR-LABEL:   cir.func {{.*}} @_Z27test_reserved_placement_newPv(
 // CIR-SAME:                                   %[[ARG0:.*]]: !cir.ptr<!void>
 // CIR:           %[[P:.*]] = cir.alloca !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>, ["p", init]
 // CIR:           cir.store %[[ARG0]], %[[P]] : !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>
diff --git a/clang/test/CIR/CodeGen/ptrdiff.cpp b/clang/test/CIR/CodeGen/ptrdiff.cpp
index 34ba0ff725581..5805349b74879 100644
--- a/clang/test/CIR/CodeGen/ptrdiff.cpp
+++ b/clang/test/CIR/CodeGen/ptrdiff.cpp
@@ -8,7 +8,7 @@
 typedef unsigned long size_type;
 
 size_type size(unsigned long *_start, unsigned long *_finish) {
-  // CIR-LABEL: cir.func dso_local @_Z4sizePmS_
+  // CIR-LABEL: cir.func {{.*}} @_Z4sizePmS_
   // CIR: %[[D:.*]] = cir.ptr_diff {{.*}} : !cir.ptr<!u64i> -> !s64i
   // CIR: %[[U:.*]] = cir.cast integral %[[D]] : !s64i -> !u64i
   // CIR: cir.return {{.*}} : !u64i
diff --git a/clang/test/CIR/CodeGen/size-of-vla.cpp b/clang/test/CIR/CodeGen/size-of-vla.cpp
new file mode 100644
index 0000000000000..bcaab27781aa3
--- /dev/null
+++ b/clang/test/CIR/CodeGen/size-of-vla.cpp
@@ -0,0 +1,156 @@
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -Wno-unused-value -fclangir -emit-cir %s -o %t.cir
+// RUN: FileCheck --input-file=%t.cir %s -check-prefix=CIR
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -Wno-unused-value -fclangir -emit-llvm %s -o %t-cir.ll
+// RUN: FileCheck --input-file=%t-cir.ll %s -check-prefix=LLVM
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -Wno-unused-value -emit-llvm %s -o %t.ll
+// RUN: FileCheck --input-file=%t.ll %s -check-prefix=OGCG
+
+void vla_type_with_element_type_of_size_1() {
+  unsigned long n = 10ul;
+  unsigned long size = sizeof(bool[n]);
+}
+
+// CIR: %[[N_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["n", init]
+// CIR: %[[SIZE_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["size", init]
+// CIR: %[[CONST_10:.*]] = cir.const #cir.int<10> : !u64i
+// CIR: cir.store {{.*}} %[[CONST_10]], %[[N_ADDR]] : !u64i, !cir.ptr<!u64i>
+// CIR: %[[TMP_N:.*]] = cir.load {{.*}} %[[N_ADDR]] : !cir.ptr<!u64i>, !u64i
+// CIR: cir.store {{.*}} %[[TMP_N]], %[[SIZE_ADDR]] : !u64i, !cir.ptr<!u64i>
+
+// LLVM: %[[N_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: %[[SIZE_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: store i64 10, ptr %[[N_ADDR]], align 8
+// LLVM: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// LLVM: store i64 %[[TMP_N]], ptr %[[SIZE_ADDR]], align 8
+
+// OGCG: %[[N_ADDR:.*]] = alloca i64, align 8
+// OGCG: %[[SIZE_ADDR:.*]] = alloca i64, align 8
+// OGCG: store i64 10, ptr %[[N_ADDR]], align 8
+// OGCG: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// OGCG: store i64 %[[TMP_N]], ptr %[[SIZE_ADDR]], align 8
+
+void vla_type_with_element_type_int() {
+  unsigned long n = 10ul;
+  unsigned long size = sizeof(int[n]);
+}
+
+// CIR: %[[N_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["n", init]
+// CIR: %[[SIZE_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["size", init]
+// CIR: %[[CONST_10:.*]] = cir.const #cir.int<10> : !u64i
+// CIR: cir.store {{.*}} %[[CONST_10]], %[[N_ADDR]] : !u64i, !cir.ptr<!u64i>
+// CIR: %3 = cir.load {{.*}} %[[N_ADDR]] : !cir.ptr<!u64i>, !u64i
+// CIR: %[[CONST_4:.*]] = cir.const #cir.int<4> : !u64i
+// CIR: %[[SIZE:.*]] = cir.binop(mul, %[[CONST_4]], %3) nuw : !u64i
+// CIR: cir.store {{.*}} %[[SIZE]], %[[SIZE_ADDR]] : !u64i, !cir.ptr<!u64i>
+
+// LLVM: %[[N_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: %[[SIZE_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: store i64 10, ptr %[[N_ADDR]], align 8
+// LLVM: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// LLVM: %[[SIZE:.*]] = mul nuw i64 4, %[[TMP_N]]
+// LLVM: store i64 %[[SIZE]], ptr %[[SIZE_ADDR]], align 8
+
+// OGCG: %[[N_ADDR:.*]] = alloca i64, align 8
+// OGCG: %[[SIZE_ADDR:.*]] = alloca i64, align 8
+// OGCG: store i64 10, ptr %[[N_ADDR]], align 8
+// OGCG: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// OGCG: %[[SIZE:.*]] = mul nuw i64 4, %[[TMP_N]]
+// OGCG: store i64 %[[SIZE]], ptr %[[SIZE_ADDR]], align 8
+
+void vla_expr_element_type_of_size_1() {
+  unsigned long n = 10ul;
+  bool arr[n];
+  unsigned long size = sizeof(arr);
+}
+
+// CIR: %[[N_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["n", init]
+// CIR: %[[SAVED_STACK_ADDR:.*]] = cir.alloca !cir.ptr<!u8i>, !cir.ptr<!cir.ptr<!u8i>>, ["saved_stack"]
+// CIR: %[[CONST_10:.*]] = cir.const #cir.int<10> : !u64i
+// CIR: cir.store {{.*}} %[[CONST_10]], %[[N_ADDR]] : !u64i, !cir.ptr<!u64i>
+// CIR: %[[TMP_N:.*]] = cir.load {{.*}} %[[N_ADDR]] : !cir.ptr<!u64i>, !u64i
+// CIR: %[[STACK_SAVE:.*]] = cir.stacksave : !cir.ptr<!u8i>
+// CIR: cir.store {{.*}} %[[STACK_SAVE]], %[[SAVED_STACK_ADDR]] : !cir.ptr<!u8i>, !cir.ptr<!cir.ptr<!u8i>>
+// CIR: %[[ARR_ADDR:.*]] = cir.alloca !cir.bool, !cir.ptr<!cir.bool>, %[[TMP_N]] : !u64i, ["arr"]
+// CIR: %[[SIZE_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["size", init]
+// CIR: cir.store {{.*}} %[[TMP_N]], %[[SIZE_ADDR]] : !u64i, !cir.ptr<!u64i>
+// CIR: %[[TMP_SAVED_STACK:.*]] = cir.load {{.*}} %[[SAVED_STACK_ADDR]] : !cir.ptr<!cir.ptr<!u8i>>, !cir.ptr<!u8i>
+// CIR: cir.stackrestore %[[TMP_SAVED_STACK]] : !cir.ptr<!u8i>
+
+// LLVM: %[[N_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: %[[SAVED_STACK_ADDR:.*]] = alloca ptr, i64 1, align 8
+// LLVM: store i64 10, ptr %[[N_ADDR]], align 8
+// LLVM: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// LLVM: %[[STACK_SAVE:.*]] = call ptr @llvm.stacksave.p0()
+// LLVM: store ptr %[[STACK_SAVE]], ptr %[[SAVED_STACK_ADDR]], align 8
+// LLVM: %[[ARR_ADDR:.*]] = alloca i8, i64 %[[TMP_N]], align 16
+// LLVM: %[[SIZE_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: store i64 %[[TMP_N]], ptr %[[SIZE_ADDR]], align 8
+// LLVM: %[[TMP_SAVED_STACK:.*]] = load ptr, ptr %[[SAVED_STACK_ADDR]], align 8
+// LLVM: call void @llvm.stackrestore.p0(ptr %[[TMP_SAVED_STACK]])
+
+// Note: VLA_EXPR0 below is emitted to capture debug info.
+
+// OGCG: %[[N_ADDR:.*]] = alloca i64, align 8
+// OGCG: %[[SAVED_STACK_ADDR:.*]] = alloca ptr, align 8
+// OGCG: %[[VLA_EXPR0:.*]] = alloca i64, align 8
+// OGCG: %[[SIZE_ADDR:.*]] = alloca i64, align 8
+// OGCG: store i64 10, ptr %[[N_ADDR]], align 8
+// OGCG: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// OGCG: %[[STACK_SAVE:.*]] = call ptr @llvm.stacksave.p0()
+// OGCG: store ptr %[[STACK_SAVE]], ptr %[[SAVED_STACK_ADDR]], align 8
+// OGCG: %[[ARR_ADDR:.*]] = alloca i8, i64 %[[TMP_N]], align 16
+// OGCG: store i64 %[[TMP_N]], ptr %[[VLA_EXPR0]], align 8
+// OGCG: store i64 %[[TMP_N]], ptr %[[SIZE_ADDR]], align 8
+// OGCG: %[[TMP_SAVED_STACK:.*]] = load ptr, ptr %[[SAVED_STACK_ADDR]], align 8
+// OGCG: call void @llvm.stackrestore.p0(ptr %[[TMP_SAVED_STACK]])
+
+void vla_expr_element_type_int() {
+  unsigned long n = 10ul;
+  int arr[n];
+  unsigned long size = sizeof(arr);
+}
+
+// CIR: %[[N_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["n", init]
+// CIR: %[[SAVED_STACK_ADDR:.*]] = cir.alloca !cir.ptr<!u8i>, !cir.ptr<!cir.ptr<!u8i>>, ["saved_stack"]
+// CIR: %[[CONST_10:.*]] = cir.const #cir.int<10> : !u64i
+// CIR: cir.store {{.*}} %[[CONST_10]], %[[N_ADDR]] : !u64i, !cir.ptr<!u64i>
+// CIR: %[[TMP_N:.*]] = cir.load {{.*}} %[[N_ADDR]] : !cir.ptr<!u64i>, !u64i
+// CIR: %[[STACK_SAVE:.*]] = cir.stacksave : !cir.ptr<!u8i>
+// CIR: cir.store {{.*}} %[[STACK_SAVE]], %[[SAVED_STACK_ADDR]] : !cir.ptr<!u8i>, !cir.ptr<!cir.ptr<!u8i>>
+// CIR: %[[ARR_ADDR:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, %[[TMP_N]] : !u64i, ["arr"]
+// CIR: %[[SIZE_ADDR:.*]] = cir.alloca !u64i, !cir.ptr<!u64i>, ["size", init]
+// CIR: %[[CONST_4:.*]] = cir.const #cir.int<4> : !u64i
+// CIR: %[[SIZE:.*]] = cir.binop(mul, %[[CONST_4]], %[[TMP_N]]) nuw : !u64i
+// CIR: cir.store {{.*}} %[[SIZE]], %[[SIZE_ADDR]] : !u64i, !cir.ptr<!u64i>
+// CIR: %[[TMP_SAVED_STACK:.*]] = cir.load {{.*}} %[[SAVED_STACK_ADDR]] : !cir.ptr<!cir.ptr<!u8i>>, !cir.ptr<!u8i>
+// CIR: cir.stackrestore %[[TMP_SAVED_STACK]] : !cir.ptr<!u8i>
+
+// LLVM: %[[N_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: %[[SAVED_STACK_ADDR:.*]] = alloca ptr, i64 1, align 8
+// LLVM: store i64 10, ptr %[[N_ADDR]], align 8
+// LLVM: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// LLVM: %[[STACK_SAVE:.*]] = call ptr @llvm.stacksave.p0()
+// LLVM: store ptr %[[STACK_SAVE]], ptr %[[SAVED_STACK_ADDR]], align 8
+// LLVM: %[[ARR_ADDR:.*]] = alloca i32, i64 %[[TMP_N]], align 16
+// LLVM: %[[SIZE_ADDR:.*]] = alloca i64, i64 1, align 8
+// LLVM: %[[SIZE:.*]] = mul nuw i64 4, %[[TMP_N]]
+// LLVM: store i64 %[[SIZE]], ptr %[[SIZE_ADDR]], align 8
+// LLVM: %[[TMP_SAVED_STACK:.*]] = load ptr, ptr %[[SAVED_STACK_ADDR]], align 8
+// LLVM: call void @llvm.stackrestore.p0(ptr %[[TMP_SAVED_STACK]])
+
+// Note: VLA_EXPR0 below is emitted to capture debug info.
+
+// OGCG: %[[N_ADDR:.*]] = alloca i64, align 8
+// OGCG: %[[SAVED_STACK_ADDR:.*]] = alloca ptr, align 8
+// OGCG: %[[VLA_EXPR0:.*]] = alloca i64, align 8
+// OGCG: %[[SIZE_ADDR:.*]] = alloca i64, align 8
+// OGCG: store i64 10, ptr %[[N_ADDR]], align 8
+// OGCG: %[[TMP_N:.*]] = load i64, ptr %[[N_ADDR]], align 8
+// OGCG: %[[STACK_SAVE:.*]] = call ptr @llvm.stacksave.p0()
+// OGCG: store ptr %[[STACK_SAVE]], ptr %[[SAVED_STACK_ADDR]], align 8
+// OGCG: %[[ARR_ADDR:.*]] = alloca i32, i64 %[[TMP_N]], align 16
+// OGCG: store i64 %[[TMP_N]], ptr %[[VLA_EXPR0]], align 8
+// OGCG: %[[SIZE:.*]] = mul nuw i64 4, %[[TMP_N]]
+// OGCG: store i64 %[[SIZE]], ptr %[[SIZE_ADDR]], align 8
+// OGCG: %[[TMP_SAVED_STACK:.*]] = load ptr, ptr %[[SAVED_STACK_ADDR]], align 8
+// OGCG: call void @llvm.stackrestore.p0(ptr %[[TMP_SAVED_STACK]])
diff --git a/clang/test/CIR/CodeGen/statement-exprs.c b/clang/test/CIR/CodeGen/statement-exprs.c
index f917334ade829..2ea4672cbf7f9 100644
--- a/clang/test/CIR/CodeGen/statement-exprs.c
+++ b/clang/test/CIR/CodeGen/statement-exprs.c
@@ -9,7 +9,7 @@ int f19(void) {
   return ({ 3;;4; });
 }
 
-// CIR: cir.func dso_local @f19() -> !s32i
+// CIR: cir.func {{.*}} @f19() -> !s32i
 // CIR:   %[[RETVAL:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   %[[TMP:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["tmp"]
 // CIR:   cir.scope {
@@ -47,7 +47,7 @@ void f20(void) {
   return ({ 3;;4;; });
 }
 
-// CIR-LABEL: cir.func dso_local @f20() {{[^-]*}}
+// CIR-LABEL: cir.func {{.*}} @f20() {{[^-]*}}
 // CIR: cir.return {{[^%]*}}
 
 // LLVM-LABEL: define{{.*}} void @f20
@@ -61,7 +61,7 @@ int nested(void) {
   }
 }
 
-// CIR: cir.func dso_local @nested() -> !s32i
+// CIR: cir.func {{.*}} @nested() -> !s32i
 // CIR:   %[[RETVAL:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   %[[TMP_OUTER:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["tmp"]
 // CIR:   cir.scope {
@@ -164,7 +164,7 @@ void empty() {
   return ({;;;;});
 }
 
-// CIR: cir.func no_proto dso_local @empty()
+// CIR: cir.func {{.*}} @empty()
 // CIR-NEXT:   cir.return
 
 // LLVM: define dso_local void @empty()
@@ -177,7 +177,7 @@ void empty() {
 
 void empty2() { ({ }); }
 
-// CIR: @empty2
+// CIR: cir.func {{.*}} @empty2
 // CIR-NEXT: cir.return
 
 // LLVM: @empty2()
@@ -191,7 +191,7 @@ void empty2() { ({ }); }
 
 // Yields an out-of-scope scalar.
 void test2() { ({int x = 3; x; }); }
-// CIR: @test2
+// CIR: cir.func {{.*}} @test2
 // CIR: %[[RETVAL:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>
 // CIR: cir.scope {
 // CIR:   %[[VAR:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["x", init]
@@ -226,7 +226,7 @@ void test2() { ({int x = 3; x; }); }
 // Yields an aggregate.
 struct S { int x; };
 int test3() { return ({ struct S s = {1}; s; }).x; }
-// CIR: cir.func no_proto dso_local @test3() -> !s32i
+// CIR: cir.func {{.*}} @test3() -> !s32i
 // CIR:   %[[RETVAL:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   cir.scope {
 // CIR:     %[[REF_TMP0:.+]] = cir.alloca !rec_S, !cir.ptr<!rec_S>, ["ref.tmp0"]
@@ -277,6 +277,6 @@ int test3() { return ({ struct S s = {1}; s; }).x; }
 
 // Expression is wrapped in an expression attribute (just ensure it does not crash).
 void test4(int x) { ({[[gsl::suppress("foo")]] x;}); }
-// CIR: @test4
+// CIR: cir.func {{.*}} @test4
 // LLVM: @test4
 // OGCG: @test4
diff --git a/clang/test/CIR/CodeGen/stmt-expr.cpp b/clang/test/CIR/CodeGen/stmt-expr.cpp
index 9e3911f638ba7..f65bf9b7e010f 100644
--- a/clang/test/CIR/CodeGen/stmt-expr.cpp
+++ b/clang/test/CIR/CodeGen/stmt-expr.cpp
@@ -20,7 +20,7 @@ void test1() {
   }).Foo();
 }
 
-// CIR: cir.func dso_local @_Z5test1v()
+// CIR: cir.func {{.*}} @_Z5test1v()
 // CIR:   cir.scope {
 // CIR:     %[[REF_TMP0:.+]] = cir.alloca !rec_A, !cir.ptr<!rec_A>, ["ref.tmp0"]
 // CIR:     %[[TMP:.+]]      = cir.alloca !rec_A, !cir.ptr<!rec_A>, ["tmp"]
@@ -67,7 +67,7 @@ void cleanup() {
   ({ with_dtor wd; });
 }
 
-// CIR: cir.func dso_local @_Z7cleanupv()
+// CIR: cir.func {{.*}} @_Z7cleanupv()
 // CIR:   cir.scope {
 // CIR:     %[[WD:.+]] = cir.alloca !rec_with_dtor, !cir.ptr<!rec_with_dtor>, ["wd"]
 // CIR:     cir.call @_ZN9with_dtorD1Ev(%[[WD]]) nothrow : (!cir.ptr<!rec_with_dtor>) -> ()
diff --git a/clang/test/CIR/CodeGen/struct.cpp b/clang/test/CIR/CodeGen/struct.cpp
index c15e7e7c57b9f..dc3e24113d8d8 100644
--- a/clang/test/CIR/CodeGen/struct.cpp
+++ b/clang/test/CIR/CodeGen/struct.cpp
@@ -109,13 +109,13 @@ void paren_expr() {
 // CIR:   %[[B_ADDR:.*]] = cir.alloca !rec_Point, !cir.ptr<!rec_Point>, ["b", init]
 // CIR:   %[[CONST:.*]] = cir.const #cir.zero : !rec_Point
 // CIR:   cir.store{{.*}} %[[CONST]], %[[A_ADDR]] : !rec_Point, !cir.ptr<!rec_Point>
-// CIR:   cir.call @_ZZ10paren_exprvEN5PointC1ERKS_(%[[B_ADDR]], %[[A_ADDR]]) nothrow : (!cir.ptr<!rec_Point>, !cir.ptr<!rec_Point>) -> ()
+// CIR:   cir.copy %[[A_ADDR]] to %[[B_ADDR]] : !cir.ptr<!rec_Point>
 
 // LLVM: define{{.*}} void @_Z10paren_exprv()
 // LLVM:   %[[A_ADDR:.*]] = alloca %struct.Point, i64 1, align 4
 // LLVM:   %[[B_ADDR:.*]] = alloca %struct.Point, i64 1, align 4
 // LLVM:   store %struct.Point zeroinitializer, ptr %[[A_ADDR]], align 4
-// LLVM:   call void @_ZZ10paren_exprvEN5PointC1ERKS_(ptr %[[B_ADDR]], ptr %[[A_ADDR]])
+// LLVM:   call void @llvm.memcpy.p0.p0.i32(ptr %[[B_ADDR]], ptr %[[A_ADDR]], i32 8, i1 false)
 
 // OGCG: define{{.*}} void @_Z10paren_exprv()
 // OGCG:   %[[A_ADDR:.*]] = alloca %struct.Point, align 4
@@ -133,14 +133,13 @@ void choose_expr() {
 // CIR:   %[[A_ADDR:.*]] = cir.alloca !rec_CompleteS, !cir.ptr<!rec_CompleteS>, ["a"]
 // CIR:   %[[B_ADDR:.*]] = cir.alloca !rec_CompleteS, !cir.ptr<!rec_CompleteS>, ["b"]
 // CIR:   %[[C_ADDR:.*]] = cir.alloca !rec_CompleteS, !cir.ptr<!rec_CompleteS>, ["c", init]
-// TODO(cir): Call to default copy constructor should be replaced by `cir.copy` op
-// CIR:   cir.call @_ZN9CompleteSC1ERKS_(%[[C_ADDR]], %[[A_ADDR]]) nothrow : (!cir.ptr<!rec_CompleteS>, !cir.ptr<!rec_CompleteS>) -> ()
+// CIR:   cir.copy %[[A_ADDR]] to %[[C_ADDR]] : !cir.ptr<!rec_CompleteS>
 
 // LLVM: define{{.*}} void @_Z11choose_exprv()
 // LLVM:   %[[A_ADDR:.*]] = alloca %struct.CompleteS, i64 1, align 4
 // LLVM:   %[[B_ADDR:.*]] = alloca %struct.CompleteS, i64 1, align 4
 // LLVM:   %[[C_ADDR:.*]] = alloca %struct.CompleteS, i64 1, align 4
-// LLVM:   call void @_ZN9CompleteSC1ERKS_(ptr %[[C_ADDR]], ptr %[[A_ADDR]])
+// LLVM:   call void @llvm.memcpy.p0.p0.i32(ptr %[[C_ADDR]], ptr %[[A_ADDR]], i32 8, i1 false)
 
 // OGCG: define{{.*}} void @_Z11choose_exprv()
 // OGCG:   %[[A_ADDR:.*]] = alloca %struct.CompleteS, align 4
@@ -160,15 +159,14 @@ void generic_selection() {
 // CIR:   %[[B_ADDR:.*]] = cir.alloca !rec_CompleteS, !cir.ptr<!rec_CompleteS>, ["b"]
 // CIR:   %[[C_ADDR:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["c"]
 // CIR:   %[[D_ADDR:.*]] = cir.alloca !rec_CompleteS, !cir.ptr<!rec_CompleteS>, ["d", init]
-// TODO(cir): Call to default copy constructor should be replaced by `cir.copy` op
-// CIR:   cir.call @_ZN9CompleteSC1ERKS_(%[[D_ADDR]], %[[A_ADDR]]) nothrow : (!cir.ptr<!rec_CompleteS>, !cir.ptr<!rec_CompleteS>) -> ()
+// CIR:   cir.copy %[[A_ADDR]] to %[[D_ADDR]] : !cir.ptr<!rec_CompleteS>
 
 // LLVM: define{{.*}} void @_Z17generic_selectionv()
 // LLVM:   %1 = alloca %struct.CompleteS, i64 1, align 4
 // LLVM:   %2 = alloca %struct.CompleteS, i64 1, align 4
 // LLVM:   %3 = alloca i32, i64 1, align 4
 // LLVM:   %4 = alloca %struct.CompleteS, i64 1, align 4
-// LLVM:   call void @_ZN9CompleteSC1ERKS_(ptr %4, ptr %1)
+// LLVM:   call void @llvm.memcpy.p0.p0.i32(ptr %4, ptr %1, i32 8, i1 false)
 
 // OGCG: define{{.*}} void @_Z17generic_selectionv()
 // OGCG:   %[[A_ADDR:.*]] = alloca %struct.CompleteS, align 4
@@ -188,7 +186,7 @@ void designated_init_update_expr() {
 // CIR: %[[A_ADDR:.*]] = cir.alloca !rec_CompleteS, !cir.ptr<!rec_CompleteS>, ["a"]
 // CIR: %[[B_ADDR:.*]] = cir.alloca !rec_Container, !cir.ptr<!rec_Container>, ["b", init]
 // CIR: %[[C_ADDR:.*]] = cir.get_member %[[B_ADDR]][0] {name = "c"} : !cir.ptr<!rec_Container> -> !cir.ptr<!rec_CompleteS>
-// CIR: cir.call @_ZN9CompleteSC1ERKS_(%2, %[[A_ADDR]]) nothrow : (!cir.ptr<!rec_CompleteS>, !cir.ptr<!rec_CompleteS>) -> ()
+// CIR: cir.copy %[[A_ADDR]] to %[[C_ADDR]] : !cir.ptr<!rec_CompleteS>
 // CIR: %[[ELEM_0_PTR:.*]] = cir.get_member %[[C_ADDR]][0] {name = "a"} : !cir.ptr<!rec_CompleteS> -> !cir.ptr<!s32i>
 // CIR: %[[CONST_1:.*]] = cir.const #cir.int<1> : !s32i
 // CIR: cir.store{{.*}} %[[CONST_1]], %[[ELEM_0_PTR]] : !s32i, !cir.ptr<!s32i>
@@ -197,7 +195,7 @@ void designated_init_update_expr() {
 // LLVM: %[[A_ADDR:.*]] = alloca %struct.CompleteS, i64 1, align 4
 // LLVM: %[[B_ADDR:.*]] = alloca %struct.Container, i64 1, align 4
 // LLVM: %[[C_ADDR:.*]] = getelementptr %struct.Container, ptr %[[B_ADDR]], i32 0, i32 0
-// LLVM: call void @_ZN9CompleteSC1ERKS_(ptr %[[C_ADDR]], ptr %[[A_ADDR]])
+// LLVM: call void @llvm.memcpy.p0.p0.i32(ptr %[[C_ADDR]], ptr %[[A_ADDR]], i32 8, i1 false)
 // LLVM: %[[ELEM_0_PTR:.*]] = getelementptr %struct.CompleteS, ptr %[[C_ADDR]], i32 0, i32 0
 // LLVM: store i32 1, ptr %[[ELEM_0_PTR]], align 4
 // LLVM: %[[ELEM_1_PTR:.*]] = getelementptr %struct.CompleteS, ptr %[[C_ADDR]], i32 0, i32 1
diff --git a/clang/test/CIR/CodeGen/var_arg.c b/clang/test/CIR/CodeGen/var_arg.c
index f5b92c61e11ad..aab909eb67672 100644
--- a/clang/test/CIR/CodeGen/var_arg.c
+++ b/clang/test/CIR/CodeGen/var_arg.c
@@ -17,7 +17,7 @@ int varargs(int count, ...) {
     return res;
 }
 
-// CIR-LABEL: cir.func dso_local @varargs(
+// CIR-LABEL: cir.func {{.*}} @varargs(
 // CIR:   %[[COUNT_ADDR:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["count", init]
 // CIR:   %[[RET_ADDR:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   %[[VAAREA:.+]] = cir.alloca !cir.array<!rec___va_list_tag x 1>, !cir.ptr<!cir.array<!rec___va_list_tag x 1>>, ["args"]
@@ -93,7 +93,7 @@ int stdarg_start(int count, ...) {
     return res;
 }
 
-// CIR-LABEL: cir.func dso_local @stdarg_start(
+// CIR-LABEL: cir.func {{.*}} @stdarg_start(
 // CIR:   %[[COUNT_ADDR:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["count", init]
 // CIR:   %[[RET_ADDR:.+]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
 // CIR:   %[[VAAREA:.+]] = cir.alloca !cir.array<!rec___va_list_tag x 1>, !cir.ptr<!cir.array<!rec___va_list_tag x 1>>, ["args"]
@@ -141,7 +141,7 @@ int stdarg_start(int count, ...) {
 // OGCG:   %[[COND:.+]] = icmp ule i32 %[[GPOFFSET]], 40
 // OGCG:   br i1 %[[COND]], label %vaarg.in_reg, label %vaarg.in_mem
 //
-// OGCG: vaarg.in_reg:                                   
+// OGCG: vaarg.in_reg:
 // OGCG:   %[[REGSAVE_PTR:.+]] = getelementptr inbounds nuw %struct.__va_list_tag, ptr %[[DECAY1]], i32 0, i32 3
 // OGCG:   %[[REGSAVE:.+]] = load ptr, ptr %[[REGSAVE_PTR]]
 // OGCG:   %[[VAADDR1:.+]] = getelementptr i8, ptr %[[REGSAVE]], i32 %[[GPOFFSET]]
@@ -164,3 +164,23 @@ int stdarg_start(int count, ...) {
 // OGCG:   call void @llvm.va_end.p0(ptr %[[DECAY2]])
 // OGCG:   %[[VAL:.+]] = load i32, ptr %[[RES_ADDR]]
 // OGCG:   ret i32 %[[VAL]]
+
+void stdarg_copy() {
+    __builtin_va_list src, dest;
+    __builtin_va_copy(src, dest);
+}
+
+// CIR-LABEL: @stdarg_copy
+// CIR:    %{{.*}} = cir.cast array_to_ptrdecay %{{.*}} : !cir.ptr<!cir.array<!rec___va_list_tag x 1>> -> !cir.ptr<!rec___va_list_tag>
+// CIR:    %{{.*}} = cir.cast array_to_ptrdecay %{{.*}} : !cir.ptr<!cir.array<!rec___va_list_tag x 1>> -> !cir.ptr<!rec___va_list_tag>
+// CIR:    cir.va_copy %{{.*}} to %{{.*}} : !cir.ptr<!rec___va_list_tag>, !cir.ptr<!rec___va_list_tag>
+
+// LLVM-LABEL: @stdarg_copy
+// LLVM:   %{{.*}} = getelementptr %struct.__va_list_tag, ptr %{{.*}}
+// LLVM:   %{{.*}} = getelementptr %struct.__va_list_tag, ptr %{{.*}}
+// LLVM:   call void @llvm.va_copy.p0(ptr %{{.*}}, ptr %{{.*}}
+
+// OGCG-LABEL: @stdarg_copy
+// OGCG:   %{{.*}} = getelementptr inbounds [1 x %struct.__va_list_tag], ptr %{{.*}}
+// OGCG:   %{{.*}} = getelementptr inbounds [1 x %struct.__va_list_tag], ptr %{{.*}}
+// OGCG:   call void @llvm.va_copy.p0(ptr %{{.*}}, ptr %{{.*}}
diff --git a/clang/test/CIR/CodeGen/variable-decomposition.cpp b/clang/test/CIR/CodeGen/variable-decomposition.cpp
index f0e19263cd6db..3ba2fac3151c9 100644
--- a/clang/test/CIR/CodeGen/variable-decomposition.cpp
+++ b/clang/test/CIR/CodeGen/variable-decomposition.cpp
@@ -16,7 +16,7 @@ float function() {
   return a + b;
 }
 
-// CIR-LABEL: cir.func dso_local @_Z8functionv() -> !cir.float
+// CIR-LABEL: cir.func {{.*}} @_Z8functionv() -> !cir.float
 // CIR:  %[[RETVAL:.+]] = cir.alloca !cir.float, !cir.ptr<!cir.float>, ["__retval"]
 // CIR:  %[[STRUCT:.+]] = cir.alloca !rec_some_struct, !cir.ptr<!rec_some_struct>, ["", init]
 // CIR:  %[[CONST:.+]] = cir.const #cir.const_record<{#cir.int<1> : !s32i, #cir.fp<2.000000e+00> : !cir.float}> : !rec_some_struct
diff --git a/clang/test/CIR/CodeGen/vbase.cpp b/clang/test/CIR/CodeGen/vbase.cpp
index 8fcb2a442cd16..c1f3972b0aed3 100644
--- a/clang/test/CIR/CodeGen/vbase.cpp
+++ b/clang/test/CIR/CodeGen/vbase.cpp
@@ -128,7 +128,7 @@ void ppp() { B b; }
 // OGCG:   ret void
 
 // Constructor for B
-// CIR: cir.func comdat linkonce_odr @_ZN1BC1Ev(%arg0: !cir.ptr<!rec_B>
+// CIR: cir.func {{.*}} @_ZN1BC1Ev(%arg0: !cir.ptr<!rec_B>
 // CIR:   %[[THIS_ADDR:.*]] = cir.alloca !cir.ptr<!rec_B>, !cir.ptr<!cir.ptr<!rec_B>>, ["this", init]
 // CIR:   cir.store %arg0, %[[THIS_ADDR]] : !cir.ptr<!rec_B>, !cir.ptr<!cir.ptr<!rec_B>>
 // CIR:   %[[THIS:.*]] = cir.load %[[THIS_ADDR]] : !cir.ptr<!cir.ptr<!rec_B>>, !cir.ptr<!rec_B>
diff --git a/clang/test/CIR/CodeGen/volatile.cpp b/clang/test/CIR/CodeGen/volatile.cpp
index df1d3a66733e3..17a7154291692 100644
--- a/clang/test/CIR/CodeGen/volatile.cpp
+++ b/clang/test/CIR/CodeGen/volatile.cpp
@@ -9,7 +9,7 @@ int test_load(volatile int *ptr) {
   return *ptr;
 }
 
-// CIR: cir.func dso_local @_Z9test_loadPVi
+// CIR: cir.func {{.*}} @_Z9test_loadPVi
 // CIR:   cir.load volatile
 
 // LLVM: define {{.*}} i32 @_Z9test_loadPVi
@@ -22,7 +22,7 @@ void test_store(volatile int *ptr) {
   *ptr = 42;
 }
 
-// CIR: cir.func dso_local @_Z10test_storePVi
+// CIR: cir.func {{.*}} @_Z10test_storePVi
 // CIR:   cir.store volatile
 
 // LLVM: define {{.*}} void @_Z10test_storePVi
@@ -41,7 +41,7 @@ int test_load_field1(volatile Foo *ptr) {
   return ptr->x;
 }
 
-// CIR: cir.func dso_local @_Z16test_load_field1PV3Foo
+// CIR: cir.func {{.*}} @_Z16test_load_field1PV3Foo
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member
 // CIR:   %{{.+}} = cir.load volatile{{.*}} %[[MEMBER_ADDR]]
 
@@ -57,7 +57,7 @@ int test_load_field2(Foo *ptr) {
   return ptr->y;
 }
 
-// CIR: cir.func dso_local @_Z16test_load_field2P3Foo
+// CIR: cir.func {{.*}} @_Z16test_load_field2P3Foo
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member
 // CIR:   %{{.+}} = cir.load volatile{{.*}} %[[MEMBER_ADDR]]
 
@@ -73,7 +73,7 @@ int test_load_field3(Foo *ptr) {
   return ptr->z;
 }
 
-// CIR: cir.func dso_local @_Z16test_load_field3P3Foo
+// CIR: cir.func {{.*}} @_Z16test_load_field3P3Foo
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member
 // CIR:   %{{.*}} = cir.get_bitfield align(4) (#bfi_z, %[[MEMBER_ADDR:.+]] {is_volatile} : !cir.ptr<!u8i>) -> !s32i
 
@@ -95,7 +95,7 @@ void test_store_field1(volatile Foo *ptr) {
   ptr->x = 42;
 }
 
-// CIR: cir.func dso_local @_Z17test_store_field1PV3Foo
+// CIR: cir.func {{.*}} @_Z17test_store_field1PV3Foo
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member
 // CIR:   cir.store volatile{{.*}} %{{.+}}, %[[MEMBER_ADDR]]
 
@@ -111,7 +111,7 @@ void test_store_field2(Foo *ptr) {
   ptr->y = 42;
 }
 
-// CIR: cir.func dso_local @_Z17test_store_field2P3Foo
+// CIR: cir.func {{.*}} @_Z17test_store_field2P3Foo
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member
 // CIR:   cir.store volatile{{.*}} %{{.+}}, %[[MEMBER_ADDR]]
 
@@ -127,7 +127,7 @@ void test_store_field3(Foo *ptr) {
   ptr->z = 4;
 }
 
-// CIR: cir.func dso_local @_Z17test_store_field3P3Foo
+// CIR: cir.func {{.*}} @_Z17test_store_field3P3Foo
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member
 // CIR:   cir.set_bitfield align(4) (#bfi_z, %[[MEMBER_ADDR:.+]] : !cir.ptr<!u8i>, %1 : !s32i) {is_volatile}
 
@@ -155,7 +155,7 @@ void A::set_x(int val) volatile {
   x = val;
 }
 
-// CIR: cir.func dso_local @_ZNV1A5set_xEi
+// CIR: cir.func {{.*}} @_ZNV1A5set_xEi
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member %{{.*}}[0] {name = "x"}
 // CIR:   cir.store volatile {{.*}} %{{.*}}, %[[MEMBER_ADDR]]
 
@@ -171,7 +171,7 @@ int A::get_x() volatile {
   return x;
 }
 
-// CIR: cir.func dso_local @_ZNV1A5get_xEv
+// CIR: cir.func {{.*}} @_ZNV1A5get_xEv
 // CIR:   %[[MEMBER_ADDR:.*]] = cir.get_member %{{.*}}[0] {name = "x"}
 // CIR:   cir.load volatile {{.*}} %[[MEMBER_ADDR]]
 
diff --git a/clang/test/CIR/CodeGen/vtable-emission.cpp b/clang/test/CIR/CodeGen/vtable-emission.cpp
index 9a34573b475c3..ceefb2ab31443 100644
--- a/clang/test/CIR/CodeGen/vtable-emission.cpp
+++ b/clang/test/CIR/CodeGen/vtable-emission.cpp
@@ -32,7 +32,7 @@ void S::key() {}
 // OGCG:      @_ZTV1S = unnamed_addr constant { [4 x ptr] } { [4 x ptr]
 // OGCG-SAME:      [ptr null, ptr null, ptr @_ZN1S3keyEv, ptr @_ZN1S6nonKeyEv] }
 
-// CHECK: cir.func dso_local @_ZN1S3keyEv
+// CHECK: cir.func {{.*}} @_ZN1S3keyEv
 
 // The reference from the vtable should result in nonKey being emitted.
-// CHECK: cir.func comdat linkonce_odr @_ZN1S6nonKeyEv
+// CHECK: cir.func no_inline comdat linkonce_odr @_ZN1S6nonKeyEv
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/avx-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/avx-builtins.c
index 82fa4358dc400..d9a8771023fa7 100644
--- a/clang/test/CIR/CodeGenBuiltins/X86/avx-builtins.c
+++ b/clang/test/CIR/CodeGenBuiltins/X86/avx-builtins.c
@@ -73,4 +73,76 @@ __m256i test_mm256_undefined_si256(void) {
   // OGCG-LABEL: test_mm256_undefined_si256
   // OGCG: ret <4 x i64> zeroinitializer
   return _mm256_undefined_si256();
-}
\ No newline at end of file
+}
+
+__m256d test_mm256_shuffle_pd(__m256d A, __m256d B) {
+  // CIR-LABEL: test_mm256_shuffle_pd
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<4 x !cir.double>) [#cir.int<0> : !s32i, #cir.int<4> : !s32i, #cir.int<2> : !s32i, #cir.int<6> : !s32i] : !cir.vector<4 x !cir.double>
+
+  // LLVM-LABEL: test_mm256_shuffle_pd
+  // LLVM: shufflevector <4 x double> %{{.*}}, <4 x double> %{{.*}}, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
+
+  // OGCG-LABEL: test_mm256_shuffle_pd
+  // OGCG: shufflevector <4 x double> %{{.*}}, <4 x double> %{{.*}}, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
+  return _mm256_shuffle_pd(A, B, 0);
+}
+
+__m256 test_mm256_shuffle_ps(__m256 A, __m256 B) {
+  // CIR-LABEL: test_mm256_shuffle_ps
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<8 x !cir.float>) [#cir.int<0> : !s32i, #cir.int<0> : !s32i, #cir.int<8> : !s32i, #cir.int<8> : !s32i, #cir.int<4> : !s32i, #cir.int<4> : !s32i, #cir.int<12> : !s32i, #cir.int<12> : !s32i] : !cir.vector<8 x !cir.float>
+
+  // LLVM-LABEL: test_mm256_shuffle_ps
+  // LLVM: shufflevector <8 x float> %{{.*}}, <8 x float> %{{.*}}, <8 x i32> <i32 0, i32 0, i32 8, i32 8, i32 4, i32 4, i32 12, i32 12>
+
+  // OGCG-LABEL: test_mm256_shuffle_ps
+  // OGCG: shufflevector <8 x float> %{{.*}}, <8 x float> %{{.*}}, <8 x i32> <i32 0, i32 0, i32 8, i32 8, i32 4, i32 4, i32 12, i32 12>
+  return _mm256_shuffle_ps(A, B, 0);
+}
+
+__m128 test_mm_permute_ps(__m128 A) {
+    // CIR-LABEL: test_mm_permute_ps
+    // CIR: cir.vec.shuffle(%{{.*}}, %{{.*}} :  !cir.vector<4 x !cir.float>) [#cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<0> : !s32i, #cir.int<1> : !s32i] : !cir.vector<4 x !cir.float>
+
+	// LLVM-LABEL: test_mm_permute_ps
+    // LLVM: shufflevector <4 x float> %{{.*}}, <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+
+    // OGCG-LABEL: test_mm_permute_ps
+	// OGCG: shufflevector <4 x float> %{{.*}}, <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+    return _mm_permute_ps(A, 0x4E);
+}
+
+__m256 test_mm256_permute_ps(__m256 A) {
+    // CIR-LABEL: test_mm256_permute_ps
+    // CIR: cir.vec.shuffle(%{{.*}}, %{{.*}} :  !cir.vector<8 x !cir.float>) [#cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<6> : !s32i, #cir.int<7> : !s32i, #cir.int<4> : !s32i, #cir.int<5> : !s32i] : !cir.vector<8 x !cir.float>
+
+    // LLVM-LABEL: test_mm256_permute_ps
+    // LLVM: shufflevector <8 x float> %{{.*}}, <8 x float> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
+
+    // OGCG-LABEL: test_mm256_permute_ps
+    // OGCG: shufflevector <8 x float> %{{.*}}, <8 x float> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
+    return _mm256_permute_ps(A, 0x4E);
+}
+
+__m128d test_mm_permute_pd(__m128d A) {
+    // CIR-LABEL: test_mm_permute_pd
+    // CIR: cir.vec.shuffle(%{{.*}}, %{{.*}} :  !cir.vector<2 x !cir.double>) [#cir.int<1> : !s32i, #cir.int<0> : !s32i] : !cir.vector<2 x !cir.double>
+
+    // LLVM-LABEL: test_mm_permute_pd
+    // LLVM: shufflevector <2 x double> %{{.*}}, <2 x double> poison, <2 x i32> <i32 1, i32 0>
+
+    // OGCG-LABEL: test_mm_permute_pd
+    // OGCG: shufflevector <2 x double> %{{.*}}, <2 x double> poison, <2 x i32> <i32 1, i32 0>
+    return _mm_permute_pd(A, 0x1);
+}
+
+__m256d test_mm256_permute_pd(__m256d A) {
+    // CIR-LABEL: test_mm256_permute_pd
+    // CIR: cir.vec.shuffle(%{{.*}}, %{{.*}} :  !cir.vector<4 x !cir.double>) [#cir.int<1> : !s32i, #cir.int<0> : !s32i, #cir.int<3> : !s32i, #cir.int<2> : !s32i] : !cir.vector<4 x !cir.double>
+
+    // LLVM-LABEL: test_mm256_permute_pd
+    // LLVM: shufflevector <4 x double> %{{.*}}, <4 x double> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+
+    // OGCG-LABEL: test_mm256_permute_pd
+    // OGCG: shufflevector <4 x double> %{{.*}}, <4 x double> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+    return _mm256_permute_pd(A, 0x5);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/avx2-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/avx2-builtins.c
new file mode 100644
index 0000000000000..b7497c2053b2d
--- /dev/null
+++ b/clang/test/CIR/CodeGenBuiltins/X86/avx2-builtins.c
@@ -0,0 +1,53 @@
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefixes=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fno-signed-char -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefixes=CIR --input-file=%t.cir %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fno-signed-char -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefixes=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fno-signed-char -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefixes=CIR --input-file=%t.cir %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx2 -fno-signed-char -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx2 -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx2 -fno-signed-char -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx2 -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx2 -fno-signed-char -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+
+// This test mimics clang/test/CodeGen/X86/avx2-builtins.c, which eventually
+// CIR shall be able to support fully.
+
+#include <immintrin.h>
+
+__m256i test_mm256_shufflelo_epi16(__m256i a) {
+  // CIR-LABEL: _mm256_shufflelo_epi16
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<16 x !s16i>) [#cir.int<3> : !s32i, #cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<1> : !s32i, #cir.int<4> : !s32i, #cir.int<5> : !s32i, #cir.int<6> : !s32i, #cir.int<7> : !s32i, #cir.int<11> : !s32i, #cir.int<8> : !s32i, #cir.int<9> : !s32i, #cir.int<9> : !s32i, #cir.int<12> : !s32i, #cir.int<13> : !s32i, #cir.int<14> : !s32i, #cir.int<15> : !s32i] : !cir.vector<16 x !s16i>
+
+  // LLVM-LABEL: test_mm256_shufflelo_epi16
+  // LLVM: shufflevector <16 x i16> %{{.*}}, <16 x i16> poison, <16 x i32> <i32 3, i32 0, i32 1, i32 1, i32 4, i32 5, i32 6, i32 7, i32 11, i32 8, i32 9, i32 9, i32 12, i32 13, i32 14, i32 15>
+
+  // OGCG-LABEL: test_mm256_shufflelo_epi16
+  // OGCG: shufflevector <16 x i16> %{{.*}}, <16 x i16> poison, <16 x i32> <i32 3, i32 0, i32 1, i32 1, i32 4, i32 5, i32 6, i32 7, i32 11, i32 8, i32 9, i32 9, i32 12, i32 13, i32 14, i32 15>
+  return _mm256_shufflelo_epi16(a, 83);
+}
+
+__m256i test_mm256_shufflehi_epi16(__m256i a) {
+  // CIR-LABEL: _mm256_shufflehi_epi16
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<16 x !s16i>) [#cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<7> : !s32i, #cir.int<6> : !s32i, #cir.int<6> : !s32i, #cir.int<5> : !s32i, #cir.int<8> : !s32i, #cir.int<9> : !s32i, #cir.int<10> : !s32i, #cir.int<11> : !s32i, #cir.int<15> : !s32i, #cir.int<14> : !s32i, #cir.int<14> : !s32i, #cir.int<13> : !s32i] : !cir.vector<16 x !s16i>
+
+  // LLVM-LABEL: test_mm256_shufflehi_epi16
+  // LLVM: shufflevector <16 x i16> %{{.*}}, <16 x i16> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 7, i32 6, i32 6, i32 5, i32 8, i32 9, i32 10, i32 11, i32 15, i32 14, i32 14, i32 13>
+
+  // OGCG-LABEL: test_mm256_shufflehi_epi16
+  // OGCG: shufflevector <16 x i16> %{{.*}}, <16 x i16> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 7, i32 6, i32 6, i32 5, i32 8, i32 9, i32 10, i32 11, i32 15, i32 14, i32 14, i32 13>
+  return _mm256_shufflehi_epi16(a, 107);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/avx512bw-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/avx512bw-builtins.c
index 3522e2c7e50bf..48a89769ea10f 100644
--- a/clang/test/CIR/CodeGenBuiltins/X86/avx512bw-builtins.c
+++ b/clang/test/CIR/CodeGenBuiltins/X86/avx512bw-builtins.c
@@ -1,15 +1,32 @@
-// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fclangir -emit-cir -o %t.cir -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fno-signed-char  -fclangir -emit-cir -o %t.cir -Wall -Werror -Wsign-conversion
 // RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
-// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fclangir -emit-llvm -o %t.ll -Wall -Werror
-// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
 
-// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fno-signed-char -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fclangir -emit-cir -o %t.cir -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fno-signed-char  -fclangir -emit-cir -o %t.cir -Wall -Werror -Wsign-conversion
 // RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
-// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fno-signed-char -fclangir -emit-llvm -o %t.ll -Wall -Werror
-// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
 
-// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
-// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw  -fclangir -emit-llvm -o %t.ll -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux  -target-feature +avx512bw -fno-signed-char  -fclangir -emit-llvm -o %t.ll -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw  -fclangir -emit-llvm -o %t.ll -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux  -target-feature +avx512bw -fno-signed-char  -fclangir -emit-llvm -o %t.ll -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefix=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fno-signed-char -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefix=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx512bw -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx512bw -fno-signed-char -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefixes=OGCG
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefix=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512bw -fno-signed-char -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefix=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx512bw -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +avx512bw -fno-signed-char -emit-llvm -o - -Wall -Werror -Wsign-conversion | FileCheck %s --check-prefixes=OGCG
 
 // This test mimics clang/test/CodeGen/X86/avx512bw-builtins.c, which eventually
 // CIR shall be able to support fully.
@@ -115,3 +132,430 @@ __mmask32 test_kshiftri_mask32_out_of_range(__mmask32 A) {
 
   return _kshiftri_mask32(A, 33);
 }
+
+__mmask32 test_kadd_mask32(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _kadd_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.kadd.d"
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _kadd_mask32
+  // LLVM: [[L:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[R:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[RES:%.*]] = call <32 x i1> @llvm.x86.avx512.kadd.d(<32 x i1> [[L]], <32 x i1> [[R]])
+  // LLVM: bitcast <32 x i1> [[RES]] to i32
+
+  // OGCG-LABEL: _kadd_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: call <32 x i1> @llvm.x86.avx512.kadd.d
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+  return _kadd_mask32(A, B);
+}
+
+__mmask64 test_kadd_mask64(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _kadd_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.kadd.q"
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _kadd_mask64
+  // LLVM: [[L:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[R:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[RES:%.*]] = call <64 x i1> @llvm.x86.avx512.kadd.q(<64 x i1> [[L]], <64 x i1> [[R]])
+  // LLVM: bitcast <64 x i1> [[RES]] to i64
+
+  // OGCG-LABEL: _kadd_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: call <64 x i1> @llvm.x86.avx512.kadd.q
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+  return _kadd_mask64(A, B);
+}
+
+__mmask32 test_kand_mask32(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _kand_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _kand_mask32
+  // LLVM: [[L:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[R:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[RES:%.*]] = and <32 x i1> [[L]], [[R]]
+  // LLVM: bitcast <32 x i1> [[RES]] to i32
+
+  // OGCG-LABEL: _kand_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: and <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+  return _kand_mask32(A, B);
+}
+
+__mmask64 test_kand_mask64(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _kand_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _kand_mask64
+  // LLVM: [[L:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[R:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[RES:%.*]] = and <64 x i1> [[L]], [[R]]
+  // LLVM: bitcast <64 x i1> [[RES]] to i64
+
+  // OGCG-LABEL: _kand_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: and <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+  return _kand_mask64(A, B);
+}
+
+__mmask32 test_kandn_mask32(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _kandn_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _kandn_mask32
+  // LLVM: [[L:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[R:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: xor <32 x i1> [[L]], splat (i1 true)
+  // LLVM: and <32 x i1>
+  // LLVM: bitcast <32 x i1> {{.*}} to i32
+
+  // OGCG-LABEL: _kandn_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: xor <32 x i1>
+  // OGCG: and <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+  return _kandn_mask32(A, B);
+}
+
+__mmask64 test_kandn_mask64(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _kandn_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _kandn_mask64
+  // LLVM: [[L:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[R:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: xor <64 x i1> [[L]], splat (i1 true)
+  // LLVM: and <64 x i1>
+  // LLVM: bitcast <64 x i1> {{.*}} to i64
+
+  // OGCG-LABEL: _kandn_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: xor <64 x i1>
+  // OGCG: and <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+  return _kandn_mask64(A, B);
+}
+
+__mmask32 test_kor_mask32(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _kor_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.binop(or, {{.*}}, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _kor_mask32
+  // LLVM: [[L:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[R:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: or <32 x i1> [[L]], [[R]]
+  // LLVM: bitcast <32 x i1> {{.*}} to i32
+
+  // OGCG-LABEL: _kor_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: or <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+  return _kor_mask32(A, B);
+}
+
+__mmask64 test_kor_mask64(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _kor_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.binop(or, {{.*}}, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _kor_mask64
+  // LLVM: [[L:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[R:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: or <64 x i1> [[L]], [[R]]
+  // LLVM: bitcast <64 x i1> {{.*}} to i64
+
+  // OGCG-LABEL: _kor_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: or <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+  return _kor_mask64(A, B);
+}
+
+__mmask32 test_kxor_mask32(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _kxor_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _kxor_mask32
+  // LLVM: [[L:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[R:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: xor <32 x i1> [[L]], [[R]]
+  // LLVM: bitcast <32 x i1> {{.*}} to i32
+
+  // OGCG-LABEL: _kxor_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: xor <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+  return _kxor_mask32(A, B);
+}
+
+__mmask64 test_kxor_mask64(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _kxor_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _kxor_mask64
+  // LLVM: [[L:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[R:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: xor <64 x i1> [[L]], [[R]]
+  // LLVM: bitcast <64 x i1> {{.*}} to i64
+
+  // OGCG-LABEL: _kxor_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: xor <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+  return _kxor_mask64(A, B);
+}
+
+__mmask32 test_kxnor_mask32(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _kxnor_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _kxnor_mask32
+  // LLVM: [[L:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[R:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[NOT:%.*]] = xor <32 x i1> [[L]], splat (i1 true)
+  // LLVM: [[RES:%.*]] = xor <32 x i1> [[NOT]], [[R]]
+  // LLVM: bitcast <32 x i1> [[RES]] to i32
+
+  // OGCG-LABEL: _kxnor_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: xor <32 x i1>
+  // OGCG: xor <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+
+  return _kxnor_mask32(A, B);
+}
+
+__mmask64 test_kxnor_mask64(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _kxnor_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _kxnor_mask64
+  // LLVM: [[L:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[R:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[NOT:%.*]] = xor <64 x i1> [[L]], splat (i1 true)
+  // LLVM: [[RES:%.*]] = xor <64 x i1> [[NOT]], [[R]]
+  // LLVM: bitcast <64 x i1> [[RES]] to i64
+
+  // OGCG-LABEL: _kxnor_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: xor <64 x i1>
+  // OGCG: xor <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+
+  return _kxnor_mask64(A, B);
+}
+
+
+__mmask32 test_knot_mask32(__mmask32 A) {
+  // CIR-LABEL: _knot_mask32
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _knot_mask32
+  // LLVM: bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: xor <32 x i1>
+  // LLVM: bitcast <32 x i1> {{.*}} to i32
+
+  // OGCG-LABEL: _knot_mask32
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: xor <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+  return _knot_mask32(A);
+}
+
+__mmask64 test_knot_mask64(__mmask64 A) {
+  // CIR-LABEL: _knot_mask64
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _knot_mask64
+  // LLVM: bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: xor <64 x i1>
+  // LLVM: bitcast <64 x i1> {{.*}} to i64
+
+  // OGCG-LABEL: _knot_mask64
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: xor <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+  return _knot_mask64(A);
+}
+
+// Multiple user-level mask helpers inline to this same kmov builtin.
+// CIR does not implement any special lowering for those helpers.
+//
+// Therefore, testing the builtin (__builtin_ia32_kmov*) directly is
+// sufficient to cover the CIR lowering behavior. Testing each helper
+// individually would add no new CIR paths.
+
+__mmask32 test_kmov_d(__mmask32 A) {
+  // CIR-LABEL: test_kmov_d
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: test_kmov_d
+  // LLVM: bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: bitcast <32 x i1> {{.*}} to i32
+
+  // OGCG-LABEL: test_kmov_d
+  // OGCG: bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: bitcast <32 x i1> {{.*}} to i32
+
+  return __builtin_ia32_kmovd(A);
+}
+
+// Multiple user-level mask helpers inline to this same kmov builtin.
+// CIR does not implement any special lowering for those helpers.
+//
+// Therefore, testing the builtin (__builtin_ia32_kmov*) directly is
+// sufficient to cover the CIR lowering behavior. Testing each helper
+// individually would add no new CIR paths.
+
+__mmask64 test_kmov_q(__mmask64 A) {
+  // CIR-LABEL: test_kmov_q
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: test_kmov_q
+  // LLVM: bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: bitcast <64 x i1> {{.*}} to i64
+
+  // OGCG-LABEL: test_kmov_q
+  // OGCG: bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: bitcast <64 x i1> {{.*}} to i64
+
+  return __builtin_ia32_kmovq(A);
+}
+
+__mmask32 test_mm512_kunpackw(__mmask32 A, __mmask32 B) {
+  // CIR-LABEL: _mm512_kunpackw
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u32i -> !cir.vector<32 x !cir.int<u, 1>>
+  // CIR: cir.vec.shuffle
+  // CIR: cir.vec.shuffle
+  // CIR: cir.vec.shuffle
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<32 x !cir.int<u, 1>> -> !u32i
+
+  // LLVM-LABEL: _mm512_kunpackw
+  // LLVM: [[A_VEC:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[B_VEC:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // LLVM: [[A_HALF:%.*]] = shufflevector <32 x i1> [[A_VEC]], <32 x i1> [[A_VEC]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  // LLVM: [[B_HALF:%.*]] = shufflevector <32 x i1> [[B_VEC]], <32 x i1> [[B_VEC]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  // LLVM: [[RES:%.*]] = shufflevector <16 x i1> [[B_HALF]], <16 x i1> [[A_HALF]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  // LLVM: bitcast <32 x i1> [[RES]] to i32
+
+  // OGCG-LABEL: _mm512_kunpackw
+  // OGCG: [[A_VEC:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: [[B_VEC:%.*]] = bitcast i32 %{{.*}} to <32 x i1>
+  // OGCG: [[A_HALF:%.*]] = shufflevector <32 x i1> [[A_VEC]], <32 x i1> [[A_VEC]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  // OGCG: [[B_HALF:%.*]] = shufflevector <32 x i1> [[B_VEC]], <32 x i1> [[B_VEC]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  // OGCG: [[RES:%.*]] = shufflevector <16 x i1> [[B_HALF]], <16 x i1> [[A_HALF]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  // OGCG: bitcast <32 x i1> [[RES]] to i32
+  return _mm512_kunpackw(A, B);
+}
+
+__mmask64 test_mm512_kunpackd(__mmask64 A, __mmask64 B) {
+  // CIR-LABEL: _mm512_kunpackd
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u64i -> !cir.vector<64 x !cir.int<u, 1>>
+  // CIR: cir.vec.shuffle
+  // CIR: cir.vec.shuffle
+  // CIR: cir.vec.shuffle
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<64 x !cir.int<u, 1>> -> !u64i
+
+  // LLVM-LABEL: _mm512_kunpackd
+  // LLVM: [[A_VEC:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[B_VEC:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // LLVM: [[A_HALF:%.*]] = shufflevector <64 x i1> [[A_VEC]], <64 x i1> [[A_VEC]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  // LLVM: [[B_HALF:%.*]] = shufflevector <64 x i1> [[B_VEC]], <64 x i1> [[B_VEC]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  // LLVM: [[RES:%.*]] = shufflevector <32 x i1> [[B_HALF]], <32 x i1> [[A_HALF]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
+  // LLVM: bitcast <64 x i1> [[RES]] to i64
+
+  // OGCG-LABEL: _mm512_kunpackd
+  // OGCG: [[A_VEC:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: [[B_VEC:%.*]] = bitcast i64 %{{.*}} to <64 x i1>
+  // OGCG: [[A_HALF:%.*]] = shufflevector <64 x i1> [[A_VEC]], <64 x i1> [[A_VEC]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  // OGCG: [[B_HALF:%.*]] = shufflevector <64 x i1> [[B_VEC]], <64 x i1> [[B_VEC]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  // OGCG: [[RES:%.*]] = shufflevector <32 x i1> [[B_HALF]], <32 x i1> [[A_HALF]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
+  // OGCG: bitcast <64 x i1> [[RES]] to i64
+  return _mm512_kunpackd(A, B);
+}
+
+__m512i test_mm512_shufflelo_epi16(__m512i __A) {
+  // CIR-LABEL: _mm512_shufflelo_epi16
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<32 x !s16i>) [#cir.int<1> : !s32i, #cir.int<1> : !s32i, #cir.int<0> : !s32i, #cir.int<0> : !s32i, #cir.int<4> : !s32i, #cir.int<5> : !s32i, #cir.int<6> : !s32i, #cir.int<7> : !s32i, #cir.int<9> : !s32i, #cir.int<9> : !s32i, #cir.int<8> : !s32i, #cir.int<8> : !s32i, #cir.int<12> : !s32i, #cir.int<13> : !s32i, #cir.int<14> : !s32i, #cir.int<15> : !s32i, #cir.int<17> : !s32i, #cir.int<17> : !s32i, #cir.int<16> : !s32i, #cir.int<16> : !s32i, #cir.int<20> : !s32i, #cir.int<21> : !s32i, #cir.int<22> : !s32i, #cir.int<23> : !s32i, #cir.int<25> : !s32i, #cir.int<25> : !s32i, #cir.int<24> : !s32i, #cir.int<24> : !s32i, #cir.int<28> : !s32i, #cir.int<29> : !s32i, #cir.int<30> : !s32i, #cir.int<31> : !s32i] : !cir.vector<32 x !s16i>
+
+  // LLVM-LABEL: test_mm512_shufflelo_epi16
+  // LLVM: shufflevector <32 x i16> %{{.*}}, <32 x i16> poison, <32 x i32> <i32 1, i32 1, i32 0, i32 0, i32 4, i32 5, i32 6, i32 7, i32 9, i32 9, i32 8, i32 8, i32 12, i32 13, i32 14, i32 15, i32 17, i32 17, i32 16, i32 16, i32 20, i32 21, i32 22, i32 23, i32 25, i32 25, i32 24, i32 24, i32 28, i32 29, i32 30, i32 31>
+
+  // OGCG-LABEL: test_mm512_shufflelo_epi16
+  // OGCG: shufflevector <32 x i16> %{{.*}}, <32 x i16> poison, <32 x i32> <i32 1, i32 1, i32 0, i32 0, i32 4, i32 5, i32 6, i32 7, i32 9, i32 9, i32 8, i32 8, i32 12, i32 13, i32 14, i32 15, i32 17, i32 17, i32 16, i32 16, i32 20, i32 21, i32 22, i32 23, i32 25, i32 25, i32 24, i32 24, i32 28, i32 29, i32 30, i32 31>
+  return _mm512_shufflelo_epi16(__A, 5);
+}
+
+__m512i test_mm512_shufflehi_epi16(__m512i __A) {
+  // CIR-LABEL: _mm512_shufflehi_epi16
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<32 x !s16i>) [#cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<5> : !s32i, #cir.int<5> : !s32i, #cir.int<4> : !s32i, #cir.int<4> : !s32i, #cir.int<8> : !s32i, #cir.int<9> : !s32i, #cir.int<10> : !s32i, #cir.int<11> : !s32i, #cir.int<13> : !s32i, #cir.int<13> : !s32i, #cir.int<12> : !s32i, #cir.int<12> : !s32i, #cir.int<16> : !s32i, #cir.int<17> : !s32i, #cir.int<18> : !s32i, #cir.int<19> : !s32i, #cir.int<21> : !s32i, #cir.int<21> : !s32i, #cir.int<20> : !s32i, #cir.int<20> : !s32i, #cir.int<24> : !s32i, #cir.int<25> : !s32i, #cir.int<26> : !s32i, #cir.int<27> : !s32i, #cir.int<29> : !s32i, #cir.int<29> : !s32i, #cir.int<28> : !s32i, #cir.int<28> : !s32i] : !cir.vector<32 x !s16i>
+
+  // LLVM-LABEL: test_mm512_shufflehi_epi16
+  // LLVM: shufflevector <32 x i16> %{{.*}}, <32 x i16> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 5, i32 4, i32 4, i32 8, i32 9, i32 10, i32 11, i32 13, i32 13, i32 12, i32 12, i32 16, i32 17, i32 18, i32 19, i32 21, i32 21, i32 20, i32 20, i32 24, i32 25, i32 26, i32 27, i32 29, i32 29, i32 28, i32 28>
+
+  // OGCG-LABEL: test_mm512_shufflehi_epi16
+  // OGCG: shufflevector <32 x i16> %{{.*}}, <32 x i16> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 5, i32 4, i32 4, i32 8, i32 9, i32 10, i32 11, i32 13, i32 13, i32 12, i32 12, i32 16, i32 17, i32 18, i32 19, i32 21, i32 21, i32 20, i32 20, i32 24, i32 25, i32 26, i32 27, i32 29, i32 29, i32 28, i32 28>
+  return _mm512_shufflehi_epi16(__A, 5);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/avx512dq-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/avx512dq-builtins.c
new file mode 100644
index 0000000000000..5d81f666271be
--- /dev/null
+++ b/clang/test/CIR/CodeGenBuiltins/X86/avx512dq-builtins.c
@@ -0,0 +1,210 @@
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512dq -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512dq -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512dq -fno-signed-char -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512dq -fno-signed-char -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512dq -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512dq -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+
+#include <immintrin.h>
+
+__mmask8 test_kadd_mask8(__mmask8 A, __mmask8 B) {
+ // CIR-LABEL: _kadd_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.call_llvm_intrinsic "x86.avx512.kadd.b"
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _kadd_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[R:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[RES:%.*]] = call <8 x i1> @llvm.x86.avx512.kadd.b(<8 x i1> [[L]], <8 x i1> [[R]])
+ // LLVM: bitcast <8 x i1> [[RES]] to i8
+
+ // OGCG-LABEL: _kadd_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: call <8 x i1> @llvm.x86.avx512.kadd.b
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return _kadd_mask8(A, B);
+}
+
+__mmask16 test_kadd_mask16(__mmask16 A, __mmask16 B) {
+ // CIR-LABEL: _kadd_mask16
+ // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+ // CIR: cir.call_llvm_intrinsic "x86.avx512.kadd.w"
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+ // LLVM-LABEL: _kadd_mask16
+ // LLVM: [[L:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+ // LLVM: [[R:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+ // LLVM: [[RES:%.*]] = call <16 x i1> @llvm.x86.avx512.kadd.w(<16 x i1> [[L]], <16 x i1> [[R]])
+ // LLVM: bitcast <16 x i1> [[RES]] to i16
+
+ // OGCG-LABEL: _kadd_mask16
+ // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+ // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+ // OGCG: call <16 x i1> @llvm.x86.avx512.kadd.w
+ // OGCG: bitcast <16 x i1> {{.*}} to i16
+ return _kadd_mask16(A, B);
+}
+
+__mmask8 test_kand_mask8(__mmask8 A, __mmask8 B) {
+ // CIR-LABEL: _kand_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _kand_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[R:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[RES:%.*]] = and <8 x i1> [[L]], [[R]]
+ // LLVM: bitcast <8 x i1> [[RES]] to i8
+
+ // OGCG-LABEL: _kand_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: and <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return _kand_mask8(A, B);
+}
+
+
+__mmask8 test_kandn_mask8(__mmask8 A, __mmask8 B) {
+ // CIR-LABEL: _kandn_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.unary(not, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _kandn_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[R:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: xor <8 x i1> [[L]], splat (i1 true)
+ // LLVM: and <8 x i1>
+ // LLVM: bitcast <8 x i1> {{.*}} to i8
+
+ // OGCG-LABEL: _kandn_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: xor <8 x i1>
+ // OGCG: and <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+
+ return _kandn_mask8(A, B);
+}
+
+__mmask8 test_kor_mask8(__mmask8 A, __mmask8 B) {
+ // CIR-LABEL: _kor_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.binop(or, {{.*}}, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _kor_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[R:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: or <8 x i1> [[L]], [[R]]
+ // LLVM: bitcast <8 x i1> {{.*}} to i8
+
+ // OGCG-LABEL: _kor_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: or <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return _kor_mask8(A, B);
+}
+
+__mmask8 test_kxor_mask8(__mmask8 A, __mmask8 B) {
+ // CIR-LABEL: _kxor_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _kxor_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[R:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: xor <8 x i1> [[L]], [[R]]
+ // LLVM: bitcast <8 x i1> {{.*}} to i8
+
+ // OGCG-LABEL: _kxor_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: xor <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return _kxor_mask8(A, B);
+}
+
+__mmask8 test_kxnor_mask8(__mmask8 A, __mmask8 B) {
+ // CIR-LABEL: _kxnor_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.unary(not, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _kxnor_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[R:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: [[NOT:%.*]] = xor <8 x i1> [[L]], splat (i1 true)
+ // LLVM: [[RES:%.*]] = xor <8 x i1> [[NOT]], [[R]]
+ // LLVM: bitcast <8 x i1> [[RES]] to i8
+
+ // OGCG-LABEL: _kxnor_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: xor <8 x i1>
+ // OGCG: xor <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return _kxnor_mask8(A, B);
+}
+
+
+__mmask8 test_knot_mask8(__mmask8 A) {
+ // CIR-LABEL: _knot_mask8
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.unary(not, {{.*}}) : !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: _knot_mask8
+ // LLVM: [[L:%.*]] = bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: xor <8 x i1> [[L]], {{.*}}
+ // LLVM: bitcast <8 x i1> {{.*}} to i8
+
+ // OGCG-LABEL: _knot_mask8
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: xor <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return _knot_mask8(A);
+}
+
+// Multiple user-level mask helpers inline to this same kmov builtin.
+// CIR does not implement any special lowering for those helpers.
+//
+// Therefore, testing the builtin (__builtin_ia32_kmov*) directly is
+// sufficient to cover the CIR lowering behavior. Testing each helper
+// individually would add no new CIR paths.
+
+__mmask8 test_kmov_b(__mmask8 A) {
+ // CIR-LABEL: test_kmov_b
+ // CIR: cir.cast bitcast {{.*}} : !u8i -> !cir.vector<8 x !cir.int<u, 1>>
+ // CIR: cir.cast bitcast {{.*}} : !cir.vector<8 x !cir.int<u, 1>> -> !u8i
+
+ // LLVM-LABEL: test_kmov_b
+ // LLVM: bitcast i8 %{{.*}} to <8 x i1>
+ // LLVM: bitcast <8 x i1> {{.*}} to i8
+
+ // OGCG-LABEL: test_kmov_b
+ // OGCG: bitcast i8 %{{.*}} to <8 x i1>
+ // OGCG: bitcast <8 x i1> {{.*}} to i8
+ return __builtin_ia32_kmovb(A);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/avx512f-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/avx512f-builtins.c
index dc54a87856a7c..cdcdad42b2845 100644
--- a/clang/test/CIR/CodeGenBuiltins/X86/avx512f-builtins.c
+++ b/clang/test/CIR/CodeGenBuiltins/X86/avx512f-builtins.c
@@ -77,3 +77,621 @@ __m512i test_mm512_undefined_epi32(void) {
   // OGCG: ret <8 x i64> zeroinitializer
   return _mm512_undefined_epi32();
 }
+
+__m512d test_mm512_shuffle_pd(__m512d __M, __m512d __V) {
+  // CIR-LABEL: test_mm512_shuffle_pd
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<8 x !cir.double>) [#cir.int<0> : !s32i, #cir.int<8> : !s32i, #cir.int<3> : !s32i, #cir.int<10> : !s32i, #cir.int<4> : !s32i, #cir.int<12> : !s32i, #cir.int<6> : !s32i, #cir.int<14> : !s32i] : !cir.vector<8 x !cir.double>
+
+  // LLVM-LABEL: test_mm512_shuffle_pd
+  // LLVM: shufflevector <8 x double> %{{.*}}, <8 x double> %{{.*}}, <8 x i32> <i32 0, i32 8, i32 3, i32 10, i32 4, i32 12, i32 6, i32 14>
+
+  // OGCG-LABEL: test_mm512_shuffle_pd
+  // OGCG: shufflevector <8 x double> %{{.*}}, <8 x double> %{{.*}}, <8 x i32> <i32 0, i32 8, i32 3, i32 10, i32 4, i32 12, i32 6, i32 14>
+  return _mm512_shuffle_pd(__M, __V, 4);
+}
+
+__m512 test_mm512_shuffle_ps(__m512 __M, __m512 __V) {
+  // CIR-LABEL: test_mm512_shuffle_ps
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<16 x !cir.float>) [#cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<16> : !s32i, #cir.int<16> : !s32i, #cir.int<4> : !s32i, #cir.int<5> : !s32i, #cir.int<20> : !s32i, #cir.int<20> : !s32i, #cir.int<8> : !s32i, #cir.int<9> : !s32i, #cir.int<24> : !s32i, #cir.int<24> : !s32i, #cir.int<12> : !s32i, #cir.int<13> : !s32i, #cir.int<28> : !s32i, #cir.int<28> : !s32i] : !cir.vector<16 x !cir.float>
+
+  // LLVM-LABEL: test_mm512_shuffle_ps
+  // LLVM: shufflevector <16 x float> %{{.*}}, <16 x float> %{{.*}}, <16 x i32> <i32 0, i32 1, i32 16, i32 16, i32 4, i32 5, i32 20, i32 20, i32 8, i32 9, i32 24, i32 24, i32 12, i32 13, i32 28, i32 28>
+
+  // OGCG-LABEL: test_mm512_shuffle_ps
+  // OGCG: shufflevector <16 x float> %{{.*}}, <16 x float> %{{.*}}, <16 x i32> <i32 0, i32 1, i32 16, i32 16, i32 4, i32 5, i32 20, i32 20, i32 8, i32 9, i32 24, i32 24, i32 12, i32 13, i32 28, i32 28>
+  return _mm512_shuffle_ps(__M, __V, 4);
+}
+
+__m512 test_mm512_permute_ps(__m512 A) {
+    // CIR-LABEL: test_mm512_permute_ps
+    // CIR: cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<16 x !cir.float>) [#cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<6> : !s32i, #cir.int<7> : !s32i, #cir.int<4> : !s32i, #cir.int<5> : !s32i, #cir.int<10> : !s32i, #cir.int<11> : !s32i, #cir.int<8> : !s32i, #cir.int<9> : !s32i, #cir.int<14> : !s32i, #cir.int<15> : !s32i, #cir.int<12> : !s32i, #cir.int<13> : !s32i] : !cir.vector<16 x !cir.float>
+
+    // LLVM-LABEL: test_mm512_permute_ps
+    // LLVM: shufflevector <16 x float> %{{.*}}, <16 x float> poison, <16 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5, i32 10, i32 11, i32 8, i32 9, i32 14, i32 15, i32 12, i32 13>
+
+    // OGCG-LABEL: test_mm512_permute_ps
+    // OGCG: shufflevector <16 x float> %{{.*}}, <16 x float> poison, <16 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5, i32 10, i32 11, i32 8, i32 9, i32 14, i32 15, i32 12, i32 13>
+    return _mm512_permute_ps(A, 0x4E);
+}
+
+__m512d test_mm512_permute_pd(__m512d A) {
+    // CIR-LABEL: test_mm512_permute_pd
+    // CIR: cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<8 x !cir.double>) [#cir.int<1> : !s32i, #cir.int<0> : !s32i, #cir.int<3> : !s32i, #cir.int<2> : !s32i, #cir.int<5> : !s32i, #cir.int<4> : !s32i, #cir.int<7> : !s32i, #cir.int<6> : !s32i] : !cir.vector<8 x !cir.double>
+
+    // LLVM-LABEL: test_mm512_permute_pd
+    // LLVM: shufflevector <8 x double> %{{.*}}, <8 x double> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6>
+
+    // OGCG-LABEL: test_mm512_permute_pd
+    // OGCG: shufflevector <8 x double> %{{.*}}, <8 x double> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6>
+    return _mm512_permute_pd(A, 0x55);
+}
+
+__mmask16 test_mm512_kand(__mmask16 A, __mmask16 B) {
+  // CIR-LABEL: _mm512_kand
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_kand
+  // LLVM: [[L:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[R:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[RES:%.*]] = and <16 x i1> [[L]], [[R]]
+  // LLVM: bitcast <16 x i1> [[RES]] to i16
+
+  // OGCG-LABEL: _mm512_kand
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: and <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return _mm512_kand(A, B);
+}
+
+__mmask16 test_mm512_kandn(__mmask16 A, __mmask16 B) {
+  // CIR-LABEL: _mm512_kandn
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.binop(and, {{.*}}, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_kandn
+  // LLVM: [[L:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[R:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: xor <16 x i1> [[L]], splat (i1 true)
+  // LLVM: and <16 x i1>
+  // LLVM: bitcast <16 x i1> {{.*}} to i16
+
+  // OGCG-LABEL: _mm512_kandn
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: xor <16 x i1>
+  // OGCG: and <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return _mm512_kandn(A, B);
+}
+
+__mmask16 test_mm512_kor(__mmask16 A, __mmask16 B) {
+  // CIR-LABEL: _mm512_kor
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.binop(or, {{.*}}, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_kor
+  // LLVM: [[L:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[R:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: or <16 x i1> [[L]], [[R]]
+  // LLVM: bitcast <16 x i1> {{.*}} to i16
+
+  // OGCG-LABEL: _mm512_kor
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: or <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return _mm512_kor(A, B);
+}
+
+__mmask16 test_mm512_kxnor(__mmask16 A, __mmask16 B) {
+  // CIR-LABEL: _mm512_kxnor
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_kxnor
+  // LLVM: [[L:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[R:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[NOT:%.*]] = xor <16 x i1> [[L]], splat (i1 true)
+  // LLVM: [[RES:%.*]] = xor <16 x i1> [[NOT]], [[R]]
+  // LLVM: bitcast <16 x i1> [[RES]] to i16
+
+  // OGCG-LABEL: _mm512_kxnor
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: xor <16 x i1>
+  // OGCG: xor <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return _mm512_kxnor(A, B);
+}
+
+__mmask16 test_mm512_kxor(__mmask16 A, __mmask16 B) {
+  // CIR-LABEL: _mm512_kxor
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.binop(xor, {{.*}}, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_kxor
+  // LLVM: [[L:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[R:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: xor <16 x i1> [[L]], [[R]]
+  // LLVM: bitcast <16 x i1> {{.*}} to i16
+
+  // OGCG-LABEL: _mm512_kxor
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: xor <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return _mm512_kxor(A, B);
+}
+
+__mmask16 test_mm512_knot(__mmask16 A) {
+  // CIR-LABEL: _mm512_knot
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.unary(not, {{.*}}) : !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_knot
+  // LLVM: bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: xor <16 x i1>
+  // LLVM: bitcast <16 x i1> {{.*}} to i16
+
+  // OGCG-LABEL: _mm512_knot
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: xor <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return _mm512_knot(A);
+}
+
+// Multiple user-level mask helpers inline to this same kmov builtin.
+// CIR does not implement any special lowering for those helpers.
+//
+// Therefore, testing the builtin (__builtin_ia32_kmov*) directly is
+// sufficient to cover the CIR lowering behavior. Testing each helper
+// individually would add no new CIR paths.
+
+__mmask16 test_kmov_w(__mmask16 A) {
+  // CIR-LABEL: test_kmov_w
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: test_kmov_w
+  // LLVM: bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: bitcast <16 x i1> {{.*}} to i16
+
+  // OGCG-LABEL: test_kmov_w
+  // OGCG: bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: bitcast <16 x i1> {{.*}} to i16
+  return __builtin_ia32_kmovw(A);
+}
+
+__mmask16 test_mm512_kunpackb(__mmask16 A, __mmask16 B) {
+  // CIR-LABEL: _mm512_kunpackb
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.cast bitcast {{.*}} : !u16i -> !cir.vector<16 x !cir.int<u, 1>>
+  // CIR: cir.vec.shuffle
+  // CIR: cir.vec.shuffle
+  // CIR: cir.vec.shuffle
+  // CIR: cir.cast bitcast {{.*}} : !cir.vector<16 x !cir.int<u, 1>> -> !u16i
+
+  // LLVM-LABEL: _mm512_kunpackb
+  // LLVM: [[A_VEC:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[B_VEC:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // LLVM: [[A_HALF:%.*]] = shufflevector <16 x i1> [[A_VEC]], <16 x i1> [[A_VEC]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  // LLVM: [[B_HALF:%.*]] = shufflevector <16 x i1> [[B_VEC]], <16 x i1> [[B_VEC]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  // LLVM: [[RES:%.*]] = shufflevector <8 x i1> [[B_HALF]], <8 x i1> [[A_HALF]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  // LLVM: bitcast <16 x i1> [[RES]] to i16
+
+  // OGCG-LABEL: _mm512_kunpackb
+  // OGCG: [[A_VEC:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: [[B_VEC:%.*]] = bitcast i16 %{{.*}} to <16 x i1>
+  // OGCG: [[A_HALF:%.*]] = shufflevector <16 x i1> [[A_VEC]], <16 x i1> [[A_VEC]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  // OGCG: [[B_HALF:%.*]] = shufflevector <16 x i1> [[B_VEC]], <16 x i1> [[B_VEC]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  // OGCG: [[RES:%.*]] = shufflevector <8 x i1> [[B_HALF]], <8 x i1> [[A_HALF]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  // OGCG: bitcast <16 x i1> [[RES]] to i16
+  return _mm512_kunpackb(A, B);
+}
+__m256 test_mm512_i64gather_ps(__m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i64gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qps.512"
+
+  // LLVM-LABEL: test_mm512_i64gather_ps
+  // LLVM: call <8 x float> @llvm.x86.avx512.mask.gather.qps.512
+
+  // OGCG-LABEL: test_mm512_i64gather_ps
+  // OGCG: call <8 x float> @llvm.x86.avx512.mask.gather.qps.512
+  return _mm512_i64gather_ps(__index, __addr, 2);
+}
+
+__m256 test_mm512_mask_i64gather_ps(__m256 __v1_old, __mmask8 __mask, __m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i64gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qps.512"
+
+  // LLVM-LABEL: test_mm512_mask_i64gather_ps
+  // LLVM: call <8 x float> @llvm.x86.avx512.mask.gather.qps.512
+
+  // OGCG-LABEL: test_mm512_mask_i64gather_ps
+  // OGCG: call <8 x float> @llvm.x86.avx512.mask.gather.qps.512
+  return _mm512_mask_i64gather_ps(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m256i test_mm512_i64gather_epi32(__m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i64gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qpi.512"
+
+  // LLVM-LABEL: test_mm512_i64gather_epi32
+  // LLVM: call <8 x i32> @llvm.x86.avx512.mask.gather.qpi.512
+
+  // OGCG-LABEL: test_mm512_i64gather_epi32
+  // OGCG: call <8 x i32> @llvm.x86.avx512.mask.gather.qpi.512
+  return _mm512_i64gather_epi32(__index, __addr, 2);
+}
+
+__m256i test_mm512_mask_i64gather_epi32(__m256i __v1_old, __mmask8 __mask, __m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i64gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qpi.512"
+
+  // LLVM-LABEL: test_mm512_mask_i64gather_epi32
+  // LLVM: call <8 x i32> @llvm.x86.avx512.mask.gather.qpi.512
+
+  // OGCG-LABEL: test_mm512_mask_i64gather_epi32
+  // OGCG: call <8 x i32> @llvm.x86.avx512.mask.gather.qpi.512
+  return _mm512_mask_i64gather_epi32(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m512d test_mm512_i64gather_pd(__m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i64gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qpd.512
+
+  // LLVM-LABEL: test_mm512_i64gather_pd
+  // LLVM: call <8 x double> @llvm.x86.avx512.mask.gather.qpd.512
+
+  // OGCG-LABEL: test_mm512_i64gather_pd
+  // OGCG: call <8 x double> @llvm.x86.avx512.mask.gather.qpd.512
+  return _mm512_i64gather_pd(__index, __addr, 2);
+}
+
+__m512d test_mm512_mask_i64gather_pd(__m512d __v1_old, __mmask8 __mask, __m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i64gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qpd.512
+
+  // LLVM-LABEL: test_mm512_mask_i64gather_pd
+  // LLVM: call <8 x double> @llvm.x86.avx512.mask.gather.qpd.512
+
+  // OGCG-LABEL: test_mm512_mask_i64gather_pd
+  // OGCG: call <8 x double> @llvm.x86.avx512.mask.gather.qpd.512
+  return _mm512_mask_i64gather_pd(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m512i test_mm512_i64gather_epi64(__m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i64gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qpq.512
+
+  // LLVM-LABEL: test_mm512_i64gather_epi64
+  // LLVM: call <8 x i64> @llvm.x86.avx512.mask.gather.qpq.512
+
+  // OGCG-LABEL: test_mm512_i64gather_epi64
+  // OGCG: call <8 x i64> @llvm.x86.avx512.mask.gather.qpq.512
+  return _mm512_i64gather_epi64(__index, __addr, 2);
+}
+
+__m512i test_mm512_mask_i64gather_epi64(__m512i __v1_old, __mmask8 __mask, __m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i64gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.qpq.512
+
+  // LLVM-LABEL: test_mm512_mask_i64gather_epi64
+  // LLVM: call <8 x i64> @llvm.x86.avx512.mask.gather.qpq.512
+
+  // OGCG-LABEL: test_mm512_mask_i64gather_epi64
+  // OGCG: call <8 x i64> @llvm.x86.avx512.mask.gather.qpq.512
+  return _mm512_mask_i64gather_epi64(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m512 test_mm512_i32gather_ps(__m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i32gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dps.512
+
+  // LLVM-LABEL: test_mm512_i32gather_ps
+  // LLVM: call <16 x float> @llvm.x86.avx512.mask.gather.dps.512
+
+  // OGCG-LABEL: test_mm512_i32gather_ps
+  // OGCG: call <16 x float> @llvm.x86.avx512.mask.gather.dps.512
+  return _mm512_i32gather_ps(__index, __addr, 2);
+}
+
+__m512 test_mm512_mask_i32gather_ps(__m512 v1_old, __mmask16 __mask, __m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i32gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dps.512
+
+  // LLVM-LABEL: test_mm512_mask_i32gather_ps
+  // LLVM: call <16 x float> @llvm.x86.avx512.mask.gather.dps.512
+
+  // OGCG-LABEL: test_mm512_mask_i32gather_ps
+  // OGCG: call <16 x float> @llvm.x86.avx512.mask.gather.dps.512
+  return _mm512_mask_i32gather_ps(v1_old, __mask, __index, __addr, 2);
+}
+
+__m512i test_mm512_i32gather_epi32(__m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i32gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dpi.512
+
+  // LLVM-LABEL: test_mm512_i32gather_epi32
+  // LLVM: call <16 x i32> @llvm.x86.avx512.mask.gather.dpi.512
+
+  // OGCG-LABEL: test_mm512_i32gather_epi32
+  // OGCG: call <16 x i32> @llvm.x86.avx512.mask.gather.dpi.512
+  return _mm512_i32gather_epi32(__index, __addr, 2);
+}
+
+__m512i test_mm512_mask_i32gather_epi32(__m512i __v1_old, __mmask16 __mask, __m512i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i32gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dpi.512
+
+  // LLVM-LABEL: test_mm512_mask_i32gather_epi32
+  // LLVM: call <16 x i32> @llvm.x86.avx512.mask.gather.dpi.512
+
+  // OGCG-LABEL: test_mm512_mask_i32gather_epi32
+  // OGCG: call <16 x i32> @llvm.x86.avx512.mask.gather.dpi.512
+  return _mm512_mask_i32gather_epi32(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m512d test_mm512_i32gather_pd(__m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i32gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dpd.512
+
+  // LLVM-LABEL: test_mm512_i32gather_pd
+  // LLVM: call <8 x double> @llvm.x86.avx512.mask.gather.dpd.512
+
+  // OGCG-LABEL: test_mm512_i32gather_pd
+  // OGCG: call <8 x double> @llvm.x86.avx512.mask.gather.dpd.512
+  return _mm512_i32gather_pd(__index, __addr, 2);
+}
+
+__m512d test_mm512_mask_i32gather_pd(__m512d __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i32gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dpd.512
+
+  // LLVM-LABEL: test_mm512_mask_i32gather_pd
+  // LLVM: call <8 x double> @llvm.x86.avx512.mask.gather.dpd.512
+
+  // OGCG-LABEL: test_mm512_mask_i32gather_pd
+  // OGCG: call <8 x double> @llvm.x86.avx512.mask.gather.dpd.512
+  return _mm512_mask_i32gather_pd(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m512i test_mm512_i32gather_epi64(__m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_i32gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dpq.512
+
+  // LLVM-LABEL: test_mm512_i32gather_epi64
+  // LLVM: call <8 x i64> @llvm.x86.avx512.mask.gather.dpq.512
+ 
+  // OGCG-LABEL: test_mm512_i32gather_epi64
+  // OGCG: call <8 x i64> @llvm.x86.avx512.mask.gather.dpq.512
+  return _mm512_i32gather_epi64(__index, __addr, 2);
+}
+
+__m512i test_mm512_mask_i32gather_epi64(__m512i __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm512_mask_i32gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather.dpq.512
+
+  // LLVM-LABEL: test_mm512_mask_i32gather_epi64
+  // LLVM: call <8 x i64> @llvm.x86.avx512.mask.gather.dpq.512
+ 
+  // OGCG-LABEL: test_mm512_mask_i32gather_epi64
+  // OGCG: call <8 x i64> @llvm.x86.avx512.mask.gather.dpq.512
+  return _mm512_mask_i32gather_epi64(__v1_old, __mask, __index, __addr, 2);
+}
+
+__m512i test_mm512_ror_epi32(__m512i __A) {
+  // CIR-LABEL: test_mm512_ror_epi32
+  // CIR: cir.cast integral %{{.*}} : !s32i -> !u32i
+  // CIR: cir.vec.splat %{{.*}} : !u32i, !cir.vector<16 x !u32i>
+  // CIR: cir.call_llvm_intrinsic "fshr" %{{.*}}: (!cir.vector<16 x !s32i>, !cir.vector<16 x !s32i>, !cir.vector<16 x !u32i>) -> !cir.vector<16 x !s32i> 
+
+  // LLVM-LABEL: test_mm512_ror_epi32
+  // LLVM: %[[CASTED_VAR:.*]] = bitcast <8 x i64> %{{.*}} to <16 x i32>
+  // LLVM: call <16 x i32> @llvm.fshr.v16i32(<16 x i32> %[[CASTED_VAR]], <16 x i32> %[[CASTED_VAR]], <16 x i32> splat (i32 5))
+
+  // OGCG-LABEL: test_mm512_ror_epi32
+  // OGCG: %[[CASTED_VAR:.*]] = bitcast <8 x i64> %{{.*}} to <16 x i32>
+  // OGCG: call <16 x i32> @llvm.fshr.v16i32(<16 x i32> %[[CASTED_VAR]], <16 x i32> %[[CASTED_VAR]], <16 x i32> splat (i32 5))
+  return _mm512_ror_epi32(__A, 5); 
+}
+
+__m512i test_mm512_ror_epi64(__m512i __A) {
+  // CIR-LABEL: test_mm512_ror_epi64
+  // CIR: cir.cast integral %{{.*}} : !s32i -> !u32i
+  // CIR: cir.cast integral %{{.*}} : !u32i -> !u64i
+  // CIR: cir.vec.splat %{{.*}} : !u64i, !cir.vector<8 x !u64i>
+  // CIR: cir.call_llvm_intrinsic "fshr" %{{.*}}: (!cir.vector<8 x !s64i>, !cir.vector<8 x !s64i>, !cir.vector<8 x !u64i>) -> !cir.vector<8 x !s64i> 
+
+  // LLVM-LABEL: test_mm512_ror_epi64
+  // LLVM: %[[VAR:.*]] = load <8 x i64>, ptr %{{.*}}, align 64
+  // LLVM: call <8 x i64> @llvm.fshr.v8i64(<8 x i64> %[[VAR]], <8 x i64> %[[VAR]], <8 x i64> splat (i64 5))
+
+  // OGCG-LABEL: test_mm512_ror_epi64
+  // OGCG: %[[VAR:.*]] = load <8 x i64>, ptr %{{.*}}, align 64
+  // OGCG: call <8 x i64> @llvm.fshr.v8i64(<8 x i64> %[[VAR]], <8 x i64> %[[VAR]], <8 x i64> splat (i64 5))
+  return _mm512_ror_epi64(__A, 5); 
+}
+
+void test_mm512_i32scatter_pd(void *__addr, __m256i __index, __m512d __v1) {
+  // CIR-LABEL: test_mm512_i32scatter_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.dpd.512"
+
+  // LLVM-LABEL: test_mm512_i32scatter_pd
+  // LLVM: @llvm.x86.avx512.mask.scatter.dpd.512
+
+  // OGCG-LABEL: test_mm512_i32scatter_pd
+  // OGCG: @llvm.x86.avx512.mask.scatter.dpd.512
+  return _mm512_i32scatter_pd(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i32scatter_pd(void *__addr, __mmask8 __mask, __m256i __index, __m512d __v1) {
+  // CIR-LABEL: test_mm512_mask_i32scatter_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.dpd.512"
+
+  // LLVM-LABEL: test_mm512_mask_i32scatter_pd
+  // LLVM: @llvm.x86.avx512.mask.scatter.dpd.512
+
+  // OGCG-LABEL: test_mm512_mask_i32scatter_pd
+  // OGCG: @llvm.x86.avx512.mask.scatter.dpd.512
+  return _mm512_mask_i32scatter_pd(__addr, __mask, __index, __v1, 2);
+}
+
+void test_mm512_i32scatter_ps(void *__addr, __m512i __index, __m512 __v1) {
+  // CIR-LABEL: test_mm512_i32scatter_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.dps.512"
+
+  // LLVM-LABEL: test_mm512_i32scatter_ps
+  // LLVM: @llvm.x86.avx512.mask.scatter.dps.512
+
+  // OGCG-LABEL: test_mm512_i32scatter_ps
+  // OGCG: @llvm.x86.avx512.mask.scatter.dps.512
+  return _mm512_i32scatter_ps(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i32scatter_ps(void *__addr, __mmask16 __mask, __m512i __index, __m512 __v1) {
+  // CIR-LABEL: test_mm512_mask_i32scatter_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.dps.512"
+
+  // LLVM-LABEL: test_mm512_mask_i32scatter_ps
+  // LLVM: @llvm.x86.avx512.mask.scatter.dps.512
+
+  // OGCG-LABEL: test_mm512_mask_i32scatter_ps
+  // OGCG: @llvm.x86.avx512.mask.scatter.dps.512
+  return _mm512_mask_i32scatter_ps(__addr, __mask, __index, __v1, 2);
+}
+
+void test_mm512_i64scatter_pd(void *__addr, __m512i __index, __m512d __v1) {
+  // CIR-LABEL: test_mm512_i64scatter_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qpd.512"
+
+  // LLVM-LABEL: test_mm512_i64scatter_pd
+  // LLVM: @llvm.x86.avx512.mask.scatter.qpd.512
+
+  // OGCG-LABEL: test_mm512_i64scatter_pd
+  // OGCG: @llvm.x86.avx512.mask.scatter.qpd.512
+  return _mm512_i64scatter_pd(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i64scatter_pd(void *__addr, __mmask8 __mask, __m512i __index, __m512d __v1) {
+  // CIR-LABEL: test_mm512_mask_i64scatter_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qpd.512"
+
+  // LLVM-LABEL: test_mm512_mask_i64scatter_pd
+  // LLVM: @llvm.x86.avx512.mask.scatter.qpd.512
+
+  // OGCG-LABEL: test_mm512_mask_i64scatter_pd
+  // OGCG: @llvm.x86.avx512.mask.scatter.qpd.512
+  return _mm512_mask_i64scatter_pd(__addr, __mask, __index, __v1, 2);
+}
+
+void test_mm512_i64scatter_ps(void *__addr, __m512i __index, __m256 __v1) {
+  // CIR-LABEL: test_mm512_i64scatter_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qps.512"
+
+  // LLVM-LABEL: test_mm512_i64scatter_ps
+  // LLVM: @llvm.x86.avx512.mask.scatter.qps.512
+
+  // OGCG-LABEL: test_mm512_i64scatter_ps
+  // OGCG: @llvm.x86.avx512.mask.scatter.qps.512
+  return _mm512_i64scatter_ps(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i64scatter_ps(void *__addr, __mmask8 __mask, __m512i __index, __m256 __v1) {
+  // CIR-LABEL: test_mm512_mask_i64scatter_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qps.512"
+
+  // LLVM-LABEL: test_mm512_mask_i64scatter_ps
+  // LLVM: @llvm.x86.avx512.mask.scatter.qps.512
+
+  // OGCG-LABEL: test_mm512_mask_i64scatter_ps
+  // OGCG: @llvm.x86.avx512.mask.scatter.qps.512
+  return _mm512_mask_i64scatter_ps(__addr, __mask, __index, __v1, 2);
+}
+
+void test_mm512_i32scatter_epi32(void *__addr, __m512i __index, __m512i __v1) {
+  // CIR-LABEL: test_mm512_i32scatter_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.dpi.512"
+
+  // LLVM-LABEL: test_mm512_i32scatter_epi32
+  // LLVM: @llvm.x86.avx512.mask.scatter.dpi.512
+
+  // OGCG-LABEL: test_mm512_i32scatter_epi32
+  // OGCG: @llvm.x86.avx512.mask.scatter.dpi.512
+  return _mm512_i32scatter_epi32(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i32scatter_epi32(void *__addr, __mmask16 __mask, __m512i __index, __m512i __v1) {
+  // CIR-LABEL: test_mm512_mask_i32scatter_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.dpi.512"
+
+  // LLVM-LABEL: test_mm512_mask_i32scatter_epi32
+  // LLVM: @llvm.x86.avx512.mask.scatter.dpi.512
+
+  // OGCG-LABEL: test_mm512_mask_i32scatter_epi32
+  // OGCG: @llvm.x86.avx512.mask.scatter.dpi.512
+  return _mm512_mask_i32scatter_epi32(__addr, __mask, __index, __v1, 2);
+}
+
+void test_mm512_i64scatter_epi64(void *__addr, __m512i __index, __m512i __v1) {
+  // CIR-LABEL: test_mm512_i64scatter_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qpq.512"
+
+  // LLVM-LABEL: test_mm512_i64scatter_epi64
+  // LLVM: @llvm.x86.avx512.mask.scatter.qpq.512
+
+  // OGCG-LABEL: test_mm512_i64scatter_epi64
+  // OGCG: @llvm.x86.avx512.mask.scatter.qpq.512
+  return _mm512_i64scatter_epi64(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i64scatter_epi64(void *__addr, __mmask8 __mask, __m512i __index, __m512i __v1) {
+  // CIR-LABEL: test_mm512_mask_i64scatter_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qpq.512"
+
+  // LLVM-LABEL: test_mm512_mask_i64scatter_epi64
+  // LLVM: @llvm.x86.avx512.mask.scatter.qpq.512
+
+  // OGCG-LABEL: test_mm512_mask_i64scatter_epi64
+  // OGCG: @llvm.x86.avx512.mask.scatter.qpq.512
+  return _mm512_mask_i64scatter_epi64(__addr, __mask, __index, __v1, 2);
+}
+
+void test_mm512_i64scatter_epi32(void *__addr, __m512i __index, __m256i __v1) {
+  // CIR-LABEL: test_mm512_i64scatter_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qpi.512"
+
+  // LLVM-LABEL: test_mm512_i64scatter_epi32
+  // LLVM: @llvm.x86.avx512.mask.scatter.qpi.512
+
+  // OGCG-LABEL: test_mm512_i64scatter_epi32
+  // OGCG: @llvm.x86.avx512.mask.scatter.qpi.512
+  return _mm512_i64scatter_epi32(__addr, __index, __v1, 2);
+}
+
+void test_mm512_mask_i64scatter_epi32(void *__addr, __mmask8 __mask, __m512i __index, __m256i __v1) {
+  // CIR-LABEL: test_mm512_mask_i64scatter_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.scatter.qpi.512"
+
+  // LLVM-LABEL: test_mm512_mask_i64scatter_epi32
+  // LLVM: @llvm.x86.avx512.mask.scatter.qpi.512
+
+  // OGCG-LABEL: test_mm512_mask_i64scatter_epi32
+  // OGCG: @llvm.x86.avx512.mask.scatter.qpi.512
+  return _mm512_mask_i64scatter_epi32(__addr, __mask, __index, __v1, 2);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/avx512vl-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/avx512vl-builtins.c
new file mode 100644
index 0000000000000..accf1f60d7c32
--- /dev/null
+++ b/clang/test/CIR/CodeGenBuiltins/X86/avx512vl-builtins.c
@@ -0,0 +1,201 @@
+// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512f -target-feature +avx512vl -fclangir -emit-cir -o %t.cir -Wall -Werror -Wsign-conversion 
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512f -target-feature +avx512vl -fclangir -emit-llvm -o %t.ll -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +avx512f -target-feature +avx512vl -emit-llvm -o %t.ll -Wall -Werror -Wsign-conversion
+// RUN: FileCheck --check-prefixes=OGCG --input-file=%t.ll %s
+
+
+#include <immintrin.h>
+
+__m128d test_mm_mmask_i64gather_pd(__m128d __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mmask_i64gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div2.df"
+
+  // LLVM-LABEL: @test_mm_mmask_i64gather_pd
+  // LLVM: @llvm.x86.avx512.mask.gather3div2.df
+
+  // OGCG-LABEL: @test_mm_mmask_i64gather_pd
+  // OGCG: @llvm.x86.avx512.mask.gather3div2.df
+  return _mm_mmask_i64gather_pd(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128i test_mm_mmask_i64gather_epi64(__m128i __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mmask_i64gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div2.di"
+
+  // LLVM-LABEL: @test_mm_mmask_i64gather_epi64
+  // LLVM: @llvm.x86.avx512.mask.gather3div2.di
+
+  // OGCG-LABEL: @test_mm_mmask_i64gather_epi64
+  // OGCG: @llvm.x86.avx512.mask.gather3div2.di
+  return _mm_mmask_i64gather_epi64(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m256d test_mm256_mmask_i64gather_pd(__m256d __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mmask_i64gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div4.df"
+
+  // LLVM-LABEL: @test_mm256_mmask_i64gather_pd
+  // LLVM: @llvm.x86.avx512.mask.gather3div4.df
+
+  // OGCG-LABEL: @test_mm256_mmask_i64gather_pd
+  // OGCG: @llvm.x86.avx512.mask.gather3div4.df
+  return _mm256_mmask_i64gather_pd(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m256i test_mm256_mmask_i64gather_epi64(__m256i __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mmask_i64gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div4.di"
+
+  // LLVM-LABEL: @test_mm256_mmask_i64gather_epi64
+  // LLVM: @llvm.x86.avx512.mask.gather3div4.di
+
+  // OGCG-LABEL: @test_mm256_mmask_i64gather_epi64
+  // OGCG: @llvm.x86.avx512.mask.gather3div4.di
+  return _mm256_mmask_i64gather_epi64(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128 test_mm_mmask_i64gather_ps(__m128 __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mmask_i64gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div4.sf"
+
+  // LLVM-LABEL: @test_mm_mmask_i64gather_ps
+  // LLVM: @llvm.x86.avx512.mask.gather3div4.sf
+
+  // OGCG-LABEL: @test_mm_mmask_i64gather_ps
+  // OGCG: @llvm.x86.avx512.mask.gather3div4.sf
+  return _mm_mmask_i64gather_ps(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128i test_mm_mmask_i64gather_epi32(__m128i __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mmask_i64gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div4.si"
+
+  // LLVM-LABEL: @test_mm_mmask_i64gather_epi32
+  // LLVM: @llvm.x86.avx512.mask.gather3div4.si
+
+  // OGCG-LABEL: @test_mm_mmask_i64gather_epi32
+  // OGCG: @llvm.x86.avx512.mask.gather3div4.si
+  return _mm_mmask_i64gather_epi32(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128 test_mm256_mmask_i64gather_ps(__m128 __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mmask_i64gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div8.sf"
+
+  // LLVM-LABEL: @test_mm256_mmask_i64gather_ps
+  // LLVM: @llvm.x86.avx512.mask.gather3div8.sf
+
+  // OGCG-LABEL: @test_mm256_mmask_i64gather_ps
+  // OGCG: @llvm.x86.avx512.mask.gather3div8.sf
+  return _mm256_mmask_i64gather_ps(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128i test_mm256_mmask_i64gather_epi32(__m128i __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mmask_i64gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3div8.si"
+
+  // LLVM-LABEL: @test_mm256_mmask_i64gather_epi32
+  // LLVM: @llvm.x86.avx512.mask.gather3div8.si
+
+  // OGCG-LABEL: @test_mm256_mmask_i64gather_epi32
+  // OGCG: @llvm.x86.avx512.mask.gather3div8.si
+  return _mm256_mmask_i64gather_epi32(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128d test_mm_mask_i32gather_pd(__m128d __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mask_i32gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv2.df"
+
+  // LLVM-LABEL: @test_mm_mask_i32gather_pd
+  // LLVM: @llvm.x86.avx512.mask.gather3siv2.df
+
+  // OGCG-LABEL: @test_mm_mask_i32gather_pd
+  // OGCG: @llvm.x86.avx512.mask.gather3siv2.df
+  return _mm_mmask_i32gather_pd(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128i test_mm_mask_i32gather_epi64(__m128i __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mask_i32gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv2.di"
+
+  // LLVM-LABEL: @test_mm_mask_i32gather_epi64
+  // LLVM: @llvm.x86.avx512.mask.gather3siv2.di
+
+  // OGCG-LABEL: @test_mm_mask_i32gather_epi64
+  // OGCG: @llvm.x86.avx512.mask.gather3siv2.di
+  return _mm_mmask_i32gather_epi64(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m256d test_mm256_mask_i32gather_pd(__m256d __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mask_i32gather_pd
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv4.df"
+
+  // LLVM-LABEL: @test_mm256_mask_i32gather_pd
+  // LLVM: @llvm.x86.avx512.mask.gather3siv4.df
+
+  // OGCG-LABEL: @test_mm256_mask_i32gather_pd
+  // OGCG: @llvm.x86.avx512.mask.gather3siv4.df
+  return _mm256_mmask_i32gather_pd(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m256i test_mm256_mask_i32gather_epi64(__m256i __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mask_i32gather_epi64
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv4.di"
+
+  // LLVM-LABEL: @test_mm256_mask_i32gather_epi64
+  // LLVM: @llvm.x86.avx512.mask.gather3siv4.di
+
+  // OGCG-LABEL: @test_mm256_mask_i32gather_epi64
+  // OGCG: @llvm.x86.avx512.mask.gather3siv4.di
+  return _mm256_mmask_i32gather_epi64(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128 test_mm_mask_i32gather_ps(__m128 __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mask_i32gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv4.sf"
+
+  // LLVM-LABEL: @test_mm_mask_i32gather_ps
+  // LLVM: @llvm.x86.avx512.mask.gather3siv4.sf
+
+  // OGCG-LABEL: @test_mm_mask_i32gather_ps
+  // OGCG: @llvm.x86.avx512.mask.gather3siv4.sf
+  return _mm_mmask_i32gather_ps(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m128i test_mm_mask_i32gather_epi32(__m128i __v1_old, __mmask8 __mask, __m128i __index, void const *__addr) {
+  // CIR-LABEL: test_mm_mask_i32gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv4.si"
+
+  // LLVM-LABEL: @test_mm_mask_i32gather_epi32
+  // LLVM: @llvm.x86.avx512.mask.gather3siv4.si
+
+  // OGCG-LABEL: @test_mm_mask_i32gather_epi32
+  // OGCG: @llvm.x86.avx512.mask.gather3siv4.si
+  return _mm_mmask_i32gather_epi32(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m256 test_mm256_mask_i32gather_ps(__m256 __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mask_i32gather_ps
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv8.sf"
+
+  // LLVM-LABEL: @test_mm256_mask_i32gather_ps
+  // LLVM: @llvm.x86.avx512.mask.gather3siv8.sf
+
+  // OGCG-LABEL: @test_mm256_mask_i32gather_ps
+  // OGCG: @llvm.x86.avx512.mask.gather3siv8.sf
+  return _mm256_mmask_i32gather_ps(__v1_old, __mask, __index, __addr, 2); 
+}
+
+__m256i test_mm256_mask_i32gather_epi32(__m256i __v1_old, __mmask8 __mask, __m256i __index, void const *__addr) {
+  // CIR-LABEL: test_mm256_mask_i32gather_epi32
+  // CIR: cir.call_llvm_intrinsic "x86.avx512.mask.gather3siv8.si"
+
+  // LLVM-LABEL: @test_mm256_mask_i32gather_epi32
+  // LLVM: @llvm.x86.avx512.mask.gather3siv8.si
+
+  // OGCG-LABEL: @test_mm256_mask_i32gather_epi32
+  // OGCG: @llvm.x86.avx512.mask.gather3siv8.si
+  return _mm256_mmask_i32gather_epi32(__v1_old, __mask, __index, __addr, 2); 
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/sse-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/sse-builtins.c
index c893859b297cc..db52021d1aa9f 100644
--- a/clang/test/CIR/CodeGenBuiltins/X86/sse-builtins.c
+++ b/clang/test/CIR/CodeGenBuiltins/X86/sse-builtins.c
@@ -71,3 +71,15 @@ __m128 test_mm_undefined_ps(void) {
   // OGCG: ret <4 x float> zeroinitializer
   return _mm_undefined_ps();
 }
+
+__m128 test_mm_shuffle_ps(__m128 A, __m128 B) {
+  // CIR-LABEL: _mm_shuffle_ps
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<4 x !cir.float>) [#cir.int<0> : !s32i, #cir.int<0> : !s32i, #cir.int<4> : !s32i, #cir.int<4> : !s32i] : !cir.vector<4 x !cir.float>
+
+  // LLVM-LABEL: test_mm_shuffle_ps
+  // LLVM: shufflevector <4 x float> {{.*}}, <4 x float> {{.*}}, <4 x i32> <i32 0, i32 0, i32 4, i32 4>
+
+  // OGCG-LABEL: test_mm_shuffle_ps
+  // OGCG: shufflevector <4 x float> {{.*}}, <4 x float> {{.*}}, <4 x i32> <i32 0, i32 0, i32 4, i32 4>
+  return _mm_shuffle_ps(A, B, 0);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/sse2-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/sse2-builtins.c
index f5e07cdc28ccd..4bb17e9d20bc6 100644
--- a/clang/test/CIR/CodeGenBuiltins/X86/sse2-builtins.c
+++ b/clang/test/CIR/CodeGenBuiltins/X86/sse2-builtins.c
@@ -8,8 +8,11 @@
 // RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse2 -fno-signed-char -fclangir -emit-llvm -o %t.ll -Wall -Werror
 // RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
 
-// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
-// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +sse2 -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +sse2 -fno-signed-char -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +sse2 -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +sse2 -fno-signed-char -emit-llvm -o - -Wall -Werror | FileCheck %s --check-prefixes=OGCG
 
 // This test mimics clang/test/CodeGen/X86/sse2-builtins.c, which eventually
 // CIR shall be able to support fully.
@@ -108,3 +111,51 @@ void test_mm_pause(void) {
   // LLVM: call void @llvm.x86.sse2.pause()
   // OGCG: call void @llvm.x86.sse2.pause()
 }
+
+__m128i test_mm_shufflelo_epi16(__m128i A) {
+  // CIR-LABEL: _mm_shufflelo_epi16
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<8 x !s16i>) [#cir.int<0> : !s32i, #cir.int<0> : !s32i, #cir.int<0> : !s32i, #cir.int<0> : !s32i, #cir.int<4> : !s32i, #cir.int<5> : !s32i, #cir.int<6> : !s32i, #cir.int<7> : !s32i] : !cir.vector<8 x !s16i>
+
+  // LLVM-LABEL: test_mm_shufflelo_epi16
+  // LLVM: shufflevector <8 x i16> %{{.*}}, <8 x i16> poison, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 4, i32 5, i32 6, i32 7>
+
+  // OGCG-LABEL: test_mm_shufflelo_epi16
+  // OGCG: shufflevector <8 x i16> %{{.*}}, <8 x i16> poison, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 4, i32 5, i32 6, i32 7>
+  return _mm_shufflelo_epi16(A, 0);
+}
+
+__m128i test_mm_shufflehi_epi16(__m128i A) {
+  // CIR-LABEL: _mm_shufflehi_epi16
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<8 x !s16i>) [#cir.int<0> : !s32i, #cir.int<1> : !s32i, #cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<4> : !s32i, #cir.int<4> : !s32i, #cir.int<4> : !s32i, #cir.int<4> : !s32i] : !cir.vector<8 x !s16i>
+
+  // LLVM-LABEL: test_mm_shufflehi_epi16
+  // LLVM: shufflevector <8 x i16> %{{.*}}, <8 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 4, i32 4, i32 4>
+
+  // OGCG-LABEL: test_mm_shufflehi_epi16
+  // OGCG: shufflevector <8 x i16> %{{.*}}, <8 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 4, i32 4, i32 4>
+  return _mm_shufflehi_epi16(A, 0);
+}
+
+__m128d test_mm_shuffle_pd(__m128d A, __m128d B) {
+  // CIR-LABEL: test_mm_shuffle_pd
+  // CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}} : !cir.vector<2 x !cir.double>) [#cir.int<1> : !s32i, #cir.int<2> : !s32i] : !cir.vector<2 x !cir.double>
+
+  // LLVM-LABEL: test_mm_shuffle_pd
+  // LLVM: shufflevector <2 x double> %{{.*}}, <2 x double> %{{.*}}, <2 x i32> <i32 1, i32 2>
+
+  // OGCG-LABEL: test_mm_shuffle_pd
+  // OGCG: shufflevector <2 x double> %{{.*}}, <2 x double> %{{.*}}, <2 x i32> <i32 1, i32 2>
+  return _mm_shuffle_pd(A, B, 1);
+}
+
+__m128i test_mm_shuffle_epi32(__m128i A) {
+	// CIR-LABEL: test_mm_shuffle_epi32
+	// CIR: %{{.*}} = cir.vec.shuffle(%{{.*}}, %{{.*}}: !cir.vector<4 x !s32i>) [#cir.int<2> : !s32i, #cir.int<3> : !s32i, #cir.int<0> : !s32i, #cir.int<1> : !s32i] : !cir.vector<4 x !s32i>
+
+    // LLVM-LABEL: test_mm_shuffle_epi32
+	// LLVM: shufflevector <4 x i32> %{{.*}}, <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+
+	// OGCG-LABEL: test_mm_shuffle_epi32
+    // OGCG: shufflevector <4 x i32> %{{.*}}, <4 x i32> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
+    return _mm_shuffle_epi32(A, 0x4E);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/vec-set-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/vec-set-builtins.c
new file mode 100644
index 0000000000000..c166128b8147d
--- /dev/null
+++ b/clang/test/CIR/CodeGenBuiltins/X86/vec-set-builtins.c
@@ -0,0 +1,141 @@
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse4.1 -target-feature +avx -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse4.1 -target-feature +avx -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse4.1 -target-feature +avx  -fclangir -emit-cir -o %t.cir -Wall -Werror
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse4.1 -target-feature +avx -fclangir -emit-llvm -o %t.ll -Wall -Werror
+// RUN: FileCheck --check-prefixes=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse4.1 -target-feature +avx -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-unknown-linux -target-feature +sse4.1 -target-feature +avx -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+
+#include <immintrin.h>
+
+typedef short __v4hi __attribute__((__vector_size__(8)));
+typedef char __v16qi __attribute__((__vector_size__(16)));
+typedef short __v8hi __attribute__((__vector_size__(16)));
+typedef int __v4si __attribute__((__vector_size__(16)));
+typedef long long __v2di __attribute__((__vector_size__(16)));
+typedef char __v32qi __attribute__((__vector_size__(32)));
+typedef short __v16hi __attribute__((__vector_size__(32)));
+typedef int __v8si __attribute__((__vector_size__(32)));
+typedef long long __v4di __attribute__((__vector_size__(32)));
+
+__v4hi test_vec_set_v4hi(__v4hi a, short b) {
+  // CIR-LABEL: test_vec_set_v4hi
+  // CIR: {{%.*}} = cir.const #cir.int<2> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<4 x !s16i>
+
+  // LLVM-LABEL: test_vec_set_v4hi
+  // LLVM: {{%.*}} = insertelement <4 x i16> {{%.*}}, i16 {{%.*}}, i64 2
+
+  // OGCG-LABEL: test_vec_set_v4hi
+  // OGCG: {{%.*}} = insertelement <4 x i16> {{%.*}}, i16 {{%.*}}, i64 2
+  return __builtin_ia32_vec_set_v4hi(a, b, 2);
+}
+
+__v16qi test_vec_set_v16qi(__v16qi a, char b) {
+  // CIR-LABEL: test_vec_set_v16qi
+  // CIR: {{%.*}} = cir.const #cir.int<5> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<16 x !s8i>
+
+  // LLVM-LABEL: test_vec_set_v16qi
+  // LLVM: {{%.*}} = insertelement <16 x i8> {{%.*}}, i8 {{%.*}}, i64 5
+
+  // OGCG-LABEL: test_vec_set_v16qi
+  // OGCG: {{%.*}} = insertelement <16 x i8> {{%.*}}, i8 {{%.*}}, i64 5
+  return __builtin_ia32_vec_set_v16qi(a, b, 5);
+}
+
+__v8hi test_vec_set_v8hi(__v8hi a, short b) {
+  // CIR-LABEL: test_vec_set_v8hi
+  // CIR: {{%.*}} = cir.const #cir.int<3> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<8 x !s16i>
+
+  // LLVM-LABEL: test_vec_set_v8hi
+  // LLVM: {{%.*}} = insertelement <8 x i16> {{%.*}}, i16 {{%.*}}, i64 3
+
+  // OGCG-LABEL: test_vec_set_v8hi
+  // OGCG: {{%.*}} = insertelement <8 x i16> {{%.*}}, i16 {{%.*}}, i64 3
+  return __builtin_ia32_vec_set_v8hi(a, b, 3);
+}
+
+__v4si test_vec_set_v4si(__v4si a, int b) {
+  // CIR-LABEL: test_vec_set_v4si
+  // CIR: {{%.*}} = cir.const #cir.int<1> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<4 x !s32i>
+
+  // LLVM-LABEL: test_vec_set_v4si
+  // LLVM: {{%.*}} = insertelement <4 x i32> {{%.*}}, i32 {{%.*}}, i64 1
+
+  // OGCG-LABEL: test_vec_set_v4si
+  // OGCG: {{%.*}} = insertelement <4 x i32> {{%.*}}, i32 {{%.*}}, i64 1
+  return __builtin_ia32_vec_set_v4si(a, b, 1);
+}
+
+__v2di test_vec_set_v2di(__v2di a, long long b) {
+  // CIR-LABEL: test_vec_set_v2di
+  // CIR: {{%.*}} = cir.const #cir.int<0> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<2 x !s64i>
+
+  // LLVM-LABEL: test_vec_set_v2di
+  // LLVM: {{%.*}} = insertelement <2 x i64> {{%.*}}, i64 {{%.*}}, i64 0
+
+  // OGCG-LABEL: test_vec_set_v2di
+  // OGCG: {{%.*}} = insertelement <2 x i64> {{%.*}}, i64 {{%.*}}, i64 0
+  return __builtin_ia32_vec_set_v2di(a, b, 0);
+}
+
+__v32qi test_vec_set_v32qi(__v32qi a, char b) {
+  // CIR-LABEL: test_vec_set_v32qi
+  // CIR: {{%.*}} = cir.const #cir.int<10> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<32 x !s8i>
+
+  // LLVM-LABEL: test_vec_set_v32qi
+  // LLVM: {{%.*}} = insertelement <32 x i8> {{%.*}}, i8 {{%.*}}, i64 10
+
+  // OGCG-LABEL: test_vec_set_v32qi
+  // OGCG: {{%.*}} = insertelement <32 x i8> {{%.*}}, i8 {{%.*}}, i64 10
+  return __builtin_ia32_vec_set_v32qi(a, b, 10);
+}
+
+__v16hi test_vec_set_v16hi(__v16hi a, short b) {
+  // CIR-LABEL: test_vec_set_v16hi
+  // CIR: {{%.*}} = cir.const #cir.int<7> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<16 x !s16i>
+
+  // LLVM-LABEL: test_vec_set_v16hi
+  // LLVM: {{%.*}} = insertelement <16 x i16> {{%.*}}, i16 {{%.*}}, i64 7
+
+  // OGCG-LABEL: test_vec_set_v16hi
+  // OGCG: {{%.*}} = insertelement <16 x i16> {{%.*}}, i16 {{%.*}}, i64 7
+  return __builtin_ia32_vec_set_v16hi(a, b, 7);
+}
+
+__v8si test_vec_set_v8si(__v8si a, int b) {
+  // CIR-LABEL: test_vec_set_v8si
+  // CIR: {{%.*}} = cir.const #cir.int<4> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<8 x !s32i>
+
+  // LLVM-LABEL: test_vec_set_v8si
+  // LLVM: {{%.*}} = insertelement <8 x i32> {{%.*}}, i32 {{%.*}}, i64 4
+
+  // OGCG-LABEL: test_vec_set_v8si
+  // OGCG: {{%.*}} = insertelement <8 x i32> {{%.*}}, i32 {{%.*}}, i64 4
+  return __builtin_ia32_vec_set_v8si(a, b, 4);
+}
+
+__v4di test_vec_set_v4di(__v4di a, long long b) {
+  // CIR-LABEL: test_vec_set_v4di
+  // CIR: {{%.*}} = cir.const #cir.int<2> : !u64i
+  // CIR: {{%.*}} = cir.vec.insert %{{.*}}, %{{.*}}[%{{.*}} : !u64i] : !cir.vector<4 x !s64i>
+
+  // LLVM-LABEL: test_vec_set_v4di
+  // LLVM: {{%.*}} = insertelement <4 x i64> {{%.*}}, i64 {{%.*}}, i64 2
+
+  // OGCG-LABEL: test_vec_set_v4di
+  // OGCG: {{%.*}} = insertelement <4 x i64> {{%.*}}, i64 {{%.*}}, i64 2
+  return __builtin_ia32_vec_set_v4di(a, b, 2);
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/X86/xop-builtins.c b/clang/test/CIR/CodeGenBuiltins/X86/xop-builtins.c
new file mode 100644
index 0000000000000..0aaba7b46327d
--- /dev/null
+++ b/clang/test/CIR/CodeGenBuiltins/X86/xop-builtins.c
@@ -0,0 +1,92 @@
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -emit-cir -o %t.cir
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fno-signed-char -emit-cir -o %t.cir
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fclangir -emit-llvm -o %t.ll
+// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fno-signed-char -fclangir -emit-llvm -o %t.ll
+// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -emit-cir -o %t.cir
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fno-signed-char -emit-cir -o %t.cir
+// RUN: FileCheck --check-prefix=CIR --input-file=%t.cir %s
+
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fclangir -emit-llvm -o %t.ll
+// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fno-signed-char -fclangir -emit-llvm -o %t.ll
+// RUN: FileCheck --check-prefix=LLVM --input-file=%t.ll %s
+
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fno-signed-char -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+// RUN: %clang_cc1 -x c++ -flax-vector-conversions=none -ffreestanding %s -triple=x86_64-apple-darwin -target-feature +xop -fno-signed-char -emit-llvm -o - -Wall -Werror | FileCheck %s -check-prefix=OGCG
+
+#include <x86intrin.h>
+
+// This test mimics clang/test/CodeGen/X86/xop-builtins.c, which eventually
+// CIR shall be able to support fully.
+
+__m128i test_mm_roti_epi8(__m128i a) {
+  // CIR-LABEL: test_mm_roti_epi8
+  // CIR: cir.vec.splat %{{.*}} : !{{[us]}}8i, !cir.vector<16 x !{{[us]}}8i> 
+  // CIR: cir.call_llvm_intrinsic "fshl" %{{.*}} : (!cir.vector<16 x !{{[su]}}8i>, !cir.vector<16 x !{{[su]}}8i>, !cir.vector<16 x !{{[su]}}8i>) -> !cir.vector<16 x !{{[su]}}8i> 
+  
+  // LLVM-LABEL: test_mm_roti_epi8
+  // LLVM: %[[CASTED_VAR:.*]] = bitcast <2 x i64> %{{.*}} to <16 x i8>
+  // LLVM: call <16 x i8> @llvm.fshl.v16i8(<16 x i8> %[[CASTED_VAR]], <16 x i8> %[[CASTED_VAR]], <16 x i8> splat (i8 1))
+  
+  // OGCG-LABEL: test_mm_roti_epi8
+  // OGCG: %[[CASTED_VAR:.*]] = bitcast <2 x i64> %{{.*}} to <16 x i8>
+  // OGCG: call <16 x i8> @llvm.fshl.v16i8(<16 x i8> %[[CASTED_VAR]], <16 x i8> %[[CASTED_VAR]], <16 x i8> splat (i8 1))
+  return _mm_roti_epi8(a, 1);
+}
+
+__m128i test_mm_roti_epi16(__m128i a) {
+  // CIR-LABEL: test_mm_roti_epi16
+  // CIR: cir.cast integral %{{.*}} : !u8i -> !u16i
+  // CIR: cir.vec.splat %{{.*}} : !{{[us]}}16i, !cir.vector<8 x !u16i> 
+  // CIR: cir.call_llvm_intrinsic "fshl" %{{.*}} : (!cir.vector<8 x !{{[su]}}16i>, !cir.vector<8 x !{{[su]}}16i>, !cir.vector<8 x !u16i>) -> !cir.vector<8 x !{{[su]}}16i> 
+  
+  // LLVM-LABEL: test_mm_roti_epi16
+  // LLVM: %[[CASTED_VAR:.*]] = bitcast <2 x i64> %{{.*}} to <8 x i16>
+  // LLVM: call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %[[CASTED_VAR]], <8 x i16> %[[CASTED_VAR]], <8 x i16> splat (i16 50))
+  
+  // OGCG-LABEL: test_mm_roti_epi16
+  // OGCG: %[[CASTED_VAR:.*]] = bitcast <2 x i64> %{{.*}} to <8 x i16>
+  // OGCG: call <8 x i16> @llvm.fshl.v8i16(<8 x i16> %[[CASTED_VAR]], <8 x i16> %[[CASTED_VAR]], <8 x i16> splat (i16 50))
+  return _mm_roti_epi16(a, 50);
+ }
+
+__m128i test_mm_roti_epi32(__m128i a) {
+  // CIR-LABEL: test_mm_roti_epi32
+  // CIR: cir.cast integral %{{.*}} : !u8i -> !u32i
+  // CIR: cir.vec.splat %{{.*}} : !{{[us]}}32i, !cir.vector<4 x !u32i> 
+  // CIR: cir.call_llvm_intrinsic "fshl" %{{.*}} : (!cir.vector<4 x !{{[su]}}32i>, !cir.vector<4 x !{{[su]}}32i>, !cir.vector<4 x !u32i>) -> !cir.vector<4 x !{{[su]}}32i> 
+  
+  // LLVM-LABEL: test_mm_roti_epi32
+  // LLVM: %[[CASTED_VAR:.*]] = bitcast <2 x i64> %{{.*}} to <4 x i32>
+  // LLVM: call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %[[CASTED_VAR]], <4 x i32> %[[CASTED_VAR]], <4 x i32> splat (i32 226))
+  
+  // OGCG-LABEL: test_mm_roti_epi32
+  // OGCG: %[[CASTED_VAR:.*]] = bitcast <2 x i64> %{{.*}} to <4 x i32>
+  // OGCG: call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %[[CASTED_VAR]], <4 x i32> %[[CASTED_VAR]], <4 x i32> splat (i32 226))
+  return _mm_roti_epi32(a, -30);
+ }
+
+__m128i test_mm_roti_epi64(__m128i a) {
+  // CIR-LABEL: test_mm_roti_epi64
+  // CIR: cir.cast integral %{{.*}} : !u8i -> !u64i
+  // CIR: cir.vec.splat %{{.*}} : !u64i, !cir.vector<2 x !u64i> 
+  // CIR: cir.call_llvm_intrinsic "fshl" %{{.*}} : (!cir.vector<2 x !{{[su]}}64i>, !cir.vector<2 x !{{[su]}}64i>, !cir.vector<2 x !u64i>) -> !cir.vector<2 x !s64i> 
+  
+  // LLVM-LABEL: test_mm_roti_epi64
+  // LLVM: %[[VAR:.*]] = load <2 x i64>, ptr %{{.*}}, align 16
+  // LLVM: call <2 x i64> @llvm.fshl.v2i64(<2 x i64> %[[VAR]], <2 x i64> %[[VAR]], <2 x i64> splat (i64 100))
+  
+  // OGCG-LABEL: test_mm_roti_epi64
+  // OGCG: %[[VAR:.*]] = load <2 x i64>, ptr %{{.*}}, align 16
+  // OGCG: call <2 x i64> @llvm.fshl.v2i64(<2 x i64> %[[VAR]], <2 x i64> %[[VAR]], <2 x i64> splat (i64 100))
+  return _mm_roti_epi64(a, 100);
+ }
diff --git a/clang/test/CIR/CodeGenBuiltins/builtin-constant-p.c b/clang/test/CIR/CodeGenBuiltins/builtin-constant-p.c
new file mode 100644
index 0000000000000..d684659216cba
--- /dev/null
+++ b/clang/test/CIR/CodeGenBuiltins/builtin-constant-p.c
@@ -0,0 +1,281 @@
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-cir %s -o %t.cir
+// RUN: FileCheck --input-file=%t.cir %s -check-prefix=CIR
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -fclangir -emit-llvm %s -o %t-cir.ll
+// RUN: FileCheck --input-file=%t-cir.ll %s -check-prefix=LLVM
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-llvm %s -o %t.ll
+// RUN: FileCheck --input-file=%t.ll %s -check-prefix=OGCG
+
+int a = 42;
+
+/* --- Compound literals */
+
+struct foo { int x, y; };
+
+int y;
+struct foo f = (struct foo){ __builtin_constant_p(y), 42 };
+
+// CIR: cir.global external @f = #cir.const_record<{#cir.int<0> : !s32i, #cir.int<42> : !s32i}> : !rec_foo
+// LLVM: @f = global %struct.foo { i32 0, i32 42 }
+// OGCG: @f = global %struct.foo { i32 0, i32 42 }
+
+struct foo test0(int expr) {
+  struct foo f = (struct foo){ __builtin_constant_p(expr), 42 };
+  return f;
+}
+
+// CIR: cir.func {{.*}} @test0(%[[ARG0:.*]]: !s32i {{.*}}) -> !rec_foo
+// CIR:   %[[EXPR_ADDR:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["expr", init]
+// CIR:   cir.store %[[ARG0]], %[[EXPR_ADDR]]
+// CIR:   %[[EXPR:.*]] = cir.load{{.*}} %[[EXPR_ADDR]]
+// CIR:   %[[IS_CONSTANT:.*]] = cir.is_constant %[[EXPR]] : !s32i -> !cir.bool
+
+// LLVM: define{{.*}} %struct.foo @test0(i32 %[[ARG0:.*]])
+// LLVM:   %[[EXPR_ADDR:.*]] = alloca i32
+// LLVM:   store i32 %[[ARG0]], ptr %[[EXPR_ADDR]]
+// LLVM:   %[[EXPR:.*]] = load i32, ptr %[[EXPR_ADDR]]
+// LLVM:   %[[IS_CONSTANT:.*]] = call i1 @llvm.is.constant.i32(i32 %[[EXPR]])
+
+// OGCG: define{{.*}} i64 @test0(i32 {{.*}} %[[ARG0:.*]])
+// OGCG:   %[[EXPR_ADDR:.*]] = alloca i32
+// OGCG:   store i32 %[[ARG0]], ptr %[[EXPR_ADDR]]
+// OGCG:   %[[EXPR:.*]] = load i32, ptr %[[EXPR_ADDR]]
+// OGCG:   %[[IS_CONSTANT:.*]] = call i1 @llvm.is.constant.i32(i32 %[[EXPR]])
+
+/* --- Pointer types */
+
+int test1(void) {
+  return __builtin_constant_p(&a - 13);
+}
+
+// CIR: cir.func {{.*}} @test1() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ZERO:.*]] = cir.const #cir.int<0> : !s32i
+// CIR:   cir.store %[[ZERO]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test1()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 0, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test1()
+// OGCG:   ret i32 0
+
+/* --- Aggregate types */
+
+int b[] = {1, 2, 3};
+
+int test2(void) {
+  return __builtin_constant_p(b);
+}
+
+// CIR: cir.func {{.*}} @test2() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ZERO:.*]] = cir.const #cir.int<0> : !s32i
+// CIR:   cir.store %[[ZERO]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test2()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 0, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test2()
+// OGCG:   ret i32 0
+
+const char test3_c[] = {1, 2, 3, 0};
+
+int test3(void) {
+  return __builtin_constant_p(test3_c);
+}
+
+// CIR: cir.func {{.*}} @test3() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ZERO:.*]] = cir.const #cir.int<0> : !s32i
+// CIR:   cir.store %[[ZERO]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test3()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 0, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test3()
+// OGCG:   ret i32 0
+
+inline char test4_i(const char *x) {
+  return x[1];
+}
+
+int test4(void) {
+  return __builtin_constant_p(test4_i(test3_c));
+}
+
+// CIR: cir.func {{.*}} @test4() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ZERO:.*]] = cir.const #cir.int<0> : !s32i
+// CIR:   cir.store %[[ZERO]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test4()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 0, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test4()
+// OGCG:   ret i32 0
+
+/* --- Constant global variables */
+
+const int c = 42;
+
+int test5(void) {
+  return __builtin_constant_p(c);
+}
+
+// CIR: cir.func {{.*}} @test5() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+// CIR:   cir.store %[[ONE]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test5()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 1, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test5()
+// OGCG:   ret i32 1
+
+/* --- Array types */
+
+int arr[] = { 1, 2, 3 };
+
+int test6(void) {
+  return __builtin_constant_p(arr[2]);
+}
+
+// CIR: cir.func {{.*}} @test6() -> !s32i
+// CIR:   %[[TWO:.*]] = cir.const #cir.int<2> : !s32i
+// CIR:   %[[ARR:.*]] = cir.get_global @arr : !cir.ptr<!cir.array<!s32i x 3>>
+// CIR:   %[[ARR_PTR:.*]] = cir.cast array_to_ptrdecay %[[ARR]] : !cir.ptr<!cir.array<!s32i x 3>> -> !cir.ptr<!s32i>
+// CIR:   %[[ELE_PTR:.*]] = cir.ptr_stride %[[ARR_PTR]], %[[TWO]] : (!cir.ptr<!s32i>, !s32i) -> !cir.ptr<!s32i>
+// CIR:   %[[ELE:.*]] = cir.load{{.*}} %[[ELE_PTR]] : !cir.ptr<!s32i>, !s32i
+// CIR:   %[[IS_CONSTANT:.*]] = cir.is_constant %[[ELE]] : !s32i -> !cir.bool
+
+// LLVM: define {{.*}} i32 @test6()
+// LLVM:   %[[TMP1:.*]] = load i32, ptr getelementptr inbounds nuw (i8, ptr @arr, i64 8)
+// LLVM:   %[[TMP2:.*]] = call i1 @llvm.is.constant.i32(i32 %[[TMP1]])
+
+// OGCG: define {{.*}} i32 @test6()
+// OGCG:   %[[TMP1:.*]] = load i32, ptr getelementptr inbounds ([3 x i32], ptr @arr, i64 0, i64 2)
+// OGCG:   %[[TMP2:.*]] = call i1 @llvm.is.constant.i32(i32 %[[TMP1]])
+
+const int c_arr[] = { 1, 2, 3 };
+
+int test7(void) {
+  return __builtin_constant_p(c_arr[2]);
+}
+
+// CIR: cir.func {{.*}} @test7() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+// CIR:   cir.store %[[ONE]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test7()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 1, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test7()
+// OGCG:   ret i32 1
+
+int test8(void) {
+  return __builtin_constant_p(c_arr);
+}
+
+// CIR: cir.func {{.*}} @test8() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ZERO:.*]] = cir.const #cir.int<0> : !s32i
+// CIR:   cir.store %[[ZERO]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test8()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 0, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test8()
+// OGCG:   ret i32 0
+
+/* --- Function pointers */
+
+int test9(void) {
+  return __builtin_constant_p(&test9);
+}
+
+// CIR: cir.func {{.*}} @test9() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ZERO:.*]] = cir.const #cir.int<0> : !s32i
+// CIR:   cir.store %[[ZERO]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test9()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 0, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test9()
+// OGCG:   ret i32 0
+
+int test10(void) {
+  return __builtin_constant_p(&test10 != 0);
+}
+
+// CIR: cir.func {{.*}} @test10() -> !s32i
+// CIR:   %[[TMP1:.*]] = cir.alloca !s32i, !cir.ptr<!s32i>, ["__retval"]
+// CIR:   %[[ONE:.*]] = cir.const #cir.int<1> : !s32i
+// CIR:   cir.store %[[ONE]], %[[TMP1]] : !s32i, !cir.ptr<!s32i>
+// CIR:   %[[TMP2:.*]] = cir.load %[[TMP1]] : !cir.ptr<!s32i>, !s32i
+// CIR:   cir.return %[[TMP2]] : !s32i
+
+// LLVM: define{{.*}} i32 @test10()
+// LLVM:   %[[TMP1:.*]] = alloca i32
+// LLVM:   store i32 1, ptr %[[TMP1]]
+// LLVM:   %[[TMP2:.*]] = load i32, ptr %[[TMP1]]
+// LLVM:   ret i32 %[[TMP2]]
+
+// OGCG: define{{.*}} i32 @test10()
+// OGCG:   ret i32 1
+
+int test11_f(void);
+void test11(void) {
+  int a, b;
+  (void)__builtin_constant_p((a = b, test11_f()));
+}
+
+// CIR: cir.func {{.*}} @test11()
+// CIR-NOT: call {{.*}}test11_f
+
+// LLVM: define{{.*}} void @test11()
+// LLVM-NOT: call {{.*}}test11_f
+
+// OGCG: define{{.*}} void @test11()
+// OGCG-NOT: call {{.*}}test11_f
diff --git a/clang/test/CIR/CodeGenBuiltins/builtin-fcmp-sse.c b/clang/test/CIR/CodeGenBuiltins/builtin-fcmp-sse.c
index c273d6b3fca0e..35abd1b57ecb0 100644
--- a/clang/test/CIR/CodeGenBuiltins/builtin-fcmp-sse.c
+++ b/clang/test/CIR/CodeGenBuiltins/builtin-fcmp-sse.c
@@ -9,8 +9,8 @@ typedef float __m128 __attribute__((__vector_size__(16), __aligned__(16)));
 typedef double __m128d __attribute__((__vector_size__(16), __aligned__(16)));
 
 __m128 test_cmpnleps(__m128 A, __m128 B) {
-  // CIR-LABEL:   cir.func dso_local @test_cmpnleps(
-  // CIR:           %[[ARG0:.*]]: !cir.vector<4 x !cir.float> {{.*}}, %[[ARG1:.*]]: !cir.vector<4 x !cir.float> {{.*}}) -> !cir.vector<4 x !cir.float> inline(never) {
+  // CIR-LABEL:   cir.func no_inline dso_local @test_cmpnleps(
+  // CIR:           %[[ARG0:.*]]: !cir.vector<4 x !cir.float> {{.*}}, %[[ARG1:.*]]: !cir.vector<4 x !cir.float> {{.*}}) -> !cir.vector<4 x !cir.float> {
   // CIR:           %[[ALLOCA_0:.*]] = cir.alloca !cir.vector<4 x !cir.float>, !cir.ptr<!cir.vector<4 x !cir.float>>, ["A", init] {alignment = 16 : i64}
   // CIR:           %[[ALLOCA_1:.*]] = cir.alloca !cir.vector<4 x !cir.float>, !cir.ptr<!cir.vector<4 x !cir.float>>, ["B", init] {alignment = 16 : i64}
   // CIR:           %[[ALLOCA_2:.*]] = cir.alloca !cir.vector<4 x !cir.float>, !cir.ptr<!cir.vector<4 x !cir.float>>, ["__retval"] {alignment = 16 : i64}
@@ -60,8 +60,8 @@ __m128 test_cmpnleps(__m128 A, __m128 B) {
 }
 
 __m128d test_cmpnlepd(__m128d A, __m128d B) {
-  // CIR-LABEL:   cir.func dso_local @test_cmpnlepd(
-  // CIR：          %[[ARG0:.*]]: !cir.vector<2 x !cir.double> {{.*}}, %[[ARG1:.*]]: !cir.vector<2 x !cir.double> {{.*}}) -> !cir.vector<2 x !cir.double> inline(never) {
+  // CIR-LABEL:   cir.func no_inline dso_local @test_cmpnlepd(
+  // CIR:           %[[ARG0:.*]]: !cir.vector<2 x !cir.double> {{.*}}, %[[ARG1:.*]]: !cir.vector<2 x !cir.double> {{.*}}) -> !cir.vector<2 x !cir.double> {
   // CIR:           %[[ALLOCA_0:.*]] = cir.alloca !cir.vector<2 x !cir.double>, !cir.ptr<!cir.vector<2 x !cir.double>>, ["A", init] {alignment = 16 : i64} 
   // CIR:           %[[ALLOCA_1:.*]] = cir.alloca !cir.vector<2 x !cir.double>, !cir.ptr<!cir.vector<2 x !cir.double>>, ["B", init] {alignment = 16 : i64} 
   // CIR:           %[[ALLOCA_2:.*]] = cir.alloca !cir.vector<2 x !cir.double>, !cir.ptr<!cir.vector<2 x !cir.double>>, ["__retval"] {alignment = 16 : i64} 
@@ -111,8 +111,8 @@ __m128d test_cmpnlepd(__m128d A, __m128d B) {
 }
 
 __m128 test_cmpnltps(__m128 A, __m128 B) {
-  // CIR-LABEL:   cir.func dso_local @test_cmpnltps(
-  // CIR-SAME:      %[[ARG0:.*]]: !cir.vector<4 x !cir.float> {{.*}}, %[[ARG1:.*]]: !cir.vector<4 x !cir.float> {{.*}}) -> !cir.vector<4 x !cir.float> inline(never) {
+  // CIR-LABEL:   cir.func no_inline dso_local @test_cmpnltps(
+  // CIR:           %[[ARG0:.*]]: !cir.vector<4 x !cir.float> {{.*}}, %[[ARG1:.*]]: !cir.vector<4 x !cir.float> {{.*}}) -> !cir.vector<4 x !cir.float> {
   // CIR:           %[[ALLOCA_0:.*]] = cir.alloca !cir.vector<4 x !cir.float>, !cir.ptr<!cir.vector<4 x !cir.float>>, ["A", init] {alignment = 16 : i64} 
   // CIR:           %[[ALLOCA_1:.*]] = cir.alloca !cir.vector<4 x !cir.float>, !cir.ptr<!cir.vector<4 x !cir.float>>, ["B", init] {alignment = 16 : i64} 
   // CIR:           %[[ALLOCA_2:.*]] = cir.alloca !cir.vector<4 x !cir.float>, !cir.ptr<!cir.vector<4 x !cir.float>>, ["__retval"] {alignment = 16 : i64} 
@@ -162,8 +162,8 @@ __m128 test_cmpnltps(__m128 A, __m128 B) {
 }
 
 __m128d test_cmpnltpd(__m128d A, __m128d B) {
-  // CIR-LABEL:   cir.func dso_local @test_cmpnltpd(
-  // CIR:           %[[ARG0:.*]]: !cir.vector<2 x !cir.double> {{.*}}, %[[ARG1:.*]]: !cir.vector<2 x !cir.double> {{.*}}) -> !cir.vector<2 x !cir.double> inline(never) {
+  // CIR-LABEL:   cir.func no_inline dso_local @test_cmpnltpd(
+  // CIR:           %[[ARG0:.*]]: !cir.vector<2 x !cir.double> {{.*}}, %[[ARG1:.*]]: !cir.vector<2 x !cir.double> {{.*}}) -> !cir.vector<2 x !cir.double> {
   // CIR:           %[[ALLOCA_0:.*]] = cir.alloca !cir.vector<2 x !cir.double>, !cir.ptr<!cir.vector<2 x !cir.double>>, ["A", init] {alignment = 16 : i64} 
   // CIR:           %[[ALLOCA_1:.*]] = cir.alloca !cir.vector<2 x !cir.double>, !cir.ptr<!cir.vector<2 x !cir.double>>, ["B", init] {alignment = 16 : i64} 
   // CIR:           %[[ALLOCA_2:.*]] = cir.alloca !cir.vector<2 x !cir.double>, !cir.ptr<!cir.vector<2 x !cir.double>>, ["__retval"] {alignment = 16 : i64} 
diff --git a/clang/test/CIR/CodeGenBuiltins/builtin_inline.c b/clang/test/CIR/CodeGenBuiltins/builtin_inline.c
index 83a3ba6e53f4b..06437ecd6ccd6 100644
--- a/clang/test/CIR/CodeGenBuiltins/builtin_inline.c
+++ b/clang/test/CIR/CodeGenBuiltins/builtin_inline.c
@@ -20,7 +20,7 @@ void *test_inline_builtin_memcpy(void *a, const void *b, size_t c) {
   return memcpy(a, b, c);
 }
 
-// CIR: cir.func internal private{{.*}}@memcpy.inline({{.*}}) -> !cir.ptr<!void> inline(always)
+// CIR: cir.func always_inline internal private{{.*}}@memcpy.inline({{.*}}) -> !cir.ptr<!void>
 
 // CIR-LABEL: @test_inline_builtin_memcpy(
 // CIR:         cir.call @memcpy.inline(
diff --git a/clang/test/CIR/CodeGenBuiltins/builtin_prefetch.c b/clang/test/CIR/CodeGenBuiltins/builtin_prefetch.c
index cfe85b9ba8104..15eb37bb2f88b 100644
--- a/clang/test/CIR/CodeGenBuiltins/builtin_prefetch.c
+++ b/clang/test/CIR/CodeGenBuiltins/builtin_prefetch.c
@@ -9,7 +9,7 @@ void foo(void *a) {
   __builtin_prefetch(a, 1, 1);  // rw=1, locality=1
 }
 
-// CIR-LABEL: cir.func dso_local @foo(
+// CIR-LABEL: cir.func {{.*}} @foo(
 // CIR: %[[ALLOCA:.*]] = cir.alloca !cir.ptr<!void>
 // CIR: cir.store %arg0, %[[ALLOCA]] : !cir.ptr<!void>, !cir.ptr<!cir.ptr<!void>>
 // CIR: %[[P1:.*]] = cir.load{{.*}} %[[ALLOCA]] : !cir.ptr<!cir.ptr<!void>>, !cir.ptr<!void>
diff --git a/clang/test/CIR/CodeGenBuiltins/builtins-floating-point.c b/clang/test/CIR/CodeGenBuiltins/builtins-floating-point.c
index a4307c57b04b6..010633551f57d 100644
--- a/clang/test/CIR/CodeGenBuiltins/builtins-floating-point.c
+++ b/clang/test/CIR/CodeGenBuiltins/builtins-floating-point.c
@@ -67,3 +67,24 @@ long double my_exp2l(long double f) {
   // LLVM: %{{.*}} = call fp128 @llvm.exp2.f128(fp128 %{{.*}})
   // OGCG: %{{.*}} = call fp128 @llvm.exp2.f128(fp128 %{{.*}})
 }
+
+float floorf(float f) {
+  return __builtin_floorf(f);
+  // CIR: %{{.*}} = cir.floor %{{.*}} : !cir.float
+  // LLVM: %{{.*}} = call float @llvm.floor.f32(float %{{.*}})
+  // OGCG: %{{.*}} = call float @llvm.floor.f32(float %{{.*}})
+}
+
+double floor(double f) {
+  return __builtin_floor(f);
+  // CIR: %{{.*}} = cir.floor %{{.*}} : !cir.double
+  // LLVM: %{{.*}} = call double @llvm.floor.f64(double %{{.*}})
+  // OGCG: %{{.*}} = call double @llvm.floor.f64(double %{{.*}})
+}
+
+long double floorl(long double f) {
+  return __builtin_floorl(f);
+  // CIR: %{{.*}} = cir.floor %{{.*}} : !cir.long_double<!cir.f128>
+  // LLVM: %{{.*}} = call fp128 @llvm.floor.f128(fp128 %{{.*}})
+  // OGCG: %{{.*}} = call fp128 @llvm.floor.f128(fp128 %{{.*}})
+}
diff --git a/clang/test/CIR/CodeGenBuiltins/builtins-overflow.cpp b/clang/test/CIR/CodeGenBuiltins/builtins-overflow.cpp
index 9ee3e7c015209..4e568a865dcbc 100644
--- a/clang/test/CIR/CodeGenBuiltins/builtins-overflow.cpp
+++ b/clang/test/CIR/CodeGenBuiltins/builtins-overflow.cpp
@@ -9,7 +9,7 @@ bool test_add_overflow_uint_uint_uint(unsigned x, unsigned y, unsigned *res) {
   return __builtin_add_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z32test_add_overflow_uint_uint_uintjjPj
+//      CIR: cir.func {{.*}} @_Z32test_add_overflow_uint_uint_uintjjPj
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u32i>>, !cir.ptr<!u32i>
@@ -27,7 +27,7 @@ bool test_add_overflow_int_int_int(int x, int y, int *res) {
   return __builtin_add_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z29test_add_overflow_int_int_intiiPi
+//      CIR: cir.func {{.*}} @_Z29test_add_overflow_int_int_intiiPi
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -39,7 +39,7 @@ bool test_add_overflow_xint31_xint31_xint31(_BitInt(31) x, _BitInt(31) y, _BitIn
   return __builtin_add_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z38test_add_overflow_xint31_xint31_xint31DB31_S_PS_
+//      CIR: cir.func {{.*}} @_Z38test_add_overflow_xint31_xint31_xint31DB31_S_PS_
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.int<s, 31>>, !cir.int<s, 31>
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.int<s, 31>>, !cir.int<s, 31>
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!cir.int<s, 31>>>, !cir.ptr<!cir.int<s, 31>>
@@ -51,7 +51,7 @@ bool test_sub_overflow_uint_uint_uint(unsigned x, unsigned y, unsigned *res) {
   return __builtin_sub_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z32test_sub_overflow_uint_uint_uintjjPj
+//      CIR: cir.func {{.*}} @_Z32test_sub_overflow_uint_uint_uintjjPj
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u32i>>, !cir.ptr<!u32i>
@@ -63,7 +63,7 @@ bool test_sub_overflow_int_int_int(int x, int y, int *res) {
   return __builtin_sub_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z29test_sub_overflow_int_int_intiiPi
+//      CIR: cir.func {{.*}} @_Z29test_sub_overflow_int_int_intiiPi
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -75,7 +75,7 @@ bool test_sub_overflow_xint31_xint31_xint31(_BitInt(31) x, _BitInt(31) y, _BitIn
   return __builtin_sub_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z38test_sub_overflow_xint31_xint31_xint31DB31_S_PS_
+//      CIR: cir.func {{.*}} @_Z38test_sub_overflow_xint31_xint31_xint31DB31_S_PS_
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.int<s, 31>>, !cir.int<s, 31>
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.int<s, 31>>, !cir.int<s, 31>
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!cir.int<s, 31>>>, !cir.ptr<!cir.int<s, 31>>
@@ -87,7 +87,7 @@ bool test_mul_overflow_uint_uint_uint(unsigned x, unsigned y, unsigned *res) {
   return __builtin_mul_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z32test_mul_overflow_uint_uint_uintjjPj
+//      CIR: cir.func {{.*}} @_Z32test_mul_overflow_uint_uint_uintjjPj
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u32i>>, !cir.ptr<!u32i>
@@ -99,7 +99,7 @@ bool test_mul_overflow_int_int_int(int x, int y, int *res) {
   return __builtin_mul_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z29test_mul_overflow_int_int_intiiPi
+//      CIR: cir.func {{.*}} @_Z29test_mul_overflow_int_int_intiiPi
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -111,7 +111,7 @@ bool test_mul_overflow_xint31_xint31_xint31(_BitInt(31) x, _BitInt(31) y, _BitIn
   return __builtin_mul_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z38test_mul_overflow_xint31_xint31_xint31DB31_S_PS_
+//      CIR: cir.func {{.*}} @_Z38test_mul_overflow_xint31_xint31_xint31DB31_S_PS_
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.int<s, 31>>, !cir.int<s, 31>
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.int<s, 31>>, !cir.int<s, 31>
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!cir.int<s, 31>>>, !cir.ptr<!cir.int<s, 31>>
@@ -123,7 +123,7 @@ bool test_mul_overflow_ulong_ulong_long(unsigned long x, unsigned long y, unsign
   return __builtin_mul_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z34test_mul_overflow_ulong_ulong_longmmPm
+//      CIR: cir.func {{.*}} @_Z34test_mul_overflow_ulong_ulong_longmmPm
 //      CIR:   %[[#LHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RHS:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -135,7 +135,7 @@ bool test_add_overflow_uint_int_int(unsigned x, int y, int *res) {
   return __builtin_add_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z30test_add_overflow_uint_int_intjiPi
+//      CIR: cir.func {{.*}} @_Z30test_add_overflow_uint_int_intjiPi
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -149,7 +149,7 @@ bool test_add_overflow_volatile(int x, int y, volatile int *res) {
   return __builtin_add_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z26test_add_overflow_volatileiiPVi
+//      CIR: cir.func {{.*}} @_Z26test_add_overflow_volatileiiPVi
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -161,7 +161,7 @@ bool test_uadd_overflow(unsigned x, unsigned y, unsigned *res) {
   return __builtin_uadd_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z18test_uadd_overflowjjPj
+//      CIR: cir.func {{.*}} @_Z18test_uadd_overflowjjPj
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u32i>>, !cir.ptr<!u32i>
@@ -173,7 +173,7 @@ bool test_uaddl_overflow(unsigned long x, unsigned long y, unsigned long *res) {
   return __builtin_uaddl_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z19test_uaddl_overflowmmPm
+//      CIR: cir.func {{.*}} @_Z19test_uaddl_overflowmmPm
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -185,7 +185,7 @@ bool test_uaddll_overflow(unsigned long long x, unsigned long long y, unsigned l
   return __builtin_uaddll_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z20test_uaddll_overflowyyPy
+//      CIR: cir.func {{.*}} @_Z20test_uaddll_overflowyyPy
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -197,7 +197,7 @@ bool test_usub_overflow(unsigned x, unsigned y, unsigned *res) {
   return __builtin_usub_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z18test_usub_overflowjjPj
+//      CIR: cir.func {{.*}} @_Z18test_usub_overflowjjPj
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u32i>>, !cir.ptr<!u32i>
@@ -209,7 +209,7 @@ bool test_usubl_overflow(unsigned long x, unsigned long y, unsigned long *res) {
   return __builtin_usubl_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z19test_usubl_overflowmmPm
+//      CIR: cir.func {{.*}} @_Z19test_usubl_overflowmmPm
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -221,7 +221,7 @@ bool test_usubll_overflow(unsigned long long x, unsigned long long y, unsigned l
   return __builtin_usubll_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z20test_usubll_overflowyyPy
+//      CIR: cir.func {{.*}} @_Z20test_usubll_overflowyyPy
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -233,7 +233,7 @@ bool test_umul_overflow(unsigned x, unsigned y, unsigned *res) {
   return __builtin_umul_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z18test_umul_overflowjjPj
+//      CIR: cir.func {{.*}} @_Z18test_umul_overflowjjPj
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u32i>, !u32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u32i>>, !cir.ptr<!u32i>
@@ -245,7 +245,7 @@ bool test_umull_overflow(unsigned long x, unsigned long y, unsigned long *res) {
   return __builtin_umull_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z19test_umull_overflowmmPm
+//      CIR: cir.func {{.*}} @_Z19test_umull_overflowmmPm
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -257,7 +257,7 @@ bool test_umulll_overflow(unsigned long long x, unsigned long long y, unsigned l
   return __builtin_umulll_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z20test_umulll_overflowyyPy
+//      CIR: cir.func {{.*}} @_Z20test_umulll_overflowyyPy
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!u64i>, !u64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!u64i>>, !cir.ptr<!u64i>
@@ -269,7 +269,7 @@ bool test_sadd_overflow(int x, int y, int *res) {
   return __builtin_sadd_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z18test_sadd_overflowiiPi
+//      CIR: cir.func {{.*}} @_Z18test_sadd_overflowiiPi
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -281,7 +281,7 @@ bool test_saddl_overflow(long x, long y, long *res) {
   return __builtin_saddl_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z19test_saddl_overflowllPl
+//      CIR: cir.func {{.*}} @_Z19test_saddl_overflowllPl
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s64i>>, !cir.ptr<!s64i>
@@ -293,7 +293,7 @@ bool test_saddll_overflow(long long x, long long y, long long *res) {
   return __builtin_saddll_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z20test_saddll_overflowxxPx
+//      CIR: cir.func {{.*}} @_Z20test_saddll_overflowxxPx
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s64i>>, !cir.ptr<!s64i>
@@ -305,7 +305,7 @@ bool test_ssub_overflow(int x, int y, int *res) {
   return __builtin_ssub_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z18test_ssub_overflowiiPi
+//      CIR: cir.func {{.*}} @_Z18test_ssub_overflowiiPi
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -317,7 +317,7 @@ bool test_ssubl_overflow(long x, long y, long *res) {
   return __builtin_ssubl_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z19test_ssubl_overflowllPl
+//      CIR: cir.func {{.*}} @_Z19test_ssubl_overflowllPl
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s64i>>, !cir.ptr<!s64i>
@@ -329,7 +329,7 @@ bool test_ssubll_overflow(long long x, long long y, long long *res) {
   return __builtin_ssubll_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z20test_ssubll_overflowxxPx
+//      CIR: cir.func {{.*}} @_Z20test_ssubll_overflowxxPx
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s64i>>, !cir.ptr<!s64i>
@@ -341,7 +341,7 @@ bool test_smul_overflow(int x, int y, int *res) {
   return __builtin_smul_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z18test_smul_overflowiiPi
+//      CIR: cir.func {{.*}} @_Z18test_smul_overflowiiPi
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s32i>, !s32i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s32i>>, !cir.ptr<!s32i>
@@ -353,7 +353,7 @@ bool test_smull_overflow(long x, long y, long *res) {
   return __builtin_smull_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z19test_smull_overflowllPl
+//      CIR: cir.func {{.*}} @_Z19test_smull_overflowllPl
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s64i>>, !cir.ptr<!s64i>
@@ -365,7 +365,7 @@ bool test_smulll_overflow(long long x, long long y, long long *res) {
   return __builtin_smulll_overflow(x, y, res);
 }
 
-//      CIR: cir.func dso_local @_Z20test_smulll_overflowxxPx
+//      CIR: cir.func {{.*}} @_Z20test_smulll_overflowxxPx
 //      CIR:   %[[#X:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#Y:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!s64i>, !s64i
 // CIR-NEXT:   %[[#RES_PTR:]] = cir.load{{.*}} %{{.+}} : !cir.ptr<!cir.ptr<!s64i>>, !cir.ptr<!s64i>
diff --git a/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp b/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
index 5ee51aaa2446e..ba3c53b6bb03e 100644
--- a/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/combined-firstprivate-clause.cpp
@@ -43,7 +43,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN15NoCopyConstructC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_NoCopyConstruct>, !cir.ptr<!rec_NoCopyConstruct>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: }
 //
@@ -63,7 +63,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_NonDefaultCtor>, !cir.ptr<!rec_NonDefaultCtor>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: }
 //
@@ -73,7 +73,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN7HasDtorC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_HasDtor>, !cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } destroy {
 // CHECK-NEXT: ^bb0(%[[ORIG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
@@ -176,7 +176,7 @@ struct HasDtor {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[DECAY_FROM]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NoCopyConstruct>, !u64i) -> !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: %[[DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[ARG_TO]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> -> !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[DECAY_TO]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NoCopyConstruct>, !u64i) -> !cir.ptr<!rec_NoCopyConstruct>
-// CHECK-NEXT: cir.call @_ZN15NoCopyConstructC1ERKS_(%[[STRIDE_TO]], %[[STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_NoCopyConstruct>, !cir.ptr<!rec_NoCopyConstruct>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR_LOAD]] = cir.load %[[ITR]] : !cir.ptr<!u64i>, !u64i
@@ -246,7 +246,7 @@ struct HasDtor {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[DECAY_FROM]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NonDefaultCtor>, !u64i) -> !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: %[[DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[ARG_TO]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> -> !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[DECAY_TO]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NonDefaultCtor>, !u64i) -> !cir.ptr<!rec_NonDefaultCtor>
-// CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1ERKS_(%[[STRIDE_TO]], %[[STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_NonDefaultCtor>, !cir.ptr<!rec_NonDefaultCtor>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR_LOAD]] = cir.load %[[ITR]] : !cir.ptr<!u64i>, !u64i
@@ -281,7 +281,7 @@ struct HasDtor {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[DECAY_FROM]], %[[ITR_LOAD]] : (!cir.ptr<!rec_HasDtor>, !u64i) -> !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: %[[DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[ARG_TO]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>> -> !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[DECAY_TO]], %[[ITR_LOAD]] : (!cir.ptr<!rec_HasDtor>, !u64i) -> !cir.ptr<!rec_HasDtor>
-// CHECK-NEXT: cir.call @_ZN7HasDtorC1ERKS_(%[[STRIDE_TO]], %[[STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_HasDtor>, !cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR_LOAD]] = cir.load %[[ITR]] : !cir.ptr<!u64i>, !u64i
diff --git a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
index 741b7dc4cb315..52340f2c3efbf 100644
--- a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause-templates.cpp
@@ -30,7 +30,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_NonDefaultCtor>, !cir.ptr<!rec_NonDefaultCtor>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: }
 //
@@ -40,7 +40,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN7HasDtorC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_HasDtor>, !cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } destroy {
 // CHECK-NEXT: ^bb0(%[[ORIG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
diff --git a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
index b98eb0edba7c6..8f723c16175d3 100644
--- a/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/compute-firstprivate-clause.cpp
@@ -1,4 +1,5 @@
-// RUN: %clang_cc1 -fopenacc -triple x86_64-linux-gnu -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir -triple x86_64-linux-pc %s -o - | FileCheck %s
+// RUN: %clang_cc1 -fopenacc -triple x86_64-linux-gnu -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir -triple x86_64-linux-pc %s -o %t.ll
+// RUN: FileCheck --input-file=%t.ll %s
 
 struct NoCopyConstruct {};
 
@@ -43,7 +44,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NoCopyConstruct> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN15NoCopyConstructC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_NoCopyConstruct>, !cir.ptr<!rec_NoCopyConstruct>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: }
 //
@@ -63,7 +64,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_NonDefaultCtor> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_NonDefaultCtor>, !cir.ptr<!rec_NonDefaultCtor>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: }
 //
@@ -73,7 +74,7 @@ struct HasDtor {
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } copy {
 // CHECK-NEXT: ^bb0(%[[ARG_FROM:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG_TO:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
-// CHECK-NEXT: cir.call @_ZN7HasDtorC1ERKS_(%[[ARG_TO]], %[[ARG_FROM]]) nothrow : (!cir.ptr<!rec_HasDtor>, !cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: cir.copy %[[ARG_FROM]] to %[[ARG_TO]] : !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: acc.yield
 // CHECK-NEXT: } destroy {
 // CHECK-NEXT: ^bb0(%[[ORIG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}, %[[ARG:.*]]: !cir.ptr<!rec_HasDtor> {{.*}}):
@@ -176,7 +177,7 @@ struct HasDtor {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[DECAY_FROM]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NoCopyConstruct>, !u64i) -> !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: %[[DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[ARG_TO]] : !cir.ptr<!cir.array<!rec_NoCopyConstruct x 5>> -> !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[DECAY_TO]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NoCopyConstruct>, !u64i) -> !cir.ptr<!rec_NoCopyConstruct>
-// CHECK-NEXT: cir.call @_ZN15NoCopyConstructC1ERKS_(%[[STRIDE_TO]], %[[STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_NoCopyConstruct>, !cir.ptr<!rec_NoCopyConstruct>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_NoCopyConstruct>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR_LOAD]] = cir.load %[[ITR]] : !cir.ptr<!u64i>, !u64i
@@ -246,7 +247,7 @@ struct HasDtor {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[DECAY_FROM]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NonDefaultCtor>, !u64i) -> !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: %[[DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[ARG_TO]] : !cir.ptr<!cir.array<!rec_NonDefaultCtor x 5>> -> !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[DECAY_TO]], %[[ITR_LOAD]] : (!cir.ptr<!rec_NonDefaultCtor>, !u64i) -> !cir.ptr<!rec_NonDefaultCtor>
-// CHECK-NEXT: cir.call @_ZN14NonDefaultCtorC1ERKS_(%[[STRIDE_TO]], %[[STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_NonDefaultCtor>, !cir.ptr<!rec_NonDefaultCtor>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_NonDefaultCtor>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR_LOAD]] = cir.load %[[ITR]] : !cir.ptr<!u64i>, !u64i
@@ -281,7 +282,7 @@ struct HasDtor {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[DECAY_FROM]], %[[ITR_LOAD]] : (!cir.ptr<!rec_HasDtor>, !u64i) -> !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: %[[DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[ARG_TO]] : !cir.ptr<!cir.array<!rec_HasDtor x 5>> -> !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[DECAY_TO]], %[[ITR_LOAD]] : (!cir.ptr<!rec_HasDtor>, !u64i) -> !cir.ptr<!rec_HasDtor>
-// CHECK-NEXT: cir.call @_ZN7HasDtorC1ERKS_(%[[STRIDE_TO]], %[[STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_HasDtor>, !cir.ptr<!rec_HasDtor>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_HasDtor>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR_LOAD]] = cir.load %[[ITR]] : !cir.ptr<!u64i>, !u64i
diff --git a/clang/test/CIR/CodeGenOpenACC/firstprivate-clause-recipes.cpp b/clang/test/CIR/CodeGenOpenACC/firstprivate-clause-recipes.cpp
index 95168812316ea..3a13a81b48828 100644
--- a/clang/test/CIR/CodeGenOpenACC/firstprivate-clause-recipes.cpp
+++ b/clang/test/CIR/CodeGenOpenACC/firstprivate-clause-recipes.cpp
@@ -75,7 +75,7 @@ void do_things(unsigned A, unsigned B) {
 // CHECK-NEXT: %[[BOUND1_STRIDE_FROM:.*]] = cir.ptr_stride %[[BOUND2_STRIDE_DECAY_FROM]], %[[ITR1_LOAD]] : (!cir.ptr<!rec_NoOps>, !u64i) -> !cir.ptr<!rec_NoOps>
 // CHECK-NEXT: %[[BOUND2_STRIDE_DECAY_TO:.*]] = cir.cast array_to_ptrdecay %[[BOUND2_STRIDE_TO]] : !cir.ptr<!cir.array<!rec_NoOps x 5>> -> !cir.ptr<!rec_NoOps>
 // CHECK-NEXT: %[[BOUND1_STRIDE_TO:.*]] = cir.ptr_stride %[[BOUND2_STRIDE_DECAY_TO]], %[[ITR1_LOAD]] : (!cir.ptr<!rec_NoOps>, !u64i) -> !cir.ptr<!rec_NoOps>
-// CHECK-NEXT: cir.call @_ZN5NoOpsC1ERKS_(%[[BOUND1_STRIDE_TO]], %[[BOUND1_STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_NoOps>, !cir.ptr<!rec_NoOps>) -> ()
+// CHECK-NEXT: cir.copy %[[BOUND1_STRIDE_FROM]] to %[[BOUND1_STRIDE_TO]] : !cir.ptr<!rec_NoOps>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR1_LOAD]] = cir.load %[[ITR1]] : !cir.ptr<!u64i>, !u64i
@@ -342,7 +342,7 @@ void do_things(unsigned A, unsigned B) {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[BOUND2_STRIDE_LOAD_FROM]], %[[ITR1_LOAD]] : (!cir.ptr<!rec_NoOps>, !u64i) -> !cir.ptr<!rec_NoOps>
 // CHECK-NEXT: %[[BOUND2_STRIDE_LOAD_TO:.*]] = cir.load %[[BOUND2_STRIDE_TO]] : !cir.ptr<!cir.ptr<!rec_NoOps>>, !cir.ptr<!rec_NoOps>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[BOUND2_STRIDE_LOAD_TO]], %[[ITR1_LOAD]] : (!cir.ptr<!rec_NoOps>, !u64i) -> !cir.ptr<!rec_NoOps>
-// CHECK-NEXT: cir.call @_ZN5NoOpsC1ERKS_(%[[BOUND1_STRIDE_TO]], %[[BOUND1_STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_NoOps>, !cir.ptr<!rec_NoOps>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_NoOps>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR1_LOAD]] = cir.load %[[ITR1]] : !cir.ptr<!u64i>, !u64i
@@ -580,7 +580,7 @@ void do_things(unsigned A, unsigned B) {
 // CHECK-NEXT: %[[STRIDE_FROM:.*]] = cir.ptr_stride %[[BOUND2_STRIDE_LOAD_FROM]], %[[ITR1_LOAD]] : (!cir.ptr<!rec_CtorDtor>, !u64i) -> !cir.ptr<!rec_CtorDtor>
 // CHECK-NEXT: %[[BOUND2_STRIDE_LOAD_TO:.*]] = cir.load %[[BOUND2_STRIDE_TO]] : !cir.ptr<!cir.ptr<!rec_CtorDtor>>, !cir.ptr<!rec_CtorDtor>
 // CHECK-NEXT: %[[STRIDE_TO:.*]] = cir.ptr_stride %[[BOUND2_STRIDE_LOAD_TO]], %[[ITR1_LOAD]] : (!cir.ptr<!rec_CtorDtor>, !u64i) -> !cir.ptr<!rec_CtorDtor>
-// CHECK-NEXT: cir.call @_ZN8CtorDtorC1ERKS_(%[[BOUND1_STRIDE_TO]], %[[BOUND1_STRIDE_FROM]]) nothrow : (!cir.ptr<!rec_CtorDtor>, !cir.ptr<!rec_CtorDtor>) -> ()
+// CHECK-NEXT: cir.copy %[[STRIDE_FROM]] to %[[STRIDE_TO]] : !cir.ptr<!rec_CtorDtor>
 // CHECK-NEXT: cir.yield
 // CHECK-NEXT: } step {
 // CHECK-NEXT: %[[ITR1_LOAD]] = cir.load %[[ITR1]] : !cir.ptr<!u64i>, !u64i
diff --git a/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented-global.cpp b/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented-global.cpp
deleted file mode 100644
index a5e4694c6f5e6..0000000000000
--- a/clang/test/CIR/CodeGenOpenACC/openacc-not-implemented-global.cpp
+++ /dev/null
@@ -1,6 +0,0 @@
-// RUN: %clang_cc1 -std=c++17 -triple x86_64-unknown-linux-gnu -fopenacc -fclangir -emit-cir %s -o %t.cir -verify
-// RUN: %clang_cc1 -std=c++17 -triple x86_64-unknown-linux-gnu -fopenacc -fclangir -emit-llvm %s -o %t-cir.ll -verify
-
-void foo() {}
-// expected-error at +1{{ClangIR code gen Not Yet Implemented: OpenACC Global Routine Construct}}
-#pragma acc routine(foo) seq
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-anon-ns.cpp b/clang/test/CIR/CodeGenOpenACC/routine-anon-ns.cpp
new file mode 100644
index 0000000000000..7c0a2edee5257
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-anon-ns.cpp
@@ -0,0 +1,27 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+namespace {
+#pragma acc routine seq
+  void NSFunc1(){}
+#pragma acc routine seq
+  auto Lambda1 = [](){};
+
+  auto Lambda2 = [](){};
+} // namespace 
+
+#pragma acc routine(NSFunc1) seq
+#pragma acc routine(Lambda2) seq
+void force_emit() {
+  NSFunc1();
+  Lambda1();
+  Lambda2();
+}
+
+// CHECK: cir.func{{.*}} @[[F1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F1_R_NAME:.*]], @[[F1_R2_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L1_R_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L2_R_NAME:.*]]]>}
+//
+// CHECK: acc.routine @[[F1_R_NAME]] func(@[[F1_NAME]]) seq
+// CHECK: acc.routine @[[L1_R_NAME]] func(@[[L1_NAME]]) seq
+// CHECK: acc.routine @[[F1_R2_NAME]] func(@[[F1_NAME]]) seq
+// CHECK: acc.routine @[[L2_R_NAME]] func(@[[L2_NAME]]) seq
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-clauses.cpp b/clang/test/CIR/CodeGenOpenACC/routine-clauses.cpp
new file mode 100644
index 0000000000000..81437e7e02ab1
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-clauses.cpp
@@ -0,0 +1,38 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+#pragma acc routine seq nohost
+void Func1() {}
+
+void Func2() {}
+#pragma acc routine(Func2) seq
+
+#pragma acc routine worker
+void Func3() {}
+
+void Func4() {}
+#pragma acc routine(Func4) worker nohost
+
+#pragma acc routine nohost vector
+void Func5() {}
+
+void Func6() {}
+#pragma acc routine(Func6) nohost vector
+
+// CHECK: cir.func{{.*}} @[[F1_NAME:.*Func1[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F1_R_NAME:.*]]]>}
+// CHECK: acc.routine @[[F1_R_NAME]] func(@[[F1_NAME]]) seq nohost
+
+// CHECK: cir.func{{.*}} @[[F2_NAME:.*Func2[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F2_R_NAME:.*]]]>}
+
+// CHECK: cir.func{{.*}} @[[F3_NAME:.*Func3[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F3_R_NAME:.*]]]>}
+// CHECK: acc.routine @[[F3_R_NAME]] func(@[[F3_NAME]]) worker
+
+// CHECK: cir.func{{.*}} @[[F4_NAME:.*Func4[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F4_R_NAME:.*]]]>}
+
+// CHECK: cir.func{{.*}} @[[F5_NAME:.*Func5[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F5_R_NAME:.*]]]>}
+// CHECK: acc.routine @[[F5_R_NAME]] func(@[[F5_NAME]]) vector
+
+// CHECK: cir.func{{.*}} @[[F6_NAME:.*Func6[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F6_R_NAME:.*]]]>}
+
+// CHECK: acc.routine @[[F2_R_NAME]] func(@[[F2_NAME]]) seq
+// CHECK: acc.routine @[[F4_R_NAME]] func(@[[F4_NAME]]) worker nohost
+// CHECK: acc.routine @[[F6_R_NAME]] func(@[[F6_NAME]]) vector nohost
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-globals.cpp b/clang/test/CIR/CodeGenOpenACC/routine-globals.cpp
new file mode 100644
index 0000000000000..5f125bbce6cb8
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-globals.cpp
@@ -0,0 +1,35 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+#pragma acc routine seq
+auto Lambda1 = [](){};
+
+auto Lambda2 = [](){};
+#pragma acc routine(Lambda2) seq
+#pragma acc routine(Lambda2) seq
+
+#pragma acc routine seq
+int GlobalFunc1();
+
+int GlobalFunc2();
+#pragma acc routine(GlobalFunc2) seq
+#pragma acc routine(GlobalFunc1) seq
+
+void force_emit() {
+  Lambda1();
+  Lambda2();
+  GlobalFunc1();
+  GlobalFunc2();
+}
+
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L1_R_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L2_R_NAME:.*]], @[[L2_R2_NAME:.*]]]>}
+//
+// CHECK: cir.func{{.*}} @[[G1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G1_R_NAME:.*]], @[[G1_R2_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[G2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G2_R_NAME:.*]]]>}
+
+// CHECK: acc.routine @[[L1_R_NAME]] func(@[[L1_NAME]]) seq
+// CHECK: acc.routine @[[G1_R_NAME]] func(@[[G1_NAME]]) seq
+// CHECK: acc.routine @[[L2_R_NAME]] func(@[[L2_NAME]]) seq
+// CHECK: acc.routine @[[L2_R2_NAME]] func(@[[L2_NAME]]) seq
+// CHECK: acc.routine @[[G2_R_NAME]] func(@[[G2_NAME]]) seq
+// CHECK: acc.routine @[[G1_R2_NAME]] func(@[[G1_NAME]]) seq
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-globals2.cpp b/clang/test/CIR/CodeGenOpenACC/routine-globals2.cpp
new file mode 100644
index 0000000000000..e1aa5046684da
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-globals2.cpp
@@ -0,0 +1,44 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+#pragma acc routine seq
+void GlobalFunc4();
+#pragma acc routine(GlobalFunc4) seq
+
+#pragma acc routine seq
+#pragma acc routine seq
+void GlobalFunc5();
+#pragma acc routine(GlobalFunc5) seq
+#pragma acc routine(GlobalFunc5) seq
+
+void GlobalFunc6();
+void GlobalFunc6();
+#pragma acc routine(GlobalFunc6) seq
+void GlobalFunc6(){}
+
+void GlobalFunc7(){}
+#pragma acc routine(GlobalFunc7) seq
+
+void force_emit() {
+  GlobalFunc4();
+  GlobalFunc5();
+  GlobalFunc6();
+  GlobalFunc7();
+}
+
+// CHECK: cir.func{{.*}} @[[G6_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G6_R_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[G7_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G7_R_NAME:.*]]]>}
+
+// CHECK: cir.func{{.*}} @[[G4_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G4_R_NAME:.*]], @[[G4_R2_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[G5_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G5_R_NAME:.*]], @[[G5_R1_NAME:.*]], @[[G5_R2_NAME:.*]], @[[G5_R3_NAME:.*]]]>}
+
+// CHECK: acc.routine @[[G4_R_NAME]] func(@[[G4_NAME]]) seq
+// CHECK: acc.routine @[[G5_R_NAME]] func(@[[G5_NAME]]) seq
+// CHECK: acc.routine @[[G5_R1_NAME]] func(@[[G5_NAME]]) seq
+//
+// CHECK: acc.routine @[[G4_R2_NAME]] func(@[[G4_NAME]]) seq
+//
+// CHECK: acc.routine @[[G5_R2_NAME]] func(@[[G5_NAME]]) seq
+// CHECK: acc.routine @[[G5_R3_NAME]] func(@[[G5_NAME]]) seq
+//
+// CHECK: acc.routine @[[G6_R_NAME]] func(@[[G6_NAME]]) seq
+// CHECK: acc.routine @[[G7_R_NAME]] func(@[[G7_NAME]]) seq
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-locals.cpp b/clang/test/CIR/CodeGenOpenACC/routine-locals.cpp
new file mode 100644
index 0000000000000..d338a9cea0d09
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-locals.cpp
@@ -0,0 +1,24 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+void GlobalFunc();
+void InFunc() {
+
+#pragma acc routine(GlobalFunc) seq
+  GlobalFunc();
+
+#pragma acc routine seq
+  auto Lambda1 = [](){};
+  Lambda1();
+
+  auto Lambda2 = [](){};
+#pragma acc routine(Lambda2) seq
+  Lambda2();
+};
+
+// CHECK: cir.func{{.*}} @[[G1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[G1_R_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L1_R_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L2_R_NAME:.*]]]>}
+
+// CHECK: acc.routine @[[L1_R_NAME]] func(@[[L1_NAME]]) seq
+// CHECK: acc.routine @[[G1_R_NAME]] func(@[[G1_NAME]]) seq
+// CHECK: acc.routine @[[L2_R_NAME]] func(@[[L2_NAME]]) seq
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-members.cpp b/clang/test/CIR/CodeGenOpenACC/routine-members.cpp
new file mode 100644
index 0000000000000..713500cfe3868
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-members.cpp
@@ -0,0 +1,55 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+struct S {
+#pragma acc routine seq
+  void MemFunc1();
+  void MemFunc2();
+#pragma acc routine(S::MemFunc2) seq
+  void MemFunc3();
+#pragma acc routine(S::MemFunc3) seq
+
+#pragma acc routine seq
+  static void StaticMemFunc1();
+  static void StaticMemFunc2();
+  static void StaticMemFunc3();
+#pragma acc routine(StaticMemFunc3) seq
+
+#pragma acc routine seq
+  static constexpr auto StaticLambda1 = [](){};
+ static constexpr auto StaticLambda2 = [](){};
+};
+#pragma acc routine(S::MemFunc2) seq
+#pragma acc routine(S::StaticLambda2) seq
+#pragma acc routine(S::StaticMemFunc2) seq
+
+void force_emit() {
+  S{}.MemFunc1();
+  S{}.MemFunc2();
+  S{}.MemFunc3();
+  S::StaticMemFunc1();
+  S::StaticMemFunc2();
+  S::StaticMemFunc3();
+  S::StaticLambda1();
+  S::StaticLambda2();
+}
+
+// CHECK: cir.func{{.*}} @[[MEM1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[MEM1_R_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[MEM2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[MEM2_R_NAME:.*]], @[[MEM2_R2_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[MEM3_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[MEM3_R_NAME:.*]]]>}
+//
+// CHECK: cir.func{{.*}} @[[STATICMEM1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[STATICMEM1_R_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[STATICMEM2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[STATICMEM2_R_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[STATICMEM3_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[STATICMEM3_R_NAME:.*]]]>}
+//
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L1_R_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L2_R_NAME:.*]]]>}
+//
+// CHECK: acc.routine @[[MEM1_R_NAME]] func(@[[MEM1_NAME]]) seq
+// CHECK: acc.routine @[[STATICMEM1_R_NAME]] func(@[[STATICMEM1_NAME]]) seq
+// CHECK: acc.routine @[[L1_R_NAME]] func(@[[L1_NAME]]) seq
+// CHECK: acc.routine @[[MEM2_R_NAME]] func(@[[MEM2_NAME]]) seq
+// CHECK: acc.routine @[[MEM3_R_NAME]] func(@[[MEM3_NAME]]) seq
+// CHECK: acc.routine @[[STATICMEM3_R_NAME]] func(@[[STATICMEM3_NAME]]) seq
+// CHECK: acc.routine @[[MEM2_R2_NAME]] func(@[[MEM2_NAME]]) seq
+// CHECK: acc.routine @[[L2_R_NAME]] func(@[[L2_NAME]]) seq
+// CHECK: acc.routine @[[STATICMEM2_R_NAME]] func(@[[STATICMEM2_NAME]]) seq
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-ns.cpp b/clang/test/CIR/CodeGenOpenACC/routine-ns.cpp
new file mode 100644
index 0000000000000..9d1d677e79db8
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-ns.cpp
@@ -0,0 +1,28 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+namespace NS1 {
+#pragma acc routine seq
+  int NSFunc1();
+#pragma acc routine seq
+  auto Lambda1 = [](){};
+
+  auto Lambda2 = [](){};
+} // namespace NS1
+
+#pragma acc routine(NS1::NSFunc1) seq
+#pragma acc routine(NS1::Lambda2) seq
+
+void force_emit() {
+  NS1::NSFunc1();
+  NS1::Lambda1();
+  NS1::Lambda2();
+}
+
+// CHECK: cir.func{{.*}} @[[F1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[F1_R_NAME:.*]], @[[F1_R2_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L1_R_NAME:.*]]]>}
+// CHECK: cir.func {{.*}}lambda{{.*}} @[[L2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[L2_R_NAME:.*]]]>}
+//
+// CHECK: acc.routine @[[F1_R_NAME]] func(@[[F1_NAME]]) seq 
+// CHECK: acc.routine @[[L1_R_NAME]] func(@[[L1_NAME]]) seq 
+// CHECK: acc.routine @[[F1_R2_NAME]] func(@[[F1_NAME]]) seq  
+// CHECK: acc.routine @[[L2_R_NAME]] func(@[[L2_NAME]]) seq 
diff --git a/clang/test/CIR/CodeGenOpenACC/routine-templ.cpp b/clang/test/CIR/CodeGenOpenACC/routine-templ.cpp
new file mode 100644
index 0000000000000..419442220a1ba
--- /dev/null
+++ b/clang/test/CIR/CodeGenOpenACC/routine-templ.cpp
@@ -0,0 +1,16 @@
+// RUN: %clang_cc1 -fopenacc -Wno-openacc-self-if-potential-conflict -emit-cir -fclangir %s -o - | FileCheck %s
+
+#pragma acc routine seq
+template<typename T>
+void func(){}
+
+void use() {
+  func<int>();
+  func<float>();
+}
+
+// CHECK: cir.func{{.*}} @[[T1_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[T1_R_NAME:.*]]]>}
+// CHECK: cir.func{{.*}} @[[T2_NAME:[^\(]*]]({{.*}}){{.*}} attributes {acc.routine_info = #acc.routine_info<[@[[T2_R_NAME:.*]]]>}
+//
+// CHECK: acc.routine @[[T1_R_NAME]] func(@[[T1_NAME]]) seq
+// CHECK: acc.routine @[[T2_R_NAME]] func(@[[T2_NAME]]) seq
diff --git a/clang/test/CIR/IR/func.cir b/clang/test/CIR/IR/func.cir
index 10df27b7e168f..d8906ab3e1301 100644
--- a/clang/test/CIR/IR/func.cir
+++ b/clang/test/CIR/IR/func.cir
@@ -186,3 +186,11 @@ cir.func @Foo_move_assign() special_member<#cir.cxx_assign<!rec_Foo, move>> {
 // CHECK: cir.func @Foo_move_assign() special_member<#cir.cxx_assign<!rec_Foo, move>> {
 // CHECK:   cir.return
 // CHECK: }
+
+cir.func @has_attrs() attributes {foo, baz = 5, floof = "flop"} {
+  cir.return
+}
+
+// CHECK: cir.func @has_attrs() attributes {baz = 5 : i64{{.*}}, floof = "flop", foo} {
+// CHECK:   cir.return
+// CHECK: }
diff --git a/clang/test/CIR/IR/inline-attrs.cir b/clang/test/CIR/IR/inline-attrs.cir
index f525abe240366..b1437a975eb78 100644
--- a/clang/test/CIR/IR/inline-attrs.cir
+++ b/clang/test/CIR/IR/inline-attrs.cir
@@ -3,31 +3,38 @@
 !s32i = !cir.int<s, 32>
 
 module {
-  cir.func @noinline_func(%arg0: !s32i) -> !s32i inline(never) {
+  // CHECK: cir.func no_inline @noinline_func(%arg0: !s32i) -> !s32i
+  cir.func no_inline @noinline_func(%arg0: !s32i) -> !s32i {
     cir.return %arg0 : !s32i
   }
-  cir.func @always_inline_func(%arg0: !s32i) -> !s32i inline(always) {
+
+  // CHECK: cir.func always_inline @always_inline_func(%arg0: !s32i) -> !s32i
+  cir.func always_inline @always_inline_func(%arg0: !s32i) -> !s32i {
     cir.return %arg0 : !s32i
   }
-  cir.func @inline_hint_func(%arg0: !s32i) -> !s32i inline(hint) {
+
+  // CHECK: cir.func inline_hint @inline_hint_func(%arg0: !s32i) -> !s32i
+  cir.func inline_hint @inline_hint_func(%arg0: !s32i) -> !s32i{
     cir.return %arg0 : !s32i
   }
+  
+  // CHECK: cir.func @regular_func(%arg0: !s32i) -> !s32i
   cir.func @regular_func(%arg0: !s32i) -> !s32i {
     cir.return %arg0 : !s32i
   }
-  cir.func dso_local @noinline_with_attrs(%arg0: !s32i) -> !s32i inline(never) {
+
+  // CHECK: cir.func no_inline dso_local @noinline_with_attrs(%arg0: !s32i) -> !s32i
+  cir.func no_inline dso_local @noinline_with_attrs(%arg0: !s32i) -> !s32i {
     cir.return %arg0 : !s32i
   }
-  cir.func private @noinline_decl(!s32i) -> !s32i inline(never)
-  cir.func private @always_inline_decl(!s32i) -> !s32i inline(always)
-  cir.func private @inline_hint_decl(!s32i) -> !s32i inline(hint)
+
+  // CHECK: cir.func no_inline private @noinline_decl(!s32i) -> !s32i
+  cir.func no_inline private @noinline_decl(!s32i) -> !s32i
+  
+  // CHECK: cir.func always_inline private @always_inline_decl(!s32i) -> !s32i
+  cir.func always_inline private @always_inline_decl(!s32i) -> !s32i
+
+  // CHECK: cir.func inline_hint private @inline_hint_decl(!s32i) -> !s32i
+  cir.func inline_hint private @inline_hint_decl(!s32i) -> !s32i
 }
 
-// CHECK: cir.func @noinline_func(%arg0: !s32i) -> !s32i inline(never)
-// CHECK: cir.func @always_inline_func(%arg0: !s32i) -> !s32i inline(always)
-// CHECK: cir.func @inline_hint_func(%arg0: !s32i) -> !s32i inline(hint)
-// CHECK: cir.func @regular_func(%arg0: !s32i) -> !s32i {
-// CHECK: cir.func dso_local @noinline_with_attrs(%arg0: !s32i) -> !s32i inline(never)
-// CHECK: cir.func private @noinline_decl(!s32i) -> !s32i inline(never)
-// CHECK: cir.func private @always_inline_decl(!s32i) -> !s32i inline(always)
-// CHECK: cir.func private @inline_hint_decl(!s32i) -> !s32i inline(hint)
diff --git a/clang/test/CIR/IR/invalid-func-attr.cir b/clang/test/CIR/IR/invalid-func-attr.cir
new file mode 100644
index 0000000000000..aaaaba7a7bf6f
--- /dev/null
+++ b/clang/test/CIR/IR/invalid-func-attr.cir
@@ -0,0 +1,11 @@
+// RUN: cir-opt %s -verify-diagnostics
+
+module {
+  cir.func @l0() {
+    cir.return
+  }
+
+  cir.func @disallowedAttr() attributes {comdat} { // expected-error{{custom op 'cir.func' attribute 'comdat' should not be specified in the explicit attribute list}}
+    cir.return
+  }
+}
diff --git a/clang/test/CIR/func-linkage.cpp b/clang/test/CIR/func-linkage.cpp
index d43f7ed273063..c90a69cba105d 100644
--- a/clang/test/CIR/func-linkage.cpp
+++ b/clang/test/CIR/func-linkage.cpp
@@ -8,7 +8,7 @@
 
 void a() {}
 
-// CIR: cir.func dso_local @_Z1av()
+// CIR: cir.func no_inline dso_local @_Z1av()
 // LLVM: define dso_local void @_Z1av()
 // OGCG: define dso_local void @_Z1av()
 
@@ -18,12 +18,12 @@ extern void b();
 // OGCG: declare void @_Z1bv()
 
 static void c() {}
-// CIR: cir.func internal private dso_local @_ZL1cv()
+// CIR: cir.func no_inline internal private dso_local @_ZL1cv()
 // LLVM: define internal void @_ZL1cv()
 // OGCG: define internal void @_ZL1cv()
 
 inline void d() {}
-// CIR: cir.func comdat linkonce_odr @_Z1dv()
+// CIR: cir.func {{.*}} comdat linkonce_odr @_Z1dv()
 // LLVM: define linkonce_odr void @_Z1dv()
 // OGCG: define linkonce_odr void @_Z1dv(){{.*}} comdat
 
@@ -31,7 +31,7 @@ namespace {
   void e() {}
 }
 
-// CIR: cir.func internal private dso_local @_ZN12_GLOBAL__N_11eEv()
+// CIR: cir.func {{.*}} internal private dso_local @_ZN12_GLOBAL__N_11eEv()
 // LLVM: define internal void @_ZN12_GLOBAL__N_11eEv()
 // OGCG: define internal void @_ZN12_GLOBAL__N_11eEv()
 
diff --git a/clang/test/CXX/drs/cwg30xx.cpp b/clang/test/CXX/drs/cwg30xx.cpp
index 0be3f0b1e88ea..648ba9e78cd66 100644
--- a/clang/test/CXX/drs/cwg30xx.cpp
+++ b/clang/test/CXX/drs/cwg30xx.cpp
@@ -7,7 +7,7 @@
 // RUN: %clang_cc1 -std=c++2c -pedantic-errors -verify=expected %s
 
 
-namespace cwg3005 { // cwg3005: 21 tentatively ready 2025-09-12
+namespace cwg3005 { // cwg3005: 21 ready 2025-09-12
 
 void f(
     int _, // #cwg3005-first-param
diff --git a/clang/test/CodeGen/Sparc/sparcv8-abi.c b/clang/test/CodeGen/Sparc/sparcv8-abi.c
index c5faf130890f8..7beddd20e5e4d 100644
--- a/clang/test/CodeGen/Sparc/sparcv8-abi.c
+++ b/clang/test/CodeGen/Sparc/sparcv8-abi.c
@@ -1,22 +1,52 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --filter "^define |^entry:" --version 6
 // RUN: %clang_cc1 -triple sparc-unknown-unknown -emit-llvm %s -o - | FileCheck %s
 
-// CHECK-LABEL: define{{.*}} { float, float } @p(ptr noundef byval({ float, float }) align 4 %a, ptr noundef byval({ float, float }) align 4 %b) #0 {
 float __complex__
+// CHECK-LABEL: define dso_local { float, float } @p(
+// CHECK-SAME: ptr noundef byval({ float, float }) align 4 [[A:%.*]], ptr noundef byval({ float, float }) align 4 [[B:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK:  [[ENTRY:.*:]]
+//
 p (float __complex__  a, float __complex__  b)
 {
   return 0;
 }
 
-// CHECK-LABEL: define{{.*}} { double, double } @q(ptr noundef byval({ double, double }) align 8 %a, ptr noundef byval({ double, double }) align 8 %b) #0 {
 double __complex__
+// CHECK-LABEL: define dso_local { double, double } @q(
+// CHECK-SAME: ptr noundef byval({ double, double }) align 8 [[A:%.*]], ptr noundef byval({ double, double }) align 8 [[B:%.*]]) #[[ATTR0]] {
+// CHECK:  [[ENTRY:.*:]]
+//
 q (double __complex__  a, double __complex__  b)
 {
   return 0;
 }
 
-// CHECK-LABEL: define{{.*}} { i64, i64 } @r(ptr noundef byval({ i64, i64 }) align 8 %a, ptr noundef byval({ i64, i64 }) align 8 %b) #0 {
 long long __complex__
+// CHECK-LABEL: define dso_local { i64, i64 } @r(
+// CHECK-SAME: ptr noundef byval({ i64, i64 }) align 8 [[A:%.*]], ptr noundef byval({ i64, i64 }) align 8 [[B:%.*]]) #[[ATTR0]] {
+// CHECK:  [[ENTRY:.*:]]
+//
 r (long long __complex__  a, long long __complex__  b)
 {
   return 0;
 }
+
+long double
+// CHECK-LABEL: define dso_local void @s(
+// CHECK-SAME: ptr dead_on_unwind noalias writable sret(fp128) align 8 [[AGG_RESULT:%.*]], ptr noundef byval(fp128) align 8 [[TMP0:%.*]]) #[[ATTR0]] {
+// CHECK:  [[ENTRY:.*:]]
+//
+s(long double a)
+{
+    return 0;
+}
+
+long double _Complex
+// CHECK-LABEL: define dso_local inreg { fp128, fp128 } @t(
+// CHECK-SAME: ptr noundef byval({ fp128, fp128 }) align 8 [[A:%.*]]) #[[ATTR0]] {
+// CHECK:  [[ENTRY:.*:]]
+//
+t(long double _Complex a)
+{
+    return 0;
+}
diff --git a/clang/test/CodeGen/X86/avx-builtins.c b/clang/test/CodeGen/X86/avx-builtins.c
index 00bcf9cc1da58..13da4292c5b92 100644
--- a/clang/test/CodeGen/X86/avx-builtins.c
+++ b/clang/test/CodeGen/X86/avx-builtins.c
@@ -968,6 +968,8 @@ __m128 test_mm256_cvtpd_ps(__m256d A) {
   return _mm256_cvtpd_ps(A);
 }
 
+TEST_CONSTEXPR(match_m128(_mm256_cvtpd_ps((__m256d){ 0.0, -1.0, +2.0, +3.5 }), 0.0f, -1.0f, +2.0f, +3.5f));
+
 __m256i test_mm256_cvtps_epi32(__m256 A) {
   // CHECK-LABEL: test_mm256_cvtps_epi32
   // CHECK: call <8 x i32> @llvm.x86.avx.cvt.ps2dq.256(<8 x float> %{{.*}})
diff --git a/clang/test/CodeGen/X86/avx2-builtins.c b/clang/test/CodeGen/X86/avx2-builtins.c
index d6facfea8962e..c9474e94476fc 100644
--- a/clang/test/CodeGen/X86/avx2-builtins.c
+++ b/clang/test/CodeGen/X86/avx2-builtins.c
@@ -1111,12 +1111,34 @@ __m256i test_mm256_permute4x64_epi64(__m256i a) {
   // CHECK: shufflevector <4 x i64> %{{.*}}, <4 x i64> poison, <4 x i32> <i32 3, i32 0, i32 2, i32 0>
   return _mm256_permute4x64_epi64(a, 35);
 }
+// Control value 0x00: [0,0,0,0] -> broadcast element 0
+TEST_CONSTEXPR(match_v4di(_mm256_permute4x64_epi64(((__m256i)(__v4di){40LL, 30LL, 20LL, 10LL}), 0x00), 40LL, 40LL, 40LL, 40LL));
+// Control value 0x1B: [0,1,2,3] -> reverse order [3,2,1,0] = [D,C,B,A]
+TEST_CONSTEXPR(match_v4di(_mm256_permute4x64_epi64(((__m256i)(__v4di){40LL, 30LL, 20LL, 10LL}), 0x1B), 10LL, 20LL, 30LL, 40LL));
+// Control value 0x39: [1,2,3,0] -> rotate left [B,C,D,A]
+TEST_CONSTEXPR(match_v4di(_mm256_permute4x64_epi64(((__m256i)(__v4di){40LL, 30LL, 20LL, 10LL}), 0x39), 30LL, 20LL, 10LL, 40LL));
+// Control value 0x12: [2,0,1,0] -> [C,A,B,A]
+TEST_CONSTEXPR(match_v4di(_mm256_permute4x64_epi64(((__m256i)(__v4di){40LL, 30LL, 20LL, 10LL}), 0x12), 20LL, 40LL, 30LL, 40LL));
+// Control value 0xE4: [3,2,1,0] -> identity [A,B,C,D]
+TEST_CONSTEXPR(match_v4di(_mm256_permute4x64_epi64(((__m256i)(__v4di){40LL, 30LL, 20LL, 10LL}), 0xE4), 40LL, 30LL, 20LL, 10LL));
+// Test with negative values
+TEST_CONSTEXPR(match_v4di(_mm256_permute4x64_epi64(((__m256i)(__v4di){-40LL, -30LL, -20LL, -10LL}), 0x1B), -10LL, -20LL, -30LL, -40LL));
 
 __m256d test_mm256_permute4x64_pd(__m256d a) {
   // CHECK-LABEL: test_mm256_permute4x64_pd
   // CHECK: shufflevector <4 x double> %{{.*}}, <4 x double> poison, <4 x i32> <i32 1, i32 2, i32 1, i32 0>
   return _mm256_permute4x64_pd(a, 25);
 }
+// Control value 0x00: [0,0,0,0] -> broadcast element 0
+TEST_CONSTEXPR(match_m256d(_mm256_permute4x64_pd(((__m256d){4.0, 3.0, 2.0, 1.0}), 0x00), 4.0, 4.0, 4.0, 4.0));
+// Control value 0x1B: [0,1,2,3] -> reverse order [3,2,1,0] = [D,C,B,A]
+TEST_CONSTEXPR(match_m256d(_mm256_permute4x64_pd(((__m256d){4.0, 3.0, 2.0, 1.0}), 0x1B), 1.0, 2.0, 3.0, 4.0));
+// Control value 0x39: [1,2,3,0] -> rotate left [B,C,D,A]
+TEST_CONSTEXPR(match_m256d(_mm256_permute4x64_pd(((__m256d){4.0, 3.0, 2.0, 1.0}), 0x39), 3.0, 2.0, 1.0, 4.0));
+// Control value 0x12: [2,0,1,0] -> [C,A,B,A]
+TEST_CONSTEXPR(match_m256d(_mm256_permute4x64_pd(((__m256d){4.0, 3.0, 2.0, 1.0}), 0x12), 2.0, 4.0, 3.0, 4.0));
+// Control value 0xE4: [3,2,1,0] -> identity [A,B,C,D]
+TEST_CONSTEXPR(match_m256d(_mm256_permute4x64_pd(((__m256d){4.0, 3.0, 2.0, 1.0}), 0xE4), 4.0, 3.0, 2.0, 1.0));
 
 __m256i test_mm256_permutevar8x32_epi32(__m256i a, __m256i b) {
   // CHECK-LABEL: test_mm256_permutevar8x32_epi32
diff --git a/clang/test/CodeGen/X86/avx512bw-builtins.c b/clang/test/CodeGen/X86/avx512bw-builtins.c
index c9c30dab389db..7cdec9b4cbbee 100644
--- a/clang/test/CodeGen/X86/avx512bw-builtins.c
+++ b/clang/test/CodeGen/X86/avx512bw-builtins.c
@@ -534,6 +534,10 @@ __mmask32 test_kshiftli_mask32(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: [[RES:%.*]] = shufflevector <32 x i1> zeroinitializer, <32 x i1> [[VAL]], <32 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32>
   return _mm512_mask_cmpneq_epu16_mask(_kshiftli_mask32(_mm512_cmpneq_epu16_mask(A, B), 31), C, D);
 }
+TEST_CONSTEXPR(_kshiftli_mask32(0x00000001, 1) == 0x00000002);
+TEST_CONSTEXPR(_kshiftli_mask32(0x00000001, 31) == 0x80000000);
+TEST_CONSTEXPR(_kshiftli_mask32(0x00000001, 32) == 0x00000000);
+TEST_CONSTEXPR(_kshiftli_mask32(0x0000FFFF, 8) == 0x00FFFF00);
 
 __mmask32 test_kshiftri_mask32(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK-LABEL: test_kshiftri_mask32
@@ -541,6 +545,10 @@ __mmask32 test_kshiftri_mask32(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: [[RES:%.*]] = shufflevector <32 x i1> [[VAL]], <32 x i1> zeroinitializer, <32 x i32> <i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62>
   return _mm512_mask_cmpneq_epu16_mask(_kshiftri_mask32(_mm512_cmpneq_epu16_mask(A, B), 31), C, D);
 }
+TEST_CONSTEXPR(_kshiftri_mask32(0x80000000, 1) == 0x40000000);
+TEST_CONSTEXPR(_kshiftri_mask32(0x80000000, 31) == 0x00000001);
+TEST_CONSTEXPR(_kshiftri_mask32(0x80000000, 32) == 0x00000000);
+TEST_CONSTEXPR(_kshiftri_mask32(0xFFFF0000, 8) == 0x00FFFF00);
 
 __mmask64 test_kshiftli_mask64(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK-LABEL: test_kshiftli_mask64
@@ -548,6 +556,10 @@ __mmask64 test_kshiftli_mask64(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: [[RES:%.*]] = shufflevector <64 x i1> zeroinitializer, <64 x i1> [[VAL]], <64 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95>
   return _mm512_mask_cmpneq_epu8_mask(_kshiftli_mask64(_mm512_cmpneq_epu8_mask(A, B), 32), C, D);
 }
+TEST_CONSTEXPR(_kshiftli_mask64(0x0000000000000001ULL, 1) == 0x0000000000000002ULL);
+TEST_CONSTEXPR(_kshiftli_mask64(0x0000000000000001ULL, 63) == 0x8000000000000000ULL);
+TEST_CONSTEXPR(_kshiftli_mask64(0x0000000000000001ULL, 64) == 0x0000000000000000ULL);
+TEST_CONSTEXPR(_kshiftli_mask64(0x00000000FFFFFFFFULL, 16) == 0x0000FFFFFFFF0000ULL);
 
 __mmask64 test_kshiftri_mask64(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK-LABEL: test_kshiftri_mask64
@@ -555,27 +567,41 @@ __mmask64 test_kshiftri_mask64(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: [[RES:%.*]] = shufflevector <64 x i1> [[VAL]], <64 x i1> zeroinitializer, <64 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95>
   return _mm512_mask_cmpneq_epu8_mask(_kshiftri_mask64(_mm512_cmpneq_epu8_mask(A, B), 32), C, D);
 }
+TEST_CONSTEXPR(_kshiftri_mask64(0x8000000000000000ULL, 1) == 0x4000000000000000ULL);
+TEST_CONSTEXPR(_kshiftri_mask64(0x8000000000000000ULL, 63) == 0x0000000000000001ULL);
+TEST_CONSTEXPR(_kshiftri_mask64(0x8000000000000000ULL, 64) == 0x0000000000000000ULL);
+TEST_CONSTEXPR(_kshiftri_mask64(0xFFFFFFFF00000000ULL, 16) == 0x0000FFFFFFFF0000ULL);
 
 unsigned int test_cvtmask32_u32(__m512i A, __m512i B) {
   // CHECK-LABEL: test_cvtmask32_u32
   return _cvtmask32_u32(_mm512_cmpneq_epu16_mask(A, B));
 }
 
+TEST_CONSTEXPR(_cvtmask32_u32((__mmask32)0xDEADBEEF) == 0xDEADBEEF);
+
 unsigned long long test_cvtmask64_u64(__m512i A, __m512i B) {
   // CHECK-LABEL: test_cvtmask64_u64
   return _cvtmask64_u64(_mm512_cmpneq_epu8_mask(A, B));
 }
 
+TEST_CONSTEXPR(_cvtmask64_u64((__mmask64)0x123456789ABCDEF0ULL) == 0x123456789ABCDEF0ULL);
+
 __mmask32 test_cvtu32_mask32(__m512i A, __m512i B, unsigned int C) {
   // CHECK-LABEL: test_cvtu32_mask32
   return _mm512_mask_cmpneq_epu16_mask(_cvtu32_mask32(C), A, B);
 }
 
+TEST_CONSTEXPR(_cvtu32_mask32(0x13579BDF) == (__mmask32)0x13579BDF);
+TEST_CONSTEXPR(_cvtu32_mask32(_cvtmask32_u32((__mmask32)0x2468ACE0)) == (__mmask32)0x2468ACE0);
+
 __mmask64 test_cvtu64_mask64(__m512i A, __m512i B, unsigned long long C) {
   // CHECK-LABEL: test_cvtu64_mask64
   return _mm512_mask_cmpneq_epu8_mask(_cvtu64_mask64(C), A, B);
 }
 
+TEST_CONSTEXPR(_cvtu64_mask64(0x0F0F0F0F0F0F0F0FULL) == (__mmask64)0x0F0F0F0F0F0F0F0FULL);
+TEST_CONSTEXPR(_cvtu64_mask64(_cvtmask64_u64((__mmask64)0xF0F0F0F0F0F0F0F0ULL)) == (__mmask64)0xF0F0F0F0F0F0F0F0ULL);
+
 __mmask32 test_load_mask32(__mmask32 *A, __m512i B, __m512i C) {
   // CHECK-LABEL: test_load_mask32
   // CHECK: [[LOAD:%.*]] = load i32, ptr %{{.*}}
diff --git a/clang/test/CodeGen/X86/avx512dq-builtins.c b/clang/test/CodeGen/X86/avx512dq-builtins.c
index 542d4446a3690..d8647b5547ceb 100644
--- a/clang/test/CodeGen/X86/avx512dq-builtins.c
+++ b/clang/test/CodeGen/X86/avx512dq-builtins.c
@@ -364,6 +364,10 @@ __mmask8 test_kshiftli_mask8(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: [[RES:%.*]] = shufflevector <8 x i1> zeroinitializer, <8 x i1> [[VAL]], <8 x i32> <i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13>
   return _mm512_mask_cmpneq_epu64_mask(_kshiftli_mask8(_mm512_cmpneq_epu64_mask(A, B), 2), C, D);
 }
+TEST_CONSTEXPR(_kshiftli_mask8(0x01, 1) == 0x02);
+TEST_CONSTEXPR(_kshiftli_mask8(0x01, 7) == 0x80);
+TEST_CONSTEXPR(_kshiftli_mask8(0x01, 8) == 0x00);
+TEST_CONSTEXPR(_kshiftli_mask8(0x0F, 2) == 0x3C);
 
 __mmask8 test_kshiftri_mask8(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK-LABEL: test_kshiftri_mask8
@@ -371,6 +375,10 @@ __mmask8 test_kshiftri_mask8(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: [[RES:%.*]] = shufflevector <8 x i1> [[VAL]], <8 x i1> zeroinitializer, <8 x i32> <i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9>
   return _mm512_mask_cmpneq_epu64_mask(_kshiftri_mask8(_mm512_cmpneq_epu64_mask(A, B), 2), C, D);
 }
+TEST_CONSTEXPR(_kshiftri_mask8(0x80, 1) == 0x40);
+TEST_CONSTEXPR(_kshiftri_mask8(0x80, 7) == 0x01);
+TEST_CONSTEXPR(_kshiftri_mask8(0x80, 8) == 0x00);
+TEST_CONSTEXPR(_kshiftri_mask8(0xF0, 2) == 0x3C);
 
 unsigned int test_cvtmask8_u32(__m512i A, __m512i B) {
   // CHECK-LABEL: test_cvtmask8_u32
@@ -378,12 +386,17 @@ unsigned int test_cvtmask8_u32(__m512i A, __m512i B) {
   return _cvtmask8_u32(_mm512_cmpneq_epu64_mask(A, B));
 }
 
+TEST_CONSTEXPR(_cvtmask8_u32((__mmask8)0x5A) == 0x5A);
+
 __mmask8 test_cvtu32_mask8(__m512i A, __m512i B, unsigned int C) {
   // CHECK-LABEL: test_cvtu32_mask8
   // CHECK: trunc i32 %{{.*}} to i8
   return _mm512_mask_cmpneq_epu64_mask(_cvtu32_mask8(C), A, B);
 }
 
+TEST_CONSTEXPR(_cvtu32_mask8(0xB7) == (__mmask8)0xB7);
+TEST_CONSTEXPR(_cvtu32_mask8(_cvtmask8_u32((__mmask8)0xDE)) == (__mmask8)0xDE);
+
 __mmask8 test_load_mask8(__mmask8 *A, __m512i B, __m512i C) {
   // CHECK-LABEL: test_load_mask8
   // CHECK: [[LOAD:%.*]] = load i8, ptr %{{.*}}
diff --git a/clang/test/CodeGen/X86/avx512f-builtins.c b/clang/test/CodeGen/X86/avx512f-builtins.c
index 6401a0e55a83b..ab047a8ecd55e 100644
--- a/clang/test/CodeGen/X86/avx512f-builtins.c
+++ b/clang/test/CodeGen/X86/avx512f-builtins.c
@@ -9572,6 +9572,10 @@ __mmask16 test_kshiftli_mask16(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: bitcast <16 x i1> {{.*}} to i16
   return _mm512_mask_cmpneq_epu32_mask(_kshiftli_mask16(_mm512_cmpneq_epu32_mask(A, B), 1), C, D);
 }
+TEST_CONSTEXPR(_kshiftli_mask16(0x0001, 1) == 0x0002);
+TEST_CONSTEXPR(_kshiftli_mask16(0x0001, 15) == 0x8000);
+TEST_CONSTEXPR(_kshiftli_mask16(0x0001, 16) == 0x0000);
+TEST_CONSTEXPR(_kshiftli_mask16(0x00FF, 4) == 0x0FF0);
 
 __mmask16 test_kshiftri_mask16(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK-LABEL: test_kshiftri_mask16
@@ -9580,6 +9584,10 @@ __mmask16 test_kshiftri_mask16(__m512i A, __m512i B, __m512i C, __m512i D) {
   // CHECK: bitcast <16 x i1> {{.*}} to i16
   return _mm512_mask_cmpneq_epu32_mask(_kshiftri_mask16(_mm512_cmpneq_epu32_mask(A, B), 1), C, D);
 }
+TEST_CONSTEXPR(_kshiftri_mask16(0x8000, 1) == 0x4000);
+TEST_CONSTEXPR(_kshiftri_mask16(0x8000, 15) == 0x0001);
+TEST_CONSTEXPR(_kshiftri_mask16(0x8000, 16) == 0x0000);
+TEST_CONSTEXPR(_kshiftri_mask16(0xFF00, 4) == 0x0FF0);
 
 unsigned int test_cvtmask16_u32(__m512i A, __m512i B) {
   // CHECK-LABEL: test_cvtmask16_u32
@@ -9589,6 +9597,8 @@ unsigned int test_cvtmask16_u32(__m512i A, __m512i B) {
   return _cvtmask16_u32(_mm512_cmpneq_epu32_mask(A, B));
 }
 
+TEST_CONSTEXPR(_cvtmask16_u32((__mmask16)0xBEEF) == 0xBEEF);
+
 __mmask16 test_cvtu32_mask16(__m512i A, __m512i B, unsigned int C) {
   // CHECK-LABEL: test_cvtu32_mask16
   // CHECK: trunc i32 %{{.*}} to i16
@@ -9596,6 +9606,9 @@ __mmask16 test_cvtu32_mask16(__m512i A, __m512i B, unsigned int C) {
   return _mm512_mask_cmpneq_epu32_mask(_cvtu32_mask16(C), A, B);
 }
 
+TEST_CONSTEXPR(_cvtu32_mask16(0xCAFE) == (__mmask16)0xCAFE);
+TEST_CONSTEXPR(_cvtu32_mask16(_cvtmask16_u32((__mmask16)0x1357)) == (__mmask16)0x1357);
+
 __mmask16 test_load_mask16(__mmask16 *A, __m512i B, __m512i C) {
   // CHECK-LABEL: test_load_mask16
   // CHECK: [[LOAD:%.*]] = load i16, ptr %{{.*}}{{$}}
@@ -10615,6 +10628,8 @@ __m256 test_mm512_cvtpd_ps (__m512d __A)
   return _mm512_cvtpd_ps (__A);
 }
 
+TEST_CONSTEXPR(match_m256(_mm512_cvtpd_ps((__m512d){ -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 }), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, +32.0f, +64.0f, +128.0f));
+
 __m256 test_mm512_mask_cvtpd_ps (__m256 __W, __mmask8 __U, __m512d __A)
 {
   // CHECK-LABEL: test_mm512_mask_cvtpd_ps 
@@ -10622,6 +10637,8 @@ __m256 test_mm512_mask_cvtpd_ps (__m256 __W, __mmask8 __U, __m512d __A)
   return _mm512_mask_cvtpd_ps (__W,__U,__A);
 }
 
+TEST_CONSTEXPR(match_m256(_mm512_mask_cvtpd_ps((__m256){ 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f }, 0x05, (__m512d){ -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 }), -1.0f, 9.0f, +4.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f));
+
 __m512 test_mm512_cvtpd_pslo(__m512d __A)
 {
   // CHECK-LABEL: test_mm512_cvtpd_pslo
@@ -10631,6 +10648,8 @@ __m512 test_mm512_cvtpd_pslo(__m512d __A)
   return _mm512_cvtpd_pslo(__A);
 }
 
+TEST_CONSTEXPR(match_m512(_mm512_cvtpd_pslo((__m512d){ -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 }), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, +32.0f, +64.0f, +128.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
+
 __m512 test_mm512_mask_cvtpd_pslo(__m512 __W, __mmask8 __U, __m512d __A) {
   // CHECK-LABEL: test_mm512_mask_cvtpd_pslo
   // CHECK: @llvm.x86.avx512.mask.cvtpd2ps.512
@@ -10639,6 +10658,8 @@ __m512 test_mm512_mask_cvtpd_pslo(__m512 __W, __mmask8 __U, __m512d __A) {
   return _mm512_mask_cvtpd_pslo(__W, __U, __A);
 }
 
+TEST_CONSTEXPR(match_m512(_mm512_mask_cvtpd_pslo((__m512){ 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f, 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f }, 0x3, (__m512d){ -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 }), -1.0f, +2.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
+
 __m256 test_mm512_maskz_cvtpd_ps (__mmask8 __U, __m512d __A)
 {
   // CHECK-LABEL: test_mm512_maskz_cvtpd_ps 
@@ -11860,12 +11881,16 @@ __m128 test_mm_mask_cvtsd_ss(__m128 __W, __mmask8 __U, __m128 __A, __m128d __B)
   return _mm_mask_cvtsd_ss(__W, __U, __A, __B); 
 }
 
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtsd_ss((__m128){ 9.0f, 5.0f, 6.0f, 7.0f }, 0x1, (__m128){ 1.0f, 2.0f, 3.0f, 4.0f }, (__m128d){ -1.0, 42.0 }), -1.0f, 2.0f, 3.0f, 4.0f));
+
 __m128 test_mm_maskz_cvtsd_ss(__mmask8 __U, __m128 __A, __m128d __B) {
   // CHECK-LABEL: test_mm_maskz_cvtsd_ss
   // CHECK: @llvm.x86.avx512.mask.cvtsd2ss.round
   return _mm_maskz_cvtsd_ss(__U, __A, __B); 
 }
 
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtsd_ss(0x1, (__m128){ 1.0f, 2.0f, 3.0f, 4.0f }, (__m128d){ -1.0, 42.0 }), -1.0f, 2.0f, 3.0f, 4.0f));
+
 __m512i test_mm512_setzero_epi32(void)
 {
   // CHECK-LABEL: test_mm512_setzero_epi32
diff --git a/clang/test/CodeGen/X86/avx512vl-builtins.c b/clang/test/CodeGen/X86/avx512vl-builtins.c
index 5f6d8360888f5..013c19ba7a929 100644
--- a/clang/test/CodeGen/X86/avx512vl-builtins.c
+++ b/clang/test/CodeGen/X86/avx512vl-builtins.c
@@ -3999,23 +3999,31 @@ __m128 test_mm_mask_cvtpd_ps(__m128 __W, __mmask8 __U, __m128d __A) {
   // CHECK: @llvm.x86.avx512.mask.cvtpd2ps
   return _mm_mask_cvtpd_ps(__W,__U,__A); 
 }
+
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtpd_ps((__m128){ 9.0f, 9.0f, 9.0f, 9.0f }, 0x3, (__m128d){ -1.0, +2.0 }), -1.0f, +2.0f, 9.0f, 9.0f));
 __m128 test_mm_maskz_cvtpd_ps(__mmask8 __U, __m128d __A) {
   // CHECK-LABEL: test_mm_maskz_cvtpd_ps
   // CHECK: @llvm.x86.avx512.mask.cvtpd2ps
   return _mm_maskz_cvtpd_ps(__U,__A); 
 }
+
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtpd_ps(0x1, (__m128d){ -1.0, +2.0 }), -1.0f, 0.0f, 0.0f, 0.0f));
 __m128 test_mm256_mask_cvtpd_ps(__m128 __W, __mmask8 __U, __m256d __A) {
   // CHECK-LABEL: test_mm256_mask_cvtpd_ps
   // CHECK: @llvm.x86.avx.cvt.pd2.ps.256
   // CHECK: select <4 x i1> {{.*}}, <4 x float> {{.*}}, <4 x float> {{.*}}
   return _mm256_mask_cvtpd_ps(__W,__U,__A); 
 }
+
+TEST_CONSTEXPR(match_m128(_mm256_mask_cvtpd_ps((__m128){ 9.0f, 9.0f, 9.0f, 9.0f }, 0xF, (__m256d){ 0.0, -1.0, +2.0, +3.5 }), 0.0f, -1.0f, +2.0f, +3.5f));
 __m128 test_mm256_maskz_cvtpd_ps(__mmask8 __U, __m256d __A) {
   // CHECK-LABEL: test_mm256_maskz_cvtpd_ps
   // CHECK: @llvm.x86.avx.cvt.pd2.ps.256
   // CHECK: select <4 x i1> {{.*}}, <4 x float> {{.*}}, <4 x float> {{.*}}
   return _mm256_maskz_cvtpd_ps(__U,__A); 
 }
+
+TEST_CONSTEXPR(match_m128(_mm256_maskz_cvtpd_ps(0x5, (__m256d){ 0.0, -1.0, +2.0, +3.5 }), 0.0f, 0.0f, +2.0f, 0.0f));
 __m128i test_mm_cvtpd_epu32(__m128d __A) {
   // CHECK-LABEL: test_mm_cvtpd_epu32
   // CHECK: @llvm.x86.avx512.mask.cvtpd2udq.128
diff --git a/clang/test/CodeGen/X86/sse2-builtins.c b/clang/test/CodeGen/X86/sse2-builtins.c
index ed1ac84b8c4a3..c4975b456ba22 100644
--- a/clang/test/CodeGen/X86/sse2-builtins.c
+++ b/clang/test/CodeGen/X86/sse2-builtins.c
@@ -573,6 +573,8 @@ __m128 test_mm_cvtpd_ps(__m128d A) {
   return _mm_cvtpd_ps(A);
 }
 
+TEST_CONSTEXPR(match_m128(_mm_cvtpd_ps((__m128d){ -1.0, +2.0 }), -1.0f, +2.0f, 0.0f, 0.0f));
+
 __m128i test_mm_cvtps_epi32(__m128 A) {
   // CHECK-LABEL: test_mm_cvtps_epi32
   // CHECK: call <4 x i32> @llvm.x86.sse2.cvtps2dq(<4 x float> %{{.*}})
@@ -614,6 +616,8 @@ __m128 test_mm_cvtsd_ss(__m128 A, __m128d B) {
   return _mm_cvtsd_ss(A, B);
 }
 
+TEST_CONSTEXPR(match_m128(_mm_cvtsd_ss((__m128){ 9.0f, 5.0f, 6.0f, 7.0f }, (__m128d){ -1.0, 42.0 }), -1.0f, 5.0f, 6.0f, 7.0f));
+
 int test_mm_cvtsi128_si32(__m128i A) {
   // CHECK-LABEL: test_mm_cvtsi128_si32
   // CHECK: extractelement <4 x i32> %{{.*}}, i32 0
diff --git a/clang/test/CodeGen/alloc-token-lower.c b/clang/test/CodeGen/alloc-token-lower.c
index 43d9a6337b7db..2d87b02c6a288 100644
--- a/clang/test/CodeGen/alloc-token-lower.c
+++ b/clang/test/CodeGen/alloc-token-lower.c
@@ -1,16 +1,20 @@
 // Test optimization pipelines do not interfere with AllocToken lowering, and we
 // pass on function attributes correctly.
 //
-// RUN: %clang_cc1     -fsanitize=alloc-token -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s
-// RUN: %clang_cc1 -O1 -fsanitize=alloc-token -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s
-// RUN: %clang_cc1 -O2 -fsanitize=alloc-token -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s
+// RUN: %clang_cc1     -fsanitize=alloc-token -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,DEFAULT
+// RUN: %clang_cc1 -O1 -fsanitize=alloc-token -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,DEFAULT
+// RUN: %clang_cc1 -O2 -fsanitize=alloc-token -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,DEFAULT
+// RUN: %clang_cc1     -fsanitize=alloc-token -fsanitize-alloc-token-fast-abi -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,FASTABI
+// RUN: %clang_cc1 -O1 -fsanitize=alloc-token -fsanitize-alloc-token-fast-abi -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,FASTABI
+// RUN: %clang_cc1 -O2 -fsanitize=alloc-token -fsanitize-alloc-token-fast-abi -triple x86_64-linux-gnu -emit-llvm %s -o - | FileCheck %s --check-prefixes=CHECK,FASTABI
 
 typedef __typeof(sizeof(int)) size_t;
 
 void *malloc(size_t size);
 
 // CHECK-LABEL: @test_malloc(
-// CHECK: call{{.*}} ptr @__alloc_token_malloc(i64 noundef 4, i64 2689373973731826898){{.*}} !alloc_token [[META_INT:![0-9]+]]
+// DEFAULT: call{{.*}} ptr @__alloc_token_malloc(i64 noundef 4, i64 2689373973731826898){{.*}} !alloc_token [[META_INT:![0-9]+]]
+// FASTABI: call{{.*}} ptr @__alloc_token_2689373973731826898_malloc(i64 noundef 4){{.*}} !alloc_token [[META_INT:![0-9]+]]
 void *test_malloc() {
   return malloc(sizeof(int));
 }
@@ -26,6 +30,7 @@ void *no_sanitize_malloc(size_t size) __attribute__((no_sanitize("alloc-token"))
 // allocator will only implement standard allocation functions.
 void *nonstandard_malloc(size_t size) __attribute__((malloc));
 // CHECK-LABEL: @test_nonlibcall_malloc(
+// CHECK-NOT: __alloc_token_
 // CHECK: call{{.*}} ptr @nonstandard_malloc(i64 noundef 4){{.*}} !alloc_token [[META_INT]]
 void *test_nonlibcall_malloc() {
   return nonstandard_malloc(sizeof(int));
diff --git a/clang/test/CodeGen/alloc-token-module-flags.c b/clang/test/CodeGen/alloc-token-module-flags.c
new file mode 100644
index 0000000000000..6fc0d619915c8
--- /dev/null
+++ b/clang/test/CodeGen/alloc-token-module-flags.c
@@ -0,0 +1,22 @@
+// RUN: %clang_cc1 -fsanitize=alloc-token -emit-llvm -o - %s | FileCheck %s --check-prefix=DEFAULT
+// RUN: %clang_cc1 -fsanitize=alloc-token -falloc-token-mode=increment -emit-llvm -o - %s | FileCheck %s --check-prefix=INCREMENT
+// RUN: %clang_cc1 -fsanitize=alloc-token -falloc-token-max=100 -emit-llvm -o - %s | FileCheck %s --check-prefix=MAX
+// RUN: %clang_cc1 -fsanitize=alloc-token -fsanitize-alloc-token-fast-abi -emit-llvm -o - %s | FileCheck %s --check-prefix=FASTABI
+// RUN: %clang_cc1 -fsanitize=alloc-token -fsanitize-alloc-token-extended -emit-llvm -o - %s | FileCheck %s --check-prefix=EXTENDED
+
+// DEFAULT-NOT: !"alloc-token-mode"
+// DEFAULT-NOT: !"alloc-token-max"
+// DEFAULT-NOT: !"alloc-token-fast-abi"
+// DEFAULT-NOT: !"alloc-token-extended"
+
+// INCREMENT: !llvm.module.flags = !{{{.*}}![[FLAG:[0-9]+]]{{.*}}}
+// INCREMENT: ![[FLAG]] = !{i32 1, !"alloc-token-mode", !"increment"}
+
+// MAX: !llvm.module.flags = !{{{.*}}![[FLAG:[0-9]+]]{{.*}}}
+// MAX: ![[FLAG]] = !{i32 1, !"alloc-token-max", i64 100}
+
+// FASTABI: !llvm.module.flags = !{{{.*}}![[FLAG:[0-9]+]]{{.*}}}
+// FASTABI: ![[FLAG]] = !{i32 1, !"alloc-token-fast-abi", i32 1}
+
+// EXTENDED: !llvm.module.flags = !{{{.*}}![[FLAG:[0-9]+]]{{.*}}}
+// EXTENDED: ![[FLAG]] = !{i32 1, !"alloc-token-extended", i32 1}
diff --git a/clang/test/CodeGen/arm-mve-intrinsics/ternary.c b/clang/test/CodeGen/arm-mve-intrinsics/ternary.c
index 768d397cb5611..3ab84459e0515 100644
--- a/clang/test/CodeGen/arm-mve-intrinsics/ternary.c
+++ b/clang/test/CodeGen/arm-mve-intrinsics/ternary.c
@@ -1,15 +1,22 @@
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -O0 -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=sroa | FileCheck %s
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -O0 -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=sroa | FileCheck %s
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -O0 -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=sroa | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -O0 -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=sroa | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -O0 -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -emit-llvm -o - %s | opt -S -passes=sroa | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -O0 -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=sroa | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
 
 // REQUIRES: aarch64-registered-target || arm-registered-target
 
 #include <arm_mve.h>
 
-// CHECK-LABEL: @test_vfmaq_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[C:%.*]], <8 x half> [[A:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[C:%.*]], <8 x half> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[C:%.*]], <8 x half> [[A:%.*]]) #[[ATTR2:[0-9]+]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP0]]
 //
 float16x8_t test_vfmaq_f16(float16x8_t a, float16x8_t b, float16x8_t c) {
 #ifdef POLYMORPHIC
@@ -19,10 +26,15 @@ float16x8_t test_vfmaq_f16(float16x8_t a, float16x8_t b, float16x8_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[C:%.*]], <4 x float> [[A:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[C:%.*]], <4 x float> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[C:%.*]], <4 x float> [[A:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP0]]
 //
 float32x4_t test_vfmaq_f32(float32x4_t a, float32x4_t b, float32x4_t c) {
 #ifdef POLYMORPHIC
@@ -32,12 +44,19 @@ float32x4_t test_vfmaq_f32(float32x4_t a, float32x4_t b, float32x4_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_n_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]], <8 x half> [[A:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_n_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]], <8 x half> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_n_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]], <8 x half> [[A:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP0]]
 //
 float16x8_t test_vfmaq_n_f16(float16x8_t a, float16x8_t b, float16_t c) {
 #ifdef POLYMORPHIC
@@ -47,12 +66,19 @@ float16x8_t test_vfmaq_n_f16(float16x8_t a, float16x8_t b, float16_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_n_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]], <4 x float> [[A:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_n_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]], <4 x float> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_n_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]], <4 x float> [[A:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP0]]
 //
 float32x4_t test_vfmaq_n_f32(float32x4_t a, float32x4_t b, float32_t c) {
 #ifdef POLYMORPHIC
@@ -62,12 +88,19 @@ float32x4_t test_vfmaq_n_f32(float32x4_t a, float32x4_t b, float32_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmasq_n_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]])
-// CHECK-NEXT:    ret <8 x half> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vfmasq_n_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vfmasq_n_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP0]]
 //
 float16x8_t test_vfmasq_n_f16(float16x8_t a, float16x8_t b, float16_t c) {
 #ifdef POLYMORPHIC
@@ -77,12 +110,19 @@ float16x8_t test_vfmasq_n_f16(float16x8_t a, float16x8_t b, float16_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmasq_n_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]])
-// CHECK-NEXT:    ret <4 x float> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vfmasq_n_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vfmasq_n_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP0]]
 //
 float32x4_t test_vfmasq_n_f32(float32x4_t a, float32x4_t b, float32_t c) {
 #ifdef POLYMORPHIC
@@ -92,11 +132,17 @@ float32x4_t test_vfmasq_n_f32(float32x4_t a, float32x4_t b, float32_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmsq_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = fneg <8 x half> [[C:%.*]]
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[TMP0]], <8 x half> [[A:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vfmsq_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = fneg <8 x half> [[C:%.*]]
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[TMP0]], <8 x half> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vfmsq_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = fneg <8 x half> [[C:%.*]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> [[B:%.*]], <8 x half> [[TMP0]], <8 x half> [[A:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP1]]
 //
 float16x8_t test_vfmsq_f16(float16x8_t a, float16x8_t b, float16x8_t c) {
 #ifdef POLYMORPHIC
@@ -106,11 +152,17 @@ float16x8_t test_vfmsq_f16(float16x8_t a, float16x8_t b, float16x8_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmsq_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = fneg <4 x float> [[C:%.*]]
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[TMP0]], <4 x float> [[A:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vfmsq_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = fneg <4 x float> [[C:%.*]]
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[TMP0]], <4 x float> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vfmsq_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = fneg <4 x float> [[C:%.*]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> [[B:%.*]], <4 x float> [[TMP0]], <4 x float> [[A:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP1]]
 //
 float32x4_t test_vfmsq_f32(float32x4_t a, float32x4_t b, float32x4_t c) {
 #ifdef POLYMORPHIC
@@ -312,11 +364,17 @@ uint32x4_t test_vmlasq_n_u32(uint32x4_t a, uint32x4_t b, uint32_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlahq_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlah.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlahq_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlah.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlahq_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlah.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP1]]
 //
 int8x16_t test_vqdmlahq_n_s8(int8x16_t a, int8x16_t b, int8_t c) {
 #ifdef POLYMORPHIC
@@ -326,11 +384,17 @@ int8x16_t test_vqdmlahq_n_s8(int8x16_t a, int8x16_t b, int8_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlahq_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlah.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlahq_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlah.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlahq_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlah.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP1]]
 //
 int16x8_t test_vqdmlahq_n_s16(int16x8_t a, int16x8_t b, int16_t c) {
 #ifdef POLYMORPHIC
@@ -340,10 +404,15 @@ int16x8_t test_vqdmlahq_n_s16(int16x8_t a, int16x8_t b, int16_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlahq_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlah.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlahq_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlah.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlahq_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlah.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP0]]
 //
 int32x4_t test_vqdmlahq_n_s32(int32x4_t a, int32x4_t b, int32_t c) {
 #ifdef POLYMORPHIC
@@ -353,11 +422,17 @@ int32x4_t test_vqdmlahq_n_s32(int32x4_t a, int32x4_t b, int32_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlashq_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[ADD:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlash.v16i8(<16 x i8> [[M1:%.*]], <16 x i8> [[M2:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlashq_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[ADD:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlash.v16i8(<16 x i8> [[M1:%.*]], <16 x i8> [[M2:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlashq_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[ADD:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlash.v16i8(<16 x i8> [[M1:%.*]], <16 x i8> [[M2:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP1]]
 //
 int8x16_t test_vqdmlashq_n_s8(int8x16_t m1, int8x16_t m2, int8_t add) {
 #ifdef POLYMORPHIC
@@ -367,11 +442,17 @@ int8x16_t test_vqdmlashq_n_s8(int8x16_t m1, int8x16_t m2, int8_t add) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlashq_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[ADD:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlash.v8i16(<8 x i16> [[M1:%.*]], <8 x i16> [[M2:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlashq_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[ADD:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlash.v8i16(<8 x i16> [[M1:%.*]], <8 x i16> [[M2:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlashq_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[ADD:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlash.v8i16(<8 x i16> [[M1:%.*]], <8 x i16> [[M2:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP1]]
 //
 int16x8_t test_vqdmlashq_n_s16(int16x8_t m1, int16x8_t m2, int16_t add) {
 #ifdef POLYMORPHIC
@@ -381,10 +462,15 @@ int16x8_t test_vqdmlashq_n_s16(int16x8_t m1, int16x8_t m2, int16_t add) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlashq_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlash.v4i32(<4 x i32> [[M1:%.*]], <4 x i32> [[M2:%.*]], i32 [[ADD:%.*]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlashq_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlash.v4i32(<4 x i32> [[M1:%.*]], <4 x i32> [[M2:%.*]], i32 [[ADD:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlashq_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlash.v4i32(<4 x i32> [[M1:%.*]], <4 x i32> [[M2:%.*]], i32 [[ADD:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP0]]
 //
 int32x4_t test_vqdmlashq_n_s32(int32x4_t m1, int32x4_t m2, int32_t add) {
 #ifdef POLYMORPHIC
@@ -394,11 +480,17 @@ int32x4_t test_vqdmlashq_n_s32(int32x4_t m1, int32x4_t m2, int32_t add) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlahq_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlah.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlahq_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlah.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlahq_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlah.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP1]]
 //
 int8x16_t test_vqrdmlahq_n_s8(int8x16_t a, int8x16_t b, int8_t c) {
 #ifdef POLYMORPHIC
@@ -408,11 +500,17 @@ int8x16_t test_vqrdmlahq_n_s8(int8x16_t a, int8x16_t b, int8_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlahq_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlah.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlahq_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlah.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlahq_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlah.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP1]]
 //
 int16x8_t test_vqrdmlahq_n_s16(int16x8_t a, int16x8_t b, int16_t c) {
 #ifdef POLYMORPHIC
@@ -422,10 +520,15 @@ int16x8_t test_vqrdmlahq_n_s16(int16x8_t a, int16x8_t b, int16_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlahq_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlah.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlahq_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlah.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlahq_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlah.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP0]]
 //
 int32x4_t test_vqrdmlahq_n_s32(int32x4_t a, int32x4_t b, int32_t c) {
 #ifdef POLYMORPHIC
@@ -435,11 +538,17 @@ int32x4_t test_vqrdmlahq_n_s32(int32x4_t a, int32x4_t b, int32_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlashq_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlash.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlashq_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlash.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlashq_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlash.v16i8(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP1]]
 //
 int8x16_t test_vqrdmlashq_n_s8(int8x16_t a, int8x16_t b, int8_t c) {
 #ifdef POLYMORPHIC
@@ -449,11 +558,17 @@ int8x16_t test_vqrdmlashq_n_s8(int8x16_t a, int8x16_t b, int8_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlashq_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlash.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP1]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlashq_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlash.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP1]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlashq_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlash.v8i16(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP1]]
 //
 int16x8_t test_vqrdmlashq_n_s16(int16x8_t a, int16x8_t b, int16_t c) {
 #ifdef POLYMORPHIC
@@ -463,10 +578,15 @@ int16x8_t test_vqrdmlashq_n_s16(int16x8_t a, int16x8_t b, int16_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlashq_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlash.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlashq_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlash.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlashq_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlash.v4i32(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP0]]
 //
 int32x4_t test_vqrdmlashq_n_s32(int32x4_t a, int32x4_t b, int32_t c) {
 #ifdef POLYMORPHIC
@@ -476,12 +596,19 @@ int32x4_t test_vqrdmlashq_n_s32(int32x4_t a, int32x4_t b, int32_t c) {
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_m_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[C:%.*]], <8 x half> [[A:%.*]], <8 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_m_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[C:%.*]], <8 x half> [[A:%.*]], <8 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_m_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[C:%.*]], <8 x half> [[A:%.*]], <8 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vfmaq_m_f16(float16x8_t a, float16x8_t b, float16x8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -491,12 +618,19 @@ float16x8_t test_vfmaq_m_f16(float16x8_t a, float16x8_t b, float16x8_t c, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_m_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[C:%.*]], <4 x float> [[A:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_m_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[C:%.*]], <4 x float> [[A:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_m_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[C:%.*]], <4 x float> [[A:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vfmaq_m_f32(float32x4_t a, float32x4_t b, float32x4_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -506,14 +640,23 @@ float32x4_t test_vfmaq_m_f32(float32x4_t a, float32x4_t b, float32x4_t c, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_m_n_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]], <8 x half> [[A:%.*]], <8 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_m_n_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]], <8 x half> [[A:%.*]], <8 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_m_n_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]], <8 x half> [[A:%.*]], <8 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vfmaq_m_n_f16(float16x8_t a, float16x8_t b, float16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -523,14 +666,23 @@ float16x8_t test_vfmaq_m_n_f16(float16x8_t a, float16x8_t b, float16_t c, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmaq_m_n_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]], <4 x float> [[A:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vfmaq_m_n_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]], <4 x float> [[A:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vfmaq_m_n_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]], <4 x float> [[A:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vfmaq_m_n_f32(float32x4_t a, float32x4_t b, float32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -540,15 +692,25 @@ float32x4_t test_vfmaq_m_n_f32(float32x4_t a, float32x4_t b, float32_t c, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmasq_m_n_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]])
-// CHECK-NEXT:    [[TMP3:%.*]] = select <8 x i1> [[TMP1]], <8 x half> [[TMP2]], <8 x half> [[A]]
-// CHECK-NEXT:    ret <8 x half> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vfmasq_m_n_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.fma.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = select <8 x i1> [[TMP1]], <8 x half> [[TMP2]], <8 x half> [[A]]
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vfmasq_m_n_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <8 x half> poison, half [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <8 x half> [[DOTSPLATINSERT]], <8 x half> poison, <8 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x half> [[DOTSPLAT]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = select <8 x i1> [[TMP1]], <8 x half> [[TMP2]], <8 x half> [[A]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP3]]
 //
 float16x8_t test_vfmasq_m_n_f16(float16x8_t a, float16x8_t b, float16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -558,15 +720,25 @@ float16x8_t test_vfmasq_m_n_f16(float16x8_t a, float16x8_t b, float16_t c, mve_p
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmasq_m_n_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
-// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]])
-// CHECK-NEXT:    [[TMP3:%.*]] = select <4 x i1> [[TMP1]], <4 x float> [[TMP2]], <4 x float> [[A]]
-// CHECK-NEXT:    ret <4 x float> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vfmasq_m_n_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-NOSTRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.fma.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = select <4 x i1> [[TMP1]], <4 x float> [[TMP2]], <4 x float> [[A]]
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vfmasq_m_n_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <4 x float> poison, float [[C:%.*]], i64 0
+// CHECK-STRICT-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <4 x float> [[DOTSPLATINSERT]], <4 x float> poison, <4 x i32> zeroinitializer
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x float> [[DOTSPLAT]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = select <4 x i1> [[TMP1]], <4 x float> [[TMP2]], <4 x float> [[A]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP3]]
 //
 float32x4_t test_vfmasq_m_n_f32(float32x4_t a, float32x4_t b, float32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -576,13 +748,21 @@ float32x4_t test_vfmasq_m_n_f32(float32x4_t a, float32x4_t b, float32_t c, mve_p
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmsq_m_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = fneg <8 x half> [[C:%.*]]
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[TMP0]], <8 x half> [[A:%.*]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x half> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vfmsq_m_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = fneg <8 x half> [[C:%.*]]
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[TMP0]], <8 x half> [[A:%.*]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vfmsq_m_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = fneg <8 x half> [[C:%.*]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x half> @llvm.arm.mve.fma.predicated.v8f16.v8i1(<8 x half> [[B:%.*]], <8 x half> [[TMP0]], <8 x half> [[A:%.*]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP3]]
 //
 float16x8_t test_vfmsq_m_f16(float16x8_t a, float16x8_t b, float16x8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -592,13 +772,21 @@ float16x8_t test_vfmsq_m_f16(float16x8_t a, float16x8_t b, float16x8_t c, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vfmsq_m_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = fneg <4 x float> [[C:%.*]]
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[TMP0]], <4 x float> [[A:%.*]], <4 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <4 x float> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vfmsq_m_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = fneg <4 x float> [[C:%.*]]
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[TMP0]], <4 x float> [[A:%.*]], <4 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vfmsq_m_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = fneg <4 x float> [[C:%.*]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <4 x float> @llvm.arm.mve.fma.predicated.v4f32.v4i1(<4 x float> [[B:%.*]], <4 x float> [[TMP0]], <4 x float> [[A:%.*]], <4 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP3]]
 //
 float32x4_t test_vfmsq_m_f32(float32x4_t a, float32x4_t b, float32x4_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -608,13 +796,21 @@ float32x4_t test_vfmsq_m_f32(float32x4_t a, float32x4_t b, float32x4_t c, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlaq_m_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmla.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlaq_m_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmla.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlaq_m_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmla.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 int8x16_t test_vmlaq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -624,13 +820,21 @@ int8x16_t test_vmlaq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlaq_m_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmla.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlaq_m_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmla.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlaq_m_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmla.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 int16x8_t test_vmlaq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -640,12 +844,19 @@ int16x8_t test_vmlaq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t p
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlaq_m_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmla.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmlaq_m_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmla.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmlaq_m_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmla.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 int32x4_t test_vmlaq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -655,13 +866,21 @@ int32x4_t test_vmlaq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t p
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlaq_m_n_u8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmla.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlaq_m_n_u8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmla.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlaq_m_n_u8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmla.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 uint8x16_t test_vmlaq_m_n_u8(uint8x16_t a, uint8x16_t b, uint8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -671,13 +890,21 @@ uint8x16_t test_vmlaq_m_n_u8(uint8x16_t a, uint8x16_t b, uint8_t c, mve_pred16_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlaq_m_n_u16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmla.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlaq_m_n_u16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmla.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlaq_m_n_u16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmla.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 uint16x8_t test_vmlaq_m_n_u16(uint16x8_t a, uint16x8_t b, uint16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -687,12 +914,19 @@ uint16x8_t test_vmlaq_m_n_u16(uint16x8_t a, uint16x8_t b, uint16_t c, mve_pred16
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlaq_m_n_u32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmla.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmlaq_m_n_u32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmla.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmlaq_m_n_u32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmla.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 uint32x4_t test_vmlaq_m_n_u32(uint32x4_t a, uint32x4_t b, uint32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -702,13 +936,21 @@ uint32x4_t test_vmlaq_m_n_u32(uint32x4_t a, uint32x4_t b, uint32_t c, mve_pred16
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlasq_m_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmlas.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlasq_m_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmlas.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlasq_m_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmlas.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 int8x16_t test_vmlasq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -718,13 +960,21 @@ int8x16_t test_vmlasq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlasq_m_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmlas.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlasq_m_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmlas.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlasq_m_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmlas.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 int16x8_t test_vmlasq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -734,12 +984,19 @@ int16x8_t test_vmlasq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlasq_m_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmlas.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmlasq_m_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmlas.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmlasq_m_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmlas.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 int32x4_t test_vmlasq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -749,13 +1006,21 @@ int32x4_t test_vmlasq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlasq_m_n_u8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmlas.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlasq_m_n_u8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmlas.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlasq_m_n_u8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vmlas.n.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 uint8x16_t test_vmlasq_m_n_u8(uint8x16_t a, uint8x16_t b, uint8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -765,13 +1030,21 @@ uint8x16_t test_vmlasq_m_n_u8(uint8x16_t a, uint8x16_t b, uint8_t c, mve_pred16_
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlasq_m_n_u16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmlas.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vmlasq_m_n_u16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmlas.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vmlasq_m_n_u16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vmlas.n.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 uint16x8_t test_vmlasq_m_n_u16(uint16x8_t a, uint16x8_t b, uint16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -781,12 +1054,19 @@ uint16x8_t test_vmlasq_m_n_u16(uint16x8_t a, uint16x8_t b, uint16_t c, mve_pred1
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmlasq_m_n_u32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmlas.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmlasq_m_n_u32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmlas.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmlasq_m_n_u32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vmlas.n.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 uint32x4_t test_vmlasq_m_n_u32(uint32x4_t a, uint32x4_t b, uint32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -796,13 +1076,21 @@ uint32x4_t test_vmlasq_m_n_u32(uint32x4_t a, uint32x4_t b, uint32_t c, mve_pred1
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlahq_m_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlah.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlahq_m_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlah.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlahq_m_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlah.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 int8x16_t test_vqdmlahq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -812,13 +1100,21 @@ int8x16_t test_vqdmlahq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlahq_m_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlah.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlahq_m_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlah.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlahq_m_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlah.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 int16x8_t test_vqdmlahq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -828,12 +1124,19 @@ int16x8_t test_vqdmlahq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlahq_m_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlah.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlahq_m_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlah.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlahq_m_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlah.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 int32x4_t test_vqdmlahq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -843,13 +1146,21 @@ int32x4_t test_vqdmlahq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlashq_m_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[ADD:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlash.predicated.v16i8.v16i1(<16 x i8> [[M1:%.*]], <16 x i8> [[M2:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlashq_m_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[ADD:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlash.predicated.v16i8.v16i1(<16 x i8> [[M1:%.*]], <16 x i8> [[M2:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlashq_m_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[ADD:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqdmlash.predicated.v16i8.v16i1(<16 x i8> [[M1:%.*]], <16 x i8> [[M2:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 int8x16_t test_vqdmlashq_m_n_s8(int8x16_t m1, int8x16_t m2, int8_t add, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -859,13 +1170,21 @@ int8x16_t test_vqdmlashq_m_n_s8(int8x16_t m1, int8x16_t m2, int8_t add, mve_pred
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlashq_m_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[ADD:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlash.predicated.v8i16.v8i1(<8 x i16> [[M1:%.*]], <8 x i16> [[M2:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlashq_m_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[ADD:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlash.predicated.v8i16.v8i1(<8 x i16> [[M1:%.*]], <8 x i16> [[M2:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlashq_m_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[ADD:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqdmlash.predicated.v8i16.v8i1(<8 x i16> [[M1:%.*]], <8 x i16> [[M2:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 int16x8_t test_vqdmlashq_m_n_s16(int16x8_t m1, int16x8_t m2, int16_t add, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -875,12 +1194,19 @@ int16x8_t test_vqdmlashq_m_n_s16(int16x8_t m1, int16x8_t m2, int16_t add, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqdmlashq_m_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlash.predicated.v4i32.v4i1(<4 x i32> [[M1:%.*]], <4 x i32> [[M2:%.*]], i32 [[ADD:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vqdmlashq_m_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlash.predicated.v4i32.v4i1(<4 x i32> [[M1:%.*]], <4 x i32> [[M2:%.*]], i32 [[ADD:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vqdmlashq_m_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqdmlash.predicated.v4i32.v4i1(<4 x i32> [[M1:%.*]], <4 x i32> [[M2:%.*]], i32 [[ADD:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 int32x4_t test_vqdmlashq_m_n_s32(int32x4_t m1, int32x4_t m2, int32_t add, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -890,13 +1216,21 @@ int32x4_t test_vqdmlashq_m_n_s32(int32x4_t m1, int32x4_t m2, int32_t add, mve_pr
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlahq_m_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlah.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlahq_m_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlah.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlahq_m_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlah.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 int8x16_t test_vqrdmlahq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -906,13 +1240,21 @@ int8x16_t test_vqrdmlahq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlahq_m_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlah.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlahq_m_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlah.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlahq_m_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlah.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 int16x8_t test_vqrdmlahq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -922,12 +1264,19 @@ int16x8_t test_vqrdmlahq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlahq_m_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlah.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlahq_m_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlah.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlahq_m_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlah.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 int32x4_t test_vqrdmlahq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -937,13 +1286,21 @@ int32x4_t test_vqrdmlahq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlashq_m_n_s8(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlash.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <16 x i8> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlashq_m_n_s8(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlash.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <16 x i8> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlashq_m_n_s8(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i8 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <16 x i8> @llvm.arm.mve.vqrdmlash.predicated.v16i8.v16i1(<16 x i8> [[A:%.*]], <16 x i8> [[B:%.*]], i32 [[TMP0]], <16 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <16 x i8> [[TMP3]]
 //
 int8x16_t test_vqrdmlashq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -953,13 +1310,21 @@ int8x16_t test_vqrdmlashq_m_n_s8(int8x16_t a, int8x16_t b, int8_t c, mve_pred16_
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlashq_m_n_s16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
-// CHECK-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlash.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
-// CHECK-NEXT:    ret <8 x i16> [[TMP3]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlashq_m_n_s16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlash.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x i16> [[TMP3]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlashq_m_n_s16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[C:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP3:%.*]] = call <8 x i16> @llvm.arm.mve.vqrdmlash.predicated.v8i16.v8i1(<8 x i16> [[A:%.*]], <8 x i16> [[B:%.*]], i32 [[TMP0]], <8 x i1> [[TMP2]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x i16> [[TMP3]]
 //
 int16x8_t test_vqrdmlashq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
@@ -969,12 +1334,19 @@ int16x8_t test_vqrdmlashq_m_n_s16(int16x8_t a, int16x8_t b, int16_t c, mve_pred1
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vqrdmlashq_m_n_s32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlash.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vqrdmlashq_m_n_s32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlash.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x i32> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vqrdmlashq_m_n_s32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.arm.mve.vqrdmlash.predicated.v4i32.v4i1(<4 x i32> [[A:%.*]], <4 x i32> [[B:%.*]], i32 [[C:%.*]], <4 x i1> [[TMP1]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x i32> [[TMP2]]
 //
 int32x4_t test_vqrdmlashq_m_n_s32(int32x4_t a, int32x4_t b, int32_t c, mve_pred16_t p) {
 #ifdef POLYMORPHIC
diff --git a/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmaq.c b/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmaq.c
index 613a390bc6d36..04834ece3a4a6 100644
--- a/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmaq.c
+++ b/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmaq.c
@@ -1,17 +1,26 @@
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
 
 // REQUIRES: aarch64-registered-target || arm-registered-target
 
 #include <arm_mve.h>
 
-// CHECK-LABEL: @test_vmaxnmaq_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[A:%.*]])
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[B:%.*]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.maxnum.v8f16(<8 x half> [[TMP0]], <8 x half> [[TMP1]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmaq_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.maxnum.v8f16(<8 x half> [[TMP0]], <8 x half> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmaq_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[A:%.*]]) #[[ATTR3:[0-9]+]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[B:%.*]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vmaxnm.v8f16(<8 x half> [[TMP0]], <8 x half> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vmaxnmaq_f16(float16x8_t a, float16x8_t b)
 {
@@ -22,12 +31,19 @@ float16x8_t test_vmaxnmaq_f16(float16x8_t a, float16x8_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmaq_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[A:%.*]])
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[B:%.*]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.maxnum.v4f32(<4 x float> [[TMP0]], <4 x float> [[TMP1]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmaq_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.maxnum.v4f32(<4 x float> [[TMP0]], <4 x float> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmaq_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[A:%.*]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[B:%.*]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vmaxnm.v4f32(<4 x float> [[TMP0]], <4 x float> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vmaxnmaq_f32(float32x4_t a, float32x4_t b)
 {
@@ -38,12 +54,19 @@ float32x4_t test_vmaxnmaq_f32(float32x4_t a, float32x4_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmaq_m_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vmaxnma.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmaq_m_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vmaxnma.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmaq_m_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vmaxnma.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x i1> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vmaxnmaq_m_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 {
@@ -54,12 +77,19 @@ float16x8_t test_vmaxnmaq_m_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmaq_m_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vmaxnma.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmaq_m_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vmaxnma.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmaq_m_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vmaxnma.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x i1> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vmaxnmaq_m_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
 {
@@ -69,3 +99,5 @@ float32x4_t test_vmaxnmaq_m_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
     return vmaxnmaq_m_f32(a, b, p);
 #endif /* POLYMORPHIC */
 }
+//// NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+// CHECK: {{.*}}
diff --git a/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmq.c b/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmq.c
index bad7cd903ab16..1225353a5a9d2 100644
--- a/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmq.c
+++ b/clang/test/CodeGen/arm-mve-intrinsics/vmaxnmq.c
@@ -1,15 +1,22 @@
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
 
 // REQUIRES: aarch64-registered-target || arm-registered-target
 
 #include <arm_mve.h>
 
-// CHECK-LABEL: @test_vmaxnmq_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.maxnum.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmq_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.maxnum.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmq_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.arm.mve.vmaxnm.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]]) #[[ATTR2:[0-9]+]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP0]]
 //
 float16x8_t test_vmaxnmq_f16(float16x8_t a, float16x8_t b)
 {
@@ -20,10 +27,15 @@ float16x8_t test_vmaxnmq_f16(float16x8_t a, float16x8_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmq_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.maxnum.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmq_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.maxnum.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmq_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.arm.mve.vmaxnm.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP0]]
 //
 float32x4_t test_vmaxnmq_f32(float32x4_t a, float32x4_t b)
 {
@@ -34,12 +46,19 @@ float32x4_t test_vmaxnmq_f32(float32x4_t a, float32x4_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmq_m_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.max.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> [[INACTIVE:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmq_m_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.max.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> [[INACTIVE:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmq_m_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.max.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> [[INACTIVE:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vmaxnmq_m_f16(float16x8_t inactive, float16x8_t a, float16x8_t b, mve_pred16_t p)
 {
@@ -50,12 +69,19 @@ float16x8_t test_vmaxnmq_m_f16(float16x8_t inactive, float16x8_t a, float16x8_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmq_m_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.max.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> [[INACTIVE:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmq_m_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.max.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> [[INACTIVE:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmq_m_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.max.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> [[INACTIVE:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vmaxnmq_m_f32(float32x4_t inactive, float32x4_t a, float32x4_t b, mve_pred16_t p)
 {
@@ -66,12 +92,19 @@ float32x4_t test_vmaxnmq_m_f32(float32x4_t inactive, float32x4_t a, float32x4_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmq_x_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.max.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> undef)
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmq_x_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.max.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> undef)
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmq_x_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.max.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> undef) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vmaxnmq_x_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 {
@@ -82,12 +115,19 @@ float16x8_t test_vmaxnmq_x_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vmaxnmq_x_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.max.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> undef)
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vmaxnmq_x_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.max.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> undef)
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vmaxnmq_x_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.max.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> undef) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vmaxnmq_x_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
 {
@@ -97,3 +137,5 @@ float32x4_t test_vmaxnmq_x_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
     return vmaxnmq_x_f32(a, b, p);
 #endif /* POLYMORPHIC */
 }
+//// NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+// CHECK: {{.*}}
diff --git a/clang/test/CodeGen/arm-mve-intrinsics/vminnmaq.c b/clang/test/CodeGen/arm-mve-intrinsics/vminnmaq.c
index 0182cf7c5b6b3..fc0dc5701e4d9 100644
--- a/clang/test/CodeGen/arm-mve-intrinsics/vminnmaq.c
+++ b/clang/test/CodeGen/arm-mve-intrinsics/vminnmaq.c
@@ -1,17 +1,26 @@
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
 
 // REQUIRES: aarch64-registered-target || arm-registered-target
 
 #include <arm_mve.h>
 
-// CHECK-LABEL: @test_vminnmaq_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[A:%.*]])
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[B:%.*]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.minnum.v8f16(<8 x half> [[TMP0]], <8 x half> [[TMP1]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmaq_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.minnum.v8f16(<8 x half> [[TMP0]], <8 x half> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmaq_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[A:%.*]]) #[[ATTR3:[0-9]+]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x half> @llvm.fabs.v8f16(<8 x half> [[B:%.*]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vminnm.v8f16(<8 x half> [[TMP0]], <8 x half> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vminnmaq_f16(float16x8_t a, float16x8_t b)
 {
@@ -22,12 +31,19 @@ float16x8_t test_vminnmaq_f16(float16x8_t a, float16x8_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmaq_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[A:%.*]])
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[B:%.*]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.minnum.v4f32(<4 x float> [[TMP0]], <4 x float> [[TMP1]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmaq_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[A:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.minnum.v4f32(<4 x float> [[TMP0]], <4 x float> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmaq_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[A:%.*]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x float> @llvm.fabs.v4f32(<4 x float> [[B:%.*]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vminnm.v4f32(<4 x float> [[TMP0]], <4 x float> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vminnmaq_f32(float32x4_t a, float32x4_t b)
 {
@@ -38,12 +54,19 @@ float32x4_t test_vminnmaq_f32(float32x4_t a, float32x4_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmaq_m_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vminnma.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmaq_m_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vminnma.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmaq_m_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.vminnma.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], <8 x i1> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vminnmaq_m_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 {
@@ -54,12 +77,19 @@ float16x8_t test_vminnmaq_m_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmaq_m_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vminnma.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x i1> [[TMP1]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmaq_m_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vminnma.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x i1> [[TMP1]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmaq_m_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.vminnma.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], <4 x i1> [[TMP1]]) #[[ATTR3]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vminnmaq_m_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
 {
@@ -69,3 +99,5 @@ float32x4_t test_vminnmaq_m_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
     return vminnmaq_m_f32(a, b, p);
 #endif /* POLYMORPHIC */
 }
+//// NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+// CHECK: {{.*}}
diff --git a/clang/test/CodeGen/arm-mve-intrinsics/vminnmq.c b/clang/test/CodeGen/arm-mve-intrinsics/vminnmq.c
index b48ff9d84b8f6..7dbad94c77674 100644
--- a/clang/test/CodeGen/arm-mve-intrinsics/vminnmq.c
+++ b/clang/test/CodeGen/arm-mve-intrinsics/vminnmq.c
@@ -1,15 +1,22 @@
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
-// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-NOSTRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
+// RUN: %clang_cc1 -triple thumbv8.1m.main-none-none-eabi -target-feature +mve.fp -mfloat-abi hard -disable-O0-optnone -frounding-math -fexperimental-strict-floating-point -DPOLYMORPHIC -emit-llvm -o - %s | opt -S -passes=mem2reg | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT
 
 // REQUIRES: aarch64-registered-target || arm-registered-target
 
 #include <arm_mve.h>
 
-// CHECK-LABEL: @test_vminnmq_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.minnum.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmq_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.minnum.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmq_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <8 x half> @llvm.arm.mve.vminnm.v8f16(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]]) #[[ATTR2:[0-9]+]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP0]]
 //
 float16x8_t test_vminnmq_f16(float16x8_t a, float16x8_t b)
 {
@@ -20,10 +27,15 @@ float16x8_t test_vminnmq_f16(float16x8_t a, float16x8_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmq_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.minnum.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP0]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmq_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.minnum.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP0]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmq_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = call <4 x float> @llvm.arm.mve.vminnm.v4f32(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP0]]
 //
 float32x4_t test_vminnmq_f32(float32x4_t a, float32x4_t b)
 {
@@ -34,12 +46,19 @@ float32x4_t test_vminnmq_f32(float32x4_t a, float32x4_t b)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmq_m_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.min.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> [[INACTIVE:%.*]])
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmq_m_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.min.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> [[INACTIVE:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmq_m_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.min.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> [[INACTIVE:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vminnmq_m_f16(float16x8_t inactive, float16x8_t a, float16x8_t b, mve_pred16_t p)
 {
@@ -50,12 +69,19 @@ float16x8_t test_vminnmq_m_f16(float16x8_t inactive, float16x8_t a, float16x8_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmq_m_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.min.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> [[INACTIVE:%.*]])
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmq_m_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.min.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> [[INACTIVE:%.*]])
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmq_m_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.min.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> [[INACTIVE:%.*]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vminnmq_m_f32(float32x4_t inactive, float32x4_t a, float32x4_t b, mve_pred16_t p)
 {
@@ -66,12 +92,19 @@ float32x4_t test_vminnmq_m_f32(float32x4_t inactive, float32x4_t a, float32x4_t
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmq_x_f16(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.min.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> undef)
-// CHECK-NEXT:    ret <8 x half> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmq_x_f16(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.min.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> undef)
+// CHECK-NOSTRICT-NEXT:    ret <8 x half> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmq_x_f16(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <8 x half> @llvm.arm.mve.min.predicated.v8f16.v8i1(<8 x half> [[A:%.*]], <8 x half> [[B:%.*]], i32 0, <8 x i1> [[TMP1]], <8 x half> undef) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <8 x half> [[TMP2]]
 //
 float16x8_t test_vminnmq_x_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 {
@@ -82,12 +115,19 @@ float16x8_t test_vminnmq_x_f16(float16x8_t a, float16x8_t b, mve_pred16_t p)
 #endif /* POLYMORPHIC */
 }
 
-// CHECK-LABEL: @test_vminnmq_x_f32(
-// CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
-// CHECK-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
-// CHECK-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.min.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> undef)
-// CHECK-NEXT:    ret <4 x float> [[TMP2]]
+// CHECK-NOSTRICT-LABEL: @test_vminnmq_x_f32(
+// CHECK-NOSTRICT-NEXT:  entry:
+// CHECK-NOSTRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-NOSTRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]])
+// CHECK-NOSTRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.min.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> undef)
+// CHECK-NOSTRICT-NEXT:    ret <4 x float> [[TMP2]]
+//
+// CHECK-STRICT-LABEL: @test_vminnmq_x_f32(
+// CHECK-STRICT-NEXT:  entry:
+// CHECK-STRICT-NEXT:    [[TMP0:%.*]] = zext i16 [[P:%.*]] to i32
+// CHECK-STRICT-NEXT:    [[TMP1:%.*]] = call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 [[TMP0]]) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    [[TMP2:%.*]] = call <4 x float> @llvm.arm.mve.min.predicated.v4f32.v4i1(<4 x float> [[A:%.*]], <4 x float> [[B:%.*]], i32 0, <4 x i1> [[TMP1]], <4 x float> undef) #[[ATTR2]]
+// CHECK-STRICT-NEXT:    ret <4 x float> [[TMP2]]
 //
 float32x4_t test_vminnmq_x_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
 {
@@ -97,3 +137,5 @@ float32x4_t test_vminnmq_x_f32(float32x4_t a, float32x4_t b, mve_pred16_t p)
     return vminnmq_x_f32(a, b, p);
 #endif /* POLYMORPHIC */
 }
+//// NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+// CHECK: {{.*}}
diff --git a/clang/test/CodeGen/attr-modular-format.c b/clang/test/CodeGen/attr-modular-format.c
new file mode 100644
index 0000000000000..5474ce361fbc2
--- /dev/null
+++ b/clang/test/CodeGen/attr-modular-format.c
@@ -0,0 +1,49 @@
+// RUN: %clang_cc1 -triple x86_64-unknown-unknown -emit-llvm %s -o - | FileCheck %s
+
+int printf(const char *fmt, ...)  __attribute__((modular_format(__modular_printf, "__printf", "float")));
+int myprintf(const char *fmt, ...)  __attribute__((modular_format(__modular_printf, "__printf", "float"), format(printf, 1, 2)));
+
+// CHECK-LABEL: define dso_local void @test_inferred_format(
+// CHECK:    {{.*}} = call i32 (ptr, ...) @printf(ptr noundef @.str) #[[ATTR:[0-9]+]]
+void test_inferred_format(void) {
+  printf("hello");
+}
+
+// CHECK-LABEL: define dso_local void @test_explicit_format(
+// CHECK:    {{.*}} = call i32 (ptr, ...) @myprintf(ptr noundef @.str) #[[ATTR:[0-9]+]]
+void test_explicit_format(void) {
+  myprintf("hello");
+}
+
+int redecl(const char *fmt, ...) __attribute__((format(printf, 1, 2)));
+int redecl(const char *fmt, ...) __attribute__((modular_format(__dupe_impl, "__dupe", "1")));
+int redecl(const char *fmt, ...) __attribute__((modular_format(__dupe_impl, "__dupe", "1")));
+
+// CHECK-LABEL: define dso_local void @test_redecl(
+// CHECK:    {{.*}} = call i32 (ptr, ...) @redecl(ptr noundef @.str) #[[ATTR_DUPE_IDENTICAL:[0-9]+]]
+void test_redecl(void) {
+  redecl("hello");
+}
+
+int order1(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "a", "b"), format(printf, 1, 2)));
+int order2(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "b", "a"), format(printf, 1, 2)));
+
+// CHECK-LABEL: define dso_local void @test_order(
+// CHECK:    {{.*}} = call i32 (ptr, ...) @order1(ptr noundef @.str) #[[ATTR_ORDER:[0-9]+]]
+// CHECK:    {{.*}} = call i32 (ptr, ...) @order2(ptr noundef @.str) #[[ATTR_ORDER]]
+void test_order(void) {
+  order1("hello");
+  order2("hello");
+}
+
+int duplicate_identical(const char *fmt, ...) __attribute__((modular_format(__dupe_impl, "__dupe", "1"), modular_format(__dupe_impl, "__dupe", "1"), format(printf, 1, 2)));
+
+// CHECK-LABEL: define dso_local void @test_duplicate_identical(
+// CHECK:    {{.*}} = call i32 (ptr, ...) @duplicate_identical(ptr noundef @.str) #[[ATTR_DUPE_IDENTICAL]]
+void test_duplicate_identical(void) {
+  duplicate_identical("hello");
+}
+
+// CHECK: attributes #[[ATTR]] = { "modular-format"="printf,1,2,__modular_printf,__printf,float" }
+// CHECK: attributes #[[ATTR_DUPE_IDENTICAL]] = { "modular-format"="printf,1,2,__dupe_impl,__dupe,1" }
+// CHECK: attributes #[[ATTR_ORDER]] = { "modular-format"="printf,1,2,__modular_printf,__printf,a,b" }
diff --git a/clang/test/CodeGen/distributed-thin-lto/memprof-pgho.cpp b/clang/test/CodeGen/distributed-thin-lto/memprof-pgho.cpp
new file mode 100644
index 0000000000000..317efd1b3a138
--- /dev/null
+++ b/clang/test/CodeGen/distributed-thin-lto/memprof-pgho.cpp
@@ -0,0 +1,69 @@
+// Test end-to-end ThinLTO optimization pipeline with PGHO, that it does not
+// interfere with other allocation instrumentation features.
+//
+// REQUIRES: x86-registered-target
+//
+// RUN: split-file %s %t
+// RUN: llvm-profdata merge %t/memprof.yaml -o %t/use.memprofdata
+//
+// RUN: %clangxx --target=x86_64-linux-gnu -O2 -flto=thin -g -fmemory-profile-use=%t/use.memprofdata %t/src.cpp -c -o %t.o
+// RUN: llvm-lto2 run %t.o -thinlto-distributed-indexes -supports-hot-cold-new -r=%t.o,main,plx -r=%t.o,_Z3foov,plx -r=%t.o,_Znam, -o %t.out
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -O1 -x ir %t.o -fthinlto-index=%t.o.thinlto.bc -mllvm -optimize-hot-cold-new -emit-llvm -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,DEFAULT
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -O2 -x ir %t.o -fthinlto-index=%t.o.thinlto.bc -mllvm -optimize-hot-cold-new -emit-llvm -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,DEFAULT
+//
+// RUN: %clangxx --target=x86_64-linux-gnu -O2 -flto=thin -g -fsanitize=alloc-token -falloc-token-max=32 -fmemory-profile-use=%t/use.memprofdata %t/src.cpp -c -o %t.o
+// RUN: llvm-lto2 run %t.o -thinlto-distributed-indexes -supports-hot-cold-new -r=%t.o,main,plx -r=%t.o,_Z3foov,plx -r=%t.o,_Znam, -o %t.out
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -O1 -x ir %t.o -fsanitize=alloc-token -fthinlto-index=%t.o.thinlto.bc -mllvm -optimize-hot-cold-new -emit-llvm -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,ALLOCTOKEN
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -O2 -x ir %t.o -fsanitize=alloc-token -fthinlto-index=%t.o.thinlto.bc -mllvm -optimize-hot-cold-new -emit-llvm -o - 2>&1 | FileCheck %s --check-prefixes=CHECK,ALLOCTOKEN
+
+//--- memprof.yaml
+---
+HeapProfileRecords:
+  - GUID: 0x7f8d88fcc70a347b
+    AllocSites:
+    - Callstack:
+      - { Function: 0x7f8d88fcc70a347b, LineOffset: 1, Column: 10, IsInlineFrame: false }
+      - { Function: 0xdb956436e78dd5fa, LineOffset: 1, Column: 13, IsInlineFrame: false }
+      MemInfoBlock:
+        AllocCount: 1
+        TotalAccessCount: 0
+        MinAccessCount: 0
+        MaxAccessCount: 0
+        TotalSize: 10
+        MinSize: 10
+        MaxSize: 10
+        AllocTimestamp: 100
+        DeallocTimestamp: 100
+        TotalLifetime: 100000
+        MinLifetime: 100000
+        MaxLifetime: 100000
+        AllocCpuId: 0
+        DeallocCpuId: 0
+        NumMigratedCpu: 0
+        NumLifetimeOverlaps: 0
+        NumSameAllocCpu: 0
+        NumSameDeallocCpu: 0
+        DataTypeId: 0
+        TotalAccessDensity: 0
+        MinAccessDensity: 0
+        MaxAccessDensity: 0
+        TotalLifetimeAccessDensity: 0
+        MinLifetimeAccessDensity: 0
+        MaxLifetimeAccessDensity: 0
+        AccessHistogramSize: 0
+        AccessHistogram: 0
+...
+
+//--- src.cpp
+// CHECK-LABEL: define{{.*}} ptr @_Z3foov()
+// DEFAULT:    call {{.*}} ptr @_Znam12__hot_cold_t(i64 10, i8 -128)
+// ALLOCTOKEN: call {{.*}} ptr @__alloc_token__Znam12__hot_cold_t(i64 10, i8 -128, i64 12){{.*}} !alloc_token
+char *foo() {
+  return new char[10];
+}
+
+int main() {
+  char *a = foo();
+  delete[] a;
+  return 0;
+}
diff --git a/clang/test/CodeGen/lto-newpm-pipeline.c b/clang/test/CodeGen/lto-newpm-pipeline.c
index dceaaf136ebfc..ea9784a76f923 100644
--- a/clang/test/CodeGen/lto-newpm-pipeline.c
+++ b/clang/test/CodeGen/lto-newpm-pipeline.c
@@ -32,12 +32,10 @@
 // CHECK-FULL-O0-NEXT: Running pass: AlwaysInlinerPass
 // CHECK-FULL-O0-NEXT: Running analysis: ProfileSummaryAnalysis
 // CHECK-FULL-O0-NEXT: Running pass: CoroConditionalWrapper
-// CHECK-FULL-O0-NEXT: Running pass: AllocTokenPass
-// CHECK-FULL-O0-NEXT: Running analysis: OptimizationRemarkEmitterAnalysis
-// CHECK-FULL-O0-NEXT: Running analysis: TargetLibraryAnalysis
 // CHECK-FULL-O0-NEXT: Running pass: CanonicalizeAliasesPass
 // CHECK-FULL-O0-NEXT: Running pass: NameAnonGlobalPass
 // CHECK-FULL-O0-NEXT: Running pass: AnnotationRemarksPass
+// CHECK-FULL-O0-NEXT: Running analysis: TargetLibraryAnalysis
 // CHECK-FULL-O0-NEXT: Running pass: VerifierPass
 // CHECK-FULL-O0-NEXT: Running pass: BitcodeWriterPass
 
@@ -48,12 +46,10 @@
 // CHECK-THIN-O0-NEXT: Running pass: AlwaysInlinerPass
 // CHECK-THIN-O0-NEXT: Running analysis: ProfileSummaryAnalysis
 // CHECK-THIN-O0-NEXT: Running pass: CoroConditionalWrapper
-// CHECK-THIN-O0-NEXT: Running pass: AllocTokenPass
-// CHECK-THIN-O0-NEXT: Running analysis: OptimizationRemarkEmitterAnalysis
-// CHECK-THIN-O0-NEXT: Running analysis: TargetLibraryAnalysis
 // CHECK-THIN-O0-NEXT: Running pass: CanonicalizeAliasesPass
 // CHECK-THIN-O0-NEXT: Running pass: NameAnonGlobalPass
 // CHECK-THIN-O0-NEXT: Running pass: AnnotationRemarksPass
+// CHECK-THIN-O0-NEXT: Running analysis: TargetLibraryAnalysis
 // CHECK-THIN-O0-NEXT: Running pass: VerifierPass
 // CHECK-THIN-O0-NEXT: Running pass: ThinLTOBitcodeWriterPass
 
diff --git a/clang/test/CodeGenCUDA/Inputs/cuda.h b/clang/test/CodeGenCUDA/Inputs/cuda.h
index e7ad784335027..421fa4dd7dbae 100644
--- a/clang/test/CodeGenCUDA/Inputs/cuda.h
+++ b/clang/test/CodeGenCUDA/Inputs/cuda.h
@@ -72,7 +72,13 @@ extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
 extern "C" cudaError_t cudaLaunchKernel_ptsz(const void *func, dim3 gridDim,
                                         dim3 blockDim, void **args,
                                         size_t sharedMem, cudaStream_t stream);
-
+extern "C" __device__ cudaError_t cudaLaunchDevice(void *func,
+                                                   void *parameterBuffer,
+                                                   dim3 gridDim, dim3 blockDim,
+                                                   unsigned int sharedMem,
+                                                   cudaStream_t stream);
+extern "C" __device__ void *cudaGetParameterBuffer(size_t alignment,
+                                                   size_t size);
 #endif
 
 extern "C" __device__ int printf(const char*, ...);
diff --git a/clang/test/CodeGenCUDA/device-kernel-call.cu b/clang/test/CodeGenCUDA/device-kernel-call.cu
new file mode 100644
index 0000000000000..eff2b37bd298d
--- /dev/null
+++ b/clang/test/CodeGenCUDA/device-kernel-call.cu
@@ -0,0 +1,35 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 6
+// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -fcuda-is-device -fgpu-rdc -emit-llvm %s -o - | FileCheck %s
+
+#include "Inputs/cuda.h"
+
+// CHECK-LABEL: define dso_local ptx_kernel void @_Z2g2i(
+// CHECK-SAME: i32 noundef [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[X_ADDR:%.*]] = alloca i32, align 4
+// CHECK-NEXT:    store i32 [[X]], ptr [[X_ADDR]], align 4
+// CHECK-NEXT:    ret void
+//
+__global__ void g2(int x) {}
+
+// CHECK-LABEL: define dso_local ptx_kernel void @_Z2g1v(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[AGG_TMP:%.*]] = alloca [[STRUCT_DIM3:%.*]], align 4
+// CHECK-NEXT:    [[AGG_TMP1:%.*]] = alloca [[STRUCT_DIM3]], align 4
+// CHECK-NEXT:    [[CALL:%.*]] = call ptr @cudaGetParameterBuffer(i64 noundef 64, i64 noundef 4) #[[ATTR3:[0-9]+]]
+// CHECK-NEXT:    [[TMP0:%.*]] = icmp ne ptr [[CALL]], null
+// CHECK-NEXT:    br i1 [[TMP0]], label %[[DKCALL_CONFIGOK:.*]], label %[[DKCALL_END:.*]]
+// CHECK:       [[DKCALL_CONFIGOK]]:
+// CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[CALL]], i64 0
+// CHECK-NEXT:    store i32 42, ptr [[TMP1]], align 64
+// CHECK-NEXT:    call void @_ZN4dim3C1Ejjj(ptr noundef nonnull align 4 dereferenceable(12) [[AGG_TMP]], i32 noundef 1, i32 noundef 1, i32 noundef 1) #[[ATTR3]]
+// CHECK-NEXT:    call void @_ZN4dim3C1Ejjj(ptr noundef nonnull align 4 dereferenceable(12) [[AGG_TMP1]], i32 noundef 1, i32 noundef 1, i32 noundef 1) #[[ATTR3]]
+// CHECK-NEXT:    [[CALL2:%.*]] = call i32 @cudaLaunchDevice(ptr noundef @_Z2g2i, ptr noundef [[CALL]], ptr noundef byval([[STRUCT_DIM3]]) align 4 [[AGG_TMP]], ptr noundef byval([[STRUCT_DIM3]]) align 4 [[AGG_TMP1]], i32 noundef 0, ptr noundef null) #[[ATTR3]]
+// CHECK-NEXT:    br label %[[DKCALL_END]]
+// CHECK:       [[DKCALL_END]]:
+// CHECK-NEXT:    ret void
+//
+__global__ void g1(void) {
+  g2<<<1, 1>>>(42);
+}
diff --git a/clang/test/CodeGenHLSL/BasicFeatures/MatrixElementTypeCast.hlsl b/clang/test/CodeGenHLSL/BasicFeatures/MatrixElementTypeCast.hlsl
new file mode 100644
index 0000000000000..3bd7636212862
--- /dev/null
+++ b/clang/test/CodeGenHLSL/BasicFeatures/MatrixElementTypeCast.hlsl
@@ -0,0 +1,219 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 6
+// RUN: %clang_cc1 -finclude-default-header -triple dxil-pc-shadermodel6.3-library -x hlsl -emit-llvm -disable-llvm-passes -fnative-half-type -fnative-int16-type -o - %s | FileCheck %s
+
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z22elementwise_type_cast0u11matrix_typeILm3ELm2EfE(
+// CHECK-SAME: <6 x float> noundef nofpclass(nan inf) [[F32:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[F32_ADDR:%.*]] = alloca [6 x float], align 4
+// CHECK-NEXT:    [[I32:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <6 x float> [[F32]], ptr [[F32_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <6 x float>, ptr [[F32_ADDR]], align 4
+// CHECK-NEXT:    [[CONV:%.*]] = fptosi <6 x float> [[TMP0]] to <6 x i32>
+// CHECK-NEXT:    store <6 x i32> [[CONV]], ptr [[I32]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I32]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+int3x2 elementwise_type_cast0(float3x2 f32) {
+    int3x2 i32 = (int3x2)f32;
+    return i32;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z22elementwise_type_cast1u11matrix_typeILm3ELm2EsE(
+// CHECK-SAME: <6 x i16> noundef [[I16_32:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I16_32_ADDR:%.*]] = alloca [6 x i16], align 2
+// CHECK-NEXT:    [[I32:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <6 x i16> [[I16_32]], ptr [[I16_32_ADDR]], align 2
+// CHECK-NEXT:    [[TMP0:%.*]] = load <6 x i16>, ptr [[I16_32_ADDR]], align 2
+// CHECK-NEXT:    [[CONV:%.*]] = sext <6 x i16> [[TMP0]] to <6 x i32>
+// CHECK-NEXT:    store <6 x i32> [[CONV]], ptr [[I32]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I32]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+int3x2 elementwise_type_cast1(int16_t3x2 i16_32) {
+    int3x2 i32 = (int3x2)i16_32;
+    return i32;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z22elementwise_type_cast2u11matrix_typeILm3ELm2ElE(
+// CHECK-SAME: <6 x i64> noundef [[I64_32:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I64_32_ADDR:%.*]] = alloca [6 x i64], align 8
+// CHECK-NEXT:    [[I32:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <6 x i64> [[I64_32]], ptr [[I64_32_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load <6 x i64>, ptr [[I64_32_ADDR]], align 8
+// CHECK-NEXT:    [[CONV:%.*]] = trunc <6 x i64> [[TMP0]] to <6 x i32>
+// CHECK-NEXT:    store <6 x i32> [[CONV]], ptr [[I32]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I32]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+int3x2 elementwise_type_cast2(int64_t3x2 i64_32) {
+    int3x2 i32 = (int3x2)i64_32;
+    return i32;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i16> @_Z22elementwise_type_cast3u11matrix_typeILm2ELm3EDhE(
+// CHECK-SAME: <6 x half> noundef nofpclass(nan inf) [[H23:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[H23_ADDR:%.*]] = alloca [6 x half], align 2
+// CHECK-NEXT:    [[I23:%.*]] = alloca [6 x i16], align 2
+// CHECK-NEXT:    store <6 x half> [[H23]], ptr [[H23_ADDR]], align 2
+// CHECK-NEXT:    [[TMP0:%.*]] = load <6 x half>, ptr [[H23_ADDR]], align 2
+// CHECK-NEXT:    [[CONV:%.*]] = fptosi <6 x half> [[TMP0]] to <6 x i16>
+// CHECK-NEXT:    store <6 x i16> [[CONV]], ptr [[I23]], align 2
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i16>, ptr [[I23]], align 2
+// CHECK-NEXT:    ret <6 x i16> [[TMP1]]
+//
+int16_t2x3 elementwise_type_cast3(half2x3 h23) {
+    int16_t2x3 i23 = (int16_t2x3)h23;
+    return i23;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z22elementwise_type_cast4u11matrix_typeILm3ELm2EdE(
+// CHECK-SAME: <6 x double> noundef nofpclass(nan inf) [[D32:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[D32_ADDR:%.*]] = alloca [6 x double], align 8
+// CHECK-NEXT:    [[I32:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <6 x double> [[D32]], ptr [[D32_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load <6 x double>, ptr [[D32_ADDR]], align 8
+// CHECK-NEXT:    [[CONV:%.*]] = fptosi <6 x double> [[TMP0]] to <6 x i32>
+// CHECK-NEXT:    store <6 x i32> [[CONV]], ptr [[I32]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I32]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+int3x2 elementwise_type_cast4(double3x2 d32) {
+    int3x2 i32 = (int3x2)d32;
+    return i32;
+}
+
+// CHECK-LABEL: define hidden void @_Z5call2v(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A:%.*]] = alloca [2 x [1 x i32]], align 4
+// CHECK-NEXT:    [[B:%.*]] = alloca [2 x i32], align 4
+// CHECK-NEXT:    [[AGG_TEMP:%.*]] = alloca [2 x [1 x i32]], align 4
+// CHECK-NEXT:    [[FLATCAST_TMP:%.*]] = alloca <2 x i32>, align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[A]], ptr align 4 @__const._Z5call2v.A, i32 8, i1 false)
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 4 [[AGG_TEMP]], ptr align 4 [[A]], i32 8, i1 false)
+// CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds [2 x [1 x i32]], ptr [[AGG_TEMP]], i32 0, i32 0, i32 0
+// CHECK-NEXT:    [[GEP1:%.*]] = getelementptr inbounds [2 x [1 x i32]], ptr [[AGG_TEMP]], i32 0, i32 1, i32 0
+// CHECK-NEXT:    [[TMP0:%.*]] = load <2 x i32>, ptr [[FLATCAST_TMP]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[GEP]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x i32> [[TMP0]], i32 [[TMP1]], i64 0
+// CHECK-NEXT:    [[TMP3:%.*]] = load i32, ptr [[GEP1]], align 4
+// CHECK-NEXT:    [[TMP4:%.*]] = insertelement <2 x i32> [[TMP2]], i32 [[TMP3]], i64 1
+// CHECK-NEXT:    store <2 x i32> [[TMP4]], ptr [[B]], align 4
+// CHECK-NEXT:    ret void
+//
+void call2() {
+  int A[2][1] = {{1},{2}};
+  int2x1 B = (int2x1)A;
+}
+
+struct S {
+  int X;
+  float Y;
+};
+
+// CHECK-LABEL: define hidden void @_Z5call3v(
+// CHECK-SAME: ) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[S:%.*]] = alloca [[STRUCT_S:%.*]], align 1
+// CHECK-NEXT:    [[A:%.*]] = alloca [2 x i32], align 4
+// CHECK-NEXT:    [[AGG_TEMP:%.*]] = alloca [[STRUCT_S]], align 1
+// CHECK-NEXT:    [[FLATCAST_TMP:%.*]] = alloca <2 x i32>, align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[S]], ptr align 1 @__const._Z5call3v.s, i32 8, i1 false)
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[AGG_TEMP]], ptr align 1 [[S]], i32 8, i1 false)
+// CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds [[STRUCT_S]], ptr [[AGG_TEMP]], i32 0, i32 0
+// CHECK-NEXT:    [[GEP1:%.*]] = getelementptr inbounds [[STRUCT_S]], ptr [[AGG_TEMP]], i32 0, i32 1
+// CHECK-NEXT:    [[TMP0:%.*]] = load <2 x i32>, ptr [[FLATCAST_TMP]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[GEP]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x i32> [[TMP0]], i32 [[TMP1]], i64 0
+// CHECK-NEXT:    [[TMP3:%.*]] = load float, ptr [[GEP1]], align 4
+// CHECK-NEXT:    [[CONV:%.*]] = fptosi float [[TMP3]] to i32
+// CHECK-NEXT:    [[TMP4:%.*]] = insertelement <2 x i32> [[TMP2]], i32 [[CONV]], i64 1
+// CHECK-NEXT:    store <2 x i32> [[TMP4]], ptr [[A]], align 4
+// CHECK-NEXT:    ret void
+//
+void call3() {
+  S s = {1, 2.0};
+  int2x1 A = (int2x1)s;
+}
+
+struct BFields {
+  double D;
+  int E: 15;
+  int : 8;
+  float F;
+};
+
+struct Derived : BFields {
+  int G;
+};
+
+// CHECK-LABEL: define hidden void @_Z5call47Derived(
+// CHECK-SAME: ptr noundef byval([[STRUCT_DERIVED:%.*]]) align 1 [[D:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A:%.*]] = alloca [4 x i32], align 4
+// CHECK-NEXT:    [[AGG_TEMP:%.*]] = alloca [[STRUCT_DERIVED]], align 1
+// CHECK-NEXT:    [[FLATCAST_TMP:%.*]] = alloca <4 x i32>, align 4
+// CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr align 1 [[AGG_TEMP]], ptr align 1 [[D]], i32 19, i1 false)
+// CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds [[STRUCT_DERIVED]], ptr [[AGG_TEMP]], i32 0, i32 0
+// CHECK-NEXT:    [[E:%.*]] = getelementptr inbounds nuw [[STRUCT_BFIELDS:%.*]], ptr [[GEP]], i32 0, i32 1
+// CHECK-NEXT:    [[GEP1:%.*]] = getelementptr inbounds [[STRUCT_DERIVED]], ptr [[AGG_TEMP]], i32 0, i32 0, i32 0
+// CHECK-NEXT:    [[GEP2:%.*]] = getelementptr inbounds [[STRUCT_DERIVED]], ptr [[AGG_TEMP]], i32 0, i32 0, i32 2
+// CHECK-NEXT:    [[GEP3:%.*]] = getelementptr inbounds [[STRUCT_DERIVED]], ptr [[AGG_TEMP]], i32 0, i32 1
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x i32>, ptr [[FLATCAST_TMP]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load double, ptr [[GEP1]], align 8
+// CHECK-NEXT:    [[CONV:%.*]] = fptosi double [[TMP1]] to i32
+// CHECK-NEXT:    [[TMP2:%.*]] = insertelement <4 x i32> [[TMP0]], i32 [[CONV]], i64 0
+// CHECK-NEXT:    [[TMP3:%.*]] = load float, ptr [[GEP2]], align 4
+// CHECK-NEXT:    [[CONV4:%.*]] = fptosi float [[TMP3]] to i32
+// CHECK-NEXT:    [[TMP4:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[CONV4]], i64 1
+// CHECK-NEXT:    [[BF_LOAD:%.*]] = load i24, ptr [[E]], align 1
+// CHECK-NEXT:    [[BF_SHL:%.*]] = shl i24 [[BF_LOAD]], 9
+// CHECK-NEXT:    [[BF_ASHR:%.*]] = ashr i24 [[BF_SHL]], 9
+// CHECK-NEXT:    [[BF_CAST:%.*]] = sext i24 [[BF_ASHR]] to i32
+// CHECK-NEXT:    [[TMP5:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[BF_CAST]], i64 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load i32, ptr [[GEP3]], align 4
+// CHECK-NEXT:    [[TMP7:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP6]], i64 3
+// CHECK-NEXT:    store <4 x i32> [[TMP7]], ptr [[A]], align 4
+// CHECK-NEXT:    ret void
+//
+void call4(Derived D) {
+  int2x2 A = (int2x2)D;
+}
+
+// CHECK-LABEL: define hidden noundef nofpclass(nan inf) <4 x float> @_Z5call5Dv4_f(
+// CHECK-SAME: <4 x float> noundef nofpclass(nan inf) [[V:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    [[M:%.*]] = alloca [4 x float], align 4
+// CHECK-NEXT:    [[HLSL_EWCAST_SRC:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    [[FLATCAST_TMP:%.*]] = alloca <4 x float>, align 4
+// CHECK-NEXT:    store <4 x float> [[V]], ptr [[V_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x float>, ptr [[V_ADDR]], align 16
+// CHECK-NEXT:    store <4 x float> [[TMP0]], ptr [[HLSL_EWCAST_SRC]], align 16
+// CHECK-NEXT:    [[VECTOR_GEP:%.*]] = getelementptr inbounds <4 x float>, ptr [[HLSL_EWCAST_SRC]], i32 0
+// CHECK-NEXT:    [[TMP1:%.*]] = load <4 x float>, ptr [[FLATCAST_TMP]], align 4
+// CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[VECTOR_GEP]], align 16
+// CHECK-NEXT:    [[VECEXT:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+// CHECK-NEXT:    [[TMP3:%.*]] = insertelement <4 x float> [[TMP1]], float [[VECEXT]], i64 0
+// CHECK-NEXT:    [[TMP4:%.*]] = load <4 x float>, ptr [[VECTOR_GEP]], align 16
+// CHECK-NEXT:    [[VECEXT1:%.*]] = extractelement <4 x float> [[TMP4]], i32 2
+// CHECK-NEXT:    [[TMP5:%.*]] = insertelement <4 x float> [[TMP3]], float [[VECEXT1]], i64 1
+// CHECK-NEXT:    [[TMP6:%.*]] = load <4 x float>, ptr [[VECTOR_GEP]], align 16
+// CHECK-NEXT:    [[VECEXT2:%.*]] = extractelement <4 x float> [[TMP6]], i32 1
+// CHECK-NEXT:    [[TMP7:%.*]] = insertelement <4 x float> [[TMP5]], float [[VECEXT2]], i64 2
+// CHECK-NEXT:    [[TMP8:%.*]] = load <4 x float>, ptr [[VECTOR_GEP]], align 16
+// CHECK-NEXT:    [[VECEXT3:%.*]] = extractelement <4 x float> [[TMP8]], i32 3
+// CHECK-NEXT:    [[TMP9:%.*]] = insertelement <4 x float> [[TMP7]], float [[VECEXT3]], i64 3
+// CHECK-NEXT:    store <4 x float> [[TMP9]], ptr [[M]], align 4
+// CHECK-NEXT:    [[TMP10:%.*]] = load <4 x float>, ptr [[M]], align 4
+// CHECK-NEXT:    ret <4 x float> [[TMP10]]
+//
+float2x2 call5(float4 v) {
+    float2x2 m = (float2x2)v;
+    return m;
+}
diff --git a/clang/test/CodeGenHLSL/BasicFeatures/MatrixExplicitTruncation.hlsl b/clang/test/CodeGenHLSL/BasicFeatures/MatrixExplicitTruncation.hlsl
new file mode 100644
index 0000000000000..f3c4bc496d5a4
--- /dev/null
+++ b/clang/test/CodeGenHLSL/BasicFeatures/MatrixExplicitTruncation.hlsl
@@ -0,0 +1,156 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 6
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.7-library -disable-llvm-passes -emit-llvm -finclude-default-header -o - %s | FileCheck %s
+
+// CHECK-LABEL: define hidden noundef <12 x i32> @_Z10trunc_castu11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I34:%.*]] = alloca [12 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+// CHECK-NEXT:    store <12 x i32> [[TRUNC]], ptr [[I34]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <12 x i32>, ptr [[I34]], align 4
+// CHECK-NEXT:    ret <12 x i32> [[TMP1]]
+//
+ int3x4 trunc_cast(int4x4 i44) {
+    int3x4 i34 = (int3x4)i44;
+    return i34;
+}
+
+// CHECK-LABEL: define hidden noundef <12 x i32> @_Z11trunc_cast0u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I43:%.*]] = alloca [12 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <12 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10, i32 12, i32 13, i32 14>
+// CHECK-NEXT:    store <12 x i32> [[TRUNC]], ptr [[I43]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <12 x i32>, ptr [[I43]], align 4
+// CHECK-NEXT:    ret <12 x i32> [[TMP1]]
+//
+ int4x3 trunc_cast0(int4x4 i44) {
+    int4x3 i43 = (int4x3)i44;
+    return i43;
+}
+
+// CHECK-LABEL: define hidden noundef <9 x i32> @_Z11trunc_cast1u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I33:%.*]] = alloca [9 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10>
+// CHECK-NEXT:    store <9 x i32> [[TRUNC]], ptr [[I33]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <9 x i32>, ptr [[I33]], align 4
+// CHECK-NEXT:    ret <9 x i32> [[TMP1]]
+//
+ int3x3 trunc_cast1(int4x4 i44) {
+    int3x3 i33 = (int3x3)i44;
+    return i33;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z11trunc_cast2u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I32:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <6 x i32> <i32 0, i32 1, i32 4, i32 5, i32 8, i32 9>
+// CHECK-NEXT:    store <6 x i32> [[TRUNC]], ptr [[I32]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I32]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+ int3x2 trunc_cast2(int4x4 i44) {
+    int3x2 i32 = (int3x2)i44;
+    return i32;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z11trunc_cast3u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I23:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <6 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6>
+// CHECK-NEXT:    store <6 x i32> [[TRUNC]], ptr [[I23]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I23]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+ int2x3 trunc_cast3(int4x4 i44) {
+    int2x3 i23 = (int2x3)i44;
+    return i23;
+}
+
+// CHECK-LABEL: define hidden noundef <4 x i32> @_Z11trunc_cast4u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I22:%.*]] = alloca [4 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
+// CHECK-NEXT:    store <4 x i32> [[TRUNC]], ptr [[I22]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <4 x i32>, ptr [[I22]], align 4
+// CHECK-NEXT:    ret <4 x i32> [[TMP1]]
+//
+ int2x2 trunc_cast4(int4x4 i44) {
+    int2x2 i22 = (int2x2)i44;
+    return i22;
+}
+
+// CHECK-LABEL: define hidden noundef <2 x i32> @_Z11trunc_cast5u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I21:%.*]] = alloca [2 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <2 x i32> <i32 0, i32 4>
+// CHECK-NEXT:    store <2 x i32> [[TRUNC]], ptr [[I21]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i32>, ptr [[I21]], align 4
+// CHECK-NEXT:    ret <2 x i32> [[TMP1]]
+//
+ int2x1 trunc_cast5(int4x4 i44) {
+    int2x1 i21 = (int2x1)i44;
+    return i21;
+}
+
+// CHECK-LABEL: define hidden noundef i32 @_Z11trunc_cast6u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I1:%.*]] = alloca i32, align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[CAST_MTRUNC:%.*]] = extractelement <16 x i32> [[TMP0]], i32 0
+// CHECK-NEXT:    store i32 [[CAST_MTRUNC]], ptr [[I1]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[I1]], align 4
+// CHECK-NEXT:    ret i32 [[TMP1]]
+//
+ int trunc_cast6(int4x4 i44) {
+    int i1 = (int)i44;
+    return i1;
+}
+
+// CHECK-LABEL: define hidden noundef i32 @_Z16trunc_multi_castu11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I1:%.*]] = alloca i32, align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+// CHECK-NEXT:    [[CAST_MTRUNC:%.*]] = extractelement <12 x i32> [[TRUNC]], i32 0
+// CHECK-NEXT:    store i32 [[CAST_MTRUNC]], ptr [[I1]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[I1]], align 4
+// CHECK-NEXT:    ret i32 [[TMP1]]
+//
+ int trunc_multi_cast(int4x4 i44) {
+    int i1 = (int)(int3x4)i44;
+    return i1;
+}
diff --git a/clang/test/CodeGenHLSL/BasicFeatures/MatrixImplicitTruncation.hlsl b/clang/test/CodeGenHLSL/BasicFeatures/MatrixImplicitTruncation.hlsl
new file mode 100644
index 0000000000000..e621f68623bd1
--- /dev/null
+++ b/clang/test/CodeGenHLSL/BasicFeatures/MatrixImplicitTruncation.hlsl
@@ -0,0 +1,138 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 6
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.7-library -disable-llvm-passes -emit-llvm -finclude-default-header -o - %s | FileCheck %s
+
+// CHECK-LABEL: define hidden noundef <12 x i32> @_Z10trunc_castu11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I34:%.*]] = alloca [12 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
+// CHECK-NEXT:    store <12 x i32> [[TRUNC]], ptr [[I34]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <12 x i32>, ptr [[I34]], align 4
+// CHECK-NEXT:    ret <12 x i32> [[TMP1]]
+//
+ int3x4 trunc_cast(int4x4 i44) {
+    int3x4 i34 = i44;
+    return i34;
+}
+
+// CHECK-LABEL: define hidden noundef <12 x i32> @_Z11trunc_cast0u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I43:%.*]] = alloca [12 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <12 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10, i32 12, i32 13, i32 14>
+// CHECK-NEXT:    store <12 x i32> [[TRUNC]], ptr [[I43]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <12 x i32>, ptr [[I43]], align 4
+// CHECK-NEXT:    ret <12 x i32> [[TMP1]]
+//
+ int4x3 trunc_cast0(int4x4 i44) {
+    int4x3 i43 = i44;
+    return i43;
+}
+
+// CHECK-LABEL: define hidden noundef <9 x i32> @_Z11trunc_cast1u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I33:%.*]] = alloca [9 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10>
+// CHECK-NEXT:    store <9 x i32> [[TRUNC]], ptr [[I33]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <9 x i32>, ptr [[I33]], align 4
+// CHECK-NEXT:    ret <9 x i32> [[TMP1]]
+//
+ int3x3 trunc_cast1(int4x4 i44) {
+    int3x3 i33 = i44;
+    return i33;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z11trunc_cast2u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I32:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <6 x i32> <i32 0, i32 1, i32 4, i32 5, i32 8, i32 9>
+// CHECK-NEXT:    store <6 x i32> [[TRUNC]], ptr [[I32]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I32]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+ int3x2 trunc_cast2(int4x4 i44) {
+    int3x2 i32 = i44;
+    return i32;
+}
+
+// CHECK-LABEL: define hidden noundef <6 x i32> @_Z11trunc_cast3u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I23:%.*]] = alloca [6 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <6 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6>
+// CHECK-NEXT:    store <6 x i32> [[TRUNC]], ptr [[I23]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <6 x i32>, ptr [[I23]], align 4
+// CHECK-NEXT:    ret <6 x i32> [[TMP1]]
+//
+ int2x3 trunc_cast3(int4x4 i44) {
+    int2x3 i23 = i44;
+    return i23;
+}
+
+// CHECK-LABEL: define hidden noundef <4 x i32> @_Z11trunc_cast4u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I22:%.*]] = alloca [4 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
+// CHECK-NEXT:    store <4 x i32> [[TRUNC]], ptr [[I22]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <4 x i32>, ptr [[I22]], align 4
+// CHECK-NEXT:    ret <4 x i32> [[TMP1]]
+//
+ int2x2 trunc_cast4(int4x4 i44) {
+    int2x2 i22 = i44;
+    return i22;
+}
+
+// CHECK-LABEL: define hidden noundef <2 x i32> @_Z11trunc_cast5u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I21:%.*]] = alloca [2 x i32], align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TRUNC:%.*]] = shufflevector <16 x i32> [[TMP0]], <16 x i32> poison, <2 x i32> <i32 0, i32 4>
+// CHECK-NEXT:    store <2 x i32> [[TRUNC]], ptr [[I21]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load <2 x i32>, ptr [[I21]], align 4
+// CHECK-NEXT:    ret <2 x i32> [[TMP1]]
+//
+ int2x1 trunc_cast5(int4x4 i44) {
+    int2x1 i21 = i44;
+    return i21;
+}
+
+// CHECK-LABEL: define hidden noundef i32 @_Z11trunc_cast6u11matrix_typeILm4ELm4EiE(
+// CHECK-SAME: <16 x i32> noundef [[I44:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[I44_ADDR:%.*]] = alloca [16 x i32], align 4
+// CHECK-NEXT:    [[I1:%.*]] = alloca i32, align 4
+// CHECK-NEXT:    store <16 x i32> [[I44]], ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[TMP0:%.*]] = load <16 x i32>, ptr [[I44_ADDR]], align 4
+// CHECK-NEXT:    [[CAST_MTRUNC:%.*]] = extractelement <16 x i32> [[TMP0]], i32 0
+// CHECK-NEXT:    store i32 [[CAST_MTRUNC]], ptr [[I1]], align 4
+// CHECK-NEXT:    [[TMP1:%.*]] = load i32, ptr [[I1]], align 4
+// CHECK-NEXT:    ret i32 [[TMP1]]
+//
+ int trunc_cast6(int4x4 i44) {
+    int i1 = i44;
+    return i1;
+}
diff --git a/clang/test/CodeGenHLSL/BoolVector.hlsl b/clang/test/CodeGenHLSL/BoolVector.hlsl
index d5054a5a92b5d..6be90e8f51ce2 100644
--- a/clang/test/CodeGenHLSL/BoolVector.hlsl
+++ b/clang/test/CodeGenHLSL/BoolVector.hlsl
@@ -69,9 +69,8 @@ bool fn4() {
 // CHECK-LABEL: define hidden void {{.*}}fn5{{.*}}
 // CHECK: [[Arr:%.*]] = alloca <2 x i32>, align 8
 // CHECK-NEXT: store <2 x i32> splat (i32 1), ptr [[Arr]], align 8
-// CHECK-NEXT: [[L:%.*]] = load <2 x i32>, ptr [[Arr]], align 8
-// CHECK-NEXT: [[V:%.*]] = insertelement <2 x i32> [[L]], i32 0, i32 1
-// CHECK-NEXT: store <2 x i32> [[V]], ptr [[Arr]], align 8
+// CHECK-NEXT: [[Ptr:%.*]] = getelementptr <2 x i32>, ptr [[Arr]]
+// CHECK-NEXT: store i32 0, ptr [[Ptr]], align 4
 // CHECK-NEXT: ret void
 void fn5() {
   bool2 Arr = {true,true};
@@ -86,10 +85,9 @@ void fn5() {
 // CHECK-NEXT: [[Y:%.*]] = load i32, ptr [[V]], align 4
 // CHECK-NEXT: [[LV:%.*]] = trunc i32 [[Y]] to i1
 // CHECK-NEXT: [[BV:%.*]] = getelementptr inbounds nuw %struct.S, ptr [[S]], i32 0, i32 0
-// CHECK-NEXT: [[X:%.*]] = load <2 x i32>, ptr [[BV]], align 1
 // CHECK-NEXT: [[Z:%.*]] = zext i1 [[LV]] to i32
-// CHECK-NEXT: [[VI:%.*]] = insertelement <2 x i32> [[X]], i32 [[Z]], i32 1
-// CHECK-NEXT: store <2 x i32> [[VI]], ptr [[BV]], align 1
+// CHECK-NEXT: [[Ptr:%.*]] = getelementptr <2 x i32>, ptr [[BV]], i32 0, i32 1
+// CHECK-NEXT: store i32 [[Z]], ptr [[Ptr]], align 4
 // CHECK-NEXT: ret void
 void fn6() {
   bool V = false;
@@ -101,9 +99,8 @@ void fn6() {
 // CHECK: [[Arr:%.*]] = alloca [2 x <2 x i32>], align 8
 // CHECK-NEXT: call void @llvm.memcpy.p0.p0.i32(ptr align 8 [[Arr]], ptr align 8 {{.*}}, i32 16, i1 false)
 // CHECK-NEXT: [[Idx:%.*]] = getelementptr inbounds [2 x <2 x i32>], ptr [[Arr]], i32 0, i32 0
-// CHECK-NEXT: [[X:%.*]] = load <2 x i32>, ptr [[Idx]], align 8
-// CHECK-NEXT: [[VI:%.*]] = insertelement <2 x i32> [[X]], i32 0, i32 1
-// CHECK-NEXT: store <2 x i32> [[VI]], ptr [[Idx]], align 8
+// CHECK-NEXT: %[[Ptr:.*]] = getelementptr <2 x i32>, ptr [[Idx]], i32 0, i32 1
+// CHECK-NEXT: store i32 0, ptr %[[Ptr]], align 4
 // CHECK-NEXT: ret void
 void fn7() {
   bool2 Arr[2] = {{true,true}, {false,false}};
diff --git a/clang/test/CodeGenHLSL/builtins/VectorElementStore.hlsl b/clang/test/CodeGenHLSL/builtins/VectorElementStore.hlsl
new file mode 100644
index 0000000000000..e0c3aa54aaeba
--- /dev/null
+++ b/clang/test/CodeGenHLSL/builtins/VectorElementStore.hlsl
@@ -0,0 +1,41 @@
+// RUN: %clang_cc1 -finclude-default-header -emit-llvm -disable-llvm-passes  \
+// RUN:   -triple dxil-pc-shadermodel6.3-library %s -o - | FileCheck %s
+
+// Test groupshared vector element store for uint.
+// CHECK-LABEL: test_uint4
+// CHECK: [[VAL:%.*]] = load i32, ptr %Val.addr, align 4
+// CHECK: [[IDX:%.*]] = load i32, ptr %Idx.addr, align 4
+// CHECK: [[PTR:%.*]] = getelementptr <4 x i32>, ptr addrspace(3) @SMem, i32 0, i32 [[IDX]]
+// CHECK: store i32 [[VAL]], ptr addrspace(3) [[PTR]], align 4
+// CHECK-: ret void
+groupshared uint4 SMem;
+void test_uint4(uint Idx, uint Val) {
+  SMem[Idx] = Val;
+}
+
+// Test local vector element store for bool.
+// CHECK: [[COND1:%.*]] = load i32, ptr addrspace(3) @Cond, align 4
+// CHECK: [[COND2:%.*]] = trunc i32 [[COND1]] to i1
+// CHECK: [[IDX:%.*]] = load i32, ptr %Idx.addr, align 4
+// CHECK: [[COND3:%.*]] = zext i1 [[COND2]] to i32
+// CHECK: [[PTR:%.*]] = getelementptr <3 x i32>, ptr %Val, i32 0, i32 [[IDX]]
+// CHECK: store i32 [[COND3]], ptr [[PTR]], align 4
+// CHECK: ret
+groupshared bool Cond;
+bool3 test_bool(uint Idx) {
+  bool3 Val = { false, false, false};
+  Val[Idx] = Cond;
+  return Val;
+}
+
+// Test resource vector element store for float.
+// CHECK: [[VAL:%.*]] = load float, ptr %Val.addr, align 4
+// CHECK: [[RES_PTR:%.*]] = call {{.*}} ptr @_ZN4hlsl18RWStructuredBufferIDv4_fEixEj(ptr {{.*}} @_ZL3Buf, i32 noundef 0)
+// CHECK: [[IDX:%.*]] = load i32, ptr %Idx.addr, align 4
+// CHECK: [[PTR:%.*]] = getelementptr <4 x float>, ptr [[RES_PTR]], i32 0, i32 [[IDX]]
+// CHECK: store float [[VAL]], ptr [[PTR]], align 4
+// CHECK: ret void
+RWStructuredBuffer<float4> Buf : register(u0);
+void test_float(uint Idx, float Val) {
+  Buf[0][Idx] = Val;
+}
diff --git a/clang/test/CodeGenHLSL/builtins/lit.hlsl b/clang/test/CodeGenHLSL/builtins/lit.hlsl
index b7979960de9f6..364c2e8794ea2 100644
--- a/clang/test/CodeGenHLSL/builtins/lit.hlsl
+++ b/clang/test/CodeGenHLSL/builtins/lit.hlsl
@@ -11,7 +11,8 @@
 // CHECK: %mul.i = fmul reassoc nnan ninf nsz arcp afn half [[LOG]], %{{.*}}
 // CHECK: [[EXP:%.*]] = call reassoc nnan ninf nsz arcp afn half @llvm.exp.f16(half %mul.i)
 // CHECK: %hlsl.select7.i = select reassoc nnan ninf nsz arcp afn i1 %{{.*}}, half 0xH0000, half %{{.*}}
-// CHECK: %vecins.i = insertelement <4 x half> %{{.*}}, half %hlsl.select7.i, i32 2
+// CHECK: [[PTR:%.*]] = getelementptr <4 x half>, ptr %Result.i, i32 0, i32 2
+// CHECK: store half %hlsl.select7.i, ptr [[PTR]], align 2
 // CHECK: ret <4 x half> %{{.*}}
 half4 test_lit_half(half NDotL, half NDotH, half M) { return lit(NDotL, NDotH, M); }
 
@@ -26,6 +27,7 @@ half4 test_lit_half(half NDotL, half NDotH, half M) { return lit(NDotL, NDotH, M
 // CHECK: %mul.i = fmul reassoc nnan ninf nsz arcp afn float [[LOG]], %{{.*}}
 // CHECK: [[EXP:%.*]] = call reassoc nnan ninf nsz arcp afn float @llvm.exp.f32(float %mul.i)
 // CHECK: %hlsl.select7.i = select reassoc nnan ninf nsz arcp afn i1 %{{.*}}, float 0.000000e+00, float %{{.*}}
-// CHECK: %vecins.i = insertelement <4 x float> %{{.*}}, float %hlsl.select7.i, i32 2
+// CHECK: [[PTR:%.*]] = getelementptr <4 x float>, ptr %Result.i, i32 0, i32 2
+// CHECK: store float %hlsl.select7.i, ptr [[PTR]], align 4
 // CHECK: ret <4 x float> %{{.*}}
 float4 test_lit_float(float NDotL, float NDotH, float M) { return lit(NDotL, NDotH, M); }
diff --git a/clang/test/CodeGenHLSL/semantics/SV_Target.ps.hlsl b/clang/test/CodeGenHLSL/semantics/SV_Target.ps.hlsl
new file mode 100644
index 0000000000000..4dc622a1eb6bb
--- /dev/null
+++ b/clang/test/CodeGenHLSL/semantics/SV_Target.ps.hlsl
@@ -0,0 +1,19 @@
+// RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefix=CHECK-SPIRV
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefix=CHECK-DXIL
+
+// CHECK-SPIRV: @SV_Target0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_2:]]
+
+// CHECK: define void @main() {{.*}} {
+float4 main(float4 p : SV_Position) : SV_Target {
+  // CHECK-SPIRV: %[[#R:]] = call spir_func <4 x float> @_Z4mainDv4_f(<4 x float> %[[#]])
+  // CHECK-SPIRV:            store <4 x float> %[[#R]], ptr addrspace(8) @SV_Target0, align 16
+
+  // CHECK-DXIL:    %[[#TMP:]] = call <4 x float> @_Z4mainDv4_f(<4 x float> %SV_Position0)
+  // CHECK-DXIL:                 call void @llvm.dx.store.output.v4f32(i32 4, i32 0, i32 0, i8 0, i32 poison, <4 x float> %[[#TMP]])
+  return p;
+}
+
+// CHECK-SPIRV-DAG: ![[#MD_2]] = !{![[#MD_3:]]}
+// CHECK-SPIRV-DAG: ![[#MD_3]] = !{i32 30, i32 0}
+//                                      |       `-> Location index
+//                                      `-> SPIR-V decoration 'Location'
diff --git a/clang/test/CodeGenHLSL/semantics/semantic.explicit-location-output-struct.hlsl b/clang/test/CodeGenHLSL/semantics/semantic.explicit-location-output-struct.hlsl
new file mode 100644
index 0000000000000..c5d86637fb4ea
--- /dev/null
+++ b/clang/test/CodeGenHLSL/semantics/semantic.explicit-location-output-struct.hlsl
@@ -0,0 +1,37 @@
+// RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefixes=CHECK,CHECK-SPIRV
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefixes=CHECK,CHECK-DXIL
+
+// CHECK-SPIRV: @SV_Position = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations ![[#MD_0:]]
+// CHECK-SPIRV: @SV_Target0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_2:]]
+
+struct Output {
+  [[vk::location(2)]] float4 field : SV_Target;
+};
+
+// CHECK: define void @main() {{.*}} {
+Output main(float4 p : SV_Position) {
+  // CHECK:   %[[#OUT:]] = alloca %struct.Output, align 16
+
+  // CHECK-SPIRV:    %[[#IN:]] = load <4 x float>, ptr addrspace(7) @SV_Position, align 16
+  // CHECK-SPIRV:                call spir_func void @_Z4mainDv4_f(ptr %[[#OUT]], <4 x float> %[[#IN]])
+
+  // CHECK-DXIL:                 call void @_Z4mainDv4_f(ptr %[[#OUT]], <4 x float> %SV_Position0)
+
+  // CHECK:   %[[#TMP:]] = load %struct.Output, ptr %[[#OUT]], align 16
+  // CHECK: %[[#FIELD:]] = extractvalue %struct.Output %[[#TMP]], 0
+
+  // CHECK-SPIRV:                store <4 x float> %[[#FIELD]], ptr addrspace(8) @SV_Target0, align 16
+  // CHECK-DXIL:                 call void @llvm.dx.store.output.v4f32(i32 4, i32 0, i32 0, i8 0, i32 poison, <4 x float> %[[#FIELD]])
+  Output o;
+  o.field = p;
+  return o;
+}
+
+// CHECK-SPIRV-DAG: ![[#MD_0]] = !{![[#MD_1:]]}
+// CHECK-SPIRV-DAG: ![[#MD_1]] = !{i32 11, i32 15}
+//                                      |       `-> BuiltIn 'FragCoord'
+//                                      `-> SPIR-V decoration 'BuiltIn'
+// CHECK-SPIRV-DAG: ![[#MD_2]] = !{![[#MD_3:]]}
+// CHECK-SPIRV-DAG: ![[#MD_3]] = !{i32 30, i32 2}
+//                                      |       `-> Location index
+//                                      `-> SPIR-V decoration 'Location'
diff --git a/clang/test/CodeGenHLSL/semantics/semantic.explicit-location.hlsl b/clang/test/CodeGenHLSL/semantics/semantic.explicit-location.hlsl
new file mode 100644
index 0000000000000..41e28bf1259d6
--- /dev/null
+++ b/clang/test/CodeGenHLSL/semantics/semantic.explicit-location.hlsl
@@ -0,0 +1,19 @@
+// RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefix=CHECK-SPIRV
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefix=CHECK-DXIL
+
+// CHECK-SPIRV: @SV_Target0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_2:]]
+
+// CHECK: define void @main() {{.*}} {
+[[vk::location(2)]] float4 main(float4 p : SV_Position) : SV_Target {
+  // CHECK-SPIRV: %[[#R:]] = call spir_func <4 x float> @_Z4mainDv4_f(<4 x float> %[[#]])
+  // CHECK-SPIRV:            store <4 x float> %[[#R]], ptr addrspace(8) @SV_Target0, align 16
+
+  // CHECK-DXIL:    %[[#TMP:]] = call <4 x float> @_Z4mainDv4_f(<4 x float> %SV_Position0)
+  // CHECK-DXIL:                 call void @llvm.dx.store.output.v4f32(i32 4, i32 0, i32 0, i8 0, i32 poison, <4 x float> %[[#TMP]])
+  return p;
+}
+
+// CHECK-SPIRV-DAG: ![[#MD_2]] = !{![[#MD_3:]]}
+// CHECK-SPIRV-DAG: ![[#MD_3]] = !{i32 30, i32 2}
+//                                      |       `-> Location index
+//                                      `-> SPIR-V decoration 'Location'
diff --git a/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.hlsl b/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.hlsl
new file mode 100644
index 0000000000000..bc2ecd926dd51
--- /dev/null
+++ b/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.hlsl
@@ -0,0 +1,39 @@
+// RUN: %clang_cc1 -triple spirv-linux-vulkan-pixel -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefixes=CHECK,CHECK-SPIRV
+
+// The following code is allowed because the `SV_Position` semantic is here
+// translated into a SPIR-V builtin. Meaning there is no implicit `Location`
+// assignment.
+
+struct S2 {
+  float4 a;
+  float4 b;
+};
+
+struct S1 {
+  float4 position : SV_Position;
+  [[vk::location(3)]] float4 color0 : COLOR0;
+};
+
+// CHECK-SPIRV: @SV_Position = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations ![[#MD_0:]]
+// CHECK-SPIRV: @COLOR0 = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations ![[#MD_2:]]
+// CHECK-SPIRV: @SV_Target0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_4:]]
+
+[shader("pixel")]
+float4 main(S1 p) : SV_Target {
+  return p.position + p.color0;
+}
+// CHECK-SPIRV:    %[[#SV_POS:]] = load <4 x float>, ptr addrspace(7) @SV_Position, align 16
+// CHECK:            %[[#TMP1:]] = insertvalue %struct.S1 poison, <4 x float> %[[#SV_POS]], 0
+// CHECK-SPIRV:        %[[#A0:]] = load <4 x float>, ptr addrspace(7) @COLOR0, align 16
+// CHECK:            %[[#TMP2:]] = insertvalue %struct.S1 %[[#TMP1]], <4 x float> %[[#A0]], 1
+// CHECK:               %[[#P:]] = alloca %struct.S1, align 16
+// CHECK:                          store %struct.S1 %[[#TMP2]], ptr %[[#P]], align 16
+// CHECK-SPIRV:         %[[#R:]] = call spir_func <4 x float> @_Z4main2S1(ptr %[[#P]])
+// CHECK-SPIRV:                    store <4 x float> %[[#R]], ptr addrspace(8) @SV_Target0, align 16
+
+// CHECK-SPIRV: ![[#MD_0]] = !{![[#MD_1:]]}
+// CHECK-SPIRV: ![[#MD_1]] = !{i32 11, i32 15}
+// CHECK-SPIRV: ![[#MD_2]] = !{![[#MD_3:]]}
+// CHECK-SPIRV: ![[#MD_3]] = !{i32 30, i32 3}
+// CHECK-SPIRV: ![[#MD_4]] = !{![[#MD_5:]]}
+// CHECK-SPIRV: ![[#MD_5]] = !{i32 30, i32 0}
diff --git a/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.vs.hlsl b/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.vs.hlsl
new file mode 100644
index 0000000000000..43dc30f089d9e
--- /dev/null
+++ b/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix-builtin.vs.hlsl
@@ -0,0 +1,31 @@
+// RUN: %clang_cc1 -triple spirv-linux-vulkan-vertex -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s
+
+// This is almost the same as semantic.explicit-mix-builtin.hlsl, except this
+// time we build a vertex shader. This means the SV_Position semantic output
+// is also a BuiltIn, This means we can mix implicit and explicit location
+// assignment.
+struct S1 {
+  float4 position : SV_Position;
+  [[vk::location(3)]] float4 color : A;
+};
+
+// CHECK: @SV_Position0 = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations ![[#MD_0:]]
+// CHECK: @SV_Position = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_2:]]
+// CHECK: @A0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_0]]
+
+[shader("vertex")]
+S1 main1(float4 position : SV_Position) {
+  S1 output;
+  output.position = position;
+  output.color = position;
+  return output;
+}
+
+// CHECK: ![[#MD_0]] = !{![[#MD_1:]]}
+// CHECK: ![[#MD_1]] = !{i32 30, i32 0}
+//                            |       `-> Location index
+//                            `-> SPIR-V decoration 'Location'
+// CHECK: ![[#MD_2]] = !{![[#MD_3:]]}
+// CHECK: ![[#MD_3]] = !{i32 11, i32 0}
+//                            |       `-> BuiltIn 'Position'
+//                            `-> SPIR-V decoration 'BuiltIn'
diff --git a/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix.lib.hlsl b/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix.lib.hlsl
new file mode 100644
index 0000000000000..456c9bf9aee05
--- /dev/null
+++ b/clang/test/CodeGenHLSL/semantics/semantic.explicit-mix.lib.hlsl
@@ -0,0 +1,40 @@
+// RUN: %clang_cc1 -triple spirv-linux-vulkan-library -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s | FileCheck %s --check-prefixes=CHECK
+
+// The followiong file contains both implicit and explicit vk::location, but
+// because each entrypoint has only one kind, this is allowed.
+
+[shader("vertex")]
+float4 vs_main(float4 p : SV_Position) : A {
+  return p;
+}
+
+[shader("pixel")]
+float4 ps_main([[vk::location(0)]] float4 p : A) : SV_Target {
+  return p;
+}
+
+// The following function is not marked as being a shader entrypoint, this
+// means the semantics and [[vk::location]] attributes are ignored.
+// Otherwise, the partial explicit location assignment would be illegal.
+float4 not_an_entry([[vk::location(0)]] float4 a : A, float4 b : B) : C {
+  return a + b;
+}
+
+// CHECK: @SV_Position0 = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations ![[#MD_0:]]
+// CHECK: @A0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_0:]]
+// CHECK: @A0.1 = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations ![[#MD_0:]]
+// CHECK: @SV_Target0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations ![[#MD_2:]]
+
+
+// CHECK: define void @vs_main()
+// CHECK: %[[#]] = load <4 x float>, ptr addrspace(7) @SV_Position0, align 16
+// CHECK: store <4 x float> %[[#]], ptr addrspace(8) @A0, align 16
+
+// CHECK: define void @ps_main()
+// CHECK: %[[#]] = load <4 x float>, ptr addrspace(7) @A0.1, align 16
+// CHECK: store <4 x float> %[[#]], ptr addrspace(8) @SV_Target0, align 16
+
+// CHECK: ![[#MD_0]] = !{![[#MD_1:]]}
+// CHECK: ![[#MD_1]] = !{i32 30, i32 0}
+// CHECK: ![[#MD_2]] = !{![[#MD_3:]]}
+// CHECK: ![[#MD_3]] = !{i32 30, i32 1}
diff --git a/clang/test/CodeGenOpenCL/address-spaces.cl b/clang/test/CodeGenOpenCL/address-spaces.cl
index 5b2a95c6ac16a..b9f01069fa26c 100644
--- a/clang/test/CodeGenOpenCL/address-spaces.cl
+++ b/clang/test/CodeGenOpenCL/address-spaces.cl
@@ -2,9 +2,10 @@
 // RUN: %clang_cc1 %s -O0 -cl-std=CL3.0 -cl-ext=-all -ffake-address-space-map -emit-llvm -o - | FileCheck %s --check-prefixes=CHECK,SPIR
 // RUN: %clang_cc1 %s -O0 -cl-std=clc++2021 -cl-ext=-all -ffake-address-space-map -emit-llvm -o - | FileCheck %s --check-prefixes=CHECK,SPIR
 // RUN: %clang_cc1 %s -O0 -DCL20 -cl-std=CL2.0 -ffake-address-space-map -emit-llvm -o - | FileCheck %s --check-prefixes=CL20,CL20SPIR
-// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
-// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -cl-std=CL3.0 -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
+// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-mesa3d -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
+// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-mesa3d -cl-std=CL3.0 -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
 // RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -DCL20 -cl-std=CL2.0 -emit-llvm -o - | FileCheck %s --check-prefixes=CL20,CL20AMDGCN
+// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -DCL20 -cl-std=CL3.0 -emit-llvm -o - | FileCheck %s --check-prefixes=CL20,CL20AMDGCN
 // RUN: %clang_cc1 %s -O0 -triple amdgcn-mesa-mesa3d -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
 // RUN: %clang_cc1 %s -O0 -triple amdgcn-mesa-mesa3d -cl-std=CL3.0 -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
 // RUN: %clang_cc1 %s -O0 -triple r600-- -emit-llvm -o - | FileCheck --check-prefixes=CHECK,AMDGCN %s
diff --git a/clang/test/CodeGenOpenCL/builtins-alloca.cl b/clang/test/CodeGenOpenCL/builtins-alloca.cl
index ce7da3aba9e45..51da8e3b3badb 100644
--- a/clang/test/CodeGenOpenCL/builtins-alloca.cl
+++ b/clang/test/CodeGenOpenCL/builtins-alloca.cl
@@ -3,9 +3,9 @@
 // RUN:     -emit-llvm -o - | FileCheck --check-prefixes=OPENCL12 %s
 // RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -cl-std=CL2.0 \
 // RUN:     -emit-llvm -o - | FileCheck --check-prefixes=OPENCL20 %s
-// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -cl-std=CL3.0 \
+// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-mesa3d -target-cpu gfx600 -cl-std=CL3.0 \
 // RUN:     -emit-llvm -o - | FileCheck --check-prefixes=OPENCL30 %s
-// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -cl-std=CL3.0 -cl-ext=+__opencl_c_generic_address_space \
+// RUN: %clang_cc1 %s -O0 -triple amdgcn-amd-amdhsa -cl-std=CL3.0 \
 // RUN:     -emit-llvm -o - | FileCheck --check-prefixes=OPENCL30GAS %s
 
 // OPENCL-LABEL: define dso_local void @test1_builtin_alloca(
diff --git a/clang/test/CodeGenOpenCL/ptx-calls.cl b/clang/test/CodeGenOpenCL/ptx-calls.cl
index ae187173b1730..17c25ee78ef45 100644
--- a/clang/test/CodeGenOpenCL/ptx-calls.cl
+++ b/clang/test/CodeGenOpenCL/ptx-calls.cl
@@ -1,11 +1,31 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --include-generated-funcs --version 6
 // RUN: %clang_cc1 %s -triple nvptx-unknown-unknown -emit-llvm -O0 -o - | FileCheck %s
 
 void device_function() {
 }
-// CHECK-LABEL: define{{.*}} void @device_function()
 
 __kernel void kernel_function() {
   device_function();
 }
-// CHECK-LABEL: define{{.*}} ptx_kernel void @kernel_function()
-// CHECK: call void @device_function()
+// CHECK-LABEL: define dso_local void @device_function(
+// CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    ret void
+//
+//
+// CHECK-LABEL: define dso_local ptx_kernel void @kernel_function(
+// CHECK-SAME: ) #[[ATTR1:[0-9]+]] !kernel_arg_addr_space [[META3:![0-9]+]] !kernel_arg_access_qual [[META3]] !kernel_arg_type [[META3]] !kernel_arg_base_type [[META3]] !kernel_arg_type_qual [[META3]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    call void @__clang_ocl_kern_imp_kernel_function() #[[ATTR2:[0-9]+]]
+// CHECK-NEXT:    ret void
+//
+//
+// CHECK-LABEL: define dso_local void @__clang_ocl_kern_imp_kernel_function(
+// CHECK-SAME: ) #[[ATTR0]] !kernel_arg_addr_space [[META3]] !kernel_arg_access_qual [[META3]] !kernel_arg_type [[META3]] !kernel_arg_base_type [[META3]] !kernel_arg_type_qual [[META3]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    call void @device_function() #[[ATTR2]]
+// CHECK-NEXT:    ret void
+//
+//.
+// CHECK: [[META3]] = !{}
+//.
diff --git a/clang/test/CodeGenOpenCL/reflect.cl b/clang/test/CodeGenOpenCL/reflect.cl
index 4abb40aa3ed50..a69e338641167 100644
--- a/clang/test/CodeGenOpenCL/reflect.cl
+++ b/clang/test/CodeGenOpenCL/reflect.cl
@@ -26,7 +26,7 @@ __kernel void kernel_function(__global int *i) {
 // CHECK-NEXT:    ret void
 //
 //
-// CHECK-LABEL: define dso_local ptx_kernel void @__clang_ocl_kern_imp_kernel_function(
+// CHECK-LABEL: define dso_local void @__clang_ocl_kern_imp_kernel_function(
 // CHECK-SAME: ptr addrspace(1) noundef align 4 [[I:%.*]]) #[[ATTR0]] !kernel_arg_addr_space [[META3]] !kernel_arg_access_qual [[META4]] !kernel_arg_type [[META5]] !kernel_arg_base_type [[META5]] !kernel_arg_type_qual [[META6]] {
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[I_ADDR:%.*]] = alloca ptr addrspace(1), align 4
diff --git a/clang/test/Driver/autocomplete.c b/clang/test/Driver/autocomplete.c
index 4983b71496834..1fd60929751ee 100644
--- a/clang/test/Driver/autocomplete.c
+++ b/clang/test/Driver/autocomplete.c
@@ -117,6 +117,8 @@
 // WARNING-NEXT: -Wmany-braces-around-scalar-init
 // WARNING-NEXT: -Wmath-errno-enabled-with-veclib
 // WARNING-NEXT: -Wmathematical-notation-identifier-extension
+// WARNING-NEXT: -Wmatrix-conversion
+// WARNING-NEXT: -Wmatrix-conversions
 // WARNING-NEXT: -Wmax-tokens
 // WARNING-NEXT: -Wmax-unsigned-zero
 // RUN: %clang --autocomplete=-Wno-invalid-pp- | FileCheck %s -check-prefix=NOWARNING
diff --git a/clang/test/Driver/nvlink-wrapper.c b/clang/test/Driver/nvlink-wrapper.c
index 79f4a6641732f..2c2cf9d6415c0 100644
--- a/clang/test/Driver/nvlink-wrapper.c
+++ b/clang/test/Driver/nvlink-wrapper.c
@@ -42,7 +42,7 @@ int baz() { return y + x; }
 //
 // Check that we forward any unrecognized argument to 'nvlink'.
 //
-// RUN: clang-nvlink-wrapper --dry-run -arch sm_52 %t-u.o -foo -o a.out 2>&1 \
+// RUN: clang-nvlink-wrapper --dry-run --assume-device-object -arch sm_52 %t-u.o -foo -o a.out 2>&1 \
 // RUN:   | FileCheck %s --check-prefix=ARGS
 // ARGS: nvlink{{.*}} -arch sm_52 -foo -o a.out [[INPUT:.+]].cubin
 
@@ -51,14 +51,17 @@ int baz() { return y + x; }
 // `libx.a` and `liby.a` because extern weak symbols do not extract and `libz.a`
 // is not used at all.
 //
-// RUN: clang-nvlink-wrapper --dry-run %t-x.a %t-u.a %t-y.a %t-z.a %t-w.a %t.o \
+// RUN: clang-nvlink-wrapper --dry-run --assume-device-object %t-x.a %t-u.a %t-y.a %t-z.a %t-w.a %t.o \
 // RUN:   -arch sm_52 -o a.out 2>&1 | FileCheck %s --check-prefix=LINK
+// RUN: clang-nvlink-wrapper --dry-run %t-x.a %t-u.a %t-y.a %t-z.a %t-w.a %t.o \
+// RUN:   -arch sm_52 -o a.out 2>&1 | FileCheck %s --check-prefix=FORWARD
 // LINK: nvlink{{.*}} -arch sm_52 -o a.out [[INPUT:.+]].cubin {{.*}}-x-{{.*}}.cubin{{.*}}-y-{{.*}}.cubin
+// FORWARD: nvlink{{.*}} -arch sm_52 -o a.out [[INPUT:.+]].cubin {{.*}}-x-{{.*}}.o {{.*}}-u-{{.*}}.o {{.*}}-y-{{.*}}.o {{.*}}-z-{{.*}}.o {{.*}}-w-{{.*}}.o
 
 //
 // Same as above but we use '--undefined' to forcibly extract 'libz.a'
 //
-// RUN: clang-nvlink-wrapper --dry-run %t-x.a %t-u.a %t-y.a %t-z.a %t-w.a %t.o \
+// RUN: clang-nvlink-wrapper --dry-run --assume-device-object %t-x.a %t-u.a %t-y.a %t-z.a %t-w.a %t.o \
 // RUN:   -u z -arch sm_52 -o a.out 2>&1 | FileCheck %s --check-prefix=LINK
 // UNDEFINED: nvlink{{.*}} -arch sm_52 -o a.out [[INPUT:.+]].cubin {{.*}}-x-{{.*}}.cubin{{.*}}-y-{{.*}}.cubin{{.*}}-z-{{.*}}.cubin
 
@@ -66,7 +69,7 @@ int baz() { return y + x; }
 // Check that the LTO interface works and properly preserves symbols used in a
 // regular object file.
 //
-// RUN: clang-nvlink-wrapper --dry-run %t.o %t-u.o %t-y.a \
+// RUN: clang-nvlink-wrapper --dry-run --assume-device-object %t.o %t-u.o %t-y.a \
 // RUN:   -arch sm_52 -o a.out 2>&1 | FileCheck %s --check-prefix=LTO
 // LTO: ptxas{{.*}} -m64 -c [[PTX:.+]].s -O3 -arch sm_52 -o [[CUBIN:.+]].cubin
 // LTO: nvlink{{.*}} -arch sm_52 -o a.out [[CUBIN]].cubin {{.*}}-u-{{.*}}.cubin {{.*}}-y-{{.*}}.cubin
diff --git a/clang/test/FixIt/fixit-cxx0x-attributes.cpp b/clang/test/FixIt/fixit-cxx0x-attributes.cpp
new file mode 100644
index 0000000000000..92f18e60458f7
--- /dev/null
+++ b/clang/test/FixIt/fixit-cxx0x-attributes.cpp
@@ -0,0 +1,48 @@
+// RUN: %clang_cc1 -fsyntax-only -verify %s
+// RUN: not %clang_cc1 -fsyntax-only -fdiagnostics-parseable-fixits -fno-diagnostics-show-line-numbers %s 2>&1 | FileCheck %s -strict-whitespace
+
+[[nodiscard]] enum class E1 { };
+// expected-error at -1 {{misplaced attributes; expected attributes here}}
+// CHECK: {{^}}{{\[\[}}nodiscard]] enum class E1 { };
+// CHECK: {{^}}~~~~~~~~~~~~~           ^
+// CHECK: fix-it:"{{.*}}":{[[@LINE-4]]:1-[[@LINE-4]]:15}:""
+// CHECK: fix-it:"{{.*}}":{[[@LINE-5]]:25-[[@LINE-5]]:25}:"{{\[\[}}nodiscard]]"
+
+[[nodiscard]] enum struct E2 { };
+// expected-error at -1 {{misplaced attributes; expected attributes here}}
+// CHECK: {{^}}{{\[\[}}nodiscard]] enum struct E2 { };
+// CHECK: {{^}}~~~~~~~~~~~~~            ^
+// CHECK: fix-it:"{{.*}}":{[[@LINE-4]]:1-[[@LINE-4]]:15}:""
+// CHECK: fix-it:"{{.*}}":{[[@LINE-5]]:26-[[@LINE-5]]:26}:"{{\[\[}}nodiscard]]"
+
+[[nodiscard]] enum          class E3 { };
+// expected-error at -1 {{misplaced attributes; expected attributes here}}
+// CHECK: {{^}}{{\[\[}}nodiscard]] enum          class E3 { };
+// CHECK: {{^}}~~~~~~~~~~~~~                    ^
+// CHECK: fix-it:"{{.*}}":{[[@LINE-4]]:1-[[@LINE-4]]:15}:""
+// CHECK: fix-it:"{{.*}}":{[[@LINE-5]]:34-[[@LINE-5]]:34}:"{{\[\[}}nodiscard]]"
+
+[[nodiscard]] enum  /*comment*/ class E4 { };
+// expected-error at -1 {{misplaced attributes; expected attributes here}}
+// CHECK: {{^}}{{\[\[}}nodiscard]] enum  /*comment*/ class E4 { };
+// CHECK: {{^}}~~~~~~~~~~~~~                        ^
+// CHECK: fix-it:"{{.*}}":{[[@LINE-4]]:1-[[@LINE-4]]:15}:""
+// CHECK: fix-it:"{{.*}}":{[[@LINE-5]]:38-[[@LINE-5]]:38}:"{{\[\[}}nodiscard]]"
+
+[[nodiscard]] enum { A = 0 };
+// expected-error at -1 {{misplaced attributes; expected attributes here}}
+// CHECK: {{^}}{{\[\[}}nodiscard]] enum { A = 0 };
+// CHECK: {{^}}~~~~~~~~~~~~~     ^
+// CHECK: fix-it:"{{.*}}":{[[@LINE-4]]:1-[[@LINE-4]]:15}:""
+// CHECK: fix-it:"{{.*}}":{[[@LINE-5]]:19-[[@LINE-5]]:19}:"{{\[\[}}nodiscard]]"
+
+namespace NS {
+  enum class E5;
+}
+
+[[nodiscard]] enum class NS::E5 { };
+// expected-error at -1 {{misplaced attributes; expected attributes here}}
+// CHECK: {{^}}{{\[\[}}nodiscard]] enum class NS::E5 { };
+// CHECK: {{^}}~~~~~~~~~~~~~           ^
+// CHECK: fix-it:"{{.*}}":{[[@LINE-4]]:1-[[@LINE-4]]:15}:""
+// CHECK: fix-it:"{{.*}}":{[[@LINE-5]]:25-[[@LINE-5]]:25}:"{{\[\[}}nodiscard]]"
diff --git a/clang/test/Misc/amdgcn.languageOptsOpenCL.cl b/clang/test/Misc/amdgcn.languageOptsOpenCL.cl
index 80c0825895c86..08715fc5a1f4a 100644
--- a/clang/test/Misc/amdgcn.languageOptsOpenCL.cl
+++ b/clang/test/Misc/amdgcn.languageOptsOpenCL.cl
@@ -11,6 +11,9 @@
 // RUN: %clang_cc1 -x cl -cl-std=CL3.0 %s -verify -triple amdgcn-unknown-unknown -Wpedantic-core-features -DTEST_CORE_FEATURES
 // RUN: %clang_cc1 -x cl -cl-std=CL3.0 %s -verify -triple amdgcn-unknown-unknown -target-cpu gfx700 -Wpedantic-core-features -DTEST_CORE_FEATURES -DFLAT_SUPPORT
 
+// Test none target with amdhsa triple, which implies >= gfx700
+// RUN: %clang_cc1 -x cl -cl-std=CL3.0 %s -verify -triple amdgcn-unknown-amdhsa -Wpedantic-core-features -DTEST_CORE_FEATURES -DFLAT_SUPPORT
+
 // Extensions in all versions
 #ifndef cl_clang_storage_class_specifiers
 #error "Missing cl_clang_storage_class_specifiers define"
@@ -162,6 +165,10 @@
   #ifndef __opencl_c_program_scope_global_variables
     #error "Missing __opencl_c_program_scope_global_variables define"
   #endif
+
+  #ifndef __opencl_c_read_write_images
+    #error "Missing __opencl_c_read_write_images define"
+  #endif
 #endif
 
 #if (__OPENCL_C_VERSION__ >= 300)
diff --git a/clang/test/Misc/pragma-attribute-supported-attributes-list.test b/clang/test/Misc/pragma-attribute-supported-attributes-list.test
index 747eb17446c87..081ea8d5c821c 100644
--- a/clang/test/Misc/pragma-attribute-supported-attributes-list.test
+++ b/clang/test/Misc/pragma-attribute-supported-attributes-list.test
@@ -89,6 +89,7 @@
 // CHECK-NEXT: FunctionReturnThunks (SubjectMatchRule_function)
 // CHECK-NEXT: GNUInline (SubjectMatchRule_function)
 // CHECK-NEXT: HIPManaged (SubjectMatchRule_variable)
+// CHECK-NEXT: HLSLVkLocation (SubjectMatchRule_variable_is_parameter, SubjectMatchRule_field, SubjectMatchRule_function)
 // CHECK-NEXT: Hot (SubjectMatchRule_function)
 // CHECK-NEXT: HybridPatchable (SubjectMatchRule_function)
 // CHECK-NEXT: IBAction (SubjectMatchRule_objc_method_is_instance)
@@ -110,6 +111,7 @@
 // CHECK-NEXT: Mips16 (SubjectMatchRule_function)
 // CHECK-NEXT: MipsLongCall (SubjectMatchRule_function)
 // CHECK-NEXT: MipsShortCall (SubjectMatchRule_function)
+// CHECK-NEXT: ModularFormat (SubjectMatchRule_function)
 // CHECK-NEXT: NSConsumed (SubjectMatchRule_variable_is_parameter)
 // CHECK-NEXT: NSConsumesSelf (SubjectMatchRule_objc_method)
 // CHECK-NEXT: NSErrorDomain (SubjectMatchRule_enum)
diff --git a/clang/test/Modules/GH170084.cpp b/clang/test/Modules/GH170084.cpp
new file mode 100644
index 0000000000000..950499467a6bb
--- /dev/null
+++ b/clang/test/Modules/GH170084.cpp
@@ -0,0 +1,75 @@
+// RUN: rm -rf %t
+// RUN: mkdir -p %t
+// RUN: split-file %s %t
+// RUN: cd %t
+
+// RUN: %clang_cc1 -fmodule-name=stl -fno-cxx-modules -emit-module -fmodules -xc++ stl.cppmap -o stl.pcm
+// RUN: %clang_cc1 -fmodule-name=d -fno-cxx-modules -emit-module -fmodules -fmodule-file=stl.pcm -xc++ d.cppmap -o d.pcm
+// RUN: %clang_cc1 -fmodule-name=b -fno-cxx-modules -emit-module -fmodules -fmodule-file=stl.pcm -xc++ b.cppmap -o b.pcm
+// RUN: %clang_cc1 -fmodule-name=a -fno-cxx-modules -emit-module -fmodules -fmodule-file=stl.pcm -fmodule-file=d.pcm -fmodule-file=b.pcm -xc++ a.cppmap -o a.pcm
+// RUN: %clang_cc1 -fno-cxx-modules -fmodules -fmodule-file=a.pcm -emit-llvm -o /dev/null main.cpp
+
+//--- a.cppmap
+module "a" {
+header "a.h"
+}
+
+//--- a.h
+#include "b.h"
+namespace {
+void a(absl::set<char> c) {
+  absl::set<int> b;
+  c.end();
+  c.contains();
+}
+}  // namespace
+
+//--- b.cppmap
+module "b" {
+header "b.h"
+}
+
+//--- b.h
+#include "c.h"
+void b() { absl::set<char> x; }
+
+//--- c.h
+#include "stl.h"
+namespace absl {
+template <typename>
+class set {
+ public:
+  struct iterator {
+    void u() const;
+  };
+  iterator end() const { return {}; }
+  void contains() const { end().u(); }
+  pair<iterator> e();
+};
+}  // namespace absl
+
+//--- d.cppmap
+module "d" {
+header "d.h"
+}
+
+//--- d.h
+#include "c.h"
+void d() { absl::set<char> x; }
+
+//--- stl.cppmap
+module "stl" {
+header "stl.h"
+}
+
+//--- stl.h
+#ifndef _STL_H_
+#define _STL_H_
+template <class>
+struct pair;
+#endif
+
+//--- main.cpp
+// expected-no-diagnostics
+#include "c.h"
+void f(absl::set<char> o) { o.contains(); }
diff --git a/clang/test/Modules/pr170235.cppm b/clang/test/Modules/pr170235.cppm
new file mode 100644
index 0000000000000..614c3abce3d61
--- /dev/null
+++ b/clang/test/Modules/pr170235.cppm
@@ -0,0 +1,32 @@
+// RUN: rm -rf %t
+// RUN: mkdir -p %t
+// RUN: split-file %s %t
+
+// RUN: %clang_cc1 -std=c++20 %t/lib.cppm -emit-module-interface -o %t/lib.pcm
+// RUN: %clang_cc1 -std=c++20 %t/main.cpp -fmodule-file=lib=%t/lib.pcm -fsyntax-only -verify
+//
+// RUN: %clang_cc1 -std=c++20 %t/lib.cppm -emit-reduced-module-interface -o %t/lib.pcm
+// RUN: %clang_cc1 -std=c++20 %t/main.cpp -fmodule-file=lib=%t/lib.pcm -fsyntax-only -verify
+
+//--- lib.cppm
+export module lib;
+namespace lib {
+    struct A;
+    // Definition comes BEFORE the class declaration
+    int foo(const A &, int) { return 42; }
+
+    struct A {
+        // Friend declaration inside the class
+        friend int foo(const A &, int);
+    };
+
+    export A a{};
+}
+//--- main.cpp
+// expected-no-diagnostics
+import lib;
+int main() {
+    // Should be found via ADL since lib::a is of type lib::A
+    auto res1 = foo(lib::a, 1); 
+    return 0;
+}
diff --git a/clang/test/OpenMP/amdgcn_weak_alias.c b/clang/test/OpenMP/amdgcn_weak_alias.c
index a9d5c1737b321..33c7dc0041810 100644
--- a/clang/test/OpenMP/amdgcn_weak_alias.c
+++ b/clang/test/OpenMP/amdgcn_weak_alias.c
@@ -94,10 +94,3 @@ int Three(void) __attribute__ ((weak, alias("__Three")));
 int Three_(void) __attribute__ ((alias("__Three")));
 extern int __attribute__((weak, alias("__Three_var"))) Three_var;
 extern int __attribute__((alias("__Three_var"))) Three_var_;
-//.
-// HOST: [[META0:![0-9]+]] = !{i32 1, !"__Two_var", i32 0, i32 0}
-// HOST: [[META1:![0-9]+]] = !{i32 1, !"__Three_var", i32 0, i32 1}
-//.
-// DEVICE: [[META0:![0-9]+]] = !{i32 1, !"__Two_var", i32 0, i32 0}
-// DEVICE: [[META1:![0-9]+]] = !{i32 1, !"__Three_var", i32 0, i32 1}
-//.
diff --git a/clang/test/OpenMP/cancel_codegen.cpp b/clang/test/OpenMP/cancel_codegen.cpp
index 16e7542a8e826..600aae211087a 100644
--- a/clang/test/OpenMP/cancel_codegen.cpp
+++ b/clang/test/OpenMP/cancel_codegen.cpp
@@ -774,10 +774,8 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB2:[0-9]+]], i32 [[OMP_GLOBAL_THREAD_NUM12]])
 // CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_AFTER:%.*]]
 // CHECK3:       omp_section_loop.after:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_AFTERSECTIONS_FINI:%.*]]
-// CHECK3:       omp_section_loop.aftersections.fini:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_PREHEADER13:%.*]]
-// CHECK3:       omp_section_loop.preheader13:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_PREHEADER16:%.*]]
+// CHECK3:       omp_section_loop.preheader16:
 // CHECK3-NEXT:    store i32 0, ptr [[P_LOWERBOUND29]], align 4
 // CHECK3-NEXT:    store i32 1, ptr [[P_UPPERBOUND30]], align 4
 // CHECK3-NEXT:    store i32 1, ptr [[P_STRIDE31]], align 4
@@ -787,54 +785,52 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3-NEXT:    [[TMP10:%.*]] = load i32, ptr [[P_UPPERBOUND30]], align 4
 // CHECK3-NEXT:    [[TMP11:%.*]] = sub i32 [[TMP10]], [[TMP9]]
 // CHECK3-NEXT:    [[TMP12:%.*]] = add i32 [[TMP11]], 1
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_HEADER14:%.*]]
-// CHECK3:       omp_section_loop.header14:
-// CHECK3-NEXT:    [[OMP_SECTION_LOOP_IV20:%.*]] = phi i32 [ 0, [[OMP_SECTION_LOOP_PREHEADER13]] ], [ [[OMP_SECTION_LOOP_NEXT22:%.*]], [[OMP_SECTION_LOOP_INC17:%.*]] ]
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_COND15:%.*]]
-// CHECK3:       omp_section_loop.cond15:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_HEADER17:%.*]]
+// CHECK3:       omp_section_loop.header17:
+// CHECK3-NEXT:    [[OMP_SECTION_LOOP_IV20:%.*]] = phi i32 [ 0, [[OMP_SECTION_LOOP_PREHEADER16]] ], [ [[OMP_SECTION_LOOP_NEXT22:%.*]], [[OMP_SECTION_LOOP_INC17:%.*]] ]
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_COND18:%.*]]
+// CHECK3:       omp_section_loop.cond18:
 // CHECK3-NEXT:    [[OMP_SECTION_LOOP_CMP21:%.*]] = icmp ult i32 [[OMP_SECTION_LOOP_IV20]], [[TMP12]]
-// CHECK3-NEXT:    br i1 [[OMP_SECTION_LOOP_CMP21]], label [[OMP_SECTION_LOOP_BODY16:%.*]], label [[OMP_SECTION_LOOP_EXIT18:%.*]]
-// CHECK3:       omp_section_loop.body16:
+// CHECK3-NEXT:    br i1 [[OMP_SECTION_LOOP_CMP21]], label [[OMP_SECTION_LOOP_BODY19:%.*]], label [[OMP_SECTION_LOOP_EXIT21:%.*]]
+// CHECK3:       omp_section_loop.body19:
 // CHECK3-NEXT:    [[TMP13:%.*]] = add i32 [[OMP_SECTION_LOOP_IV20]], [[TMP9]]
 // CHECK3-NEXT:    [[TMP14:%.*]] = mul i32 [[TMP13]], 1
 // CHECK3-NEXT:    [[TMP15:%.*]] = add i32 [[TMP14]], 0
 // CHECK3-NEXT:    switch i32 [[TMP15]], label [[OMP_SECTION_LOOP_BODY16_SECTIONS_AFTER:%.*]] [
-// CHECK3-NEXT:      i32 0, label [[OMP_SECTION_LOOP_BODY_CASE23:%.*]]
-// CHECK3-NEXT:      i32 1, label [[OMP_SECTION_LOOP_BODY_CASE25:%.*]]
+// CHECK3-NEXT:      i32 0, label [[OMP_SECTION_LOOP_BODY_CASE26:%.*]]
+// CHECK3-NEXT:      i32 1, label [[OMP_SECTION_LOOP_BODY_CASE29:%.*]]
 // CHECK3-NEXT:    ]
-// CHECK3:       omp_section_loop.body.case23:
+// CHECK3:       omp_section_loop.body.case26:
 // CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM24:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK3-NEXT:    [[TMP16:%.*]] = call i32 @__kmpc_cancel(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM24]], i32 3)
 // CHECK3-NEXT:    [[TMP17:%.*]] = icmp eq i32 [[TMP16]], 0
-// CHECK3-NEXT:    br i1 [[TMP17]], label [[OMP_SECTION_LOOP_BODY_CASE23_SPLIT:%.*]], label [[OMP_SECTION_LOOP_BODY_CASE23_CNCL:%.*]]
-// CHECK3:       omp_section_loop.body.case23.split:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY_CASE23_SECTION_AFTER:%.*]]
-// CHECK3:       omp_section_loop.body.case23.section.after:
+// CHECK3-NEXT:    br i1 [[TMP17]], label [[OMP_SECTION_LOOP_BODY_CASE26_SPLIT:%.*]], label [[OMP_SECTION_LOOP_BODY_CASE26_CNCL:%.*]]
+// CHECK3:       omp_section_loop.body.case26.split:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY_CASE26_SECTION_AFTER:%.*]]
+// CHECK3:       omp_section_loop.body.case26.section.after:
 // CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY16_SECTIONS_AFTER]]
-// CHECK3:       omp_section_loop.body.case25:
+// CHECK3:       omp_section_loop.body.case29:
 // CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM27:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK3-NEXT:    [[TMP18:%.*]] = call i32 @__kmpc_cancel(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM27]], i32 3)
 // CHECK3-NEXT:    [[TMP19:%.*]] = icmp eq i32 [[TMP18]], 0
-// CHECK3-NEXT:    br i1 [[TMP19]], label [[OMP_SECTION_LOOP_BODY_CASE25_SPLIT:%.*]], label [[OMP_SECTION_LOOP_BODY_CASE25_CNCL:%.*]]
-// CHECK3:       omp_section_loop.body.case25.split:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY_CASE25_SECTION_AFTER26:%.*]]
-// CHECK3:       omp_section_loop.body.case25.section.after26:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY_CASE25_SECTION_AFTER:%.*]]
-// CHECK3:       omp_section_loop.body.case25.section.after:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY16_SECTIONS_AFTER]]
-// CHECK3:       omp_section_loop.body16.sections.after:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_INC17]]
-// CHECK3:       omp_section_loop.inc17:
+// CHECK3-NEXT:    br i1 [[TMP19]], label [[OMP_SECTION_LOOP_BODY_CASE29_SPLIT:%.*]], label [[OMP_SECTION_LOOP_BODY_CASE29_CNCL:%.*]]
+// CHECK3:       omp_section_loop.body.case29.split:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY_CASE25_SECTION_AFTER29:%.*]]
+// CHECK3:       omp_section_loop.body.case29.section.after30:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY_CASE29_SECTION_AFTER:%.*]]
+// CHECK3:       omp_section_loop.body.case29.section.after:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_BODY19_SECTIONS_AFTER:.*]]
+// CHECK3:       omp_section_loop.body19.sections.after:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_INC20:.*]]
+// CHECK3:       omp_section_loop.inc20:
 // CHECK3-NEXT:    [[OMP_SECTION_LOOP_NEXT22]] = add nuw i32 [[OMP_SECTION_LOOP_IV20]], 1
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_HEADER14]]
-// CHECK3:       omp_section_loop.exit18:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_HEADER17]]
+// CHECK3:       omp_section_loop.exit21:
 // CHECK3-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM32]])
 // CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM33:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK3-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM33]])
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_AFTER19:%.*]]
-// CHECK3:       omp_section_loop.after19:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_AFTER19SECTIONS_FINI:%.*]]
-// CHECK3:       omp_section_loop.after19sections.fini:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_AFTER22:%.*]]
+// CHECK3:       omp_section_loop.after22:
 // CHECK3-NEXT:    [[TMP20:%.*]] = load i32, ptr [[ARGC_ADDR]], align 4
 // CHECK3-NEXT:    store i32 [[TMP20]], ptr [[DOTCAPTURE_EXPR_]], align 4
 // CHECK3-NEXT:    [[TMP21:%.*]] = load i32, ptr [[DOTCAPTURE_EXPR_]], align 4
@@ -891,11 +887,11 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3:       .cancel.exit:
 // CHECK3-NEXT:    br label [[CANCEL_EXIT:%.*]]
 // CHECK3:       omp_section_loop.body.case.cncl:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_EXIT]]
-// CHECK3:       omp_section_loop.body.case23.cncl:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_EXIT18]]
-// CHECK3:       omp_section_loop.body.case25.cncl:
-// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_EXIT18]]
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_EXIT:.*]]
+// CHECK3:       omp_section_loop.body.case26.cncl:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_EXIT18:.*]]
+// CHECK3:       omp_section_loop.body.case29.cncl:
+// CHECK3-NEXT:    br label [[OMP_SECTION_LOOP_EXIT21:.*]]
 // CHECK3:       .cancel.continue:
 // CHECK3-NEXT:    br label [[OMP_IF_END:%.*]]
 // CHECK3:       omp_if.else:
@@ -954,8 +950,17 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3-NEXT:    [[TOBOOL:%.*]] = fcmp une float [[TMP2]], 0.000000e+00
 // CHECK3-NEXT:    br i1 [[TOBOOL]], label [[TMP14:%.*]], label [[TMP3:%.*]]
 // CHECK3:       3:
-// CHECK3-NEXT:    br label [[TMP4:%.*]]
-// CHECK3:       4:
+// CHECK3-NEXT:    %[[GTN:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
+// CHECK3-NEXT:    %[[CANCEL_POINT:.*]] = call i32 @__kmpc_cancellationpoint(ptr @1, i32 %[[GTN]], i32 1)
+// CHECK3-NEXT:    %[[COND:.*]] = icmp eq i32 %[[CANCEL_POINT]], 0
+// CHECK3-NEXT:    br i1 %[[COND]], label %[[SPLIT:.*]], label %[[CNCL:.*]]
+// CHECK3:       .cncl:
+// CHECK3-NEXT:    br label %[[FINI:.*]]
+// CHECK3:       .fini:
+// CHECK3-NEXT:    br label %[[EXIT_STUB:omp.par.exit.exitStub]]
+// CHECK3:       .split:
+// CHECK3-NEXT:    br label [[TMP6:%.*]]
+// CHECK3:       6:
 // CHECK3-NEXT:    [[TMP5:%.*]] = load i32, ptr [[LOADGEP_ARGC_ADDR]], align 4
 // CHECK3-NEXT:    [[CONV:%.*]] = trunc i32 [[TMP5]] to i8
 // CHECK3-NEXT:    [[TMP6:%.*]] = load ptr, ptr [[LOADGEP_ARGV_ADDR]], align 8
@@ -967,8 +972,8 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3-NEXT:    [[TMP8:%.*]] = call i32 @__kmpc_cancel_barrier(ptr @[[GLOB3:[0-9]+]], i32 [[OMP_GLOBAL_THREAD_NUM4]])
 // CHECK3-NEXT:    [[TMP9:%.*]] = icmp eq i32 [[TMP8]], 0
 // CHECK3-NEXT:    br i1 [[TMP9]], label [[DOTCONT:%.*]], label [[DOTCNCL5:%.*]]
-// CHECK3:       .cncl5:
-// CHECK3-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]]
+// CHECK3:       .cncl7:
+// CHECK3-NEXT:    br label %[[FINI]]
 // CHECK3:       .cont:
 // CHECK3-NEXT:    [[TMP10:%.*]] = load i32, ptr [[LOADGEP_ARGC_ADDR]], align 4
 // CHECK3-NEXT:    [[TMP11:%.*]] = load ptr, ptr [[LOADGEP_ARGV_ADDR]], align 8
@@ -984,18 +989,16 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3:       omp.par.region.parallel.after:
 // CHECK3-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK3:       omp.par.pre_finalize:
-// CHECK3-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB]]
-// CHECK3:       14:
+// CHECK3-NEXT:    br label %[[FINI]]
+// CHECK3:       16:
 // CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM1:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK3-NEXT:    [[TMP15:%.*]] = call i32 @__kmpc_cancel(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM1]], i32 1)
 // CHECK3-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[TMP15]], 0
 // CHECK3-NEXT:    br i1 [[TMP16]], label [[DOTSPLIT:%.*]], label [[DOTCNCL:%.*]]
-// CHECK3:       .cncl:
-// CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM2:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
-// CHECK3-NEXT:    [[TMP17:%.*]] = call i32 @__kmpc_cancel_barrier(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM2]])
-// CHECK3-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB]]
-// CHECK3:       .split:
-// CHECK3-NEXT:    br label [[TMP4]]
+// CHECK3:       .cncl4:
+// CHECK3-NEXT:    br label %[[FINI]]
+// CHECK3:       .split3:
+// CHECK3-NEXT:    br label {{.+}}
 // CHECK3:       omp.par.exit.exitStub:
 // CHECK3-NEXT:    ret void
 //
@@ -1089,7 +1092,7 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3:       .omp.sections.case.split:
 // CHECK3-NEXT:    br label [[DOTOMP_SECTIONS_EXIT]]
 // CHECK3:       .omp.sections.case.cncl:
-// CHECK3-NEXT:    br label [[CANCEL_CONT:%.*]]
+// CHECK3-NEXT:    br label [[FINI:%.*]]
 // CHECK3:       .omp.sections.exit:
 // CHECK3-NEXT:    br label [[OMP_INNER_FOR_INC:%.*]]
 // CHECK3:       omp.inner.for.inc:
@@ -1100,7 +1103,7 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3:       omp.inner.for.end:
 // CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM3:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB19:[0-9]+]])
 // CHECK3-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB15]], i32 [[OMP_GLOBAL_THREAD_NUM3]])
-// CHECK3-NEXT:    br label [[CANCEL_CONT]]
+// CHECK3-NEXT:    br label [[CANCEL_CONT:.*]]
 // CHECK3:       cancel.cont:
 // CHECK3-NEXT:    ret void
 // CHECK3:       cancel.exit:
@@ -1153,6 +1156,8 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3:       .omp.sections.case.split:
 // CHECK3-NEXT:    br label [[DOTOMP_SECTIONS_EXIT]]
 // CHECK3:       .omp.sections.case.cncl:
+// CHECK3-NEXT:    br label [[DOTFINI:.%*]]
+// CHECK3:       .fini:
 // CHECK3-NEXT:    br label [[CANCEL_CONT:%.*]]
 // CHECK3:       .omp.sections.case2:
 // CHECK3-NEXT:    [[OMP_GLOBAL_THREAD_NUM3:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
@@ -1162,9 +1167,11 @@ for (int i = 0; i < argc; ++i) {
 // CHECK3:       .omp.sections.case2.split:
 // CHECK3-NEXT:    br label [[DOTOMP_SECTIONS_CASE2_SECTION_AFTER:%.*]]
 // CHECK3:       .omp.sections.case2.section.after:
-// CHECK3-NEXT:    br label [[DOTOMP_SECTIONS_EXIT]]
+// CHECK3-NEXT:    br label [[OMP_REGION_FINALIZE:.*]]
+// CHECK3:       omp_region.finalize:
+// CHECK3-NEXT:    br label [[OMP_SECTIONS_EXIT:.*]]
 // CHECK3:       .omp.sections.case2.cncl:
-// CHECK3-NEXT:    br label [[OMP_INNER_FOR_END]]
+// CHECK3-NEXT:    br label [[FINI:.*]]
 // CHECK3:       .omp.sections.exit:
 // CHECK3-NEXT:    br label [[OMP_INNER_FOR_INC:%.*]]
 // CHECK3:       omp.inner.for.inc:
diff --git a/clang/test/OpenMP/critical_codegen.cpp b/clang/test/OpenMP/critical_codegen.cpp
index 5c752d354804b..9620613dfdb87 100644
--- a/clang/test/OpenMP/critical_codegen.cpp
+++ b/clang/test/OpenMP/critical_codegen.cpp
@@ -35,6 +35,8 @@ int main() {
 // ALL-NEXT:  			store i8 2, ptr [[A_ADDR]]
 // IRBUILDER-NEXT:		br label %[[AFTER:[^ ,]+]]
 // IRBUILDER:			[[AFTER]]
+// IRBUILDER-NEXT:		br label %[[OMP_REGION_FINALIZE:[^ ,]+]]
+// IRBUILDER:			[[OMP_REGION_FINALIZE]]
 // ALL-NEXT:  			call {{.*}}void @__kmpc_end_critical(ptr [[DEFAULT_LOC]], i32 [[GTID]], ptr [[UNNAMED_LOCK]])
 #pragma omp critical
   a = 2;
diff --git a/clang/test/OpenMP/critical_codegen_attr.cpp b/clang/test/OpenMP/critical_codegen_attr.cpp
index 32482a92e76b8..50b0b04fcfd4a 100644
--- a/clang/test/OpenMP/critical_codegen_attr.cpp
+++ b/clang/test/OpenMP/critical_codegen_attr.cpp
@@ -35,6 +35,8 @@ int main() {
 // ALL-NEXT:  			store i8 2, ptr [[A_ADDR]]
 // IRBUILDER-NEXT:		br label %[[AFTER:[^ ,]+]]
 // IRBUILDER:			[[AFTER]]
+// IRBUILDER-NEXT:		br label %[[OMP_REGION_FINALIZE:[^ ,]+]]
+// IRBUILDER:			[[OMP_REGION_FINALIZE]]
 // ALL-NEXT:  			call {{.*}}void @__kmpc_end_critical(ptr [[DEFAULT_LOC]], i32 [[GTID]], ptr [[UNNAMED_LOCK]])
   [[omp::directive(critical)]]
   a = 2;
diff --git a/clang/test/OpenMP/irbuilder_nested_parallel_for.c b/clang/test/OpenMP/irbuilder_nested_parallel_for.c
index 5cc5640a5173b..56cf9644de5ed 100644
--- a/clang/test/OpenMP/irbuilder_nested_parallel_for.c
+++ b/clang/test/OpenMP/irbuilder_nested_parallel_for.c
@@ -449,7 +449,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    call void @__captured_stmt.19(ptr [[DOTCOUNT_ADDR188]], ptr [[AGG_CAPTURED186]])
 // CHECK-NEXT:    [[DOTCOUNT189:%.*]] = load i32, ptr [[DOTCOUNT_ADDR188]], align 4
 // CHECK-NEXT:    br label [[OMP_LOOP_PREHEADER190:%.*]]
-// CHECK:       omp_loop.preheader187:
+// CHECK:       omp_loop.preheader190:
 // CHECK-NEXT:    store i32 0, ptr [[P_LOWERBOUND204]], align 4
 // CHECK-NEXT:    [[TMP3:%.*]] = sub i32 [[DOTCOUNT189]], 1
 // CHECK-NEXT:    store i32 [[TMP3]], ptr [[P_UPPERBOUND205]], align 4
@@ -461,13 +461,13 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    [[TMP6:%.*]] = sub i32 [[TMP5]], [[TMP4]]
 // CHECK-NEXT:    [[TMP7:%.*]] = add i32 [[TMP6]], 1
 // CHECK-NEXT:    br label [[OMP_LOOP_HEADER191:%.*]]
-// CHECK:       omp_loop.header188:
+// CHECK:       omp_loop.header191:
 // CHECK-NEXT:    [[OMP_LOOP_IV197:%.*]] = phi i32 [ 0, [[OMP_LOOP_PREHEADER190]] ], [ [[OMP_LOOP_NEXT199:%.*]], [[OMP_LOOP_INC194:%.*]] ]
 // CHECK-NEXT:    br label [[OMP_LOOP_COND192:%.*]]
-// CHECK:       omp_loop.cond189:
+// CHECK:       omp_loop.cond192:
 // CHECK-NEXT:    [[OMP_LOOP_CMP198:%.*]] = icmp ult i32 [[OMP_LOOP_IV197]], [[TMP7]]
 // CHECK-NEXT:    br i1 [[OMP_LOOP_CMP198]], label [[OMP_LOOP_BODY193:%.*]], label [[OMP_LOOP_EXIT195:%.*]]
-// CHECK:       omp_loop.body190:
+// CHECK:       omp_loop.body193:
 // CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[OMP_LOOP_IV197]], [[TMP4]]
 // CHECK-NEXT:    call void @__captured_stmt.20(ptr [[I185]], i32 [[TMP8]], ptr [[AGG_CAPTURED187]])
 // CHECK-NEXT:    [[TMP9:%.*]] = load i32, ptr [[A_ADDR]], align 4
@@ -478,15 +478,15 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    [[TMP11:%.*]] = load ptr, ptr [[R_ADDR]], align 8
 // CHECK-NEXT:    store float [[CONV202]], ptr [[TMP11]], align 4
 // CHECK-NEXT:    br label [[OMP_LOOP_INC194]]
-// CHECK:       omp_loop.inc191:
+// CHECK:       omp_loop.inc194:
 // CHECK-NEXT:    [[OMP_LOOP_NEXT199]] = add nuw i32 [[OMP_LOOP_IV197]], 1
 // CHECK-NEXT:    br label [[OMP_LOOP_HEADER191]]
-// CHECK:       omp_loop.exit192:
+// CHECK:       omp_loop.exit195:
 // CHECK-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM207]])
 // CHECK-NEXT:    [[OMP_GLOBAL_THREAD_NUM208:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM208]])
 // CHECK-NEXT:    br label [[OMP_LOOP_AFTER196:%.*]]
-// CHECK:       omp_loop.after193:
+// CHECK:       omp_loop.after196:
 // CHECK-NEXT:    ret void
 //
 //
@@ -576,7 +576,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    call void @__captured_stmt.17(ptr [[DOTCOUNT_ADDR163]], ptr [[AGG_CAPTURED161]])
 // CHECK-NEXT:    [[DOTCOUNT164:%.*]] = load i32, ptr [[DOTCOUNT_ADDR163]], align 4
 // CHECK-NEXT:    br label [[OMP_LOOP_PREHEADER165:%.*]]
-// CHECK:       omp_loop.preheader163:
+// CHECK:       omp_loop.preheader165:
 // CHECK-NEXT:    store i32 0, ptr [[P_LOWERBOUND179]], align 4
 // CHECK-NEXT:    [[TMP13:%.*]] = sub i32 [[DOTCOUNT164]], 1
 // CHECK-NEXT:    store i32 [[TMP13]], ptr [[P_UPPERBOUND180]], align 4
@@ -588,24 +588,24 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    [[TMP16:%.*]] = sub i32 [[TMP15]], [[TMP14]]
 // CHECK-NEXT:    [[TMP17:%.*]] = add i32 [[TMP16]], 1
 // CHECK-NEXT:    br label [[OMP_LOOP_HEADER166:%.*]]
-// CHECK:       omp_loop.header164:
+// CHECK:       omp_loop.header166:
 // CHECK-NEXT:    [[OMP_LOOP_IV172:%.*]] = phi i32 [ 0, [[OMP_LOOP_PREHEADER165]] ], [ [[OMP_LOOP_NEXT174:%.*]], [[OMP_LOOP_INC169:%.*]] ]
 // CHECK-NEXT:    br label [[OMP_LOOP_COND167:%.*]]
-// CHECK:       omp_loop.cond165:
+// CHECK:       omp_loop.cond167:
 // CHECK-NEXT:    [[OMP_LOOP_CMP173:%.*]] = icmp ult i32 [[OMP_LOOP_IV172]], [[TMP17]]
 // CHECK-NEXT:    br i1 [[OMP_LOOP_CMP173]], label [[OMP_LOOP_BODY168:%.*]], label [[OMP_LOOP_EXIT170:%.*]]
-// CHECK:       omp_loop.exit168:
+// CHECK:       omp_loop.exit170:
 // CHECK-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM182]])
 // CHECK-NEXT:    [[OMP_GLOBAL_THREAD_NUM183:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM183]])
 // CHECK-NEXT:    br label [[OMP_LOOP_AFTER171:%.*]]
-// CHECK:       omp_loop.after169:
+// CHECK:       omp_loop.after171:
 // CHECK-NEXT:    br label [[OMP_PAR_REGION_PARALLEL_AFTER:%.*]]
 // CHECK:       omp.par.region.parallel.after:
 // CHECK-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK:       omp.par.pre_finalize:
 // CHECK-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT184_EXITSTUB:%.*]]
-// CHECK:       omp_loop.body166:
+// CHECK:       omp_loop.body168:
 // CHECK-NEXT:    [[TMP18:%.*]] = add i32 [[OMP_LOOP_IV172]], [[TMP14]]
 // CHECK-NEXT:    call void @__captured_stmt.18(ptr [[I160]], i32 [[TMP18]], ptr [[AGG_CAPTURED162]])
 // CHECK-NEXT:    [[TMP19:%.*]] = load i32, ptr [[LOADGEP_A_ADDR]], align 4
@@ -616,7 +616,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    [[TMP21:%.*]] = load ptr, ptr [[LOADGEP_R_ADDR]], align 8
 // CHECK-NEXT:    store float [[CONV177]], ptr [[TMP21]], align 4
 // CHECK-NEXT:    br label [[OMP_LOOP_INC169]]
-// CHECK:       omp_loop.inc167:
+// CHECK:       omp_loop.inc169:
 // CHECK-NEXT:    [[OMP_LOOP_NEXT174]] = add nuw i32 [[OMP_LOOP_IV172]], 1
 // CHECK-NEXT:    br label [[OMP_LOOP_HEADER166]]
 // CHECK:       omp_loop.body:
@@ -758,7 +758,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK:       omp_loop.after86:
 // CHECK-NEXT:    [[OMP_GLOBAL_THREAD_NUM99:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK-NEXT:    br label [[OMP_PARALLEL213:%.*]]
-// CHECK:       omp_parallel210:
+// CHECK:       omp_parallel213:
 // CHECK-NEXT:    [[GEP_A_ADDR210:%.*]] = getelementptr { ptr, ptr, ptr }, ptr [[STRUCTARG209]], i32 0, i32 0
 // CHECK-NEXT:    store ptr [[LOADGEP_A_ADDR]], ptr [[GEP_A_ADDR210]], align 8
 // CHECK-NEXT:    [[GEP_B_ADDR211:%.*]] = getelementptr { ptr, ptr, ptr }, ptr [[STRUCTARG209]], i32 0, i32 1
@@ -777,7 +777,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    call void @__captured_stmt.15(ptr [[DOTCOUNT_ADDR138]], ptr [[AGG_CAPTURED136]])
 // CHECK-NEXT:    [[DOTCOUNT139:%.*]] = load i32, ptr [[DOTCOUNT_ADDR138]], align 4
 // CHECK-NEXT:    br label [[OMP_LOOP_PREHEADER140:%.*]]
-// CHECK:       omp_loop.preheader139:
+// CHECK:       omp_loop.preheader140:
 // CHECK-NEXT:    store i32 0, ptr [[P_LOWERBOUND154]], align 4
 // CHECK-NEXT:    [[TMP21:%.*]] = sub i32 [[DOTCOUNT139]], 1
 // CHECK-NEXT:    store i32 [[TMP21]], ptr [[P_UPPERBOUND155]], align 4
@@ -789,24 +789,26 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    [[TMP24:%.*]] = sub i32 [[TMP23]], [[TMP22]]
 // CHECK-NEXT:    [[TMP25:%.*]] = add i32 [[TMP24]], 1
 // CHECK-NEXT:    br label [[OMP_LOOP_HEADER141:%.*]]
-// CHECK:       omp_loop.header140:
+// CHECK:       omp_loop.header141:
 // CHECK-NEXT:    [[OMP_LOOP_IV147:%.*]] = phi i32 [ 0, [[OMP_LOOP_PREHEADER140]] ], [ [[OMP_LOOP_NEXT149:%.*]], [[OMP_LOOP_INC144:%.*]] ]
 // CHECK-NEXT:    br label [[OMP_LOOP_COND142:%.*]]
-// CHECK:       omp_loop.cond141:
+// CHECK:       omp_loop.cond142:
 // CHECK-NEXT:    [[OMP_LOOP_CMP148:%.*]] = icmp ult i32 [[OMP_LOOP_IV147]], [[TMP25]]
 // CHECK-NEXT:    br i1 [[OMP_LOOP_CMP148]], label [[OMP_LOOP_BODY143:%.*]], label [[OMP_LOOP_EXIT145:%.*]]
-// CHECK:       omp_loop.exit144:
+// CHECK:       omp_loop.exit145:
 // CHECK-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM157]])
 // CHECK-NEXT:    [[OMP_GLOBAL_THREAD_NUM158:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB1]])
 // CHECK-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM158]])
 // CHECK-NEXT:    br label [[OMP_LOOP_AFTER146:%.*]]
-// CHECK:       omp_loop.after145:
+// CHECK:       omp_loop.after146:
 // CHECK-NEXT:    br label [[OMP_PAR_REGION9_PARALLEL_AFTER:%.*]]
 // CHECK:       omp.par.region9.parallel.after:
 // CHECK-NEXT:    br label [[OMP_PAR_PRE_FINALIZE10:%.*]]
 // CHECK:       omp.par.pre_finalize10:
-// CHECK-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT159_EXITSTUB:%.*]]
-// CHECK:       omp_loop.body142:
+// CHECK-NEXT:    br label [[FINI159:%.*]]
+// CHECK:       .fini159:
+// CHECK-NEXT:    br label [[OMP_PAR_EXIT11_EXITSTUB:%.*]]
+// CHECK:       omp_loop.body143:
 // CHECK-NEXT:    [[TMP26:%.*]] = add i32 [[OMP_LOOP_IV147]], [[TMP22]]
 // CHECK-NEXT:    call void @__captured_stmt.16(ptr [[I135]], i32 [[TMP26]], ptr [[AGG_CAPTURED137]])
 // CHECK-NEXT:    [[TMP27:%.*]] = load i32, ptr [[LOADGEP_A_ADDR]], align 4
@@ -817,7 +819,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-NEXT:    [[TMP29:%.*]] = load ptr, ptr [[LOADGEP_R_ADDR]], align 8
 // CHECK-NEXT:    store float [[CONV152]], ptr [[TMP29]], align 4
 // CHECK-NEXT:    br label [[OMP_LOOP_INC144]]
-// CHECK:       omp_loop.inc143:
+// CHECK:       omp_loop.inc144:
 // CHECK-NEXT:    [[OMP_LOOP_NEXT149]] = add nuw i32 [[OMP_LOOP_IV147]], 1
 // CHECK-NEXT:    br label [[OMP_LOOP_HEADER141]]
 // CHECK:       omp_loop.body83:
@@ -1557,6 +1559,8 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG:       omp.par.region.parallel.after:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK-DEBUG:       omp.par.pre_finalize:
+// CHECK-DEBUG-NEXT:    br label [[FINI:.*]]
+// CHECK-DEBUG:       .fini:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG30]]
 // CHECK-DEBUG:       omp_loop.body:
 // CHECK-DEBUG-NEXT:    [[TMP9:%.*]] = add i32 [[OMP_LOOP_IV]], [[TMP5]], !dbg [[DBG29]]
@@ -1700,6 +1704,8 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG:       omp.par.region.parallel.after:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK-DEBUG:       omp.par.pre_finalize:
+// CHECK-DEBUG-NEXT:    br label [[FINI16:%.*]]
+// CHECK-DEBUG:       .fini16:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT16_EXITSTUB:%.*]], !dbg [[DBG92]]
 // CHECK-DEBUG:       omp.par.exit.exitStub:
 // CHECK-DEBUG-NEXT:    ret void
@@ -1769,6 +1775,8 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG:       omp.par.region5.parallel.after:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE6:%.*]]
 // CHECK-DEBUG:       omp.par.pre_finalize6:
+// CHECK-DEBUG-NEXT:    br label [[FINI:%.*]]
+// CHECK-DEBUG:       .fini:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG103]]
 // CHECK-DEBUG:       omp_loop.body:
 // CHECK-DEBUG-NEXT:    [[TMP10:%.*]] = add i32 [[OMP_LOOP_IV]], [[TMP6]], !dbg [[DBG102]]
@@ -1899,7 +1907,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    call void @__captured_stmt.19(ptr [[DOTCOUNT_ADDR188]], ptr [[AGG_CAPTURED186]]), !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    [[DOTCOUNT189:%.*]] = load i32, ptr [[DOTCOUNT_ADDR188]], align 4, !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_PREHEADER190:%.*]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.preheader187:
+// CHECK-DEBUG:       omp_loop.preheader190:
 // CHECK-DEBUG-NEXT:    store i32 0, ptr [[P_LOWERBOUND204]], align 4, !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    [[TMP3:%.*]] = sub i32 [[DOTCOUNT189]], 1, !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    store i32 [[TMP3]], ptr [[P_UPPERBOUND205]], align 4, !dbg [[DBG148]]
@@ -1911,13 +1919,13 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    [[TMP6:%.*]] = sub i32 [[TMP5]], [[TMP4]], !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    [[TMP7:%.*]] = add i32 [[TMP6]], 1, !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_HEADER191:%.*]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.header188:
+// CHECK-DEBUG:       omp_loop.header191:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_IV197:%.*]] = phi i32 [ 0, [[OMP_LOOP_PREHEADER190]] ], [ [[OMP_LOOP_NEXT199:%.*]], [[OMP_LOOP_INC194:%.*]] ], !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_COND192:%.*]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.cond189:
+// CHECK-DEBUG:       omp_loop.cond192:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_CMP198:%.*]] = icmp ult i32 [[OMP_LOOP_IV197]], [[TMP7]], !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    br i1 [[OMP_LOOP_CMP198]], label [[OMP_LOOP_BODY193:%.*]], label [[OMP_LOOP_EXIT195:%.*]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.body190:
+// CHECK-DEBUG:       omp_loop.body193:
 // CHECK-DEBUG-NEXT:    [[TMP8:%.*]] = add i32 [[OMP_LOOP_IV197]], [[TMP4]], !dbg [[DBG150:![0-9]+]]
 // CHECK-DEBUG-NEXT:    call void @__captured_stmt.20(ptr [[I185]], i32 [[TMP8]], ptr [[AGG_CAPTURED187]]), !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    [[TMP9:%.*]] = load i32, ptr [[A_ADDR]], align 4, !dbg [[DBG151:![0-9]+]]
@@ -1928,15 +1936,15 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    [[TMP11:%.*]] = load ptr, ptr [[R_ADDR]], align 8, !dbg [[DBG153:![0-9]+]]
 // CHECK-DEBUG-NEXT:    store float [[CONV202]], ptr [[TMP11]], align 4, !dbg [[DBG154:![0-9]+]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_INC194]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.inc191:
+// CHECK-DEBUG:       omp_loop.inc194:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_NEXT199]] = add nuw i32 [[OMP_LOOP_IV197]], 1, !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_HEADER191]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.exit192:
+// CHECK-DEBUG:       omp_loop.exit195:
 // CHECK-DEBUG-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB42]], i32 [[OMP_GLOBAL_THREAD_NUM207]]), !dbg [[DBG148]]
 // CHECK-DEBUG-NEXT:    [[OMP_GLOBAL_THREAD_NUM208:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB42]]), !dbg [[DBG150]]
 // CHECK-DEBUG-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB43:[0-9]+]], i32 [[OMP_GLOBAL_THREAD_NUM208]]), !dbg [[DBG150]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_AFTER196:%.*]], !dbg [[DBG148]]
-// CHECK-DEBUG:       omp_loop.after193:
+// CHECK-DEBUG:       omp_loop.after196:
 // CHECK-DEBUG-NEXT:    ret void, !dbg [[DBG155:![0-9]+]]
 //
 //
@@ -2031,7 +2039,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    call void @__captured_stmt.17(ptr [[DOTCOUNT_ADDR163]], ptr [[AGG_CAPTURED161]]), !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    [[DOTCOUNT164:%.*]] = load i32, ptr [[DOTCOUNT_ADDR163]], align 4, !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_PREHEADER165:%.*]], !dbg [[DBG174]]
-// CHECK-DEBUG:       omp_loop.preheader163:
+// CHECK-DEBUG:       omp_loop.preheader165:
 // CHECK-DEBUG-NEXT:    store i32 0, ptr [[P_LOWERBOUND179]], align 4, !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    [[TMP13:%.*]] = sub i32 [[DOTCOUNT164]], 1, !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    store i32 [[TMP13]], ptr [[P_UPPERBOUND180]], align 4, !dbg [[DBG174]]
@@ -2043,24 +2051,26 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    [[TMP16:%.*]] = sub i32 [[TMP15]], [[TMP14]], !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    [[TMP17:%.*]] = add i32 [[TMP16]], 1, !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_HEADER166:%.*]], !dbg [[DBG174]]
-// CHECK-DEBUG:       omp_loop.header164:
+// CHECK-DEBUG:       omp_loop.header166:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_IV172:%.*]] = phi i32 [ 0, [[OMP_LOOP_PREHEADER165]] ], [ [[OMP_LOOP_NEXT174:%.*]], [[OMP_LOOP_INC169:%.*]] ], !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_COND167:%.*]], !dbg [[DBG174]]
-// CHECK-DEBUG:       omp_loop.cond165:
+// CHECK-DEBUG:       omp_loop.cond167:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_CMP173:%.*]] = icmp ult i32 [[OMP_LOOP_IV172]], [[TMP17]], !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    br i1 [[OMP_LOOP_CMP173]], label [[OMP_LOOP_BODY168:%.*]], label [[OMP_LOOP_EXIT170:%.*]], !dbg [[DBG174]]
-// CHECK-DEBUG:       omp_loop.exit168:
+// CHECK-DEBUG:       omp_loop.exit170:
 // CHECK-DEBUG-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB39]], i32 [[OMP_GLOBAL_THREAD_NUM182]]), !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    [[OMP_GLOBAL_THREAD_NUM183:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB39]]), !dbg [[DBG176:![0-9]+]]
 // CHECK-DEBUG-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB40:[0-9]+]], i32 [[OMP_GLOBAL_THREAD_NUM183]]), !dbg [[DBG176]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_AFTER171:%.*]], !dbg [[DBG174]]
-// CHECK-DEBUG:       omp_loop.after169:
+// CHECK-DEBUG:       omp_loop.after171:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_REGION_PARALLEL_AFTER:%.*]], !dbg [[DBG177:![0-9]+]]
 // CHECK-DEBUG:       omp.par.region.parallel.after:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK-DEBUG:       omp.par.pre_finalize:
+// CHECK-DEBUG-NEXT:    br label [[FINI184:%.*]]
+// CHECK-DEBUG:       .fini184:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT184_EXITSTUB:%.*]], !dbg [[DBG177]]
-// CHECK-DEBUG:       omp_loop.body166:
+// CHECK-DEBUG:       omp_loop.body168:
 // CHECK-DEBUG-NEXT:    [[TMP18:%.*]] = add i32 [[OMP_LOOP_IV172]], [[TMP14]], !dbg [[DBG176]]
 // CHECK-DEBUG-NEXT:    call void @__captured_stmt.18(ptr [[I160]], i32 [[TMP18]], ptr [[AGG_CAPTURED162]]), !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    [[TMP19:%.*]] = load i32, ptr [[LOADGEP_A_ADDR]], align 4, !dbg [[DBG178:![0-9]+]]
@@ -2071,7 +2081,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    [[TMP21:%.*]] = load ptr, ptr [[LOADGEP_R_ADDR]], align 8, !dbg [[DBG180:![0-9]+]]
 // CHECK-DEBUG-NEXT:    store float [[CONV177]], ptr [[TMP21]], align 4, !dbg [[DBG181:![0-9]+]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_INC169]], !dbg [[DBG174]]
-// CHECK-DEBUG:       omp_loop.inc167:
+// CHECK-DEBUG:       omp_loop.inc169:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_NEXT174]] = add nuw i32 [[OMP_LOOP_IV172]], 1, !dbg [[DBG174]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_HEADER166]], !dbg [[DBG174]]
 // CHECK-DEBUG:       omp_loop.body:
@@ -2218,7 +2228,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG:       omp_loop.after86:
 // CHECK-DEBUG-NEXT:    [[OMP_GLOBAL_THREAD_NUM99:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB31:[0-9]+]]), !dbg [[DBG208:![0-9]+]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_PARALLEL213:%.*]]
-// CHECK-DEBUG:       omp_parallel210:
+// CHECK-DEBUG:       omp_parallel213:
 // CHECK-DEBUG-NEXT:    [[GEP_A_ADDR210:%.*]] = getelementptr { ptr, ptr, ptr }, ptr [[STRUCTARG209]], i32 0, i32 0
 // CHECK-DEBUG-NEXT:    store ptr [[LOADGEP_A_ADDR]], ptr [[GEP_A_ADDR210]], align 8
 // CHECK-DEBUG-NEXT:    [[GEP_B_ADDR211:%.*]] = getelementptr { ptr, ptr, ptr }, ptr [[STRUCTARG209]], i32 0, i32 1
@@ -2238,7 +2248,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    call void @__captured_stmt.15(ptr [[DOTCOUNT_ADDR138]], ptr [[AGG_CAPTURED136]]), !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    [[DOTCOUNT139:%.*]] = load i32, ptr [[DOTCOUNT_ADDR138]], align 4, !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_PREHEADER140:%.*]], !dbg [[DBG217]]
-// CHECK-DEBUG:       omp_loop.preheader139:
+// CHECK-DEBUG:       omp_loop.preheader140:
 // CHECK-DEBUG-NEXT:    store i32 0, ptr [[P_LOWERBOUND154]], align 4, !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    [[TMP21:%.*]] = sub i32 [[DOTCOUNT139]], 1, !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    store i32 [[TMP21]], ptr [[P_UPPERBOUND155]], align 4, !dbg [[DBG217]]
@@ -2250,24 +2260,26 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    [[TMP24:%.*]] = sub i32 [[TMP23]], [[TMP22]], !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    [[TMP25:%.*]] = add i32 [[TMP24]], 1, !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_HEADER141:%.*]], !dbg [[DBG217]]
-// CHECK-DEBUG:       omp_loop.header140:
+// CHECK-DEBUG:       omp_loop.header141:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_IV147:%.*]] = phi i32 [ 0, [[OMP_LOOP_PREHEADER140]] ], [ [[OMP_LOOP_NEXT149:%.*]], [[OMP_LOOP_INC144:%.*]] ], !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_COND142:%.*]], !dbg [[DBG217]]
-// CHECK-DEBUG:       omp_loop.cond141:
+// CHECK-DEBUG:       omp_loop.cond142:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_CMP148:%.*]] = icmp ult i32 [[OMP_LOOP_IV147]], [[TMP25]], !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    br i1 [[OMP_LOOP_CMP148]], label [[OMP_LOOP_BODY143:%.*]], label [[OMP_LOOP_EXIT145:%.*]], !dbg [[DBG217]]
-// CHECK-DEBUG:       omp_loop.exit144:
+// CHECK-DEBUG:       omp_loop.exit145:
 // CHECK-DEBUG-NEXT:    call void @__kmpc_for_static_fini(ptr @[[GLOB36]], i32 [[OMP_GLOBAL_THREAD_NUM157]]), !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    [[OMP_GLOBAL_THREAD_NUM158:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB36]]), !dbg [[DBG219:![0-9]+]]
 // CHECK-DEBUG-NEXT:    call void @__kmpc_barrier(ptr @[[GLOB37:[0-9]+]], i32 [[OMP_GLOBAL_THREAD_NUM158]]), !dbg [[DBG219]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_AFTER146:%.*]], !dbg [[DBG217]]
-// CHECK-DEBUG:       omp_loop.after145:
+// CHECK-DEBUG:       omp_loop.after146:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_REGION9_PARALLEL_AFTER:%.*]], !dbg [[DBG220:![0-9]+]]
 // CHECK-DEBUG:       omp.par.region9.parallel.after:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE10:%.*]]
 // CHECK-DEBUG:       omp.par.pre_finalize10:
+// CHECK-DEBUG-NEXT:    br label [[FINI159:%.*]]
+// CHECK-DEBUG:       .fini159:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT159_EXITSTUB:%.*]], !dbg [[DBG220]]
-// CHECK-DEBUG:       omp_loop.body142:
+// CHECK-DEBUG:       omp_loop.body143:
 // CHECK-DEBUG-NEXT:    [[TMP26:%.*]] = add i32 [[OMP_LOOP_IV147]], [[TMP22]], !dbg [[DBG219]]
 // CHECK-DEBUG-NEXT:    call void @__captured_stmt.16(ptr [[I135]], i32 [[TMP26]], ptr [[AGG_CAPTURED137]]), !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    [[TMP27:%.*]] = load i32, ptr [[LOADGEP_A_ADDR]], align 4, !dbg [[DBG221:![0-9]+]]
@@ -2278,7 +2290,7 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG-NEXT:    [[TMP29:%.*]] = load ptr, ptr [[LOADGEP_R_ADDR]], align 8, !dbg [[DBG223:![0-9]+]]
 // CHECK-DEBUG-NEXT:    store float [[CONV152]], ptr [[TMP29]], align 4, !dbg [[DBG224:![0-9]+]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_INC144]], !dbg [[DBG217]]
-// CHECK-DEBUG:       omp_loop.inc143:
+// CHECK-DEBUG:       omp_loop.inc144:
 // CHECK-DEBUG-NEXT:    [[OMP_LOOP_NEXT149]] = add nuw i32 [[OMP_LOOP_IV147]], 1, !dbg [[DBG217]]
 // CHECK-DEBUG-NEXT:    br label [[OMP_LOOP_HEADER141]], !dbg [[DBG217]]
 // CHECK-DEBUG:       omp_loop.body83:
@@ -2375,8 +2387,8 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG:       omp_loop.after121:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_REGION103_PARALLEL_AFTER:%.*]], !dbg [[DBG244:![0-9]+]]
 // CHECK-DEBUG:       omp.par.region103.parallel.after:
-// CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE104:%.*]]
-// CHECK-DEBUG:       omp.par.pre_finalize104:
+// CHECK-DEBUG-NEXT:    br label [[FINI134:%.*]]
+// CHECK-DEBUG:       .fini134:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT134_EXITSTUB:%.*]], !dbg [[DBG244]]
 // CHECK-DEBUG:       omp_loop.body118:
 // CHECK-DEBUG-NEXT:    [[TMP10:%.*]] = add i32 [[OMP_LOOP_IV122]], [[TMP6]], !dbg [[DBG243]]
@@ -2460,6 +2472,8 @@ void parallel_for_2(float *r, int a, double b) {
 // CHECK-DEBUG:       omp.par.region44.parallel.after:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_PRE_FINALIZE45:%.*]]
 // CHECK-DEBUG:       omp.par.pre_finalize45:
+// CHECK-DEBUG-NEXT:    br label [[FINI:%.*]]
+// CHECK-DEBUG:       .fini:
 // CHECK-DEBUG-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG260]]
 // CHECK-DEBUG:       omp_loop.body59:
 // CHECK-DEBUG-NEXT:    [[TMP10:%.*]] = add i32 [[OMP_LOOP_IV63]], [[TMP6]], !dbg [[DBG259]]
diff --git a/clang/test/OpenMP/masked_codegen.cpp b/clang/test/OpenMP/masked_codegen.cpp
index a39de12d69337..bc6f68de9b248 100644
--- a/clang/test/OpenMP/masked_codegen.cpp
+++ b/clang/test/OpenMP/masked_codegen.cpp
@@ -35,6 +35,8 @@ int main() {
 // ALL-NEXT:  			store i8 2, ptr [[A_ADDR]]
 // IRBUILDER-NEXT:		br label %[[AFTER:[^ ,]+]]
 // IRBUILDER:			[[AFTER]]
+// IRBUILDER-NEXT:		br label %[[OMP_REGION_FINALIZE:[^ ,]+]]
+// IRBUILDER:			[[OMP_REGION_FINALIZE]]
 // ALL-NEXT:  			call {{.*}}void @__kmpc_end_masked(ptr [[DEFAULT_LOC]], i32 [[GTID]])
 // ALL-NEXT:  			br label {{%?}}[[EXIT]]
 // ALL:       			[[EXIT]]
diff --git a/clang/test/OpenMP/master_codegen.cpp b/clang/test/OpenMP/master_codegen.cpp
index a7af326caacfe..5a92444d9a927 100644
--- a/clang/test/OpenMP/master_codegen.cpp
+++ b/clang/test/OpenMP/master_codegen.cpp
@@ -35,6 +35,8 @@ int main() {
 // ALL-NEXT:  			store i8 2, ptr [[A_ADDR]]
 // IRBUILDER-NEXT:		br label %[[AFTER:[^ ,]+]]
 // IRBUILDER:			[[AFTER]]
+// IRBUILDER-NEXT:		br label %[[OMP_REGION_FINALIZE:[^ ,]+]]
+// IRBUILDER:			[[OMP_REGION_FINALIZE]]
 // ALL-NEXT:  			call {{.*}}void @__kmpc_end_master(ptr [[DEFAULT_LOC]], i32 [[GTID]])
 // ALL-NEXT:  			br label {{%?}}[[EXIT]]
 // ALL:       			[[EXIT]]
diff --git a/clang/test/OpenMP/nested_loop_codegen.cpp b/clang/test/OpenMP/nested_loop_codegen.cpp
index 9aefc6a739e51..e01fd0da31ee8 100644
--- a/clang/test/OpenMP/nested_loop_codegen.cpp
+++ b/clang/test/OpenMP/nested_loop_codegen.cpp
@@ -904,6 +904,8 @@ int inline_decl() {
 // CHECK4:       omp.par.region.parallel.after:
 // CHECK4-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK4:       omp.par.pre_finalize:
+// CHECK4-NEXT:    br label [[FINI:%.*]]
+// CHECK4:       .fini:
 // CHECK4-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG27]]
 // CHECK4:       for.body:
 // CHECK4-NEXT:    store i32 0, ptr [[LOADGEP_K]], align 4, !dbg [[DBG28:![0-9]+]]
@@ -1083,6 +1085,8 @@ int inline_decl() {
 // CHECK4:       omp.par.region.parallel.after:
 // CHECK4-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK4:       omp.par.pre_finalize:
+// CHECK4-NEXT:    br label [[FINI:%.*]]
+// CHECK4:       .fini:
 // CHECK4-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG90]]
 // CHECK4:       for.body:
 // CHECK4-NEXT:      #dbg_declare(ptr [[K]], [[META91:![0-9]+]], !DIExpression(), [[META95:![0-9]+]])
diff --git a/clang/test/OpenMP/ordered_codegen.cpp b/clang/test/OpenMP/ordered_codegen.cpp
index 5cd95f1927e5c..3b29feac7caa2 100644
--- a/clang/test/OpenMP/ordered_codegen.cpp
+++ b/clang/test/OpenMP/ordered_codegen.cpp
@@ -794,6 +794,8 @@ void foo_simd(int low, int up) {
 // CHECK1-IRBUILDER-NEXT:    store float [[MUL8]], ptr [[ARRAYIDX10]], align 4
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK1-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK1-IRBUILDER:       omp_region.finalize:
 // CHECK1-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM2]])
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK1-IRBUILDER:       omp.body.continue:
@@ -884,6 +886,8 @@ void foo_simd(int low, int up) {
 // CHECK1-IRBUILDER-NEXT:    store float [[MUL7]], ptr [[ARRAYIDX8]], align 4
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK1-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK1-IRBUILDER:       omp_region.finalize:
 // CHECK1-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM3]])
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK1-IRBUILDER:       omp.body.continue:
@@ -1022,6 +1026,8 @@ void foo_simd(int low, int up) {
 // CHECK1-IRBUILDER-NEXT:    store float [[MUL29]], ptr [[ARRAYIDX31]], align 4
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK1-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK1-IRBUILDER:       omp_region.finalize:
 // CHECK1-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM23]])
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK1-IRBUILDER:       omp.body.continue:
@@ -1131,6 +1137,8 @@ void foo_simd(int low, int up) {
 // CHECK1-IRBUILDER-NEXT:    store float [[MUL14]], ptr [[ARRAYIDX16]], align 4
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK1-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK1-IRBUILDER:       omp_region.finalize:
 // CHECK1-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM8]])
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK1-IRBUILDER:       omp.body.continue:
@@ -1296,17 +1304,19 @@ void foo_simd(int low, int up) {
 // CHECK1-IRBUILDER-NEXT:    call void @__captured_stmt.1(ptr [[I28]])
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY33_ORDERED_AFTER:%.*]]
 // CHECK1-IRBUILDER:       omp.inner.for.body33.ordered.after:
-// CHECK1-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE38:%.*]]
-// CHECK1-IRBUILDER:       omp.body.continue38:
-// CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_INC39:%.*]]
-// CHECK1-IRBUILDER:       omp.inner.for.inc39:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE38:%.*]]
+// CHECK1-IRBUILDER:       omp_region.finalize38:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE39:%.*]]
+// CHECK1-IRBUILDER:       omp.body.continue39:
+// CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_INC40:%.*]]
+// CHECK1-IRBUILDER:       omp.inner.for.inc40:
 // CHECK1-IRBUILDER-NEXT:    [[TMP32:%.*]] = load i32, ptr [[DOTOMP_IV16]], align 4
 // CHECK1-IRBUILDER-NEXT:    [[ADD40:%.*]] = add i32 [[TMP32]], 1
 // CHECK1-IRBUILDER-NEXT:    store i32 [[ADD40]], ptr [[DOTOMP_IV16]], align 4
 // CHECK1-IRBUILDER-NEXT:    [[OMP_GLOBAL_THREAD_NUM41:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB12]])
 // CHECK1-IRBUILDER-NEXT:    call void @__kmpc_dispatch_fini_4u(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM41]])
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_COND30]], !llvm.loop [[LOOP5:![0-9]+]]
-// CHECK1-IRBUILDER:       omp.inner.for.end42:
+// CHECK1-IRBUILDER:       omp.inner.for.end43:
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_DISPATCH_INC:%.*]]
 // CHECK1-IRBUILDER:       omp.dispatch.inc:
 // CHECK1-IRBUILDER-NEXT:    br label [[OMP_DISPATCH_COND]]
@@ -2034,6 +2044,8 @@ void foo_simd(int low, int up) {
 // CHECK3-IRBUILDER-NEXT:    store float [[MUL8]], ptr [[ARRAYIDX10]], align 4
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK3-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK3-IRBUILDER:       omp_region.finalize:
 // CHECK3-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM2]])
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK3-IRBUILDER:       omp.body.continue:
@@ -2124,6 +2136,8 @@ void foo_simd(int low, int up) {
 // CHECK3-IRBUILDER-NEXT:    store float [[MUL7]], ptr [[ARRAYIDX8]], align 4
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK3-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK3-IRBUILDER:       omp_region.finalize:
 // CHECK3-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM3]])
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK3-IRBUILDER:       omp.body.continue:
@@ -2262,6 +2276,8 @@ void foo_simd(int low, int up) {
 // CHECK3-IRBUILDER-NEXT:    store float [[MUL29]], ptr [[ARRAYIDX31]], align 4
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK3-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK3-IRBUILDER:       omp_region.finalize:
 // CHECK3-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM23]])
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK3-IRBUILDER:       omp.body.continue:
@@ -2371,6 +2387,8 @@ void foo_simd(int low, int up) {
 // CHECK3-IRBUILDER-NEXT:    store float [[MUL14]], ptr [[ARRAYIDX16]], align 4
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY_ORDERED_AFTER:%.*]]
 // CHECK3-IRBUILDER:       omp.inner.for.body.ordered.after:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+// CHECK3-IRBUILDER:       omp_region.finalize:
 // CHECK3-IRBUILDER-NEXT:    call void @__kmpc_end_ordered(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM8]])
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE:%.*]]
 // CHECK3-IRBUILDER:       omp.body.continue:
@@ -2536,17 +2554,19 @@ void foo_simd(int low, int up) {
 // CHECK3-IRBUILDER-NEXT:    call void @__captured_stmt.1(ptr [[I28]])
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_BODY33_ORDERED_AFTER:%.*]]
 // CHECK3-IRBUILDER:       omp.inner.for.body33.ordered.after:
-// CHECK3-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE38:%.*]]
-// CHECK3-IRBUILDER:       omp.body.continue38:
-// CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_INC39:%.*]]
-// CHECK3-IRBUILDER:       omp.inner.for.inc39:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_REGION_FINALIZE38:%.*]]
+// CHECK3-IRBUILDER:       omp_region.finalize38:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_BODY_CONTINUE39:%.*]]
+// CHECK3-IRBUILDER:       omp.body.continue39:
+// CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_INC40:%.*]]
+// CHECK3-IRBUILDER:       omp.inner.for.inc40:
 // CHECK3-IRBUILDER-NEXT:    [[TMP32:%.*]] = load i32, ptr [[DOTOMP_IV16]], align 4
 // CHECK3-IRBUILDER-NEXT:    [[ADD40:%.*]] = add i32 [[TMP32]], 1
 // CHECK3-IRBUILDER-NEXT:    store i32 [[ADD40]], ptr [[DOTOMP_IV16]], align 4
 // CHECK3-IRBUILDER-NEXT:    [[OMP_GLOBAL_THREAD_NUM41:%.*]] = call i32 @__kmpc_global_thread_num(ptr @[[GLOB12]])
 // CHECK3-IRBUILDER-NEXT:    call void @__kmpc_dispatch_fini_4u(ptr @[[GLOB1]], i32 [[OMP_GLOBAL_THREAD_NUM41]])
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_INNER_FOR_COND30]], !llvm.loop [[LOOP5:![0-9]+]]
-// CHECK3-IRBUILDER:       omp.inner.for.end42:
+// CHECK3-IRBUILDER:       omp.inner.for.end43:
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_DISPATCH_INC:%.*]]
 // CHECK3-IRBUILDER:       omp.dispatch.inc:
 // CHECK3-IRBUILDER-NEXT:    br label [[OMP_DISPATCH_COND]]
diff --git a/clang/test/OpenMP/parallel_codegen.cpp b/clang/test/OpenMP/parallel_codegen.cpp
index e8e57aedaa164..9f6004e37db9c 100644
--- a/clang/test/OpenMP/parallel_codegen.cpp
+++ b/clang/test/OpenMP/parallel_codegen.cpp
@@ -906,6 +906,8 @@ int main (int argc, char **argv) {
 // CHECK4:       omp.par.region.parallel.after:
 // CHECK4-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK4:       omp.par.pre_finalize:
+// CHECK4-NEXT:    br label [[FINI:%.*]]
+// CHECK4:       .fini:
 // CHECK4-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG35]]
 // CHECK4:       omp.par.exit.exitStub:
 // CHECK4-NEXT:    ret void
@@ -975,6 +977,8 @@ int main (int argc, char **argv) {
 // CHECK4:       omp.par.region.parallel.after:
 // CHECK4-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 // CHECK4:       omp.par.pre_finalize:
+// CHECK4-NEXT:    br label [[FINI:%.*]]
+// CHECK4:       .fini:
 // CHECK4-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]], !dbg [[DBG66]]
 // CHECK4:       omp.par.exit.exitStub:
 // CHECK4-NEXT:    ret void
diff --git a/clang/test/OpenMP/target_update_codegen.cpp b/clang/test/OpenMP/target_update_codegen.cpp
index c8211f475c7fc..6c754c1c953ea 100644
--- a/clang/test/OpenMP/target_update_codegen.cpp
+++ b/clang/test/OpenMP/target_update_codegen.cpp
@@ -1560,5 +1560,37 @@ void foo(int arg) {
   { ++arg; }
 }
 
+#endif
+// RUN: %clang_cc1 -DCK26 -verify -Wno-vla -fopenmp -fopenmp-version=51 -fopenmp-targets=powerpc64le-ibm-linux-gnu -x c++ -triple powerpc64le-unknown-unknown -emit-llvm %s -o - | FileCheck %s --check-prefix CK26 --check-prefix CK26-64
+// RUN: %clang_cc1 -DCK26 -fopenmp -fopenmp-version=51 -fopenmp-targets=powerpc64le-ibm-linux-gnu -x c++ -std=c++11 -triple powerpc64le-unknown-unknown -emit-pch -o %t %s
+// RUN: %clang_cc1 -fopenmp -fopenmp-version=51 -fopenmp-targets=powerpc64le-ibm-linux-gnu -x c++ -triple powerpc64le-unknown-unknown -std=c++11 -include-pch %t -verify -Wno-vla %s -emit-llvm -o - | FileCheck %s  --check-prefix CK26 --check-prefix CK26-64
+// RUN: %clang_cc1 -DCK26 -fopenmp-version=51 -verify -Wno-vla -fopenmp -fopenmp-targets=i386-pc-linux-gnu -x c++ -triple i386-unknown-unknown -emit-llvm %s -o - | FileCheck %s  --check-prefix CK26 --check-prefix CK26-32
+// RUN: %clang_cc1 -DCK26 -fopenmp -fopenmp-version=51 -fopenmp-targets=i386-pc-linux-gnu -x c++ -std=c++11 -triple i386-unknown-unknown -emit-pch -o %t %s
+// RUN: %clang_cc1 -fopenmp -fopenmp-version=51 -fopenmp-targets=i386-pc-linux-gnu -x c++ -triple i386-unknown-unknown -std=c++11 -include-pch %t -verify -Wno-vla %s -emit-llvm -o - | FileCheck %s  --check-prefix CK26 --check-prefix CK26-32
+
+// RUN: %clang_cc1 -DCK26 -verify -Wno-vla -fopenmp-simd -fopenmp-version=51 -fopenmp-targets=powerpc64le-ibm-linux-gnu -x c++ -triple powerpc64le-unknown-unknown -emit-llvm %s -o - | FileCheck --check-prefix SIMD-ONLY19 %s
+// RUN: %clang_cc1 -DCK26 -fopenmp-simd -fopenmp-version=51 -fopenmp-targets=powerpc64le-ibm-linux-gnu -x c++ -std=c++11 -triple powerpc64le-unknown-unknown -emit-pch -o %t %s
+// RUN: %clang_cc1 -fopenmp-simd -fopenmp-version=51 -fopenmp-targets=powerpc64le-ibm-linux-gnu -x c++ -triple powerpc64le-unknown-unknown -std=c++11 -include-pch %t -verify -Wno-vla %s -emit-llvm -o - | FileCheck --check-prefix SIMD-ONLY19 %s
+// RUN: %clang_cc1 -DCK26 -verify -Wno-vla -fopenmp-simd -fopenmp-version=51 -fopenmp-targets=i386-pc-linux-gnu -x c++ -triple i386-unknown-unknown -emit-llvm %s -o - | FileCheck --check-prefix SIMD-ONLY19 %s
+// RUN: %clang_cc1 -DCK26 -fopenmp-simd -fopenmp-version=51 -fopenmp-targets=i386-pc-linux-gnu -x c++ -std=c++11 -triple i386-unknown-unknown -emit-pch -o %t %s
+// RUN: %clang_cc1 -fopenmp-simd -fopenmp-version=51 -fopenmp-targets=i386-pc-linux-gnu -x c++ -triple i386-unknown-unknown -std=c++11 -include-pch %t -verify -Wno-vla %s -emit-llvm -o - | FileCheck --check-prefix SIMD-ONLY19 %s
+// SIMD-ONLY19-NOT: {{__kmpc|__tgt}}
+#ifdef CK26
+void foo() {
+int a[10];
+#pragma omp target update to(iterator(int it = 0:10) : a[it])
+// CK26-LABEL: define {{.+}}foo
+// CK26: %[[ITER:[a-zA-Z0-9_]+]] = alloca i32, align 4
+// CK26: %[[LOAD2:.*]] = load i32, ptr %[[ITER]], align 4
+}
+
+void foo1() {
+int a[10];
+#pragma omp target update from(iterator(int it = 0:10) : a[it])
+// CK26-LABEL: define {{.+}}foo1
+// CK26: %[[ITER:[a-zA-Z0-9_]+]] = alloca i32, align 4
+// CK26: %[[LOAD2:.*]] = load i32, ptr %[[ITER]], align 4
+}
+
 #endif
 #endif
diff --git a/clang/test/OpenMP/target_update_iterator_ast_print.cpp b/clang/test/OpenMP/target_update_iterator_ast_print.cpp
new file mode 100644
index 0000000000000..322f565c9c732
--- /dev/null
+++ b/clang/test/OpenMP/target_update_iterator_ast_print.cpp
@@ -0,0 +1,16 @@
+// RUN: %clang_cc1 -verify -fopenmp -fopenmp-version=51 -ast-print %s | FileCheck %s
+// expected-no-diagnostics
+
+#ifndef HEADER
+#define HEADER
+
+void test() {
+  int a[10];
+  #pragma omp target update to(iterator(int it = 0:10): a[it]) 
+  // CHECK:   int a[10];
+  // CHECK: #pragma omp target update to(iterator(int it = 0:10): a[it])
+  #pragma omp target update from(iterator(int it = 0:10): a[it]) 
+  // CHECK: #pragma omp target update from(iterator(int it = 0:10): a[it])
+}
+
+#endif
diff --git a/clang/test/OpenMP/target_update_iterator_serialization.cpp b/clang/test/OpenMP/target_update_iterator_serialization.cpp
new file mode 100644
index 0000000000000..c1ad380f7c9a5
--- /dev/null
+++ b/clang/test/OpenMP/target_update_iterator_serialization.cpp
@@ -0,0 +1,35 @@
+// Test without serialization:
+// RUN: %clang_cc1 -std=c++20 -fopenmp  %s -ast-dump | FileCheck %s
+
+// Test with serialization:
+// RUN: %clang_cc1 -std=c++20 -fopenmp  -emit-pch -o %t %s
+// RUN: %clang_cc1 -x c++ -std=c++20 -fopenmp -include-pch %t -ast-dump-all /dev/null  \
+// RUN:   | sed -e "s/ <undeserialized declarations>//" -e "s/ imported//" \
+// RUN:   | FileCheck %s
+
+// CHECK: OMPTargetUpdateDirective
+// CHECK-NEXT: OMPFromClause
+// CHECK-NEXT: ArraySubscriptExpr
+// CHECK: DeclRefExpr {{.*}} 'a'
+// CHECK: DeclRefExpr {{.*}} 'it'
+
+
+void foo1() {
+  int a[10];
+
+#pragma omp target update from(iterator(int it = 0:10) : a[it])
+  ;
+}
+
+// CHECK: OMPTargetUpdateDirective
+// CHECK-NEXT: OMPToClause
+// CHECK-NEXT: ArraySubscriptExpr
+// CHECK: DeclRefExpr {{.*}} 'a'
+// CHECK: DeclRefExpr {{.*}} 'it'
+
+void foo2() {
+  int a[10];
+
+#pragma omp target update to(iterator(int it = 0:10) : a[it])
+  ;
+}
diff --git a/clang/test/Parser/cxx-nested-name-spec.cpp b/clang/test/Parser/cxx-nested-name-spec.cpp
new file mode 100644
index 0000000000000..3a551a4f2221f
--- /dev/null
+++ b/clang/test/Parser/cxx-nested-name-spec.cpp
@@ -0,0 +1,10 @@
+// RUN: %clang_cc1 -fsyntax-only -verify %s
+
+namespace a { b c ( a:c::
+// expected-error at -1 {{unknown type name 'b'}}
+// expected-error at -2 {{unexpected ':' in nested name specifier; did you mean '::'?}}
+// expected-error at -3 {{no member named 'c' in namespace 'a'}}
+// expected-error at -4 {{expected ';' after top level declarator}}
+// expected-note at -5 {{to match this '{'}}
+// expected-error at +1 {{expected unqualified-id}} \
+// expected-error at +1 {{expected '}'}}
diff --git a/clang/test/Preprocessor/header-shadowing.c b/clang/test/Preprocessor/header-shadowing.c
new file mode 100644
index 0000000000000..c6d90d6f7760e
--- /dev/null
+++ b/clang/test/Preprocessor/header-shadowing.c
@@ -0,0 +1,57 @@
+// RUN: rm -rf %t
+// RUN: split-file %s %t
+
+/// Check that:
+/// - Quoted includes ("...") trigger the diagnostic.
+/// - System headers are ignored.
+/// - #include_next does not cause a duplicate warning.
+// RUN: %clang_cc1 -Wshadow-header -Eonly %t/main.c -I %t/include1 -I %t/include2 \
+// RUN: -isystem %t/system1 -isystem %t/system2 2>&1 | FileCheck %s --check-prefix=SHADOWING
+
+// SHADOWING: {{.*}} warning: multiple candidates for header 'header.h' found; directory '{{.*}}include1' chosen, ignoring others including '{{.*}}include2' [-Wshadow-header]
+// SHADOWING: warning: include1/header.h included!
+// SHADOWING-NOT: {{.*}} warning: multiple candidates for header 'header.h' found; directory '{{.*}}include2' chosen, ignoring others including '{{.*}}include1' [-Wshadow-header]
+// SHADOWING: warning: include2/header.h included!
+// SHADOWING-NOT: {{.*}} warning: multiple candidates for header 'stdio.h' found; directory '{{.*}}system1' chosen, ignoring others including '{{.*}}system2' [-Wshadow-header]
+// SHADOWING: warning: system1/stdio.h included!
+
+/// Check that the diagnostic is only performed once in MSVC compatibility mode.
+// RUN: %clang_cc1 -fms-compatibility -Wshadow-header -Eonly %t/t.c 2>&1 | FileCheck %s --check-prefix=SHADOWING-MS
+
+// SHADOWING-MS: {{.*}} warning: multiple candidates for header 't3.h' found; directory '{{.*}}foo' chosen, ignoring others including '{{.*}}' [-Wshadow-header]
+// SHADOWING-MS-NOT: {{.*}} warning: multiple candidates for header 't3.h' found; directory '{{.*}}' chosen, ignoring others including '{{.*}}foo' [-Wshadow-header]
+// SHADOWING-MS: warning: Found foo/t3.h.
+
+//--- main.c
+#include "header.h"
+#include <stdio.h>
+
+//--- include1/header.h
+#warning include1/header.h included!
+#include_next "header.h"
+
+//--- include2/header.h
+#warning include2/header.h included!
+
+//--- system1/stdio.h
+#warning system1/stdio.h included!
+
+//--- system2/stdio.h
+#warning system2/stdio.h included!
+
+
+/// Used to test when running in MSVC compatibility
+//--- t.c
+#include "foo/t1.h"
+
+//--- foo/t1.h
+#include "bar/t2.h"
+
+//--- foo/bar/t2.h
+#include "t3.h"
+
+//--- foo/t3.h
+#warning Found foo/t3.h.
+
+//--- t3.h
+#warning Found t3.h.
diff --git a/clang/test/Preprocessor/init.c b/clang/test/Preprocessor/init.c
index 4dea1b583a089..32c758699120e 100644
--- a/clang/test/Preprocessor/init.c
+++ b/clang/test/Preprocessor/init.c
@@ -1106,19 +1106,19 @@
 // SPARC:#define __INT_LEAST8_MAX__ 127
 // SPARC:#define __INT_LEAST8_TYPE__ signed char
 // SPARC:#define __INT_MAX__ 2147483647
-// SPARC:#define __LDBL_DENORM_MIN__ 4.9406564584124654e-324L
-// SPARC:#define __LDBL_DIG__ 15
-// SPARC:#define __LDBL_EPSILON__ 2.2204460492503131e-16L
+// SPARC:#define __LDBL_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966L
+// SPARC:#define __LDBL_DIG__ 33
+// SPARC:#define __LDBL_EPSILON__ 1.92592994438723585305597794258492732e-34L
 // SPARC:#define __LDBL_HAS_DENORM__ 1
 // SPARC:#define __LDBL_HAS_INFINITY__ 1
 // SPARC:#define __LDBL_HAS_QUIET_NAN__ 1
-// SPARC:#define __LDBL_MANT_DIG__ 53
-// SPARC:#define __LDBL_MAX_10_EXP__ 308
-// SPARC:#define __LDBL_MAX_EXP__ 1024
-// SPARC:#define __LDBL_MAX__ 1.7976931348623157e+308L
-// SPARC:#define __LDBL_MIN_10_EXP__ (-307)
-// SPARC:#define __LDBL_MIN_EXP__ (-1021)
-// SPARC:#define __LDBL_MIN__ 2.2250738585072014e-308L
+// SPARC:#define __LDBL_MANT_DIG__ 113
+// SPARC:#define __LDBL_MAX_10_EXP__ 4932
+// SPARC:#define __LDBL_MAX_EXP__ 16384
+// SPARC:#define __LDBL_MAX__ 1.18973149535723176508575932662800702e+4932L
+// SPARC:#define __LDBL_MIN_10_EXP__ (-4931)
+// SPARC:#define __LDBL_MIN_EXP__ (-16381)
+// SPARC:#define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
 // SPARC:#define __LONG_LONG_MAX__ 9223372036854775807LL
 // SPARC:#define __LONG_MAX__ 2147483647L
 // SPARC-NOT:#define __LP64__
@@ -1134,7 +1134,7 @@
 // SPARC:#define __SIZEOF_DOUBLE__ 8
 // SPARC:#define __SIZEOF_FLOAT__ 4
 // SPARC:#define __SIZEOF_INT__ 4
-// SPARC:#define __SIZEOF_LONG_DOUBLE__ 8
+// SPARC:#define __SIZEOF_LONG_DOUBLE__ 16
 // SPARC:#define __SIZEOF_LONG_LONG__ 8
 // SPARC:#define __SIZEOF_LONG__ 4
 // SPARC:#define __SIZEOF_POINTER__ 4
diff --git a/clang/test/Preprocessor/predefined-arch-macros.c b/clang/test/Preprocessor/predefined-arch-macros.c
index 27feeb57b5de2..1e38b4d3ba350 100644
--- a/clang/test/Preprocessor/predefined-arch-macros.c
+++ b/clang/test/Preprocessor/predefined-arch-macros.c
@@ -4210,6 +4210,11 @@
 // CHECK_SPARC-NOT: #define __sparcv9 1
 // CHECK_SPARC-NOT: #define __sparcv9__ 1
 
+// RUN: %clang -E -dM %s -o - 2>&1 \
+// RUN:     -target sparc-unknown-linux \
+// RUN:   | FileCheck -match-full-lines %s -check-prefix=CHECK_SPARC_LDBL
+// CHECK_SPARC_LDBL: #define __LONG_DOUBLE_128__ 1
+
 // RUN: %clang -mcpu=v9 -E -dM %s -o - 2>&1 \
 // RUN:     -target sparc-unknown-linux \
 // RUN:   | FileCheck -match-full-lines %s -check-prefix=CHECK_SPARC-V9
diff --git a/clang/test/Sema/attr-modular-format.c b/clang/test/Sema/attr-modular-format.c
new file mode 100644
index 0000000000000..fc5b28b0b88be
--- /dev/null
+++ b/clang/test/Sema/attr-modular-format.c
@@ -0,0 +1,26 @@
+//RUN: %clang_cc1 -fsyntax-only -verify %s
+
+int printf(const char *fmt, ...)  __attribute__((modular_format(__modular_printf, "__printf", "float")));  // no-error
+int myprintf(const char *fmt, ...)  __attribute__((modular_format(__modular_printf, "__printf", "float")));  // expected-error {{'modular_format' attribute requires 'format' attribute}}
+
+int dupe(const char *fmt, ...)  __attribute__((modular_format(__modular_printf, "__printf", "float", "int", "float"), format(printf, 1, 2))); // expected-error {{duplicate aspect 'float' in 'modular_format' attribute}}
+int multi_dupe(const char *fmt, ...)  __attribute__((modular_format(__modular_printf, "__printf", "float", "int", "float", "int"), format(printf, 1, 2))); // expected-error {{duplicate aspect 'float' in 'modular_format' attribute}} \
+                                                                                                                                                                 // expected-error {{duplicate aspect 'int' in 'modular_format' attribute}}
+
+// Test with multiple identical attributes on the same declaration.
+int same_attr(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "float"), modular_format(__modular_printf, "__printf", "float"), format(printf, 1, 2))); // no-warning
+
+// Test with multiple different attributes on the same declaration.
+int diff_attr(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "float"), format(printf, 1, 2), modular_format(__modular_printf, "__printf", "int"))); // expected-error {{attribute 'modular_format' is already applied with different arguments}} expected-note {{conflicting attribute is here}}
+
+int diff_attr2(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "float"), format(printf, 1, 2), modular_format(__modular_printf, "__other", "float"))); // expected-error {{attribute 'modular_format' is already applied with different arguments}} expected-note {{conflicting attribute is here}}
+
+int diff_attr3(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "float"), format(printf, 1, 2), modular_format(__other, "__printf", "float"))); // expected-error {{attribute 'modular_format' is already applied with different arguments}} expected-note {{conflicting attribute is here}}
+
+// Test with same attributes but different aspect order.
+int diff_order(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "float", "int"), format(printf, 1, 2), modular_format(__modular_printf, "__printf", "int", "float"))); // no-error
+
+// Test with multiple different attributes on a declaration and a redeclaration
+int redecl(const char *fmt, ...) __attribute__((format(printf, 1, 2))); // no-error
+int redecl(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "float"))); // expected-note {{conflicting attribute is here}}
+int redecl(const char *fmt, ...) __attribute__((modular_format(__modular_printf, "__printf", "int"))); // expected-error {{attribute 'modular_format' is already applied with different arguments}}
diff --git a/clang/test/SemaCUDA/Inputs/cuda.h b/clang/test/SemaCUDA/Inputs/cuda.h
index 2bf45e03d91c7..de6f7fb635421 100644
--- a/clang/test/SemaCUDA/Inputs/cuda.h
+++ b/clang/test/SemaCUDA/Inputs/cuda.h
@@ -46,6 +46,13 @@ extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,
 extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
                                         dim3 blockDim, void **args,
                                         size_t sharedMem, cudaStream_t stream);
+extern "C" __device__ cudaError_t cudaLaunchDevice(void *func,
+                                                   void *parameterBuffer,
+                                                   dim3 gridDim, dim3 blockDim,
+                                                   unsigned int sharedMem,
+                                                   cudaStream_t stream);
+extern "C" __device__ void *cudaGetParameterBuffer(size_t alignment,
+                                                   size_t size);
 #endif
 
 // Host- and device-side placement new overloads.
diff --git a/clang/test/SemaCUDA/call-kernel-from-kernel.cu b/clang/test/SemaCUDA/call-kernel-from-kernel.cu
index 5f8832f3cd070..01dba44339520 100644
--- a/clang/test/SemaCUDA/call-kernel-from-kernel.cu
+++ b/clang/test/SemaCUDA/call-kernel-from-kernel.cu
@@ -1,9 +1,12 @@
 // RUN: %clang_cc1 %s --std=c++11 -triple nvptx -o - \
 // RUN:   -verify -fcuda-is-device -fsyntax-only -verify-ignore-unexpected=note
+// RUN: %clang_cc1 %s --std=c++11 -fgpu-rdc -triple nvptx -o - \
+// RUN:   -verify=rdc -fcuda-is-device -fsyntax-only -verify-ignore-unexpected=note
+// rdc-no-diagnostics
 
 #include "Inputs/cuda.h"
 
 __global__ void kernel1();
 __global__ void kernel2() {
-  kernel1<<<1,1>>>(); // expected-error {{reference to __global__ function 'kernel1' in __global__ function}}
+  kernel1<<<1,1>>>(); // expected-error {{kernel launch from __device__ or __global__ function requires relocatable device code (i.e. requires -fgpu-rdc)}}
 }
diff --git a/clang/test/SemaCUDA/device-kernel-call.cu b/clang/test/SemaCUDA/device-kernel-call.cu
new file mode 100644
index 0000000000000..856cbd88404e6
--- /dev/null
+++ b/clang/test/SemaCUDA/device-kernel-call.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device -verify=nordc %s
+// RUN: %clang_cc1 -fcuda-is-device -fgpu-rdc -verify=rdc %s
+// RUN: %clang_cc1 -x hip -fcuda-is-device -verify=hip %s
+
+// rdc-no-diagnostics
+
+#include "Inputs/cuda.h"
+
+__global__ void g2(int x) {}
+
+__global__ void g1(void) {
+  g2<<<1, 1>>>(42);
+  // nordc-error at -1 {{kernel launch from __device__ or __global__ function requires relocatable device code (i.e. requires -fgpu-rdc)}}
+  // hip-error at -2 {{device-side kernel call/launch is not supported}}
+}
diff --git a/clang/test/SemaCUDA/function-overload.cu b/clang/test/SemaCUDA/function-overload.cu
index 3d05839af7528..11f84a912ea7b 100644
--- a/clang/test/SemaCUDA/function-overload.cu
+++ b/clang/test/SemaCUDA/function-overload.cu
@@ -91,10 +91,7 @@ __host__ HostReturnTy h() { return HostReturnTy(); }
 // devdefer-note at -4 1+ {{candidate function not viable: call to __host__ function from __global__ function}}
 
 __global__ void g() {}
-// dev-note at -1 1+ {{'g' declared here}}
-// devdefer-note at -2 1+ {{candidate function not viable: call to __global__ function from __device__ function}}
 // expected-note at -3 0+ {{candidate function not viable: call to __global__ function from __host__ __device__ function}}
-// devdefer-note at -4 1+ {{candidate function not viable: call to __global__ function from __global__ function}}
 
 extern "C" __device__ DeviceReturnTy cd() { return DeviceReturnTy(); }
 // host-note at -1 1+ {{'cd' declared here}}
@@ -144,9 +141,9 @@ __device__ void devicef() {
   DeviceFnPtr fp_cdh = cdh;
   DeviceReturnTy ret_cdh = cdh();
 
-  GlobalFnPtr fp_g = g; // dev-error {{reference to __global__ function 'g' in __device__ function}}
-  g(); // devdefer-error {{no matching function for call to 'g'}}
-  g<<<0,0>>>(); // dev-error {{reference to __global__ function 'g' in __device__ function}}
+  GlobalFnPtr fp_g = g;
+  g(); // expected-error {{call to global function 'g' not configured}}
+  g<<<0,0>>>(); // expected-error {{kernel launch from __device__ or __global__ function requires relocatable device code (i.e. requires -fgpu-rdc)}}
 }
 
 __global__ void globalf() {
@@ -165,9 +162,9 @@ __global__ void globalf() {
   DeviceFnPtr fp_cdh = cdh;
   DeviceReturnTy ret_cdh = cdh();
 
-  GlobalFnPtr fp_g = g; // dev-error {{reference to __global__ function 'g' in __global__ function}}
-  g(); // devdefer-error {{no matching function for call to 'g'}}
-  g<<<0,0>>>(); // dev-error {{reference to __global__ function 'g' in __global__ function}}
+  GlobalFnPtr fp_g = g;
+  g(); // expected-error {{call to global function 'g' not configured}}
+  g<<<0,0>>>(); // expected-error {{kernel launch from __device__ or __global__ function requires relocatable device code (i.e. requires -fgpu-rdc)}}
 }
 
 __host__ __device__ void hostdevicef() {
@@ -199,20 +196,13 @@ __host__ __device__ void hostdevicef() {
   CurrentReturnTy ret_cdh = cdh();
 
   GlobalFnPtr fp_g = g;
-#if defined(__CUDA_ARCH__)
-  // expected-error at -2 {{reference to __global__ function 'g' in __host__ __device__ function}}
-#endif
 
   g();
-#if defined (__CUDA_ARCH__)
-  // expected-error at -2 {{reference to __global__ function 'g' in __host__ __device__ function}}
-#else
-  // expected-error at -4 {{call to global function 'g' not configured}}
-#endif
+  // expected-error at -1 {{call to global function 'g' not configured}}
 
   g<<<0,0>>>();
 #if defined(__CUDA_ARCH__)
-  // expected-error at -2 {{reference to __global__ function 'g' in __host__ __device__ function}}
+  // expected-error at -2 {{kernel launch from __device__ or __global__ function requires relocatable device code (i.e. requires -fgpu-rdc)}}
 #endif
 }
 
diff --git a/clang/test/SemaCUDA/function-target.cu b/clang/test/SemaCUDA/function-target.cu
index 64444b6676248..66704a320cee1 100644
--- a/clang/test/SemaCUDA/function-target.cu
+++ b/clang/test/SemaCUDA/function-target.cu
@@ -24,11 +24,11 @@ __host__ void h1(void) {
 __host__ void d1h(void); // expected-note {{candidate function not viable: call to __host__ function from __device__ function}}
 __device__ void d1d(void);
 __host__ __device__ void d1hd(void);
-__global__ void d1g(void); // dev-note {{'d1g' declared here}}
+__global__ void d1g(void);
 
 __device__ void d1(void) {
   d1h(); // expected-error {{no matching function}}
   d1d();
   d1hd();
-  d1g<<<1, 1>>>(); // dev-error {{reference to __global__ function 'd1g' in __device__ function}}
+  d1g<<<1, 1>>>(); // expected-error {{kernel launch from __device__ or __global__ function requires relocatable device code (i.e. requires -fgpu-rdc)}}
 }
diff --git a/clang/test/SemaCUDA/reference-to-kernel-fn.cu b/clang/test/SemaCUDA/reference-to-kernel-fn.cu
index 70a1cda6ab0c8..bdb70fc8b55d1 100644
--- a/clang/test/SemaCUDA/reference-to-kernel-fn.cu
+++ b/clang/test/SemaCUDA/reference-to-kernel-fn.cu
@@ -8,6 +8,7 @@
 // device-side kernel launches.)
 
 // host-no-diagnostics
+// dev-no-diagnostics
 
 #include "Inputs/cuda.h"
 
@@ -19,11 +20,10 @@ typedef void (*fn_ptr_t)();
 
 __host__ __device__ fn_ptr_t get_ptr_hd() {
   return kernel;
-  // dev-error at -1 {{reference to __global__ function}}
 }
 __host__ fn_ptr_t get_ptr_h() {
   return kernel;
 }
 __device__ fn_ptr_t get_ptr_d() {
-  return kernel;  // dev-error {{reference to __global__ function}}
+  return kernel;
 }
diff --git a/clang/test/SemaCXX/constexpr-x86-avx-builtins.cpp b/clang/test/SemaCXX/constexpr-x86-avx-builtins.cpp
new file mode 100644
index 0000000000000..724aff3011ded
--- /dev/null
+++ b/clang/test/SemaCXX/constexpr-x86-avx-builtins.cpp
@@ -0,0 +1,18 @@
+// RUN: %clang_cc1 -std=c++20 -ffreestanding -fexperimental-new-constant-interpreter -triple x86_64-unknown-unknown -target-feature +avx -verify %s
+
+#include <immintrin.h>
+#include "../CodeGen/X86/builtin_test_helpers.h"
+
+namespace Test_mm256_cvtpd_ps {
+namespace OK {
+constexpr __m256d a = { 0.0, -1.0, +2.0, +3.5 };
+TEST_CONSTEXPR(match_m128(_mm256_cvtpd_ps(a), 0.0f, -1.0f, +2.0f, +3.5f));
+}
+namespace Inexact {
+constexpr __m256d a = { 1.0000000000000002, 0.0, 0.0, 0.0 };
+constexpr __m128 r = _mm256_cvtpd_ps(a);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avxintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm256_cvtpd_ps({1.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00})'}}
+}
+}
diff --git a/clang/test/SemaCXX/constexpr-x86-avx512f-builtins.cpp b/clang/test/SemaCXX/constexpr-x86-avx512f-builtins.cpp
new file mode 100644
index 0000000000000..0d2a82cbbb83c
--- /dev/null
+++ b/clang/test/SemaCXX/constexpr-x86-avx512f-builtins.cpp
@@ -0,0 +1,230 @@
+// RUN: %clang_cc1 -std=c++20 -ffreestanding -fexperimental-new-constant-interpreter -triple x86_64-unknown-unknown -target-feature +avx512f -verify %s
+
+#include <immintrin.h>
+#include "../CodeGen/X86/builtin_test_helpers.h"
+
+namespace Test_mm_mask_cvtsd_ss {
+namespace OK {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b = { -1.0, 42.0 };
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtsd_ss(src, 0x1, a, b), -1.0f, 2.0f, 3.0f, 4.0f));
+}
+namespace MaskOff {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b = { -1.0, 42.0 };
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtsd_ss(src, 0x0, a, b), 9.0f, 2.0f, 3.0f, 4.0f));
+}
+namespace MaskOffInexact {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_inexact = { 1.0000000000000002, 0.0 };
+constexpr __m128 r = _mm_mask_cvtsd_ss(src, 0x0, a, b_inexact);
+TEST_CONSTEXPR(match_m128(r, 9.0f, 2.0f, 3.0f, 4.0f));
+}
+namespace MaskOnInexact {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_inexact = { 1.0000000000000002, 0.0 };
+constexpr __m128 r = _mm_mask_cvtsd_ss(src, 0x1, a, b_inexact);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_mask_cvtsd_ss({9.000000e+00, 5.000000e+00, 6.000000e+00, 7.000000e+00}, 1, {1.000000e+00, 2.000000e+00, 3.000000e+00, 4.000000e+00}, {1.000000e+00, 0.000000e+00})'}}
+}
+namespace MaskOnInf {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_inf = { __builtin_huge_val(), 0.0 };
+constexpr __m128 r = _mm_mask_cvtsd_ss(src, 0x1, a, b_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm_mask_cvtsd_ss({9.000000e+00, 5.000000e+00, 6.000000e+00, 7.000000e+00}, 1, {1.000000e+00, 2.000000e+00, 3.000000e+00, 4.000000e+00}, {INF, 0.000000e+00})'}}
+}
+namespace MaskOnNaN {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_nan = { __builtin_nan(""), 0.0 };
+constexpr __m128 r = _mm_mask_cvtsd_ss(src, 0x1, a, b_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm_mask_cvtsd_ss({9.000000e+00, 5.000000e+00, 6.000000e+00, 7.000000e+00}, 1, {1.000000e+00, 2.000000e+00, 3.000000e+00, 4.000000e+00}, {nan, 0.000000e+00})'}}
+}
+namespace MaskOnSubnormal {
+constexpr __m128 src = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_sub = { 1e-310, 0.0 };
+constexpr __m128 r = _mm_mask_cvtsd_ss(src, 0x1, a, b_sub);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_mask_cvtsd_ss({9.000000e+00, 5.000000e+00, 6.000000e+00, 7.000000e+00}, 1, {1.000000e+00, 2.000000e+00, 3.000000e+00, 4.000000e+00}, {1.000000e-310, 0.000000e+00})'}}
+}
+}
+
+namespace Test_mm_maskz_cvtsd_ss {
+namespace OK {
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b = { -1.0, 42.0 };
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtsd_ss(0x1, a, b), -1.0f, 2.0f, 3.0f, 4.0f));
+}
+namespace MaskOff {
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b = { -1.0, 42.0 };
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtsd_ss(0x0, a, b), 0.0f, 2.0f, 3.0f, 4.0f));
+}
+namespace MaskOffInexact {
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_inexact = { 1.0000000000000002, 0.0 };
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtsd_ss(0x0, a, b_inexact), 0.0f, 2.0f, 3.0f, 4.0f));
+}
+namespace MaskOnInf {
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_inf = { __builtin_huge_val(), 0.0 };
+constexpr __m128 r = _mm_maskz_cvtsd_ss(0x1, a, b_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm_maskz_cvtsd_ss(1, {1.000000e+00, 2.000000e+00, 3.000000e+00, 4.000000e+00}, {INF, 0.000000e+00})'}}
+}
+namespace MaskOnNaN {
+constexpr __m128 a = { 1.0f, 2.0f, 3.0f, 4.0f };
+constexpr __m128d b_nan = { __builtin_nan(""), 0.0 };
+constexpr __m128 r = _mm_maskz_cvtsd_ss(0x1, a, b_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm_maskz_cvtsd_ss(1, {1.000000e+00, 2.000000e+00, 3.000000e+00, 4.000000e+00}, {nan, 0.000000e+00})'}}
+}
+}
+
+namespace Test_mm512_cvtpd_ps {
+namespace OK {
+constexpr __m512d a = { -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_cvtpd_ps(a), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, +32.0f, +64.0f, +128.0f));
+}
+namespace Inexact {
+constexpr __m512d a = { 1.0000000000000002, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
+constexpr __m256 r = _mm512_cvtpd_ps(a);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm512_cvtpd_ps({1.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00})'}}
+}
+}
+
+namespace Test_mm512_mask_cvtpd_ps {
+namespace OK {
+constexpr __m256 src = { 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a = { -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_mask_cvtpd_ps(src, 0x05, a), -1.0f, 9.0f, +4.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f));
+}
+namespace MaskOffInexact {
+constexpr __m256 src = { 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_inexact = { -1.0, +2.0, +4.0, +8.0, +16.0, 1.0000000000000002, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_mask_cvtpd_ps(src, 0b11011111, a_inexact), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, 9.0f, +64.0f, +128.0f));
+}
+namespace MaskOffInf {
+constexpr __m256 src = { 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_inf = { -1.0, +2.0, +4.0, +8.0, +16.0, __builtin_huge_val(), +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_mask_cvtpd_ps(src, 0x1F, a_inf), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, 9.0f, 9.0f, 9.0f));
+}
+namespace MaskOffNaN {
+constexpr __m256 src = { 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_nan = { -1.0, +2.0, +4.0, +8.0, +16.0, __builtin_nan(""), +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_mask_cvtpd_ps(src, 0x1F, a_nan), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, 9.0f, 9.0f, 9.0f));
+}
+namespace MaskOnInf {
+constexpr __m256 src = { 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_inf = { -1.0, +2.0, +4.0, __builtin_huge_val(), +16.0, +32.0, +64.0, +128.0 };
+constexpr __m256 r = _mm512_mask_cvtpd_ps(src, 0x08, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm512_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 8, {-1.000000e+00, 2.000000e+00, 4.000000e+00, INF, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+}
+namespace MaskOnNaN {
+constexpr __m256 src = { 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_nan = { -1.0, +2.0, +4.0, __builtin_nan(""), +16.0, +32.0, +64.0, +128.0 };
+constexpr __m256 r = _mm512_mask_cvtpd_ps(src, 0x08, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm512_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 8, {-1.000000e+00, 2.000000e+00, 4.000000e+00, nan, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+}
+}
+
+namespace Test_mm512_maskz_cvtpd_ps {
+namespace OK {
+constexpr __m512d a = { -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_maskz_cvtpd_ps(0x81, a), -1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, +128.0f));
+}
+namespace MaskOffInexact {
+constexpr __m512d a_inexact = { -1.0, +2.0, +4.0, +8.0, +16.0, 1.0000000000000002, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_maskz_cvtpd_ps(0b11011111, a_inexact), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, 0.0f, +64.0f, +128.0f));
+}
+namespace MaskOffInf {
+constexpr __m512d a_inf = { -1.0, +2.0, +4.0, +8.0, +16.0, __builtin_huge_val(), +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_maskz_cvtpd_ps(0x1F, a_inf), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOffNaN {
+constexpr __m512d a_nan = { -1.0, +2.0, +4.0, +8.0, +16.0, __builtin_nan(""), +64.0, +128.0 };
+TEST_CONSTEXPR(match_m256(_mm512_maskz_cvtpd_ps(0x1F, a_nan), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOnInf {
+constexpr __m512d a_inf = { -1.0, +2.0, +4.0, __builtin_huge_val(), +16.0, +32.0, +64.0, +128.0 };
+constexpr __m256 r = _mm512_maskz_cvtpd_ps(0x08, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm512_maskz_cvtpd_ps(8, {-1.000000e+00, 2.000000e+00, 4.000000e+00, INF, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+}
+namespace MaskOnNaN {
+constexpr __m512d a_nan = { -1.0, +2.0, +4.0, __builtin_nan(""), +16.0, +32.0, +64.0, +128.0 };
+constexpr __m256 r = _mm512_maskz_cvtpd_ps(0x08, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm512_maskz_cvtpd_ps(8, {-1.000000e+00, 2.000000e+00, 4.000000e+00, nan, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+}
+}
+
+namespace Test_mm512_cvtpd_pslo {
+namespace OK {
+constexpr __m512d a = { -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m512(_mm512_cvtpd_pslo(a), -1.0f, +2.0f, +4.0f, +8.0f, +16.0f, +32.0f, +64.0f, +128.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
+}
+}
+
+namespace Test_mm512_mask_cvtpd_pslo {
+namespace OK {
+constexpr __m512 src = (__m512){ 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,
+                                9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a = { -1.0, +2.0, +4.0, +8.0, +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m512(_mm512_mask_cvtpd_pslo(src, 0x3, a), -1.0f, +2.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOffInf {
+constexpr __m512 src = (__m512){ 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,
+                                9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_inf = { -1.0, +2.0, __builtin_huge_val(), +8.0, +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m512(_mm512_mask_cvtpd_pslo(src, 0x3, a_inf), -1.0f, +2.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOffNaN {
+constexpr __m512 src = (__m512){ 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,
+                                9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_nan = { -1.0, +2.0, +4.0, __builtin_nan(""), +16.0, +32.0, +64.0, +128.0 };
+TEST_CONSTEXPR(match_m512(_mm512_mask_cvtpd_pslo(src, 0x7, a_nan), -1.0f, +2.0f, +4.0f, 9.0f, 9.0f, 9.0f, 9.0f, 9.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOnInf {
+constexpr __m512 src = (__m512){ 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,
+                                9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_inf = { -1.0, +2.0, __builtin_huge_val(), +8.0, +16.0, +32.0, +64.0, +128.0 };
+constexpr __m512 r = _mm512_mask_cvtpd_pslo(src, 0x4, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at avx512fintrin.h:* {{in call to '_mm512_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 4, {-1.000000e+00, 2.000000e+00, INF, 8.000000e+00, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+// expected-note at -4 {{in call to '_mm512_mask_cvtpd_pslo({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 4, {-1.000000e+00, 2.000000e+00, INF, 8.000000e+00, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+}
+namespace MaskOnNaN {
+constexpr __m512 src = (__m512){ 9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,
+                                9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f,9.0f };
+constexpr __m512d a_nan = { -1.0, +2.0, __builtin_nan(""), +8.0, +16.0, +32.0, +64.0, +128.0 };
+constexpr __m512 r = _mm512_mask_cvtpd_pslo(src, 0x4, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512fintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at avx512fintrin.h:* {{in call to '_mm512_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 4, {-1.000000e+00, 2.000000e+00, nan, 8.000000e+00, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+// expected-note at -4 {{in call to '_mm512_mask_cvtpd_pslo({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 4, {-1.000000e+00, 2.000000e+00, nan, 8.000000e+00, 1.600000e+01, 3.200000e+01, 6.400000e+01, 1.280000e+02})'}}
+}
+}
diff --git a/clang/test/SemaCXX/constexpr-x86-avx512vl-builtins.cpp b/clang/test/SemaCXX/constexpr-x86-avx512vl-builtins.cpp
new file mode 100644
index 0000000000000..bdce60a357f13
--- /dev/null
+++ b/clang/test/SemaCXX/constexpr-x86-avx512vl-builtins.cpp
@@ -0,0 +1,120 @@
+// RUN: %clang_cc1 -std=c++20 -ffreestanding -fexperimental-new-constant-interpreter -triple x86_64-unknown-unknown -target-feature +avx512f -target-feature +avx512vl -verify %s
+
+#include <immintrin.h>
+#include "../CodeGen/X86/builtin_test_helpers.h"
+
+namespace Test_mm_mask_cvtpd_ps {
+namespace OK {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m128d a = { -1.0, +2.0 };
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtpd_ps(src, 0x3, a), -1.0f, +2.0f, 9.0f, 9.0f));
+}
+namespace Partial {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m128d a = { -1.0, +2.0 };
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtpd_ps(src, 0x1, a), -1.0f, 9.0f, 9.0f, 9.0f));
+}
+namespace MaskOffInexact {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m128d a_inexact = { -1.0, 1.0000000000000002 };
+TEST_CONSTEXPR(match_m128(_mm_mask_cvtpd_ps(src, 0x1, a_inexact), -1.0f, 9.0f, 9.0f, 9.0f));
+}
+namespace MaskOnInexact {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m128d a_inexact = { -1.0, 1.0000000000000002 };
+constexpr __m128 r = _mm_mask_cvtpd_ps(src, 0x2, a_inexact);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512vlintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 2, {-1.000000e+00, 1.000000e+00})'}}
+}
+namespace MaskOnInf {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m128d a_inf = { -1.0, __builtin_huge_val() };
+constexpr __m128 r = _mm_mask_cvtpd_ps(src, 0x2, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512vlintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 2, {-1.000000e+00, INF})'}}
+}
+namespace MaskOnNaN {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m128d a_nan = { -1.0, __builtin_nan("") };
+constexpr __m128 r = _mm_mask_cvtpd_ps(src, 0x2, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512vlintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 2, {-1.000000e+00, nan})'}}
+}
+}
+
+namespace Test_mm_maskz_cvtpd_ps {
+namespace OK {
+constexpr __m128d a = { -1.0, +2.0 };
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtpd_ps(0x1, a), -1.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOffInexact {
+constexpr __m128d a_inexact = { -1.0, 1.0000000000000002 };
+TEST_CONSTEXPR(match_m128(_mm_maskz_cvtpd_ps(0x1, a_inexact), -1.0f, 0.0f, 0.0f, 0.0f));
+}
+namespace MaskOnInf {
+constexpr __m128d a_inf = { -1.0, __builtin_huge_val() };
+constexpr __m128 r = _mm_maskz_cvtpd_ps(0x2, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512vlintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm_maskz_cvtpd_ps(2, {-1.000000e+00, INF})'}}
+}
+namespace MaskOnNaN {
+constexpr __m128d a_nan = { -1.0, __builtin_nan("") };
+constexpr __m128 r = _mm_maskz_cvtpd_ps(0x2, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avx512vlintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm_maskz_cvtpd_ps(2, {-1.000000e+00, nan})'}}
+}
+}
+
+namespace Test_mm256_mask_cvtpd_ps {
+namespace OK {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m256d a = { 0.0, -1.0, +2.0, +3.5 };
+TEST_CONSTEXPR(match_m128(_mm256_mask_cvtpd_ps(src, 0xF, a), 0.0f, -1.0f, +2.0f, +3.5f));
+}
+namespace MaskOffInf {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m256d a_inf = { -1.0, +2.0, __builtin_huge_val(), +8.0 };
+constexpr __m128 r = _mm256_mask_cvtpd_ps(src, 0x3, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avxintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at avx512vlintrin.h:* {{in call to '_mm256_cvtpd_ps({-1.000000e+00, 2.000000e+00, INF, 8.000000e+00})'}}
+// expected-note at -4 {{in call to '_mm256_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 3, {-1.000000e+00, 2.000000e+00, INF, 8.000000e+00})'}}
+}
+namespace MaskOffNaN {
+constexpr __m128 src = { 9.0f, 9.0f, 9.0f, 9.0f };
+constexpr __m256d a_nan = { -1.0, +2.0, +4.0, __builtin_nan("") };
+constexpr __m128 r = _mm256_mask_cvtpd_ps(src, 0x7, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avxintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at avx512vlintrin.h:* {{in call to '_mm256_cvtpd_ps({-1.000000e+00, 2.000000e+00, 4.000000e+00, nan})'}}
+// expected-note at -4 {{in call to '_mm256_mask_cvtpd_ps({9.000000e+00, 9.000000e+00, 9.000000e+00, 9.000000e+00}, 7, {-1.000000e+00, 2.000000e+00, 4.000000e+00, nan})'}}
+}
+}
+
+namespace Test_mm256_maskz_cvtpd_ps {
+namespace OK {
+constexpr __m256d a = { 0.0, -1.0, +2.0, +3.5 };
+TEST_CONSTEXPR(match_m128(_mm256_maskz_cvtpd_ps(0x5, a), 0.0f, 0.0f, +2.0f, 0.0f));
+}
+namespace MaskOffInf {
+constexpr __m256d a_inf = { -1.0, +2.0, __builtin_huge_val(), +8.0 };
+constexpr __m128 r = _mm256_maskz_cvtpd_ps(0x3, a_inf);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avxintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at avx512vlintrin.h:* {{in call to '_mm256_cvtpd_ps({-1.000000e+00, 2.000000e+00, INF, 8.000000e+00})'}}
+// expected-note at -4 {{in call to '_mm256_maskz_cvtpd_ps(3, {-1.000000e+00, 2.000000e+00, INF, 8.000000e+00})'}}
+}
+namespace MaskOffNaN {
+constexpr __m256d a_nan = { -1.0, +2.0, +4.0, __builtin_nan("") };
+constexpr __m128 r = _mm256_maskz_cvtpd_ps(0x7, a_nan);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at avxintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at avx512vlintrin.h:* {{in call to '_mm256_cvtpd_ps({-1.000000e+00, 2.000000e+00, 4.000000e+00, nan})'}}
+// expected-note at -4 {{in call to '_mm256_maskz_cvtpd_ps(7, {-1.000000e+00, 2.000000e+00, 4.000000e+00, nan})'}}
+}
+}
diff --git a/clang/test/SemaCXX/constexpr-x86-sse2-builtins.cpp b/clang/test/SemaCXX/constexpr-x86-sse2-builtins.cpp
new file mode 100644
index 0000000000000..319a3b02a94f9
--- /dev/null
+++ b/clang/test/SemaCXX/constexpr-x86-sse2-builtins.cpp
@@ -0,0 +1,79 @@
+// RUN: %clang_cc1 -std=c++20 -ffreestanding -fexperimental-new-constant-interpreter -triple x86_64-unknown-unknown -target-feature +sse2 -verify %s
+
+#include <immintrin.h>
+#include "../CodeGen/X86/builtin_test_helpers.h"
+
+namespace Test_mm_cvtsd_ss {
+namespace OK {
+constexpr __m128 a = { 9.0f, 5.0f, 6.0f, 7.0f };
+constexpr __m128d b = { -1.0, 42.0 };
+TEST_CONSTEXPR(match_m128(_mm_cvtsd_ss(a, b), -1.0f, 5.0f, 6.0f, 7.0f));
+}
+namespace Inexact {
+constexpr __m128 a = { 0.0f, 1.0f, 2.0f, 3.0f };
+constexpr __m128d b = { 1.0000000000000002, 0.0 };
+constexpr __m128 r = _mm_cvtsd_ss(a, b);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_cvtsd_ss({0.000000e+00, 1.000000e+00, 2.000000e+00, 3.000000e+00}, {1.000000e+00, 0.000000e+00})'}}
+}
+namespace Inf {
+constexpr __m128 a = { 0.0f, 1.0f, 2.0f, 3.0f };
+constexpr __m128d b = { __builtin_huge_val(), 0.0 };
+constexpr __m128 r = _mm_cvtsd_ss(a, b);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm_cvtsd_ss({0.000000e+00, 1.000000e+00, 2.000000e+00, 3.000000e+00}, {INF, 0.000000e+00})'}}
+}
+namespace NaN {
+constexpr __m128 a = { 0.0f, 1.0f, 2.0f, 3.0f };
+constexpr __m128d b = { __builtin_nan(""), 0.0 };
+constexpr __m128 r = _mm_cvtsd_ss(a, b);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm_cvtsd_ss({0.000000e+00, 1.000000e+00, 2.000000e+00, 3.000000e+00}, {nan, 0.000000e+00})'}}
+}
+namespace Subnormal {
+constexpr __m128 a = { 0.0f, 1.0f, 2.0f, 3.0f };
+constexpr __m128d b = { 1e-310, 0.0 };
+constexpr __m128 r = _mm_cvtsd_ss(a, b);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_cvtsd_ss({0.000000e+00, 1.000000e+00, 2.000000e+00, 3.000000e+00}, {1.000000e-310, 0.000000e+00})'}}
+}
+}
+
+namespace Test_mm_cvtpd_ps {
+namespace OK {
+constexpr __m128d a = { -1.0, +2.0 };
+TEST_CONSTEXPR(match_m128(_mm_cvtpd_ps(a), -1.0f, +2.0f, 0.0f, 0.0f));
+}
+namespace Inexact {
+constexpr __m128d a = { 1.0000000000000002, 0.0 };
+constexpr __m128 r = _mm_cvtpd_ps(a);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_cvtpd_ps({1.000000e+00, 0.000000e+00})'}}
+}
+namespace Inf {
+constexpr __m128d a = { __builtin_huge_val(), 0.0 };
+constexpr __m128 r = _mm_cvtpd_ps(a);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{floating point arithmetic produces an infinity}}
+// expected-note at -3 {{in call to '_mm_cvtpd_ps({INF, 0.000000e+00})'}}
+}
+namespace NaN {
+constexpr __m128d a = { __builtin_nan(""), 0.0 };
+constexpr __m128 r = _mm_cvtpd_ps(a);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{floating point arithmetic produces a NaN}}
+// expected-note at -3 {{in call to '_mm_cvtpd_ps({nan, 0.000000e+00})'}}
+}
+namespace Subnormal {
+constexpr __m128d a = { 1e-310, 0.0 };
+constexpr __m128 r = _mm_cvtpd_ps(a);
+// expected-error at -1 {{must be initialized by a constant expression}}
+// expected-note at emmintrin.h:* {{compile time floating point arithmetic suppressed in strict evaluation modes}}
+// expected-note at -3 {{in call to '_mm_cvtpd_ps({1.000000e-310, 0.000000e+00})'}}
+}
+}
diff --git a/clang/test/SemaCXX/no-warn-consumed-analysis.cpp b/clang/test/SemaCXX/no-warn-consumed-analysis.cpp
new file mode 100644
index 0000000000000..59d503661a0b1
--- /dev/null
+++ b/clang/test/SemaCXX/no-warn-consumed-analysis.cpp
@@ -0,0 +1,9 @@
+// RUN: %clang_cc1 -fsyntax-only -verify -Wconsumed -fcxx-exceptions -std=c++11 %s
+// expected-no-diagnostics
+
+struct foo {
+  ~foo();
+};
+struct bar : foo {};
+struct baz : bar {};
+baz foobar(baz a) { return a; }
diff --git a/clang/test/SemaCXX/zero-length-arrays.cpp b/clang/test/SemaCXX/zero-length-arrays.cpp
index 0802ec7020463..6bfc7a5fd2e35 100644
--- a/clang/test/SemaCXX/zero-length-arrays.cpp
+++ b/clang/test/SemaCXX/zero-length-arrays.cpp
@@ -1,6 +1,7 @@
 // RUN: %clang_cc1 -fsyntax-only -verify %s
 // RUN: %clang_cc1 -fsyntax-only -verify -std=c++98 %s
 // RUN: %clang_cc1 -fsyntax-only -verify -std=c++11 %s
+// RUN: %clang_cc1 -fsyntax-only -verify -std=c++20 %s
 
 class Foo {
   ~Foo();
@@ -19,7 +20,7 @@ class Bar {
   Foo foos3[2][0];
 
 public:
-  Bar(): foo_count(0) { }    
+  Bar(): foo_count(0) { }
   ~Bar() { }
 };
 
@@ -33,3 +34,17 @@ void testBar() {
 #endif
   b = b2;
 }
+
+namespace GH170040 {
+#if __cplusplus >= 202002L
+template <int N> struct Foo {
+    operator int() const requires(N == 2);
+    template <int I = 0, char (*)[(I)] = nullptr> operator long() const;
+};
+
+void test () {
+    Foo<2> foo;
+    long bar = foo;
+}
+#endif
+}
diff --git a/clang/test/SemaHLSL/MatrixElementOverloadResolution.hlsl b/clang/test/SemaHLSL/MatrixElementOverloadResolution.hlsl
new file mode 100644
index 0000000000000..04149e176edbd
--- /dev/null
+++ b/clang/test/SemaHLSL/MatrixElementOverloadResolution.hlsl
@@ -0,0 +1,293 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-library -fnative-half-type -finclude-default-header -verify -o - %s -DERROR=1
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-library -fnative-half-type -finclude-default-header -ast-dump %s | FileCheck %s
+
+// This test verifies floating point type implicit conversion ranks for overload
+// resolution. In HLSL the built-in type ranks are half < float < double. This
+// applies to both scalar and matrix types.
+
+// HLSL allows implicit truncation fo types, so it differentiates between
+// promotions (converting to larger types) and conversions (converting to
+// smaller types). Promotions are preferred over conversions. Promotions prefer
+// promoting to the next lowest type in the ranking order. Conversions prefer
+// converting to the next highest type in the ranking order.
+
+void HalfFloatDouble(double2x2 D);
+void HalfFloatDouble(float2x2 F);
+void HalfFloatDouble(half2x2 H);
+
+// CHECK: FunctionDecl {{.*}} used HalfFloatDouble 'void (double2x2)'
+// CHECK: FunctionDecl {{.*}} used HalfFloatDouble 'void (float2x2)'
+// CHECK: FunctionDecl {{.*}} used HalfFloatDouble 'void (half2x2)'
+
+void FloatDouble(double2x2 D); // expected-note {{candidate function}}
+void FloatDouble(float2x2 F); // expected-note {{candidate function}}
+
+// CHECK: FunctionDecl {{.*}} used FloatDouble 'void (double2x2)'
+// CHECK: FunctionDecl {{.*}} used FloatDouble 'void (float2x2)'
+
+void HalfDouble(double2x2 D);
+void HalfDouble(half2x2 H);
+
+// CHECK: FunctionDecl {{.*}} used HalfDouble 'void (double2x2)'
+// CHECK: FunctionDecl {{.*}} used HalfDouble 'void (half2x2)'
+
+void HalfFloat(float2x2 F); // expected-note {{candidate function}}
+void HalfFloat(half2x2 H); // expected-note {{candidate function}}
+
+// CHECK: FunctionDecl {{.*}} used HalfFloat 'void (float2x2)'
+// CHECK: FunctionDecl {{.*}} used HalfFloat 'void (half2x2)'
+
+void Double(double2x2 D);
+void Float(float2x2 F);
+void Half(half2x2 H);
+
+// CHECK: FunctionDecl {{.*}} used Double 'void (double2x2)'
+// CHECK: FunctionDecl {{.*}} used Float 'void (float2x2)'
+// CHECK: FunctionDecl {{.*}} used Half 'void (half2x2)'
+
+// Case 1: A function declared with overloads for half float and double types.
+//   (a) When called with half, it will resolve to half because half is an exact
+//   match.
+//   (b) When called with float it will resolve to float because float is an
+//   exact match.
+//   (c) When called with double it will resolve to double because it is an
+//   exact match.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case1 'void (half2x2, float2x2, double2x2)'
+void Case1(half2x2 H, float2x2 F, double2x2 D) {
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(half2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (half2x2)' lvalue Function {{.*}} 'HalfFloatDouble' 'void (half2x2)'
+  HalfFloatDouble(H);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(float2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'HalfFloatDouble' 'void (float2x2)'
+  HalfFloatDouble(F);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'HalfFloatDouble' 'void (double2x2)'
+  HalfFloatDouble(D);
+}
+
+// Case 2: A function declared with double and float overlaods.
+//   (a) When called with half, it fails to resulve the ambiguous promotion.
+//   (b) When called with float it will resolve to float because float is an
+//   exact match.
+//   (c) When called with double it will resolve to double because it is an
+//   exact match.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case2 'void (half2x2, float2x2, double2x2)'
+void Case2(half2x2 H, float2x2 F, double2x2 D) {
+#if ERROR
+  FloatDouble(H); // expected-error {{call to 'FloatDouble' is ambiguous}}
+#endif
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(float2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'FloatDouble' 'void (float2x2)'
+  FloatDouble(F);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'FloatDouble' 'void (double2x2)'
+  FloatDouble(D);
+}
+
+// Case 3: A function declared with half and double overloads
+//   (a) When called with half, it will resolve to half because it is an exact
+//   match.
+//   (b) When called with flaot, it will resolve to double because double is a
+//   valid promotion.
+//   (c) When called with double, it will resolve to double because it is an
+//   exact match.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case3 'void (half2x2, float2x2, double2x2)'
+void Case3(half2x2 H, float2x2 F, double2x2 D) {
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(half2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (half2x2)' lvalue Function {{.*}} 'HalfDouble' 'void (half2x2)'
+  HalfDouble(H);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'HalfDouble' 'void (double2x2)'
+  HalfDouble(F);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'HalfDouble' 'void (double2x2)'
+  HalfDouble(D);
+}
+
+// Case 4: A function declared with half and float overloads.
+//   (a) When called with half, it will resolve to half because half is an exact
+//   match.
+//   (b) When called with float it will resolve to float because float is an
+//   exact match.
+//   (c) When called with double it fails to resolve the ambigjuous conversion.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case4 'void (half2x2, float2x2, double2x2)'
+void Case4(half2x2 H, float2x2 F, double2x2 D) {
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(half2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (half2x2)' lvalue Function {{.*}} 'HalfFloat' 'void (half2x2)'
+  HalfFloat(H);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(float2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'HalfFloat' 'void (float2x2)'
+  HalfFloat(F);
+
+#if ERROR
+  HalfFloat(D); // expected-error{{call to 'HalfFloat' is ambiguous}}
+#endif
+}
+
+// Case 5: A function declared with only a double overload.
+//   (a) When called with half, it will resolve to double because double is a
+//   valid promotion.
+//   (b) When called with float it will resolve to double because double is a
+//   valid promotion.
+//   (c) When called with double it will resolve to double because it is an
+//   exact match.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case5 'void (half2x2, float2x2, double2x2)'
+void Case5(half2x2 H, float2x2 F, double2x2 D) {
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'Double' 'void (double2x2)'
+  Double(H);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'Double' 'void (double2x2)'
+  Double(F);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(double2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (double2x2)' lvalue Function {{.*}} 'Double' 'void (double2x2)'
+  Double(D);
+}
+
+// Case 6: A function declared with only a float overload.
+//   (a) When called with half, it will resolve to float because float is a
+//   valid promotion.
+//   (b) When called with float it will resolve to float because float is an
+//   exact match.
+//   (c) When called with double it will resolve to float because it is a
+//   valid conversion.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case6 'void (half2x2, float2x2, double2x2)'
+void Case6(half2x2 H, float2x2 F, double2x2 D) {
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(float2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'Float' 'void (float2x2)'
+  Float(H);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(float2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'Float' 'void (float2x2)'
+  Float(F);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(float2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'Float' 'void (float2x2)'
+  Float(D); // TODO: See #168944. Make this an expect warning. {{implicit conversion loses floating-point precision: 'double2x2' (aka 'matrix<double, 2, 2>') to 'matrix<float, 2, 2>' (matrix of 2 'float' values)}}
+}
+
+// Case 7: A function declared with only a half overload.
+//   (a) When called with half, it will resolve to half because half is an
+//   exact match
+//   (b) When called with float it will resolve to half because half is a
+//   valid conversion.
+//   (c) When called with double it will resolve to float because it is a
+//   valid conversion.
+
+// CHECK-LABEL: FunctionDecl {{.*}} Case7 'void (half2x2, float2x2, double2x2)'
+void Case7(half2x2 H, float2x2 F, double2x2 D) {
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(half2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (half2x2)' lvalue Function {{.*}} 'Half' 'void (half2x2)'
+  Half(H);
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(half2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (half2x2)' lvalue Function {{.*}} 'Half' 'void (half2x2)'
+  Half(F); // TODO: See #168944. Make this an expect warning. {{implicit conversion loses floating-point precision: 'float2x2' (aka 'matrix<float, 2, 2>') to 'matrix<half, 2, 2>' (matrix of 4 'half' values)}}
+
+  // CHECK: CallExpr {{.*}} 'void'
+  // CHECK-NEXT: ImplicitCastExpr {{.*}} 'void (*)(half2x2)' <FunctionToPointerDecay>
+  // CHECK-NEXT: DeclRefExpr {{.*}} 'void (half2x2)' lvalue Function {{.*}} 'Half' 'void (half2x2)'
+  Half(D); // TODO: See #168944. Make this an expect warning.  {{implicit conversion loses floating-point precision: 'double2x2' (aka 'matrix<double, 2, 2>') to 'matrix<half, 2, 2>' (matrix of 4 'half' values)}}
+}
+
+void fn3x2(float3x2) {} // expected-note{{candidate function}}
+void fn2x2(float2x2) {}
+void fn2x2IO(inout float2x2) {}
+void fnI2x2IO(inout int2x2) {}
+
+void matOrVec(float4 F) {}
+void matOrVec(float2x2 F) {}
+
+void matOrVec2(float3 F) {} // expected-note{{candidate function}}
+void matOrVec2(float2x3 F) {} // expected-note{{candidate function}}
+
+export void Case8(float2x3 f23, float4x4 f44, float3x3 f33, float3x2 f32) {
+  int2x2 i22 = f23;
+  // expected-warning at -1{{implicit conversion truncates matrix: 'float2x3' (aka 'matrix<float, 2, 3>') to 'int2x2' (aka 'matrix<int, 2, 2>')}}
+  //CHECK: VarDecl {{.*}} i22 'int2x2':'matrix<int, 2, 2>' cinit
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'int2x2':'matrix<int, 2, 2>' <FloatingToIntegral>
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float2x3':'matrix<float, 2, 3>' <LValueToRValue>
+#ifdef ERROR
+  int3x2 i32 = f23; // expected-error{{cannot initialize a variable of type 'matrix<int, 3, 2>' with an lvalue of type 'matrix<float, 2, 3>'}}
+  fn3x2(f23); // expected-error{{no matching function for call to 'fn3x2'}}
+#endif
+  
+  fn2x2(f23);
+  // expected-warning at -1{{implicit conversion truncates matrix: 'float2x3' (aka 'matrix<float, 2, 3>') to 'matrix<float, 2, 2>'}}
+  //CHECK: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'fn2x2' 'void (float2x2)'
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'matrix<float, 2, 2>' <HLSLMatrixTruncation>
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float2x3':'matrix<float, 2, 3>' <LValueToRValue>
+
+#ifdef ERROR
+  fn2x2IO(f23); // expected-error{{assigning to 'matrix<[2 * ...], 3>' from incompatible type 'matrix<[2 * ...], 2>'}}
+  fnI2x2IO(f23); // expected-error{{assigning to 'matrix<float, [...], 3>' from incompatible type 'matrix<int, [...], 2>'}}
+#endif
+
+  matOrVec(f23);
+  // expected-warning at -1{{implicit conversion truncates matrix: 'float2x3' (aka 'matrix<float, 2, 3>') to 'matrix<float, 2, 2>'}}
+  //CHECK: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'matOrVec' 'void (float2x2)'
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'matrix<float, 2, 2>' <HLSLMatrixTruncation>
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float2x3':'matrix<float, 2, 3>' <LValueToRValue>
+
+  matOrVec(f44);
+  // expected-warning at -1{{implicit conversion truncates matrix: 'float4x4' (aka 'matrix<float, 4, 4>') to 'matrix<float, 2, 2>'}}
+  //CHECK: DeclRefExpr {{.*}} 'void (float2x2)' lvalue Function {{.*}} 'matOrVec' 'void (float2x2)'
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'matrix<float, 2, 2>' <HLSLMatrixTruncation>
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float4x4':'matrix<float, 4, 4>' <LValueToRValue>
+
+#ifdef ERROR
+  matOrVec(2.0); // TODO: See #168960 this should be ambiguous once we implement ICK_HLSL_Matrix_Splat.
+#endif
+  matOrVec2(f23);
+  //CHECK: DeclRefExpr {{.*}} 'void (float2x3)' lvalue Function {{.*}} 'matOrVec2' 'void (float2x3)'
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float2x3':'matrix<float, 2, 3>' <LValueToRValue>
+
+  matOrVec2(f44);
+  // expected-warning at -1{{implicit conversion truncates matrix: 'float4x4' (aka 'matrix<float, 4, 4>') to 'matrix<float, 2, 3>'}}
+  //CHECK: DeclRefExpr {{.*}} 'void (float2x3)' lvalue Function {{.*}} 'matOrVec2' 'void (float2x3)'
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'matrix<float, 2, 3>' <HLSLMatrixTruncation>
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float4x4':'matrix<float, 4, 4>' <LValueToRValue>
+
+  matOrVec2(f33);
+  // expected-warning at -1{{implicit conversion truncates matrix: 'float3x3' (aka 'matrix<float, 3, 3>') to 'matrix<float, 2, 3>'}}
+  //CHECK: DeclRefExpr {{.*}} 'void (float2x3)' lvalue Function {{.*}} 'matOrVec2' 'void (float2x3)'
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'matrix<float, 2, 3>' <HLSLMatrixTruncation>
+  //CHECK-NEXT: ImplicitCastExpr {{.*}} 'float3x3':'matrix<float, 3, 3>' <LValueToRValue>
+  
+#ifdef ERROR
+  matOrVec2(f32); // expected-error{{no matching function for call to 'matOrVec2'}}
+#endif
+}
diff --git a/clang/test/SemaHLSL/Semantics/position.ps.hlsl b/clang/test/SemaHLSL/Semantics/position.ps.hlsl
index 47d07887911d6..d0fe19d1a5407 100644
--- a/clang/test/SemaHLSL/Semantics/position.ps.hlsl
+++ b/clang/test/SemaHLSL/Semantics/position.ps.hlsl
@@ -2,6 +2,6 @@
 // RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-pixel -finclude-default-header -x hlsl -verify -o - %s
 
 float4 main(float4 a : A) : SV_Position {
-// expected-error at -1 {{attribute 'SV_Position' is unsupported in 'pixel' shaders, requires one of the following: pixel, vertex}}
+// expected-error at -1 {{semantic 'SV_Position' is unsupported in pixel shaders as output, requires one of the following: vertex input/output, pixel input}}
   return a;
 }
diff --git a/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-builtin-vs.hlsl b/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-builtin-vs.hlsl
new file mode 100644
index 0000000000000..3abd1cb65ffc4
--- /dev/null
+++ b/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-builtin-vs.hlsl
@@ -0,0 +1,16 @@
+// RUN: %clang_cc1 -triple spirv-linux-vulkan-vertex -x hlsl -emit-llvm -finclude-default-header -disable-llvm-passes -o - %s -verify -verify-ignore-unexpected=note
+
+// This is almost the same as semantic.explicit-mix-builtin.hlsl, except this
+// time we build a vertex shader. This means the SV_Position semantic is not
+// a BuiltIn anymore, but a Location decorated variable. This means we mix
+// implicit and explicit location assignment.
+struct S1 {
+  float4 position : SV_Position;
+  [[vk::location(3)]] float4 color : A;
+  // expected-error at -1 {{partial explicit stage input location assignment via vk::location(X) unsupported}}
+};
+
+[shader("vertex")]
+float4 main1(S1 p) : A {
+  return p.position + p.color;
+}
diff --git a/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location-2.hlsl b/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location-2.hlsl
new file mode 100644
index 0000000000000..7b494f5a9cbe6
--- /dev/null
+++ b/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location-2.hlsl
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -finclude-default-header -triple spirv-pc-vulkan1.3-pixel %s -emit-llvm-only -disable-llvm-passes -verify -verify-ignore-unexpected=note
+
+// The following code is not legal: both semantics A and B will be lowered
+// into a Location decoration. And mixing implicit and explicit Location
+// assignment is not supported.
+struct S1 {
+  [[vk::location(3)]] float4 color : B;
+  float4 position : A;
+  // expected-error at -1 {{partial explicit stage input location assignment via vk::location(X) unsupported}}
+};
+
+[shader("pixel")]
+float4 main1(S1 p) : SV_Target {
+  return p.position + p.color;
+}
diff --git a/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location.hlsl b/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location.hlsl
new file mode 100644
index 0000000000000..74f110c286cae
--- /dev/null
+++ b/clang/test/SemaHLSL/Semantics/semantic.explicit-mix-location.hlsl
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -finclude-default-header -triple spirv-pc-vulkan1.3-pixel %s -emit-llvm-only -disable-llvm-passes -verify -verify-ignore-unexpected=note
+
+// The following code is not legal: both semantics A and B will be lowered
+// into a Location decoration. And mixing implicit and explicit Location
+// assignment is not supported.
+struct S1 {
+  float4 position : A;
+  [[vk::location(3)]] float4 color : B;
+  // expected-error at -1 {{partial explicit stage input location assignment via vk::location(X) unsupported}}
+};
+
+[shader("pixel")]
+float4 main1(S1 p) : SV_Target {
+  return p.position + p.color;
+}
diff --git a/clang/test/SemaHLSL/Semantics/target.ps.input.hlsl b/clang/test/SemaHLSL/Semantics/target.ps.input.hlsl
new file mode 100644
index 0000000000000..a77b46c0e9f1a
--- /dev/null
+++ b/clang/test/SemaHLSL/Semantics/target.ps.input.hlsl
@@ -0,0 +1,7 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-pixel -finclude-default-header -x hlsl -verify -o - %s
+// RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-pixel -finclude-default-header -x hlsl -verify -o - %s
+
+float4 main(float4 a : SV_Target) : A {
+// expected-error at -1 {{semantic 'SV_Target' is unsupported in pixel shaders as input, requires one of the following: pixel out}}
+  return a;
+}
diff --git a/clang/test/SemaHLSL/Semantics/target.vs.input.hlsl b/clang/test/SemaHLSL/Semantics/target.vs.input.hlsl
new file mode 100644
index 0000000000000..add24732fc05a
--- /dev/null
+++ b/clang/test/SemaHLSL/Semantics/target.vs.input.hlsl
@@ -0,0 +1,8 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-vertex -finclude-default-header -x hlsl -verify -o - %s
+// RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-vertex -finclude-default-header -x hlsl -verify -o - %s
+
+float4 main(float4 a : SV_Target) : A {
+// expected-error at -1 {{attribute 'SV_Target' is unsupported in 'vertex' shaders, requires pixel}}
+  return a;
+}
+
diff --git a/clang/test/SemaHLSL/Semantics/target.vs.output.hlsl b/clang/test/SemaHLSL/Semantics/target.vs.output.hlsl
new file mode 100644
index 0000000000000..0481bcdad0177
--- /dev/null
+++ b/clang/test/SemaHLSL/Semantics/target.vs.output.hlsl
@@ -0,0 +1,7 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.3-vertex -finclude-default-header -x hlsl -verify -o - %s
+// RUN: %clang_cc1 -triple spirv-pc-vulkan1.3-vertex -finclude-default-header -x hlsl -verify -o - %s
+
+float4 main(float4 a : SV_Position) : SV_Target {
+// expected-error at -1 {{attribute 'SV_Target' is unsupported in 'vertex' shaders, requires pixel}}
+  return a;
+}
diff --git a/clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixCastErrors.hlsl b/clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixCastErrors.hlsl
new file mode 100644
index 0000000000000..59d432cd3eb00
--- /dev/null
+++ b/clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixCastErrors.hlsl
@@ -0,0 +1,21 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.6-library -finclude-default-header -std=hlsl202x -verify %s
+
+// Note column is too large
+export int3x2 shape_cast_error(float2x3 f23) {
+    int3x2 i32 = (int3x2)f23;
+    // expected-error at -1 {{conversion between matrix types 'int3x2' (aka 'matrix<int, 3, 2>') and 'matrix<float, 2, 3>' of different size is not allowed}}
+    return i32;
+}
+// Note row is too large
+export int2x3 shape_cast_error2(float3x2 f32) {
+    int2x3 i23 = (int2x3)f32;
+    // expected-error at -1 {{conversion between matrix types 'int2x3' (aka 'matrix<int, 2, 3>') and 'matrix<float, 3, 2>' of different size is not allowed}}
+    return i23;
+}
+
+// Note do the type change independent of the shape should still error
+export int2x3 shape_cast_error3(float3x2 f32) {
+    int2x3 i23 = (int3x2)f32;
+    // expected-error at -1 {{cannot initialize a variable of type 'matrix<[...], 2, 3>' with an rvalue of type 'matrix<[...], 3, 2>}}
+    return i23;
+}
diff --git a/clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixImplicitTruncCastWarnings.hlsl b/clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixImplicitTruncCastWarnings.hlsl
new file mode 100644
index 0000000000000..2c50b957578ec
--- /dev/null
+++ b/clang/test/SemaHLSL/Types/BuiltinMatrix/MatrixImplicitTruncCastWarnings.hlsl
@@ -0,0 +1,50 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.6-library -finclude-default-header -verify %s
+
+export int3x4 trunc_cast(int4x4 i44) {
+    int3x4 i34 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 3, 4>'}}
+    return i34;
+}
+
+export int4x3 trunc_cast0(int4x4 i44) {
+    int4x3 i43 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 4, 3>'}}
+    return i43;
+}
+
+export int3x3 trunc_cast1(int4x4 i44) {
+    int3x3 i33 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 3, 3>'}}
+    return i33;
+}
+
+export int3x2 trunc_cast2(int4x4 i44) {
+    int3x2 i32 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 3, 2>'}}
+    return i32;
+}
+
+export int2x3 trunc_cast3(int4x4 i44) {
+    int2x3 i23 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 2, 3>'}}
+    return i23;
+}
+
+export int2x2 trunc_cast4(int4x4 i44) {
+    int2x2 i22 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 2, 2>'}}
+    return i22;
+}
+
+export int2x1 trunc_cast5(int4x4 i44) {
+    int2x1 i21 = i44;
+    // expected-warning at -1{{implicit conversion truncates matrix: 'int4x4' (aka 'matrix<int, 4, 4>') to 'matrix<int, 2, 1>'}}
+    return i21;
+}
+
+export int trunc_scalar_cast6(int4x4 i44) {
+    int i1 = i44;
+    // expected-warning at -1{{implicit conversion turns matrix to scalar: 'int4x4' (aka 'matrix<int, 4, 4>') to 'int'}}
+    return i1;
+}
+
diff --git a/clang/test/SemaHLSL/static_resources.hlsl b/clang/test/SemaHLSL/static_resources.hlsl
new file mode 100644
index 0000000000000..f71e9ea98e0d9
--- /dev/null
+++ b/clang/test/SemaHLSL/static_resources.hlsl
@@ -0,0 +1,138 @@
+// RUN: %clang_cc1 -triple dxil-pc-shadermodel6.6-compute -emit-llvm -disable-llvm-passes -o - %s | llvm-cxxfilt | FileCheck %s
+
+// CHECK-DAG: [[ONE_STR:@.*]] = private unnamed_addr constant [4 x i8] c"One\00"
+// CHECK-DAG: [[ARRAY_STR:@.*]] = private unnamed_addr constant [6 x i8] c"Array\00"
+// CHECK-DAG: [[ONEWITHCOUNTER_STR:@.*]] = private unnamed_addr constant [15 x i8] c"OneWithCounter\00"
+// CHECK-DAG: [[ARRAYWITHCOUNTER_STR:@.*]] = private unnamed_addr constant [17 x i8] c"ArrayWithCounter\00"
+// CHECK-NOT: private unnamed_addr constant [{{[0-9]+}} x i8] c"Static
+
+RWBuffer<float> One : register(u1, space5);
+RWBuffer<float> Array[2] : register(u10, space6);
+RWStructuredBuffer<int> OneWithCounter : register(u2, space4);
+RWStructuredBuffer<int> ArrayWithCounter[2] : register(u7, space4);
+
+// Check that the non-static resource One is initialized from binding on
+// startup (register 1, space 5).
+// CHECK: define internal void @__cxx_global_var_init{{.*}}
+// CHECK-NEXT: entry:
+// CHECK-NEXT: call void @hlsl::RWBuffer<float>::__createFromBinding(unsigned int, unsigned int, int, unsigned int, char const*)
+// CHECK-SAME: (ptr {{.*}} @One, i32 noundef 1, i32 noundef 5, i32 noundef 1, i32 noundef 0, ptr noundef [[ONE_STR]])
+
+// Check that the non-static resource OneWithCounter is initialized from binding on
+// startup (register 2, space 4).
+// CHECK: define internal void @__cxx_global_var_init{{.*}}
+// CHECK-NEXT: entry:
+// CHECK-NEXT: call void @hlsl::RWStructuredBuffer<int>::__createFromBindingWithImplicitCounter(unsigned int, unsigned int, int, unsigned int, char const*, unsigned int)
+// CHECK-SAME: (ptr {{.*}} @OneWithCounter, i32 noundef 2, i32 noundef 4, i32 noundef 1, i32 noundef 0, ptr noundef [[ONEWITHCOUNTER_STR]], i32 noundef 0)
+
+// Note that non-static resource arrays are not initialized on startup.
+// The individual resources from the array are initialized on access.
+
+static RWBuffer<float> StaticOne;
+static RWBuffer<float> StaticArray[2];
+
+// Check that StaticOne resource is initialized on startup with the default
+// constructor and not from binding. It will initalize the handle to poison.
+// CHECK: define internal void @__cxx_global_var_init{{.*}}
+// CHECK-NEXT: entry:
+// CHECK-NEXT: call void @hlsl::RWBuffer<float>::RWBuffer()(ptr {{.*}} @StaticOne)
+
+// Check that StaticArray elements are initialized on startup with the default
+// constructor and not from binding. The initializer will loop over the array
+// elements and call the default constructor for each one, setting the handle to poison.
+// CHECK: define internal void @__cxx_global_var_init{{.*}}
+// CHECK-NEXT: entry:
+// CHECK-NEXT: br label %arrayctor.loop
+// CHECK: arrayctor.loop:                                   ; preds = %arrayctor.loop, %entry
+// CHECK-NEXT:   %arrayctor.cur = phi ptr [ @StaticArray, %entry ], [ %arrayctor.next, %arrayctor.loop ]
+// CHECK-NEXT: call void @hlsl::RWBuffer<float>::RWBuffer()(ptr {{.*}} %arrayctor.cur)
+// CHECK-NEXT: %arrayctor.next = getelementptr inbounds %"class.hlsl::RWBuffer", ptr %arrayctor.cur, i32 1
+// CHECK-NEXT: %arrayctor.done = icmp eq ptr %arrayctor.next, getelementptr inbounds (%"class.hlsl::RWBuffer", ptr @StaticArray, i32 2)
+// CHECK-NEXT: br i1 %arrayctor.done, label %arrayctor.cont, label %arrayctor.loop
+// CHECK: arrayctor.cont:                                   ; preds = %arrayctor.loop
+// CHECK-NEXT: ret void
+
+static RWStructuredBuffer<int> StaticOneWithCounter;
+
+// Check that StaticOneWithCounter resource is initialized on startup with the default
+// constructor and not from binding. It will initalize the handle to poison.
+// CHECK: define internal void @__cxx_global_var_init{{.*}}
+// CHECK-NEXT: entry:
+// CHECK-NEXT: call void @hlsl::RWStructuredBuffer<int>::RWStructuredBuffer()(ptr {{.*}} @StaticOneWithCounter)
+
+// No other global initialization routines should be present.
+// CHECK-NOT: define internal void @__cxx_global_var_init{{.*}}
+
+[numthreads(4,1,1)]
+void main() {
+// CHECK: define internal void @main()()
+// CHECK-NEXT: entry:
+// CHECK-NEXT: %[[TMP0:.*]] = alloca %"class.hlsl::RWBuffer"
+
+  static RWBuffer<float> StaticLocal;
+// Check that StaticLocal is initialized by default constructor (handle set to poison)
+// and not from binding.
+// call void @hlsl::RWBuffer<float>::RWBuffer()(ptr {{.*}} @main()::StaticLocal)
+
+  StaticLocal = Array[1];
+// A[2][0] is accessed here, so it should be initialized from binding (register 10, space 6, index 1),
+// and then assigned to StaticLocal using = operator.
+// CHECK: call void @hlsl::RWBuffer<float>::__createFromBinding(unsigned int, unsigned int, int, unsigned int, char const*)
+// CHECK-SAME: (ptr {{.*}} %[[TMP0]], i32 noundef 10, i32 noundef 6, i32 noundef 2, i32 noundef 1, ptr noundef [[ARRAY_STR]])
+// CHECK-NEXT: call {{.*}} ptr @hlsl::RWBuffer<float>::operator=({{.*}})(ptr {{.*}} @main()::StaticLocal, ptr {{.*}} %[[TMP0]])
+
+  StaticOne = One;
+// Operator = call to assign non-static One handle to static StaticOne.
+// CHECK-NEXT: call {{.*}} ptr @hlsl::RWBuffer<float>::operator=({{.*}})(ptr {{.*}} @StaticOne, ptr {{.*}} @One)
+
+  StaticArray = Array;
+// Check that each elements of StaticArray is initialized from binding (register 10, space 6, indices 0 and 1).
+// CHECK: call void @hlsl::RWBuffer<float>::__createFromBinding(unsigned int, unsigned int, int, unsigned int, char const*)
+// CHECK-SAME: (ptr {{.*}} sret(%"class.hlsl::RWBuffer") align 4 @StaticArray, i32 noundef 10, i32 noundef 6, i32 noundef 2, i32 noundef 0, ptr noundef [[ARRAY_STR]])
+// CHECK-NEXT: call void @hlsl::RWBuffer<float>::__createFromBinding(unsigned int, unsigned int, int, unsigned int, char const*)
+// CHECK-SAME: (ptr {{.*}} sret(%"class.hlsl::RWBuffer") align 4 getelementptr ([2 x %"class.hlsl::RWBuffer"], ptr @StaticArray, i32 0, i32 1),
+// CHECK-SAME: i32 noundef 10, i32 noundef 6, i32 noundef 2, i32 noundef 1, ptr noundef [[ARRAY_STR]]
+
+  StaticArray[1] = One;
+// Operator = call to assign non-static One handle to StaticArray element.
+// CHECK-NEXT: call {{.*}} ptr @hlsl::RWBuffer<float>::operator=(hlsl::RWBuffer<float> const&)
+// CHECK-SAME: (ptr {{.*}} getelementptr inbounds ([2 x %"class.hlsl::RWBuffer"], ptr @StaticArray, i32 0, i32 1), ptr {{.*}} @One)
+
+  StaticLocal[0] = 123;
+// CHECK-NEXT: %[[PTR0:.*]] = call {{.*}} ptr @hlsl::RWBuffer<float>::operator[](unsigned int)(ptr {{.*}} @main()::StaticLocal, i32 noundef 0)
+// CHECK-NEXT: store float 1.230000e+02, ptr %[[PTR0]]
+
+  StaticOne[1] = 456;
+// CHECK-NEXT: %[[PTR1:.*]] = call {{.*}} ptr @hlsl::RWBuffer<float>::operator[](unsigned int)(ptr {{.*}}) @StaticOne, i32 noundef 1)
+// CHECK-NEXT: store float 4.560000e+02, ptr %[[PTR1]], align 4
+
+  StaticArray[1][2] = 789;
+// CHECK-NEXT: %[[PTR2:.*]] = call {{.*}} ptr @hlsl::RWBuffer<float>::operator[](unsigned int)
+// CHECK-SAME: (ptr {{.*}} getelementptr inbounds ([2 x %"class.hlsl::RWBuffer"], ptr @StaticArray, i32 0, i32 1), i32 noundef 2)
+// CHECK-NEXT: store float 7.890000e+02, ptr %[[PTR2]], align 4
+
+  static RWStructuredBuffer<int> StaticLocalWithCounter;
+// Check that StaticLocalWithCounter is initialized by default constructor (handle set to poison)
+// and not from binding.
+// call void @hlsl::RWStructuredBuffer<int>::RWStructuredBuffer()(ptr {{.*}} @main()::StaticLocalWithCounter)
+
+  static RWStructuredBuffer<int> StaticLocalArrayWithCounter[2];
+
+  StaticLocalWithCounter = OneWithCounter;
+// Operator = call to assign non-static OneWithCounter handles to StaticLocalWithCounter handles.
+// CHECK: call {{.*}} ptr @hlsl::RWStructuredBuffer<int>::operator=(hlsl::RWStructuredBuffer<int> const&)(ptr {{.*}} @main()::StaticLocalWithCounter, ptr {{.*}} @OneWithCounter)
+
+  StaticLocalArrayWithCounter = ArrayWithCounter;
+// Check that each elements of StaticLocalArrayWithCounter is initialized from binding
+// of ArrayWithCounter (register 7, space 4, indices 0 and 1).
+// CHECK: call void @hlsl::RWStructuredBuffer<int>::__createFromBindingWithImplicitCounter(unsigned int, unsigned int, int, unsigned int, char const*, unsigned int)
+// CHECK-SAME: (ptr {{.*}} sret(%"class.hlsl::RWStructuredBuffer") align 4 @main()::StaticLocalArrayWithCounter,
+// CHECK-SAME: i32 noundef 7, i32 noundef 4, i32 noundef 2, i32 noundef 0, ptr noundef [[ARRAYWITHCOUNTER_STR]], i32 noundef 1)
+
+// CHECK-NEXT: call void @hlsl::RWStructuredBuffer<int>::__createFromBindingWithImplicitCounter(unsigned int, unsigned int, int, unsigned int, char const*, unsigned int)
+// CHECK-SAME: (ptr {{.*}} sret(%"class.hlsl::RWStructuredBuffer") align 4 getelementptr ([2 x %"class.hlsl::RWStructuredBuffer"], ptr @main()::StaticLocalArrayWithCounter, i32 0, i32 1),
+// CHECK-SAME: i32 noundef 7, i32 noundef 4, i32 noundef 2, i32 noundef 1, ptr noundef [[ARRAYWITHCOUNTER_STR]], i32 noundef 1)
+}
+
+// No other binding initialization calls should be present.
+// CHECK-NOT: call void @hlsl::RWBuffer<float>::__createFrom{{.*}}Binding{{.*}}
diff --git a/clang/tools/CMakeLists.txt b/clang/tools/CMakeLists.txt
index 7a7c56ae217b0..afdd613b4ee99 100644
--- a/clang/tools/CMakeLists.txt
+++ b/clang/tools/CMakeLists.txt
@@ -2,7 +2,6 @@ create_subdirectory_options(CLANG TOOL)
 
 add_clang_subdirectory(diagtool)
 add_clang_subdirectory(driver)
-add_clang_subdirectory(apinotes-test)
 if(CLANG_ENABLE_CIR)
   add_clang_subdirectory(cir-opt)
   add_clang_subdirectory(cir-translate)
@@ -22,7 +21,10 @@ if(HAVE_CLANG_REPL_SUPPORT)
   add_clang_subdirectory(clang-repl)
 endif()
 
-add_clang_subdirectory(c-index-test)
+if(CLANG_INCLUDE_TESTS)
+  add_clang_subdirectory(c-index-test)
+  add_clang_subdirectory(apinotes-test)
+endif()
 
 add_clang_subdirectory(clang-refactor)
 # For MinGW/Cygwin we only enable shared library if LLVM_LINK_LLVM_DYLIB=ON.
diff --git a/clang/tools/clang-nvlink-wrapper/ClangNVLinkWrapper.cpp b/clang/tools/clang-nvlink-wrapper/ClangNVLinkWrapper.cpp
index 58eb671c61989..a6c7a3affa97d 100644
--- a/clang/tools/clang-nvlink-wrapper/ClangNVLinkWrapper.cpp
+++ b/clang/tools/clang-nvlink-wrapper/ClangNVLinkWrapper.cpp
@@ -165,6 +165,19 @@ void diagnosticHandler(const DiagnosticInfo &DI) {
   }
 }
 
+bool hasFatBinary(const ArgList &Args, MemoryBufferRef Buffer) {
+  if (Args.hasArg(OPT_dry_run) && Args.hasArg(OPT_assume_device_object))
+    return false;
+  if (identify_magic(Buffer.getBuffer()) != file_magic::elf_relocatable)
+    return false;
+  Expected<std::unique_ptr<ObjectFile>> ObjFile =
+      ObjectFile::createObjectFile(Buffer);
+  if (!ObjFile) // Assume fatbin if the object creation fails.
+    return !errorToBool(ObjFile.takeError());
+  return (*ObjFile)->getArch() != Triple::nvptx &&
+         (*ObjFile)->getArch() != Triple::nvptx64;
+}
+
 Expected<StringRef> createTempFile(const ArgList &Args, const Twine &Prefix,
                                    StringRef Extension) {
   SmallString<128> OutputFile;
@@ -556,6 +569,11 @@ Expected<SmallVector<StringRef>> getInput(const ArgList &Args) {
       if (!Input)
         continue;
 
+      if (hasFatBinary(Args, *Input)) {
+        LinkerInput.emplace_back(std::move(Input));
+        continue;
+      }
+
       // Archive members only extract if they define needed symbols. We will
       // re-scan all the inputs if any files were extracted for the link job.
       Expected<bool> ExtractOrErr = getSymbols(*Input, SymTab, IsLazy);
@@ -674,7 +692,8 @@ Expected<SmallVector<StringRef>> getInput(const ArgList &Args) {
   // of this input files could be extracted from an archive.
   for (auto &Input : LinkerInput) {
     auto TempFileOrErr = createTempFile(
-        Args, sys::path::stem(Input->getBufferIdentifier()), "cubin");
+        Args, sys::path::stem(Input->getBufferIdentifier()),
+        hasFatBinary(Args, Input->getMemBufferRef()) ? "o" : "cubin");
     if (!TempFileOrErr)
       return TempFileOrErr.takeError();
     Expected<std::unique_ptr<FileOutputBuffer>> OutputOrErr =
diff --git a/clang/tools/clang-nvlink-wrapper/NVLinkOpts.td b/clang/tools/clang-nvlink-wrapper/NVLinkOpts.td
index 7af35bf5989ec..9915aef8ba7cf 100644
--- a/clang/tools/clang-nvlink-wrapper/NVLinkOpts.td
+++ b/clang/tools/clang-nvlink-wrapper/NVLinkOpts.td
@@ -112,3 +112,8 @@ def mllvm_EQ : Joined<["-"], "mllvm=">, Flags<[HelpHidden]>, Alias<mllvm>;
 
 def dry_run : Flag<["--", "-"], "dry-run">, Flags<[WrapperOnlyOption]>,
   HelpText<"Print generated commands without running.">;
+def assume_device_object
+    : Flag<["--", "-"], "assume-device-object">,
+      Flags<[WrapperOnlyOption]>,
+      HelpText<
+          "Assume objects have device object files, only in dry-run mode.">;
diff --git a/clang/tools/clang-scan-deps/ClangScanDeps.cpp b/clang/tools/clang-scan-deps/ClangScanDeps.cpp
index 984a51c915f45..6a2acb0d4f20e 100644
--- a/clang/tools/clang-scan-deps/ClangScanDeps.cpp
+++ b/clang/tools/clang-scan-deps/ClangScanDeps.cpp
@@ -286,11 +286,9 @@ class ResourceDirectoryCache {
     if (CachedResourceDir != Cache.end())
       return CachedResourceDir->second;
 
-    std::vector<StringRef> PrintResourceDirArgs{ClangBinaryName};
-    if (ClangCLMode)
-      PrintResourceDirArgs.push_back("/clang:-print-resource-dir");
-    else
-      PrintResourceDirArgs.push_back("-print-resource-dir");
+    const std::array<StringRef, 2> PrintResourceDirArgs{
+        ClangBinaryName,
+        ClangCLMode ? "/clang:-print-resource-dir" : "-print-resource-dir"};
 
     llvm::SmallString<64> OutputFile, ErrorFile;
     llvm::sys::fs::createTemporaryFile("print-resource-dir-output",
diff --git a/clang/unittests/ASTMatchers/ASTMatchersNodeTest.cpp b/clang/unittests/ASTMatchers/ASTMatchersNodeTest.cpp
index 3fcb5582d3dd7..108b32e5d91b9 100644
--- a/clang/unittests/ASTMatchers/ASTMatchersNodeTest.cpp
+++ b/clang/unittests/ASTMatchers/ASTMatchersNodeTest.cpp
@@ -2353,6 +2353,26 @@ TEST_P(ASTMatchersTest, ReferenceTypeLocTest_BindsToAnyRvalueReferenceTypeLoc) {
   EXPECT_TRUE(matches("float&& r = 3.0;", matcher));
 }
 
+TEST_P(ASTMatchersTest, ArrayTypeLocTest_BindsToAnyArrayTypeLoc) {
+  auto matcher = varDecl(hasName("x"), hasTypeLoc(arrayTypeLoc()));
+  EXPECT_TRUE(matches("int x[3];", matcher));
+  EXPECT_TRUE(matches("float x[3];", matcher));
+  EXPECT_TRUE(matches("char x[3];", matcher));
+  EXPECT_TRUE(matches("void* x[3];", matcher));
+  EXPECT_TRUE(matches("const int x[3] = {1, 2, 3};", matcher));
+  EXPECT_TRUE(matches("int x[3][4];", matcher));
+  EXPECT_TRUE(matches("void foo(int x[]);", matcher));
+  EXPECT_TRUE(matches("int a[] = {1, 2}; void foo() {int x[a[0]];}", matcher));
+}
+
+TEST_P(ASTMatchersTest, ArrayTypeLocTest_DoesNotBindToNonArrayTypeLoc) {
+  auto matcher = varDecl(hasName("x"), hasTypeLoc(arrayTypeLoc()));
+  EXPECT_TRUE(notMatches("int x;", matcher));
+  EXPECT_TRUE(notMatches("float x;", matcher));
+  EXPECT_TRUE(notMatches("char x;", matcher));
+  EXPECT_TRUE(notMatches("void* x;", matcher));
+}
+
 TEST_P(ASTMatchersTest,
        TemplateSpecializationTypeLocTest_BindsToVarDeclTemplateSpecialization) {
   if (!GetParam().isCXX()) {
diff --git a/clang/unittests/Analysis/ExprMutationAnalyzerTest.cpp b/clang/unittests/Analysis/ExprMutationAnalyzerTest.cpp
index 8fc9a66dbda7e..d171d47ac1fef 100644
--- a/clang/unittests/Analysis/ExprMutationAnalyzerTest.cpp
+++ b/clang/unittests/Analysis/ExprMutationAnalyzerTest.cpp
@@ -2091,4 +2091,21 @@ TEST(ExprMutationAnalyzerTest, PointeeMutatedByPointerToMemberOperator) {
   EXPECT_TRUE(isPointeeMutated(Results, AST.get()));
 }
 
+TEST(ExprMutationAnalyzerTest, PointeeMutatedByPassAsPointerToPointer) {
+  {
+    const std::string Code = "void f(int**); void g() { int* ip; f(&ip); }";
+    auto AST = buildASTFromCode(Code);
+    auto Results =
+        match(withEnclosingCompound(declRefTo("ip")), AST->getASTContext());
+    EXPECT_TRUE(isPointeeMutated(Results, AST.get()));
+  }
+  {
+    const std::string Code = "void f(void**); void g() { void* ip; f(&ip); }";
+    auto AST = buildASTFromCode(Code);
+    auto Results =
+        match(withEnclosingCompound(declRefTo("ip")), AST->getASTContext());
+    EXPECT_TRUE(isPointeeMutated(Results, AST.get()));
+  }
+}
+
 } // namespace clang
diff --git a/clang/unittests/Format/FormatTest.cpp b/clang/unittests/Format/FormatTest.cpp
index 5a5d77075bb3a..3ff784035dd44 100644
--- a/clang/unittests/Format/FormatTest.cpp
+++ b/clang/unittests/Format/FormatTest.cpp
@@ -20689,6 +20689,16 @@ TEST_F(FormatTest, AlignWithLineBreaks) {
                "}",
                Style);
 
+  verifyFormat("auto someLongName = 3;\n"
+               "auto x            = someLongExpression //\n"
+               "                    | ranges::views::values;",
+               Style);
+  verifyFormat(
+      "veryverylongvariablename = somethingelse;\n"
+      "shortervariablename      = anotherverylonglonglongvariablename + //\n"
+      "                           somevariablethatwastoolongtofitonthesamerow;",
+      Style);
+
   // clang-format off
   verifyFormat("void foo() {\n"
                "  const int capacityBefore = Entries.capacity();\n"
@@ -20762,6 +20772,42 @@ TEST_F(FormatTest, AlignWithLineBreaks) {
                Style);
   // clang-format on
 
+  // The start of the closure is indented from the start of the line. It should
+  // not move with the equal sign.
+  Style.ContinuationIndentWidth = 6;
+  Style.IndentWidth = 8;
+  Style.BreakBeforeBraces = FormatStyle::BS_Custom;
+  Style.BraceWrapping.BeforeLambdaBody = true;
+  Style.BraceWrapping.IndentBraces = true;
+  Style.ColumnLimit = 32;
+  verifyFormat("auto aaaaaaaaaaa = {};\n"
+               "b = []() constexpr\n"
+               "      -> aaaaaaaaaaaaaaaaaaaaaaa\n"
+               "        {\n"
+               "                return {}; //\n"
+               "        };",
+               Style);
+  verifyFormat("auto aaaaaaaaaaaaaaaaaaaaa = {};\n"
+               "b = []()\n"
+               "        {\n"
+               "                return; //\n"
+               "        };",
+               Style);
+  Style.ColumnLimit = 33;
+  verifyFormat("auto aaaaaaaaaaa = {};\n"
+               "b                = []() constexpr\n"
+               "      -> aaaaaaaaaaaaaaaaaaaaaaa\n"
+               "        {\n"
+               "                return {}; //\n"
+               "        };",
+               Style);
+  verifyFormat("auto aaaaaaaaaaaaaaaaaaaaa = {};\n"
+               "b                          = []()\n"
+               "        {\n"
+               "                return; //\n"
+               "        };",
+               Style);
+
   Style = getLLVMStyleWithColumns(20);
   Style.AlignConsecutiveAssignments.Enabled = true;
   Style.IndentWidth = 4;
@@ -20826,6 +20872,13 @@ TEST_F(FormatTest, AlignWithLineBreaks) {
       Style);
 
   Style.AlignConsecutiveAssignments.Enabled = true;
+  verifyFormat("float i2 = 0;\n"
+               "auto  v  = false ? type{}\n"
+               "                 : type{\n"
+               "                       1,\n"
+               "                   };",
+               Style);
+
   Style.ColumnLimit = 15;
   verifyFormat("int i1 = 1;\n"
                "k      = bar(\n"
@@ -20886,6 +20939,12 @@ TEST_F(FormatTest, AlignWithInitializerPeriods) {
                "});",
                Style);
 
+  verifyFormat("auto aaaaaaaaaaaaaaaaaaaaa = {};\n"
+               "auto b                     = {.a = {\n"
+               "                                  .a = 0,\n"
+               "                              }};",
+               Style);
+
   Style.AlignConsecutiveAssignments.Enabled = false;
   Style.AlignConsecutiveDeclarations.Enabled = true;
   verifyFormat("void foo3(void) {\n"
diff --git a/clang/unittests/Format/FormatTestObjC.cpp b/clang/unittests/Format/FormatTestObjC.cpp
index cf8143ace7b45..c685c5554f9b5 100644
--- a/clang/unittests/Format/FormatTestObjC.cpp
+++ b/clang/unittests/Format/FormatTestObjC.cpp
@@ -876,6 +876,32 @@ TEST_F(FormatTestObjC, FormatObjCMethodExpr) {
   verifyFormat("aaaaaa = [aa aa:aa\n"
                "             aa:aa];");
 
+  Style.AlignConsecutiveAssignments.Enabled = true;
+  // When the method name and parameters are on their own lines, their positions
+  // only depend on the continuation indentation configuration, not where the
+  // square bracket is. Thus they should not move with the square bracket in the
+  // alignment step.
+  verifyFormat("aaaaaa = [aa aa:aa\n"
+               "             aa:aa];\n"
+               "a      = [a //\n"
+               "    aaaaaaa:aa];");
+  verifyFormat("aaaaaa = [aa aa:aa\n"
+               "             aa:aa];\n"
+               "aaaaa  = [a //\n"
+               "          a:aa\n"
+               "    aaaaaaa:aa];");
+  // When the method name is on the same line as the square bracket, the
+  // positions of the parameters depend on where the square bracket is. Thus
+  // they should move with the square bracket in the alignment step.
+  verifyFormat("aaaaa  = [a aa:aa\n"
+               "            aa:aa];\n"
+               "aaaaaa = [aa aa:aa\n"
+               "             aa:aa];\n"
+               "aaaaa  = [a aa:aa\n"
+               "    aaaaaaaaaa:aa];");
+
+  Style.AlignConsecutiveAssignments.Enabled = false;
+
   // Message receiver taking multiple lines.
   // Non-corner case.
   verifyFormat("[[object block:^{\n"
diff --git a/clang/unittests/Format/FormatTestVerilog.cpp b/clang/unittests/Format/FormatTestVerilog.cpp
index eee2bbdf551e6..f407fc36c3a12 100644
--- a/clang/unittests/Format/FormatTestVerilog.cpp
+++ b/clang/unittests/Format/FormatTestVerilog.cpp
@@ -423,6 +423,11 @@ TEST_F(FormatTestVerilog, Declaration) {
   verifyFormat("wire (strong1, pull0) mynet, mynet1 = enable;");
   verifyFormat("wire (strong1, pull0) mynet, //\n"
                "                      mynet1 = enable;");
+
+  // The type or variable can be a C++ keyword.
+  verifyFormat("private mynet;");
+  verifyFormat("switch mynet;");
+  verifyFormat("wire try;");
 }
 
 TEST_F(FormatTestVerilog, Delay) {
diff --git a/clang/unittests/Format/TokenAnnotatorTest.cpp b/clang/unittests/Format/TokenAnnotatorTest.cpp
index 008adff1cee2d..065d53f42f02b 100644
--- a/clang/unittests/Format/TokenAnnotatorTest.cpp
+++ b/clang/unittests/Format/TokenAnnotatorTest.cpp
@@ -2918,6 +2918,13 @@ TEST_F(TokenAnnotatorTest, UnderstandsVerilogOperators) {
   EXPECT_EQ(Tokens[0]->TokenText, R"(\busa+index\
 +)");
   EXPECT_TOKEN(Tokens[1], tok::semi, TT_Unknown);
+
+  // A C++ keyword should be treated as an identifier.
+  Tokens = Annotate("volatile delete;");
+  ASSERT_EQ(Tokens.size(), 4u) << Tokens;
+  EXPECT_TOKEN(Tokens[0], tok::identifier, TT_Unknown);
+  EXPECT_TOKEN(Tokens[1], tok::identifier, TT_StartOfName);
+
   // An escaped newline should not be treated as an escaped identifier.
   Tokens = Annotate("\\\n");
   ASSERT_EQ(Tokens.size(), 1u) << Tokens;
diff --git a/clang/unittests/Lex/PPCallbacksTest.cpp b/clang/unittests/Lex/PPCallbacksTest.cpp
index 990689c6b1e45..9533fbc776e6e 100644
--- a/clang/unittests/Lex/PPCallbacksTest.cpp
+++ b/clang/unittests/Lex/PPCallbacksTest.cpp
@@ -437,6 +437,7 @@ TEST_F(PPCallbacksTest, FileNotFoundSkipped) {
   PreprocessorOptions PPOpts;
   HeaderSearch HeaderInfo(HSOpts, SourceMgr, Diags, LangOpts, Target.get());
 
+  unsigned int NumCalls = 0;
   DiagnosticConsumer *DiagConsumer = new DiagnosticConsumer;
   DiagnosticsEngine FileNotFoundDiags(DiagID, DiagOpts, DiagConsumer);
   Preprocessor PP(PPOpts, FileNotFoundDiags, LangOpts, SourceMgr, HeaderInfo,
@@ -445,21 +446,68 @@ TEST_F(PPCallbacksTest, FileNotFoundSkipped) {
 
   class FileNotFoundCallbacks : public PPCallbacks {
   public:
-    unsigned int NumCalls = 0;
+    unsigned int &NumCalls;
+
+    FileNotFoundCallbacks(unsigned int &NumCalls) : NumCalls(NumCalls) {}
+
     bool FileNotFound(StringRef FileName) override {
       NumCalls++;
       return FileName == "skipped.h";
     }
   };
 
-  auto *Callbacks = new FileNotFoundCallbacks;
-  PP.addPPCallbacks(std::unique_ptr<PPCallbacks>(Callbacks));
+  PP.addPPCallbacks(std::make_unique<FileNotFoundCallbacks>(NumCalls));
+
+  // Lex source text.
+  PP.EnterMainSourceFile();
+  PP.LexTokensUntilEOF();
+
+  ASSERT_EQ(1u, NumCalls);
+  ASSERT_EQ(0u, DiagConsumer->getNumErrors());
+}
+
+TEST_F(PPCallbacksTest, EmbedFileNotFoundChained) {
+  const char *SourceText = "#embed \"notfound.h\"\n";
+
+  std::unique_ptr<llvm::MemoryBuffer> SourceBuf =
+      llvm::MemoryBuffer::getMemBuffer(SourceText);
+  SourceMgr.setMainFileID(SourceMgr.createFileID(std::move(SourceBuf)));
+
+  unsigned int NumCalls = 0;
+  HeaderSearchOptions HSOpts;
+  TrivialModuleLoader ModLoader;
+  PreprocessorOptions PPOpts;
+  HeaderSearch HeaderInfo(HSOpts, SourceMgr, Diags, LangOpts, Target.get());
+
+  DiagnosticConsumer *DiagConsumer = new DiagnosticConsumer;
+  DiagnosticsEngine EmbedFileNotFoundDiags(DiagID, DiagOpts, DiagConsumer);
+  Preprocessor PP(PPOpts, EmbedFileNotFoundDiags, LangOpts, SourceMgr,
+                  HeaderInfo, ModLoader, /*IILookup=*/nullptr,
+                  /*OwnsHeaderSearch=*/false);
+  PP.Initialize(*Target);
+
+  class EmbedFileNotFoundCallbacks : public PPCallbacks {
+  public:
+    unsigned int &NumCalls;
+
+    EmbedFileNotFoundCallbacks(unsigned int &NumCalls) : NumCalls(NumCalls) {}
+
+    bool EmbedFileNotFound(StringRef FileName) override {
+      NumCalls++;
+      return true;
+    }
+  };
+
+  // Add two instances of `EmbedFileNotFoundCallbacks` to ensure the
+  // preprocessor is using an instance of `PPChainedCallbaks`.
+  PP.addPPCallbacks(std::make_unique<EmbedFileNotFoundCallbacks>(NumCalls));
+  PP.addPPCallbacks(std::make_unique<EmbedFileNotFoundCallbacks>(NumCalls));
 
   // Lex source text.
   PP.EnterMainSourceFile();
   PP.LexTokensUntilEOF();
 
-  ASSERT_EQ(1u, Callbacks->NumCalls);
+  ASSERT_EQ(2u, NumCalls);
   ASSERT_EQ(0u, DiagConsumer->getNumErrors());
 }
 
diff --git a/clang/unittests/Tooling/RangeSelectorTest.cpp b/clang/unittests/Tooling/RangeSelectorTest.cpp
index d441da165b09b..9e83fa1dc92ff 100644
--- a/clang/unittests/Tooling/RangeSelectorTest.cpp
+++ b/clang/unittests/Tooling/RangeSelectorTest.cpp
@@ -327,6 +327,45 @@ TEST(RangeSelectorTest, EncloseOpGeneralParsed) {
   EXPECT_THAT_EXPECTED(select(*R, Match), HasValue("3, 7"));
 }
 
+TEST(RangeSelectorTest, MergeOp) {
+  StringRef Code = R"cc(
+    int f(int x, int y, int z) { return 3; }
+    int g() { return f(/* comment */ 3, 7 /* comment */, 9); }
+  )cc";
+  auto Matcher = callExpr(hasArgument(0, expr().bind("a0")),
+                          hasArgument(1, expr().bind("a1")),
+                          hasArgument(2, expr().bind("a2")));
+  RangeSelector R = merge(node("a0"), node("a1"));
+  TestMatch Match = matchCode(Code, Matcher);
+  EXPECT_THAT_EXPECTED(select(R, Match), HasValue("3, 7"));
+  // Test the merge of two non-contiguous and out-of-order token-ranges.
+  R = merge(node("a2"), node("a0"));
+  EXPECT_THAT_EXPECTED(select(R, Match), HasValue("3, 7 /* comment */, 9"));
+  // Test the merge of a token-range (expr node) with a char-range (before).
+  R = merge(node("a1"), before(node("a0")));
+  EXPECT_THAT_EXPECTED(select(R, Match), HasValue("3, 7"));
+  // Test the merge of two char-ranges.
+  R = merge(before(node("a0")), before(node("a1")));
+  EXPECT_THAT_EXPECTED(select(R, Match), HasValue("3, "));
+}
+
+TEST(RangeSelectorTest, MergeOpParsed) {
+  StringRef Code = R"cc(
+    int f(int x, int y, int z) { return 3; }
+    int g() { return f(/* comment */ 3, 7 /* comment */, 9); }
+  )cc";
+  auto Matcher = callExpr(hasArgument(0, expr().bind("a0")),
+                          hasArgument(1, expr().bind("a1")),
+                          hasArgument(2, expr().bind("a2")));
+  auto R = parseRangeSelector(R"rs(merge(node("a0"), node("a1")))rs");
+  ASSERT_THAT_EXPECTED(R, llvm::Succeeded());
+  TestMatch Match = matchCode(Code, Matcher);
+  EXPECT_THAT_EXPECTED(select(*R, Match), HasValue("3, 7"));
+  R = parseRangeSelector(R"rs(merge(node("a2"), node("a1")))rs");
+  ASSERT_THAT_EXPECTED(R, llvm::Succeeded());
+  EXPECT_THAT_EXPECTED(select(*R, Match), HasValue("7 /* comment */, 9"));
+}
+
 TEST(RangeSelectorTest, NodeOpStatement) {
   StringRef Code = "int f() { return 3; }";
   TestMatch Match = matchCode(Code, returnStmt().bind("id"));
diff --git a/clang/utils/TableGen/MveEmitter.cpp b/clang/utils/TableGen/MveEmitter.cpp
index 7681213d9675a..8fde56a0bb5ec 100644
--- a/clang/utils/TableGen/MveEmitter.cpp
+++ b/clang/utils/TableGen/MveEmitter.cpp
@@ -1260,7 +1260,9 @@ Result::Ptr EmitterBase::getCodeForDag(const DagInit *D,
     for (unsigned i = 0, e = D->getNumArgs(); i < e; ++i)
       Args.push_back(getCodeForDagArg(D, i, Scope, Param));
 
-    auto GenIRBuilderBase = [&](const Record *Op) {
+    auto GenIRBuilderBase = [&](const Record *Op) -> Result::Ptr {
+      assert(Op->isSubClassOf("IRBuilderBase") &&
+             "Expected IRBuilderBase in GenIRBuilderBase\n");
       std::set<unsigned> AddressArgs;
       std::map<unsigned, std::string> IntegerArgs;
       for (const Record *sp : Op->getValueAsListOfDefs("special_params")) {
@@ -1274,7 +1276,9 @@ Result::Ptr EmitterBase::getCodeForDag(const DagInit *D,
       return std::make_shared<IRBuilderResult>(Op->getValueAsString("prefix"),
                                                Args, AddressArgs, IntegerArgs);
     };
-    auto GenIRIntBase = [&](const Record *Op) {
+    auto GenIRIntBase = [&](const Record *Op) -> Result::Ptr {
+      assert(Op->isSubClassOf("IRIntBase") &&
+             "Expected IRIntBase in GenIRIntBase\n");
       std::vector<const Type *> ParamTypes;
       for (const Record *RParam : Op->getValueAsListOfDefs("params"))
         ParamTypes.push_back(getType(RParam, Param));
@@ -1289,8 +1293,11 @@ Result::Ptr EmitterBase::getCodeForDag(const DagInit *D,
     } else if (Op->isSubClassOf("IRIntBase")) {
       return GenIRIntBase(Op);
     } else if (Op->isSubClassOf("strictFPAlt")) {
-      auto Standard = GenIRBuilderBase(Op->getValueAsDef("standard"));
-      auto StrictFp = GenIRIntBase(Op->getValueAsDef("strictfp"));
+      auto StardardBuilder = Op->getValueAsDef("standard");
+      Result::Ptr Standard = StardardBuilder->isSubClassOf("IRBuilder")
+                                 ? GenIRBuilderBase(StardardBuilder)
+                                 : GenIRIntBase(StardardBuilder);
+      Result::Ptr StrictFp = GenIRIntBase(Op->getValueAsDef("strictfp"));
       return std::make_shared<StrictFpAltResult>(Standard, StrictFp);
     } else {
       PrintFatalError("Unsupported dag node " + Op->getName());
diff --git a/compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake b/compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake
index c10367715396e..f2317de8916e9 100644
--- a/compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake
+++ b/compiler-rt/cmake/Modules/AllSupportedArchDefs.cmake
@@ -89,7 +89,7 @@ else()
   set(ALL_TSAN_SUPPORTED_ARCH ${X86_64} ${MIPS64} ${ARM64} ${PPC64} ${S390X}
       ${LOONGARCH64} ${RISCV64})
 endif()
-set(ALL_TYSAN_SUPPORTED_ARCH ${X86_64} ${ARM64})
+set(ALL_TYSAN_SUPPORTED_ARCH ${X86_64} ${ARM64} ${S390X})
 set(ALL_UBSAN_SUPPORTED_ARCH ${X86} ${X86_64} ${ARM32} ${ARM64} ${RISCV64}
     ${MIPS32} ${MIPS64} ${PPC64} ${S390X} ${SPARC} ${SPARCV9} ${HEXAGON}
     ${LOONGARCH64})
diff --git a/compiler-rt/cmake/config-ix.cmake b/compiler-rt/cmake/config-ix.cmake
index 8dfbdecbd6b97..084a7060e8d13 100644
--- a/compiler-rt/cmake/config-ix.cmake
+++ b/compiler-rt/cmake/config-ix.cmake
@@ -856,6 +856,7 @@ else()
 endif()
 
 if (COMPILER_RT_HAS_SANITIZER_COMMON AND TYSAN_SUPPORTED_ARCH AND
+        "tysan" IN_LIST COMPILER_RT_SANITIZERS_TO_BUILD AND
         OS_NAME MATCHES "Linux|Darwin")
   set(COMPILER_RT_HAS_TYSAN TRUE)
 else()
diff --git a/compiler-rt/lib/builtins/CMakeLists.txt b/compiler-rt/lib/builtins/CMakeLists.txt
index aa2a2519afc02..7e8621855eb84 100644
--- a/compiler-rt/lib/builtins/CMakeLists.txt
+++ b/compiler-rt/lib/builtins/CMakeLists.txt
@@ -1011,9 +1011,10 @@ else ()
         list(APPEND BUILTIN_CFLAGS_${arch} -fomit-frame-pointer -DCOMPILER_RT_ARMHF_TARGET)
       endif()
 
-      # For RISCV32, we must force enable int128 for compiling long
+      # For RISCV32 and 32-bit SPARC, we must force enable int128 for compiling long
       # double routines.
-      if(COMPILER_RT_ENABLE_SOFTWARE_INT128 OR "${arch}" STREQUAL "riscv32")
+      if (COMPILER_RT_ENABLE_SOFTWARE_INT128 OR ("${arch}" MATCHES "riscv32|sparc$"
+        AND NOT CMAKE_COMPILER_IS_GNUCC))
         list(APPEND BUILTIN_CFLAGS_${arch} -fforce-enable-int128)
       endif()
 
diff --git a/compiler-rt/lib/tysan/tysan_platform.h b/compiler-rt/lib/tysan/tysan_platform.h
index 19f77f0cace6b..96049c8d1c9a2 100644
--- a/compiler-rt/lib/tysan/tysan_platform.h
+++ b/compiler-rt/lib/tysan/tysan_platform.h
@@ -45,6 +45,13 @@ struct Mapping48 {
   static const uptr kPtrShift = 3;
 };
 #define TYSAN_RUNTIME_VMA 1
+#elif defined(__s390x__)
+struct Mapping {
+  static const uptr kShadowAddr = 0x080000000000ULL;
+  static const uptr kAppAddr = 0x460000000000ULL;
+  static const uptr kAppMemMsk = ~0xC00000000000ULL;
+  static const uptr kPtrShift = 3;
+};
 #else
 #error "TySan not supported for this platform!"
 #endif
diff --git a/compiler-rt/test/CMakeLists.txt b/compiler-rt/test/CMakeLists.txt
index a2e4c8cbf5685..9cfb7ea559475 100644
--- a/compiler-rt/test/CMakeLists.txt
+++ b/compiler-rt/test/CMakeLists.txt
@@ -16,6 +16,13 @@ pythonize_bool(COMPILER_RT_HAS_AARCH64_SME)
 
 pythonize_bool(COMPILER_RT_HAS_NO_DEFAULT_CONFIG_FLAG)
 
+if(LLVM_TREE_AVAILABLE OR NOT COMPILER_RT_STANDALONE_BUILD)
+  set(COMPILER_RT_BUILT_WITH_LLVM TRUE)
+else()
+  set(COMPILER_RT_BUILT_WITH_LLVM FALSE)
+endif()
+pythonize_bool(COMPILER_RT_BUILT_WITH_LLVM)
+
 configure_compiler_rt_lit_site_cfg(
   ${CMAKE_CURRENT_SOURCE_DIR}/lit.common.configured.in
   ${CMAKE_CURRENT_BINARY_DIR}/lit.common.configured)
diff --git a/compiler-rt/test/builtins/CMakeLists.txt b/compiler-rt/test/builtins/CMakeLists.txt
index 8e3cb35183ba7..36135c7905900 100644
--- a/compiler-rt/test/builtins/CMakeLists.txt
+++ b/compiler-rt/test/builtins/CMakeLists.txt
@@ -48,7 +48,8 @@ foreach(arch ${BUILTIN_TEST_ARCH})
     string(REPLACE ";" " " BUILTINS_TEST_TARGET_CFLAGS "${BUILTINS_TEST_TARGET_CFLAGS}")
   endif()
 
-  if (COMPILER_RT_ENABLE_SOFTWARE_INT128 OR ${arch} STREQUAL "riscv32")
+  if (COMPILER_RT_ENABLE_SOFTWARE_INT128 OR ("${arch}" MATCHES "riscv32|sparc$"
+    AND NOT CMAKE_COMPILER_IS_GNUCC))
     list(APPEND BUILTINS_TEST_TARGET_CFLAGS -fforce-enable-int128)
     string(REPLACE ";" " " BUILTINS_TEST_TARGET_CFLAGS "${BUILTINS_TEST_TARGET_CFLAGS}")
   endif()
diff --git a/compiler-rt/test/sanitizer_common/TestCases/printf-ldbl.c b/compiler-rt/test/sanitizer_common/TestCases/printf-ldbl.c
index cfe8d800d3834..f6629ab81c3b3 100644
--- a/compiler-rt/test/sanitizer_common/TestCases/printf-ldbl.c
+++ b/compiler-rt/test/sanitizer_common/TestCases/printf-ldbl.c
@@ -1,8 +1,5 @@
 // RUN: %clang %s -o %t && %run %t 2>&1
 
-// Issue #41838
-// XFAIL: sparc-target-arch && target={{.*solaris.*}}
-
 #include <assert.h>
 #include <stdio.h>
 #include <string.h>
diff --git a/compiler-rt/test/sanitizer_common/TestCases/scanf-ldbl.c b/compiler-rt/test/sanitizer_common/TestCases/scanf-ldbl.c
index a38f34a245fae..9ca30f4a65688 100644
--- a/compiler-rt/test/sanitizer_common/TestCases/scanf-ldbl.c
+++ b/compiler-rt/test/sanitizer_common/TestCases/scanf-ldbl.c
@@ -1,8 +1,5 @@
 // RUN: %clang %s -o %t && %run %t 2>&1
 
-// Issue #41838
-// XFAIL: sparc-target-arch && target={{.*solaris.*}}
-
 #include <assert.h>
 #include <stdio.h>
 #include <string.h>
diff --git a/compiler-rt/test/ubsan/TestCases/Float/cast-overflow.cpp b/compiler-rt/test/ubsan/TestCases/Float/cast-overflow.cpp
index 8638bf69f749e..80063b7a0f9f9 100644
--- a/compiler-rt/test/ubsan/TestCases/Float/cast-overflow.cpp
+++ b/compiler-rt/test/ubsan/TestCases/Float/cast-overflow.cpp
@@ -9,9 +9,6 @@
 // RUN: %run %t 6 2>&1 | FileCheck %s --check-prefix=CHECK-6
 // RUN: %run %t 7 2>&1 | FileCheck %s --check-prefix=CHECK-7
 
-// Issue #41838
-// XFAIL: sparc-target-arch && target={{.*solaris.*}}
-
 // This test assumes float and double are IEEE-754 single- and double-precision.
 
 #if defined(__APPLE__)
diff --git a/compiler-rt/test/ubsan/TestCases/Misc/log-path_test.cpp b/compiler-rt/test/ubsan/TestCases/Misc/log-path_test.cpp
index 4773884cb4cc0..3fd02957a6903 100644
--- a/compiler-rt/test/ubsan/TestCases/Misc/log-path_test.cpp
+++ b/compiler-rt/test/ubsan/TestCases/Misc/log-path_test.cpp
@@ -24,9 +24,6 @@
 // FIXME: log_path is not supported on Windows yet.
 // XFAIL: target={{.*windows-msvc.*}}
 
-// Issue #41838
-// XFAIL: sparc-target-arch && target={{.*solaris.*}}
-
 #include <stdio.h>
 #include <stdlib.h>
 int main(int argc, char *argv[]) {
diff --git a/compiler-rt/test/xray/TestCases/Posix/always-never-instrument.cpp b/compiler-rt/test/xray/TestCases/Posix/always-never-instrument.cpp
index e5fefc07c1cc8..85a3d232ed1b4 100644
--- a/compiler-rt/test/xray/TestCases/Posix/always-never-instrument.cpp
+++ b/compiler-rt/test/xray/TestCases/Posix/always-never-instrument.cpp
@@ -12,6 +12,8 @@
 
 // REQUIRES: built-in-llvm-tree
 
+// UNSUPPORTED: armhf-linux
+
 // NOINSTR-NOT: {{.*__xray_NeverInstrumented.*}}
 int __xray_NeverInstrumented() {
   return 0;
diff --git a/compiler-rt/test/xray/TestCases/Posix/default-options.cpp b/compiler-rt/test/xray/TestCases/Posix/default-options.cpp
index e00ff3ba0a5cb..73cb4ddd9ea45 100644
--- a/compiler-rt/test/xray/TestCases/Posix/default-options.cpp
+++ b/compiler-rt/test/xray/TestCases/Posix/default-options.cpp
@@ -4,6 +4,8 @@
 
 // REQUIRES: built-in-llvm-tree
 
+// UNSUPPORTED: ppc
+
 extern "C" __attribute__((xray_never_instrument)) const char *
 __xray_default_options() {
   return "patch_premain=true:verbosity=1:xray_mode=xray-basic";
diff --git a/compiler-rt/test/xray/lit.site.cfg.py.in b/compiler-rt/test/xray/lit.site.cfg.py.in
index 72a7be6a80e3a..021d999dc7b21 100644
--- a/compiler-rt/test/xray/lit.site.cfg.py.in
+++ b/compiler-rt/test/xray/lit.site.cfg.py.in
@@ -5,7 +5,7 @@ config.name_suffix = "@XRAY_TEST_CONFIG_SUFFIX@"
 config.xray_lit_source_dir = "@XRAY_LIT_SOURCE_DIR@"
 config.target_cflags = "@XRAY_TEST_TARGET_CFLAGS@"
 config.target_arch = "@XRAY_TEST_TARGET_ARCH@"
-config.built_with_llvm = ("@COMPILER_RT_STANDALONE_BUILD@" != "TRUE")
+config.built_with_llvm = "@COMPILER_RT_BUILT_WITH_LLVM@"
 
 # TODO: Look into whether we can run a capability test on the standalone build to
 # see whether it can run 'llvm-xray convert' instead of turning off tests for a
diff --git a/flang-rt/lib/runtime/extensions.cpp b/flang-rt/lib/runtime/extensions.cpp
index d3a618c1a39ec..c110b0381890c 100644
--- a/flang-rt/lib/runtime/extensions.cpp
+++ b/flang-rt/lib/runtime/extensions.cpp
@@ -12,6 +12,7 @@
 #include "flang/Runtime/extensions.h"
 #include "unit.h"
 #include "flang-rt/runtime/descriptor.h"
+#include "flang-rt/runtime/lock.h"
 #include "flang-rt/runtime/terminator.h"
 #include "flang-rt/runtime/tools.h"
 #include "flang/Runtime/command.h"
@@ -23,6 +24,7 @@
 #include <cstdio>
 #include <cstring>
 #include <ctime>
+#include <limits>
 #include <signal.h>
 #include <stdlib.h>
 #include <thread>
@@ -60,6 +62,11 @@ inline void CtimeBuffer(char *buffer, size_t bufsize, const time_t cur_time,
 
 namespace Fortran::runtime {
 
+#define GFC_RAND_A 16807
+#define GFC_RAND_M 2147483647
+static unsigned rand_seed = 1;
+static Lock rand_seed_lock;
+
 // Common implementation that could be used for either SECNDS() or DSECNDS(),
 // which are defined for float or double.
 template <typename T> T SecndsImpl(T *refTime) {
@@ -409,6 +416,57 @@ std::int64_t RTNAME(time)() { return time(nullptr); }
 // MCLOCK: returns accumulated CPU time in ticks
 std::int32_t FORTRAN_PROCEDURE_NAME(mclock)() { return std::clock(); }
 
+static void _internal_srand(int seed) { rand_seed = seed ? seed : 123459876; }
+
+// IRAND(I)
+int RTNAME(Irand)(int *i) {
+  int j;
+  if (i)
+    j = *i;
+  else
+    j = 0;
+
+  rand_seed_lock.Take();
+  switch (j) {
+  case 0:
+    break;
+  case 1:
+    _internal_srand(0);
+    break;
+  default:
+    _internal_srand(j);
+    break;
+  }
+
+  rand_seed = GFC_RAND_A * rand_seed % GFC_RAND_M;
+  j = (int)rand_seed;
+  rand_seed_lock.Drop();
+  return j;
+}
+
+// RAND(I)
+float RTNAME(Rand)(int *i, const char *sourceFile, int line) {
+  unsigned mask = 0;
+  constexpr int radix = std::numeric_limits<float>::radix;
+  constexpr int digits = std::numeric_limits<float>::digits;
+  if (radix == 2) {
+    mask = ~(unsigned)0u << (32 - digits + 1);
+  } else if (radix == 16) {
+    mask = ~(unsigned)0u << ((8 - digits) * 4 + 1);
+  } else {
+    Terminator terminator{sourceFile, line};
+    terminator.Crash("Radix unknown value.");
+  }
+  return ((unsigned)(RTNAME(Irand)(i) - 1) & mask) * (float)0x1.p-31f;
+}
+
+// SRAND(SEED)
+void FORTRAN_PROCEDURE_NAME(srand)(int *seed) {
+  rand_seed_lock.Take();
+  _internal_srand(*seed);
+  rand_seed_lock.Drop();
+}
+
 // Extension procedures related to I/O
 
 namespace io {
diff --git a/flang/docs/Directives.md b/flang/docs/Directives.md
index 128d8f9b6b707..5640e44e16bae 100644
--- a/flang/docs/Directives.md
+++ b/flang/docs/Directives.md
@@ -57,6 +57,15 @@ A list of non-standard directives supported by Flang
 * `!dir$ vector always` forces vectorization on the following loop regardless
   of cost model decisions. The loop must still be vectorizable.
   [This directive currently only works on plain do loops without labels].
+* `!dir$ vector vectorlength({fixed|scalable|<num>|<num>,fixed|<num>,scalable})`
+  specifies a hint to the compiler about the desired vectorization factor. If
+  `fixed` is used, the compiler should prefer fixed-width vectorization.
+  Scalable vectorization instructions may still be used with a fixed-width
+  predicate. If `scalable` is used the compiler should prefer scalable
+  vectorization, though it can choose to use fixed length vectorization or not
+  at all. `<num>` means that the compiler should consider using this specific
+  vectorization factor, which should be an integer literal. This directive
+  currently has the same limitations as `!dir$ vector always`.
 * `!dir$ unroll [n]` specifies that the compiler ought to unroll the immediately
   following loop `n` times. When `n` is `0` or `1`, the loop should not be unrolled
   at all. When `n` is `2` or greater, the loop should be unrolled exactly `n`
diff --git a/flang/docs/Intrinsics.md b/flang/docs/Intrinsics.md
index 48e732d0b35a1..31bead9f8bfdc 100644
--- a/flang/docs/Intrinsics.md
+++ b/flang/docs/Intrinsics.md
@@ -1413,3 +1413,45 @@ This is prefixed by `STRING`, a colon and a space.
 - **Standard:** GNU extension
 - **Class:** subroutine
 - **Syntax:** `CALL PERROR(STRING)`
+
+### Non-Standard Intrinsics: SRAND
+
+#### Description
+`SRAND` reinitializes the pseudo-random number generator called by `RAND` and `IRAND`.
+The new seed used by the generator is specified by the required argument `SEED`.
+
+#### Usage and Info
+
+- **Standard:** GNU extension
+- **Class:** Subroutine
+- **Syntax:** `CALL SRAND(SEED)`
+
+### Non-Standard Intrinsics: IRAND
+
+#### Description
+`IRAND(FLAG)` returns a pseudo-random number from a uniform distribution between 0 and a system-dependent limit.
+If `FLAG` is 0, the next number in the current sequence is returned;
+If `FLAG` is 1, the generator is restarted by `CALL SRAND(0)`;
+If `FLAG` has any other value, it is used as a new seed with `SRAND`.
+The return value is of `INTEGER` type of kind 4.
+
+#### Usage and Info
+
+- **Standard:** GNU extension
+- **Class:** function
+- **Syntax:** `RESULT = IRAND(I)`
+
+### Non-Standard Intrinsics: RAND
+
+#### Description
+`RAND(FLAG)` returns a pseudo-random number from a uniform distribution between 0 and 1.
+If `FLAG` is 0, the next number in the current sequence is returned;
+If `FLAG` is 1, the generator is restarted by `CALL SRAND(0)`;
+If `FLAG` has any other value, it is used as a new seed with `SRAND`.
+The return value is of `REAL` type with the default kind.
+
+#### Usage and Info
+
+- **Standard:** GNU extension
+- **Class:** function
+- **Syntax:** `RESULT = RAND(I)`
diff --git a/flang/docs/ReleaseNotes.md b/flang/docs/ReleaseNotes.md
index 6a285f829053b..24122e7a77581 100644
--- a/flang/docs/ReleaseNotes.md
+++ b/flang/docs/ReleaseNotes.md
@@ -1,18 +1,18 @@
 <!-- If you want to modify sections/contents permanently, you should modify both
 ReleaseNotes.md and ReleaseNotesTemplate.txt. -->
 
-# Flang |version| (In-Progress) Release Notes
+# Flang {{version}} {{in_progress}}Release Notes
 
 > **warning**
 >
-> These are in-progress notes for the upcoming LLVM |version| release.
+> These are in-progress notes for the upcoming LLVM {{version}} release.
 > Release notes for previous releases can be found on [the Download
 > Page](https://releases.llvm.org/download.html).
 
 ## Introduction
 
 This document contains the release notes for the Flang Fortran frontend,
-part of the LLVM Compiler Infrastructure, release |version|. Here we
+part of the LLVM Compiler Infrastructure, release {{version}}. Here we
 describe the status of Flang in some detail, including major
 improvements from the previous release and new feature work. For the
 general LLVM release notes, see [the LLVM
@@ -45,7 +45,6 @@ page](https://llvm.org/releases/).
 
 ## New Issues Found
 
-
 ## Additional Information
 
 Flang's documentation is located in the `flang/docs/` directory in the
diff --git a/flang/docs/ReleaseNotesTemplate.txt b/flang/docs/ReleaseNotesTemplate.txt
index 2ccf5472ee234..b607e9a938424 100644
--- a/flang/docs/ReleaseNotesTemplate.txt
+++ b/flang/docs/ReleaseNotesTemplate.txt
@@ -1,18 +1,18 @@
 <!-- If you want to modify sections/contents permanently, you should modify both
 ReleaseNotes.md and ReleaseNotesTemplate.txt. -->
 
-# Flang |version| (In-Progress) Release Notes
+# Flang {{version}} {{in_progress}}Release Notes
 
 > **warning**
 >
-> These are in-progress notes for the upcoming LLVM |version| release.
+> These are in-progress notes for the upcoming LLVM {{version}} release.
 > Release notes for previous releases can be found on [the Download
 > Page](https://releases.llvm.org/download.html).
 
 ## Introduction
 
 This document contains the release notes for the Flang Fortran frontend,
-part of the LLVM Compiler Infrastructure, release |version|. Here we
+part of the LLVM Compiler Infrastructure, release {{version}}. Here we
 describe the status of Flang in some detail, including major
 improvements from the previous release and new feature work. For the
 general LLVM release notes, see [the LLVM
diff --git a/flang/docs/conf.py b/flang/docs/conf.py
index 0942dbf70ff16..6395834a736cb 100644
--- a/flang/docs/conf.py
+++ b/flang/docs/conf.py
@@ -10,6 +10,7 @@
 # serve to show the default.
 
 from datetime import date
+
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
@@ -46,6 +47,14 @@
 }
 myst_heading_anchors = 6
 
+# Enable myst's substitution extension since markdown files cannot use the
+# |version| and |release| substitutions available to .rst files.
+myst_enable_extensions = ["substitution"]
+
+# The substitutions to use in markdown files. This contains unconditional
+# substitutions, but more may be added once the configuration is obtained.
+myst_substitutions = {"in_progress": "(In-Progress) " if tags.has("PreRelease") else ""}
+
 import sphinx
 
 # The encoding of source files.
@@ -268,3 +277,22 @@
 
 # How to display URL addresses: 'footnote', 'no', or 'inline'.
 # texinfo_show_urls = 'footnote'
+
+
+# This can be treated as its own sphinx extension. setup() will be called by
+# sphinx. In it, register a function to be called when the configuration has
+# been initialized. The configuration will contain the values of the -D options
+# passed to sphinx-build on the command line.
+#
+# See llvm/cmake/modules/AddSphinxTarget.cmake for details on how sphinx-build
+# is invoked.
+def setup(app):
+    app.connect("config-inited", myst_substitutions_update)
+
+
+# Override the myst_parser substitutions map after the configuration has been
+# initialized.
+def myst_substitutions_update(app, config):
+    config.myst_substitutions.update(
+        {"release": config.release, "version": config.version}
+    )
diff --git a/flang/include/flang/Lower/OpenMP/Clauses.h b/flang/include/flang/Lower/OpenMP/Clauses.h
index 3eff90b95a20d..737b535d604d6 100644
--- a/flang/include/flang/Lower/OpenMP/Clauses.h
+++ b/flang/include/flang/Lower/OpenMP/Clauses.h
@@ -246,7 +246,7 @@ using Initializer = tomp::clause::InitializerT<TypeTy, IdTy, ExprTy>;
 using InReduction = tomp::clause::InReductionT<TypeTy, IdTy, ExprTy>;
 using IsDevicePtr = tomp::clause::IsDevicePtrT<TypeTy, IdTy, ExprTy>;
 using Lastprivate = tomp::clause::LastprivateT<TypeTy, IdTy, ExprTy>;
-using LoopRange = tomp::clause::LoopRangeT<TypeTy, IdTy, ExprTy>;
+using Looprange = tomp::clause::LooprangeT<TypeTy, IdTy, ExprTy>;
 using Linear = tomp::clause::LinearT<TypeTy, IdTy, ExprTy>;
 using Link = tomp::clause::LinkT<TypeTy, IdTy, ExprTy>;
 using Map = tomp::clause::MapT<TypeTy, IdTy, ExprTy>;
diff --git a/flang/include/flang/Optimizer/Builder/CUFCommon.h b/flang/include/flang/Optimizer/Builder/CUFCommon.h
index 98d01958846f7..736f90123969c 100644
--- a/flang/include/flang/Optimizer/Builder/CUFCommon.h
+++ b/flang/include/flang/Optimizer/Builder/CUFCommon.h
@@ -14,7 +14,7 @@
 #include "mlir/IR/BuiltinOps.h"
 
 static constexpr llvm::StringRef cudaDeviceModuleName = "cuda_device_mod";
-static constexpr llvm::StringRef cudaSharedMemSuffix = "__shared_mem";
+static constexpr llvm::StringRef cudaSharedMemSuffix = "__shared_mem__";
 
 namespace fir {
 class FirOpBuilder;
diff --git a/flang/include/flang/Optimizer/Builder/IntrinsicCall.h b/flang/include/flang/Optimizer/Builder/IntrinsicCall.h
index 005a9786e43b9..0ae9177f98fd8 100644
--- a/flang/include/flang/Optimizer/Builder/IntrinsicCall.h
+++ b/flang/include/flang/Optimizer/Builder/IntrinsicCall.h
@@ -330,6 +330,8 @@ struct IntrinsicLibrary {
   fir::ExtendedValue genIndex(mlir::Type, llvm::ArrayRef<fir::ExtendedValue>);
   mlir::Value genIor(mlir::Type, llvm::ArrayRef<mlir::Value>);
   fir::ExtendedValue genIparity(mlir::Type, llvm::ArrayRef<fir::ExtendedValue>);
+  fir::ExtendedValue genIrand(mlir::Type resultType,
+                              llvm::ArrayRef<fir::ExtendedValue>);
   fir::ExtendedValue genIsContiguous(mlir::Type,
                                      llvm::ArrayRef<fir::ExtendedValue>);
   template <Fortran::runtime::io::Iostat value>
@@ -377,6 +379,8 @@ struct IntrinsicLibrary {
   fir::ExtendedValue genProduct(mlir::Type, llvm::ArrayRef<fir::ExtendedValue>);
   fir::ExtendedValue genPutenv(std::optional<mlir::Type>,
                                llvm::ArrayRef<fir::ExtendedValue>);
+  fir::ExtendedValue genRand(mlir::Type resultType,
+                             llvm::ArrayRef<fir::ExtendedValue>);
   void genRandomInit(llvm::ArrayRef<fir::ExtendedValue>);
   void genRandomNumber(llvm::ArrayRef<fir::ExtendedValue>);
   void genRandomSeed(llvm::ArrayRef<fir::ExtendedValue>);
diff --git a/flang/include/flang/Optimizer/Builder/Runtime/Intrinsics.h b/flang/include/flang/Optimizer/Builder/Runtime/Intrinsics.h
index 5121ccce921c6..30c3189366cec 100644
--- a/flang/include/flang/Optimizer/Builder/Runtime/Intrinsics.h
+++ b/flang/include/flang/Optimizer/Builder/Runtime/Intrinsics.h
@@ -111,6 +111,12 @@ void genSleep(fir::FirOpBuilder &builder, mlir::Location loc,
 mlir::Value genChdir(fir::FirOpBuilder &builder, mlir::Location loc,
                      mlir::Value name);
 
+
+mlir::Value genIrand(fir::FirOpBuilder &builder, mlir::Location loc,
+                     mlir::Value i);
+mlir::Value genRand(fir::FirOpBuilder &builder, mlir::Location loc,
+                    mlir::Value i);
+
 } // namespace runtime
 } // namespace fir
 
diff --git a/flang/include/flang/Optimizer/Dialect/CUF/CUFOps.td b/flang/include/flang/Optimizer/Dialect/CUF/CUFOps.td
index 07bb47e26b968..920bef99dc996 100644
--- a/flang/include/flang/Optimizer/Dialect/CUF/CUFOps.td
+++ b/flang/include/flang/Optimizer/Dialect/CUF/CUFOps.td
@@ -350,15 +350,15 @@ def cuf_SharedMemoryOp
   let arguments = (ins TypeAttr:$in_type, OptionalAttr<StrAttr>:$uniq_name,
       OptionalAttr<StrAttr>:$bindc_name, Variadic<AnyIntegerType>:$typeparams,
       Variadic<AnyIntegerType>:$shape,
-      Optional<AnyIntegerType>:$offset // offset in bytes from the shared memory
-                                       // base address.
-  );
+      // offset in bytes from the shared memory base address.
+      Optional<AnyIntegerType>:$offset, OptionalAttr<I64Attr>:$alignment,
+      UnitAttr:$isStatic);
 
   let results = (outs fir_ReferenceType:$ptr);
 
   let assemblyFormat = [{
       (`[` $offset^ `:` type($offset) `]`)? $in_type (`(` $typeparams^ `:` type($typeparams) `)`)?
-        (`,` $shape^ `:` type($shape) )?  attr-dict `->` qualified(type($ptr))
+        (`,` $shape^ `:` type($shape) )?  (`align` $alignment^ )? attr-dict `->` qualified(type($ptr))
   }];
 
   let builders = [OpBuilder<(ins "mlir::Type":$inType,
diff --git a/flang/include/flang/Optimizer/Dialect/FIROps.td b/flang/include/flang/Optimizer/Dialect/FIROps.td
index 5d16b9816e318..cfce9fca504ec 100644
--- a/flang/include/flang/Optimizer/Dialect/FIROps.td
+++ b/flang/include/flang/Optimizer/Dialect/FIROps.td
@@ -3107,9 +3107,12 @@ def fir_TypeInfoOp : fir_Op<"type_info",
     between method identifiers and corresponding `FuncOp` symbols.
     The ordering of associations in the map is determined by the front end.
 
-    The "no_init" flag indicates that this type has no components requiring default
-    initialization (including setting allocatable component to a clean deallocated
-    state).
+    The "abstract" flag indicates that this type is an ABSTRACT derived type and
+    that it cannot be instantiated.
+
+    The "no_init" flag indicates that this type has no components requiring
+    default initialization (including setting allocatable component to a clean
+    deallocated state).
 
     The "no_destroy" flag indicates that there are no allocatable components
     that require deallocation.
@@ -3118,7 +3121,8 @@ def fir_TypeInfoOp : fir_Op<"type_info",
     for its parents ,or for components.
 
     ```
-      fir.type_info @_QMquuzTfoo noinit nofinal : !fir.type<_QMquuzTfoo{i:i32}> dispatch_table {
+      fir.type_info @_QMquuzTfoo abstract noinit nofinal
+        : !fir.type<_QMquuzTfoo{i:i32}> dispatch_table {
         fir.dt_entry method1, @_QFNMquuzTfooPmethod1AfooR
         fir.dt_entry method2, @_QFNMquuzTfooPmethod2AfooII
       }
@@ -3129,6 +3133,7 @@ def fir_TypeInfoOp : fir_Op<"type_info",
     SymbolNameAttr:$sym_name,
     TypeAttr:$type,
     OptionalAttr<TypeAttr>:$parent_type,
+    UnitAttr:$abstract,
     UnitAttr:$no_init,
     UnitAttr:$no_destroy,
     UnitAttr:$no_final
@@ -3147,8 +3152,9 @@ def fir_TypeInfoOp : fir_Op<"type_info",
   ];
 
   let assemblyFormat = [{
-    $sym_name (`noinit` $no_init^)? (`nodestroy` $no_destroy^)?
-    (`nofinal` $no_final^)? (`extends` $parent_type^)? attr-dict `:` $type
+    $sym_name (`abstract` $abstract^)? (`noinit` $no_init^)?
+    (`nodestroy` $no_destroy^)? (`nofinal` $no_final^)?
+    (`extends` $parent_type^)? attr-dict `:` $type
     (`dispatch_table` $dispatch_table^)?
     (`component_info` $component_info^)?
   }];
@@ -3174,23 +3180,34 @@ def fir_DTEntryOp : fir_Op<"dt_entry", [HasParent<"TypeInfoOp">]> {
   let summary = "map entry in a dispatch table";
 
   let description = [{
-    An entry in a dispatch table.  Allows a function symbol to be bound
-    to a specifier method identifier.  A dispatch operation uses the dynamic
+    An entry in a dispatch table. Allows a function symbol to be bound
+    to a specifier method identifier. A dispatch operation uses the dynamic
     type of a distinguished argument to determine an exact dispatch table
     and uses the method identifier to select the type-bound procedure to
     be called.
 
+    The optional "deferred" flag indicates that the binding is a DEFERRED
+    type-bound procedure (declared but without an implementation at this
+    type level).
+
     ```
+      // Non-deferred binding
       fir.dt_entry method_name, @uniquedProcedure
+
+      // Deferred binding
+      fir.dt_entry method_name, @uniquedProcedure deferred
     ```
   }];
 
-  let arguments = (ins StrAttr:$method, SymbolRefAttr:$proc);
+  let arguments = (ins StrAttr:$method, SymbolRefAttr:$proc, UnitAttr:$deferred);
 
   let hasCustomAssemblyFormat = 1;
 
   let extraClassDeclaration = [{
     static constexpr llvm::StringRef getProcAttrNameStr() { return "proc"; }
+    static constexpr llvm::StringRef getDeferredAttrNameStr() {
+      return "deferred";
+    }
   }];
 }
 
diff --git a/flang/include/flang/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.h b/flang/include/flang/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.h
index 3167c554abbdd..0f133623475f8 100644
--- a/flang/include/flang/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.h
+++ b/flang/include/flang/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.h
@@ -43,6 +43,15 @@ struct OpenACCPointerLikeModel
                mlir::TypedValue<mlir::acc::PointerLikeType> destination,
                mlir::TypedValue<mlir::acc::PointerLikeType> source,
                mlir::Type varType) const;
+
+  mlir::Value genLoad(mlir::Type pointer, mlir::OpBuilder &builder,
+                      mlir::Location loc,
+                      mlir::TypedValue<mlir::acc::PointerLikeType> srcPtr,
+                      mlir::Type valueType) const;
+
+  bool genStore(mlir::Type pointer, mlir::OpBuilder &builder,
+                mlir::Location loc, mlir::Value valueToStore,
+                mlir::TypedValue<mlir::acc::PointerLikeType> destPtr) const;
 };
 
 template <typename T>
diff --git a/flang/include/flang/Optimizer/Transforms/CUDA/CUFAllocationConversion.h b/flang/include/flang/Optimizer/Transforms/CUDA/CUFAllocationConversion.h
new file mode 100644
index 0000000000000..2a4eb1cdb27f0
--- /dev/null
+++ b/flang/include/flang/Optimizer/Transforms/CUDA/CUFAllocationConversion.h
@@ -0,0 +1,33 @@
+//===------- CUFAllocationConversion.h --------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef FORTRAN_OPTIMIZER_TRANSFORMS_CUDA_CUFALLOCATIONCONVERSION_H_
+#define FORTRAN_OPTIMIZER_TRANSFORMS_CUDA_CUFALLOCATIONCONVERSION_H_
+
+#include "mlir/Pass/Pass.h"
+#include "mlir/Pass/PassRegistry.h"
+
+namespace fir {
+class LLVMTypeConverter;
+}
+
+namespace mlir {
+class DataLayout;
+class SymbolTable;
+} // namespace mlir
+
+namespace cuf {
+
+/// Patterns that convert CUF operations to runtime calls.
+void populateCUFAllocationConversionPatterns(
+    const fir::LLVMTypeConverter &converter, mlir::DataLayout &dl,
+    const mlir::SymbolTable &symtab, mlir::RewritePatternSet &patterns);
+
+} // namespace cuf
+
+#endif // FORTRAN_OPTIMIZER_TRANSFORMS_CUDA_CUFALLOCATIONCONVERSION_H_
diff --git a/flang/include/flang/Optimizer/Transforms/Passes.td b/flang/include/flang/Optimizer/Transforms/Passes.td
index f5403ab6ff503..f50202784e2dc 100644
--- a/flang/include/flang/Optimizer/Transforms/Passes.td
+++ b/flang/include/flang/Optimizer/Transforms/Passes.td
@@ -470,11 +470,19 @@ def AssumedRankOpConversion : Pass<"fir-assumed-rank-op", "mlir::ModuleOp"> {
   ];
 }
 
+def CUFAllocationConversion : Pass<"cuf-allocation-convert", "mlir::ModuleOp"> {
+  let summary = "Convert allocation related CUF operations to runtime calls";
+  let dependentDialects = ["fir::FIROpsDialect"];
+}
+
 def CUFOpConversion : Pass<"cuf-convert", "mlir::ModuleOp"> {
   let summary = "Convert some CUF operations to runtime calls";
   let dependentDialects = [
     "fir::FIROpsDialect", "mlir::gpu::GPUDialect", "mlir::DLTIDialect"
   ];
+  let options = [Option<"allocationConversion", "allocation-conversion", "bool",
+                        /*default=*/"true",
+                        "Convert allocation related operation with this pass">];
 }
 
 def CUFDeviceGlobal :
diff --git a/flang/include/flang/Parser/dump-parse-tree.h b/flang/include/flang/Parser/dump-parse-tree.h
index f460e61fbb915..58fa48e9f04c3 100644
--- a/flang/include/flang/Parser/dump-parse-tree.h
+++ b/flang/include/flang/Parser/dump-parse-tree.h
@@ -229,6 +229,8 @@ class ParseTreeDumper {
   NODE(CompilerDirective, NoInline)
   NODE(CompilerDirective, Unrecognized)
   NODE(CompilerDirective, VectorAlways)
+  NODE_ENUM(CompilerDirective::VectorLength, VectorLength::Kind)
+  NODE(CompilerDirective, VectorLength)
   NODE(CompilerDirective, Unroll)
   NODE(CompilerDirective, UnrollAndJam)
   NODE(CompilerDirective, NoVector)
@@ -643,7 +645,7 @@ class ParseTreeDumper {
   NODE_ENUM(OmpLinearModifier, Value)
   NODE(parser, OmpLocator)
   NODE(parser, OmpLocatorList)
-  NODE(parser, OmpLoopRangeClause)
+  NODE(parser, OmpLooprangeClause)
   NODE(parser, OmpMapClause)
   NODE(OmpMapClause, Modifier)
   NODE(parser, OmpMapper)
diff --git a/flang/include/flang/Parser/openmp-utils.h b/flang/include/flang/Parser/openmp-utils.h
index b72164e6cef4b..b7d990c9e75d6 100644
--- a/flang/include/flang/Parser/openmp-utils.h
+++ b/flang/include/flang/Parser/openmp-utils.h
@@ -67,17 +67,7 @@ struct DirectiveNameScope {
   template <typename T>
   static OmpDirectiveName GetOmpDirectiveName(const T &x) {
     if constexpr (WrapperTrait<T>) {
-      if constexpr (std::is_same_v<T, OpenMPCancelConstruct> ||
-          std::is_same_v<T, OpenMPCancellationPointConstruct> ||
-          std::is_same_v<T, OpenMPDepobjConstruct> ||
-          std::is_same_v<T, OpenMPFlushConstruct> ||
-          std::is_same_v<T, OpenMPInteropConstruct> ||
-          std::is_same_v<T, OpenMPSimpleStandaloneConstruct> ||
-          std::is_same_v<T, OpenMPGroupprivate>) {
-        return x.v.DirName();
-      } else {
-        return GetOmpDirectiveName(x.v);
-      }
+      return GetOmpDirectiveName(x.v);
     } else if constexpr (TupleTrait<T>) {
       if constexpr (std::is_base_of_v<OmpBlockConstruct, T>) {
         return std::get<OmpBeginDirective>(x.t).DirName();
diff --git a/flang/include/flang/Parser/parse-tree.h b/flang/include/flang/Parser/parse-tree.h
index dd928e1244a2f..cbfcaafa78dc0 100644
--- a/flang/include/flang/Parser/parse-tree.h
+++ b/flang/include/flang/Parser/parse-tree.h
@@ -3384,6 +3384,12 @@ struct CompilerDirective {
     std::tuple<common::Indirection<Designator>, uint64_t> t;
   };
   EMPTY_CLASS(VectorAlways);
+  struct VectorLength {
+    TUPLE_CLASS_BOILERPLATE(VectorLength);
+    ENUM_CLASS(Kind, Auto, Fixed, Scalable);
+
+    std::tuple<std::uint64_t, Kind> t;
+  };
   struct NameValue {
     TUPLE_CLASS_BOILERPLATE(NameValue);
     std::tuple<Name, std::optional<std::uint64_t>> t;
@@ -3408,9 +3414,9 @@ struct CompilerDirective {
   EMPTY_CLASS(Unrecognized);
   CharBlock source;
   std::variant<std::list<IgnoreTKR>, LoopCount, std::list<AssumeAligned>,
-      VectorAlways, std::list<NameValue>, Unroll, UnrollAndJam, Unrecognized,
-      NoVector, NoUnroll, NoUnrollAndJam, ForceInline, Inline, NoInline,
-      Prefetch, IVDep>
+      VectorAlways, VectorLength, std::list<NameValue>, Unroll, UnrollAndJam,
+      Unrecognized, NoVector, NoUnroll, NoUnrollAndJam, ForceInline, Inline,
+      NoInline, Prefetch, IVDep>
       u;
 };
 
@@ -4666,10 +4672,10 @@ struct OmpLinearClause {
 
 // Ref: [6.0:207-208]
 //
-// loop-range-clause ->
+// looprange-clause ->
 //    LOOPRANGE(first, count)                       // since 6.0
-struct OmpLoopRangeClause {
-  TUPLE_CLASS_BOILERPLATE(OmpLoopRangeClause);
+struct OmpLooprangeClause {
+  TUPLE_CLASS_BOILERPLATE(OmpLooprangeClause);
   std::tuple<ScalarIntConstantExpr, ScalarIntConstantExpr> t;
 };
 
diff --git a/flang/include/flang/Runtime/extensions.h b/flang/include/flang/Runtime/extensions.h
index 8db68eb9c245c..f2765a5987ea1 100644
--- a/flang/include/flang/Runtime/extensions.h
+++ b/flang/include/flang/Runtime/extensions.h
@@ -102,5 +102,14 @@ int FORTRAN_PROCEDURE_NAME(mclock)();
 float FORTRAN_PROCEDURE_NAME(secnds)(float *refTime);
 float RTNAME(Secnds)(float *refTime, const char *sourceFile, int line);
 
+// GNU extension function IRAND(I)
+int RTNAME(Irand)(int *i);
+
+// GNU extension function RAND(I)
+float RTNAME(Rand)(int *i, const char *sourceFile, int line);
+
+// GNU extension subroutine SRAND(SEED)
+void FORTRAN_PROCEDURE_NAME(srand)(int *seed);
+
 } // extern "C"
 #endif // FORTRAN_RUNTIME_EXTENSIONS_H_
diff --git a/flang/lib/Evaluate/fold-real.cpp b/flang/lib/Evaluate/fold-real.cpp
index 225e3402fd1ad..1ff941053a82e 100644
--- a/flang/lib/Evaluate/fold-real.cpp
+++ b/flang/lib/Evaluate/fold-real.cpp
@@ -425,8 +425,14 @@ Expr<Type<TypeCategory::Real, KIND>> FoldIntrinsicFunction(
             [](const Scalar<T> &x) -> Scalar<T> { return x.SPACING(); }));
   } else if (name == "sqrt") {
     return FoldElementalIntrinsic<T, T>(context, std::move(funcRef),
-        ScalarFunc<T, T>(
-            [](const Scalar<T> &x) -> Scalar<T> { return x.SQRT().value; }));
+        ScalarFunc<T, T>([&context](const Scalar<T> &x) -> Scalar<T> {
+          ValueWithRealFlags<Scalar<T>> result{x.SQRT()};
+          if (result.flags.test(RealFlag::InvalidArgument)) {
+            context.Warn(common::UsageWarning::FoldingValueChecks,
+                "Invalid argument to SQRT()"_warn_en_US);
+          }
+          return result.value;
+        }));
   } else if (name == "sum") {
     return FoldSum<T>(context, std::move(funcRef));
   } else if (name == "tiny") {
diff --git a/flang/lib/Evaluate/intrinsics.cpp b/flang/lib/Evaluate/intrinsics.cpp
index 2ba28a7ea752e..bbcb766274e7f 100644
--- a/flang/lib/Evaluate/intrinsics.cpp
+++ b/flang/lib/Evaluate/intrinsics.cpp
@@ -654,6 +654,10 @@ static const IntrinsicInterface genericIntrinsicFunction[]{
         {{"i", OperandUnsigned}, {"j", OperandUnsigned, Rank::elementalOrBOZ}},
         OperandUnsigned},
     {"ior", {{"i", BOZ}, {"j", SameIntOrUnsigned}}, SameIntOrUnsigned},
+    {"irand",
+        {{"i", TypePattern{IntType, KindCode::exactKind, 4}, Rank::scalar,
+            Optionality::optional}},
+        TypePattern{IntType, KindCode::exactKind, 4}, Rank::scalar},
     {"ishft", {{"i", SameIntOrUnsigned}, {"shift", AnyInt}}, SameIntOrUnsigned},
     {"ishftc",
         {{"i", SameIntOrUnsigned}, {"shift", AnyInt},
@@ -872,6 +876,10 @@ static const IntrinsicInterface genericIntrinsicFunction[]{
             common::Intent::In,
             {ArgFlag::canBeMoldNull, ArgFlag::onlyConstantInquiry}}},
         DefaultInt, Rank::scalar, IntrinsicClass::inquiryFunction},
+    {"rand",
+        {{"i", TypePattern{IntType, KindCode::exactKind, 4}, Rank::scalar,
+            Optionality::optional}},
+        TypePattern{RealType, KindCode::exactKind, 4}, Rank::scalar},
     {"range",
         {{"x", AnyNumeric, Rank::anyOrAssumedRank, Optionality::required,
             common::Intent::In,
diff --git a/flang/lib/Evaluate/tools.cpp b/flang/lib/Evaluate/tools.cpp
index 117b2249a9179..a0035ae330e35 100644
--- a/flang/lib/Evaluate/tools.cpp
+++ b/flang/lib/Evaluate/tools.cpp
@@ -63,7 +63,11 @@ Expr<SomeType> Parenthesize(Expr<SomeType> &&expr) {
 
 std::optional<DataRef> ExtractDataRef(
     const ActualArgument &arg, bool intoSubstring, bool intoComplexPart) {
-  return ExtractDataRef(arg.UnwrapExpr(), intoSubstring, intoComplexPart);
+  if (const Symbol *assumedType{arg.GetAssumedTypeDummy()}) {
+    return DataRef{*assumedType};
+  } else {
+    return ExtractDataRef(arg.UnwrapExpr(), intoSubstring, intoComplexPart);
+  }
 }
 
 std::optional<DataRef> ExtractSubstringBase(const Substring &substring) {
diff --git a/flang/lib/Lower/Bridge.cpp b/flang/lib/Lower/Bridge.cpp
index 6f9dc32297272..d175e2a8a73cb 100644
--- a/flang/lib/Lower/Bridge.cpp
+++ b/flang/lib/Lower/Bridge.cpp
@@ -307,7 +307,11 @@ class TypeInfoConverter {
     if (!insertPointIfCreated.isSet())
       return; // fir.type_info was already built in a previous call.
 
-    // Set init, destroy, and nofinal attributes.
+    // Set abstract, init, destroy, and nofinal attributes.
+    const Fortran::semantics::Symbol &dtSymbol = info.typeSpec.typeSymbol();
+    if (dtSymbol.attrs().test(Fortran::semantics::Attr::ABSTRACT))
+      dt->setAttr(dt.getAbstractAttrName(), builder.getUnitAttr());
+
     if (!info.typeSpec.HasDefaultInitialization(/*ignoreAllocatable=*/false,
                                                 /*ignorePointer=*/false))
       dt->setAttr(dt.getNoInitAttrName(), builder.getUnitAttr());
@@ -331,10 +335,14 @@ class TypeInfoConverter {
         if (details.numPrivatesNotOverridden() > 0)
           tbpName += "."s + std::to_string(details.numPrivatesNotOverridden());
         std::string bindingName = converter.mangleName(details.symbol());
-        fir::DTEntryOp::create(
+        auto dtEntry = fir::DTEntryOp::create(
             builder, info.loc,
             mlir::StringAttr::get(builder.getContext(), tbpName),
             mlir::SymbolRefAttr::get(builder.getContext(), bindingName));
+        // Propagate DEFERRED attribute on the binding to fir.dt_entry.
+        if (binding.get().attrs().test(Fortran::semantics::Attr::DEFERRED))
+          dtEntry->setAttr(fir::DTEntryOp::getDeferredAttrNameStr(),
+                           builder.getUnitAttr());
       }
       fir::FirEndOp::create(builder, info.loc);
     }
@@ -2576,12 +2584,16 @@ class FirConverter : public Fortran::lower::AbstractConverter {
 
   // Enabling loop vectorization attribute.
   mlir::LLVM::LoopVectorizeAttr
-  genLoopVectorizeAttr(mlir::BoolAttr disableAttr) {
+  genLoopVectorizeAttr(mlir::BoolAttr disableAttr,
+                       mlir::BoolAttr scalableEnable,
+                       mlir::IntegerAttr vectorWidth) {
     mlir::LLVM::LoopVectorizeAttr va;
     if (disableAttr)
-      va = mlir::LLVM::LoopVectorizeAttr::get(builder->getContext(),
-                                              /*disable=*/disableAttr, {}, {},
-                                              {}, {}, {}, {});
+      va = mlir::LLVM::LoopVectorizeAttr::get(
+          builder->getContext(),
+          /*disable=*/disableAttr, /*predicate=*/{},
+          /*scalableEnable=*/scalableEnable,
+          /*vectorWidth=*/vectorWidth, {}, {}, {});
     return va;
   }
 
@@ -2589,6 +2601,8 @@ class FirConverter : public Fortran::lower::AbstractConverter {
       IncrementLoopInfo &info,
       llvm::SmallVectorImpl<const Fortran::parser::CompilerDirective *> &dirs) {
     mlir::BoolAttr disableVecAttr;
+    mlir::BoolAttr scalableEnable;
+    mlir::IntegerAttr vectorWidth;
     mlir::LLVM::LoopUnrollAttr ua;
     mlir::LLVM::LoopUnrollAndJamAttr uja;
     llvm::SmallVector<mlir::LLVM::AccessGroupAttr> aga;
@@ -2601,6 +2615,30 @@ class FirConverter : public Fortran::lower::AbstractConverter {
                     mlir::BoolAttr::get(builder->getContext(), false);
                 has_attrs = true;
               },
+              [&](const Fortran::parser::CompilerDirective::VectorLength &vl) {
+                using Kind =
+                    Fortran::parser::CompilerDirective::VectorLength::Kind;
+                Kind kind = std::get<Kind>(vl.t);
+                uint64_t length = std::get<uint64_t>(vl.t);
+                disableVecAttr =
+                    mlir::BoolAttr::get(builder->getContext(), false);
+                if (length != 0)
+                  vectorWidth =
+                      builder->getIntegerAttr(builder->getI64Type(), length);
+                switch (kind) {
+                case Kind::Scalable:
+                  scalableEnable =
+                      mlir::BoolAttr::get(builder->getContext(), true);
+                  break;
+                case Kind::Fixed:
+                  scalableEnable =
+                      mlir::BoolAttr::get(builder->getContext(), false);
+                  break;
+                case Kind::Auto:
+                  break;
+                }
+                has_attrs = true;
+              },
               [&](const Fortran::parser::CompilerDirective::Unroll &u) {
                 ua = genLoopUnrollAttr(u.v);
                 has_attrs = true;
@@ -2632,7 +2670,8 @@ class FirConverter : public Fortran::lower::AbstractConverter {
               [&](const auto &) {}},
           dir->u);
     }
-    mlir::LLVM::LoopVectorizeAttr va = genLoopVectorizeAttr(disableVecAttr);
+    mlir::LLVM::LoopVectorizeAttr va =
+        genLoopVectorizeAttr(disableVecAttr, scalableEnable, vectorWidth);
     mlir::LLVM::LoopAnnotationAttr la = mlir::LLVM::LoopAnnotationAttr::get(
         builder->getContext(), {}, /*vectorize=*/va, {}, /*unroll*/ ua,
         /*unroll_and_jam*/ uja, {}, {}, {}, {}, {}, {}, {}, {}, {},
@@ -3339,6 +3378,9 @@ class FirConverter : public Fortran::lower::AbstractConverter {
             [&](const Fortran::parser::CompilerDirective::VectorAlways &) {
               attachDirectiveToLoop(dir, &eval);
             },
+            [&](const Fortran::parser::CompilerDirective::VectorLength &) {
+              attachDirectiveToLoop(dir, &eval);
+            },
             [&](const Fortran::parser::CompilerDirective::Unroll &) {
               attachDirectiveToLoop(dir, &eval);
             },
diff --git a/flang/lib/Lower/OpenMP/Clauses.cpp b/flang/lib/Lower/OpenMP/Clauses.cpp
index dc49a8118b0a5..61430fceafe2a 100644
--- a/flang/lib/Lower/OpenMP/Clauses.cpp
+++ b/flang/lib/Lower/OpenMP/Clauses.cpp
@@ -1076,7 +1076,7 @@ Link make(const parser::OmpClause::Link &inp,
   return Link{/*List=*/makeObjects(inp.v, semaCtx)};
 }
 
-LoopRange make(const parser::OmpClause::Looprange &inp,
+Looprange make(const parser::OmpClause::Looprange &inp,
                semantics::SemanticsContext &semaCtx) {
   llvm_unreachable("Unimplemented: looprange");
 }
diff --git a/flang/lib/Lower/Runtime.cpp b/flang/lib/Lower/Runtime.cpp
index 9ff6157f7487d..5f8586b9c8a88 100644
--- a/flang/lib/Lower/Runtime.cpp
+++ b/flang/lib/Lower/Runtime.cpp
@@ -169,12 +169,55 @@ void Fortran::lower::genUnlockStatement(
 
 void Fortran::lower::genPauseStatement(
     Fortran::lower::AbstractConverter &converter,
-    const Fortran::parser::PauseStmt &) {
+    const Fortran::parser::PauseStmt &stmt) {
+
   fir::FirOpBuilder &builder = converter.getFirOpBuilder();
   mlir::Location loc = converter.getCurrentLocation();
-  mlir::func::FuncOp callee =
-      fir::runtime::getRuntimeFunc<mkRTKey(PauseStatement)>(loc, builder);
-  fir::CallOp::create(builder, loc, callee, mlir::ValueRange{});
+  Fortran::lower::StatementContext stmtCtx;
+
+  llvm::SmallVector<mlir::Value> operands;
+  mlir::func::FuncOp callee;
+  mlir::FunctionType calleeType;
+
+  if (stmt.v.has_value()) {
+    const auto &code = stmt.v.value();
+    auto expr =
+        converter.genExprValue(*Fortran::semantics::GetExpr(code), stmtCtx);
+    expr.match(
+        // Character-valued expression -> call PauseStatementText (CHAR, LEN)
+        [&](const fir::CharBoxValue &x) {
+          callee = fir::runtime::getRuntimeFunc<mkRTKey(PauseStatementText)>(
+              loc, builder);
+          calleeType = callee.getFunctionType();
+
+          operands.push_back(
+              builder.createConvert(loc, calleeType.getInput(0), x.getAddr()));
+          operands.push_back(
+              builder.createConvert(loc, calleeType.getInput(1), x.getLen()));
+        },
+        // Unboxed value -> call PauseStatementInt which accepts an integer.
+        [&](fir::UnboxedValue x) {
+          callee = fir::runtime::getRuntimeFunc<mkRTKey(PauseStatementInt)>(
+              loc, builder);
+          calleeType = callee.getFunctionType();
+          assert(calleeType.getNumInputs() >= 1);
+          mlir::Value cast =
+              builder.createConvert(loc, calleeType.getInput(0), x);
+          operands.push_back(cast);
+        },
+        [&](auto) {
+          fir::emitFatalError(loc, "unhandled expression in PAUSE");
+        });
+  } else {
+    callee =
+        fir::runtime::getRuntimeFunc<mkRTKey(PauseStatement)>(loc, builder);
+    calleeType = callee.getFunctionType();
+  }
+
+  fir::CallOp::create(builder, loc, callee, operands);
+
+  // NOTE: PAUSE does not terminate the current block. The program may resume
+  // and continue normal execution, so we do not emit control-flow terminators.
 }
 
 void Fortran::lower::genPointerAssociate(fir::FirOpBuilder &builder,
diff --git a/flang/lib/Optimizer/Builder/CUDAIntrinsicCall.cpp b/flang/lib/Optimizer/Builder/CUDAIntrinsicCall.cpp
index 270037f5fcb00..ae6120826f8d2 100644
--- a/flang/lib/Optimizer/Builder/CUDAIntrinsicCall.cpp
+++ b/flang/lib/Optimizer/Builder/CUDAIntrinsicCall.cpp
@@ -17,6 +17,8 @@
 #include "flang/Evaluate/common.h"
 #include "flang/Optimizer/Builder/FIRBuilder.h"
 #include "flang/Optimizer/Builder/MutableBox.h"
+#include "flang/Optimizer/Dialect/CUF/CUFOps.h"
+#include "flang/Optimizer/HLFIR/HLFIROps.h"
 #include "mlir/Dialect/Index/IR/IndexOps.h"
 #include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Vector/IR/VectorOps.h"
@@ -1489,6 +1491,13 @@ void CUDAIntrinsicLibrary::genTMABulkG2S(
       builder, loc, dst, src, barrier, fir::getBase(args[3]), {}, {});
 }
 
+static void setAlignment(mlir::Value ptr, unsigned alignment) {
+  if (auto declareOp = mlir::dyn_cast<hlfir::DeclareOp>(ptr.getDefiningOp()))
+    if (auto sharedOp = mlir::dyn_cast<cuf::SharedMemoryOp>(
+            declareOp.getMemref().getDefiningOp()))
+      sharedOp.setAlignment(alignment);
+}
+
 static void genTMABulkLoad(fir::FirOpBuilder &builder, mlir::Location loc,
                            mlir::Value barrier, mlir::Value src,
                            mlir::Value dst, mlir::Value nelem,
@@ -1496,8 +1505,11 @@ static void genTMABulkLoad(fir::FirOpBuilder &builder, mlir::Location loc,
   mlir::Value size = mlir::arith::MulIOp::create(builder, loc, nelem, eleSize);
   auto llvmPtrTy = mlir::LLVM::LLVMPointerType::get(builder.getContext());
   barrier = builder.createConvert(loc, llvmPtrTy, barrier);
-  dst = builder.createConvert(loc, llvmPtrTy, dst);
-  src = builder.createConvert(loc, llvmPtrTy, src);
+  setAlignment(dst, 16);
+  dst = convertPtrToNVVMSpace(builder, loc, dst,
+                              mlir::NVVM::NVVMMemorySpace::Shared);
+  src = convertPtrToNVVMSpace(builder, loc, src,
+                              mlir::NVVM::NVVMMemorySpace::Shared);
   mlir::NVVM::InlinePtxOp::create(
       builder, loc, mlir::TypeRange{}, {dst, src, size, barrier}, {},
       "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], "
diff --git a/flang/lib/Optimizer/Builder/IntrinsicCall.cpp b/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
index f78afd9a21a4d..3619e5bb942db 100644
--- a/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
+++ b/flang/lib/Optimizer/Builder/IntrinsicCall.cpp
@@ -499,6 +499,10 @@ static constexpr IntrinsicHandler handlers[]{
        {"dim", asValue},
        {"mask", asBox, handleDynamicOptional}}},
      /*isElemental=*/false},
+    {"irand",
+     &I::genIrand,
+     {{{"i", asAddr, handleDynamicOptional}}},
+     /*isElemental=*/false},
     {"is_contiguous",
      &I::genIsContiguous,
      {{{"array", asBox}}},
@@ -625,6 +629,10 @@ static constexpr IntrinsicHandler handlers[]{
      &I::genPutenv,
      {{{"str", asAddr}, {"status", asAddr, handleDynamicOptional}}},
      /*isElemental=*/false},
+    {"rand",
+     &I::genRand,
+     {{{"i", asAddr, handleDynamicOptional}}},
+     /*isElemental=*/false},
     {"random_init",
      &I::genRandomInit,
      {{{"repeatable", asValue}, {"image_distinct", asValue}}},
@@ -6158,6 +6166,20 @@ IntrinsicLibrary::genIparity(mlir::Type resultType,
                       "IPARITY", resultType, args);
 }
 
+// IRAND
+fir::ExtendedValue
+IntrinsicLibrary::genIrand(mlir::Type resultType,
+                           llvm::ArrayRef<fir::ExtendedValue> args) {
+  assert(args.size() == 1);
+  mlir::Value i =
+      isStaticallyPresent(args[0])
+          ? fir::getBase(args[0])
+          : fir::AbsentOp::create(builder, loc,
+                                  builder.getRefType(builder.getI32Type()))
+                .getResult();
+  return fir::runtime::genIrand(builder, loc, i);
+}
+
 // IS_CONTIGUOUS
 fir::ExtendedValue
 IntrinsicLibrary::genIsContiguous(mlir::Type resultType,
@@ -7184,6 +7206,19 @@ IntrinsicLibrary::genPutenv(std::optional<mlir::Type> resultType,
   return {};
 }
 
+// RAND
+fir::ExtendedValue
+IntrinsicLibrary::genRand(mlir::Type, llvm::ArrayRef<fir::ExtendedValue> args) {
+  assert(args.size() == 1);
+  mlir::Value i =
+      isStaticallyPresent(args[0])
+          ? fir::getBase(args[0])
+          : fir::AbsentOp::create(builder, loc,
+                                  builder.getRefType(builder.getI32Type()))
+                .getResult();
+  return fir::runtime::genRand(builder, loc, i);
+}
+
 // RANDOM_INIT
 void IntrinsicLibrary::genRandomInit(llvm::ArrayRef<fir::ExtendedValue> args) {
   assert(args.size() == 2);
diff --git a/flang/lib/Optimizer/Builder/Runtime/Intrinsics.cpp b/flang/lib/Optimizer/Builder/Runtime/Intrinsics.cpp
index 9fa3b18a255bd..4d366135c305f 100644
--- a/flang/lib/Optimizer/Builder/Runtime/Intrinsics.cpp
+++ b/flang/lib/Optimizer/Builder/Runtime/Intrinsics.cpp
@@ -470,3 +470,27 @@ mlir::Value fir::runtime::genChdir(fir::FirOpBuilder &builder,
       fir::runtime::createArguments(builder, loc, func.getFunctionType(), name);
   return fir::CallOp::create(builder, loc, func, args).getResult(0);
 }
+
+mlir::Value fir::runtime::genIrand(fir::FirOpBuilder &builder,
+                                   mlir::Location loc, mlir::Value i) {
+  auto runtimeFunc = fir::runtime::getRuntimeFunc<mkRTKey(Irand)>(loc, builder);
+  mlir::FunctionType runtimeFuncTy = runtimeFunc.getFunctionType();
+
+  llvm::SmallVector<mlir::Value> args =
+      fir::runtime::createArguments(builder, loc, runtimeFuncTy, i);
+  return fir::CallOp::create(builder, loc, runtimeFunc, args).getResult(0);
+}
+
+mlir::Value fir::runtime::genRand(fir::FirOpBuilder &builder,
+                                  mlir::Location loc, mlir::Value i) {
+  auto runtimeFunc = fir::runtime::getRuntimeFunc<mkRTKey(Rand)>(loc, builder);
+  mlir::FunctionType runtimeFuncTy = runtimeFunc.getFunctionType();
+
+  mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
+  mlir::Value sourceLine =
+      fir::factory::locationToLineNo(builder, loc, runtimeFuncTy.getInput(2));
+
+  llvm::SmallVector<mlir::Value> args = fir::runtime::createArguments(
+      builder, loc, runtimeFuncTy, i, sourceFile, sourceLine);
+  return fir::CallOp::create(builder, loc, runtimeFunc, args).getResult(0);
+}
diff --git a/flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp b/flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp
index 687007d957225..97f7f76a8fbe7 100644
--- a/flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp
+++ b/flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp
@@ -333,7 +333,8 @@ void cuf::SharedMemoryOp::build(
       bindcName.empty() ? mlir::StringAttr{} : builder.getStringAttr(bindcName);
   build(builder, result, wrapAllocaResultType(inType),
         mlir::TypeAttr::get(inType), nameAttr, bindcAttr, typeparams, shape,
-        /*offset=*/mlir::Value{});
+        /*offset=*/mlir::Value{}, /*alignment=*/mlir::IntegerAttr{},
+        /*isStatic=*/nullptr);
   result.addAttributes(attributes);
 }
 
diff --git a/flang/lib/Optimizer/Dialect/FIROps.cpp b/flang/lib/Optimizer/Dialect/FIROps.cpp
index 97e544f30de3e..4e797d651cb7a 100644
--- a/flang/lib/Optimizer/Dialect/FIROps.cpp
+++ b/flang/lib/Optimizer/Dialect/FIROps.cpp
@@ -3230,11 +3230,19 @@ mlir::ParseResult fir::DTEntryOp::parse(mlir::OpAsmParser &parser,
       parser.parseAttribute(calleeAttr, fir::DTEntryOp::getProcAttrNameStr(),
                             result.attributes))
     return mlir::failure();
+
+  // Optional "deferred" keyword.
+  if (succeeded(parser.parseOptionalKeyword("deferred"))) {
+    result.addAttribute(fir::DTEntryOp::getDeferredAttrNameStr(),
+                        parser.getBuilder().getUnitAttr());
+  }
   return mlir::success();
 }
 
 void fir::DTEntryOp::print(mlir::OpAsmPrinter &p) {
   p << ' ' << getMethodAttr() << ", " << getProcAttr();
+  if ((*this)->getAttr(fir::DTEntryOp::getDeferredAttrNameStr()))
+    p << " deferred";
 }
 
 //===----------------------------------------------------------------------===//
diff --git a/flang/lib/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.cpp b/flang/lib/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.cpp
index ae0f5fb8197fa..9fcc7d3681c39 100644
--- a/flang/lib/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.cpp
+++ b/flang/lib/Optimizer/OpenACC/Support/FIROpenACCTypeInterfaces.cpp
@@ -1014,4 +1014,114 @@ template bool OpenACCPointerLikeModel<fir::LLVMPointerType>::genCopy(
     mlir::TypedValue<mlir::acc::PointerLikeType> source,
     mlir::Type varType) const;
 
+template <typename Ty>
+mlir::Value OpenACCPointerLikeModel<Ty>::genLoad(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::TypedValue<mlir::acc::PointerLikeType> srcPtr,
+    mlir::Type valueType) const {
+
+  // Unwrap to get the pointee type.
+  mlir::Type pointeeTy = fir::dyn_cast_ptrEleTy(pointer);
+  assert(pointeeTy && "expected pointee type to be extractable");
+
+  // Box types contain both a descriptor and referenced data. The genLoad API
+  // handles simple loads and cannot properly manage both parts.
+  if (fir::isa_box_type(pointeeTy))
+    return {};
+
+  // Unlimited polymorphic (class(*)) cannot be handled because type is unknown.
+  if (fir::isUnlimitedPolymorphicType(pointeeTy))
+    return {};
+
+  // Return empty for dynamic size types because the load logic
+  // cannot be determined simply from the type.
+  if (fir::hasDynamicSize(pointeeTy))
+    return {};
+
+  mlir::Value loadedValue = fir::LoadOp::create(builder, loc, srcPtr);
+
+  // If valueType is provided and differs from the loaded type, insert a convert
+  if (valueType && loadedValue.getType() != valueType)
+    return fir::ConvertOp::create(builder, loc, valueType, loadedValue);
+
+  return loadedValue;
+}
+
+template mlir::Value OpenACCPointerLikeModel<fir::ReferenceType>::genLoad(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::TypedValue<mlir::acc::PointerLikeType> srcPtr,
+    mlir::Type valueType) const;
+
+template mlir::Value OpenACCPointerLikeModel<fir::PointerType>::genLoad(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::TypedValue<mlir::acc::PointerLikeType> srcPtr,
+    mlir::Type valueType) const;
+
+template mlir::Value OpenACCPointerLikeModel<fir::HeapType>::genLoad(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::TypedValue<mlir::acc::PointerLikeType> srcPtr,
+    mlir::Type valueType) const;
+
+template mlir::Value OpenACCPointerLikeModel<fir::LLVMPointerType>::genLoad(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::TypedValue<mlir::acc::PointerLikeType> srcPtr,
+    mlir::Type valueType) const;
+
+template <typename Ty>
+bool OpenACCPointerLikeModel<Ty>::genStore(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::Value valueToStore,
+    mlir::TypedValue<mlir::acc::PointerLikeType> destPtr) const {
+
+  // Unwrap to get the pointee type.
+  mlir::Type pointeeTy = fir::dyn_cast_ptrEleTy(pointer);
+  assert(pointeeTy && "expected pointee type to be extractable");
+
+  // Box types contain both a descriptor and referenced data. The genStore API
+  // handles simple stores and cannot properly manage both parts.
+  if (fir::isa_box_type(pointeeTy))
+    return false;
+
+  // Unlimited polymorphic (class(*)) cannot be handled because type is unknown.
+  if (fir::isUnlimitedPolymorphicType(pointeeTy))
+    return false;
+
+  // Return false for dynamic size types because the store logic
+  // cannot be determined simply from the type.
+  if (fir::hasDynamicSize(pointeeTy))
+    return false;
+
+  // Get the type from the value being stored
+  mlir::Type valueType = valueToStore.getType();
+  mlir::Value convertedValue = valueToStore;
+
+  // If the value type differs from the pointee type, insert a convert
+  if (valueType != pointeeTy)
+    convertedValue =
+        fir::ConvertOp::create(builder, loc, pointeeTy, valueToStore);
+
+  fir::StoreOp::create(builder, loc, convertedValue, destPtr);
+  return true;
+}
+
+template bool OpenACCPointerLikeModel<fir::ReferenceType>::genStore(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::Value valueToStore,
+    mlir::TypedValue<mlir::acc::PointerLikeType> destPtr) const;
+
+template bool OpenACCPointerLikeModel<fir::PointerType>::genStore(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::Value valueToStore,
+    mlir::TypedValue<mlir::acc::PointerLikeType> destPtr) const;
+
+template bool OpenACCPointerLikeModel<fir::HeapType>::genStore(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::Value valueToStore,
+    mlir::TypedValue<mlir::acc::PointerLikeType> destPtr) const;
+
+template bool OpenACCPointerLikeModel<fir::LLVMPointerType>::genStore(
+    mlir::Type pointer, mlir::OpBuilder &builder, mlir::Location loc,
+    mlir::Value valueToStore,
+    mlir::TypedValue<mlir::acc::PointerLikeType> destPtr) const;
+
 } // namespace fir::acc
diff --git a/flang/lib/Optimizer/Transforms/CMakeLists.txt b/flang/lib/Optimizer/Transforms/CMakeLists.txt
index 0388439f89a54..619f3adc67c85 100644
--- a/flang/lib/Optimizer/Transforms/CMakeLists.txt
+++ b/flang/lib/Optimizer/Transforms/CMakeLists.txt
@@ -9,6 +9,7 @@ add_flang_library(FIRTransforms
   CompilerGeneratedNames.cpp
   ConstantArgumentGlobalisation.cpp
   ControlFlowConverter.cpp
+  CUDA/CUFAllocationConversion.cpp
   CUFAddConstructor.cpp
   CUFDeviceGlobal.cpp
   CUFOpConversion.cpp
diff --git a/flang/lib/Optimizer/Transforms/CUDA/CUFAllocationConversion.cpp b/flang/lib/Optimizer/Transforms/CUDA/CUFAllocationConversion.cpp
new file mode 100644
index 0000000000000..0acdb24bf62b1
--- /dev/null
+++ b/flang/lib/Optimizer/Transforms/CUDA/CUFAllocationConversion.cpp
@@ -0,0 +1,468 @@
+//===-- CUFAllocationConversion.cpp ---------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "flang/Optimizer/Transforms/CUDA/CUFAllocationConversion.h"
+#include "flang/Optimizer/Builder/CUFCommon.h"
+#include "flang/Optimizer/Builder/FIRBuilder.h"
+#include "flang/Optimizer/Builder/Runtime/CUDA/Descriptor.h"
+#include "flang/Optimizer/Builder/Runtime/RTBuilder.h"
+#include "flang/Optimizer/CodeGen/TypeConverter.h"
+#include "flang/Optimizer/Dialect/CUF/CUFOps.h"
+#include "flang/Optimizer/Dialect/FIRDialect.h"
+#include "flang/Optimizer/Dialect/FIROps.h"
+#include "flang/Optimizer/HLFIR/HLFIROps.h"
+#include "flang/Optimizer/Support/DataLayout.h"
+#include "flang/Runtime/CUDA/allocatable.h"
+#include "flang/Runtime/CUDA/common.h"
+#include "flang/Runtime/CUDA/descriptor.h"
+#include "flang/Runtime/CUDA/memory.h"
+#include "flang/Runtime/CUDA/pointer.h"
+#include "flang/Runtime/allocatable.h"
+#include "flang/Runtime/allocator-registry-consts.h"
+#include "flang/Support/Fortran.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/IR/Matchers.h"
+#include "mlir/Pass/Pass.h"
+#include "mlir/Transforms/DialectConversion.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+
+namespace fir {
+#define GEN_PASS_DEF_CUFALLOCATIONCONVERSION
+#include "flang/Optimizer/Transforms/Passes.h.inc"
+} // namespace fir
+
+using namespace fir;
+using namespace mlir;
+using namespace Fortran::runtime;
+using namespace Fortran::runtime::cuda;
+
+namespace {
+
+template <typename OpTy>
+static bool isPinned(OpTy op) {
+  if (op.getDataAttr() && *op.getDataAttr() == cuf::DataAttribute::Pinned)
+    return true;
+  return false;
+}
+
+static inline unsigned getMemType(cuf::DataAttribute attr) {
+  if (attr == cuf::DataAttribute::Device)
+    return kMemTypeDevice;
+  if (attr == cuf::DataAttribute::Managed)
+    return kMemTypeManaged;
+  if (attr == cuf::DataAttribute::Pinned)
+    return kMemTypePinned;
+  if (attr == cuf::DataAttribute::Unified)
+    return kMemTypeUnified;
+  llvm_unreachable("unsupported memory type");
+}
+
+template <typename OpTy>
+static bool hasDoubleDescriptors(OpTy op) {
+  if (auto declareOp =
+          mlir::dyn_cast_or_null<fir::DeclareOp>(op.getBox().getDefiningOp())) {
+    if (mlir::isa_and_nonnull<fir::AddrOfOp>(
+            declareOp.getMemref().getDefiningOp())) {
+      if (isPinned(declareOp))
+        return false;
+      return true;
+    }
+  } else if (auto declareOp = mlir::dyn_cast_or_null<hlfir::DeclareOp>(
+                 op.getBox().getDefiningOp())) {
+    if (mlir::isa_and_nonnull<fir::AddrOfOp>(
+            declareOp.getMemref().getDefiningOp())) {
+      if (isPinned(declareOp))
+        return false;
+      return true;
+    }
+  }
+  return false;
+}
+
+static bool inDeviceContext(mlir::Operation *op) {
+  if (op->getParentOfType<cuf::KernelOp>())
+    return true;
+  if (auto funcOp = op->getParentOfType<mlir::gpu::GPUFuncOp>())
+    return true;
+  if (auto funcOp = op->getParentOfType<mlir::gpu::LaunchOp>())
+    return true;
+  if (auto funcOp = op->getParentOfType<mlir::func::FuncOp>()) {
+    if (auto cudaProcAttr =
+            funcOp.getOperation()->getAttrOfType<cuf::ProcAttributeAttr>(
+                cuf::getProcAttrName())) {
+      return cudaProcAttr.getValue() != cuf::ProcAttribute::Host &&
+             cudaProcAttr.getValue() != cuf::ProcAttribute::HostDevice;
+    }
+  }
+  return false;
+}
+
+template <typename OpTy>
+static mlir::LogicalResult convertOpToCall(OpTy op,
+                                           mlir::PatternRewriter &rewriter,
+                                           mlir::func::FuncOp func) {
+  auto mod = op->template getParentOfType<mlir::ModuleOp>();
+  fir::FirOpBuilder builder(rewriter, mod);
+  mlir::Location loc = op.getLoc();
+  auto fTy = func.getFunctionType();
+
+  mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
+  mlir::Value sourceLine;
+  if constexpr (std::is_same_v<OpTy, cuf::AllocateOp>)
+    sourceLine = fir::factory::locationToLineNo(
+        builder, loc, op.getSource() ? fTy.getInput(7) : fTy.getInput(6));
+  else
+    sourceLine = fir::factory::locationToLineNo(builder, loc, fTy.getInput(4));
+
+  mlir::Value hasStat = op.getHasStat() ? builder.createBool(loc, true)
+                                        : builder.createBool(loc, false);
+
+  mlir::Value errmsg;
+  if (op.getErrmsg()) {
+    errmsg = op.getErrmsg();
+  } else {
+    mlir::Type boxNoneTy = fir::BoxType::get(builder.getNoneType());
+    errmsg = fir::AbsentOp::create(builder, loc, boxNoneTy).getResult();
+  }
+  llvm::SmallVector<mlir::Value> args;
+  if constexpr (std::is_same_v<OpTy, cuf::AllocateOp>) {
+    mlir::Value pinned =
+        op.getPinned()
+            ? op.getPinned()
+            : builder.createNullConstant(
+                  loc, fir::ReferenceType::get(
+                           mlir::IntegerType::get(op.getContext(), 1)));
+    if (op.getSource()) {
+      mlir::Value stream =
+          op.getStream() ? op.getStream()
+                         : builder.createNullConstant(loc, fTy.getInput(2));
+      args = fir::runtime::createArguments(
+          builder, loc, fTy, op.getBox(), op.getSource(), stream, pinned,
+          hasStat, errmsg, sourceFile, sourceLine);
+    } else {
+      mlir::Value stream =
+          op.getStream() ? op.getStream()
+                         : builder.createNullConstant(loc, fTy.getInput(1));
+      args = fir::runtime::createArguments(builder, loc, fTy, op.getBox(),
+                                           stream, pinned, hasStat, errmsg,
+                                           sourceFile, sourceLine);
+    }
+  } else {
+    args =
+        fir::runtime::createArguments(builder, loc, fTy, op.getBox(), hasStat,
+                                      errmsg, sourceFile, sourceLine);
+  }
+  auto callOp = fir::CallOp::create(builder, loc, func, args);
+  rewriter.replaceOp(op, callOp);
+  return mlir::success();
+}
+
+struct CUFAllocOpConversion : public mlir::OpRewritePattern<cuf::AllocOp> {
+  using OpRewritePattern::OpRewritePattern;
+
+  CUFAllocOpConversion(mlir::MLIRContext *context, mlir::DataLayout *dl,
+                       const fir::LLVMTypeConverter *typeConverter)
+      : OpRewritePattern(context), dl{dl}, typeConverter{typeConverter} {}
+
+  mlir::LogicalResult
+  matchAndRewrite(cuf::AllocOp op,
+                  mlir::PatternRewriter &rewriter) const override {
+
+    mlir::Location loc = op.getLoc();
+
+    if (inDeviceContext(op.getOperation())) {
+      // In device context just replace the cuf.alloc operation with a fir.alloc
+      // the cuf.free will be removed.
+      auto allocaOp =
+          fir::AllocaOp::create(rewriter, loc, op.getInType(),
+                                op.getUniqName() ? *op.getUniqName() : "",
+                                op.getBindcName() ? *op.getBindcName() : "",
+                                op.getTypeparams(), op.getShape());
+      allocaOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
+      rewriter.replaceOp(op, allocaOp);
+      return mlir::success();
+    }
+
+    auto mod = op->getParentOfType<mlir::ModuleOp>();
+    fir::FirOpBuilder builder(rewriter, mod);
+    mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
+
+    if (!mlir::dyn_cast_or_null<fir::BaseBoxType>(op.getInType())) {
+      // Convert scalar and known size array allocations.
+      mlir::Value bytes;
+      fir::KindMapping kindMap{fir::getKindMapping(mod)};
+      if (fir::isa_trivial(op.getInType())) {
+        int width = cuf::computeElementByteSize(loc, op.getInType(), kindMap);
+        bytes =
+            builder.createIntegerConstant(loc, builder.getIndexType(), width);
+      } else if (auto seqTy = mlir::dyn_cast_or_null<fir::SequenceType>(
+                     op.getInType())) {
+        std::size_t size = 0;
+        if (fir::isa_derived(seqTy.getEleTy())) {
+          mlir::Type structTy = typeConverter->convertType(seqTy.getEleTy());
+          size = dl->getTypeSizeInBits(structTy) / 8;
+        } else {
+          size = cuf::computeElementByteSize(loc, seqTy.getEleTy(), kindMap);
+        }
+        mlir::Value width =
+            builder.createIntegerConstant(loc, builder.getIndexType(), size);
+        mlir::Value nbElem;
+        if (fir::sequenceWithNonConstantShape(seqTy)) {
+          assert(!op.getShape().empty() && "expect shape with dynamic arrays");
+          nbElem = builder.loadIfRef(loc, op.getShape()[0]);
+          for (unsigned i = 1; i < op.getShape().size(); ++i) {
+            nbElem = mlir::arith::MulIOp::create(
+                rewriter, loc, nbElem,
+                builder.loadIfRef(loc, op.getShape()[i]));
+          }
+        } else {
+          nbElem = builder.createIntegerConstant(loc, builder.getIndexType(),
+                                                 seqTy.getConstantArraySize());
+        }
+        bytes = mlir::arith::MulIOp::create(rewriter, loc, nbElem, width);
+      } else if (fir::isa_derived(op.getInType())) {
+        mlir::Type structTy = typeConverter->convertType(op.getInType());
+        std::size_t structSize = dl->getTypeSizeInBits(structTy) / 8;
+        bytes = builder.createIntegerConstant(loc, builder.getIndexType(),
+                                              structSize);
+      } else if (fir::isa_char(op.getInType())) {
+        mlir::Type charTy = typeConverter->convertType(op.getInType());
+        std::size_t charSize = dl->getTypeSizeInBits(charTy) / 8;
+        bytes = builder.createIntegerConstant(loc, builder.getIndexType(),
+                                              charSize);
+      } else {
+        mlir::emitError(loc, "unsupported type in cuf.alloc\n");
+      }
+      mlir::func::FuncOp func =
+          fir::runtime::getRuntimeFunc<mkRTKey(CUFMemAlloc)>(loc, builder);
+      auto fTy = func.getFunctionType();
+      mlir::Value sourceLine =
+          fir::factory::locationToLineNo(builder, loc, fTy.getInput(3));
+      mlir::Value memTy = builder.createIntegerConstant(
+          loc, builder.getI32Type(), getMemType(op.getDataAttr()));
+      llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
+          builder, loc, fTy, bytes, memTy, sourceFile, sourceLine)};
+      auto callOp = fir::CallOp::create(builder, loc, func, args);
+      callOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
+      auto convOp = builder.createConvert(loc, op.getResult().getType(),
+                                          callOp.getResult(0));
+      rewriter.replaceOp(op, convOp);
+      return mlir::success();
+    }
+
+    // Convert descriptor allocations to function call.
+    auto boxTy = mlir::dyn_cast_or_null<fir::BaseBoxType>(op.getInType());
+    mlir::func::FuncOp func =
+        fir::runtime::getRuntimeFunc<mkRTKey(CUFAllocDescriptor)>(loc, builder);
+    auto fTy = func.getFunctionType();
+    mlir::Value sourceLine =
+        fir::factory::locationToLineNo(builder, loc, fTy.getInput(2));
+
+    mlir::Type structTy = typeConverter->convertBoxTypeAsStruct(boxTy);
+    std::size_t boxSize = dl->getTypeSizeInBits(structTy) / 8;
+    mlir::Value sizeInBytes =
+        builder.createIntegerConstant(loc, builder.getIndexType(), boxSize);
+
+    llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
+        builder, loc, fTy, sizeInBytes, sourceFile, sourceLine)};
+    auto callOp = fir::CallOp::create(builder, loc, func, args);
+    callOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
+    auto convOp = builder.createConvert(loc, op.getResult().getType(),
+                                        callOp.getResult(0));
+    rewriter.replaceOp(op, convOp);
+    return mlir::success();
+  }
+
+private:
+  mlir::DataLayout *dl;
+  const fir::LLVMTypeConverter *typeConverter;
+};
+
+struct CUFFreeOpConversion : public mlir::OpRewritePattern<cuf::FreeOp> {
+  using OpRewritePattern::OpRewritePattern;
+
+  mlir::LogicalResult
+  matchAndRewrite(cuf::FreeOp op,
+                  mlir::PatternRewriter &rewriter) const override {
+    if (inDeviceContext(op.getOperation())) {
+      rewriter.eraseOp(op);
+      return mlir::success();
+    }
+
+    if (!mlir::isa<fir::ReferenceType>(op.getDevptr().getType()))
+      return failure();
+
+    auto mod = op->getParentOfType<mlir::ModuleOp>();
+    fir::FirOpBuilder builder(rewriter, mod);
+    mlir::Location loc = op.getLoc();
+    mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
+
+    auto refTy = mlir::dyn_cast<fir::ReferenceType>(op.getDevptr().getType());
+    if (!mlir::isa<fir::BaseBoxType>(refTy.getEleTy())) {
+      mlir::func::FuncOp func =
+          fir::runtime::getRuntimeFunc<mkRTKey(CUFMemFree)>(loc, builder);
+      auto fTy = func.getFunctionType();
+      mlir::Value sourceLine =
+          fir::factory::locationToLineNo(builder, loc, fTy.getInput(3));
+      mlir::Value memTy = builder.createIntegerConstant(
+          loc, builder.getI32Type(), getMemType(op.getDataAttr()));
+      llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
+          builder, loc, fTy, op.getDevptr(), memTy, sourceFile, sourceLine)};
+      fir::CallOp::create(builder, loc, func, args);
+      rewriter.eraseOp(op);
+      return mlir::success();
+    }
+
+    // Convert cuf.free on descriptors.
+    mlir::func::FuncOp func =
+        fir::runtime::getRuntimeFunc<mkRTKey(CUFFreeDescriptor)>(loc, builder);
+    auto fTy = func.getFunctionType();
+    mlir::Value sourceLine =
+        fir::factory::locationToLineNo(builder, loc, fTy.getInput(2));
+    llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
+        builder, loc, fTy, op.getDevptr(), sourceFile, sourceLine)};
+    auto callOp = fir::CallOp::create(builder, loc, func, args);
+    callOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
+    rewriter.eraseOp(op);
+    return mlir::success();
+  }
+};
+
+struct CUFAllocateOpConversion
+    : public mlir::OpRewritePattern<cuf::AllocateOp> {
+  using OpRewritePattern::OpRewritePattern;
+
+  mlir::LogicalResult
+  matchAndRewrite(cuf::AllocateOp op,
+                  mlir::PatternRewriter &rewriter) const override {
+    auto mod = op->getParentOfType<mlir::ModuleOp>();
+    fir::FirOpBuilder builder(rewriter, mod);
+    mlir::Location loc = op.getLoc();
+
+    bool isPointer = false;
+
+    if (auto declareOp =
+            mlir::dyn_cast_or_null<fir::DeclareOp>(op.getBox().getDefiningOp()))
+      if (declareOp.getFortranAttrs() &&
+          bitEnumContainsAny(*declareOp.getFortranAttrs(),
+                             fir::FortranVariableFlagsEnum::pointer))
+        isPointer = true;
+
+    if (hasDoubleDescriptors(op)) {
+      // Allocation for module variable are done with custom runtime entry point
+      // so the descriptors can be synchronized.
+      mlir::func::FuncOp func;
+      if (op.getSource()) {
+        func = isPointer ? fir::runtime::getRuntimeFunc<mkRTKey(
+                               CUFPointerAllocateSourceSync)>(loc, builder)
+                         : fir::runtime::getRuntimeFunc<mkRTKey(
+                               CUFAllocatableAllocateSourceSync)>(loc, builder);
+      } else {
+        func =
+            isPointer
+                ? fir::runtime::getRuntimeFunc<mkRTKey(CUFPointerAllocateSync)>(
+                      loc, builder)
+                : fir::runtime::getRuntimeFunc<mkRTKey(
+                      CUFAllocatableAllocateSync)>(loc, builder);
+      }
+      return convertOpToCall<cuf::AllocateOp>(op, rewriter, func);
+    }
+
+    mlir::func::FuncOp func;
+    if (op.getSource()) {
+      func =
+          isPointer
+              ? fir::runtime::getRuntimeFunc<mkRTKey(CUFPointerAllocateSource)>(
+                    loc, builder)
+              : fir::runtime::getRuntimeFunc<mkRTKey(
+                    CUFAllocatableAllocateSource)>(loc, builder);
+    } else {
+      func =
+          isPointer
+              ? fir::runtime::getRuntimeFunc<mkRTKey(CUFPointerAllocate)>(
+                    loc, builder)
+              : fir::runtime::getRuntimeFunc<mkRTKey(CUFAllocatableAllocate)>(
+                    loc, builder);
+    }
+
+    return convertOpToCall<cuf::AllocateOp>(op, rewriter, func);
+  }
+};
+
+struct CUFDeallocateOpConversion
+    : public mlir::OpRewritePattern<cuf::DeallocateOp> {
+  using OpRewritePattern::OpRewritePattern;
+
+  mlir::LogicalResult
+  matchAndRewrite(cuf::DeallocateOp op,
+                  mlir::PatternRewriter &rewriter) const override {
+
+    auto mod = op->getParentOfType<mlir::ModuleOp>();
+    fir::FirOpBuilder builder(rewriter, mod);
+    mlir::Location loc = op.getLoc();
+
+    if (hasDoubleDescriptors(op)) {
+      // Deallocation for module variable are done with custom runtime entry
+      // point so the descriptors can be synchronized.
+      mlir::func::FuncOp func =
+          fir::runtime::getRuntimeFunc<mkRTKey(CUFAllocatableDeallocate)>(
+              loc, builder);
+      return convertOpToCall<cuf::DeallocateOp>(op, rewriter, func);
+    }
+
+    // Deallocation for local descriptor falls back on the standard runtime
+    // AllocatableDeallocate as the dedicated deallocator is set in the
+    // descriptor before the call.
+    mlir::func::FuncOp func =
+        fir::runtime::getRuntimeFunc<mkRTKey(AllocatableDeallocate)>(loc,
+                                                                     builder);
+    return convertOpToCall<cuf::DeallocateOp>(op, rewriter, func);
+  }
+};
+
+class CUFAllocationConversion
+    : public fir::impl::CUFAllocationConversionBase<CUFAllocationConversion> {
+public:
+  void runOnOperation() override {
+    auto *ctx = &getContext();
+    mlir::RewritePatternSet patterns(ctx);
+    mlir::ConversionTarget target(*ctx);
+
+    mlir::Operation *op = getOperation();
+    mlir::ModuleOp module = mlir::dyn_cast<mlir::ModuleOp>(op);
+    if (!module)
+      return signalPassFailure();
+    mlir::SymbolTable symtab(module);
+
+    std::optional<mlir::DataLayout> dl = fir::support::getOrSetMLIRDataLayout(
+        module, /*allowDefaultLayout=*/false);
+    fir::LLVMTypeConverter typeConverter(module, /*applyTBAA=*/false,
+                                         /*forceUnifiedTBAATree=*/false, *dl);
+    target.addLegalDialect<fir::FIROpsDialect, mlir::arith::ArithDialect,
+                           mlir::gpu::GPUDialect>();
+    target.addLegalOp<cuf::StreamCastOp>();
+    cuf::populateCUFAllocationConversionPatterns(typeConverter, *dl, symtab,
+                                                 patterns);
+    if (mlir::failed(mlir::applyPartialConversion(getOperation(), target,
+                                                  std::move(patterns)))) {
+      mlir::emitError(mlir::UnknownLoc::get(ctx),
+                      "error in CUF allocation conversion\n");
+      signalPassFailure();
+    }
+  }
+};
+
+} // namespace
+
+void cuf::populateCUFAllocationConversionPatterns(
+    const fir::LLVMTypeConverter &converter, mlir::DataLayout &dl,
+    const mlir::SymbolTable &symtab, mlir::RewritePatternSet &patterns) {
+  patterns.insert<CUFAllocOpConversion>(patterns.getContext(), &dl, &converter);
+  patterns.insert<CUFFreeOpConversion, CUFAllocateOpConversion,
+                  CUFDeallocateOpConversion>(patterns.getContext());
+}
diff --git a/flang/lib/Optimizer/Transforms/CUFComputeSharedMemoryOffsetsAndSize.cpp b/flang/lib/Optimizer/Transforms/CUFComputeSharedMemoryOffsetsAndSize.cpp
index a64494510d847..7bae0602fe5ca 100644
--- a/flang/lib/Optimizer/Transforms/CUFComputeSharedMemoryOffsetsAndSize.cpp
+++ b/flang/lib/Optimizer/Transforms/CUFComputeSharedMemoryOffsetsAndSize.cpp
@@ -46,6 +46,43 @@ static bool isAssumedSize(mlir::ValueRange shape) {
   return false;
 }
 
+static void createSharedMemoryGlobal(fir::FirOpBuilder &builder,
+                                     mlir::Location loc, llvm::StringRef prefix,
+                                     llvm::StringRef suffix,
+                                     mlir::gpu::GPUModuleOp gpuMod,
+                                     mlir::Type sharedMemType, unsigned size,
+                                     unsigned align, bool isDynamic) {
+  std::string sharedMemGlobalName =
+      isDynamic ? (prefix + llvm::Twine(cudaSharedMemSuffix)).str()
+                : (prefix + llvm::Twine(cudaSharedMemSuffix) + suffix).str();
+
+  mlir::OpBuilder::InsertionGuard guard(builder);
+  builder.setInsertionPointToEnd(gpuMod.getBody());
+
+  mlir::StringAttr linkage = isDynamic ? builder.createExternalLinkage()
+                                       : builder.createInternalLinkage();
+  llvm::SmallVector<mlir::NamedAttribute> attrs;
+  auto globalOpName = mlir::OperationName(fir::GlobalOp::getOperationName(),
+                                          gpuMod.getContext());
+  attrs.push_back(mlir::NamedAttribute(
+      fir::GlobalOp::getDataAttrAttrName(globalOpName),
+      cuf::DataAttributeAttr::get(gpuMod.getContext(),
+                                  cuf::DataAttribute::Shared)));
+
+  mlir::DenseElementsAttr init = {};
+  mlir::Type i8Ty = builder.getI8Type();
+  if (size > 0) {
+    auto vecTy = mlir::VectorType::get(
+        static_cast<fir::SequenceType::Extent>(size), i8Ty);
+    mlir::Attribute zero = mlir::IntegerAttr::get(i8Ty, 0);
+    init = mlir::DenseElementsAttr::get(vecTy, llvm::ArrayRef(zero));
+  }
+  auto sharedMem =
+      fir::GlobalOp::create(builder, loc, sharedMemGlobalName, false, false,
+                            sharedMemType, init, linkage, attrs);
+  sharedMem.setAlignment(align);
+}
+
 struct CUFComputeSharedMemoryOffsetsAndSize
     : public fir::impl::CUFComputeSharedMemoryOffsetsAndSizeBase<
           CUFComputeSharedMemoryOffsetsAndSize> {
@@ -108,18 +145,23 @@ struct CUFComputeSharedMemoryOffsetsAndSize
                                                        crtDynOffset, dynSize);
           else
             crtDynOffset = dynSize;
-
-          continue;
+        } else {
+          // Static shared memory.
+          auto [size, align] = fir::getTypeSizeAndAlignmentOrCrash(
+              loc, sharedOp.getInType(), *dl, kindMap);
+          createSharedMemoryGlobal(
+              builder, sharedOp.getLoc(), funcOp.getName(),
+              *sharedOp.getBindcName(), gpuMod,
+              fir::SequenceType::get(size, i8Ty), size,
+              sharedOp.getAlignment() ? *sharedOp.getAlignment() : align,
+              /*isDynamic=*/false);
+          mlir::Value zero = builder.createIntegerConstant(loc, i32Ty, 0);
+          sharedOp.getOffsetMutable().assign(zero);
+          if (!sharedOp.getAlignment())
+            sharedOp.setAlignment(align);
+          sharedOp.setIsStatic(true);
+          ++nbStaticSharedVariables;
         }
-        auto [size, align] = fir::getTypeSizeAndAlignmentOrCrash(
-            sharedOp.getLoc(), sharedOp.getInType(), *dl, kindMap);
-        ++nbStaticSharedVariables;
-        mlir::Value offset = builder.createIntegerConstant(
-            loc, i32Ty, llvm::alignTo(sharedMemSize, align));
-        sharedOp.getOffsetMutable().assign(offset);
-        sharedMemSize =
-            llvm::alignTo(sharedMemSize, align) + llvm::alignTo(size, align);
-        alignment = std::max(alignment, align);
       }
 
       if (nbDynamicSharedVariables == 0 && nbStaticSharedVariables == 0)
@@ -130,35 +172,13 @@ struct CUFComputeSharedMemoryOffsetsAndSize
             funcOp.getLoc(),
             "static and dynamic shared variables in a single kernel");
 
-      mlir::DenseElementsAttr init = {};
-      if (sharedMemSize > 0) {
-        auto vecTy = mlir::VectorType::get(sharedMemSize, i8Ty);
-        mlir::Attribute zero = mlir::IntegerAttr::get(i8Ty, 0);
-        init = mlir::DenseElementsAttr::get(vecTy, llvm::ArrayRef(zero));
-      }
+      if (nbStaticSharedVariables > 0)
+        continue;
 
-      // Create the shared memory global where each shared variable will point
-      // to.
       auto sharedMemType = fir::SequenceType::get(sharedMemSize, i8Ty);
-      std::string sharedMemGlobalName =
-          (funcOp.getName() + llvm::Twine(cudaSharedMemSuffix)).str();
-      // Dynamic shared memory needs an external linkage while static shared
-      // memory needs an internal linkage.
-      mlir::StringAttr linkage = nbDynamicSharedVariables > 0
-                                     ? builder.createExternalLinkage()
-                                     : builder.createInternalLinkage();
-      builder.setInsertionPointToEnd(gpuMod.getBody());
-      llvm::SmallVector<mlir::NamedAttribute> attrs;
-      auto globalOpName = mlir::OperationName(fir::GlobalOp::getOperationName(),
-                                              gpuMod.getContext());
-      attrs.push_back(mlir::NamedAttribute(
-          fir::GlobalOp::getDataAttrAttrName(globalOpName),
-          cuf::DataAttributeAttr::get(gpuMod.getContext(),
-                                      cuf::DataAttribute::Shared)));
-      auto sharedMem = fir::GlobalOp::create(
-          builder, funcOp.getLoc(), sharedMemGlobalName, false, false,
-          sharedMemType, init, linkage, attrs);
-      sharedMem.setAlignment(alignment);
+      createSharedMemoryGlobal(builder, funcOp.getLoc(), funcOp.getName(), "",
+                               gpuMod, sharedMemType, sharedMemSize, alignment,
+                               /*isDynamic=*/true);
     }
   }
 };
diff --git a/flang/lib/Optimizer/Transforms/CUFGPUToLLVMConversion.cpp b/flang/lib/Optimizer/Transforms/CUFGPUToLLVMConversion.cpp
index 40f180a8c1657..d5a8212eb5472 100644
--- a/flang/lib/Optimizer/Transforms/CUFGPUToLLVMConversion.cpp
+++ b/flang/lib/Optimizer/Transforms/CUFGPUToLLVMConversion.cpp
@@ -249,8 +249,13 @@ struct CUFSharedMemoryOpConversion
                       "cuf.shared_memory must have an offset for code gen");
 
     auto gpuMod = op->getParentOfType<gpu::GPUModuleOp>();
+
     std::string sharedGlobalName =
-        (getFuncName(op) + llvm::Twine(cudaSharedMemSuffix)).str();
+        op.getIsStatic()
+            ? (getFuncName(op) + llvm::Twine(cudaSharedMemSuffix) +
+               *op.getBindcName())
+                  .str()
+            : (getFuncName(op) + llvm::Twine(cudaSharedMemSuffix)).str();
     mlir::Value sharedGlobalAddr =
         createAddressOfOp(rewriter, loc, gpuMod, sharedGlobalName);
 
diff --git a/flang/lib/Optimizer/Transforms/CUFOpConversion.cpp b/flang/lib/Optimizer/Transforms/CUFOpConversion.cpp
index 7ed34f865d0e9..424a8fd9d959b 100644
--- a/flang/lib/Optimizer/Transforms/CUFOpConversion.cpp
+++ b/flang/lib/Optimizer/Transforms/CUFOpConversion.cpp
@@ -16,6 +16,8 @@
 #include "flang/Optimizer/Dialect/FIROps.h"
 #include "flang/Optimizer/HLFIR/HLFIROps.h"
 #include "flang/Optimizer/Support/DataLayout.h"
+#include "flang/Optimizer/Transforms/CUDA/CUFAllocationConversion.h"
+#include "flang/Optimizer/Transforms/Passes.h"
 #include "flang/Runtime/CUDA/allocatable.h"
 #include "flang/Runtime/CUDA/common.h"
 #include "flang/Runtime/CUDA/descriptor.h"
@@ -44,207 +46,6 @@ using namespace Fortran::runtime::cuda;
 
 namespace {
 
-static inline unsigned getMemType(cuf::DataAttribute attr) {
-  if (attr == cuf::DataAttribute::Device)
-    return kMemTypeDevice;
-  if (attr == cuf::DataAttribute::Managed)
-    return kMemTypeManaged;
-  if (attr == cuf::DataAttribute::Unified)
-    return kMemTypeUnified;
-  if (attr == cuf::DataAttribute::Pinned)
-    return kMemTypePinned;
-  llvm::report_fatal_error("unsupported memory type");
-}
-
-template <typename OpTy>
-static bool isPinned(OpTy op) {
-  if (op.getDataAttr() && *op.getDataAttr() == cuf::DataAttribute::Pinned)
-    return true;
-  return false;
-}
-
-template <typename OpTy>
-static bool hasDoubleDescriptors(OpTy op) {
-  if (auto declareOp =
-          mlir::dyn_cast_or_null<fir::DeclareOp>(op.getBox().getDefiningOp())) {
-    if (mlir::isa_and_nonnull<fir::AddrOfOp>(
-            declareOp.getMemref().getDefiningOp())) {
-      if (isPinned(declareOp))
-        return false;
-      return true;
-    }
-  } else if (auto declareOp = mlir::dyn_cast_or_null<hlfir::DeclareOp>(
-                 op.getBox().getDefiningOp())) {
-    if (mlir::isa_and_nonnull<fir::AddrOfOp>(
-            declareOp.getMemref().getDefiningOp())) {
-      if (isPinned(declareOp))
-        return false;
-      return true;
-    }
-  }
-  return false;
-}
-
-static mlir::Value createConvertOp(mlir::PatternRewriter &rewriter,
-                                   mlir::Location loc, mlir::Type toTy,
-                                   mlir::Value val) {
-  if (val.getType() != toTy)
-    return fir::ConvertOp::create(rewriter, loc, toTy, val);
-  return val;
-}
-
-template <typename OpTy>
-static mlir::LogicalResult convertOpToCall(OpTy op,
-                                           mlir::PatternRewriter &rewriter,
-                                           mlir::func::FuncOp func) {
-  auto mod = op->template getParentOfType<mlir::ModuleOp>();
-  fir::FirOpBuilder builder(rewriter, mod);
-  mlir::Location loc = op.getLoc();
-  auto fTy = func.getFunctionType();
-
-  mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
-  mlir::Value sourceLine;
-  if constexpr (std::is_same_v<OpTy, cuf::AllocateOp>)
-    sourceLine = fir::factory::locationToLineNo(
-        builder, loc, op.getSource() ? fTy.getInput(7) : fTy.getInput(6));
-  else
-    sourceLine = fir::factory::locationToLineNo(builder, loc, fTy.getInput(4));
-
-  mlir::Value hasStat = op.getHasStat() ? builder.createBool(loc, true)
-                                        : builder.createBool(loc, false);
-
-  mlir::Value errmsg;
-  if (op.getErrmsg()) {
-    errmsg = op.getErrmsg();
-  } else {
-    mlir::Type boxNoneTy = fir::BoxType::get(builder.getNoneType());
-    errmsg = fir::AbsentOp::create(builder, loc, boxNoneTy).getResult();
-  }
-  llvm::SmallVector<mlir::Value> args;
-  if constexpr (std::is_same_v<OpTy, cuf::AllocateOp>) {
-    mlir::Value pinned =
-        op.getPinned()
-            ? op.getPinned()
-            : builder.createNullConstant(
-                  loc, fir::ReferenceType::get(
-                           mlir::IntegerType::get(op.getContext(), 1)));
-    if (op.getSource()) {
-      mlir::Value stream =
-          op.getStream() ? op.getStream()
-                         : builder.createNullConstant(loc, fTy.getInput(2));
-      args = fir::runtime::createArguments(
-          builder, loc, fTy, op.getBox(), op.getSource(), stream, pinned,
-          hasStat, errmsg, sourceFile, sourceLine);
-    } else {
-      mlir::Value stream =
-          op.getStream() ? op.getStream()
-                         : builder.createNullConstant(loc, fTy.getInput(1));
-      args = fir::runtime::createArguments(builder, loc, fTy, op.getBox(),
-                                           stream, pinned, hasStat, errmsg,
-                                           sourceFile, sourceLine);
-    }
-  } else {
-    args =
-        fir::runtime::createArguments(builder, loc, fTy, op.getBox(), hasStat,
-                                      errmsg, sourceFile, sourceLine);
-  }
-  auto callOp = fir::CallOp::create(builder, loc, func, args);
-  rewriter.replaceOp(op, callOp);
-  return mlir::success();
-}
-
-struct CUFAllocateOpConversion
-    : public mlir::OpRewritePattern<cuf::AllocateOp> {
-  using OpRewritePattern::OpRewritePattern;
-
-  mlir::LogicalResult
-  matchAndRewrite(cuf::AllocateOp op,
-                  mlir::PatternRewriter &rewriter) const override {
-    auto mod = op->getParentOfType<mlir::ModuleOp>();
-    fir::FirOpBuilder builder(rewriter, mod);
-    mlir::Location loc = op.getLoc();
-
-    bool isPointer = false;
-
-    if (auto declareOp =
-            mlir::dyn_cast_or_null<fir::DeclareOp>(op.getBox().getDefiningOp()))
-      if (declareOp.getFortranAttrs() &&
-          bitEnumContainsAny(*declareOp.getFortranAttrs(),
-                             fir::FortranVariableFlagsEnum::pointer))
-        isPointer = true;
-
-    if (hasDoubleDescriptors(op)) {
-      // Allocation for module variable are done with custom runtime entry point
-      // so the descriptors can be synchronized.
-      mlir::func::FuncOp func;
-      if (op.getSource()) {
-        func = isPointer ? fir::runtime::getRuntimeFunc<mkRTKey(
-                               CUFPointerAllocateSourceSync)>(loc, builder)
-                         : fir::runtime::getRuntimeFunc<mkRTKey(
-                               CUFAllocatableAllocateSourceSync)>(loc, builder);
-      } else {
-        func =
-            isPointer
-                ? fir::runtime::getRuntimeFunc<mkRTKey(CUFPointerAllocateSync)>(
-                      loc, builder)
-                : fir::runtime::getRuntimeFunc<mkRTKey(
-                      CUFAllocatableAllocateSync)>(loc, builder);
-      }
-      return convertOpToCall<cuf::AllocateOp>(op, rewriter, func);
-    }
-
-    mlir::func::FuncOp func;
-    if (op.getSource()) {
-      func =
-          isPointer
-              ? fir::runtime::getRuntimeFunc<mkRTKey(CUFPointerAllocateSource)>(
-                    loc, builder)
-              : fir::runtime::getRuntimeFunc<mkRTKey(
-                    CUFAllocatableAllocateSource)>(loc, builder);
-    } else {
-      func =
-          isPointer
-              ? fir::runtime::getRuntimeFunc<mkRTKey(CUFPointerAllocate)>(
-                    loc, builder)
-              : fir::runtime::getRuntimeFunc<mkRTKey(CUFAllocatableAllocate)>(
-                    loc, builder);
-    }
-
-    return convertOpToCall<cuf::AllocateOp>(op, rewriter, func);
-  }
-};
-
-struct CUFDeallocateOpConversion
-    : public mlir::OpRewritePattern<cuf::DeallocateOp> {
-  using OpRewritePattern::OpRewritePattern;
-
-  mlir::LogicalResult
-  matchAndRewrite(cuf::DeallocateOp op,
-                  mlir::PatternRewriter &rewriter) const override {
-
-    auto mod = op->getParentOfType<mlir::ModuleOp>();
-    fir::FirOpBuilder builder(rewriter, mod);
-    mlir::Location loc = op.getLoc();
-
-    if (hasDoubleDescriptors(op)) {
-      // Deallocation for module variable are done with custom runtime entry
-      // point so the descriptors can be synchronized.
-      mlir::func::FuncOp func =
-          fir::runtime::getRuntimeFunc<mkRTKey(CUFAllocatableDeallocate)>(
-              loc, builder);
-      return convertOpToCall<cuf::DeallocateOp>(op, rewriter, func);
-    }
-
-    // Deallocation for local descriptor falls back on the standard runtime
-    // AllocatableDeallocate as the dedicated deallocator is set in the
-    // descriptor before the call.
-    mlir::func::FuncOp func =
-        fir::runtime::getRuntimeFunc<mkRTKey(AllocatableDeallocate)>(loc,
-                                                                     builder);
-    return convertOpToCall<cuf::DeallocateOp>(op, rewriter, func);
-  }
-};
-
 static bool inDeviceContext(mlir::Operation *op) {
   if (op->getParentOfType<cuf::KernelOp>())
     return true;
@@ -263,126 +64,13 @@ static bool inDeviceContext(mlir::Operation *op) {
   return false;
 }
 
-struct CUFAllocOpConversion : public mlir::OpRewritePattern<cuf::AllocOp> {
-  using OpRewritePattern::OpRewritePattern;
-
-  CUFAllocOpConversion(mlir::MLIRContext *context, mlir::DataLayout *dl,
-                       const fir::LLVMTypeConverter *typeConverter)
-      : OpRewritePattern(context), dl{dl}, typeConverter{typeConverter} {}
-
-  mlir::LogicalResult
-  matchAndRewrite(cuf::AllocOp op,
-                  mlir::PatternRewriter &rewriter) const override {
-
-    mlir::Location loc = op.getLoc();
-
-    if (inDeviceContext(op.getOperation())) {
-      // In device context just replace the cuf.alloc operation with a fir.alloc
-      // the cuf.free will be removed.
-      auto allocaOp =
-          fir::AllocaOp::create(rewriter, loc, op.getInType(),
-                                op.getUniqName() ? *op.getUniqName() : "",
-                                op.getBindcName() ? *op.getBindcName() : "",
-                                op.getTypeparams(), op.getShape());
-      allocaOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
-      rewriter.replaceOp(op, allocaOp);
-      return mlir::success();
-    }
-
-    auto mod = op->getParentOfType<mlir::ModuleOp>();
-    fir::FirOpBuilder builder(rewriter, mod);
-    mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
-
-    if (!mlir::dyn_cast_or_null<fir::BaseBoxType>(op.getInType())) {
-      // Convert scalar and known size array allocations.
-      mlir::Value bytes;
-      fir::KindMapping kindMap{fir::getKindMapping(mod)};
-      if (fir::isa_trivial(op.getInType())) {
-        int width = cuf::computeElementByteSize(loc, op.getInType(), kindMap);
-        bytes =
-            builder.createIntegerConstant(loc, builder.getIndexType(), width);
-      } else if (auto seqTy = mlir::dyn_cast_or_null<fir::SequenceType>(
-                     op.getInType())) {
-        std::size_t size = 0;
-        if (fir::isa_derived(seqTy.getEleTy())) {
-          mlir::Type structTy = typeConverter->convertType(seqTy.getEleTy());
-          size = dl->getTypeSizeInBits(structTy) / 8;
-        } else {
-          size = cuf::computeElementByteSize(loc, seqTy.getEleTy(), kindMap);
-        }
-        mlir::Value width =
-            builder.createIntegerConstant(loc, builder.getIndexType(), size);
-        mlir::Value nbElem;
-        if (fir::sequenceWithNonConstantShape(seqTy)) {
-          assert(!op.getShape().empty() && "expect shape with dynamic arrays");
-          nbElem = builder.loadIfRef(loc, op.getShape()[0]);
-          for (unsigned i = 1; i < op.getShape().size(); ++i) {
-            nbElem = mlir::arith::MulIOp::create(
-                rewriter, loc, nbElem,
-                builder.loadIfRef(loc, op.getShape()[i]));
-          }
-        } else {
-          nbElem = builder.createIntegerConstant(loc, builder.getIndexType(),
-                                                 seqTy.getConstantArraySize());
-        }
-        bytes = mlir::arith::MulIOp::create(rewriter, loc, nbElem, width);
-      } else if (fir::isa_derived(op.getInType())) {
-        mlir::Type structTy = typeConverter->convertType(op.getInType());
-        std::size_t structSize = dl->getTypeSizeInBits(structTy) / 8;
-        bytes = builder.createIntegerConstant(loc, builder.getIndexType(),
-                                              structSize);
-      } else if (fir::isa_char(op.getInType())) {
-        mlir::Type charTy = typeConverter->convertType(op.getInType());
-        std::size_t charSize = dl->getTypeSizeInBits(charTy) / 8;
-        bytes = builder.createIntegerConstant(loc, builder.getIndexType(),
-                                              charSize);
-      } else {
-        mlir::emitError(loc, "unsupported type in cuf.alloc\n");
-      }
-      mlir::func::FuncOp func =
-          fir::runtime::getRuntimeFunc<mkRTKey(CUFMemAlloc)>(loc, builder);
-      auto fTy = func.getFunctionType();
-      mlir::Value sourceLine =
-          fir::factory::locationToLineNo(builder, loc, fTy.getInput(3));
-      mlir::Value memTy = builder.createIntegerConstant(
-          loc, builder.getI32Type(), getMemType(op.getDataAttr()));
-      llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
-          builder, loc, fTy, bytes, memTy, sourceFile, sourceLine)};
-      auto callOp = fir::CallOp::create(builder, loc, func, args);
-      callOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
-      auto convOp = builder.createConvert(loc, op.getResult().getType(),
-                                          callOp.getResult(0));
-      rewriter.replaceOp(op, convOp);
-      return mlir::success();
-    }
-
-    // Convert descriptor allocations to function call.
-    auto boxTy = mlir::dyn_cast_or_null<fir::BaseBoxType>(op.getInType());
-    mlir::func::FuncOp func =
-        fir::runtime::getRuntimeFunc<mkRTKey(CUFAllocDescriptor)>(loc, builder);
-    auto fTy = func.getFunctionType();
-    mlir::Value sourceLine =
-        fir::factory::locationToLineNo(builder, loc, fTy.getInput(2));
-
-    mlir::Type structTy = typeConverter->convertBoxTypeAsStruct(boxTy);
-    std::size_t boxSize = dl->getTypeSizeInBits(structTy) / 8;
-    mlir::Value sizeInBytes =
-        builder.createIntegerConstant(loc, builder.getIndexType(), boxSize);
-
-    llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
-        builder, loc, fTy, sizeInBytes, sourceFile, sourceLine)};
-    auto callOp = fir::CallOp::create(builder, loc, func, args);
-    callOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
-    auto convOp = builder.createConvert(loc, op.getResult().getType(),
-                                        callOp.getResult(0));
-    rewriter.replaceOp(op, convOp);
-    return mlir::success();
-  }
-
-private:
-  mlir::DataLayout *dl;
-  const fir::LLVMTypeConverter *typeConverter;
-};
+static mlir::Value createConvertOp(mlir::PatternRewriter &rewriter,
+                                   mlir::Location loc, mlir::Type toTy,
+                                   mlir::Value val) {
+  if (val.getType() != toTy)
+    return fir::ConvertOp::create(rewriter, loc, toTy, val);
+  return val;
+}
 
 struct CUFDeviceAddressOpConversion
     : public mlir::OpRewritePattern<cuf::DeviceAddressOp> {
@@ -460,56 +148,6 @@ struct DeclareOpConversion : public mlir::OpRewritePattern<fir::DeclareOp> {
   const mlir::SymbolTable &symTab;
 };
 
-struct CUFFreeOpConversion : public mlir::OpRewritePattern<cuf::FreeOp> {
-  using OpRewritePattern::OpRewritePattern;
-
-  mlir::LogicalResult
-  matchAndRewrite(cuf::FreeOp op,
-                  mlir::PatternRewriter &rewriter) const override {
-    if (inDeviceContext(op.getOperation())) {
-      rewriter.eraseOp(op);
-      return mlir::success();
-    }
-
-    if (!mlir::isa<fir::ReferenceType>(op.getDevptr().getType()))
-      return failure();
-
-    auto mod = op->getParentOfType<mlir::ModuleOp>();
-    fir::FirOpBuilder builder(rewriter, mod);
-    mlir::Location loc = op.getLoc();
-    mlir::Value sourceFile = fir::factory::locationToFilename(builder, loc);
-
-    auto refTy = mlir::dyn_cast<fir::ReferenceType>(op.getDevptr().getType());
-    if (!mlir::isa<fir::BaseBoxType>(refTy.getEleTy())) {
-      mlir::func::FuncOp func =
-          fir::runtime::getRuntimeFunc<mkRTKey(CUFMemFree)>(loc, builder);
-      auto fTy = func.getFunctionType();
-      mlir::Value sourceLine =
-          fir::factory::locationToLineNo(builder, loc, fTy.getInput(3));
-      mlir::Value memTy = builder.createIntegerConstant(
-          loc, builder.getI32Type(), getMemType(op.getDataAttr()));
-      llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
-          builder, loc, fTy, op.getDevptr(), memTy, sourceFile, sourceLine)};
-      fir::CallOp::create(builder, loc, func, args);
-      rewriter.eraseOp(op);
-      return mlir::success();
-    }
-
-    // Convert cuf.free on descriptors.
-    mlir::func::FuncOp func =
-        fir::runtime::getRuntimeFunc<mkRTKey(CUFFreeDescriptor)>(loc, builder);
-    auto fTy = func.getFunctionType();
-    mlir::Value sourceLine =
-        fir::factory::locationToLineNo(builder, loc, fTy.getInput(2));
-    llvm::SmallVector<mlir::Value> args{fir::runtime::createArguments(
-        builder, loc, fTy, op.getDevptr(), sourceFile, sourceLine)};
-    auto callOp = fir::CallOp::create(builder, loc, func, args);
-    callOp->setAttr(cuf::getDataAttrName(), op.getDataAttrAttr());
-    rewriter.eraseOp(op);
-    return mlir::success();
-  }
-};
-
 static bool isDstGlobal(cuf::DataTransferOp op) {
   if (auto declareOp = op.getDst().getDefiningOp<fir::DeclareOp>())
     if (declareOp.getMemref().getDefiningOp<fir::AddrOfOp>())
@@ -896,6 +534,8 @@ struct CUFSyncDescriptorOpConversion
 };
 
 class CUFOpConversion : public fir::impl::CUFOpConversionBase<CUFOpConversion> {
+  using CUFOpConversionBase::CUFOpConversionBase;
+
 public:
   void runOnOperation() override {
     auto *ctx = &getContext();
@@ -917,6 +557,9 @@ class CUFOpConversion : public fir::impl::CUFOpConversionBase<CUFOpConversion> {
     target.addLegalOp<cuf::StreamCastOp>();
     cuf::populateCUFToFIRConversionPatterns(typeConverter, *dl, symtab,
                                             patterns);
+    if (allocationConversion)
+      cuf::populateCUFAllocationConversionPatterns(typeConverter, *dl, symtab,
+                                                   patterns);
     if (mlir::failed(mlir::applyPartialConversion(getOperation(), target,
                                                   std::move(patterns)))) {
       mlir::emitError(mlir::UnknownLoc::get(ctx),
@@ -956,10 +599,7 @@ class CUFOpConversion : public fir::impl::CUFOpConversionBase<CUFOpConversion> {
 void cuf::populateCUFToFIRConversionPatterns(
     const fir::LLVMTypeConverter &converter, mlir::DataLayout &dl,
     const mlir::SymbolTable &symtab, mlir::RewritePatternSet &patterns) {
-  patterns.insert<CUFAllocOpConversion>(patterns.getContext(), &dl, &converter);
-  patterns.insert<CUFAllocateOpConversion, CUFDeallocateOpConversion,
-                  CUFFreeOpConversion, CUFSyncDescriptorOpConversion>(
-      patterns.getContext());
+  patterns.insert<CUFSyncDescriptorOpConversion>(patterns.getContext());
   patterns.insert<CUFDataTransferOpConversion>(patterns.getContext(), symtab,
                                                &dl, &converter);
   patterns.insert<CUFLaunchOpConversion, CUFDeviceAddressOpConversion>(
diff --git a/flang/lib/Parser/Fortran-parsers.cpp b/flang/lib/Parser/Fortran-parsers.cpp
index fccb9d82f4fc9..988db5450abc9 100644
--- a/flang/lib/Parser/Fortran-parsers.cpp
+++ b/flang/lib/Parser/Fortran-parsers.cpp
@@ -1295,6 +1295,7 @@ TYPE_PARSER(construct<StatOrErrmsg>("STAT =" >> statVariable) ||
 // Directives, extensions, and deprecated statements
 // !DIR$ IGNORE_TKR [ [(tkrdmac...)] name ]...
 // !DIR$ LOOP COUNT (n1[, n2]...)
+// !DIR$ VECTOR VECTORLENGTH ({FIXED|SCALABLE|<num>|<num>,FIXED|<num>,SCALABLE})
 // !DIR$ name[=value] [, name[=value]]...
 // !DIR$ UNROLL [n]
 // !DIR$ PREFETCH designator[, designator]...
@@ -1311,6 +1312,15 @@ constexpr auto assumeAligned{"ASSUME_ALIGNED" >>
         indirect(designator), ":"_tok >> digitString64))};
 constexpr auto vectorAlways{
     "VECTOR ALWAYS" >> construct<CompilerDirective::VectorAlways>()};
+constexpr auto vectorLengthKind{
+    "FIXED" >> pure(CompilerDirective::VectorLength::Kind::Fixed) ||
+    "SCALABLE" >> pure(CompilerDirective::VectorLength::Kind::Scalable)};
+constexpr auto vectorLength{"VECTOR VECTORLENGTH" >>
+    parenthesized(construct<CompilerDirective::VectorLength>(
+                      digitString64, ","_tok >> vectorLengthKind) ||
+        construct<CompilerDirective::VectorLength>(pure(0), vectorLengthKind) ||
+        construct<CompilerDirective::VectorLength>(
+            digitString64, pure(CompilerDirective::VectorLength::Kind::Auto)))};
 constexpr auto unroll{
     "UNROLL" >> construct<CompilerDirective::Unroll>(maybe(digitString64))};
 constexpr auto prefetch{"PREFETCH" >>
@@ -1332,6 +1342,7 @@ TYPE_PARSER(beginDirective >> "DIR$ "_tok >>
                 construct<CompilerDirective>(loopCount) ||
                 construct<CompilerDirective>(assumeAligned) ||
                 construct<CompilerDirective>(vectorAlways) ||
+                construct<CompilerDirective>(vectorLength) ||
                 construct<CompilerDirective>(unrollAndJam) ||
                 construct<CompilerDirective>(unroll) ||
                 construct<CompilerDirective>(prefetch) ||
diff --git a/flang/lib/Parser/openmp-parsers.cpp b/flang/lib/Parser/openmp-parsers.cpp
index bd259a9c6e01d..de26a1ba7910d 100644
--- a/flang/lib/Parser/openmp-parsers.cpp
+++ b/flang/lib/Parser/openmp-parsers.cpp
@@ -1278,7 +1278,7 @@ TYPE_PARSER(
         maybe(":"_tok >> nonemptyList(Parser<OmpLinearClause::Modifier>{})),
         /*PostModified=*/pure(true)))
 
-TYPE_PARSER(construct<OmpLoopRangeClause>(
+TYPE_PARSER(construct<OmpLooprangeClause>(
     scalarIntConstantExpr, "," >> scalarIntConstantExpr))
 
 // OpenMPv5.2 12.5.2 detach-clause -> DETACH (event-handle)
@@ -1471,7 +1471,7 @@ TYPE_PARSER( //
     "LINK" >> construct<OmpClause>(construct<OmpClause::Link>(
                   parenthesized(Parser<OmpObjectList>{}))) ||
     "LOOPRANGE" >> construct<OmpClause>(construct<OmpClause::Looprange>(
-                       parenthesized(Parser<OmpLoopRangeClause>{}))) ||
+                       parenthesized(Parser<OmpLooprangeClause>{}))) ||
     "MAP" >> construct<OmpClause>(construct<OmpClause::Map>(
                  parenthesized(Parser<OmpMapClause>{}))) ||
     "MATCH" >> construct<OmpClause>(construct<OmpClause::Match>(
diff --git a/flang/lib/Parser/unparse.cpp b/flang/lib/Parser/unparse.cpp
index 8e9c7d04bc522..ebb5148b3f614 100644
--- a/flang/lib/Parser/unparse.cpp
+++ b/flang/lib/Parser/unparse.cpp
@@ -1848,6 +1848,25 @@ class UnparseVisitor {
             [&](const CompilerDirective::VectorAlways &valways) {
               Word("!DIR$ VECTOR ALWAYS");
             },
+            [&](const CompilerDirective::VectorLength &vlength) {
+              using Kind = CompilerDirective::VectorLength::Kind;
+              std::uint64_t length = std::get<std::uint64_t>(vlength.t);
+              Kind kind = std::get<Kind>(vlength.t);
+
+              Word("!DIR$ VECTOR VECTORLENGTH (");
+              // || kind == Kind::Auto handles the case of VECTORLENGTH(0) so we
+              // don't print nothing
+              if (length != 0 || kind == Kind::Auto) {
+                Walk(length);
+              }
+              if (length != 0 && kind != Kind::Auto) {
+                Word(", ");
+              }
+              if (kind != Kind::Auto) {
+                Word(CompilerDirective::VectorLength::EnumToString(kind));
+              }
+              Word(")");
+            },
             [&](const std::list<CompilerDirective::NameValue> &names) {
               Walk("!DIR$ ", names, " ");
             },
@@ -2370,7 +2389,7 @@ class UnparseVisitor {
       }
     }
   }
-  void Unparse(const OmpLoopRangeClause &x) {
+  void Unparse(const OmpLooprangeClause &x) {
     Word("LOOPRANGE(");
     Walk(std::get<0>(x.t));
     Put(", ");
diff --git a/flang/lib/Semantics/canonicalize-directives.cpp b/flang/lib/Semantics/canonicalize-directives.cpp
index b21da4d041a97..f32a3d34c6572 100644
--- a/flang/lib/Semantics/canonicalize-directives.cpp
+++ b/flang/lib/Semantics/canonicalize-directives.cpp
@@ -56,6 +56,7 @@ bool CanonicalizeDirectives(
 static bool IsExecutionDirective(const parser::CompilerDirective &dir) {
   return std::holds_alternative<parser::CompilerDirective::VectorAlways>(
              dir.u) ||
+      std::holds_alternative<parser::CompilerDirective::VectorLength>(dir.u) ||
       std::holds_alternative<parser::CompilerDirective::Unroll>(dir.u) ||
       std::holds_alternative<parser::CompilerDirective::UnrollAndJam>(dir.u) ||
       std::holds_alternative<parser::CompilerDirective::NoVector>(dir.u) ||
@@ -121,6 +122,9 @@ void CanonicalizationOfDirectives::Post(parser::Block &block) {
           common::visitors{[&](parser::CompilerDirective::VectorAlways &) {
                              CheckLoopDirective(*dir, block, it);
                            },
+              [&](parser::CompilerDirective::VectorLength &) {
+                CheckLoopDirective(*dir, block, it);
+              },
               [&](parser::CompilerDirective::Unroll &) {
                 CheckLoopDirective(*dir, block, it);
               },
diff --git a/flang/lib/Semantics/check-omp-loop.cpp b/flang/lib/Semantics/check-omp-loop.cpp
index 9a78209369949..526e2d4658459 100644
--- a/flang/lib/Semantics/check-omp-loop.cpp
+++ b/flang/lib/Semantics/check-omp-loop.cpp
@@ -773,6 +773,20 @@ void OmpStructureChecker::Enter(const parser::OmpClause::Linear &x) {
   }
 }
 
+void OmpStructureChecker::Enter(const parser::OmpClause::Sizes &c) {
+  CheckAllowedClause(llvm::omp::Clause::OMPC_sizes);
+  for (const parser::Cosubscript &v : c.v)
+    RequiresPositiveParameter(llvm::omp::Clause::OMPC_sizes, v,
+        /*paramName=*/"parameter", /*allowZero=*/false);
+}
+
+void OmpStructureChecker::Enter(const parser::OmpClause::Looprange &x) {
+  CheckAllowedClause(llvm::omp::Clause::OMPC_looprange);
+  auto &[first, count]{x.v.t};
+  RequiresConstantPositiveParameter(llvm::omp::Clause::OMPC_looprange, count);
+  RequiresConstantPositiveParameter(llvm::omp::Clause::OMPC_looprange, first);
+}
+
 void OmpStructureChecker::Enter(const parser::DoConstruct &x) {
   Base::Enter(x);
   loopStack_.push_back(&x);
diff --git a/flang/lib/Semantics/check-omp-structure.cpp b/flang/lib/Semantics/check-omp-structure.cpp
index f7778472f71f1..f5883d30c492a 100644
--- a/flang/lib/Semantics/check-omp-structure.cpp
+++ b/flang/lib/Semantics/check-omp-structure.cpp
@@ -3393,21 +3393,6 @@ void OmpStructureChecker::Enter(const parser::OmpClause &x) {
   }
 }
 
-void OmpStructureChecker::Enter(const parser::OmpClause::Sizes &c) {
-  CheckAllowedClause(llvm::omp::Clause::OMPC_sizes);
-  for (const parser::Cosubscript &v : c.v)
-    RequiresPositiveParameter(llvm::omp::Clause::OMPC_sizes, v,
-        /*paramName=*/"parameter", /*allowZero=*/false);
-}
-
-void OmpStructureChecker::Enter(const parser::OmpClause::Looprange &x) {
-  CheckAllowedClause(llvm::omp::Clause::OMPC_looprange);
-  auto &first = std::get<0>(x.v.t);
-  auto &count = std::get<1>(x.v.t);
-  RequiresConstantPositiveParameter(llvm::omp::Clause::OMPC_looprange, count);
-  RequiresConstantPositiveParameter(llvm::omp::Clause::OMPC_looprange, first);
-}
-
 // Restrictions specific to each clause are implemented apart from the
 // generalized restrictions.
 
diff --git a/flang/lib/Semantics/resolve-names.cpp b/flang/lib/Semantics/resolve-names.cpp
index 2a487a6d39d51..345a0e4e8ecce 100644
--- a/flang/lib/Semantics/resolve-names.cpp
+++ b/flang/lib/Semantics/resolve-names.cpp
@@ -2153,6 +2153,8 @@ class ResolveNamesVisitor : public virtual ScopeHandler,
   void Post(const parser::AssignedGotoStmt &);
   void Post(const parser::CompilerDirective &);
 
+  bool Pre(const parser::SectionSubscript &);
+
   // These nodes should never be reached: they are handled in ProgramUnit
   bool Pre(const parser::MainProgram &) {
     llvm_unreachable("This node is handled in ProgramUnit");
@@ -10075,6 +10077,7 @@ void ResolveNamesVisitor::Post(const parser::AssignedGotoStmt &x) {
 
 void ResolveNamesVisitor::Post(const parser::CompilerDirective &x) {
   if (std::holds_alternative<parser::CompilerDirective::VectorAlways>(x.u) ||
+      std::holds_alternative<parser::CompilerDirective::VectorLength>(x.u) ||
       std::holds_alternative<parser::CompilerDirective::Unroll>(x.u) ||
       std::holds_alternative<parser::CompilerDirective::UnrollAndJam>(x.u) ||
       std::holds_alternative<parser::CompilerDirective::NoVector>(x.u) ||
@@ -10217,6 +10220,14 @@ template <typename A> std::set<SourceName> GetUses(const A &x) {
   return uses;
 }
 
+bool ResolveNamesVisitor::Pre(const parser::SectionSubscript &x) {
+  // Turn off "in EQUIVALENCE" check for array indexing, because
+  // the indices themselves are not part of the EQUIVALENCE.
+  auto restorer{common::ScopedSet(inEquivalenceStmt_, false)};
+  Walk(x.u);
+  return false;
+}
+
 bool ResolveNamesVisitor::Pre(const parser::Program &x) {
   if (Scope * hermetic{context().currentHermeticModuleFileScope()}) {
     // Processing either the dependent modules or first module of a
diff --git a/flang/test/Evaluate/bug168978.f90 b/flang/test/Evaluate/bug168978.f90
new file mode 100644
index 0000000000000..ffe77500aeba5
--- /dev/null
+++ b/flang/test/Evaluate/bug168978.f90
@@ -0,0 +1,6 @@
+!RUN: %flang_fc1 -fdebug-unparse %s 2>&1 | FileCheck %s
+subroutine sub(dd)
+  type(*)::dd(..)
+  !CHECK: PRINT *, size(lbound(dd))
+  print *, size(lbound(dd)) ! do not fold
+end
diff --git a/flang/test/Evaluate/folding03.f90 b/flang/test/Evaluate/folding03.f90
index 5b7ddd3c6c230..1d79098e085b8 100644
--- a/flang/test/Evaluate/folding03.f90
+++ b/flang/test/Evaluate/folding03.f90
@@ -83,6 +83,8 @@ module real_tests
   real(4), parameter :: r4_pinf = 1._4/0._4
   !WARN: warning: division by zero [-Wfolding-exception]
   real(4), parameter :: r4_ninf = -1._4/0._4
+  !WARN: warning: Invalid argument to SQRT() [-Wfolding-value-checks]
+  real(4), parameter :: r4_sqrtneg = sqrt(-1._4)
 
   logical, parameter :: test_r4_nan_parentheses1 = .NOT.(((r4_nan)).EQ.r4_nan)
   logical, parameter :: test_r4_nan_parentheses2 = .NOT.(((r4_nan)).LT.r4_nan)
@@ -155,6 +157,8 @@ module real_tests
   TEST_ISNAN(r4_nan_add5)
   real(4), parameter :: r4_nan_add6 = r4_nan + r4_nan
   TEST_ISNAN(r4_nan_add6)
+  real(4), parameter :: r4_nan_sqrt = sqrt(r4_nan)
+  TEST_ISNAN(r4_nan_sqrt)
 
   !WARN: warning: overflow on multiplication [-Wfolding-exception]
   logical, parameter :: test_inf_r4_mult1 = (1.5_4*r4_pmax).eq.(r4_pinf)
diff --git a/flang/test/Fir/CUDA/cuda-code-gen.mlir b/flang/test/Fir/CUDA/cuda-code-gen.mlir
index 60cda9e98c7d8..e83648f21bdf1 100644
--- a/flang/test/Fir/CUDA/cuda-code-gen.mlir
+++ b/flang/test/Fir/CUDA/cuda-code-gen.mlir
@@ -201,9 +201,9 @@ func.func @_QMm1Psub1(%arg0: !fir.box<!fir.array<?xi32>> {cuf.data_attr = #cuf.c
 
 // -----
 
-fir.global common @_QPshared_static__shared_mem(dense<0> : vector<28xi8>) {alignment = 8 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<28xi8>
+fir.global common @_QPshared_static__shared_mem__(dense<0> : vector<28xi8>) {alignment = 8 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<28xi8>
 
-// CHECK: llvm.mlir.global common @_QPshared_static__shared_mem(dense<0> : vector<28xi8>) {addr_space = 3 : i32, alignment = 8 : i64} : !llvm.array<28 x i8>
+// CHECK: llvm.mlir.global common @_QPshared_static__shared_mem__(dense<0> : vector<28xi8>) {addr_space = 3 : i32, alignment = 8 : i64} : !llvm.array<28 x i8>
 
 // -----
 
diff --git a/flang/test/Fir/CUDA/cuda-shared-offset.mlir b/flang/test/Fir/CUDA/cuda-shared-offset.mlir
index 37b36b2bd050e..1a39fefe85cda 100644
--- a/flang/test/Fir/CUDA/cuda-shared-offset.mlir
+++ b/flang/test/Fir/CUDA/cuda-shared-offset.mlir
@@ -17,7 +17,7 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
 // CHECK: %{{.*}} = cuf.shared_memory[%c0{{.*}} : i32] !fir.array<?xf32>, %{{.*}} : index {bindc_name = "r", uniq_name = "_QFdynsharedEr"} -> !fir.ref<!fir.array<?xf32>>       
 // CHECK: gpu.return
 // CHECK: }
-// CHECK: fir.global external @_QPdynshared__shared_mem {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<0xi8>
+// CHECK: fir.global external @_QPdynshared__shared_mem__ {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<0xi8>
 
 // -----
 
@@ -43,15 +43,20 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
 
 // CHECK-LABEL: gpu.module @cuda_device_mod
 // CHECK: gpu.func @_QPshared_static()
-// CHECK: cuf.shared_memory[%c0{{.*}} : i32] i32 {bindc_name = "a", uniq_name = "_QFshared_staticEa"} -> !fir.ref<i32>      
-// CHECK: cuf.shared_memory[%c4{{.*}} : i32] i32 {bindc_name = "b", uniq_name = "_QFshared_staticEb"} -> !fir.ref<i32>
-// CHECK: cuf.shared_memory[%c8{{.*}} : i32] i32 {bindc_name = "c", uniq_name = "_QFshared_staticEc"} -> !fir.ref<i32>
-// CHECK: cuf.shared_memory[%c12{{.*}} : i32] i32 {bindc_name = "d", uniq_name = "_QFshared_staticEd"} -> !fir.ref<i32>
-// CHECK: cuf.shared_memory[%c16{{.*}} : i32] i64 {bindc_name = "e", uniq_name = "_QFshared_staticEe"} -> !fir.ref<i64>
-// CHECK: cuf.shared_memory[%c24{{.*}} : i32] f32 {bindc_name = "r", uniq_name = "_QFshared_staticEr"} -> !fir.ref<f32>
+// CHECK: cuf.shared_memory[%c0{{.*}} : i32] i32 align 4 {bindc_name = "a", isStatic, uniq_name = "_QFshared_staticEa"} -> !fir.ref<i32>      
+// CHECK: cuf.shared_memory[%c0{{.*}} : i32] i32 align 4 {bindc_name = "b", isStatic, uniq_name = "_QFshared_staticEb"} -> !fir.ref<i32>
+// CHECK: cuf.shared_memory[%c0{{.*}} : i32] i32 align 4 {bindc_name = "c", isStatic, uniq_name = "_QFshared_staticEc"} -> !fir.ref<i32>
+// CHECK: cuf.shared_memory[%c0{{.*}} : i32] i32 align 4 {bindc_name = "d", isStatic, uniq_name = "_QFshared_staticEd"} -> !fir.ref<i32>
+// CHECK: cuf.shared_memory[%c0{{.*}} : i32] i64 align 8 {bindc_name = "e", isStatic, uniq_name = "_QFshared_staticEe"} -> !fir.ref<i64>
+// CHECK: cuf.shared_memory[%c0{{.*}} : i32] f32 align 4 {bindc_name = "r", isStatic, uniq_name = "_QFshared_staticEr"} -> !fir.ref<f32>
 // CHECK: gpu.return
 // CHECK: }
-// CHECK: fir.global internal @_QPshared_static__shared_mem(dense<0> : vector<28xi8>) {alignment = 8 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<28xi8>
+// CHECK: fir.global internal @_QPshared_static__shared_mem__a(dense<0> : vector<4xi8>) {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<4xi8>
+// CHECK: fir.global internal @_QPshared_static__shared_mem__b(dense<0> : vector<4xi8>) {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<4xi8>
+// CHECK: fir.global internal @_QPshared_static__shared_mem__c(dense<0> : vector<4xi8>) {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<4xi8>
+// CHECK: fir.global internal @_QPshared_static__shared_mem__d(dense<0> : vector<4xi8>) {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<4xi8>
+// CHECK: fir.global internal @_QPshared_static__shared_mem__e(dense<0> : vector<8xi8>) {alignment = 8 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<8xi8>
+// CHECK: fir.global internal @_QPshared_static__shared_mem__r(dense<0> : vector<4xi8>) {alignment = 4 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<4xi8>
 // CHECK: }
 // CHECK: }
 
@@ -159,4 +164,4 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
 // CHECK: %{{.*}} = cuf.shared_memory[%c0{{.*}} : i32] !fir.array<?xf64>, %{{.*}} : index {bindc_name = "dmasks", uniq_name = "_QMmtestsFtestanyEdmasks"} -> !fir.ref<!fir.array<?xf64>>
 // CHECK: %{{.*}} = cuf.shared_memory[%c0{{.*}} : i32] !fir.array<?xf32>, %{{.*}} : index {bindc_name = "smasks", uniq_name = "_QMmtestsFtestanyEsmasks"} -> !fir.ref<!fir.array<?xf32>>
 
-// CHECK: fir.global external @_QMmtestsPtestany__shared_mem {alignment = 8 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<0xi8>
+// CHECK: fir.global external @_QMmtestsPtestany__shared_mem__ {alignment = 8 : i64, data_attr = #cuf.cuda<shared>} : !fir.array<0xi8>
diff --git a/flang/test/Fir/CUDA/cuda-shared-to-llvm.mlir b/flang/test/Fir/CUDA/cuda-shared-to-llvm.mlir
index 26479d1cdd94f..69370613cd348 100644
--- a/flang/test/Fir/CUDA/cuda-shared-to-llvm.mlir
+++ b/flang/test/Fir/CUDA/cuda-shared-to-llvm.mlir
@@ -9,14 +9,14 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
       %1 = cuf.shared_memory [%c4 : i32] i32 {bindc_name = "b", uniq_name = "_QFshared_staticEb"} -> !fir.ref<i32>
       llvm.return
     }
-    llvm.mlir.global common @_QPshared_static__shared_mem(dense<0> : vector<28xi8>) {addr_space = 3 : i32, alignment = 8 : i64} : !llvm.array<28 x i8>
+    llvm.mlir.global common @_QPshared_static__shared_mem__(dense<0> : vector<28xi8>) {addr_space = 3 : i32, alignment = 8 : i64} : !llvm.array<28 x i8>
   }
 }
 
 // CHECK-LABEL: llvm.func @_QPshared_static()
-// CHECK: %[[ADDR0:.*]] = llvm.mlir.addressof @_QPshared_static__shared_mem : !llvm.ptr<3>
+// CHECK: %[[ADDR0:.*]] = llvm.mlir.addressof @_QPshared_static__shared_mem__ : !llvm.ptr<3>
 // CHECK: %[[ADDRCAST0:.*]] = llvm.addrspacecast %[[ADDR0]] : !llvm.ptr<3> to !llvm.ptr
 // CHECK: %[[A:.*]] = llvm.getelementptr %[[ADDRCAST0]][%c0{{.*}}] : (!llvm.ptr, i32) -> !llvm.ptr, i8
-// CHECK: %[[ADDR1:.*]] = llvm.mlir.addressof @_QPshared_static__shared_mem : !llvm.ptr<3>
+// CHECK: %[[ADDR1:.*]] = llvm.mlir.addressof @_QPshared_static__shared_mem__ : !llvm.ptr<3>
 // CHECK: %[[ADDRCAST1:.*]] = llvm.addrspacecast %[[ADDR1]] : !llvm.ptr<3> to !llvm.ptr
 // CHECK: %[[B:.*]] = llvm.getelementptr %[[ADDRCAST1]][%c4{{.*}}] : (!llvm.ptr, i32) -> !llvm.ptr, i8
diff --git a/flang/test/Fir/OpenACC/pointer-like-interface-load.mlir b/flang/test/Fir/OpenACC/pointer-like-interface-load.mlir
new file mode 100644
index 0000000000000..170ea56b24742
--- /dev/null
+++ b/flang/test/Fir/OpenACC/pointer-like-interface-load.mlir
@@ -0,0 +1,95 @@
+// RUN: fir-opt %s --split-input-file --pass-pipeline="builtin.module(func.func(test-acc-pointer-like-interface{test-mode=load}))" 2>&1 | FileCheck %s
+
+func.func @test_load_scalar_f32() {
+  %ptr = fir.alloca f32 {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.alloca f32 {test.ptr}
+  // CHECK: Loaded value type: f32
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.ref<f32>
+  return
+}
+
+// -----
+
+func.func @test_load_scalar_i32() {
+  %ptr = fir.alloca i32 {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.alloca i32 {test.ptr}
+  // CHECK: Loaded value type: i32
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.ref<i32>
+  return
+}
+
+// -----
+
+func.func @test_load_scalar_i64() {
+  %ptr = fir.alloca i64 {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.alloca i64 {test.ptr}
+  // CHECK: Loaded value type: i64
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.ref<i64>
+  return
+}
+
+// -----
+
+func.func @test_load_heap_scalar() {
+  %ptr = fir.allocmem f64 {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.allocmem f64 {test.ptr}
+  // CHECK: Loaded value type: f64
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.heap<f64>
+  return
+}
+
+// -----
+
+func.func @test_load_logical() {
+  %ptr = fir.alloca !fir.logical<4> {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.alloca !fir.logical<4> {test.ptr}
+  // CHECK: Loaded value type: !fir.logical<4>
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.ref<!fir.logical<4>>
+  return
+}
+
+// -----
+
+func.func @test_load_derived_type() {
+  %ptr = fir.alloca !fir.type<_QTt{i:i32}> {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.alloca !fir.type<_QTt{i:i32}> {test.ptr}
+  // CHECK: Loaded value type: !fir.type<_QTt{i:i32}>
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.ref<!fir.type<_QTt{i:i32}>>
+  return
+}
+
+// -----
+
+func.func @test_load_constant_array() {
+  %ptr = fir.alloca !fir.array<10xf32> {test.ptr}
+  // CHECK: Successfully generated load for operation: %{{.*}} = fir.alloca !fir.array<10xf32> {test.ptr}
+  // CHECK: Loaded value type: !fir.array<10xf32>
+  // CHECK: Generated: %{{.*}} = fir.load %{{.*}} : !fir.ref<!fir.array<10xf32>>
+  return
+}
+
+// -----
+
+func.func @test_load_dynamic_array_fails() {
+  %c10 = arith.constant 10 : index
+  %ptr = fir.alloca !fir.array<?xf32>, %c10 {test.ptr}
+  // CHECK: Failed to generate load for operation: %{{.*}} = fir.alloca !fir.array<?xf32>
+  return
+}
+
+// -----
+
+func.func @test_load_box_fails() {
+  %ptr = fir.alloca !fir.box<!fir.ptr<f32>> {test.ptr}
+  // CHECK: Failed to generate load for operation: %{{.*}} = fir.alloca !fir.box<!fir.ptr<f32>>
+  return
+}
+
+// -----
+
+func.func @test_load_unlimited_polymorphic_fails() {
+  %ptr = fir.alloca !fir.class<none> {test.ptr}
+  // CHECK: Failed to generate load for operation: %{{.*}} = fir.alloca !fir.class<none>
+  return
+}
+
diff --git a/flang/test/Fir/OpenACC/pointer-like-interface-store.mlir b/flang/test/Fir/OpenACC/pointer-like-interface-store.mlir
new file mode 100644
index 0000000000000..5ea4f0e750c65
--- /dev/null
+++ b/flang/test/Fir/OpenACC/pointer-like-interface-store.mlir
@@ -0,0 +1,85 @@
+// RUN: fir-opt %s --split-input-file --pass-pipeline="builtin.module(func.func(test-acc-pointer-like-interface{test-mode=store}))" 2>&1 | FileCheck %s
+
+func.func @test_store_scalar_f32() {
+  %ptr = fir.alloca f32 {test.ptr}
+  // CHECK: Successfully generated store for operation: %{{.*}} = fir.alloca f32 {test.ptr}
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 4.200000e+01 : f32
+  // CHECK: Generated: fir.store %[[VAL]] to %{{.*}} : !fir.ref<f32>
+  return
+}
+
+// -----
+
+func.func @test_store_scalar_i32() {
+  %ptr = fir.alloca i32 {test.ptr}
+  // CHECK: Successfully generated store for operation: %{{.*}} = fir.alloca i32 {test.ptr}
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 42 : i32
+  // CHECK: Generated: fir.store %[[VAL]] to %{{.*}} : !fir.ref<i32>
+  return
+}
+
+// -----
+
+func.func @test_store_scalar_i64() {
+  %ptr = fir.alloca i64 {test.ptr}
+  // CHECK: Successfully generated store for operation: %{{.*}} = fir.alloca i64 {test.ptr}
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 42 : i64
+  // CHECK: Generated: fir.store %[[VAL]] to %{{.*}} : !fir.ref<i64>
+  return
+}
+
+// -----
+
+func.func @test_store_heap_scalar() {
+  %ptr = fir.allocmem f64 {test.ptr}
+  // CHECK: Successfully generated store for operation: %{{.*}} = fir.allocmem f64 {test.ptr}
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 4.200000e+01 : f64
+  // CHECK: Generated: fir.store %[[VAL]] to %{{.*}} : !fir.heap<f64>
+  return
+}
+
+// -----
+
+func.func @test_store_with_type_conversion() {
+  %ptr = fir.alloca i32 {test.ptr}
+  // CHECK: Successfully generated store for operation: %{{.*}} = fir.alloca i32 {test.ptr}
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 42 : i32
+  // CHECK: Generated: fir.store %[[VAL]] to %{{.*}} : !fir.ref<i32>
+  return
+}
+
+// -----
+
+func.func @test_store_constant_array() {
+  %val = fir.undefined !fir.array<10xf32> {test.value}
+  %ptr = fir.alloca !fir.array<10xf32> {test.ptr}
+  // CHECK: Successfully generated store for operation: %{{.*}} = fir.alloca !fir.array<10xf32> {test.ptr}
+  // CHECK: Generated: fir.store %{{.*}} to %{{.*}} : !fir.ref<!fir.array<10xf32>>
+  return
+}
+
+// -----
+
+func.func @test_store_dynamic_array_fails() {
+  %c10 = arith.constant 10 : index
+  %ptr = fir.alloca !fir.array<?xf32>, %c10 {test.ptr}
+  // CHECK: Failed to generate store for operation: %{{.*}} = fir.alloca !fir.array<?xf32>
+  return
+}
+
+// -----
+
+func.func @test_store_box_fails() {
+  %ptr = fir.alloca !fir.box<!fir.ptr<f32>> {test.ptr}
+  // CHECK: Failed to generate store for operation: %{{.*}} = fir.alloca !fir.box<!fir.ptr<f32>>
+  return
+}
+
+// -----
+
+func.func @test_store_unlimited_polymorphic_fails() {
+  %ptr = fir.alloca !fir.class<none> {test.ptr}
+  // CHECK: Failed to generate store for operation: %{{.*}} = fir.alloca !fir.class<none>
+  return
+}
+
diff --git a/flang/test/Fir/fir-ops.fir b/flang/test/Fir/fir-ops.fir
index 0892eb9fa0de8..8336b6d89e721 100644
--- a/flang/test/Fir/fir-ops.fir
+++ b/flang/test/Fir/fir-ops.fir
@@ -467,6 +467,13 @@ fir.type_info @cpinfo : !fir.type<cpinfo{comp_i:!fir.array<10x20xi32>}> componen
   fir.dt_component "component_info" lbs [2, 3]
 }
 
+// CHECK-LABEL: fir.type_info @abstract_dispatch_tbl abstract : !fir.type<abstract_dispatch_tbl{i:i32}> dispatch_table {
+// CHECK: fir.dt_entry "deferred_method", @deferred_impl deferred
+// CHECK: }
+fir.type_info @abstract_dispatch_tbl abstract : !fir.type<abstract_dispatch_tbl{i:i32}> dispatch_table {
+  fir.dt_entry "deferred_method", @deferred_impl deferred
+}
+
 // CHECK-LABEL: func @compare_complex(
 // CHECK-SAME: [[VAL_151:%.*]]: complex<f128>, [[VAL_152:%.*]]: complex<f128>) {
 func.func @compare_complex(%a : complex<f128>, %b : complex<f128>) {
diff --git a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
index cf77c46346b7f..fd59d39b552da 100644
--- a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
+++ b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
@@ -174,10 +174,13 @@ subroutine worst_case(a, b, c, d)
 ! CHECK-NEXT:    br label %omp.par.pre_finalize
 
 ! CHECK:       omp.par.pre_finalize:                             ; preds = %reduce.finalize
+! CHECK-NEXT:    br label %.fini
+
+! CHECK:       .fini:
 ! CHECK-NEXT:    %{{.*}} = load ptr, ptr
 ! CHECK-NEXT:    br label %omp.reduction.cleanup
 
-! CHECK:       omp.reduction.cleanup:                            ; preds = %omp.par.pre_finalize
+! CHECK:       omp.reduction.cleanup:                            ; preds = %.fini
 !                [null check]
 ! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup43, label %omp.reduction.cleanup44
 
diff --git a/flang/test/Lower/CUDA/cuda-device-proc.cuf b/flang/test/Lower/CUDA/cuda-device-proc.cuf
index 434322ea22265..1e3c66307c334 100644
--- a/flang/test/Lower/CUDA/cuda-device-proc.cuf
+++ b/flang/test/Lower/CUDA/cuda-device-proc.cuf
@@ -538,11 +538,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_c4
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_c4Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_c4Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xcomplex<f32>> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_c4Etmp"} -> !fir.ref<!fir.array<1024xcomplex<f32>>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 8 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_load_c8(a, n)
@@ -557,11 +558,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_c8
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_c8Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_c8Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xcomplex<f64>> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_c8Etmp"} -> !fir.ref<!fir.array<1024xcomplex<f64>>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 16 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_load_i4(a, n)
@@ -576,11 +578,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_i4
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_i4Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_i4Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xi32> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_i4Etmp"} -> !fir.ref<!fir.array<1024xi32>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 4 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_load_i8(a, n)
@@ -595,11 +598,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_i8
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_i8Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_i8Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xi64> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_i8Etmp"} -> !fir.ref<!fir.array<1024xi64>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 8 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_load_r2(a, n)
@@ -614,11 +618,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_r2
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_r2Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_r2Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xf16> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_r2Etmp"} -> !fir.ref<!fir.array<1024xf16>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 2 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_load_r4(a, n)
@@ -633,11 +638,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_r4
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_r4Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_r4Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xf32> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_r4Etmp"} -> !fir.ref<!fir.array<1024xf32>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 4 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_load_r8(a, n)
@@ -652,11 +658,12 @@ end subroutine
 ! CHECK-LABEL: func.func @_QPtest_tma_bulk_load_r8
 ! CHECK: %[[BARRIER:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<shared>, uniq_name = "_QFtest_tma_bulk_load_r8Ebarrier1"} : (!fir.ref<i64>) -> (!fir.ref<i64>, !fir.ref<i64>)
 ! CHECK: %[[ELEM_COUNT:.*]]:2 = hlfir.declare %{{.*}} {data_attr = #cuf.cuda<device>, uniq_name = "_QFtest_tma_bulk_load_r8Eelem_count"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+! CHECK: cuf.shared_memory !fir.array<1024xf64> align 16 {bindc_name = "tmp", uniq_name = "_QFtest_tma_bulk_load_r8Etmp"} -> !fir.ref<!fir.array<1024xf64>>
 ! CHECK: %[[COUNT:.*]] = fir.load %[[ELEM_COUNT]]#0 : !fir.ref<i32>
 ! CHECK: %[[ELEM_SIZE:.*]] = arith.constant 8 : i32
 ! CHECK: %[[SIZE:.*]] = arith.muli %[[COUNT]], %[[ELEM_SIZE]] : i32
 ! CHECK: %[[BARRIER_PTR:.*]] = fir.convert %[[BARRIER]]#0 : (!fir.ref<i64>) -> !llvm.ptr
-! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr, !llvm.ptr, i32, !llvm.ptr)
+! CHECK: nvvm.inline_ptx "cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1], %2, [%3];" ro(%{{.*}}, %{{.*}}, %[[SIZE]], %[[BARRIER_PTR]] : !llvm.ptr<3>, !llvm.ptr<3>, i32, !llvm.ptr)
 ! CHECK: nvvm.inline_ptx "mbarrier.expect_tx.relaxed.cta.shared::cta.b64 [%0], %1;" ro(%[[BARRIER_PTR]], %[[SIZE]] : !llvm.ptr, i32)
 
 attributes(global) subroutine test_tma_bulk_store_c4(c, n)
diff --git a/flang/test/Lower/Intrinsics/rand.f90 b/flang/test/Lower/Intrinsics/rand.f90
new file mode 100644
index 0000000000000..1ff5bb6509889
--- /dev/null
+++ b/flang/test/Lower/Intrinsics/rand.f90
@@ -0,0 +1,41 @@
+! RUN: bbc -emit-hlfir %s -o - | FileCheck --check-prefixes=CHECK %s
+! RUN: %flang_fc1 -emit-hlfir %s -o - | FileCheck --check-prefixes=CHECK %s
+
+! CHECK-LABEL: func @_QPtest_srand(
+subroutine test_srand()
+  integer :: seed = 0
+  call srand(seed)
+  ! CHECK: %[[VAL_0:.*]] = fir.address_of(@_QFtest_srandEseed) : !fir.ref<i32> 
+  ! CHECK: %[[VAL_1:.*]]:2 = hlfir.declare %[[VAL_0]] {uniq_name = "_QFtest_srandEseed"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+  ! CHECK: fir.call @_QPsrand(%[[VAL_1]]#0) fastmath<contract> : (!fir.ref<i32>) -> () 
+  ! CHECK: return
+end subroutine test_srand
+
+! CHECK-LABEL: func @_QPtest_irand(
+subroutine test_irand()
+  integer :: seed = 0
+  integer :: result
+  result = irand(seed)
+  ! CHECK: %[[VAL_0:.*]] = fir.alloca i32 {bindc_name = "result", uniq_name = "_QFtest_irandEresult"} 
+  ! CHECK: %[[VAL_1:.*]]:2 = hlfir.declare %[[VAL_0]] {uniq_name = "_QFtest_irandEresult"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+  ! CHECK: %[[VAL_2:.*]] = fir.address_of(@_QFtest_irandEseed) : !fir.ref<i32> 
+  ! CHECK: %[[VAL_3:.*]]:2 = hlfir.declare %[[VAL_2]] {uniq_name = "_QFtest_irandEseed"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+  ! CHECK: %[[VAL_4:.*]] = fir.call @_FortranAIrand(%[[VAL_3]]#0) fastmath<contract> : (!fir.ref<i32>) -> i32
+  ! CHECK: hlfir.assign %[[VAL_4]] to %[[VAL_1]]#0 : i32, !fir.ref<i32>
+  ! CHECK: return
+end subroutine test_irand
+
+! CHECK-LABEL: func @_QPtest_rand(
+subroutine test_rand()
+  integer :: seed = 0
+  real :: result
+  result = rand(seed)
+  ! CHECK: %[[VAL_0:.*]] = fir.alloca f32 {bindc_name = "result", uniq_name = "_QFtest_randEresult"} 
+  ! CHECK: %[[VAL_1:.*]]:2 = hlfir.declare %[[VAL_0]] {uniq_name = "_QFtest_randEresult"} : (!fir.ref<f32>) -> (!fir.ref<f32>, !fir.ref<f32>)
+  ! CHECK: %[[VAL_2:.*]] = fir.address_of(@_QFtest_randEseed) : !fir.ref<i32> 
+  ! CHECK: %[[VAL_3:.*]]:2 = hlfir.declare %[[VAL_2]] {uniq_name = "_QFtest_randEseed"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+  ! CHECK: %[[VAL_4:.*]] = fir.call @_FortranARand(%[[VAL_3]]#0, %[[SOURCE:.*]], %[[LINE:.*]]) fastmath<contract> : (!fir.ref<i32>, !fir.ref<i8>, i32) -> f32
+  ! CHECK: hlfir.assign %[[VAL_4]] to %[[VAL_1]]#0 : f32, !fir.ref<f32>
+  ! CHECK: return
+end subroutine test_rand
+
diff --git a/flang/test/Lower/dispatch-table-abstract.f90 b/flang/test/Lower/dispatch-table-abstract.f90
new file mode 100644
index 0000000000000..cb4eb0cdeb52f
--- /dev/null
+++ b/flang/test/Lower/dispatch-table-abstract.f90
@@ -0,0 +1,21 @@
+! Test lowering of ASBTRACT type to fir.type_info
+! RUN: %flang_fc1 -emit-hlfir %s -o - | FileCheck %s
+
+module m_abstract_info
+  type, abstract :: abstract_type
+    contains
+    procedure(proc_iface), nopass, deferred :: proc
+  end type
+  interface
+    subroutine proc_iface()
+    end subroutine
+  end interface
+end module
+
+subroutine test(x)
+  use m_abstract_info, only : abstract_type
+  class(abstract_type) :: x
+end subroutine
+
+!CHECK-LABEL:  fir.type_info @_QMm_abstract_infoTabstract_type abstract noinit nodestroy nofinal : !fir.type<_QMm_abstract_infoTabstract_type> dispatch_table {
+!CHECK:    fir.dt_entry "proc", @_QPproc_iface deferred
diff --git a/flang/test/Lower/module-debug-file-loc-linux.f90 b/flang/test/Lower/module-debug-file-loc-linux.f90
index 454fad98e9af0..1c67d83492cc2 100644
--- a/flang/test/Lower/module-debug-file-loc-linux.f90
+++ b/flang/test/Lower/module-debug-file-loc-linux.f90
@@ -2,7 +2,7 @@
 
 ! RUN: %flang_fc1 -mmlir --mlir-print-debuginfo -emit-fir -o - %s | FileCheck %s
 
-! REQUIRES: system-linux
+! REQUIRES: system-linux || system-aix
 
 subroutine sb1()
 end subroutine
diff --git a/flang/test/Lower/pause-statement.f90 b/flang/test/Lower/pause-statement.f90
index f4c8f6fbc4385..465d82449c5bc 100644
--- a/flang/test/Lower/pause-statement.f90
+++ b/flang/test/Lower/pause-statement.f90
@@ -2,7 +2,31 @@
 
 ! CHECK-LABEL: pause_test
 subroutine pause_test()
-  ! CHECK: fir.call @_Fortran{{.*}}PauseStatement()
-  ! CHECK-NEXT: return
   pause
+  ! CHECK: fir.call @_FortranA{{.*}}PauseStatement()
+  ! CHECK-NEXT: return
+end subroutine
+
+! CHECK-LABEL: pause_code
+subroutine pause_code()
+  pause 42
+  ! CHECK: %[[c42:.*]] = arith.constant 42 : i32
+  ! CHECK: fir.call @_FortranA{{.*}}PauseStatementInt(%[[c42]])
+  ! CHECK-NEXT: return
 end subroutine
+
+! CHECK-LABEL: pause_msg
+subroutine pause_msg()
+  pause "hello"
+  ! CHECK-DAG: %[[five:.*]] = arith.constant 5 : index
+  ! CHECK-DAG: %[[addr:.*]] = fir.address_of(@_QQ{{.*}}) : !fir.ref<!fir.char<1,5>>
+  ! CHECK-DAG: %[[str:.*]]:2 = hlfir.declare %[[addr]] typeparams %[[five]] {fortran_attrs = #fir.var_attrs<parameter>, uniq_name = "_QQ{{.*}}"} : (!fir.ref<!fir.char<1,5>>, index) -> (!fir.ref<!fir.char<1,5>>, !fir.ref<!fir.char<1,5>>)
+  ! CHECK-DAG: %[[buff:.*]] = fir.convert %[[str]]#0 : (!fir.ref<!fir.char<1,5>>) -> !fir.ref<i8>
+  ! CHECK-DAG: %[[len:.*]] = fir.convert %[[five]] : (index) -> i64
+  ! CHECK: fir.call @_FortranA{{.*}}PauseStatementText(%[[buff]], %[[len]])
+  ! CHECK-NEXT: return
+end subroutine
+
+! CHECK-DAG: func private @_FortranA{{.*}}PauseStatement
+! CHECK-DAG: func private @_FortranA{{.*}}PauseStatementInt
+! CHECK-DAG: func private @_FortranA{{.*}}PauseStatementText
diff --git a/flang/test/Lower/vectorlength.f90 b/flang/test/Lower/vectorlength.f90
new file mode 100644
index 0000000000000..95753c3f78090
--- /dev/null
+++ b/flang/test/Lower/vectorlength.f90
@@ -0,0 +1,67 @@
+! RUN: %flang_fc1 -emit-hlfir -o - %s | FileCheck %s
+
+! CHECK: #[[FIXED:.*]] = #llvm.loop_vectorize<disable = false, scalableEnable = false>
+! CHECK: #[[SCALABLE:.*]] = #llvm.loop_vectorize<disable = false, scalableEnable = true>
+! CHECK: #[[WIDTH2:.*]] = #llvm.loop_vectorize<disable = false, width = 2 : i64>
+! CHECK: #[[FIXED_WIDTH2:.*]] = #llvm.loop_vectorize<disable = false, scalableEnable = false, width = 2 : i64>
+! CHECK: #[[SCALABLE_WIDTH2:.*]] = #llvm.loop_vectorize<disable = false, scalableEnable = true, width = 2 : i64>
+! CHECK: #[[FIXED_TAG:.*]] = #llvm.loop_annotation<vectorize = #[[FIXED]]>
+! CHECK: #[[SCALABLE_TAG:.*]] = #llvm.loop_annotation<vectorize = #[[SCALABLE]]>
+! CHECK: #[[WIDTH2_TAG:.*]]  = #llvm.loop_annotation<vectorize = #[[WIDTH2]]>
+! CHECK: #[[FIXED_WIDTH2_TAG:.*]] = #llvm.loop_annotation<vectorize = #[[FIXED_WIDTH2]]>
+! CHECK: #[[SCALABLE_WIDTH2_TAG:.*]] = #llvm.loop_annotation<vectorize = #[[SCALABLE_WIDTH2]]>
+
+! CHECK-LABEL: func.func @_QPfixed(
+subroutine fixed(a, b, m)
+  integer :: i, m, a(m), b(m)
+
+  !dir$ vector vectorlength(fixed)
+  ! CHECK: fir.do_loop {{.*}} attributes {loopAnnotation = #[[FIXED_TAG]]}
+  do i = 1, m
+    b(i) = a(i) + 1
+  end do
+end subroutine
+
+! CHECK-LABEL: func.func @_QPscalable(
+subroutine scalable(a, b, m)
+  integer :: i, m, a(m), b(m)
+
+  !dir$ vector vectorlength(scalable)
+  ! CHECK: fir.do_loop {{.*}} attributes {loopAnnotation = #[[SCALABLE_TAG]]}
+  do i = 1, m
+    b(i) = a(i) + 1
+  end do
+end subroutine
+
+! CHECK-LABEL: func.func @_QPlen2(
+subroutine len2(a, b, m)
+  integer :: i, m, a(m), b(m)
+
+  !dir$ vector vectorlength(2)
+  ! CHECK: fir.do_loop {{.*}} attributes {loopAnnotation = #[[WIDTH2_TAG]]}
+  do i = 1, m
+    b(i) = a(i) + 1
+  end do
+end subroutine
+
+! CHECK-LABEL: func.func @_QPlen2fixed(
+subroutine len2fixed(a, b, m)
+  integer :: i, m, a(m), b(m)
+
+  !dir$ vector vectorlength(2,fixed)
+  ! CHECK: fir.do_loop {{.*}} attributes {loopAnnotation = #[[FIXED_WIDTH2_TAG]]}
+  do i = 1, m
+    b(i) = a(i) + 1
+  end do
+end subroutine
+
+! CHECK-LABEL: func.func @_QPlen2scalable(
+subroutine len2scalable(a, b, m)
+  integer :: i, m, a(m), b(m)
+
+  !dir$ vector vectorlength(2,scalable)
+  ! CHECK: fir.do_loop {{.*}} attributes {loopAnnotation = #[[SCALABLE_WIDTH2_TAG]]}
+  do i = 1, m
+    b(i) = a(i) + 1
+  end do
+end subroutine
diff --git a/flang/test/Parser/OpenMP/fuse-looprange.f90 b/flang/test/Parser/OpenMP/fuse-looprange.f90
index 75ec15fddd65f..d6e6416175b33 100644
--- a/flang/test/Parser/OpenMP/fuse-looprange.f90
+++ b/flang/test/Parser/OpenMP/fuse-looprange.f90
@@ -28,7 +28,7 @@ subroutine openmp_fuse(x)
 !PARSE-TREE: OpenMPConstruct -> OpenMPLoopConstruct
 !PARSE-TREE: OmpBeginLoopDirective
 !PARSE-TREE: OmpDirectiveName -> llvm::omp::Directive = fuse
-!PARSE-TREE: OmpClauseList -> OmpClause -> Looprange -> OmpLoopRangeClause
+!PARSE-TREE: OmpClauseList -> OmpClause -> Looprange -> OmpLooprangeClause
 !PARSE-TREE: Scalar -> Integer -> Constant -> Expr = '1_4'
 !PARSE-TREE: LiteralConstant -> IntLiteralConstant = '1'
 !PARSE-TREE: Scalar -> Integer -> Constant -> Expr = '2_4'
diff --git a/flang/test/Parser/compiler-directives.f90 b/flang/test/Parser/compiler-directives.f90
index 56a10f9177997..ce592692cfc67 100644
--- a/flang/test/Parser/compiler-directives.f90
+++ b/flang/test/Parser/compiler-directives.f90
@@ -36,6 +36,28 @@ subroutine vector_always
   enddo
 end subroutine
 
+subroutine vector_vectorlength
+  !dir$ vector vectorlength(fixed)
+  ! CHECK: !DIR$ VECTOR VECTORLENGTH (FIXED)
+  do i=1,10
+  enddo
+
+  !dir$ vector vectorlength(scalable)
+  ! CHECK: !DIR$ VECTOR VECTORLENGTH (SCALABLE)
+  do i=1,10
+  enddo
+
+  !dir$ vector vectorlength(8,scalable)
+  ! CHECK: !DIR$ VECTOR VECTORLENGTH (8, SCALABLE)
+  do i=1,10
+  enddo
+
+  !dir$ vector vectorlength(4)
+  ! CHECK: !DIR$ VECTOR VECTORLENGTH (4)
+  do i=1,10
+  enddo
+end subroutine
+
 subroutine unroll
   !dir$ unroll
   ! CHECK: !DIR$ UNROLL
diff --git a/flang/test/Semantics/equiv-kind.f90 b/flang/test/Semantics/equiv-kind.f90
new file mode 100644
index 0000000000000..d54fe62ee77db
--- /dev/null
+++ b/flang/test/Semantics/equiv-kind.f90
@@ -0,0 +1,19 @@
+! RUN: %flang_fc1 -fdebug-unparse %s 2>&1 | FileCheck %s
+module equiv_kind_m
+  implicit none
+  integer, parameter :: knd = kind(42)
+  integer, parameter :: dim_2 = 1_knd
+  integer, parameter :: n = 3_knd
+  integer, parameter :: i_start = 1_knd
+contains
+subroutine test()
+  integer(knd) :: a(n),b(n,n)
+  character(len=5) :: small_ch
+  character(len=20) :: large_ch
+
+  equivalence (a(1_knd),b(1_knd,dim_2))
+  !CHECK: EQUIVALENCE (a(1_4), b(1_4,1_4))
+  equivalence (small_ch, large_ch(i_start:5_knd))
+  !CHECK: EQUIVALENCE (small_ch, large_ch(1_4:5_4))
+end subroutine test
+end module equiv_kind_m
diff --git a/flang/test/Transforms/debug-dwarf-version.fir b/flang/test/Transforms/debug-dwarf-version.fir
index fe2700274ab87..0136d2469d749 100644
--- a/flang/test/Transforms/debug-dwarf-version.fir
+++ b/flang/test/Transforms/debug-dwarf-version.fir
@@ -8,7 +8,7 @@
 // RUN:         | FileCheck --check-prefix=CHECK-DWARF2 %s
 // RUN: fir-opt --add-debug-info= --mlir-print-debuginfo %s \
 // RUN:         | FileCheck --check-prefix=CHECK-WITHOUT-VERSION %s
-// REQUIRES: system-linux
+// REQUIRES: system-linux || system-aix
 
 module {
 } loc(#loc)
diff --git a/flang/test/Transforms/debug-line-table-existing.fir b/flang/test/Transforms/debug-line-table-existing.fir
index 03eefd08a4379..98ca3dcb2a717 100644
--- a/flang/test/Transforms/debug-line-table-existing.fir
+++ b/flang/test/Transforms/debug-line-table-existing.fir
@@ -1,6 +1,6 @@
 
 // RUN: fir-opt --add-debug-info --mlir-print-debuginfo %s | FileCheck %s
-// REQUIRES: system-linux
+// REQUIRES: system-linux || system-aix
 
 // Test that there are no changes to a function with existed fused loc debug
 module {
diff --git a/flang/test/Transforms/debug-line-table-inc-file.fir b/flang/test/Transforms/debug-line-table-inc-file.fir
index 32c9f515ead43..d29e2fd6683b6 100644
--- a/flang/test/Transforms/debug-line-table-inc-file.fir
+++ b/flang/test/Transforms/debug-line-table-inc-file.fir
@@ -1,6 +1,6 @@
 
 // RUN: fir-opt --add-debug-info="debug-level=LineTablesOnly" --mlir-print-debuginfo %s | FileCheck %s
-// REQUIRES: system-linux
+// REQUIRES: system-linux || system-aix
 
 // Test for included functions that have a different debug location than the current file
 module {
diff --git a/flang/test/Transforms/debug-line-table-inc-same-file.fir b/flang/test/Transforms/debug-line-table-inc-same-file.fir
index aaa8d03a76ef0..5265c79e61173 100644
--- a/flang/test/Transforms/debug-line-table-inc-same-file.fir
+++ b/flang/test/Transforms/debug-line-table-inc-same-file.fir
@@ -1,6 +1,6 @@
 
 // RUN: fir-opt --add-debug-info --mlir-print-debuginfo %s | FileCheck %s
-// REQUIRES: system-linux
+// REQUIRES: system-linux || system-aix
 
 // Test that there is only one FileAttribute generated for multiple functions
 // in the same file.
diff --git a/libc/docs/talks.rst b/libc/docs/talks.rst
index e05aaeb53e04b..27164d26386c4 100644
--- a/libc/docs/talks.rst
+++ b/libc/docs/talks.rst
@@ -4,10 +4,61 @@ Talks
 ----
 2025
 ----
+* From proprietary to fully open-source - Arm Toolchain's adoption of LLVM technology - Peter Smith
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/keynotes/smith.pdf>`__
+  * `video <https://www.youtube.com/watch?v=I7S_Vsnkecg>`__
+
+* How to test and evaluate LLVM libc on embedded applications - William Huynh
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/lightning_talks/huynh.pdf>`__
+  * `video <https://www.youtube.com/watch?v=Dta6nQLmCOY>`__
+
+* Through the Compiler's Keyhole - Migrating to Clang Without Seeing the Source - Petr Hosek
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/technical_talks/hosek.pdf>`__
+  * `video <https://www.youtube.com/watch?v=CHbyo0Ux60o>`__
+
+* Building C++ compiler runtimes on demand - Why and how - Brook Moses
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/technical_talks/moses.pdf>`__
+  * `video <https://www.youtube.com/watch?v=4oKegT4TWV0>`__
+
+* LT-Uh-Oh - Adventures using LTO with libc - Paul Kirth, Daniel Thornburgh
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/technical_talks/kirth_thornburgh.pdf>`__
+  * `video <https://www.youtube.com/watch?v=cG278WjmIFs>`__
+
+* Climbing the ladder of Complete - LLVM libc past and future - Michael Jones
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/technical_talks/jones.pdf>`__
+  * `video <https://www.youtube.com/watch?v=HtCMCL13Grg>`__
+
+* Project Widen Your Char-izons - Adding wchar support to LLVM libc - Uzair Nawaz, Sriya Pratipati
+
+  * `slides <https://llvm.org/devmtg/2025-10/slides/quick_talks/nawaz_pratipati.pdf>`__
+  * `video <https://www.youtube.com/watch?v=YjI9dum74uM>`__
+
+* A problem left unsolved by Jean-Michel (RAIM 2025) - Paul Zimmermann, Tue Ly
+
+  * `slides <https://raim2025.sciencesconf.org/data/program/slides_paul_zimmermann.pdf>`__
+
+* Bfloat16 in LLVM libc (GSoC 2025) - Krishna Pandey
+
+  * `blog <https://blog.llvm.org/posts/2025-09-10-bfloat16-in-llvm-libc/>`__
+
+* Profiling and Testing Math Functions on GPUs (GSoC 2025) - Leandro A. L. Campos
+
+  * `blog <https://blog.llvm.org/posts/2025-08-29-gsoc-profiling-and-testing-math-functions-on-gpus/>`__
+
+* GPU-driven I/O with io_uring (GSoC 2025) - Rodrigo Ceccato
+
+  * `blog <https://blog.llvm.org/posts/2025-08-04-gpu-io-uring/>`__
+
 * An introduction to building and using LLVM libc - Peter Smith
 
   * `slides <https://fosdem.org/2025/events/attachments/fosdem-2025-5456-an-introduction-to-building-and-using-llvm-libc/slides/237989/Fosdem202_76Bilu2.pdf>`__
-  * `videos <https://fosdem.org/2025/schedule/event/fosdem-2025-5456-an-introduction-to-building-and-using-llvm-libc/>`__
+  * `video <https://fosdem.org/2025/schedule/event/fosdem-2025-5456-an-introduction-to-building-and-using-llvm-libc/>`__
 
 ----
 2024
@@ -15,38 +66,39 @@ Talks
 * A C/C++ Toolchain for your GPU - Joseph Huber
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/techtalk/Huber-A-CPlusPlus-Toolchain-for-Your-GPU.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=4TxGWis1mws>`__
+  * `video <https://www.youtube.com/watch?v=4TxGWis1mws>`__
   * `phoronix <https://www.phoronix.com/news/AMD-Standard-C-Code-GPUs>`__
 
 * Modern Embedded Development with LLVM - Petr Hosek
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/techtalk/Hosek-ModernEmbeddedDevelopment-with-LLVM.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=5hHQi-Uj34I>`__
+  * `video <https://www.youtube.com/watch?v=5hHQi-Uj34I>`__
 
 * Using llvm-libc in LLVM Embedded Toolchain for Arm - Peter Smith
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/lightning/Smith-Using-llvm-libc.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=ctgkbaYwT_I>`__
+  * `video <https://www.youtube.com/watch?v=ctgkbaYwT_I>`__
 
 * RISC-V Support into LLVM libc - Challenges and Solutions for 32-bit and 64-bit - Mikhail R. Gadelha
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/quicktalks/Gadelha-RISC-V-SupportIntoLLVM-libc.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=GytmaH64wFo>`__
+  * `video <https://www.youtube.com/watch?v=GytmaH64wFo>`__
 
 * Project Hand-in-Hand - The beginning of a beautiful friendship - Michael Jones & Christopher Di Bella
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/techtalk/Jones-DiBella-hand-in-hand.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=VAEO86YtTHA>`__
+  * `video <https://www.youtube.com/watch?v=VAEO86YtTHA>`__
 
 * LLVM libc math library - Current status and future directions - Tue Ly
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/techtalk/Ly-LLVM-libc-math-library-CurrentStatus.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=-8zb8rHbvcQ>`__
+  * `video <https://www.youtube.com/watch?v=-8zb8rHbvcQ>`__
 
 * Half-precision in LLVM libc - Nicolas Celik
 
   * `slides <https://llvm.org/devmtg/2024-10/slides/studenttalks/Celik-Half-precision-in-LLVM-libc.pdf>`__
-  * `videos <https://www.youtube.com/watch?v=H6aOFSVwSSM>`__
+  * `video <https://www.youtube.com/watch?v=H6aOFSVwSSM>`__
+  * `blog <https://blog.llvm.org/posts/2024-08-31-half-precision-in-llvm-libc/>`__
 
 ----
 2023
diff --git a/libc/fuzzing/__support/freelist_heap_fuzz.cpp b/libc/fuzzing/__support/freelist_heap_fuzz.cpp
index 0b400cb156491..b342b21895a08 100644
--- a/libc/fuzzing/__support/freelist_heap_fuzz.cpp
+++ b/libc/fuzzing/__support/freelist_heap_fuzz.cpp
@@ -147,7 +147,7 @@ extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t remainder) {
 
       // Perform allocation.
       void *ptr = nullptr;
-      size_t alignment = alignof(max_align_t);
+      size_t alignment = Block::MIN_ALIGN;
       switch (alloc_type) {
       case AllocType::MALLOC:
         ptr = heap.allocate(alloc_size);
@@ -172,7 +172,7 @@ extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t remainder) {
                           alloc_size - alloc.size);
           alloc.ptr = ptr;
           alloc.size = alloc_size;
-          alloc.alignment = alignof(max_align_t);
+          alloc.alignment = Block::MIN_ALIGN;
         }
         break;
       }
@@ -194,8 +194,8 @@ extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t remainder) {
 
       if (ptr) {
         // aligned_allocate should automatically apply a minimum alignment.
-        if (alignment < alignof(max_align_t))
-          alignment = alignof(max_align_t);
+        if (alignment < Block::MIN_ALIGN)
+          alignment = Block::MIN_ALIGN;
         // Check alignment.
         if (reinterpret_cast<uintptr_t>(ptr) % alignment)
           __builtin_trap();
diff --git a/libc/src/__support/CMakeLists.txt b/libc/src/__support/CMakeLists.txt
index d33e7ae45c068..c7f127d6934a0 100644
--- a/libc/src/__support/CMakeLists.txt
+++ b/libc/src/__support/CMakeLists.txt
@@ -162,7 +162,6 @@ add_header_library(
     wctype_utils.h
   DEPENDS
     libc.hdr.types.wchar_t
-    libc.hdr.types.wint_t
 )
 
 add_header_library(
diff --git a/libc/src/__support/block.h b/libc/src/__support/block.h
index b0d6576093244..d45af1079a5bc 100644
--- a/libc/src/__support/block.h
+++ b/libc/src/__support/block.h
@@ -43,8 +43,8 @@ using cpp::optional;
 /// The blocks store their offsets to the previous and next blocks. The latter
 /// is also the block's size.
 ///
-/// All blocks have their usable space aligned to some multiple of max_align_t.
-/// This also implies that block outer sizes are aligned to max_align_t.
+/// All blocks have their usable space aligned to some multiple of MIN_ALIGN.
+/// This also implies that block outer sizes are aligned to MIN_ALIGN.
 ///
 /// As an example, the diagram below represents two contiguous `Block`s. The
 /// indices indicate byte offsets:
@@ -97,13 +97,17 @@ class Block {
   static constexpr size_t SIZE_MASK = ~(PREV_FREE_MASK | LAST_MASK);
 
 public:
+  // To ensure block sizes have two lower unused bits, ensure usable space is
+  // always aligned to at least 4 bytes. (The distances between usable spaces,
+  // the outer size, is then always also 4-aligned.)
+  static constexpr size_t MIN_ALIGN = cpp::max(size_t{4}, alignof(max_align_t));
   // No copy or move.
   Block(const Block &other) = delete;
   Block &operator=(const Block &other) = delete;
 
   /// Initializes a given memory region into a first block and a sentinel last
   /// block. Returns the first block, which has its usable space aligned to
-  /// max_align_t.
+  /// MIN_ALIGN.
   static optional<Block *> init(ByteSpan region);
 
   /// @returns  A pointer to a `Block`, given a pointer to the start of the
@@ -160,17 +164,17 @@ class Block {
 
   /// @returns A pointer to the usable space inside this block.
   ///
-  /// Aligned to some multiple of max_align_t.
+  /// Aligned to some multiple of MIN_ALIGN.
   LIBC_INLINE cpp::byte *usable_space() {
     auto *s = reinterpret_cast<cpp::byte *>(this) + sizeof(Block);
-    LIBC_ASSERT(reinterpret_cast<uintptr_t>(s) % alignof(max_align_t) == 0 &&
-                "usable space must be aligned to a multiple of max_align_t");
+    LIBC_ASSERT(reinterpret_cast<uintptr_t>(s) % MIN_ALIGN == 0 &&
+                "usable space must be aligned to MIN_ALIGN");
     return s;
   }
   LIBC_INLINE const cpp::byte *usable_space() const {
     const auto *s = reinterpret_cast<const cpp::byte *>(this) + sizeof(Block);
-    LIBC_ASSERT(reinterpret_cast<uintptr_t>(s) % alignof(max_align_t) == 0 &&
-                "usable space must be aligned to a multiple of max_align_t");
+    LIBC_ASSERT(reinterpret_cast<uintptr_t>(s) % MIN_ALIGN == 0 &&
+                "usable space must be aligned to MIN_ALIGN");
     return s;
   }
 
@@ -185,9 +189,9 @@ class Block {
   /// `new_inner_size`. The remaining space will be returned as a new block,
   /// with usable space aligned to `usable_space_alignment`. Note that the prev_
   /// field of the next block counts as part of the inner size of the block.
-  /// `usable_space_alignment` must be a multiple of max_align_t.
+  /// `usable_space_alignment` must be a multiple of MIN_ALIGN.
   optional<Block *> split(size_t new_inner_size,
-                          size_t usable_space_alignment = alignof(max_align_t));
+                          size_t usable_space_alignment = MIN_ALIGN);
 
   /// Merges this block with the one that comes after it.
   bool merge_next();
@@ -228,13 +232,11 @@ class Block {
 
   LIBC_INLINE Block(size_t outer_size, bool is_last) : next_(outer_size) {
     // Last blocks are not usable, so they need not have sizes aligned to
-    // max_align_t. Their lower bits must still be free, so they must be aligned
-    // to Block.
-    LIBC_ASSERT(
-        outer_size % (is_last ? alignof(Block) : alignof(max_align_t)) == 0 &&
-        "block sizes must be aligned");
-    LIBC_ASSERT(is_usable_space_aligned(alignof(max_align_t)) &&
-                "usable space must be aligned to a multiple of max_align_t");
+    // MIN_ALIGN.
+    LIBC_ASSERT(outer_size % (is_last ? alignof(Block) : MIN_ALIGN) == 0 &&
+                "block sizes must be aligned");
+    LIBC_ASSERT(is_usable_space_aligned(MIN_ALIGN) &&
+                "usable space must be aligned to a multiple of MIN_ALIGN");
     if (is_last)
       next_ |= LAST_MASK;
   }
@@ -249,11 +251,10 @@ class Block {
   // Returns 0 if there is no such size.
   LIBC_INLINE static size_t min_size_for_allocation(size_t alignment,
                                                     size_t size) {
-    LIBC_ASSERT(alignment >= alignof(max_align_t) &&
-                alignment % alignof(max_align_t) == 0 &&
-                "alignment must be multiple of max_align_t");
+    LIBC_ASSERT(alignment >= MIN_ALIGN && alignment % MIN_ALIGN == 0 &&
+                "alignment must be multiple of MIN_ALIGN");
 
-    if (alignment == alignof(max_align_t))
+    if (alignment == MIN_ALIGN)
       return size;
 
     // We must create a new block inside this one (splitting). This requires a
@@ -274,7 +275,7 @@ class Block {
     // So the maximum distance would be G - L. As a special case, if L is 1
     // (unaligned), the max distance is G - 1.
     //
-    // This block's usable space is aligned to max_align_t >= Block. With zero
+    // This block's usable space is aligned to MIN_ALIGN >= Block. With zero
     // padding, the next block's usable space is sizeof(Block) past it, which is
     // a point aligned to Block. Thus the max padding needed is alignment -
     // alignof(Block).
@@ -309,13 +310,15 @@ class Block {
   static BlockInfo allocate(Block *block, size_t alignment, size_t size);
 
   // These two functions may wrap around.
-  LIBC_INLINE static uintptr_t next_possible_block_start(
-      uintptr_t ptr, size_t usable_space_alignment = alignof(max_align_t)) {
+  LIBC_INLINE static uintptr_t
+  next_possible_block_start(uintptr_t ptr,
+                            size_t usable_space_alignment = MIN_ALIGN) {
     return align_up(ptr + sizeof(Block), usable_space_alignment) -
            sizeof(Block);
   }
-  LIBC_INLINE static uintptr_t prev_possible_block_start(
-      uintptr_t ptr, size_t usable_space_alignment = alignof(max_align_t)) {
+  LIBC_INLINE static uintptr_t
+  prev_possible_block_start(uintptr_t ptr,
+                            size_t usable_space_alignment = MIN_ALIGN) {
     return align_down(ptr, usable_space_alignment) - sizeof(Block);
   }
 
@@ -360,9 +363,6 @@ class Block {
   static constexpr size_t PREV_FIELD_SIZE = sizeof(prev_);
 };
 
-static_assert(alignof(Block) >= 4,
-              "at least 2 bits must be available in block sizes for flags");
-
 LIBC_INLINE
 optional<Block *> Block::init(ByteSpan region) {
   if (!region.data())
@@ -394,8 +394,8 @@ optional<Block *> Block::init(ByteSpan region) {
 
 LIBC_INLINE
 Block::BlockInfo Block::allocate(Block *block, size_t alignment, size_t size) {
-  LIBC_ASSERT(alignment % alignof(max_align_t) == 0 &&
-              "alignment must be a multiple of max_align_t");
+  LIBC_ASSERT(alignment % MIN_ALIGN == 0 &&
+              "alignment must be a multiple of MIN_ALIGN");
 
   BlockInfo info{block, /*prev=*/nullptr, /*next=*/nullptr};
 
@@ -430,8 +430,8 @@ Block::BlockInfo Block::allocate(Block *block, size_t alignment, size_t size) {
 LIBC_INLINE
 optional<Block *> Block::split(size_t new_inner_size,
                                size_t usable_space_alignment) {
-  LIBC_ASSERT(usable_space_alignment % alignof(max_align_t) == 0 &&
-              "alignment must be a multiple of max_align_t");
+  LIBC_ASSERT(usable_space_alignment % MIN_ALIGN == 0 &&
+              "alignment must be a multiple of MIN_ALIGN");
   if (used())
     return {};
 
@@ -445,8 +445,8 @@ optional<Block *> Block::split(size_t new_inner_size,
   if (next_block_start < start)
     return {};
   size_t new_outer_size = next_block_start - start;
-  LIBC_ASSERT(new_outer_size % alignof(max_align_t) == 0 &&
-              "new size must be aligned to max_align_t");
+  LIBC_ASSERT(new_outer_size % MIN_ALIGN == 0 &&
+              "new size must be aligned to MIN_ALIGN");
 
   if (outer_size() < new_outer_size ||
       outer_size() - new_outer_size < sizeof(Block))
diff --git a/libc/src/__support/freelist_heap.h b/libc/src/__support/freelist_heap.h
index d58685194aeb8..2e0e371f96e8c 100644
--- a/libc/src/__support/freelist_heap.h
+++ b/libc/src/__support/freelist_heap.h
@@ -108,7 +108,7 @@ LIBC_INLINE void *FreeListHeap::allocate_impl(size_t alignment, size_t size) {
 }
 
 LIBC_INLINE void *FreeListHeap::allocate(size_t size) {
-  return allocate_impl(alignof(max_align_t), size);
+  return allocate_impl(Block::MIN_ALIGN, size);
 }
 
 LIBC_INLINE void *FreeListHeap::aligned_allocate(size_t alignment,
@@ -121,8 +121,8 @@ LIBC_INLINE void *FreeListHeap::aligned_allocate(size_t alignment,
   if (size % alignment != 0)
     return nullptr;
 
-  // The minimum alignment supported by Block is max_align_t.
-  alignment = cpp::max(alignment, alignof(max_align_t));
+  // The minimum alignment supported by Block is MIN_ALIGN.
+  alignment = cpp::max(alignment, Block::MIN_ALIGN);
 
   return allocate_impl(alignment, size);
 }
diff --git a/libc/src/__support/freestore.h b/libc/src/__support/freestore.h
index 09f2479debb36..2dcb4b10b93d5 100644
--- a/libc/src/__support/freestore.h
+++ b/libc/src/__support/freestore.h
@@ -41,11 +41,11 @@ class FreeStore {
 
 private:
   static constexpr size_t MIN_OUTER_SIZE =
-      align_up(sizeof(Block) + sizeof(FreeList::Node), alignof(max_align_t));
+      align_up(sizeof(Block) + sizeof(FreeList::Node), Block::MIN_ALIGN);
   static constexpr size_t MIN_LARGE_OUTER_SIZE =
-      align_up(sizeof(Block) + sizeof(FreeTrie::Node), alignof(max_align_t));
+      align_up(sizeof(Block) + sizeof(FreeTrie::Node), Block::MIN_ALIGN);
   static constexpr size_t NUM_SMALL_SIZES =
-      (MIN_LARGE_OUTER_SIZE - MIN_OUTER_SIZE) / alignof(max_align_t);
+      (MIN_LARGE_OUTER_SIZE - MIN_OUTER_SIZE) / Block::MIN_ALIGN;
 
   LIBC_INLINE static bool too_small(Block *block) {
     return block->outer_size() < MIN_OUTER_SIZE;
@@ -98,8 +98,7 @@ LIBC_INLINE Block *FreeStore::remove_best_fit(size_t size) {
 
 LIBC_INLINE FreeList &FreeStore::small_list(Block *block) {
   LIBC_ASSERT(is_small(block) && "only legal for small blocks");
-  return small_lists[(block->outer_size() - MIN_OUTER_SIZE) /
-                     alignof(max_align_t)];
+  return small_lists[(block->outer_size() - MIN_OUTER_SIZE) / Block::MIN_ALIGN];
 }
 
 LIBC_INLINE FreeList *FreeStore::find_best_small_fit(size_t size) {
diff --git a/libc/src/__support/wctype_utils.h b/libc/src/__support/wctype_utils.h
index 7041470adc2f4..7f17224104ffb 100644
--- a/libc/src/__support/wctype_utils.h
+++ b/libc/src/__support/wctype_utils.h
@@ -10,8 +10,6 @@
 #define LLVM_LIBC_SRC___SUPPORT_WCTYPE_UTILS_H
 
 #include "hdr/types/wchar_t.h"
-#include "hdr/types/wint_t.h"
-#include "src/__support/CPP/optional.h"
 #include "src/__support/macros/attributes.h" // LIBC_INLINE
 #include "src/__support/macros/config.h"
 
@@ -584,30 +582,6 @@ is_char_or_wchar(wchar_t ch, [[maybe_unused]] char, wchar_t wc_value) {
   return (ch == wc_value);
 }
 
-// ------------------------------------------------------
-// Rationale: Since these classification functions are
-// called in other functions, we will avoid the overhead
-// of a function call by inlining them.
-// ------------------------------------------------------
-
-LIBC_INLINE cpp::optional<int> wctob(wint_t c) {
-  // This needs to be translated to EOF at the callsite. This is to avoid
-  // including stdio.h in this file.
-  // The standard states that wint_t may either be an alias of wchar_t or
-  // an alias of an integer type, different platforms define this type with
-  // different signedness. This is equivalent to `(c > 127) || (c < 0)` but also
-  // works without -Wtype-limits warnings when `wint_t` is unsigned.
-  if ((c & ~127) != 0)
-    return cpp::nullopt;
-  return static_cast<int>(c);
-}
-
-LIBC_INLINE cpp::optional<wint_t> btowc(int c) {
-  if (c > 127 || c < 0)
-    return cpp::nullopt;
-  return static_cast<wint_t>(c);
-}
-
 } // namespace internal
 } // namespace LIBC_NAMESPACE_DECL
 
diff --git a/libc/src/wchar/CMakeLists.txt b/libc/src/wchar/CMakeLists.txt
index 9ca7295118a11..ce57199b0837a 100644
--- a/libc/src/wchar/CMakeLists.txt
+++ b/libc/src/wchar/CMakeLists.txt
@@ -40,7 +40,6 @@ add_entrypoint_object(
   DEPENDS
     libc.hdr.types.wint_t
     libc.hdr.stdio_macros
-    libc.src.__support.wctype_utils
 )
 
 add_entrypoint_object(
@@ -52,7 +51,6 @@ add_entrypoint_object(
   DEPENDS
     libc.hdr.types.wint_t
     libc.hdr.wchar_macros
-    libc.src.__support.wctype_utils
 )
 
 add_entrypoint_object(
diff --git a/libc/src/wchar/btowc.cpp b/libc/src/wchar/btowc.cpp
index c69f77d06c5c5..6bc7526fac164 100644
--- a/libc/src/wchar/btowc.cpp
+++ b/libc/src/wchar/btowc.cpp
@@ -9,7 +9,6 @@
 #include "src/wchar/btowc.h"
 #include "src/__support/common.h"
 #include "src/__support/macros/config.h"
-#include "src/__support/wctype_utils.h"
 
 #include "hdr/types/wint_t.h"
 #include "hdr/wchar_macros.h" // for WEOF.
@@ -17,12 +16,9 @@
 namespace LIBC_NAMESPACE_DECL {
 
 LLVM_LIBC_FUNCTION(wint_t, btowc, (int c)) {
-  auto result = internal::btowc(c);
-  if (result.has_value()) {
-    return result.value();
-  } else {
+  if (c > 127 || c < 0)
     return WEOF;
-  }
+  return static_cast<wint_t>(c);
 }
 
 } // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/src/wchar/wctob.cpp b/libc/src/wchar/wctob.cpp
index 45240d6052eb4..9b6773bf6428e 100644
--- a/libc/src/wchar/wctob.cpp
+++ b/libc/src/wchar/wctob.cpp
@@ -9,7 +9,6 @@
 #include "src/wchar/wctob.h"
 #include "src/__support/common.h"
 #include "src/__support/macros/config.h"
-#include "src/__support/wctype_utils.h"
 
 #include "hdr/stdio_macros.h" // for EOF.
 #include "hdr/types/wint_t.h"
@@ -17,12 +16,13 @@
 namespace LIBC_NAMESPACE_DECL {
 
 LLVM_LIBC_FUNCTION(int, wctob, (wint_t c)) {
-  auto result = internal::wctob(c);
-  if (result.has_value()) {
-    return result.value();
-  } else {
+  // The standard states that wint_t may either be an alias of wchar_t or
+  // an alias of an integer type, different platforms define this type with
+  // different signedness. This is equivalent to `(c > 127) || (c < 0)` but also
+  // works without -Wtype-limits warnings when `wint_t` is unsigned.
+  if ((c & ~127) != 0)
     return EOF;
-  }
+  return static_cast<int>(c);
 }
 
 } // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/test/src/__support/block_test.cpp b/libc/test/src/__support/block_test.cpp
index 904ac5c66994d..3029cde834a5d 100644
--- a/libc/test/src/__support/block_test.cpp
+++ b/libc/test/src/__support/block_test.cpp
@@ -22,14 +22,14 @@ using LIBC_NAMESPACE::cpp::span;
 
 TEST(LlvmLibcBlockTest, CanCreateSingleAlignedBlock) {
   constexpr size_t kN = 1024;
-  alignas(max_align_t) array<byte, kN> bytes;
+  alignas(Block::MIN_ALIGN) array<byte, kN> bytes;
 
   auto result = Block::init(bytes);
   ASSERT_TRUE(result.has_value());
   Block *block = *result;
 
   EXPECT_EQ(reinterpret_cast<uintptr_t>(block) % alignof(Block), size_t{0});
-  EXPECT_TRUE(block->is_usable_space_aligned(alignof(max_align_t)));
+  EXPECT_TRUE(block->is_usable_space_aligned(Block::MIN_ALIGN));
 
   Block *last = block->next();
   ASSERT_NE(last, static_cast<Block *>(nullptr));
@@ -52,7 +52,7 @@ TEST(LlvmLibcBlockTest, CanCreateUnalignedSingleBlock) {
   constexpr size_t kN = 1024;
 
   // Force alignment, so we can un-force it below
-  alignas(max_align_t) array<byte, kN> bytes;
+  alignas(Block::MIN_ALIGN) array<byte, kN> bytes;
   span<byte> aligned(bytes);
 
   auto result = Block::init(aligned.subspan(1));
@@ -60,7 +60,7 @@ TEST(LlvmLibcBlockTest, CanCreateUnalignedSingleBlock) {
 
   Block *block = *result;
   EXPECT_EQ(reinterpret_cast<uintptr_t>(block) % alignof(Block), size_t{0});
-  EXPECT_TRUE(block->is_usable_space_aligned(alignof(max_align_t)));
+  EXPECT_TRUE(block->is_usable_space_aligned(Block::MIN_ALIGN));
 
   Block *last = block->next();
   ASSERT_NE(last, static_cast<Block *>(nullptr));
@@ -98,7 +98,7 @@ TEST(LlvmLibcBlockTest, CanSplitBlock) {
   EXPECT_EQ(block2->outer_size(), orig_size - block1->outer_size());
   EXPECT_FALSE(block2->used());
   EXPECT_EQ(reinterpret_cast<uintptr_t>(block2) % alignof(Block), size_t{0});
-  EXPECT_TRUE(block2->is_usable_space_aligned(alignof(max_align_t)));
+  EXPECT_TRUE(block2->is_usable_space_aligned(Block::MIN_ALIGN));
 
   EXPECT_EQ(block1->next(), block2);
   EXPECT_EQ(block2->prev_free(), block1);
@@ -124,7 +124,7 @@ TEST(LlvmLibcBlockTest, CanSplitBlockUnaligned) {
   EXPECT_EQ(block2->outer_size(), orig_size - block1->outer_size());
   EXPECT_FALSE(block2->used());
   EXPECT_EQ(reinterpret_cast<uintptr_t>(block2) % alignof(Block), size_t{0});
-  EXPECT_TRUE(block2->is_usable_space_aligned(alignof(max_align_t)));
+  EXPECT_TRUE(block2->is_usable_space_aligned(Block::MIN_ALIGN));
 
   EXPECT_EQ(block1->next(), block2);
   EXPECT_EQ(block2->prev_free(), block1);
@@ -211,7 +211,7 @@ TEST(LlvmLibcBlockTest, CanMakeMinimalSizeFirstBlock) {
 
   result = block->split(0);
   ASSERT_TRUE(result.has_value());
-  EXPECT_LE(block->outer_size(), sizeof(Block) + alignof(max_align_t));
+  EXPECT_LE(block->outer_size(), sizeof(Block) + Block::MIN_ALIGN);
 }
 
 TEST(LlvmLibcBlockTest, CanMakeMinimalSizeSecondBlock) {
@@ -228,7 +228,7 @@ TEST(LlvmLibcBlockTest, CanMakeMinimalSizeSecondBlock) {
                          reinterpret_cast<uintptr_t>(block1->usable_space()) +
                          Block::PREV_FIELD_SIZE);
   ASSERT_TRUE(result.has_value());
-  EXPECT_LE((*result)->outer_size(), sizeof(Block) + alignof(max_align_t));
+  EXPECT_LE((*result)->outer_size(), sizeof(Block) + Block::MIN_ALIGN);
 }
 
 TEST(LlvmLibcBlockTest, CanMarkBlockUsed) {
@@ -361,18 +361,18 @@ TEST(LlvmLibcBlockTest, Allocate) {
     if (i > block->inner_size())
       continue;
 
-    auto info = Block::allocate(block, alignof(max_align_t), i);
+    auto info = Block::allocate(block, Block::MIN_ALIGN, i);
     EXPECT_NE(info.block, static_cast<Block *>(nullptr));
   }
 
   // Ensure we can allocate a byte at every guaranteeable alignment.
-  for (size_t i = 1; i < kN / alignof(max_align_t); ++i) {
+  for (size_t i = 1; i < kN / Block::MIN_ALIGN; ++i) {
     array<byte, kN> bytes;
     auto result = Block::init(bytes);
     ASSERT_TRUE(result.has_value());
     Block *block = *result;
 
-    size_t alignment = i * alignof(max_align_t);
+    size_t alignment = i * Block::MIN_ALIGN;
     if (Block::min_size_for_allocation(alignment, 1) > block->inner_size())
       continue;
 
@@ -393,14 +393,14 @@ TEST(LlvmLibcBlockTest, AllocateAlreadyAligned) {
   constexpr size_t SIZE = Block::PREV_FIELD_SIZE + 1;
 
   auto [aligned_block, prev, next] =
-      Block::allocate(block, alignof(max_align_t), SIZE);
+      Block::allocate(block, Block::MIN_ALIGN, SIZE);
 
   // Since this is already aligned, there should be no previous block.
   EXPECT_EQ(prev, static_cast<Block *>(nullptr));
 
   // Ensure we the block is aligned and large enough.
   EXPECT_NE(aligned_block, static_cast<Block *>(nullptr));
-  EXPECT_TRUE(aligned_block->is_usable_space_aligned(alignof(max_align_t)));
+  EXPECT_TRUE(aligned_block->is_usable_space_aligned(Block::MIN_ALIGN));
   EXPECT_GE(aligned_block->inner_size(), SIZE);
 
   // Check the next block.
@@ -422,9 +422,9 @@ TEST(LlvmLibcBlockTest, AllocateNeedsAlignment) {
   // Now pick an alignment such that the usable space is not already aligned to
   // it. We want to explicitly test that the block will split into one before
   // it.
-  size_t alignment = alignof(max_align_t);
+  size_t alignment = Block::MIN_ALIGN;
   while (block->is_usable_space_aligned(alignment))
-    alignment += alignof(max_align_t);
+    alignment += Block::MIN_ALIGN;
 
   auto [aligned_block, prev, next] = Block::allocate(block, alignment, 10);
 
@@ -464,9 +464,9 @@ TEST(LlvmLibcBlockTest, PreviousBlockMergedIfNotFirst) {
   // Now pick an alignment such that the usable space is not already aligned to
   // it. We want to explicitly test that the block will split into one before
   // it.
-  size_t alignment = alignof(max_align_t);
+  size_t alignment = Block::MIN_ALIGN;
   while (newblock->is_usable_space_aligned(alignment))
-    alignment += alignof(max_align_t);
+    alignment += Block::MIN_ALIGN;
 
   // Ensure we can allocate in the new block.
   auto [aligned_block, prev, next] = Block::allocate(newblock, alignment, 1);
@@ -501,9 +501,9 @@ TEST(LlvmLibcBlockTest, CanRemergeBlockAllocations) {
   // Now pick an alignment such that the usable space is not already aligned to
   // it. We want to explicitly test that the block will split into one before
   // it.
-  size_t alignment = alignof(max_align_t);
+  size_t alignment = Block::MIN_ALIGN;
   while (block->is_usable_space_aligned(alignment))
-    alignment += alignof(max_align_t);
+    alignment += Block::MIN_ALIGN;
 
   auto [aligned_block, prev, next] = Block::allocate(block, alignment, 1);
 
diff --git a/libc/test/src/__support/freelist_heap_test.cpp b/libc/test/src/__support/freelist_heap_test.cpp
index ea7310d1d0756..9d3a6b612555f 100644
--- a/libc/test/src/__support/freelist_heap_test.cpp
+++ b/libc/test/src/__support/freelist_heap_test.cpp
@@ -123,12 +123,12 @@ TEST_FOR_EACH_ALLOCATOR(ReturnedPointersAreAligned, 2048) {
   void *ptr1 = allocator.allocate(1);
 
   uintptr_t ptr1_start = reinterpret_cast<uintptr_t>(ptr1);
-  EXPECT_EQ(ptr1_start % alignof(max_align_t), static_cast<size_t>(0));
+  EXPECT_EQ(ptr1_start % Block::MIN_ALIGN, static_cast<size_t>(0));
 
   void *ptr2 = allocator.allocate(1);
   uintptr_t ptr2_start = reinterpret_cast<uintptr_t>(ptr2);
 
-  EXPECT_EQ(ptr2_start % alignof(max_align_t), static_cast<size_t>(0));
+  EXPECT_EQ(ptr2_start % Block::MIN_ALIGN, static_cast<size_t>(0));
 }
 
 TEST_FOR_EACH_ALLOCATOR(CanRealloc, 2048) {
diff --git a/libc/test/src/__support/freestore_test.cpp b/libc/test/src/__support/freestore_test.cpp
index 39292b6a1211b..7017d6b9ebe93 100644
--- a/libc/test/src/__support/freestore_test.cpp
+++ b/libc/test/src/__support/freestore_test.cpp
@@ -51,8 +51,8 @@ TEST(LlvmLibcFreeStore, RemoveBestFit) {
   ASSERT_TRUE(maybeBlock.has_value());
 
   Block *largest_small = *maybeBlock;
-  maybeBlock = largest_small->split(
-      sizeof(FreeTrie::Node) + Block::PREV_FIELD_SIZE - alignof(max_align_t));
+  maybeBlock = largest_small->split(sizeof(FreeTrie::Node) +
+                                    Block::PREV_FIELD_SIZE - Block::MIN_ALIGN);
   ASSERT_TRUE(maybeBlock.has_value());
   if (largest_small->inner_size() == smallest->inner_size())
     largest_small = smallest;
diff --git a/libclc/opencl/lib/generic/atomic/atomic_def.inc b/libclc/opencl/lib/generic/atomic/atomic_def.inc
index a4ccab5990888..e6b7c831e10d3 100644
--- a/libclc/opencl/lib/generic/atomic/atomic_def.inc
+++ b/libclc/opencl/lib/generic/atomic/atomic_def.inc
@@ -12,7 +12,8 @@
                                  defined(cl_khr_int64_extended_atomics))
 #define __CLC_HAVE_64_ATOMIC
 #endif
-#if defined(__CLC_FPSIZE) && (__CLC_FPSIZE < 64 || defined(__CLC_HAVE_64_ATOMIC)
+#if defined(__CLC_FPSIZE) &&                                                   \
+    (__CLC_FPSIZE < 64 || defined(__CLC_HAVE_64_ATOMIC))
 #define __CLC_HAVE_FP_ATOMIC
 #endif
 #if defined(__CLC_GENSIZE) &&                                                  \
diff --git a/libclc/opencl/lib/generic/integer/bitfield_insert.cl b/libclc/opencl/lib/generic/integer/bitfield_insert.cl
index c165bd756ffef..f6d0aea96d8ec 100644
--- a/libclc/opencl/lib/generic/integer/bitfield_insert.cl
+++ b/libclc/opencl/lib/generic/integer/bitfield_insert.cl
@@ -12,7 +12,7 @@
 #include <clc/opencl/integer/bitfield_insert.h>
 
 #define __CLC_FUNCTION bitfield_insert
-#define __CLC_BODY <clc/integer/clc_bitfield_insert.inc>
+#define __CLC_BODY <bitfield_insert.inc>
 #include <clc/integer/gentype.inc>
 
 #endif // cl_khr_extended_bit_ops
diff --git a/libcxx/docs/AddingNewCIJobs.rst b/libcxx/docs/AddingNewCIJobs.rst
index 9d749c0d866f2..7a12728b98919 100644
--- a/libcxx/docs/AddingNewCIJobs.rst
+++ b/libcxx/docs/AddingNewCIJobs.rst
@@ -28,6 +28,9 @@ An example of a job definition is:
 
   - label: "C++11"
     command: "libcxx/utils/ci/run-buildbot generic-cxx11"
+    env:
+      CC: clang
+      CXX: clang++
     artifact_paths:
       - "**/test-results.xml"
     agents:
diff --git a/libcxx/docs/Contributing.rst b/libcxx/docs/Contributing.rst
index b814ccfd0ac9a..e660daeba7e5b 100644
--- a/libcxx/docs/Contributing.rst
+++ b/libcxx/docs/Contributing.rst
@@ -311,6 +311,9 @@ To do so, you will need to create a PR in the llvm-zorg repository and wait for
 merged. Once that change has been merged, an LLVM premerge maintainer (a Google employee)
 must use terraform to apply the change to the running GKE cluster.
 
+.. note:: When you update the ``libcxx_runner_image``, also make sure to update the
+          ``libcxx/utils/ci/run-buildbot-container`` script to contain the new image.
+
 
 Monitoring premerge testing performance
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/libcxx/include/__functional/bind.h b/libcxx/include/__functional/bind.h
index 328dc3bf3dabc..cbe8660b821c1 100644
--- a/libcxx/include/__functional/bind.h
+++ b/libcxx/include/__functional/bind.h
@@ -81,16 +81,12 @@ inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _Tp& __mu(reference_w
   return __t.get();
 }
 
-template <class _Ti, class... _Uj, size_t... _Indx>
-inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 __invoke_result_t<_Ti&, _Uj...>
-__mu_expand(_Ti& __ti, tuple<_Uj...>& __uj, __index_sequence<_Indx...>) {
-  return __ti(std::forward<_Uj>(std::get<_Indx>(__uj))...);
-}
-
 template <class _Ti, class... _Uj, __enable_if_t<is_bind_expression<_Ti>::value, int> = 0>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 __invoke_result_t<_Ti&, _Uj...>
 __mu(_Ti& __ti, tuple<_Uj...>& __uj) {
-  return std::__mu_expand(__ti, __uj, __make_index_sequence<sizeof...(_Uj)>());
+  return [&]<size_t... _Indices>(__index_sequence<_Indices...>) -> __invoke_result_t<_Ti&, _Uj...> {
+    return __ti(std::forward<_Uj>(std::get<_Indices>(__uj))...);
+  }(__index_sequence_for<_Uj...>{});
 }
 
 template <bool _IsPh, class _Ti, class _Uj>
@@ -217,10 +213,7 @@ class __bind : public __weak_result_type<__decay_t<_Fp> > {
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 typename __bind_return<_Fd, _Td, tuple<_Args&&...> >::type
   operator()(_Args&&... __args) {
     return std::__apply_functor(
-        __f_,
-        __bound_args_,
-        __make_index_sequence<sizeof...(_BoundArgs)>(),
-        tuple<_Args&&...>(std::forward<_Args>(__args)...));
+        __f_, __bound_args_, __index_sequence_for<_BoundArgs...>(), tuple<_Args&&...>(std::forward<_Args>(__args)...));
   }
 
   template <class... _Args>
@@ -228,10 +221,7 @@ class __bind : public __weak_result_type<__decay_t<_Fp> > {
   typename __bind_return<const _Fd, const _Td, tuple<_Args&&...> >::type
   operator()(_Args&&... __args) const {
     return std::__apply_functor(
-        __f_,
-        __bound_args_,
-        __make_index_sequence<sizeof...(_BoundArgs)>(),
-        tuple<_Args&&...>(std::forward<_Args>(__args)...));
+        __f_, __bound_args_, __index_sequence_for<_BoundArgs...>(), tuple<_Args&&...>(std::forward<_Args>(__args)...));
   }
 };
 
diff --git a/libcxx/include/__mutex/once_flag.h b/libcxx/include/__mutex/once_flag.h
index 808b1ea99cc0b..ad15b2eb6df68 100644
--- a/libcxx/include/__mutex/once_flag.h
+++ b/libcxx/include/__mutex/once_flag.h
@@ -86,12 +86,10 @@ class __call_once_param {
 public:
   _LIBCPP_HIDE_FROM_ABI explicit __call_once_param(_Fp& __f) : __f_(__f) {}
 
-  _LIBCPP_HIDE_FROM_ABI void operator()() { __execute(__make_index_sequence<tuple_size<_Fp>::value>()); }
-
-private:
-  template <size_t... _Indices>
-  _LIBCPP_HIDE_FROM_ABI void __execute(__index_sequence<_Indices...>) {
-    std::__invoke(std::get<_Indices>(std::move(__f_))...);
+  _LIBCPP_HIDE_FROM_ABI void operator()() {
+    [&]<size_t... _Indices>(__index_sequence<_Indices...>) -> void {
+      std::__invoke(std::get<_Indices>(std::move(__f_))...);
+    }(__make_index_sequence<tuple_size<_Fp>::value>());
   }
 };
 
diff --git a/libcxx/include/__utility/integer_sequence.h b/libcxx/include/__utility/integer_sequence.h
index 329826ae5eda2..fe8773ab73165 100644
--- a/libcxx/include/__utility/integer_sequence.h
+++ b/libcxx/include/__utility/integer_sequence.h
@@ -42,6 +42,9 @@ using __index_sequence _LIBCPP_NODEBUG = __integer_sequence<size_t, _Indices...>
 template <size_t _SequenceSize>
 using __make_index_sequence _LIBCPP_NODEBUG = __make_integer_sequence_impl<__integer_sequence, size_t, _SequenceSize>;
 
+template <class... _Args>
+using __index_sequence_for _LIBCPP_NODEBUG = __make_index_sequence<sizeof...(_Args)>;
+
 #  if _LIBCPP_STD_VER >= 14
 
 template <class _Tp, _Tp... _Indices>
diff --git a/libcxx/include/__utility/pair.h b/libcxx/include/__utility/pair.h
index 61485123114ba..d3914f655f2a6 100644
--- a/libcxx/include/__utility/pair.h
+++ b/libcxx/include/__utility/pair.h
@@ -219,11 +219,7 @@ struct pair
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20
   pair(piecewise_construct_t __pc, tuple<_Args1...> __first_args, tuple<_Args2...> __second_args) noexcept(
       is_nothrow_constructible<first_type, _Args1...>::value && is_nothrow_constructible<second_type, _Args2...>::value)
-      : pair(__pc,
-             __first_args,
-             __second_args,
-             __make_index_sequence<sizeof...(_Args1)>(),
-             __make_index_sequence<sizeof...(_Args2)>()) {}
+      : pair(__pc, __first_args, __second_args, __index_sequence_for<_Args1...>(), __index_sequence_for<_Args2...>()) {}
 
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair&
   operator=(__conditional_t<is_copy_assignable<first_type>::value && is_copy_assignable<second_type>::value,
diff --git a/libcxx/include/ext/hash_map b/libcxx/include/ext/hash_map
index 01ca7498f0cc1..09c981131ff96 100644
--- a/libcxx/include/ext/hash_map
+++ b/libcxx/include/ext/hash_map
@@ -570,10 +570,7 @@ hash_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hash_map(
 }
 
 template <class _Key, class _Tp, class _Hash, class _Pred, class _Alloc>
-hash_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hash_map(const hash_map& __u) : __table_(__u.__table_) {
-  __table_.__rehash_unique(__u.bucket_count());
-  insert(__u.begin(), __u.end());
-}
+hash_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hash_map(const hash_map& __u) : __table_(__u.__table_) {}
 
 template <class _Key, class _Tp, class _Hash, class _Pred, class _Alloc>
 typename hash_map<_Key, _Tp, _Hash, _Pred, _Alloc>::__node_holder
diff --git a/libcxx/include/ext/hash_set b/libcxx/include/ext/hash_set
index 2796774fee24a..56aa4d8a47eeb 100644
--- a/libcxx/include/ext/hash_set
+++ b/libcxx/include/ext/hash_set
@@ -356,10 +356,7 @@ hash_set<_Value, _Hash, _Pred, _Alloc>::hash_set(
 }
 
 template <class _Value, class _Hash, class _Pred, class _Alloc>
-hash_set<_Value, _Hash, _Pred, _Alloc>::hash_set(const hash_set& __u) : __table_(__u.__table_) {
-  __table_.__rehash_unique(__u.bucket_count());
-  insert(__u.begin(), __u.end());
-}
+hash_set<_Value, _Hash, _Pred, _Alloc>::hash_set(const hash_set& __u) : __table_(__u.__table_) {}
 
 template <class _Value, class _Hash, class _Pred, class _Alloc>
 template <class _InputIterator>
diff --git a/libcxx/include/future b/libcxx/include/future
index 0877d66602e6b..c249bc5e7938f 100644
--- a/libcxx/include/future
+++ b/libcxx/include/future
@@ -1833,12 +1833,10 @@ public:
 
   _LIBCPP_HIDE_FROM_ABI __async_func(__async_func&& __f) : __f_(std::move(__f.__f_)) {}
 
-  _LIBCPP_HIDE_FROM_ABI _Rp operator()() { return __execute(__make_index_sequence<sizeof...(_Args) + 1>()); }
-
-private:
-  template <size_t... _Indices>
-  _LIBCPP_HIDE_FROM_ABI _Rp __execute(__index_sequence<_Indices...>) {
-    return std::__invoke(std::move(std::get<_Indices>(__f_))...);
+  _LIBCPP_HIDE_FROM_ABI _Rp operator()() {
+    return [&]<size_t... _Indices>(__index_sequence<_Indices...>) -> _Rp {
+      return std::__invoke(std::move(std::get<_Indices>(__f_))...);
+    }(__index_sequence_for<_Fp, _Args...>{});
   }
 };
 
diff --git a/libcxx/include/optional b/libcxx/include/optional
index 23b21364b1a79..7b979d3d6d577 100644
--- a/libcxx/include/optional
+++ b/libcxx/include/optional
@@ -186,6 +186,71 @@ namespace std {
   template<class T>
     optional(T) -> optional<T>;
 
+  template<class T>
+  class optional<T&> { // since C++26
+  public:
+    using value_type     = T;
+    using iterator       = implementation-defined;              // see [optional.ref.iterators]
+
+  public:
+    // [optional.ref.ctor], constructors
+    constexpr optional() noexcept = default;
+    constexpr optional(nullopt_t) noexcept : optional() {}
+    constexpr optional(const optional& rhs) noexcept = default;
+
+    template<class Arg>
+      constexpr explicit optional(in_place_t, Arg&& arg);
+    template<class U>
+      constexpr explicit(see below) optional(U&& u) noexcept(see below);
+    template<class U>
+      constexpr explicit(see below) optional(optional<U>& rhs) noexcept(see below);
+    template<class U>
+      constexpr explicit(see below) optional(const optional<U>& rhs) noexcept(see below);
+    template<class U>
+      constexpr explicit(see below) optional(optional<U>&& rhs) noexcept(see below);
+    template<class U>
+      constexpr explicit(see below) optional(const optional<U>&& rhs) noexcept(see below);
+
+    constexpr ~optional() = default;
+
+    // [optional.ref.assign], assignment
+    constexpr optional& operator=(nullopt_t) noexcept;
+    constexpr optional& operator=(const optional& rhs) noexcept = default;
+
+    template<class U> constexpr T& emplace(U&& u) noexcept(see below);
+
+    // [optional.ref.swap], swap
+    constexpr void swap(optional& rhs) noexcept;
+
+    // [optional.ref.iterators], iterator support
+    constexpr iterator begin() const noexcept;
+    constexpr iterator end() const noexcept;
+
+    // [optional.ref.observe], observers
+    constexpr T*       operator->() const noexcept;
+    constexpr T&       operator*() const noexcept;
+    constexpr explicit operator bool() const noexcept;
+    constexpr bool     has_value() const noexcept;
+    constexpr T&       value() const;                           // freestanding-deleted
+    template<class U = remove_cv_t<T>>
+      constexpr remove_cv_t<T> value_or(U&& u) const;
+
+    // [optional.ref.monadic], monadic operations
+    template<class F> constexpr auto and_then(F&& f) const;
+    template<class F> constexpr optional<invoke_result_t<F, T&>> transform(F&& f) const;
+    template<class F> constexpr optional or_else(F&& f) const;
+
+    // [optional.ref.mod], modifiers
+    constexpr void reset() noexcept;
+
+  private:
+    T* val = nullptr;                                           // exposition only
+
+    // [optional.ref.expos], exposition only helper functions
+    template<class U>
+      constexpr void convert-ref-init-val(U&& u);               // exposition only
+  };
+
 } // namespace std
 
 */
diff --git a/libcxx/include/scoped_allocator b/libcxx/include/scoped_allocator
index 74effc547f3e2..c72c470f0c541 100644
--- a/libcxx/include/scoped_allocator
+++ b/libcxx/include/scoped_allocator
@@ -434,10 +434,10 @@ public:
         piecewise_construct,
         __transform_tuple(typename __uses_alloc_ctor< _T1, inner_allocator_type&, _Args1... >::type(),
                           std::move(__x),
-                          __make_index_sequence<sizeof...(_Args1)>()),
+                          __index_sequence_for<_Args1...>()),
         __transform_tuple(typename __uses_alloc_ctor< _T2, inner_allocator_type&, _Args2... >::type(),
                           std::move(__y),
-                          __make_index_sequence<sizeof...(_Args2)>()));
+                          __index_sequence_for<_Args2...>()));
   }
 
   template <class _T1, class _T2>
diff --git a/libcxx/include/tuple b/libcxx/include/tuple
index caa473012a7c4..670b90fc7b3b9 100644
--- a/libcxx/include/tuple
+++ b/libcxx/include/tuple
@@ -576,7 +576,7 @@ __memberwise_forward_assign(_Dest& __dest, _Source&& __source, __type_list<_Up..
 
 template <class... _Tp>
 class _LIBCPP_NO_SPECIALIZATIONS tuple {
-  typedef __tuple_impl<__make_index_sequence<sizeof...(_Tp)>, _Tp...> _BaseT;
+  typedef __tuple_impl<__index_sequence_for<_Tp...>, _Tp...> _BaseT;
 
   _BaseT __base_;
 
@@ -858,7 +858,7 @@ public:
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 tuple&
   operator=(_If<_And<is_copy_assignable<_Tp>...>::value, tuple, __nat> const& __tuple) noexcept(
       _And<is_nothrow_copy_assignable<_Tp>...>::value) {
-    std::__memberwise_copy_assign(*this, __tuple, __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_copy_assign(*this, __tuple, __index_sequence_for<_Tp...>());
     return *this;
   }
 
@@ -866,15 +866,14 @@ public:
   _LIBCPP_HIDE_FROM_ABI constexpr const tuple& operator=(tuple const& __tuple) const
     requires(_And<is_copy_assignable<const _Tp>...>::value)
   {
-    std::__memberwise_copy_assign(*this, __tuple, __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_copy_assign(*this, __tuple, __index_sequence_for<_Tp...>());
     return *this;
   }
 
   _LIBCPP_HIDE_FROM_ABI constexpr const tuple& operator=(tuple&& __tuple) const
     requires(_And<is_assignable<const _Tp&, _Tp>...>::value)
   {
-    std::__memberwise_forward_assign(
-        *this, std::move(__tuple), __type_list<_Tp...>(), __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_forward_assign(*this, std::move(__tuple), __type_list<_Tp...>(), __index_sequence_for<_Tp...>());
     return *this;
   }
 #    endif // _LIBCPP_STD_VER >= 23
@@ -882,8 +881,7 @@ public:
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 tuple&
   operator=(_If<_And<is_move_assignable<_Tp>...>::value, tuple, __nat>&& __tuple) noexcept(
       _And<is_nothrow_move_assignable<_Tp>...>::value) {
-    std::__memberwise_forward_assign(
-        *this, std::move(__tuple), __type_list<_Tp...>(), __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_forward_assign(*this, std::move(__tuple), __type_list<_Tp...>(), __index_sequence_for<_Tp...>());
     return *this;
   }
 
@@ -893,7 +891,7 @@ public:
                      int> = 0>
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 tuple&
   operator=(tuple<_Up...> const& __tuple) noexcept(_And<is_nothrow_assignable<_Tp&, _Up const&>...>::value) {
-    std::__memberwise_copy_assign(*this, __tuple, __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_copy_assign(*this, __tuple, __index_sequence_for<_Tp...>());
     return *this;
   }
 
@@ -902,8 +900,7 @@ public:
                            int> = 0>
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 tuple&
   operator=(tuple<_Up...>&& __tuple) noexcept(_And<is_nothrow_assignable<_Tp&, _Up>...>::value) {
-    std::__memberwise_forward_assign(
-        *this, std::move(__tuple), __type_list<_Up...>(), __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_forward_assign(*this, std::move(__tuple), __type_list<_Up...>(), __index_sequence_for<_Tp...>());
     return *this;
   }
 
@@ -912,7 +909,7 @@ public:
             enable_if_t< _And<_BoolConstant<sizeof...(_Tp) == sizeof...(_UTypes)>,
                               is_assignable<const _Tp&, const _UTypes&>...>::value>* = nullptr>
   _LIBCPP_HIDE_FROM_ABI constexpr const tuple& operator=(const tuple<_UTypes...>& __u) const {
-    std::__memberwise_copy_assign(*this, __u, __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_copy_assign(*this, __u, index_sequence_for<_Tp...>());
     return *this;
   }
 
@@ -920,7 +917,7 @@ public:
             enable_if_t< _And<_BoolConstant<sizeof...(_Tp) == sizeof...(_UTypes)>,
                               is_assignable<const _Tp&, _UTypes>...>::value>* = nullptr>
   _LIBCPP_HIDE_FROM_ABI constexpr const tuple& operator=(tuple<_UTypes...>&& __u) const {
-    std::__memberwise_forward_assign(*this, __u, __type_list<_UTypes...>(), __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_forward_assign(*this, __u, __type_list<_UTypes...>(), index_sequence_for<_Tp...>());
     return *this;
   }
 #    endif // _LIBCPP_STD_VER >= 23
@@ -986,7 +983,7 @@ public:
       __enable_if_t< _And< _BoolConstant<_Np == sizeof...(_Tp)>, is_assignable<_Tp&, _Up const&>... >::value, int> = 0>
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 tuple&
   operator=(array<_Up, _Np> const& __array) noexcept(_And<is_nothrow_assignable<_Tp&, _Up const&>...>::value) {
-    std::__memberwise_copy_assign(*this, __array, __make_index_sequence<sizeof...(_Tp)>());
+    std::__memberwise_copy_assign(*this, __array, __index_sequence_for<_Tp...>());
     return *this;
   }
 
@@ -998,7 +995,7 @@ public:
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 tuple&
   operator=(array<_Up, _Np>&& __array) noexcept(_And<is_nothrow_assignable<_Tp&, _Up>...>::value) {
     std::__memberwise_forward_assign(
-        *this, std::move(__array), __type_list<_If<true, _Up, _Tp>...>(), __make_index_sequence<sizeof...(_Tp)>());
+        *this, std::move(__array), __type_list<_If<true, _Up, _Tp>...>(), __index_sequence_for<_Tp...>());
     return *this;
   }
 
diff --git a/libcxx/src/include/from_chars_floating_point.h b/libcxx/src/include/from_chars_floating_point.h
index 19eeeb28fb08d..20f23d2bc267d 100644
--- a/libcxx/src/include/from_chars_floating_point.h
+++ b/libcxx/src/include/from_chars_floating_point.h
@@ -9,11 +9,6 @@
 #ifndef _LIBCPP_SRC_INCLUDE_FROM_CHARS_FLOATING_POINT_H
 #define _LIBCPP_SRC_INCLUDE_FROM_CHARS_FLOATING_POINT_H
 
-// These headers are in the shared LLVM-libc header library.
-#include "shared/fp_bits.h"
-#include "shared/str_to_float.h"
-#include "shared/str_to_integer.h"
-
 #include <__assert>
 #include <__config>
 #include <cctype>
@@ -21,6 +16,15 @@
 #include <concepts>
 #include <limits>
 
+// Make sure we use libc++'s assertion machinery within the shared code we use
+// from LLVM libc.
+#define LIBC_ASSERT(cond) _LIBCPP_ASSERT((cond), _LIBCPP_TOSTRING(cond))
+
+// These headers are in the shared LLVM-libc header library.
+#include "shared/fp_bits.h"
+#include "shared/str_to_float.h"
+#include "shared/str_to_integer.h"
+
 // Included for the _Floating_type_traits class
 #include "to_chars_floating_point.h"
 
diff --git a/libcxx/test/extensions/gnu/hash_map/copy.pass.cpp b/libcxx/test/extensions/gnu/hash_map/copy.pass.cpp
new file mode 100644
index 0000000000000..65b8debda0676
--- /dev/null
+++ b/libcxx/test/extensions/gnu/hash_map/copy.pass.cpp
@@ -0,0 +1,27 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// ADDITIONAL_COMPILE_FLAGS: -Wno-deprecated
+
+// hash_map::hash_map(const hash_map&)
+
+#include <cassert>
+#include <ext/hash_map>
+
+int main(int, char**) {
+  __gnu_cxx::hash_map<int, int> map;
+
+  map.insert(std::make_pair(1, 1));
+  map.insert(std::make_pair(2, 1));
+
+  auto map2 = map;
+
+  assert(map2.size() == 2);
+
+  return 0;
+}
diff --git a/libcxx/test/extensions/gnu/hash_set/copy.pass.cpp b/libcxx/test/extensions/gnu/hash_set/copy.pass.cpp
new file mode 100644
index 0000000000000..95a3579194923
--- /dev/null
+++ b/libcxx/test/extensions/gnu/hash_set/copy.pass.cpp
@@ -0,0 +1,27 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// ADDITIONAL_COMPILE_FLAGS: -Wno-deprecated
+
+// hash_set::hash_set(const hash_set&)
+
+#include <cassert>
+#include <ext/hash_set>
+
+int main(int, char**) {
+  __gnu_cxx::hash_set<int> set;
+
+  set.insert(1);
+  set.insert(2);
+
+  auto set2 = set;
+
+  assert(set2.size() == 2);
+
+  return 0;
+}
diff --git a/libcxx/test/selftest/dsl/lit.local.cfg b/libcxx/test/selftest/dsl/lit.local.cfg
index dc6887ad7e48b..73e1c4db9ca4e 100644
--- a/libcxx/test/selftest/dsl/lit.local.cfg
+++ b/libcxx/test/selftest/dsl/lit.local.cfg
@@ -10,6 +10,6 @@
 # within the test.
 import base64, lit.util, pickle
 
-base64Encode = lambda s: lit.util.to_string(base64.b64encode(lit.util.to_bytes(s)))
+base64Encode = lambda s: base64.b64encode(s).decode("utf-8")
 escapedSubstitutions = base64Encode(pickle.dumps(config.substitutions))
 config.substitutions.append(("%{substitutions}", escapedSubstitutions))
diff --git a/libcxx/utils/ci/buildkite-pipeline.yml b/libcxx/utils/ci/buildkite-pipeline.yml
index 2ac69c38ebffa..1938d9a67af28 100644
--- a/libcxx/utils/ci/buildkite-pipeline.yml
+++ b/libcxx/utils/ci/buildkite-pipeline.yml
@@ -37,6 +37,9 @@ steps:
   steps:
   - label: AArch64
     command: libcxx/utils/ci/run-buildbot aarch64
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: aarch64
@@ -44,6 +47,9 @@ steps:
 
   - label: AArch64 -fno-exceptions
     command: libcxx/utils/ci/run-buildbot aarch64-no-exceptions
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: aarch64
@@ -51,6 +57,9 @@ steps:
 
   - label: Armv8
     command: libcxx/utils/ci/run-buildbot armv8
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: armv8l
@@ -58,6 +67,9 @@ steps:
 
   - label: Armv8 -fno-exceptions
     command: libcxx/utils/ci/run-buildbot armv8-no-exceptions
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: armv8l
@@ -65,6 +77,9 @@ steps:
 
   - label: Armv7
     command: libcxx/utils/ci/run-buildbot armv7
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: armv8l
@@ -72,6 +87,9 @@ steps:
 
   - label: Armv7 -fno-exceptions
     command: libcxx/utils/ci/run-buildbot armv7-no-exceptions
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: armv8l
@@ -79,6 +97,9 @@ steps:
 
   - label: Armv7-M picolibc
     command: libcxx/utils/ci/run-buildbot armv7m-picolibc
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: aarch64
@@ -86,6 +107,9 @@ steps:
 
   - label: Armv7-M picolibc -fno-exceptions
     command: libcxx/utils/ci/run-buildbot armv7m-picolibc-no-exceptions
+    env:
+      CC: cc
+      CXX: c++
     agents:
       queue: libcxx-builders-linaro-arm
       arch: aarch64
@@ -131,6 +155,9 @@ steps:
   steps:
   - label: Android 5.0, x86 NDK
     command: libcxx/utils/ci/run-buildbot android-ndk-21-def-x86
+    env:
+      CC: /opt/android/clang/clang-current/bin/clang
+      CXX: /opt/android/clang/clang-current/bin/clang++
     agents:
       queue: libcxx-builders
       os: android
@@ -138,6 +165,9 @@ steps:
 
   - label: Android 13, x86_64 NDK
     command: libcxx/utils/ci/run-buildbot android-ndk-33-goog-x86_64
+    env:
+      CC: /opt/android/clang/clang-current/bin/clang
+      CXX: /opt/android/clang/clang-current/bin/clang++
     agents:
       queue: libcxx-builders
       os: android
diff --git a/libcxx/utils/ci/run-buildbot b/libcxx/utils/ci/run-buildbot
index 8ab6a94e0255f..b8ff9475b844e 100755
--- a/libcxx/utils/ci/run-buildbot
+++ b/libcxx/utils/ci/run-buildbot
@@ -30,17 +30,41 @@ ${PROGNAME} [options] <BUILDER>
 
 Environment variables
 CC                  The C compiler to use, this value is used by CMake. This
-                    variable is optional.
+                    variable is mandatory.
 
 CXX                 The C++ compiler to use, this value is used by CMake. This
-                    variable is optional.
-
-CLANG_FORMAT        The clang-format binary to use when generating the format
-                    ignore list.
+                    variable is mandatory.
 
+CCACHE              The ccache binary to use. This variable is optional and is only
+                    used by the bootstrapping build.
 EOF
 }
 
+function step() {
+  endstep
+  set +x
+  if [[ ! -z ${GITHUB_ACTIONS+x} ]]; then
+    echo "::group::$1"
+    export IN_GROUP=1
+  else
+    echo "--- $1"
+  fi
+  set -x
+}
+
+function endstep() {
+  set +x
+  if [[ ! -z ${GITHUB_ACTIONS+x} ]] && [[ ! -z ${IN_GROUP+x} ]]; then
+    echo "::endgroup::"
+    unset IN_GROUP
+  fi
+  set -x
+}
+
+function error() {
+    echo "::error::$1"
+}
+
 if [[ $# == 0 ]]; then
    usage
    exit 0
@@ -71,30 +95,22 @@ MONOREPO_ROOT="${MONOREPO_ROOT:="$(git rev-parse --show-toplevel)"}"
 BUILD_DIR="${BUILD_DIR:=${MONOREPO_ROOT}/build/${BUILDER}}"
 INSTALL_DIR="${BUILD_DIR}/install"
 
-function step() {
-  endstep
-  set +x
-  if [[ ! -z ${GITHUB_ACTIONS+x} ]]; then
-    echo "::group::$1"
-    export IN_GROUP=1
-  else
-    echo "--- $1"
-  fi
-  set -x
-}
+if [ -z ${CC+x} ]; then
+    error "Environment variable CC must be defined"
+    exit 1
+fi
 
-function endstep() {
-  set +x
-  if [[ ! -z ${GITHUB_ACTIONS+x} ]] && [[ ! -z ${IN_GROUP+x} ]]; then
-    echo "::endgroup::"
-    unset IN_GROUP
-  fi
-  set -x
-}
+if [ -z ${CXX+x} ]; then
+    error "Environment variable CXX must be defined"
+    exit 1
+fi
 
-function error() {
-    echo "::error::$1"
-}
+# Print the version of a few tools to aid diagnostics in some cases
+step "Diagnose tools in use"
+cmake --version
+ninja --version
+${CC} --version
+${CXX} --version
 
 function clean() {
     rm -rf "${BUILD_DIR}"
@@ -127,11 +143,7 @@ function generate-cmake() {
 }
 
 function generate-cmake-libcxx-win() {
-    generate-cmake-base \
-          -DLLVM_ENABLE_RUNTIMES="libcxx" \
-          -DCMAKE_C_COMPILER=clang-cl \
-          -DCMAKE_CXX_COMPILER=clang-cl \
-          "${@}"
+    generate-cmake-base -DLLVM_ENABLE_RUNTIMES="libcxx" "${@}"
 }
 
 function generate-cmake-android() {
@@ -216,12 +228,6 @@ function test-armv7m-picolibc() {
     check-runtimes
 }
 
-# Print the version of a few tools to aid diagnostics in some cases
-step "Diagnose tools in use"
-cmake --version
-ninja --version
-if [ ! -z "${CXX}" ]; then ${CXX} --version; fi
-
 case "${BUILDER}" in
 check-generated-output)
     # `! foo` doesn't work properly with `set -e`, use `! foo || false` instead.
@@ -358,12 +364,16 @@ generic-ubsan)
 bootstrapping-build)
     clean
 
+    if [ ! -z ${CCACHE+x} ]; then
+        COMPILER_LAUNCHER="-DCMAKE_CXX_COMPILER_LAUNCHER=${CCACHE}"
+    fi
+
     step "Generating CMake"
     cmake \
           -S "${MONOREPO_ROOT}/llvm" \
           -B "${BUILD_DIR}" \
           -GNinja \
-          -DCMAKE_CXX_COMPILER_LAUNCHER="ccache" \
+          ${COMPILER_LAUNCHER} \
           -DCMAKE_BUILD_TYPE=Release \
           -DCMAKE_INSTALL_PREFIX="${INSTALL_DIR}" \
           -DLLVM_ENABLE_PROJECTS="clang;lldb" \
@@ -691,14 +701,6 @@ mingw-static)
           -DLIBUNWIND_ENABLE_SHARED=OFF
     check-runtimes
 ;;
-mingw-dll-i686)
-    clean
-    generate-cmake \
-          -DCMAKE_C_COMPILER=i686-w64-mingw32-clang \
-          -DCMAKE_CXX_COMPILER=i686-w64-mingw32-clang++ \
-          -C "${MONOREPO_ROOT}/libcxx/cmake/caches/MinGW.cmake"
-    check-runtimes
-;;
 mingw-incomplete-sysroot)
     # When bringing up a new cross compiler from scratch, we build
     # libunwind/libcxx in a setup where the toolchain is incomplete and
@@ -743,10 +745,6 @@ android-ndk-*)
     fi
     ARCH=$(arch_of_emu_img ${ANDROID_EMU_IMG})
 
-    # Use the Android compiler by default.
-    export CC=${CC:-/opt/android/clang/clang-current/bin/clang}
-    export CXX=${CXX:-/opt/android/clang/clang-current/bin/clang++}
-
     # The NDK libc++_shared.so is always built against the oldest supported API
     # level. When tests are run against a device with a newer API level, test
     # programs can be built for any supported API level, but building for the
diff --git a/libcxx/utils/ci/run-buildbot-container b/libcxx/utils/ci/run-buildbot-container
index 33a00a9c90671..fa83d1db4f40f 100755
--- a/libcxx/utils/ci/run-buildbot-container
+++ b/libcxx/utils/ci/run-buildbot-container
@@ -26,6 +26,6 @@ if [[ ! -d "${MONOREPO_ROOT}/libcxx/utils/ci" ]]; then
     echo "Was unable to find the root of the LLVM monorepo; are you running from within the monorepo?"
     exit 1
 fi
-docker pull ghcr.io/llvm/libcxx-linux-builder:b060022103f551d8ca1dad84122ef73927c86512
-docker run -it --volume "${MONOREPO_ROOT}:/llvm" --workdir "/llvm" --cap-add=SYS_PTRACE ghcr.io/llvm/libcxx-linux-builder:b060022103f551d8ca1dad84122ef73927c86512 \
+docker pull ghcr.io/llvm/libcxx-linux-builder:d6b22a347f813cf4a9832627323a43074f57bbcf
+docker run -it --volume "${MONOREPO_ROOT}:/llvm" --workdir "/llvm" --cap-add=SYS_PTRACE ghcr.io/llvm/libcxx-linux-builder:d6b22a347f813cf4a9832627323a43074f57bbcf \
     bash -c 'git config --global --add safe.directory /llvm ; exec bash'
diff --git a/libsycl/README.md b/libsycl/README.md
index 9fed3fdc4efa1..1ef6505bf7a03 100644
--- a/libsycl/README.md
+++ b/libsycl/README.md
@@ -17,4 +17,4 @@ TODO
 
 # License
 
-See [LICENSE](./LICENSE.TXT) for details.
+See [LICENSE](./LICENSE.txt) for details.
diff --git a/libunwind/test/aarch64_vg_unwind.pass.cpp b/libunwind/test/aarch64_vg_unwind.pass.cpp
index 1c139a7ae9e41..d0c623b155092 100644
--- a/libunwind/test/aarch64_vg_unwind.pass.cpp
+++ b/libunwind/test/aarch64_vg_unwind.pass.cpp
@@ -6,7 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 
-// REQUIRES: linux && target={{aarch64-.+}}
+// REQUIRES: target={{aarch64-.+}}
+// UNSUPPORTED: target={{.*-windows.*}}
 
 #include <libunwind.h>
 #include <stdlib.h>
diff --git a/libunwind/test/aarch64_za_unwind.pass.cpp b/libunwind/test/aarch64_za_unwind.pass.cpp
index 2985bb8d298de..9f6b106a21fec 100644
--- a/libunwind/test/aarch64_za_unwind.pass.cpp
+++ b/libunwind/test/aarch64_za_unwind.pass.cpp
@@ -6,7 +6,8 @@
 //
 //===----------------------------------------------------------------------===//
 
-// REQUIRES: linux && target={{aarch64-.+}}
+// REQUIRES: target={{aarch64-.+}}
+// UNSUPPORTED: target={{.*-windows.*}}
 
 #include <libunwind.h>
 #include <stdint.h>
diff --git a/libunwind/test/bad_unwind_info.pass.cpp b/libunwind/test/bad_unwind_info.pass.cpp
index 272a83f64a611..332b661d2e98f 100644
--- a/libunwind/test/bad_unwind_info.pass.cpp
+++ b/libunwind/test/bad_unwind_info.pass.cpp
@@ -10,7 +10,9 @@
 // Ensure that libunwind doesn't crash on invalid info; the Linux aarch64
 // sigreturn frame check would previously attempt to access invalid memory in
 // this scenario.
-// REQUIRES: target={{(aarch64|s390x|x86_64)-.+linux.*}}
+// REQUIRES: target={{(aarch64|s390x|x86_64)-.+}}
+// UNSUPPORTED: target={{.*-windows.*}}
+// UNSUPPORTED: target={{.*-apple.*}}
 
 // GCC doesn't support __attribute__((naked)) on AArch64.
 // UNSUPPORTED: gcc
diff --git a/libunwind/test/eh_frame_fde_pc_range.pass.cpp b/libunwind/test/eh_frame_fde_pc_range.pass.cpp
index 795ce66806f28..32ddb769e6dce 100644
--- a/libunwind/test/eh_frame_fde_pc_range.pass.cpp
+++ b/libunwind/test/eh_frame_fde_pc_range.pass.cpp
@@ -13,15 +13,17 @@
 
 // clang-format off
 
-// REQUIRES: target={{x86_64-.+-linux-gnu}}
+// REQUIRES: target={{x86_64-.+}}
 // REQUIRES: objcopy-available
+// UNSUPPORTED: target={{.*-windows.*}}
+// UNSUPPORTED: target={{.*-apple.*}}
 
 // TODO: Figure out why this fails with Memory Sanitizer.
 // XFAIL: msan
 
 // RUN: %{build}
 // RUN: %{objcopy} --dump-section .eh_frame_hdr=%t_ehf_hdr.bin %t.exe
-// RUN: echo -ne '\xFF' | dd of=%t_ehf_hdr.bin bs=1 seek=2 count=2 conv=notrunc status=none
+// RUN: printf '\377' | dd of=%t_ehf_hdr.bin bs=1 seek=2 count=2 conv=notrunc status=none
 // RUN: %{objcopy} --update-section .eh_frame_hdr=%t_ehf_hdr.bin %t.exe
 // RUN: %{exec} %t.exe
 
diff --git a/libunwind/test/floatregister.pass.cpp b/libunwind/test/floatregister.pass.cpp
index 018b792bd5f1e..6be3e1f3f7385 100644
--- a/libunwind/test/floatregister.pass.cpp
+++ b/libunwind/test/floatregister.pass.cpp
@@ -7,7 +7,8 @@
 //
 //===----------------------------------------------------------------------===//
 
-// REQUIRES: linux && target={{aarch64-.+}}
+// REQUIRES: target={{aarch64-.+}}
+// UNSUPPORTED: target={{.*-windows.*}}
 
 // Basic test for float registers number are accepted.
 
diff --git a/libunwind/test/forceunwind.pass.cpp b/libunwind/test/forceunwind.pass.cpp
index 9e032fc680806..e5437c31a0f65 100644
--- a/libunwind/test/forceunwind.pass.cpp
+++ b/libunwind/test/forceunwind.pass.cpp
@@ -7,7 +7,9 @@
 //
 //===----------------------------------------------------------------------===//
 
-// REQUIRES: linux
+// UNSUPPORTED: target={{.*-apple.*}}
+// UNSUPPORTED: target={{.*-aix.*}}
+// UNSUPPORTED: target={{.*-windows.*}}
 
 // TODO: Figure out why this fails with Memory Sanitizer.
 // XFAIL: msan
diff --git a/libunwind/test/remember_state_leak.pass.sh.s b/libunwind/test/remember_state_leak.pass.sh.s
index d3335cf82290b..69be3f9595515 100644
--- a/libunwind/test/remember_state_leak.pass.sh.s
+++ b/libunwind/test/remember_state_leak.pass.sh.s
@@ -6,7 +6,9 @@
 #
 #===------------------------------------------------------------------------===#
 
-# REQUIRES: target={{x86_64-.+-linux-gnu}}
+# REQUIRES: target={{x86_64-.+}}
+# UNSUPPORTED: target={{.*-windows.*}}
+# UNSUPPORTED: target={{.*-apple.*}}
 
 # Inline assembly isn't supported by Memory Sanitizer
 # UNSUPPORTED: msan
diff --git a/libunwind/test/signal_unwind.pass.cpp b/libunwind/test/signal_unwind.pass.cpp
index 4de271ecb886b..ca50f83964c11 100644
--- a/libunwind/test/signal_unwind.pass.cpp
+++ b/libunwind/test/signal_unwind.pass.cpp
@@ -8,7 +8,9 @@
 //===----------------------------------------------------------------------===//
 
 // Ensure that the unwinder can cope with the signal handler.
-// REQUIRES: target={{(aarch64|loongarch64|riscv64|s390x|x86_64)-.+linux.*}}
+// REQUIRES: target={{(aarch64|loongarch64|riscv64|s390x|x86_64)-.+}}
+// UNSUPPORTED: target={{.*-windows.*}}
+// UNSUPPORTED: target={{.*-apple.*}}
 
 // TODO: Figure out why this fails with Memory Sanitizer.
 // XFAIL: msan
diff --git a/libunwind/test/unwind_leaffunction.pass.cpp b/libunwind/test/unwind_leaffunction.pass.cpp
index d336c159c131b..af791a6b2ed31 100644
--- a/libunwind/test/unwind_leaffunction.pass.cpp
+++ b/libunwind/test/unwind_leaffunction.pass.cpp
@@ -8,7 +8,9 @@
 //===----------------------------------------------------------------------===//
 
 // Ensure that leaf function can be unwund.
-// REQUIRES: target={{(aarch64|loongarch64|riscv64|s390x|x86_64)-.+linux.*}}
+// REQUIRES: target={{(aarch64|loongarch64|riscv64|s390x|x86_64)-.+}}
+// UNSUPPORTED: target={{.*-windows.*}}
+// UNSUPPORTED: target={{.*-apple.*}}
 
 // TODO: Figure out why this fails with Memory Sanitizer.
 // XFAIL: msan
diff --git a/libunwind/test/unwind_scalable_vectors.pass.cpp b/libunwind/test/unwind_scalable_vectors.pass.cpp
index 57ef4d78244c5..38d8bd5e002d1 100644
--- a/libunwind/test/unwind_scalable_vectors.pass.cpp
+++ b/libunwind/test/unwind_scalable_vectors.pass.cpp
@@ -7,7 +7,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-// REQUIRES: linux && target={{riscv64-.+}}
+// REQUIRES: target={{riscv64-.+}}
 
 #undef NDEBUG
 #include <assert.h>
diff --git a/lld/ELF/Options.td b/lld/ELF/Options.td
index 75184de496448..c2111e58c12b9 100644
--- a/lld/ELF/Options.td
+++ b/lld/ELF/Options.td
@@ -154,7 +154,7 @@ def bp_startup_sort: JJ<"bp-startup-sort=">, MetaVarName<"[none,function]">,
 
 // Auxiliary options related to balanced partition
 defm bp_compression_sort_startup_functions: BB<"bp-compression-sort-startup-functions",
-  "When --irpgo-profile is pecified, prioritize function similarity for compression in addition to startup time", "">;
+  "When --irpgo-profile is specified, prioritize function similarity for compression in addition to startup time", "">;
 def verbose_bp_section_orderer: FF<"verbose-bp-section-orderer">,
   HelpText<"Print information on balanced partitioning">;
 
diff --git a/lld/ELF/SyntheticSections.cpp b/lld/ELF/SyntheticSections.cpp
index ea9b87952cd84..1e9d44fa37bea 100644
--- a/lld/ELF/SyntheticSections.cpp
+++ b/lld/ELF/SyntheticSections.cpp
@@ -540,43 +540,6 @@ void EhFrameSection::finalizeContents() {
   this->size = off;
 }
 
-// Returns data for .eh_frame_hdr. .eh_frame_hdr is a binary search table
-// to get an FDE from an address to which FDE is applied. This function
-// returns a list of such pairs.
-SmallVector<EhFrameSection::FdeData, 0> EhFrameSection::getFdeData() const {
-  uint8_t *buf = ctx.bufferStart + getParent()->offset + outSecOff;
-  SmallVector<FdeData, 0> ret;
-
-  uint64_t va = getPartition(ctx).ehFrameHdr->getVA();
-  for (CieRecord *rec : cieRecords) {
-    uint8_t enc = getFdeEncoding(rec->cie);
-    for (EhSectionPiece *fde : rec->fdes) {
-      uint64_t pc = getFdePc(buf, fde->outputOff, enc);
-      uint64_t fdeVA = getParent()->addr + fde->outputOff;
-      if (!isInt<32>(pc - va)) {
-        Err(ctx) << fde->sec << ": PC offset is too large: 0x"
-                 << Twine::utohexstr(pc - va);
-        continue;
-      }
-      ret.push_back({uint32_t(pc - va), uint32_t(fdeVA - va)});
-    }
-  }
-
-  // Sort the FDE list by their PC and uniqueify. Usually there is only
-  // one FDE for a PC (i.e. function), but if ICF merges two functions
-  // into one, there can be more than one FDEs pointing to the address.
-  auto less = [](const FdeData &a, const FdeData &b) {
-    return a.pcRel < b.pcRel;
-  };
-  llvm::stable_sort(ret, less);
-  auto eq = [](const FdeData &a, const FdeData &b) {
-    return a.pcRel == b.pcRel;
-  };
-  ret.erase(llvm::unique(ret, eq), ret.end());
-
-  return ret;
-}
-
 static uint64_t readFdeAddr(Ctx &ctx, uint8_t *buf, int size) {
   switch (size) {
   case DW_EH_PE_udata2:
@@ -630,14 +593,79 @@ void EhFrameSection::writeTo(uint8_t *buf) {
     }
   }
 
-  // Apply relocations. .eh_frame section contents are not contiguous
-  // in the output buffer, but relocateAlloc() still works because
-  // getOffset() takes care of discontiguous section pieces.
+  // Apply relocations to .eh_frame entries. This includes CIE personality
+  // pointers, FDE initial_location fields, and LSDA pointers.
   for (EhInputSection *s : sections)
     ctx.target->relocateEh(*s, buf);
 
-  if (getPartition(ctx).ehFrameHdr && getPartition(ctx).ehFrameHdr->getParent())
-    getPartition(ctx).ehFrameHdr->write();
+  EhFrameHeader *hdr = getPartition(ctx).ehFrameHdr.get();
+  if (!hdr || !hdr->getParent())
+    return;
+
+  // Write the .eh_frame_hdr section, which contains a binary search table of
+  // pointers to FDEs. This must be written after .eh_frame relocation since
+  // the content depends on relocated initial_location fields in FDEs.
+  using FdeData = EhFrameSection::FdeData;
+  SmallVector<FdeData, 0> fdes;
+  uint64_t va = hdr->getVA();
+  for (CieRecord *rec : cieRecords) {
+    uint8_t enc = getFdeEncoding(rec->cie);
+    for (EhSectionPiece *fde : rec->fdes) {
+      uint64_t pc = getFdePc(buf, fde->outputOff, enc);
+      uint64_t fdeVA = getParent()->addr + fde->outputOff;
+      if (!isInt<32>(pc - va)) {
+        Err(ctx) << fde->sec << ": PC offset is too large: 0x"
+                 << Twine::utohexstr(pc - va);
+        continue;
+      }
+      fdes.push_back({uint32_t(pc - va), uint32_t(fdeVA - va)});
+    }
+  }
+
+  // Sort the FDE list by their PC and uniqueify. Usually there is only
+  // one FDE for a PC (i.e. function), but if ICF merges two functions
+  // into one, there can be more than one FDEs pointing to the address.
+  llvm::stable_sort(fdes, [](const FdeData &a, const FdeData &b) {
+    return a.pcRel < b.pcRel;
+  });
+  fdes.erase(
+      llvm::unique(fdes, [](auto &a, auto &b) { return a.pcRel == b.pcRel; }),
+      fdes.end());
+
+  // Write header.
+  uint8_t *hdrBuf = ctx.bufferStart + hdr->getParent()->offset + hdr->outSecOff;
+  hdrBuf[0] = 1;                                  // version
+  hdrBuf[1] = DW_EH_PE_pcrel | DW_EH_PE_sdata4;   // eh_frame_ptr_enc
+  hdrBuf[2] = DW_EH_PE_udata4;                    // fde_count_enc
+  hdrBuf[3] = DW_EH_PE_datarel | DW_EH_PE_sdata4; // table_enc
+  write32(ctx, hdrBuf + 4,
+          getParent()->addr - hdr->getVA() - 4); // eh_frame_ptr
+  write32(ctx, hdrBuf + 8, fdes.size());         // fde_count
+  hdrBuf += 12;
+
+  // Write binary search table. Each entry describes the starting PC and the FDE
+  // address.
+  for (FdeData &fde : fdes) {
+    write32(ctx, hdrBuf, fde.pcRel);
+    write32(ctx, hdrBuf + 4, fde.fdeVARel);
+    hdrBuf += 8;
+  }
+}
+
+EhFrameHeader::EhFrameHeader(Ctx &ctx)
+    : SyntheticSection(ctx, ".eh_frame_hdr", SHT_PROGBITS, SHF_ALLOC, 4) {}
+
+void EhFrameHeader::writeTo(uint8_t *buf) {
+  // The section content is written during EhFrameSection::writeTo.
+}
+
+size_t EhFrameHeader::getSize() const {
+  // .eh_frame_hdr has a 12 bytes header followed by an array of FDEs.
+  return 12 + getPartition(ctx).ehFrame->numFdes * 8;
+}
+
+bool EhFrameHeader::isNeeded() const {
+  return isLive() && getPartition(ctx).ehFrame->isNeeded();
 }
 
 GotSection::GotSection(Ctx &ctx)
@@ -3658,51 +3686,6 @@ void GdbIndexSection::writeTo(uint8_t *buf) {
 
 bool GdbIndexSection::isNeeded() const { return !chunks.empty(); }
 
-EhFrameHeader::EhFrameHeader(Ctx &ctx)
-    : SyntheticSection(ctx, ".eh_frame_hdr", SHT_PROGBITS, SHF_ALLOC, 4) {}
-
-void EhFrameHeader::writeTo(uint8_t *buf) {
-  // Unlike most sections, the EhFrameHeader section is written while writing
-  // another section, namely EhFrameSection, which calls the write() function
-  // below from its writeTo() function. This is necessary because the contents
-  // of EhFrameHeader depend on the relocated contents of EhFrameSection and we
-  // don't know which order the sections will be written in.
-}
-
-// .eh_frame_hdr contains a binary search table of pointers to FDEs.
-// Each entry of the search table consists of two values,
-// the starting PC from where FDEs covers, and the FDE's address.
-// It is sorted by PC.
-void EhFrameHeader::write() {
-  uint8_t *buf = ctx.bufferStart + getParent()->offset + outSecOff;
-  using FdeData = EhFrameSection::FdeData;
-  SmallVector<FdeData, 0> fdes = getPartition(ctx).ehFrame->getFdeData();
-
-  buf[0] = 1;
-  buf[1] = DW_EH_PE_pcrel | DW_EH_PE_sdata4;
-  buf[2] = DW_EH_PE_udata4;
-  buf[3] = DW_EH_PE_datarel | DW_EH_PE_sdata4;
-  write32(ctx, buf + 4,
-          getPartition(ctx).ehFrame->getParent()->addr - this->getVA() - 4);
-  write32(ctx, buf + 8, fdes.size());
-  buf += 12;
-
-  for (FdeData &fde : fdes) {
-    write32(ctx, buf, fde.pcRel);
-    write32(ctx, buf + 4, fde.fdeVARel);
-    buf += 8;
-  }
-}
-
-size_t EhFrameHeader::getSize() const {
-  // .eh_frame_hdr has a 12 bytes header followed by an array of FDEs.
-  return 12 + getPartition(ctx).ehFrame->numFdes * 8;
-}
-
-bool EhFrameHeader::isNeeded() const {
-  return isLive() && getPartition(ctx).ehFrame->isNeeded();
-}
-
 VersionDefinitionSection::VersionDefinitionSection(Ctx &ctx)
     : SyntheticSection(ctx, ".gnu.version_d", SHT_GNU_verdef, SHF_ALLOC,
                        sizeof(uint32_t)) {}
diff --git a/lld/ELF/SyntheticSections.h b/lld/ELF/SyntheticSections.h
index 66c866d7e8cde..e01a5ad8abc60 100644
--- a/lld/ELF/SyntheticSections.h
+++ b/lld/ELF/SyntheticSections.h
@@ -68,7 +68,6 @@ class EhFrameSection final : public SyntheticSection {
     uint32_t fdeVARel;
   };
 
-  SmallVector<FdeData, 0> getFdeData() const;
   ArrayRef<CieRecord *> getCieRecords() const { return cieRecords; }
   template <class ELFT>
   void iterateFDEWithLSDA(llvm::function_ref<void(InputSection &)> fn);
@@ -95,6 +94,17 @@ class EhFrameSection final : public SyntheticSection {
   llvm::DenseMap<std::pair<ArrayRef<uint8_t>, Symbol *>, CieRecord *> cieMap;
 };
 
+// .eh_frame_hdr contains a binary search table for .eh_frame FDEs. The section
+// is covered by a PT_GNU_EH_FRAME segment, which allows the runtime unwinder to
+// locate it via functions like `dl_iterate_phdr`.
+class EhFrameHeader final : public SyntheticSection {
+public:
+  EhFrameHeader(Ctx &);
+  void writeTo(uint8_t *buf) override;
+  size_t getSize() const override;
+  bool isNeeded() const override;
+};
+
 class GotSection final : public SyntheticSection {
 public:
   GotSection(Ctx &);
@@ -967,24 +977,6 @@ class GdbIndexSection final : public SyntheticSection {
   size_t size;
 };
 
-// --eh-frame-hdr option tells linker to construct a header for all the
-// .eh_frame sections. This header is placed to a section named .eh_frame_hdr
-// and also to a PT_GNU_EH_FRAME segment.
-// At runtime the unwinder then can find all the PT_GNU_EH_FRAME segments by
-// calling dl_iterate_phdr.
-// This section contains a lookup table for quick binary search of FDEs.
-// Detailed info about internals can be found in Ian Lance Taylor's blog:
-// http://www.airs.com/blog/archives/460 (".eh_frame")
-// http://www.airs.com/blog/archives/462 (".eh_frame_hdr")
-class EhFrameHeader final : public SyntheticSection {
-public:
-  EhFrameHeader(Ctx &);
-  void write();
-  void writeTo(uint8_t *buf) override;
-  size_t getSize() const override;
-  bool isNeeded() const override;
-};
-
 // For more information about .gnu.version and .gnu.version_r see:
 // https://www.akkadia.org/drepper/symbol-versioning
 
diff --git a/lld/MachO/UnwindInfoSection.cpp b/lld/MachO/UnwindInfoSection.cpp
index 6e9f6c2aba749..bf01b12d11dfd 100644
--- a/lld/MachO/UnwindInfoSection.cpp
+++ b/lld/MachO/UnwindInfoSection.cpp
@@ -153,8 +153,6 @@ class UnwindInfoSectionImpl final : public UnwindInfoSection {
   // The entries here will be in the same order as their originating symbols
   // in symbolsVec.
   std::vector<CompactUnwindEntry> cuEntries;
-  // Indices into the cuEntries vector.
-  std::vector<size_t> cuIndices;
   std::vector<Symbol *> personalities;
   SmallDenseMap<std::pair<InputSection *, uint64_t /* addend */>, Symbol *>
       personalityTable;
@@ -400,8 +398,7 @@ void UnwindInfoSectionImpl::relocateCompactUnwind(
 // There should only be a handful of unique personality pointers, so we can
 // encode them as 2-bit indices into a small array.
 void UnwindInfoSectionImpl::encodePersonalities() {
-  for (size_t idx : cuIndices) {
-    CompactUnwindEntry &cu = cuEntries[idx];
+  for (CompactUnwindEntry &cu : cuEntries) {
     if (cu.personality == nullptr)
       continue;
     // Linear search is fast enough for a small array.
@@ -467,27 +464,24 @@ void UnwindInfoSectionImpl::finalize() {
   symbolsVec = symbols.takeVector();
   relocateCompactUnwind(cuEntries);
 
-  // Rather than sort & fold the 32-byte entries directly, we create a
-  // vector of indices to entries and sort & fold that instead.
-  cuIndices.resize(cuEntries.size());
-  std::iota(cuIndices.begin(), cuIndices.end(), 0);
-  llvm::sort(cuIndices, [&](size_t a, size_t b) {
-    return cuEntries[a].functionAddress < cuEntries[b].functionAddress;
+  // Sort the entries by address.
+  llvm::sort(cuEntries, [&](auto &a, auto &b) {
+    return a.functionAddress < b.functionAddress;
   });
 
   // Record the ending boundary before we fold the entries.
-  cueEndBoundary = cuEntries[cuIndices.back()].functionAddress +
-                   cuEntries[cuIndices.back()].functionLength;
+  cueEndBoundary =
+      cuEntries.back().functionAddress + cuEntries.back().functionLength;
 
   // Fold adjacent entries with matching encoding+personality and without LSDA
-  // We use three iterators on the same cuIndices to fold in-situ:
+  // We use three iterators to fold in-situ:
   // (1) `foldBegin` is the first of a potential sequence of matching entries
   // (2) `foldEnd` is the first non-matching entry after `foldBegin`.
   // The semi-open interval [ foldBegin .. foldEnd ) contains a range
   // entries that can be folded into a single entry and written to ...
   // (3) `foldWrite`
-  auto foldWrite = cuIndices.begin();
-  for (auto foldBegin = cuIndices.begin(); foldBegin < cuIndices.end();) {
+  auto foldWrite = cuEntries.begin();
+  for (auto foldBegin = cuEntries.begin(); foldBegin != cuEntries.end();) {
     auto foldEnd = foldBegin;
     // Common LSDA encodings (e.g. for C++ and Objective-C) contain offsets from
     // a base address. The base address is normally not contained directly in
@@ -503,9 +497,9 @@ void UnwindInfoSectionImpl::finalize() {
     // directly in the LSDA, two functions at different addresses would
     // necessarily have different LSDAs, so their CU entries would not have been
     // folded anyway.
-    while (++foldEnd < cuIndices.end() &&
-           cuEntries[*foldBegin].encoding == cuEntries[*foldEnd].encoding &&
-           !cuEntries[*foldBegin].lsda && !cuEntries[*foldEnd].lsda &&
+    while (++foldEnd != cuEntries.end() &&
+           foldBegin->encoding == foldEnd->encoding && !foldBegin->lsda &&
+           !foldEnd->lsda &&
            // If we've gotten to this point, we don't have an LSDA, which should
            // also imply that we don't have a personality function, since in all
            // likelihood a personality function needs the LSDA to do anything
@@ -513,21 +507,20 @@ void UnwindInfoSectionImpl::finalize() {
            // and no LSDA though (e.g. the C++ personality __gxx_personality_v0
            // is just a no-op without LSDA), so we still check for personality
            // function equivalence to handle that case.
-           cuEntries[*foldBegin].personality ==
-               cuEntries[*foldEnd].personality &&
-           canFoldEncoding(cuEntries[*foldEnd].encoding))
+           foldBegin->personality == foldEnd->personality &&
+           canFoldEncoding(foldEnd->encoding))
       ;
     *foldWrite++ = *foldBegin;
     foldBegin = foldEnd;
   }
-  cuIndices.erase(foldWrite, cuIndices.end());
+  cuEntries.erase(foldWrite, cuEntries.end());
 
   encodePersonalities();
 
   // Count frequencies of the folded encodings
   EncodingMap encodingFrequencies;
-  for (size_t idx : cuIndices)
-    encodingFrequencies[cuEntries[idx].encoding]++;
+  for (const CompactUnwindEntry &cu : cuEntries)
+    encodingFrequencies[cu.encoding]++;
 
   // Make a vector of encodings, sorted by descending frequency
   for (const auto &frequency : encodingFrequencies)
@@ -558,21 +551,19 @@ void UnwindInfoSectionImpl::finalize() {
   //     and 127..255 references a local per-second-level-page table.
   // First we try the compact format and determine how many entries fit.
   // If more entries fit in the regular format, we use that.
-  for (size_t i = 0; i < cuIndices.size();) {
-    size_t idx = cuIndices[i];
+  for (size_t i = 0; i < cuEntries.size();) {
     secondLevelPages.emplace_back();
     SecondLevelPage &page = secondLevelPages.back();
     page.entryIndex = i;
     uint64_t functionAddressMax =
-        cuEntries[idx].functionAddress + COMPRESSED_ENTRY_FUNC_OFFSET_MASK;
+        cuEntries[i].functionAddress + COMPRESSED_ENTRY_FUNC_OFFSET_MASK;
     size_t n = commonEncodings.size();
     size_t wordsRemaining =
         SECOND_LEVEL_PAGE_WORDS -
         sizeof(unwind_info_compressed_second_level_page_header) /
             sizeof(uint32_t);
-    while (wordsRemaining >= 1 && i < cuIndices.size()) {
-      idx = cuIndices[i];
-      const CompactUnwindEntry *cuPtr = &cuEntries[idx];
+    while (wordsRemaining >= 1 && i < cuEntries.size()) {
+      const CompactUnwindEntry *cuPtr = &cuEntries[i];
       if (cuPtr->functionAddress >= functionAddressMax)
         break;
       if (commonEncodingIndexes.count(cuPtr->encoding) ||
@@ -593,21 +584,21 @@ void UnwindInfoSectionImpl::finalize() {
     // If this is not the final page, see if it's possible to fit more entries
     // by using the regular format. This can happen when there are many unique
     // encodings, and we saturated the local encoding table early.
-    if (i < cuIndices.size() &&
+    if (i < cuEntries.size() &&
         page.entryCount < REGULAR_SECOND_LEVEL_ENTRIES_MAX) {
       page.kind = UNWIND_SECOND_LEVEL_REGULAR;
       page.entryCount = std::min(REGULAR_SECOND_LEVEL_ENTRIES_MAX,
-                                 cuIndices.size() - page.entryIndex);
+                                 cuEntries.size() - page.entryIndex);
       i = page.entryIndex + page.entryCount;
     } else {
       page.kind = UNWIND_SECOND_LEVEL_COMPRESSED;
     }
   }
 
-  for (size_t idx : cuIndices) {
-    lsdaIndex[idx] = entriesWithLsda.size();
-    if (cuEntries[idx].lsda)
-      entriesWithLsda.push_back(idx);
+  for (size_t i = 0; i < cuEntries.size(); ++i) {
+    lsdaIndex[i] = entriesWithLsda.size();
+    if (cuEntries[i].lsda)
+      entriesWithLsda.push_back(i);
   }
 
   // compute size of __TEXT,__unwind_info section
@@ -626,7 +617,7 @@ void UnwindInfoSectionImpl::finalize() {
 // All inputs are relocated and output addresses are known, so write!
 
 void UnwindInfoSectionImpl::writeTo(uint8_t *buf) const {
-  assert(!cuIndices.empty() && "call only if there is unwind info");
+  assert(!cuEntries.empty() && "call only if there is unwind info");
 
   // section header
   auto *uip = reinterpret_cast<unwind_info_section_header *>(buf);
@@ -660,7 +651,7 @@ void UnwindInfoSectionImpl::writeTo(uint8_t *buf) const {
   uint64_t l2PagesOffset = level2PagesOffset;
   auto *iep = reinterpret_cast<unwind_info_section_header_index_entry *>(i32p);
   for (const SecondLevelPage &page : secondLevelPages) {
-    size_t idx = cuIndices[page.entryIndex];
+    size_t idx = page.entryIndex;
     iep->functionOffset = cuEntries[idx].functionAddress - in.header->addr;
     iep->secondLevelPagesSectionOffset = l2PagesOffset;
     iep->lsdaIndexArraySectionOffset =
@@ -695,7 +686,7 @@ void UnwindInfoSectionImpl::writeTo(uint8_t *buf) const {
   for (const SecondLevelPage &page : secondLevelPages) {
     if (page.kind == UNWIND_SECOND_LEVEL_COMPRESSED) {
       uintptr_t functionAddressBase =
-          cuEntries[cuIndices[page.entryIndex]].functionAddress;
+          cuEntries[page.entryIndex].functionAddress;
       auto *p2p =
           reinterpret_cast<unwind_info_compressed_second_level_page_header *>(
               pp);
@@ -708,8 +699,7 @@ void UnwindInfoSectionImpl::writeTo(uint8_t *buf) const {
       p2p->encodingsCount = page.localEncodings.size();
       auto *ep = reinterpret_cast<uint32_t *>(&p2p[1]);
       for (size_t i = 0; i < page.entryCount; i++) {
-        const CompactUnwindEntry &cue =
-            cuEntries[cuIndices[page.entryIndex + i]];
+        const CompactUnwindEntry &cue = cuEntries[page.entryIndex + i];
         auto it = commonEncodingIndexes.find(cue.encoding);
         if (it == commonEncodingIndexes.end())
           it = page.localEncodingIndexes.find(cue.encoding);
@@ -728,8 +718,7 @@ void UnwindInfoSectionImpl::writeTo(uint8_t *buf) const {
       p2p->entryCount = page.entryCount;
       auto *ep = reinterpret_cast<uint32_t *>(&p2p[1]);
       for (size_t i = 0; i < page.entryCount; i++) {
-        const CompactUnwindEntry &cue =
-            cuEntries[cuIndices[page.entryIndex + i]];
+        const CompactUnwindEntry &cue = cuEntries[page.entryIndex + i];
         *ep++ = cue.functionAddress;
         *ep++ = cue.encoding;
       }
diff --git a/lldb/bindings/python/python-wrapper.swig b/lldb/bindings/python/python-wrapper.swig
index ef501fbafc947..0ba152166522b 100644
--- a/lldb/bindings/python/python-wrapper.swig
+++ b/lldb/bindings/python/python-wrapper.swig
@@ -425,6 +425,18 @@ void *lldb_private::python::LLDBSWIGPython_CastPyObjectToSBBreakpoint(PyObject *
   return sb_ptr;
 }
 
+void *lldb_private::python::LLDBSWIGPython_CastPyObjectToSBThread(PyObject * data) {
+  lldb::SBThread *sb_ptr = nullptr;
+
+  int valid_cast =
+      SWIG_ConvertPtr(data, (void **)&sb_ptr, SWIGTYPE_p_lldb__SBThread, 0);
+
+  if (valid_cast == -1)
+    return NULL;
+
+  return sb_ptr;
+}
+
 void *lldb_private::python::LLDBSWIGPython_CastPyObjectToSBFrame(PyObject * data) {
   lldb::SBFrame *sb_ptr = nullptr;
 
diff --git a/lldb/docs/CMakeLists.txt b/lldb/docs/CMakeLists.txt
index 092bdc1c30f5c..bbecf606f1f8f 100644
--- a/lldb/docs/CMakeLists.txt
+++ b/lldb/docs/CMakeLists.txt
@@ -28,6 +28,7 @@ if (LLDB_ENABLE_PYTHON AND SPHINX_FOUND)
     add_custom_target(lldb-python-doc-package
       COMMAND "${CMAKE_COMMAND}" -E copy "${lldb_bindings_dir}/lldb.py" "${CMAKE_CURRENT_BINARY_DIR}/lldb/__init__.py"
       COMMAND "${CMAKE_COMMAND}" -E make_directory "${CMAKE_CURRENT_BINARY_DIR}/lldb/plugins"
+      COMMAND "${CMAKE_COMMAND}" -E copy "${LLDB_SOURCE_DIR}/examples/python/templates/scripted_frame_provider.py" "${CMAKE_CURRENT_BINARY_DIR}/lldb/plugins/"
       COMMAND "${CMAKE_COMMAND}" -E copy "${LLDB_SOURCE_DIR}/examples/python/templates/scripted_process.py" "${CMAKE_CURRENT_BINARY_DIR}/lldb/plugins/"
       COMMAND "${CMAKE_COMMAND}" -E copy "${LLDB_SOURCE_DIR}/examples/python/templates/scripted_platform.py" "${CMAKE_CURRENT_BINARY_DIR}/lldb/plugins/"
       COMMAND "${CMAKE_COMMAND}" -E copy "${LLDB_SOURCE_DIR}/examples/python/templates/operating_system.py" "${CMAKE_CURRENT_BINARY_DIR}/lldb/plugins/"
diff --git a/lldb/docs/dil-expr-lang.ebnf b/lldb/docs/dil-expr-lang.ebnf
index ccd2b00223910..5fabdd445878b 100644
--- a/lldb/docs/dil-expr-lang.ebnf
+++ b/lldb/docs/dil-expr-lang.ebnf
@@ -3,10 +3,13 @@
 (* This is currently a subset of the final DIL Language, matching the current
    DIL implementation. *)
 
-expression = unary_expression ;
+expression = cast_expression;
+
+cast_expression = unary_expression
+                | "(" type_id ")" cast_expression;
 
 unary_expression = postfix_expression
-                 | unary_operator expression ;
+                 | unary_operator cast_expression ;
 
 unary_operator = "*" | "&" | "+" | "-";
 
@@ -44,10 +47,28 @@ nested_name_specifier = type_name "::"
                       | namespace_name '::'
                       | nested_name_specifier identifier "::" ;
 
+type_id = type_specifier_seq [abstract_declarator] ;
+
+type_specifier_seq = type_specifier [type_specifier];
+
+type_specifier = ["::"] [nested_name_specifier] type_name
+               | builtin_typename ;
+
+nested_name_specifier = type_name "::"
+                      | namespace_name "::"
+                      | nested_name_specifier identifier "::" ;
+
+abstract_declarator = ptr_operator [abstract_declarator] ;
+
+ptr_operator = "*"
+             | "&";
+
 type_name = class_name
           | enum_name
           | typedef_name;
 
+builtin_typename = identifier_seq;
+
 class_name = identifier ;
 
 enum_name = identifier ;
@@ -56,6 +77,7 @@ typedef_name = identifier ;
 
 namespace_name = identifier ;
 
-
+identifier_seq = identifier
+               | identifier identifier_seq;
 
 
diff --git a/lldb/docs/python_extensions.rst b/lldb/docs/python_extensions.rst
index 7e5f1ba6879db..8420187efcdcc 100644
--- a/lldb/docs/python_extensions.rst
+++ b/lldb/docs/python_extensions.rst
@@ -14,6 +14,14 @@ Operating System Thread Plugins
     :skip: ScriptedThread
     :no-inheritance-diagram:
 
+Scripted Frame Provider Plugins
+-------------------------------
+
+.. automodapi:: lldb.plugins.scripted_frame_provider
+    :no-heading:
+    :skip: ABCMeta
+    :no-inheritance-diagram:
+
 Scripted Process Plugins
 -------------------------------
 
diff --git a/lldb/docs/resources/lldbgdbremote.md b/lldb/docs/resources/lldbgdbremote.md
index 032edb66690e4..fdd9b057f0b4a 100644
--- a/lldb/docs/resources/lldbgdbremote.md
+++ b/lldb/docs/resources/lldbgdbremote.md
@@ -2508,7 +2508,7 @@ stack traces.
 
 Get the value of a Wasm global variable for the given frame index at the given
 variable index. The indexes are encoded as base 10. The result is a hex-encoded
-address from where to read the value.
+little-endian value of the global.
 
 ```
 send packet: $qWasmGlobal:0;2#cb
@@ -2523,7 +2523,7 @@ variables.
 
 Get the value of a Wasm function argument or local variable for the given frame
 index at the given variable index. The indexes are encoded as base 10. The
-result is a hex-encoded address from where to read the value.
+result is a hex-encoded little-endian value of the local.
 
 
 ```
@@ -2539,7 +2539,7 @@ variables.
 
 Get the value of a Wasm local variable from the Wasm operand stack, for the
 given frame index at the given variable index. The indexes are encoded as base
-10. The result is a hex-encoded address from where to read value.
+10. The result is a hex-encoded little-endian value from the stack at the given index.
 
 ```
 send packet: $qWasmStackValue:0;2#cb
diff --git a/lldb/examples/python/templates/scripted_frame_provider.py b/lldb/examples/python/templates/scripted_frame_provider.py
index 20f4d76d188c2..7a72f1a24c9da 100644
--- a/lldb/examples/python/templates/scripted_frame_provider.py
+++ b/lldb/examples/python/templates/scripted_frame_provider.py
@@ -31,7 +31,54 @@ class ScriptedFrameProvider(metaclass=ABCMeta):
         )
     """
 
+    @staticmethod
+    def applies_to_thread(thread):
+        """Determine if this frame provider should be used for a given thread.
+
+        This static method is called before creating an instance of the frame
+        provider to determine if it should be applied to a specific thread.
+        Override this method to provide custom filtering logic.
+
+        Args:
+            thread (lldb.SBThread): The thread to check.
+
+        Returns:
+            bool: True if this frame provider should be used for the thread,
+                False otherwise. The default implementation returns True for
+                all threads.
+
+        Example:
+
+        .. code-block:: python
+
+            @staticmethod
+            def applies_to_thread(thread):
+                # Only apply to thread 1
+                return thread.GetIndexID() == 1
+        """
+        return True
+
+    @staticmethod
     @abstractmethod
+    def get_description():
+        """Get a description of this frame provider.
+
+        This method should return a human-readable string describing what
+        this frame provider does. The description is used for debugging
+        and display purposes.
+
+        Returns:
+            str: A description of the frame provider.
+
+        Example:
+
+        .. code-block:: python
+
+            def get_description(self):
+                return "Crash log frame provider for thread 1"
+        """
+        pass
+
     def __init__(self, input_frames, args):
         """Construct a scripted frame provider.
 
diff --git a/lldb/examples/python/templates/scripted_process.py b/lldb/examples/python/templates/scripted_process.py
index b4232f632a30a..24aa9818bb989 100644
--- a/lldb/examples/python/templates/scripted_process.py
+++ b/lldb/examples/python/templates/scripted_process.py
@@ -243,6 +243,7 @@ def __init__(self, process, args):
                 key/value pairs used by the scripted thread.
         """
         self.target = None
+        self.arch = None
         self.originating_process = None
         self.process = None
         self.args = None
@@ -264,6 +265,9 @@ def __init__(self, process, args):
             and process.IsValid()
         ):
             self.target = process.target
+            triple = self.target.triple
+            if triple:
+                self.arch = triple.split("-")[0]
             self.originating_process = process
             self.process = self.target.GetProcess()
             self.get_register_info()
@@ -350,17 +354,14 @@ def get_stackframes(self):
     def get_register_info(self):
         if self.register_info is None:
             self.register_info = dict()
-            if "x86_64" in self.originating_process.arch:
+            if "x86_64" in self.arch:
                 self.register_info["sets"] = ["General Purpose Registers"]
                 self.register_info["registers"] = INTEL64_GPR
-            elif (
-                "arm64" in self.originating_process.arch
-                or self.originating_process.arch == "aarch64"
-            ):
+            elif "arm64" in self.arch or self.arch == "aarch64":
                 self.register_info["sets"] = ["General Purpose Registers"]
                 self.register_info["registers"] = ARM64_GPR
             else:
-                raise ValueError("Unknown architecture", self.originating_process.arch)
+                raise ValueError("Unknown architecture", self.arch)
         return self.register_info
 
     @abstractmethod
@@ -403,11 +404,12 @@ def __init__(self, thread, args):
         """Construct a scripted frame.
 
         Args:
-            thread (ScriptedThread): The thread owning this frame.
+            thread (ScriptedThread/lldb.SBThread): The thread owning this frame.
             args (lldb.SBStructuredData): A Dictionary holding arbitrary
                 key/value pairs used by the scripted frame.
         """
         self.target = None
+        self.arch = None
         self.originating_thread = None
         self.thread = None
         self.args = None
@@ -417,15 +419,17 @@ def __init__(self, thread, args):
         self.register_ctx = {}
         self.variables = []
 
-        if (
-            isinstance(thread, ScriptedThread)
-            or isinstance(thread, lldb.SBThread)
-            and thread.IsValid()
+        if isinstance(thread, ScriptedThread) or (
+            isinstance(thread, lldb.SBThread) and thread.IsValid()
         ):
-            self.target = thread.target
             self.process = thread.process
+            self.target = self.process.target
+            triple = self.target.triple
+            if triple:
+                self.arch = triple.split("-")[0]
+            tid = thread.tid if isinstance(thread, ScriptedThread) else thread.id
             self.originating_thread = thread
-            self.thread = self.process.GetThreadByIndexID(thread.tid)
+            self.thread = self.process.GetThreadByIndexID(tid)
             self.get_register_info()
 
     @abstractmethod
@@ -506,7 +510,18 @@ def get_variables(self, filters):
 
     def get_register_info(self):
         if self.register_info is None:
-            self.register_info = self.originating_thread.get_register_info()
+            if isinstance(self.originating_thread, ScriptedThread):
+                self.register_info = self.originating_thread.get_register_info()
+            elif isinstance(self.originating_thread, lldb.SBThread):
+                self.register_info = dict()
+                if "x86_64" in self.arch:
+                    self.register_info["sets"] = ["General Purpose Registers"]
+                    self.register_info["registers"] = INTEL64_GPR
+                elif "arm64" in self.arch or self.arch == "aarch64":
+                    self.register_info["sets"] = ["General Purpose Registers"]
+                    self.register_info["registers"] = ARM64_GPR
+                else:
+                    raise ValueError("Unknown architecture", self.arch)
         return self.register_info
 
     @abstractmethod
@@ -640,12 +655,12 @@ def get_stop_reason(self):
 
             # TODO: Passthrough stop reason from driving process
             if self.driving_thread.GetStopReason() != lldb.eStopReasonNone:
-                if "arm64" in self.originating_process.arch:
+                if "arm64" in self.arch:
                     stop_reason["type"] = lldb.eStopReasonException
                     stop_reason["data"]["desc"] = (
                         self.driving_thread.GetStopDescription(100)
                     )
-                elif self.originating_process.arch == "x86_64":
+                elif self.arch == "x86_64":
                     stop_reason["type"] = lldb.eStopReasonSignal
                     stop_reason["data"]["signal"] = signal.SIGTRAP
                 else:
diff --git a/lldb/include/lldb/API/SBTarget.h b/lldb/include/lldb/API/SBTarget.h
index ce81ae46a0905..0318492f1054c 100644
--- a/lldb/include/lldb/API/SBTarget.h
+++ b/lldb/include/lldb/API/SBTarget.h
@@ -19,6 +19,7 @@
 #include "lldb/API/SBLaunchInfo.h"
 #include "lldb/API/SBStatisticsOptions.h"
 #include "lldb/API/SBSymbolContextList.h"
+#include "lldb/API/SBThreadCollection.h"
 #include "lldb/API/SBType.h"
 #include "lldb/API/SBValue.h"
 #include "lldb/API/SBWatchpoint.h"
@@ -1003,6 +1004,35 @@ class LLDB_API SBTarget {
 
   lldb::SBMutex GetAPIMutex() const;
 
+  /// Register a scripted frame provider for this target.
+  /// If a scripted frame provider with the same name and same argument
+  /// dictionary is already registered on this target, it will be overwritten.
+  ///
+  /// \param[in] class_name
+  ///     The name of the Python class that implements the frame provider.
+  ///
+  /// \param[in] args_dict
+  ///     A dictionary of arguments to pass to the frame provider class.
+  ///
+  /// \param[out] error
+  ///     An error object indicating success or failure.
+  ///
+  /// \return
+  ///     A unique identifier for the frame provider descriptor that was
+  ///     registered. 0 if the registration failed.
+  uint32_t RegisterScriptedFrameProvider(const char *class_name,
+                                         lldb::SBStructuredData args_dict,
+                                         lldb::SBError &error);
+
+  /// Remove a scripted frame provider from this target by name.
+  ///
+  /// \param[in] provider_id
+  ///     The id of the frame provider class to remove.
+  ///
+  /// \return
+  ///     An error object indicating success or failure.
+  lldb::SBError RemoveScriptedFrameProvider(uint32_t provider_id);
+
 protected:
   friend class SBAddress;
   friend class SBAddressRange;
diff --git a/lldb/include/lldb/API/SBThread.h b/lldb/include/lldb/API/SBThread.h
index f6a6d19935b83..639e7a0a1a5c0 100644
--- a/lldb/include/lldb/API/SBThread.h
+++ b/lldb/include/lldb/API/SBThread.h
@@ -256,6 +256,7 @@ class LLDB_API SBThread {
   friend class SBThreadPlan;
   friend class SBTrace;
 
+  friend class lldb_private::ScriptInterpreter;
   friend class lldb_private::python::SWIGBridge;
 
   SBThread(const lldb::ThreadSP &lldb_object_sp);
diff --git a/lldb/include/lldb/API/SBThreadCollection.h b/lldb/include/lldb/API/SBThreadCollection.h
index 5a052e6246026..d13dea0f11cd2 100644
--- a/lldb/include/lldb/API/SBThreadCollection.h
+++ b/lldb/include/lldb/API/SBThreadCollection.h
@@ -46,6 +46,7 @@ class LLDB_API SBThreadCollection {
   void SetOpaque(const lldb::ThreadCollectionSP &threads);
 
 private:
+  friend class SBTarget;
   friend class SBProcess;
   friend class SBThread;
   friend class SBSaveCoreOptions;
diff --git a/lldb/include/lldb/API/SBTrace.h b/lldb/include/lldb/API/SBTrace.h
index ce95595a423c9..d5368b234dd34 100644
--- a/lldb/include/lldb/API/SBTrace.h
+++ b/lldb/include/lldb/API/SBTrace.h
@@ -39,7 +39,7 @@ class LLDB_API SBTrace {
   SBTraceCursor CreateNewCursor(SBError &error, SBThread &thread);
 
   /// Save the trace to the specified directory, which will be created if
-  /// needed. This will also create a file \a <directory>/trace.json with the
+  /// needed. This will also create a file <directory>/trace.json with the
   /// main properties of the trace session, along with others files which
   /// contain the actual trace data. The trace.json file can be used later as
   /// input for the "trace load" command to load the trace in LLDB, or for the
diff --git a/lldb/include/lldb/Breakpoint/BreakpointSite.h b/lldb/include/lldb/Breakpoint/BreakpointSite.h
index a935b2441c02a..e189ed77e261b 100644
--- a/lldb/include/lldb/Breakpoint/BreakpointSite.h
+++ b/lldb/include/lldb/Breakpoint/BreakpointSite.h
@@ -156,6 +156,10 @@ class BreakpointSite : public std::enable_shared_from_this<BreakpointSite>,
   ///     would be valid for this thread, false otherwise.
   bool ValidForThisThread(Thread &thread);
 
+  /// Returns true if at least one constituent is both public and valid for
+  /// `thread`.
+  bool ContainsUserBreakpointForThread(Thread &thread);
+
   /// Print a description of this breakpoint site to the stream \a s.
   /// GetDescription tells you about the breakpoint site's constituents. Use
   /// BreakpointSite::Dump(Stream *) to get information about the breakpoint
diff --git a/lldb/include/lldb/Core/Disassembler.h b/lldb/include/lldb/Core/Disassembler.h
index 5de314109b0cc..ab0f4ac804a7c 100644
--- a/lldb/include/lldb/Core/Disassembler.h
+++ b/lldb/include/lldb/Core/Disassembler.h
@@ -167,6 +167,8 @@ class Instruction {
 
   virtual bool IsLoad() = 0;
 
+  virtual bool IsBarrier() = 0;
+
   virtual bool IsAuthenticated() = 0;
 
   bool CanSetBreakpoint();
@@ -367,6 +369,8 @@ class PseudoInstruction : public Instruction {
 
   bool IsLoad() override;
 
+  bool IsBarrier() override;
+
   bool IsAuthenticated() override;
 
   void CalculateMnemonicOperandsAndComment(
diff --git a/lldb/include/lldb/Core/FormatEntity.h b/lldb/include/lldb/Core/FormatEntity.h
index 40916dc48a70b..107c30a000979 100644
--- a/lldb/include/lldb/Core/FormatEntity.h
+++ b/lldb/include/lldb/Core/FormatEntity.h
@@ -81,6 +81,7 @@ struct Entry {
     FrameRegisterByName,
     FrameIsArtificial,
     FrameKind,
+    FrameBorrowedInfo,
     ScriptFrame,
     FunctionID,
     FunctionDidChange,
diff --git a/lldb/include/lldb/Host/windows/ProcessLauncherWindows.h b/lldb/include/lldb/Host/windows/ProcessLauncherWindows.h
index 81aea5b2022a5..553263e2f5a72 100644
--- a/lldb/include/lldb/Host/windows/ProcessLauncherWindows.h
+++ b/lldb/include/lldb/Host/windows/ProcessLauncherWindows.h
@@ -11,6 +11,7 @@
 
 #include "lldb/Host/ProcessLauncher.h"
 #include "lldb/Host/windows/windows.h"
+#include "llvm/Support/ErrorOr.h"
 
 namespace lldb_private {
 
@@ -23,6 +24,36 @@ class ProcessLauncherWindows : public ProcessLauncher {
 
 protected:
   HANDLE GetStdioHandle(const ProcessLaunchInfo &launch_info, int fd);
+
+  /// Get the list of Windows handles that should be inherited by the child
+  /// process and update `STARTUPINFOEXW` with the handle list.
+  ///
+  /// If no handles need to be inherited, an empty vector is returned.
+  ///
+  /// Otherwise, the function populates the
+  /// `PROC_THREAD_ATTRIBUTE_HANDLE_LIST` attribute in `startupinfoex` with the
+  /// collected handles using `UpdateProcThreadAttribute`. On success, the
+  /// vector of inherited handles is returned.
+  ///
+  /// \param launch_info
+  ///   The process launch configuration.
+  ///
+  /// \param startupinfoex
+  ///   The extended STARTUPINFO structure for the process being created.
+  ///
+  /// \param stdout_handle
+  /// \param stderr_handle
+  /// \param stdin_handle
+  ///   Optional explicit standard stream handles to use for the child process.
+  ///
+  /// \returns
+  ///   `std::vector<HANDLE>` containing all handles that the child must
+  ///   inherit.
+  llvm::ErrorOr<std::vector<HANDLE>>
+  GetInheritedHandles(const ProcessLaunchInfo &launch_info,
+                      STARTUPINFOEXW &startupinfoex,
+                      HANDLE stdout_handle = NULL, HANDLE stderr_handle = NULL,
+                      HANDLE stdin_handle = NULL);
 };
 }
 
diff --git a/lldb/include/lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h b/lldb/include/lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h
index 2d9f713676f90..49b60131399d5 100644
--- a/lldb/include/lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h
+++ b/lldb/include/lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h
@@ -16,11 +16,29 @@
 namespace lldb_private {
 class ScriptedFrameProviderInterface : public ScriptedInterface {
 public:
+  virtual bool AppliesToThread(llvm::StringRef class_name,
+                               lldb::ThreadSP thread_sp) {
+    return true;
+  }
+
   virtual llvm::Expected<StructuredData::GenericSP>
   CreatePluginObject(llvm::StringRef class_name,
                      lldb::StackFrameListSP input_frames,
                      StructuredData::DictionarySP args_sp) = 0;
 
+  /// Get a description string for the frame provider.
+  ///
+  /// This is called by the descriptor to fetch a description from the
+  /// scripted implementation. Implementations should call a static method
+  /// on the scripting class to retrieve the description.
+  ///
+  /// \param class_name The name of the scripting class implementing the
+  /// provider.
+  ///
+  /// \return A string describing what this frame provider does, or an
+  ///         empty string if no description is available.
+  virtual std::string GetDescription(llvm::StringRef class_name) { return {}; }
+
   virtual StructuredData::ObjectSP GetFrameAtIndex(uint32_t index) {
     return {};
   }
diff --git a/lldb/include/lldb/Interpreter/Interfaces/ScriptedInterface.h b/lldb/include/lldb/Interpreter/Interfaces/ScriptedInterface.h
index a3dc52c435561..8ace90927b582 100644
--- a/lldb/include/lldb/Interpreter/Interfaces/ScriptedInterface.h
+++ b/lldb/include/lldb/Interpreter/Interfaces/ScriptedInterface.h
@@ -39,6 +39,10 @@ class ScriptedInterface {
   virtual llvm::SmallVector<AbstractMethodRequirement>
   GetAbstractMethodRequirements() const = 0;
 
+  virtual llvm::Expected<FileSpec> GetScriptedModulePath() {
+    return llvm::make_error<UnimplementedError>();
+  }
+
   llvm::SmallVector<llvm::StringLiteral> const GetAbstractMethods() const {
     llvm::SmallVector<llvm::StringLiteral> abstract_methods;
     llvm::transform(GetAbstractMethodRequirements(), abstract_methods.begin(),
diff --git a/lldb/include/lldb/Interpreter/ScriptInterpreter.h b/lldb/include/lldb/Interpreter/ScriptInterpreter.h
index 7fed4940b85bf..0b91d6756552d 100644
--- a/lldb/include/lldb/Interpreter/ScriptInterpreter.h
+++ b/lldb/include/lldb/Interpreter/ScriptInterpreter.h
@@ -21,6 +21,7 @@
 #include "lldb/API/SBMemoryRegionInfo.h"
 #include "lldb/API/SBStream.h"
 #include "lldb/API/SBSymbolContext.h"
+#include "lldb/API/SBThread.h"
 #include "lldb/Breakpoint/BreakpointOptions.h"
 #include "lldb/Core/PluginInterface.h"
 #include "lldb/Core/SearchFilter.h"
@@ -580,6 +581,8 @@ class ScriptInterpreter : public PluginInterface {
 
   lldb::StreamSP GetOpaqueTypeFromSBStream(const lldb::SBStream &stream) const;
 
+  lldb::ThreadSP GetOpaqueTypeFromSBThread(const lldb::SBThread &exe_ctx) const;
+
   lldb::StackFrameSP GetOpaqueTypeFromSBFrame(const lldb::SBFrame &frame) const;
 
   SymbolContext
diff --git a/lldb/include/lldb/Symbol/ObjectFile.h b/lldb/include/lldb/Symbol/ObjectFile.h
index 1de08a8576507..993650b8888f5 100644
--- a/lldb/include/lldb/Symbol/ObjectFile.h
+++ b/lldb/include/lldb/Symbol/ObjectFile.h
@@ -18,6 +18,7 @@
 #include "lldb/Utility/Endian.h"
 #include "lldb/Utility/FileSpec.h"
 #include "lldb/Utility/FileSpecList.h"
+#include "lldb/Utility/NonNullSharedPtr.h"
 #include "lldb/Utility/StructuredData.h"
 #include "lldb/Utility/UUID.h"
 #include "lldb/lldb-private.h"
@@ -418,7 +419,7 @@ class ObjectFile : public std::enable_shared_from_this<ObjectFile>,
   /// Attempts to parse the object header.
   ///
   /// This function is used as a test to see if a given plug-in instance can
-  /// parse the header data already contained in ObjectFile::m_data. If an
+  /// parse the header data already contained in ObjectFile::m_data_nsp. If an
   /// object file parser does not recognize that magic bytes in a header,
   /// false should be returned and the next plug-in can attempt to parse an
   /// object file.
@@ -777,6 +778,8 @@ class ObjectFile : public std::enable_shared_from_this<ObjectFile>,
   std::string GetObjectName() const;
 
 protected:
+  typedef NonNullSharedPtr<lldb_private::DataExtractor> DataExtractorNSP;
+
   // Member variables.
   FileSpec m_file;
   Type m_type;
@@ -786,8 +789,10 @@ class ObjectFile : public std::enable_shared_from_this<ObjectFile>,
   lldb::addr_t m_length; ///< The length of this object file if it is known (can
                          ///be zero if length is unknown or can't be
                          ///determined).
-  DataExtractor
-      m_data; ///< The data for this object file so things can be parsed lazily.
+  DataExtractorNSP m_data_nsp; ///< The data for this object file so things
+                               ///< can be parsed lazily.  This shared pointer
+                               ///< will always have a DataExtractor object,
+                               ///< although it may only be default-constructed.
   lldb::ProcessWP m_process_wp;
   /// Set if the object file only exists in memory.
   const lldb::addr_t m_memory_addr;
diff --git a/lldb/include/lldb/Target/BorrowedStackFrame.h b/lldb/include/lldb/Target/BorrowedStackFrame.h
new file mode 100644
index 0000000000000..72e7777961da7
--- /dev/null
+++ b/lldb/include/lldb/Target/BorrowedStackFrame.h
@@ -0,0 +1,146 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_TARGET_BORROWEDSTACKFRAME_H
+#define LLDB_TARGET_BORROWEDSTACKFRAME_H
+
+#include "lldb/Target/StackFrame.h"
+
+namespace lldb_private {
+
+/// \class BorrowedStackFrame BorrowedStackFrame.h
+/// "lldb/Target/BorrowedStackFrame.h"
+///
+/// A wrapper around an existing StackFrame that supersedes its frame indices.
+///
+/// This class is useful when you need to present an existing stack frame
+/// with a different index, such as when creating synthetic frame views or
+/// renumbering frames without copying all the underlying data.
+///
+/// All methods delegate to the borrowed frame except for GetFrameIndex()
+/// & GetConcreteFrameIndex() which uses the overridden indices.
+class BorrowedStackFrame : public StackFrame {
+public:
+  /// Construct a BorrowedStackFrame that wraps an existing frame.
+  ///
+  /// \param [in] borrowed_frame_sp
+  ///   The existing StackFrame to borrow from. This frame's data will be
+  ///   used for all operations except frame index queries.
+  ///
+  /// \param [in] new_frame_index
+  ///   The frame index to report instead of the borrowed frame's index.
+  ///
+  /// \param [in] new_concrete_frame_index
+  ///   Optional concrete frame index. If not provided, defaults to
+  ///   new_frame_index.
+  BorrowedStackFrame(
+      lldb::StackFrameSP borrowed_frame_sp, uint32_t new_frame_index,
+      std::optional<uint32_t> new_concrete_frame_index = std::nullopt);
+
+  ~BorrowedStackFrame() override = default;
+
+  uint32_t GetFrameIndex() const override;
+  void SetFrameIndex(uint32_t index);
+
+  /// Get the concrete frame index for this borrowed frame.
+  ///
+  /// Returns the overridden concrete frame index provided at construction,
+  /// or LLDB_INVALID_FRAME_ID if the borrowed frame represents an inlined
+  /// function, since this would require some computation if we chain inlined
+  /// borrowed stack frames.
+  ///
+  /// \return
+  ///   The concrete frame index, or LLDB_INVALID_FRAME_ID for inline frames.
+  uint32_t GetConcreteFrameIndex() override;
+
+  StackID &GetStackID() override;
+
+  const Address &GetFrameCodeAddress() override;
+
+  Address GetFrameCodeAddressForSymbolication() override;
+
+  bool ChangePC(lldb::addr_t pc) override;
+
+  const SymbolContext &
+  GetSymbolContext(lldb::SymbolContextItem resolve_scope) override;
+
+  llvm::Error GetFrameBaseValue(Scalar &value) override;
+
+  DWARFExpressionList *GetFrameBaseExpression(Status *error_ptr) override;
+
+  Block *GetFrameBlock() override;
+
+  lldb::RegisterContextSP GetRegisterContext() override;
+
+  VariableList *GetVariableList(bool get_file_globals,
+                                Status *error_ptr) override;
+
+  lldb::VariableListSP
+  GetInScopeVariableList(bool get_file_globals,
+                         bool must_have_valid_location = false) override;
+
+  lldb::ValueObjectSP GetValueForVariableExpressionPath(
+      llvm::StringRef var_expr, lldb::DynamicValueType use_dynamic,
+      uint32_t options, lldb::VariableSP &var_sp, Status &error) override;
+
+  bool HasDebugInformation() override;
+
+  const char *Disassemble() override;
+
+  lldb::ValueObjectSP
+  GetValueObjectForFrameVariable(const lldb::VariableSP &variable_sp,
+                                 lldb::DynamicValueType use_dynamic) override;
+
+  bool IsInlined() override;
+
+  bool IsSynthetic() const override;
+
+  bool IsHistorical() const override;
+
+  bool IsArtificial() const override;
+
+  bool IsHidden() override;
+
+  const char *GetFunctionName() override;
+
+  const char *GetDisplayFunctionName() override;
+
+  lldb::ValueObjectSP FindVariable(ConstString name) override;
+
+  SourceLanguage GetLanguage() override;
+
+  SourceLanguage GuessLanguage() override;
+
+  lldb::ValueObjectSP GuessValueForAddress(lldb::addr_t addr) override;
+
+  lldb::ValueObjectSP GuessValueForRegisterAndOffset(ConstString reg,
+                                                     int64_t offset) override;
+
+  StructuredData::ObjectSP GetLanguageSpecificData() override;
+
+  lldb::RecognizedStackFrameSP GetRecognizedFrame() override;
+
+  /// Get the underlying borrowed frame.
+  lldb::StackFrameSP GetBorrowedFrame() const;
+
+  bool isA(const void *ClassID) const override;
+  static bool classof(const StackFrame *obj);
+
+private:
+  lldb::StackFrameSP m_borrowed_frame_sp;
+  uint32_t m_new_frame_index;
+  uint32_t m_new_concrete_frame_index;
+  static char ID;
+
+  BorrowedStackFrame(const BorrowedStackFrame &) = delete;
+  const BorrowedStackFrame &operator=(const BorrowedStackFrame &) = delete;
+};
+
+} // namespace lldb_private
+
+#endif // LLDB_TARGET_BORROWEDSTACKFRAME_H
diff --git a/lldb/include/lldb/Target/ExecutionContext.h b/lldb/include/lldb/Target/ExecutionContext.h
index fe8bce7f69713..8637234c4fb95 100644
--- a/lldb/include/lldb/Target/ExecutionContext.h
+++ b/lldb/include/lldb/Target/ExecutionContext.h
@@ -268,7 +268,10 @@ class ExecutionContextRef {
     m_tid = LLDB_INVALID_THREAD_ID;
   }
 
-  void ClearFrame() { m_stack_id.Clear(); }
+  void ClearFrame() {
+    m_stack_id.Clear();
+    m_frame_list_wp.reset();
+  }
 
 protected:
   // Member variables
@@ -279,7 +282,14 @@ class ExecutionContextRef {
                                               ///< object refers to in case the
                                               /// backing object changes
   StackID m_stack_id; ///< The stack ID that this object refers to in case the
-                      ///backing object changes
+                      ///< backing object changes
+  mutable lldb::StackFrameListWP
+      m_frame_list_wp; ///< Weak reference to the
+                       ///< frame list that contains
+                       ///< this frame. If we can create a valid
+                       ///< StackFrameListSP from it, we must use it to resolve
+                       ///< the StackID, otherwise, we should ask the Thread's
+                       ///< StackFrameList.
 };
 
 /// \class ExecutionContext ExecutionContext.h
diff --git a/lldb/include/lldb/Target/StackFrame.h b/lldb/include/lldb/Target/StackFrame.h
index cdbe8ae3c6779..46922448d6e59 100644
--- a/lldb/include/lldb/Target/StackFrame.h
+++ b/lldb/include/lldb/Target/StackFrame.h
@@ -43,6 +43,13 @@ namespace lldb_private {
 class StackFrame : public ExecutionContextScope,
                    public std::enable_shared_from_this<StackFrame> {
 public:
+  /// LLVM RTTI support.
+  /// \{
+  static char ID;
+  virtual bool isA(const void *ClassID) const { return ClassID == &ID; }
+  static bool classof(const StackFrame *obj) { return obj->isA(&ID); }
+  /// \}
+
   enum ExpressionPathOption {
     eExpressionPathOptionCheckPtrVsMember = (1u << 0),
     eExpressionPathOptionsNoFragileObjcIvar = (1u << 1),
@@ -127,7 +134,7 @@ class StackFrame : public ExecutionContextScope,
 
   lldb::ThreadSP GetThread() const { return m_thread_wp.lock(); }
 
-  StackID &GetStackID();
+  virtual StackID &GetStackID();
 
   /// Get an Address for the current pc value in this StackFrame.
   ///
@@ -135,7 +142,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   The Address object set to the current PC value.
-  const Address &GetFrameCodeAddress();
+  virtual const Address &GetFrameCodeAddress();
 
   /// Get the current code Address suitable for symbolication,
   /// may not be the same as GetFrameCodeAddress().
@@ -153,7 +160,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   The Address object set to the current PC value.
-  Address GetFrameCodeAddressForSymbolication();
+  virtual Address GetFrameCodeAddressForSymbolication();
 
   /// Change the pc value for a given thread.
   ///
@@ -165,7 +172,7 @@ class StackFrame : public ExecutionContextScope,
   /// \return
   ///     true if the pc was changed.  false if this failed -- possibly
   ///     because this frame is not a live StackFrame.
-  bool ChangePC(lldb::addr_t pc);
+  virtual bool ChangePC(lldb::addr_t pc);
 
   /// Provide a SymbolContext for this StackFrame's current pc value.
   ///
@@ -181,7 +188,8 @@ class StackFrame : public ExecutionContextScope,
   /// \return
   ///   A SymbolContext reference which includes the types of information
   ///   requested by resolve_scope, if they are available.
-  const SymbolContext &GetSymbolContext(lldb::SymbolContextItem resolve_scope);
+  virtual const SymbolContext &
+  GetSymbolContext(lldb::SymbolContextItem resolve_scope);
 
   /// Return the Canonical Frame Address (DWARF term) for this frame.
   ///
@@ -199,7 +207,7 @@ class StackFrame : public ExecutionContextScope,
   /// \return
   ///   If there is an error determining the CFA address, return an error
   ///   explaining the failure. Success otherwise.
-  llvm::Error GetFrameBaseValue(Scalar &value);
+  virtual llvm::Error GetFrameBaseValue(Scalar &value);
 
   /// Get the DWARFExpressionList corresponding to the Canonical Frame Address.
   ///
@@ -211,7 +219,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   Returns the corresponding DWARF expression, or NULL.
-  DWARFExpressionList *GetFrameBaseExpression(Status *error_ptr);
+  virtual DWARFExpressionList *GetFrameBaseExpression(Status *error_ptr);
 
   /// Get the current lexical scope block for this StackFrame, if possible.
   ///
@@ -221,7 +229,7 @@ class StackFrame : public ExecutionContextScope,
   /// \return
   ///   A pointer to the current Block.  nullptr is returned if this can
   ///   not be provided.
-  Block *GetFrameBlock();
+  virtual Block *GetFrameBlock();
 
   /// Get the RegisterContext for this frame, if possible.
   ///
@@ -235,7 +243,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   The RegisterContext shared point for this frame.
-  lldb::RegisterContextSP GetRegisterContext();
+  virtual lldb::RegisterContextSP GetRegisterContext();
 
   const lldb::RegisterContextSP &GetRegisterContextSP() const {
     return m_reg_context_sp;
@@ -261,7 +269,8 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///     A pointer to a list of variables.
-  VariableList *GetVariableList(bool get_file_globals, Status *error_ptr);
+  virtual VariableList *GetVariableList(bool get_file_globals,
+                                        Status *error_ptr);
 
   /// Retrieve the list of variables that are in scope at this StackFrame's
   /// pc.
@@ -280,7 +289,7 @@ class StackFrame : public ExecutionContextScope,
   ///     StackFrame's pc.
   /// \return
   ///     A pointer to a list of variables.
-  lldb::VariableListSP
+  virtual lldb::VariableListSP
   GetInScopeVariableList(bool get_file_globals,
                          bool must_have_valid_location = false);
 
@@ -309,7 +318,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///     A shared pointer to the ValueObject described by var_expr.
-  lldb::ValueObjectSP GetValueForVariableExpressionPath(
+  virtual lldb::ValueObjectSP GetValueForVariableExpressionPath(
       llvm::StringRef var_expr, lldb::DynamicValueType use_dynamic,
       uint32_t options, lldb::VariableSP &var_sp, Status &error);
 
@@ -318,14 +327,14 @@ class StackFrame : public ExecutionContextScope,
   /// \return
   ///    true if debug information is available for this frame (function,
   ///    compilation unit, block, etc.)
-  bool HasDebugInformation();
+  virtual bool HasDebugInformation();
 
   /// Return the disassembly for the instructions of this StackFrame's
   /// function as a single C string.
   ///
   /// \return
   ///    C string with the assembly instructions for this function.
-  const char *Disassemble();
+  virtual const char *Disassemble();
 
   /// Print a description of this frame using the provided frame format.
   ///
@@ -337,9 +346,9 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   \b true if and only if dumping with the given \p format worked.
-  bool DumpUsingFormat(Stream &strm,
-                       const lldb_private::FormatEntity::Entry *format,
-                       llvm::StringRef frame_marker = {});
+  virtual bool DumpUsingFormat(Stream &strm,
+                               const lldb_private::FormatEntity::Entry *format,
+                               llvm::StringRef frame_marker = {});
 
   /// Print a description for this frame using the frame-format formatter
   /// settings. If the current frame-format settings are invalid, then the
@@ -353,8 +362,8 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \param [in] frame_marker
   ///   Optional string that will be prepended to the frame output description.
-  void DumpUsingSettingsFormat(Stream *strm, bool show_unique = false,
-                               const char *frame_marker = nullptr);
+  virtual void DumpUsingSettingsFormat(Stream *strm, bool show_unique = false,
+                                       const char *frame_marker = nullptr);
 
   /// Print a description for this frame using a default format.
   ///
@@ -366,7 +375,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \param [in] show_fullpaths
   ///   Whether to print the full source paths or just the file base name.
-  void Dump(Stream *strm, bool show_frame_index, bool show_fullpaths);
+  virtual void Dump(Stream *strm, bool show_frame_index, bool show_fullpaths);
 
   /// Print a description of this stack frame and/or the source
   /// context/assembly for this stack frame.
@@ -389,8 +398,9 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   Returns true if successful.
-  bool GetStatus(Stream &strm, bool show_frame_info, bool show_source,
-                 bool show_unique = false, const char *frame_marker = nullptr);
+  virtual bool GetStatus(Stream &strm, bool show_frame_info, bool show_source,
+                         bool show_unique = false,
+                         const char *frame_marker = nullptr);
 
   /// Query whether this frame is a concrete frame on the call stack, or if it
   /// is an inlined frame derived from the debug information and presented by
@@ -401,10 +411,10 @@ class StackFrame : public ExecutionContextScope,
   virtual bool IsInlined();
 
   /// Query whether this frame is synthetic.
-  bool IsSynthetic() const;
+  virtual bool IsSynthetic() const;
 
   /// Query whether this frame is part of a historical backtrace.
-  bool IsHistorical() const;
+  virtual bool IsHistorical() const;
 
   /// Query whether this frame is artificial (e.g a synthesized result of
   /// inferring missing tail call frames from a backtrace). Artificial frames
@@ -419,7 +429,7 @@ class StackFrame : public ExecutionContextScope,
   /// Language plugins can use this API to report language-specific
   /// runtime information about this compile unit, such as additional
   /// language version details or feature flags.
-  StructuredData::ObjectSP GetLanguageSpecificData();
+  virtual StructuredData::ObjectSP GetLanguageSpecificData();
 
   /// Get the frame's demangled name.
   ///
@@ -439,9 +449,9 @@ class StackFrame : public ExecutionContextScope,
   /// \return
   ///   StackFrame index 0 indicates the currently-executing function.  Inline
   ///   frames are included in this frame index count.
-  uint32_t GetFrameIndex() const;
+  virtual uint32_t GetFrameIndex() const;
 
-  /// Set this frame's synthetic frame index.
+  /// Set this frame's frame index.
   void SetFrameIndex(uint32_t index) { m_frame_index = index; }
 
   /// Query this frame to find what frame it is in this Thread's
@@ -452,7 +462,7 @@ class StackFrame : public ExecutionContextScope,
   ///   frames are not included in this frame index count; their concrete
   ///   frame index will be the same as the concrete frame that they are
   ///   derived from.
-  uint32_t GetConcreteFrameIndex() const { return m_concrete_frame_index; }
+  virtual uint32_t GetConcreteFrameIndex() { return m_concrete_frame_index; }
 
   /// Create a ValueObject for a given Variable in this StackFrame.
   ///
@@ -466,7 +476,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///     A ValueObject for this variable.
-  lldb::ValueObjectSP
+  virtual lldb::ValueObjectSP
   GetValueObjectForFrameVariable(const lldb::VariableSP &variable_sp,
                                  lldb::DynamicValueType use_dynamic);
 
@@ -474,11 +484,11 @@ class StackFrame : public ExecutionContextScope,
   /// parsing expressions given the execution context.
   ///
   /// \return   The language of the frame if known.
-  SourceLanguage GetLanguage();
+  virtual SourceLanguage GetLanguage();
 
   /// Similar to GetLanguage(), but is allowed to take a potentially incorrect
   /// guess if exact information is not available.
-  SourceLanguage GuessLanguage();
+  virtual SourceLanguage GuessLanguage();
 
   /// Attempt to econstruct the ValueObject for a given raw address touched by
   /// the current instruction.  The ExpressionPath should indicate how to get
@@ -489,7 +499,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   The ValueObject if found.  If valid, it has a valid ExpressionPath.
-  lldb::ValueObjectSP GuessValueForAddress(lldb::addr_t addr);
+  virtual lldb::ValueObjectSP GuessValueForAddress(lldb::addr_t addr);
 
   /// Attempt to reconstruct the ValueObject for the address contained in a
   /// given register plus an offset.  The ExpressionPath should indicate how
@@ -503,8 +513,8 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   The ValueObject if found.  If valid, it has a valid ExpressionPath.
-  lldb::ValueObjectSP GuessValueForRegisterAndOffset(ConstString reg,
-                                                     int64_t offset);
+  virtual lldb::ValueObjectSP GuessValueForRegisterAndOffset(ConstString reg,
+                                                             int64_t offset);
 
   /// Attempt to reconstruct the ValueObject for a variable with a given \a name
   /// from within the current StackFrame, within the current block. The search
@@ -517,7 +527,7 @@ class StackFrame : public ExecutionContextScope,
   ///
   /// \return
   ///   The ValueObject if found.
-  lldb::ValueObjectSP FindVariable(ConstString name);
+  virtual lldb::ValueObjectSP FindVariable(ConstString name);
 
   // lldb::ExecutionContextScope pure virtual functions
   lldb::TargetSP CalculateTarget() override;
@@ -530,10 +540,25 @@ class StackFrame : public ExecutionContextScope,
 
   void CalculateExecutionContext(ExecutionContext &exe_ctx) override;
 
-  lldb::RecognizedStackFrameSP GetRecognizedFrame();
+  virtual lldb::RecognizedStackFrameSP GetRecognizedFrame();
+
+  /// Get the StackFrameList that contains this frame.
+  ///
+  /// Returns the StackFrameList that contains this frame, allowing
+  /// frames to resolve execution contexts without calling
+  /// Thread::GetStackFrameList(), which can cause circular dependencies
+  /// during frame provider initialization.
+  ///
+  /// \return
+  ///   The StackFrameList that contains this frame, or nullptr if not set.
+  virtual lldb::StackFrameListSP GetContainingStackFrameList() const {
+    return m_frame_list_wp.lock();
+  }
 
 protected:
+  friend class BorrowedStackFrame;
   friend class StackFrameList;
+  friend class SyntheticStackFrameList;
 
   void SetSymbolContextScope(SymbolContextScope *symbol_scope);
 
@@ -574,6 +599,7 @@ class StackFrame : public ExecutionContextScope,
   /// well as any other frame with the same trait.
   bool m_behaves_like_zeroth_frame;
   lldb::VariableListSP m_variable_list_sp;
+  lldb::StackFrameListWP m_frame_list_wp;
   /// Value objects for each variable in m_variable_list_sp.
   ValueObjectList m_variable_list_value_objects;
   std::optional<lldb::RecognizedStackFrameSP> m_recognized_frame_sp;
diff --git a/lldb/include/lldb/Target/StackFrameList.h b/lldb/include/lldb/Target/StackFrameList.h
index 5b0df0ddb3e29..539c070ff0f4b 100644
--- a/lldb/include/lldb/Target/StackFrameList.h
+++ b/lldb/include/lldb/Target/StackFrameList.h
@@ -20,13 +20,13 @@ namespace lldb_private {
 
 class ScriptedThread;
 
-class StackFrameList {
+class StackFrameList : public std::enable_shared_from_this<StackFrameList> {
 public:
   // Constructors and Destructors
   StackFrameList(Thread &thread, const lldb::StackFrameListSP &prev_frames_sp,
                  bool show_inline_frames);
 
-  ~StackFrameList();
+  virtual ~StackFrameList();
 
   /// Get the number of visible frames. Frames may be created if \p can_create
   /// is true. Synthetic (inline) frames expanded from the concrete frame #0
@@ -106,6 +106,7 @@ class StackFrameList {
 
 protected:
   friend class Thread;
+  friend class ScriptedFrameProvider;
   friend class ScriptedThread;
 
   /// Use this API to build a stack frame list (used for scripted threads, for
@@ -211,19 +212,23 @@ class StackFrameList {
   /// Whether or not to show synthetic (inline) frames. Immutable.
   const bool m_show_inlined_frames;
 
+  /// Returns true if fetching frames was interrupted, false otherwise.
+  virtual bool FetchFramesUpTo(uint32_t end_idx,
+                               InterruptionControl allow_interrupt);
+
 private:
   uint32_t SetSelectedFrameNoLock(lldb_private::StackFrame *frame);
   lldb::StackFrameSP
   GetFrameAtIndexNoLock(uint32_t idx,
                         std::shared_lock<std::shared_mutex> &guard);
 
+  /// @{
   /// These two Fetch frames APIs and SynthesizeTailCallFrames are called in
   /// GetFramesUpTo, they are the ones that actually add frames.  They must be
   /// called with the writer end of the list mutex held.
-
-  /// Returns true if fetching frames was interrupted, false otherwise.
-  bool FetchFramesUpTo(uint32_t end_idx, InterruptionControl allow_interrupt);
+  ///
   /// Not currently interruptible so returns void.
+  /// }@
   void FetchOnlyConcreteFramesUpTo(uint32_t end_idx);
   void SynthesizeTailCallFrames(StackFrame &next_frame);
 
@@ -231,6 +236,27 @@ class StackFrameList {
   const StackFrameList &operator=(const StackFrameList &) = delete;
 };
 
+/// A StackFrameList that wraps another StackFrameList and uses a
+/// SyntheticFrameProvider to lazily provide frames from either the provider
+/// or the underlying real stack frame list.
+class SyntheticStackFrameList : public StackFrameList {
+public:
+  SyntheticStackFrameList(Thread &thread, lldb::StackFrameListSP input_frames,
+                          const lldb::StackFrameListSP &prev_frames_sp,
+                          bool show_inline_frames);
+
+protected:
+  /// Override FetchFramesUpTo to lazily return frames from the provider
+  /// or from the actual stack frame list.
+  bool FetchFramesUpTo(uint32_t end_idx,
+                       InterruptionControl allow_interrupt) override;
+
+private:
+  /// The input stack frame list that the provider transforms.
+  /// This could be a real StackFrameList or another SyntheticStackFrameList.
+  lldb::StackFrameListSP m_input_frames;
+};
+
 } // namespace lldb_private
 
 #endif // LLDB_TARGET_STACKFRAMELIST_H
diff --git a/lldb/include/lldb/Target/SyntheticFrameProvider.h b/lldb/include/lldb/Target/SyntheticFrameProvider.h
index 61a492f356ece..2d5330cb03105 100644
--- a/lldb/include/lldb/Target/SyntheticFrameProvider.h
+++ b/lldb/include/lldb/Target/SyntheticFrameProvider.h
@@ -24,22 +24,25 @@ namespace lldb_private {
 
 /// This struct contains the metadata needed to instantiate a frame provider
 /// and optional filters to control which threads it applies to.
-struct SyntheticFrameProviderDescriptor {
+struct ScriptedFrameProviderDescriptor {
   /// Metadata for instantiating the provider (e.g. script class name and args).
   lldb::ScriptedMetadataSP scripted_metadata_sp;
 
+  /// Interface for calling static methods on the provider class.
+  lldb::ScriptedFrameProviderInterfaceSP interface_sp;
+
   /// Optional list of thread specifications to which this provider applies.
   /// If empty, the provider applies to all threads. A thread matches if it
   /// satisfies ANY of the specs in this vector (OR logic).
   std::vector<ThreadSpec> thread_specs;
 
-  SyntheticFrameProviderDescriptor() = default;
+  ScriptedFrameProviderDescriptor() = default;
 
-  SyntheticFrameProviderDescriptor(lldb::ScriptedMetadataSP metadata_sp)
+  ScriptedFrameProviderDescriptor(lldb::ScriptedMetadataSP metadata_sp)
       : scripted_metadata_sp(metadata_sp) {}
 
-  SyntheticFrameProviderDescriptor(lldb::ScriptedMetadataSP metadata_sp,
-                                   const std::vector<ThreadSpec> &specs)
+  ScriptedFrameProviderDescriptor(lldb::ScriptedMetadataSP metadata_sp,
+                                  const std::vector<ThreadSpec> &specs)
       : scripted_metadata_sp(metadata_sp), thread_specs(specs) {}
 
   /// Get the name of this descriptor (the scripted class name).
@@ -47,6 +50,12 @@ struct SyntheticFrameProviderDescriptor {
     return scripted_metadata_sp ? scripted_metadata_sp->GetClassName() : "";
   }
 
+  /// Get the description of this frame provider.
+  ///
+  /// \return A string describing what this frame provider does, or an
+  ///         empty string if no description is available.
+  std::string GetDescription() const;
+
   /// Check if this descriptor applies to the given thread.
   bool AppliesToThread(Thread &thread) const {
     // If no thread specs specified, applies to all threads.
@@ -64,6 +73,13 @@ struct SyntheticFrameProviderDescriptor {
   /// Check if this descriptor has valid metadata for script-based providers.
   bool IsValid() const { return scripted_metadata_sp != nullptr; }
 
+  /// Get a unique identifier for this descriptor based on its contents.
+  /// The ID is computed from the class name and arguments dictionary,
+  /// not from the pointer address, so two descriptors with the same
+  /// contents will have the same ID.
+  uint32_t GetID() const;
+
+  /// Dump a description of this descriptor to the given stream.
   void Dump(Stream *s) const;
 };
 
@@ -95,7 +111,7 @@ class SyntheticFrameProvider : public PluginInterface {
   ///     otherwise an \a llvm::Error.
   static llvm::Expected<lldb::SyntheticFrameProviderSP>
   CreateInstance(lldb::StackFrameListSP input_frames,
-                 const SyntheticFrameProviderDescriptor &descriptor);
+                 const ScriptedFrameProviderDescriptor &descriptor);
 
   /// Try to create a SyntheticFrameProvider instance for the given input
   /// frames using a specific C++ plugin.
@@ -125,6 +141,8 @@ class SyntheticFrameProvider : public PluginInterface {
 
   ~SyntheticFrameProvider() override;
 
+  virtual std::string GetDescription() const = 0;
+
   /// Get a single stack frame at the specified index.
   ///
   /// This method is called lazily - frames are only created when requested.
diff --git a/lldb/include/lldb/Target/Target.h b/lldb/include/lldb/Target/Target.h
index c0fcda7c0d960..812a638910b3b 100644
--- a/lldb/include/lldb/Target/Target.h
+++ b/lldb/include/lldb/Target/Target.h
@@ -32,6 +32,7 @@
 #include "lldb/Target/PathMappingList.h"
 #include "lldb/Target/SectionLoadHistory.h"
 #include "lldb/Target/Statistics.h"
+#include "lldb/Target/SyntheticFrameProvider.h"
 #include "lldb/Target/ThreadSpec.h"
 #include "lldb/Utility/ArchSpec.h"
 #include "lldb/Utility/Broadcaster.h"
@@ -745,6 +746,36 @@ class Target : public std::enable_shared_from_this<Target>,
   Status Attach(ProcessAttachInfo &attach_info,
                 Stream *stream); // Optional stream to receive first stop info
 
+  /// Add or update a scripted frame provider descriptor for this target.
+  /// All new threads in this target will check if they match any descriptors
+  /// to create their frame providers.
+  ///
+  /// \param[in] descriptor
+  ///     The descriptor to add or update.
+  ///
+  /// \return
+  ///     The descriptor identifier if the registration succeeded, otherwise an
+  ///     llvm::Error.
+  llvm::Expected<uint32_t> AddScriptedFrameProviderDescriptor(
+      const ScriptedFrameProviderDescriptor &descriptor);
+
+  /// Remove a scripted frame provider descriptor by id.
+  ///
+  /// \param[in] id
+  ///     The id of the descriptor to remove.
+  ///
+  /// \return
+  ///     True if a descriptor was removed, false if no descriptor with that
+  ///     id existed.
+  bool RemoveScriptedFrameProviderDescriptor(uint32_t id);
+
+  /// Clear all scripted frame provider descriptors for this target.
+  void ClearScriptedFrameProviderDescriptors();
+
+  /// Get all scripted frame provider descriptors for this target.
+  const llvm::DenseMap<uint32_t, ScriptedFrameProviderDescriptor> &
+  GetScriptedFrameProviderDescriptors() const;
+
   // This part handles the breakpoints.
 
   BreakpointList &GetBreakpointList(bool internal = false);
@@ -1744,6 +1775,13 @@ class Target : public std::enable_shared_from_this<Target>,
   PathMappingList m_image_search_paths;
   TypeSystemMap m_scratch_type_system_map;
 
+  /// Map of scripted frame provider descriptors for this target.
+  /// Keys are the provider descriptors ids, values are the descriptors.
+  /// Used to initialize frame providers for new threads.
+  llvm::DenseMap<uint32_t, ScriptedFrameProviderDescriptor>
+      m_frame_provider_descriptors;
+  mutable std::recursive_mutex m_frame_provider_descriptors_mutex;
+
   typedef std::map<lldb::LanguageType, lldb::REPLSP> REPLMap;
   REPLMap m_repl_map;
 
diff --git a/lldb/include/lldb/Target/TargetList.h b/lldb/include/lldb/Target/TargetList.h
index 88272512bcc0f..d7ff639d0d2b6 100644
--- a/lldb/include/lldb/Target/TargetList.h
+++ b/lldb/include/lldb/Target/TargetList.h
@@ -216,6 +216,11 @@ class TargetList : public Broadcaster {
       llvm::StringRef triple_str, LoadDependentFiles load_dependent_files,
       const OptionGroupPlatform *platform_options, lldb::TargetSP &target_sp);
 
+  // Create Target Internal does not modify any state directly, and should not
+  // be called under the target list mutex. Instead any state changes should
+  // call into methods which themselves are protected by the target list mutex.
+  // We need to do this so the locate module call back doesn't cause a re-entry
+  // dead lock when creating the target.
   static Status CreateTargetInternal(Debugger &debugger,
                                      llvm::StringRef user_exe_path,
                                      const ArchSpec &arch,
diff --git a/lldb/include/lldb/Target/Thread.h b/lldb/include/lldb/Target/Thread.h
index 841f80cd1b1eb..46ce192556756 100644
--- a/lldb/include/lldb/Target/Thread.h
+++ b/lldb/include/lldb/Target/Thread.h
@@ -1297,6 +1297,15 @@ class Thread : public std::enable_shared_from_this<Thread>,
 
   lldb::StackFrameListSP GetStackFrameList();
 
+  llvm::Error
+  LoadScriptedFrameProvider(const ScriptedFrameProviderDescriptor &descriptor);
+
+  void ClearScriptedFrameProvider();
+
+  lldb::SyntheticFrameProviderSP GetFrameProvider() const {
+    return m_frame_provider_sp;
+  }
+
 protected:
   friend class ThreadPlan;
   friend class ThreadList;
@@ -1400,6 +1409,9 @@ class Thread : public std::enable_shared_from_this<Thread>,
   /// The Thread backed by this thread, if any.
   lldb::ThreadWP m_backed_thread;
 
+  /// The Scripted Frame Provider, if any.
+  lldb::SyntheticFrameProviderSP m_frame_provider_sp;
+
 private:
   bool m_extended_info_fetched; // Have we tried to retrieve the m_extended_info
                                 // for this thread?
diff --git a/lldb/include/lldb/Target/ThreadSpec.h b/lldb/include/lldb/Target/ThreadSpec.h
index 7c7c832741196..63f8f8b5ec181 100644
--- a/lldb/include/lldb/Target/ThreadSpec.h
+++ b/lldb/include/lldb/Target/ThreadSpec.h
@@ -34,6 +34,8 @@ class ThreadSpec {
 public:
   ThreadSpec();
 
+  ThreadSpec(Thread &thread);
+
   static std::unique_ptr<ThreadSpec>
   CreateFromStructuredData(const StructuredData::Dictionary &data_dict,
                            Status &error);
diff --git a/lldb/include/lldb/Utility/DataExtractor.h b/lldb/include/lldb/Utility/DataExtractor.h
index b4960f5e87c85..db85b44debf43 100644
--- a/lldb/include/lldb/Utility/DataExtractor.h
+++ b/lldb/include/lldb/Utility/DataExtractor.h
@@ -334,7 +334,8 @@ class DataExtractor {
   /// \return
   ///     A pointer to the bytes in this object's data if the offset
   ///     and length are valid, or nullptr otherwise.
-  const void *GetData(lldb::offset_t *offset_ptr, lldb::offset_t length) const {
+  virtual const void *GetData(lldb::offset_t *offset_ptr,
+                              lldb::offset_t length) const {
     const uint8_t *ptr = PeekData(*offset_ptr, length);
     if (ptr)
       *offset_ptr += length;
@@ -609,17 +610,17 @@ class DataExtractor {
   ///     The extracted uint8_t value.
   uint8_t GetU8(lldb::offset_t *offset_ptr) const;
 
-  uint8_t GetU8_unchecked(lldb::offset_t *offset_ptr) const {
+  virtual uint8_t GetU8_unchecked(lldb::offset_t *offset_ptr) const {
     uint8_t val = m_start[*offset_ptr];
     *offset_ptr += 1;
     return val;
   }
 
-  uint16_t GetU16_unchecked(lldb::offset_t *offset_ptr) const;
+  virtual uint16_t GetU16_unchecked(lldb::offset_t *offset_ptr) const;
 
-  uint32_t GetU32_unchecked(lldb::offset_t *offset_ptr) const;
+  virtual uint32_t GetU32_unchecked(lldb::offset_t *offset_ptr) const;
 
-  uint64_t GetU64_unchecked(lldb::offset_t *offset_ptr) const;
+  virtual uint64_t GetU64_unchecked(lldb::offset_t *offset_ptr) const;
   /// Extract \a count uint8_t values from \a *offset_ptr.
   ///
   /// Extract \a count uint8_t values from the binary data at the offset
@@ -829,7 +830,8 @@ class DataExtractor {
   ///     A non-nullptr data pointer if \a offset is a valid offset and
   ///     there are \a length bytes available at that offset, nullptr
   ///     otherwise.
-  const uint8_t *PeekData(lldb::offset_t offset, lldb::offset_t length) const {
+  virtual const uint8_t *PeekData(lldb::offset_t offset,
+                                  lldb::offset_t length) const {
     if (ValidOffsetForDataOfSize(offset, length))
       return m_start + offset;
     return nullptr;
diff --git a/lldb/include/lldb/Utility/RangeMap.h b/lldb/include/lldb/Utility/RangeMap.h
index e701ae1ba96c8..24ed4a55deb68 100644
--- a/lldb/include/lldb/Utility/RangeMap.h
+++ b/lldb/include/lldb/Utility/RangeMap.h
@@ -465,6 +465,10 @@ class RangeDataVector {
 
   RangeDataVector(Compare compare = Compare()) : m_compare(compare) {}
 
+  RangeDataVector(std::initializer_list<AugmentedEntry> entries,
+                  Compare compare = Compare())
+      : m_entries(entries), m_compare(compare) {}
+
   ~RangeDataVector() = default;
 
   void Append(const Entry &entry) { m_entries.emplace_back(entry); }
diff --git a/lldb/include/lldb/Utility/ScriptedMetadata.h b/lldb/include/lldb/Utility/ScriptedMetadata.h
index 69c83edce909a..8523c95429718 100644
--- a/lldb/include/lldb/Utility/ScriptedMetadata.h
+++ b/lldb/include/lldb/Utility/ScriptedMetadata.h
@@ -10,7 +10,9 @@
 #define LLDB_INTERPRETER_SCRIPTEDMETADATA_H
 
 #include "lldb/Utility/ProcessInfo.h"
+#include "lldb/Utility/StreamString.h"
 #include "lldb/Utility/StructuredData.h"
+#include "llvm/ADT/Hashing.h"
 
 namespace lldb_private {
 class ScriptedMetadata {
@@ -27,11 +29,36 @@ class ScriptedMetadata {
     }
   }
 
+  ScriptedMetadata(const ScriptedMetadata &other)
+      : m_class_name(other.m_class_name), m_args_sp(other.m_args_sp) {}
+
   explicit operator bool() const { return !m_class_name.empty(); }
 
   llvm::StringRef GetClassName() const { return m_class_name; }
   StructuredData::DictionarySP GetArgsSP() const { return m_args_sp; }
 
+  /// Get a unique identifier for this metadata based on its contents.
+  /// The ID is computed from the class name and arguments dictionary,
+  /// not from the pointer address, so two metadata objects with the same
+  /// contents will have the same ID.
+  uint32_t GetID() const {
+    if (m_class_name.empty())
+      return 0;
+
+    // Hash the class name.
+    llvm::hash_code hash = llvm::hash_value(m_class_name);
+
+    // Hash the arguments dictionary if present.
+    if (m_args_sp) {
+      StreamString ss;
+      m_args_sp->GetDescription(ss);
+      hash = llvm::hash_combine(hash, llvm::hash_value(ss.GetData()));
+    }
+
+    // Return the lower 32 bits of the hash.
+    return static_cast<uint32_t>(hash);
+  }
+
 private:
   std::string m_class_name;
   StructuredData::DictionarySP m_args_sp;
diff --git a/lldb/include/lldb/Utility/VirtualDataExtractor.h b/lldb/include/lldb/Utility/VirtualDataExtractor.h
new file mode 100644
index 0000000000000..e430dd8628b5f
--- /dev/null
+++ b/lldb/include/lldb/Utility/VirtualDataExtractor.h
@@ -0,0 +1,75 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_UTILITY_VIRTUALDATAEXTRACTOR_H
+#define LLDB_UTILITY_VIRTUALDATAEXTRACTOR_H
+
+#include "lldb/Utility/DataExtractor.h"
+#include "lldb/Utility/RangeMap.h"
+#include "lldb/lldb-types.h"
+
+namespace lldb_private {
+
+/// A DataExtractor subclass that allows reading data at virtual addresses
+/// using a lookup table that maps virtual address ranges to physical offsets.
+///
+/// This class maintains a lookup table where each entry contains:
+/// - base: starting virtual address for this entry
+/// - size: size of this entry in bytes
+/// - data: physical offset in the underlying data buffer
+///
+/// Reads are translated from virtual addresses to physical offsets using
+/// this lookup table. Reads cannot cross entry boundaries and this is
+/// enforced with assertions.
+class VirtualDataExtractor : public DataExtractor {
+public:
+  /// Type alias for the range map used internally.
+  /// Maps virtual addresses (base) to physical offsets (data).
+  using LookupTable =
+      RangeDataVector<lldb::offset_t, lldb::offset_t, lldb::offset_t>;
+
+  VirtualDataExtractor() = default;
+
+  VirtualDataExtractor(const void *data, lldb::offset_t data_length,
+                       lldb::ByteOrder byte_order, uint32_t addr_size,
+                       LookupTable lookup_table);
+
+  VirtualDataExtractor(const lldb::DataBufferSP &data_sp,
+                       lldb::ByteOrder byte_order, uint32_t addr_size,
+                       LookupTable lookup_table);
+
+  const void *GetData(lldb::offset_t *offset_ptr,
+                      lldb::offset_t length) const override;
+
+  const uint8_t *PeekData(lldb::offset_t offset,
+                          lldb::offset_t length) const override;
+
+  /// Unchecked overrides
+  /// @{
+  uint8_t GetU8_unchecked(lldb::offset_t *offset_ptr) const override;
+  uint16_t GetU16_unchecked(lldb::offset_t *offset_ptr) const override;
+  uint32_t GetU32_unchecked(lldb::offset_t *offset_ptr) const override;
+  uint64_t GetU64_unchecked(lldb::offset_t *offset_ptr) const override;
+  /// @}
+
+protected:
+  /// Find the lookup entry that contains the given virtual address.
+  const LookupTable::Entry *FindEntry(lldb::offset_t virtual_addr) const;
+
+  /// Validate that a read at a virtual address is within bounds and
+  /// does not cross entry boundaries.
+  bool ValidateVirtualRead(lldb::offset_t virtual_addr,
+                           lldb::offset_t length) const;
+
+private:
+  LookupTable m_lookup_table;
+};
+
+} // namespace lldb_private
+
+#endif // LLDB_UTILITY_VIRTUALDATAEXTRACTOR_H
diff --git a/lldb/include/lldb/ValueObject/DILAST.h b/lldb/include/lldb/ValueObject/DILAST.h
index 91f8d93c09622..9fda0c798ec4e 100644
--- a/lldb/include/lldb/ValueObject/DILAST.h
+++ b/lldb/include/lldb/ValueObject/DILAST.h
@@ -21,6 +21,7 @@ enum class NodeKind {
   eArraySubscriptNode,
   eBitExtractionNode,
   eBooleanLiteralNode,
+  eCastNode,
   eErrorNode,
   eFloatLiteralNode,
   eIdentifierNode,
@@ -37,6 +38,14 @@ enum class UnaryOpKind {
   Plus,   // "+"
 };
 
+/// The type casts allowed by DIL.
+enum class CastKind {
+  eEnumeration, ///< Casting from a scalar to an enumeration type
+  eNullptr,     ///< Casting to a nullptr type
+  eReference,   ///< Casting to a reference type
+  eNone,        ///< Type promotion casting
+};
+
 /// Forward declaration, for use in DIL AST nodes. Definition is at the very
 /// end of this file.
 class Visitor;
@@ -246,6 +255,29 @@ class BooleanLiteralNode : public ASTNode {
   bool m_value;
 };
 
+class CastNode : public ASTNode {
+public:
+  CastNode(uint32_t location, CompilerType type, ASTNodeUP operand,
+           CastKind kind)
+      : ASTNode(location, NodeKind::eCastNode), m_type(type),
+        m_operand(std::move(operand)), m_cast_kind(kind) {}
+
+  llvm::Expected<lldb::ValueObjectSP> Accept(Visitor *v) const override;
+
+  CompilerType GetType() const { return m_type; }
+  ASTNode *GetOperand() const { return m_operand.get(); }
+  CastKind GetCastKind() const { return m_cast_kind; }
+
+  static bool classof(const ASTNode *node) {
+    return node->GetKind() == NodeKind::eCastNode;
+  }
+
+private:
+  CompilerType m_type;
+  ASTNodeUP m_operand;
+  CastKind m_cast_kind;
+};
+
 /// This class contains one Visit method for each specialized type of
 /// DIL AST node. The Visit methods are used to dispatch a DIL AST node to
 /// the correct function in the DIL expression evaluator for evaluating that
@@ -269,6 +301,7 @@ class Visitor {
   Visit(const FloatLiteralNode *node) = 0;
   virtual llvm::Expected<lldb::ValueObjectSP>
   Visit(const BooleanLiteralNode *node) = 0;
+  virtual llvm::Expected<lldb::ValueObjectSP> Visit(const CastNode *node) = 0;
 };
 
 } // namespace lldb_private::dil
diff --git a/lldb/include/lldb/ValueObject/DILEval.h b/lldb/include/lldb/ValueObject/DILEval.h
index a65edc58cc4e7..2db45a7c37314 100644
--- a/lldb/include/lldb/ValueObject/DILEval.h
+++ b/lldb/include/lldb/ValueObject/DILEval.h
@@ -60,6 +60,7 @@ class Interpreter : Visitor {
   Visit(const FloatLiteralNode *node) override;
   llvm::Expected<lldb::ValueObjectSP>
   Visit(const BooleanLiteralNode *node) override;
+  llvm::Expected<lldb::ValueObjectSP> Visit(const CastNode *node) override;
 
   /// Perform usual unary conversions on a value. At the moment this
   /// includes array-to-pointer and integral promotion for eligible types.
diff --git a/lldb/include/lldb/ValueObject/DILParser.h b/lldb/include/lldb/ValueObject/DILParser.h
index c9d28333ffed1..dd5c3fb028c70 100644
--- a/lldb/include/lldb/ValueObject/DILParser.h
+++ b/lldb/include/lldb/ValueObject/DILParser.h
@@ -101,6 +101,12 @@ class DILParser {
   ASTNodeUP ParseFloatingPointLiteral();
   ASTNodeUP ParseBooleanLiteral();
 
+  ASTNodeUP ParseCastExpression();
+  std::optional<CompilerType> ParseBuiltinType();
+  std::optional<CompilerType> ParseTypeId();
+  CompilerType ResolveTypeDeclarators(CompilerType type,
+                                      const std::vector<Token> &ptr_operators);
+
   void BailOut(const std::string &error, uint32_t loc, uint16_t err_len);
 
   void Expect(Token::Kind kind);
diff --git a/lldb/include/lldb/ValueObject/ValueObjectSynthetic.h b/lldb/include/lldb/ValueObject/ValueObjectSynthetic.h
index 063d796ee4eec..1a82fd78bbba3 100644
--- a/lldb/include/lldb/ValueObject/ValueObjectSynthetic.h
+++ b/lldb/include/lldb/ValueObject/ValueObjectSynthetic.h
@@ -123,6 +123,11 @@ class ValueObjectSynthetic : public ValueObject {
 
   void SetLanguageFlags(uint64_t flags) override;
 
+  void
+  GetExpressionPath(Stream &stream,
+                    GetExpressionPathFormat epformat =
+                        eGetExpressionPathFormatDereferencePointers) override;
+
 protected:
   bool UpdateValue() override;
 
diff --git a/lldb/include/lldb/lldb-enumerations.h b/lldb/include/lldb/lldb-enumerations.h
index 1a7db8faecd94..79f22be1c95d3 100644
--- a/lldb/include/lldb/lldb-enumerations.h
+++ b/lldb/include/lldb/lldb-enumerations.h
@@ -542,6 +542,7 @@ enum InstrumentationRuntimeType {
   eInstrumentationRuntimeTypeMainThreadChecker = 0x0003,
   eInstrumentationRuntimeTypeSwiftRuntimeReporting = 0x0004,
   eInstrumentationRuntimeTypeLibsanitizersAsan = 0x0005,
+  eInstrumentationRuntimeTypeBoundsSafety = 0x0006,
   eNumInstrumentationRuntimeTypes
 };
 
diff --git a/lldb/include/lldb/lldb-forward.h b/lldb/include/lldb/lldb-forward.h
index c8e2e97953aa4..ccfe5efa19e1d 100644
--- a/lldb/include/lldb/lldb-forward.h
+++ b/lldb/include/lldb/lldb-forward.h
@@ -342,6 +342,7 @@ typedef std::shared_ptr<lldb_private::CompileUnit> CompUnitSP;
 typedef std::shared_ptr<lldb_private::DataBuffer> DataBufferSP;
 typedef std::shared_ptr<lldb_private::WritableDataBuffer> WritableDataBufferSP;
 typedef std::shared_ptr<lldb_private::DataExtractor> DataExtractorSP;
+typedef std::unique_ptr<lldb_private::DataExtractor> DataExtractorUP;
 typedef std::shared_ptr<lldb_private::Debugger> DebuggerSP;
 typedef std::weak_ptr<lldb_private::Debugger> DebuggerWP;
 typedef std::shared_ptr<lldb_private::Disassembler> DisassemblerSP;
@@ -439,6 +440,7 @@ typedef std::unique_ptr<lldb_private::SourceManager> SourceManagerUP;
 typedef std::shared_ptr<lldb_private::StackFrame> StackFrameSP;
 typedef std::weak_ptr<lldb_private::StackFrame> StackFrameWP;
 typedef std::shared_ptr<lldb_private::StackFrameList> StackFrameListSP;
+typedef std::weak_ptr<lldb_private::StackFrameList> StackFrameListWP;
 typedef std::shared_ptr<lldb_private::StackFrameRecognizer>
     StackFrameRecognizerSP;
 typedef std::unique_ptr<lldb_private::StackFrameRecognizerManager>
diff --git a/lldb/include/lldb/lldb-private-interfaces.h b/lldb/include/lldb/lldb-private-interfaces.h
index 5fc5c14c52f9e..52806eea190a7 100644
--- a/lldb/include/lldb/lldb-private-interfaces.h
+++ b/lldb/include/lldb/lldb-private-interfaces.h
@@ -26,7 +26,7 @@ class Value;
 
 namespace lldb_private {
 class ScriptedInterfaceUsages;
-struct SyntheticFrameProviderDescriptor;
+struct ScriptedFrameProviderDescriptor;
 typedef lldb::ABISP (*ABICreateInstance)(lldb::ProcessSP process_sp,
                                          const ArchSpec &arch);
 typedef std::unique_ptr<Architecture> (*ArchitectureCreateInstance)(
@@ -91,7 +91,7 @@ typedef lldb::ScriptInterpreterSP (*ScriptInterpreterCreateInstance)(
 typedef llvm::Expected<lldb::SyntheticFrameProviderSP> (
     *ScriptedFrameProviderCreateInstance)(
     lldb::StackFrameListSP input_frames,
-    const lldb_private::SyntheticFrameProviderDescriptor &descriptor);
+    const lldb_private::ScriptedFrameProviderDescriptor &descriptor);
 typedef llvm::Expected<lldb::SyntheticFrameProviderSP> (
     *SyntheticFrameProviderCreateInstance)(
     lldb::StackFrameListSP input_frames,
diff --git a/lldb/packages/Python/lldbsuite/test/gdbclientutils.py b/lldb/packages/Python/lldbsuite/test/gdbclientutils.py
index bd2fdc0a60cb4..4c40299f3256d 100644
--- a/lldb/packages/Python/lldbsuite/test/gdbclientutils.py
+++ b/lldb/packages/Python/lldbsuite/test/gdbclientutils.py
@@ -264,31 +264,31 @@ def _respond_impl(self, packet) -> Union[Response, List[Response]]:
 
         return self.other(packet)
 
-    def qsProcessInfo(self):
+    def qsProcessInfo(self) -> str:
         return "E04"
 
-    def qfProcessInfo(self, packet):
+    def qfProcessInfo(self, packet) -> str:
         return "E04"
 
-    def jGetLoadedDynamicLibrariesInfos(self, packet):
+    def jGetLoadedDynamicLibrariesInfos(self, packet) -> str:
         return ""
 
-    def qGetWorkingDir(self):
+    def qGetWorkingDir(self) -> str:
         return "2f"
 
-    def qOffsets(self):
+    def qOffsets(self) -> str:
         return ""
 
-    def qProcessInfo(self):
+    def qProcessInfo(self) -> str:
         return ""
 
-    def qHostInfo(self):
+    def qHostInfo(self) -> str:
         return "ptrsize:8;endian:little;"
 
-    def qEcho(self, num: int):
+    def qEcho(self, num: int) -> str:
         return "E04"
 
-    def qQueryGDBServer(self):
+    def qQueryGDBServer(self) -> str:
         return "E04"
 
     def interrupt(self):
@@ -300,10 +300,10 @@ def cont(self):
     def vCont(self, packet):
         raise self.UnexpectedPacketException()
 
-    def A(self, packet):
+    def A(self, packet) -> str:
         return ""
 
-    def D(self, packet):
+    def D(self, packet) -> str:
         return "OK"
 
     def readRegisters(self) -> str:
@@ -312,40 +312,40 @@ def readRegisters(self) -> str:
     def readRegister(self, register: int) -> str:
         return "00000000"
 
-    def writeRegisters(self, registers_hex):
+    def writeRegisters(self, registers_hex) -> str:
         return "OK"
 
-    def writeRegister(self, register, value_hex):
+    def writeRegister(self, register, value_hex) -> str:
         return "OK"
 
-    def readMemory(self, addr, length):
+    def readMemory(self, addr, length) -> str:
         return "00" * length
 
-    def x(self, addr, length):
+    def x(self, addr, length) -> str:
         return ""
 
-    def writeMemory(self, addr, data_hex):
+    def writeMemory(self, addr, data_hex) -> str:
         return "OK"
 
-    def qSymbol(self, symbol_args):
+    def qSymbol(self, symbol_args) -> str:
         return "OK"
 
-    def qSupported(self, client_supported):
+    def qSupported(self, client_supported) -> str:
         return "qXfer:features:read+;PacketSize=3fff;QStartNoAckMode+"
 
-    def qfThreadInfo(self):
+    def qfThreadInfo(self) -> str:
         return "l"
 
-    def qsThreadInfo(self):
+    def qsThreadInfo(self) -> str:
         return "l"
 
-    def qC(self):
+    def qC(self) -> str:
         return "QC0"
 
-    def QEnableErrorStrings(self):
+    def QEnableErrorStrings(self) -> str:
         return "OK"
 
-    def haltReason(self):
+    def haltReason(self) -> str:
         # SIGINT is 2, return type is 2 digit hex string
         return "S02"
 
@@ -360,50 +360,50 @@ def _qXferResponse(self, data, has_more):
     def vAttach(self, pid):
         raise self.UnexpectedPacketException()
 
-    def selectThread(self, op, thread_id):
+    def selectThread(self, op, thread_id) -> str:
         return "OK"
 
-    def setBreakpoint(self, packet):
+    def setBreakpoint(self, packet) -> str:
         raise self.UnexpectedPacketException()
 
-    def threadStopInfo(self, threadnum):
+    def threadStopInfo(self, threadnum) -> str:
         return ""
 
-    def other(self, packet):
+    def other(self, packet) -> str:
         # empty string means unsupported
         return ""
 
-    def QThreadSuffixSupported(self):
+    def QThreadSuffixSupported(self) -> str:
         return ""
 
-    def QListThreadsInStopReply(self):
+    def QListThreadsInStopReply(self) -> str:
         return ""
 
-    def qMemoryRegionInfo(self, addr):
+    def qMemoryRegionInfo(self, addr) -> str:
         return ""
 
-    def qPathComplete(self):
+    def qPathComplete(self) -> str:
         return ""
 
-    def vFile(self, packet):
+    def vFile(self, packet) -> str:
         return ""
 
-    def vRun(self, packet):
+    def vRun(self, packet) -> str:
         return ""
 
     def qLaunchGDBServer(self, host):
         raise self.UnexpectedPacketException()
 
-    def qLaunchSuccess(self):
+    def qLaunchSuccess(self) -> str:
         return ""
 
-    def QEnvironment(self, packet):
+    def QEnvironment(self, packet) -> str:
         return "OK"
 
-    def QEnvironmentHexEncoded(self, packet):
+    def QEnvironmentHexEncoded(self, packet) -> str:
         return "OK"
 
-    def qRegisterInfo(self, num):
+    def qRegisterInfo(self, num) -> str:
         return ""
 
     def k(self):
diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py
index 4a7ba78b63993..7a9d5a82983d7 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py
@@ -391,7 +391,7 @@ def _process_recv_packets(self) -> None:
         with self._recv_condition:
             for packet in self._recv_packets:
                 if packet and ("seq" not in packet or packet["seq"] == 0):
-                    warnings.warn(
+                    raise ValueError(
                         f"received a malformed packet, expected 'seq != 0' for {packet!r}"
                     )
                 # Handle events that may modify any stateful properties of
diff --git a/lldb/source/API/SBDebugger.cpp b/lldb/source/API/SBDebugger.cpp
index 7a4bebfdf998e..f939955ba57c8 100644
--- a/lldb/source/API/SBDebugger.cpp
+++ b/lldb/source/API/SBDebugger.cpp
@@ -179,7 +179,7 @@ void SBDebugger::Initialize() {
 lldb::SBError SBDebugger::InitializeWithErrorHandling() {
   LLDB_INSTRUMENT();
 
-  SBError error;
+  SBError error((Status()));
   if (auto e = g_debugger_lifetime->Initialize(
           std::make_unique<SystemInitializerFull>())) {
     error.SetError(Status::FromError(std::move(e)));
diff --git a/lldb/source/API/SBTarget.cpp b/lldb/source/API/SBTarget.cpp
index 578a7bdf7433d..78c2d49d647b5 100644
--- a/lldb/source/API/SBTarget.cpp
+++ b/lldb/source/API/SBTarget.cpp
@@ -23,6 +23,7 @@
 #include "lldb/API/SBStringList.h"
 #include "lldb/API/SBStructuredData.h"
 #include "lldb/API/SBSymbolContextList.h"
+#include "lldb/API/SBThreadCollection.h"
 #include "lldb/API/SBTrace.h"
 #include "lldb/Breakpoint/BreakpointID.h"
 #include "lldb/Breakpoint/BreakpointIDList.h"
@@ -39,6 +40,7 @@
 #include "lldb/Core/Section.h"
 #include "lldb/Core/StructuredDataImpl.h"
 #include "lldb/Host/Host.h"
+#include "lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h"
 #include "lldb/Symbol/DeclVendor.h"
 #include "lldb/Symbol/ObjectFile.h"
 #include "lldb/Symbol/SymbolFile.h"
@@ -50,6 +52,7 @@
 #include "lldb/Target/LanguageRuntime.h"
 #include "lldb/Target/Process.h"
 #include "lldb/Target/StackFrame.h"
+#include "lldb/Target/SyntheticFrameProvider.h"
 #include "lldb/Target/Target.h"
 #include "lldb/Target/TargetList.h"
 #include "lldb/Utility/ArchSpec.h"
@@ -59,6 +62,7 @@
 #include "lldb/Utility/LLDBLog.h"
 #include "lldb/Utility/ProcessInfo.h"
 #include "lldb/Utility/RegularExpression.h"
+#include "lldb/Utility/ScriptedMetadata.h"
 #include "lldb/ValueObject/ValueObjectConstResult.h"
 #include "lldb/ValueObject/ValueObjectList.h"
 #include "lldb/ValueObject/ValueObjectVariable.h"
@@ -2435,3 +2439,81 @@ lldb::SBMutex SBTarget::GetAPIMutex() const {
     return lldb::SBMutex(target_sp);
   return lldb::SBMutex();
 }
+
+uint32_t
+SBTarget::RegisterScriptedFrameProvider(const char *class_name,
+                                        lldb::SBStructuredData args_dict,
+                                        lldb::SBError &error) {
+  LLDB_INSTRUMENT_VA(this, class_name, args_dict, error);
+
+  TargetSP target_sp = GetSP();
+  if (!target_sp) {
+    error.SetErrorString("invalid target");
+    return 0;
+  }
+
+  if (!class_name || !class_name[0]) {
+    error.SetErrorString("invalid class name");
+    return 0;
+  }
+
+  // Extract the dictionary from SBStructuredData.
+  StructuredData::DictionarySP dict_sp;
+  if (args_dict.IsValid() && args_dict.m_impl_up) {
+    StructuredData::ObjectSP obj_sp = args_dict.m_impl_up->GetObjectSP();
+    if (obj_sp && obj_sp->GetType() != lldb::eStructuredDataTypeDictionary) {
+      error.SetErrorString("SBStructuredData argument isn't a dictionary");
+      return 0;
+    }
+    dict_sp = std::make_shared<StructuredData::Dictionary>(obj_sp);
+  }
+
+  // Create the ScriptedMetadata.
+  ScriptedMetadataSP metadata_sp =
+      std::make_shared<ScriptedMetadata>(class_name, dict_sp);
+
+  // Create the interface for calling static methods.
+  ScriptedFrameProviderInterfaceSP interface_sp =
+      target_sp->GetDebugger()
+          .GetScriptInterpreter()
+          ->CreateScriptedFrameProviderInterface();
+
+  // Create a descriptor (applies to all threads by default).
+  ScriptedFrameProviderDescriptor descriptor(metadata_sp);
+  descriptor.interface_sp = interface_sp;
+
+  llvm::Expected<uint32_t> descriptor_id_or_err =
+      target_sp->AddScriptedFrameProviderDescriptor(descriptor);
+  if (!descriptor_id_or_err) {
+    error.SetErrorString(
+        llvm::toString(descriptor_id_or_err.takeError()).c_str());
+    return 0;
+  }
+
+  // Register the descriptor with the target.
+  return *descriptor_id_or_err;
+}
+
+lldb::SBError SBTarget::RemoveScriptedFrameProvider(uint32_t provider_id) {
+  LLDB_INSTRUMENT_VA(this, provider_id);
+
+  SBError error;
+  TargetSP target_sp = GetSP();
+  if (!target_sp) {
+    error.SetErrorString("invalid target");
+    return error;
+  }
+
+  if (!provider_id) {
+    error.SetErrorString("invalid provider id");
+    return error;
+  }
+
+  if (!target_sp->RemoveScriptedFrameProviderDescriptor(provider_id)) {
+    error.SetErrorStringWithFormat("no frame provider named '%u' found",
+                                   provider_id);
+    return error;
+  }
+
+  return {};
+}
diff --git a/lldb/source/Breakpoint/BreakpointSite.cpp b/lldb/source/Breakpoint/BreakpointSite.cpp
index fd7666be6b1bf..8639379afe1df 100644
--- a/lldb/source/Breakpoint/BreakpointSite.cpp
+++ b/lldb/source/Breakpoint/BreakpointSite.cpp
@@ -168,6 +168,22 @@ bool BreakpointSite::ValidForThisThread(Thread &thread) {
   return m_constituents.ValidForThisThread(thread);
 }
 
+bool BreakpointSite::ContainsUserBreakpointForThread(Thread &thread) {
+  if (ThreadSP backed_thread = thread.GetBackedThread())
+    return ContainsUserBreakpointForThread(*backed_thread);
+
+  std::lock_guard<std::recursive_mutex> guard(m_constituents_mutex);
+  for (const BreakpointLocationSP &bp_loc :
+       m_constituents.BreakpointLocations()) {
+    const Breakpoint &bp = bp_loc->GetBreakpoint();
+    if (bp.IsInternal())
+      continue;
+    if (bp_loc->ValidForThisThread(thread))
+      return true;
+  }
+  return false;
+}
+
 void BreakpointSite::BumpHitCounts() {
   std::lock_guard<std::recursive_mutex> guard(m_constituents_mutex);
   for (BreakpointLocationSP loc_sp : m_constituents.BreakpointLocations()) {
diff --git a/lldb/source/Commands/CommandObjectTarget.cpp b/lldb/source/Commands/CommandObjectTarget.cpp
index 7f880d223d6c3..6e8c94fa234cd 100644
--- a/lldb/source/Commands/CommandObjectTarget.cpp
+++ b/lldb/source/Commands/CommandObjectTarget.cpp
@@ -51,6 +51,7 @@
 #include "lldb/Utility/ConstString.h"
 #include "lldb/Utility/FileSpec.h"
 #include "lldb/Utility/LLDBLog.h"
+#include "lldb/Utility/ScriptedMetadata.h"
 #include "lldb/Utility/State.h"
 #include "lldb/Utility/Stream.h"
 #include "lldb/Utility/StructuredData.h"
@@ -5402,6 +5403,202 @@ class CommandObjectTargetDump : public CommandObjectMultiword {
   ~CommandObjectTargetDump() override = default;
 };
 
+#pragma mark CommandObjectTargetFrameProvider
+
+#define LLDB_OPTIONS_target_frame_provider_register
+#include "CommandOptions.inc"
+
+class CommandObjectTargetFrameProviderRegister : public CommandObjectParsed {
+public:
+  CommandObjectTargetFrameProviderRegister(CommandInterpreter &interpreter)
+      : CommandObjectParsed(
+            interpreter, "target frame-provider register",
+            "Register frame provider for all threads in this target.", nullptr,
+            eCommandRequiresTarget),
+
+        m_class_options("target frame-provider", true, 'C', 'k', 'v', 0) {
+    m_all_options.Append(&m_class_options, LLDB_OPT_SET_1 | LLDB_OPT_SET_2,
+                         LLDB_OPT_SET_ALL);
+    m_all_options.Finalize();
+
+    AddSimpleArgumentList(eArgTypeRunArgs, eArgRepeatOptional);
+  }
+
+  ~CommandObjectTargetFrameProviderRegister() override = default;
+
+  Options *GetOptions() override { return &m_all_options; }
+
+  std::optional<std::string> GetRepeatCommand(Args &current_command_args,
+                                              uint32_t index) override {
+    return std::string("");
+  }
+
+protected:
+  void DoExecute(Args &launch_args, CommandReturnObject &result) override {
+    ScriptedMetadataSP metadata_sp = std::make_shared<ScriptedMetadata>(
+        m_class_options.GetName(), m_class_options.GetStructuredData());
+
+    Target *target = m_exe_ctx.GetTargetPtr();
+    if (!target)
+      target = &GetDebugger().GetDummyTarget();
+
+    // Create the interface for calling static methods.
+    ScriptedFrameProviderInterfaceSP interface_sp =
+        GetDebugger()
+            .GetScriptInterpreter()
+            ->CreateScriptedFrameProviderInterface();
+
+    // Create a descriptor from the metadata (applies to all threads by
+    // default).
+    ScriptedFrameProviderDescriptor descriptor(metadata_sp);
+    descriptor.interface_sp = interface_sp;
+
+    auto id_or_err = target->AddScriptedFrameProviderDescriptor(descriptor);
+    if (!id_or_err) {
+      result.SetError(id_or_err.takeError());
+      return;
+    }
+
+    result.AppendMessageWithFormat(
+        "successfully registered scripted frame provider '%s' for target\n",
+        m_class_options.GetName().c_str());
+  }
+
+  OptionGroupPythonClassWithDict m_class_options;
+  OptionGroupOptions m_all_options;
+};
+
+class CommandObjectTargetFrameProviderClear : public CommandObjectParsed {
+public:
+  CommandObjectTargetFrameProviderClear(CommandInterpreter &interpreter)
+      : CommandObjectParsed(
+            interpreter, "target frame-provider clear",
+            "Clear all registered frame providers from this target.", nullptr,
+            eCommandRequiresTarget) {}
+
+  ~CommandObjectTargetFrameProviderClear() override = default;
+
+protected:
+  void DoExecute(Args &command, CommandReturnObject &result) override {
+    Target *target = m_exe_ctx.GetTargetPtr();
+    if (!target) {
+      result.AppendError("invalid target");
+      return;
+    }
+
+    target->ClearScriptedFrameProviderDescriptors();
+
+    result.SetStatus(eReturnStatusSuccessFinishResult);
+  }
+};
+
+class CommandObjectTargetFrameProviderList : public CommandObjectParsed {
+public:
+  CommandObjectTargetFrameProviderList(CommandInterpreter &interpreter)
+      : CommandObjectParsed(
+            interpreter, "target frame-provider list",
+            "List all registered frame providers for the target.", nullptr,
+            eCommandRequiresTarget) {}
+
+  ~CommandObjectTargetFrameProviderList() override = default;
+
+protected:
+  void DoExecute(Args &command, CommandReturnObject &result) override {
+    Target *target = m_exe_ctx.GetTargetPtr();
+    if (!target)
+      target = &GetDebugger().GetDummyTarget();
+
+    const auto &descriptors = target->GetScriptedFrameProviderDescriptors();
+    if (descriptors.empty()) {
+      result.AppendMessage("no frame providers registered for this target.");
+      result.SetStatus(eReturnStatusSuccessFinishResult);
+      return;
+    }
+
+    result.AppendMessageWithFormat("%u frame provider(s) registered:\n\n",
+                                   descriptors.size());
+
+    for (const auto &entry : descriptors) {
+      const ScriptedFrameProviderDescriptor &descriptor = entry.second;
+      descriptor.Dump(&result.GetOutputStream());
+      result.GetOutputStream().PutChar('\n');
+    }
+
+    result.SetStatus(eReturnStatusSuccessFinishResult);
+  }
+};
+
+class CommandObjectTargetFrameProviderRemove : public CommandObjectParsed {
+public:
+  CommandObjectTargetFrameProviderRemove(CommandInterpreter &interpreter)
+      : CommandObjectParsed(
+            interpreter, "target frame-provider remove",
+            "Remove a registered frame provider from the target by id.",
+            "target frame-provider remove <provider-id>",
+            eCommandRequiresTarget) {
+    AddSimpleArgumentList(eArgTypeUnsignedInteger, eArgRepeatPlus);
+  }
+
+  ~CommandObjectTargetFrameProviderRemove() override = default;
+
+protected:
+  void DoExecute(Args &command, CommandReturnObject &result) override {
+    Target *target = m_exe_ctx.GetTargetPtr();
+    if (!target)
+      target = &GetDebugger().GetDummyTarget();
+
+    std::vector<uint32_t> removed_provider_ids;
+    for (size_t i = 0; i < command.GetArgumentCount(); i++) {
+      uint32_t provider_id = 0;
+      if (!llvm::to_integer(command[i].ref(), provider_id)) {
+        result.AppendError("target frame-provider remove requires integer "
+                           "provider id argument");
+        return;
+      }
+
+      if (!target->RemoveScriptedFrameProviderDescriptor(provider_id)) {
+        result.AppendErrorWithFormat(
+            "no frame provider named '%u' found in target\n", provider_id);
+        return;
+      }
+      removed_provider_ids.push_back(provider_id);
+    }
+
+    if (size_t num_removed_providers = removed_provider_ids.size()) {
+      result.AppendMessageWithFormat(
+          "Successfully removed %zu frame-providers.\n", num_removed_providers);
+      result.SetStatus(eReturnStatusSuccessFinishNoResult);
+    } else {
+      result.AppendError("0 frame providers removed.\n");
+    }
+  }
+};
+
+class CommandObjectTargetFrameProvider : public CommandObjectMultiword {
+public:
+  CommandObjectTargetFrameProvider(CommandInterpreter &interpreter)
+      : CommandObjectMultiword(
+            interpreter, "target frame-provider",
+            "Commands for registering and viewing frame providers for the "
+            "target.",
+            "target frame-provider [<sub-command-options>] ") {
+    LoadSubCommand("register",
+                   CommandObjectSP(new CommandObjectTargetFrameProviderRegister(
+                       interpreter)));
+    LoadSubCommand("clear",
+                   CommandObjectSP(
+                       new CommandObjectTargetFrameProviderClear(interpreter)));
+    LoadSubCommand(
+        "list",
+        CommandObjectSP(new CommandObjectTargetFrameProviderList(interpreter)));
+    LoadSubCommand(
+        "remove", CommandObjectSP(
+                      new CommandObjectTargetFrameProviderRemove(interpreter)));
+  }
+
+  ~CommandObjectTargetFrameProvider() override = default;
+};
+
 #pragma mark CommandObjectMultiwordTarget
 
 // CommandObjectMultiwordTarget
@@ -5417,6 +5614,9 @@ CommandObjectMultiwordTarget::CommandObjectMultiwordTarget(
                  CommandObjectSP(new CommandObjectTargetDelete(interpreter)));
   LoadSubCommand("dump",
                  CommandObjectSP(new CommandObjectTargetDump(interpreter)));
+  LoadSubCommand(
+      "frame-provider",
+      CommandObjectSP(new CommandObjectTargetFrameProvider(interpreter)));
   LoadSubCommand("list",
                  CommandObjectSP(new CommandObjectTargetList(interpreter)));
   LoadSubCommand("select",
diff --git a/lldb/source/Core/Disassembler.cpp b/lldb/source/Core/Disassembler.cpp
index f2eb887986bfb..ed32caf361e0a 100644
--- a/lldb/source/Core/Disassembler.cpp
+++ b/lldb/source/Core/Disassembler.cpp
@@ -1341,6 +1341,11 @@ bool PseudoInstruction::DoesBranch() {
   return false;
 }
 
+bool PseudoInstruction::IsBarrier() {
+  // This is NOT a valid question for a pseudo instruction.
+  return false;
+}
+
 bool PseudoInstruction::HasDelaySlot() {
   // This is NOT a valid question for a pseudo instruction.
   return false;
diff --git a/lldb/source/Core/FormatEntity.cpp b/lldb/source/Core/FormatEntity.cpp
index 491f5c6320d97..c528a14fa76d0 100644
--- a/lldb/source/Core/FormatEntity.cpp
+++ b/lldb/source/Core/FormatEntity.cpp
@@ -27,6 +27,7 @@
 #include "lldb/Symbol/Symbol.h"
 #include "lldb/Symbol/SymbolContext.h"
 #include "lldb/Symbol/VariableList.h"
+#include "lldb/Target/BorrowedStackFrame.h"
 #include "lldb/Target/ExecutionContext.h"
 #include "lldb/Target/ExecutionContextScope.h"
 #include "lldb/Target/Language.h"
@@ -109,6 +110,7 @@ constexpr Definition g_frame_child_entries[] = {
                                   g_string_entry),
     Definition("is-artificial", EntryType::FrameIsArtificial),
     Definition("kind", EntryType::FrameKind),
+    Definition("borrowed-info", EntryType::FrameBorrowedInfo),
 };
 
 constexpr Definition g_function_child_entries[] = {
@@ -382,6 +384,7 @@ const char *FormatEntity::Entry::TypeToCString(Type t) {
     ENUM_TO_CSTR(FrameRegisterByName);
     ENUM_TO_CSTR(FrameIsArtificial);
     ENUM_TO_CSTR(FrameKind);
+    ENUM_TO_CSTR(FrameBorrowedInfo);
     ENUM_TO_CSTR(ScriptFrame);
     ENUM_TO_CSTR(FunctionID);
     ENUM_TO_CSTR(FunctionDidChange);
@@ -1761,6 +1764,22 @@ bool FormatEntity::Format(const Entry &entry, Stream &s,
     return false;
   }
 
+  case Entry::Type::FrameBorrowedInfo: {
+    if (exe_ctx)
+      if (StackFrame *frame = exe_ctx->GetFramePtr()) {
+        if (BorrowedStackFrame *borrowed_frame =
+                llvm::dyn_cast<BorrowedStackFrame>(frame)) {
+          if (lldb::StackFrameSP borrowed_from_sp =
+                  borrowed_frame->GetBorrowedFrame()) {
+            s.Printf(" [borrowed from frame #%u]",
+                     borrowed_from_sp->GetFrameIndex());
+            return true;
+          }
+        }
+      }
+    return false;
+  }
+
   case Entry::Type::ScriptFrame:
     if (exe_ctx) {
       StackFrame *frame = exe_ctx->GetFramePtr();
diff --git a/lldb/source/Core/Statusline.cpp b/lldb/source/Core/Statusline.cpp
index bfbd190fba27c..922aada07e979 100644
--- a/lldb/source/Core/Statusline.cpp
+++ b/lldb/source/Core/Statusline.cpp
@@ -91,7 +91,7 @@ void Statusline::UpdateScrollWindow(ScrollWindowMode mode) {
   if (!stream_sp)
     return;
 
-  const unsigned reduced_scroll_window = m_terminal_height - 1;
+  const unsigned reduced_scroll_rows = m_terminal_height - 1;
   LockedStreamFile locked_stream = stream_sp->Lock();
 
   switch (mode) {
@@ -101,13 +101,14 @@ void Statusline::UpdateScrollWindow(ScrollWindowMode mode) {
     locked_stream.Printf(ANSI_UP_ROWS, 1);
     // Reduce the scroll window.
     locked_stream << ANSI_SAVE_CURSOR;
-    locked_stream.Printf(ANSI_SET_SCROLL_ROWS, reduced_scroll_window);
+    locked_stream.Printf(ANSI_SET_SCROLL_ROWS, reduced_scroll_rows);
     locked_stream << ANSI_RESTORE_CURSOR;
     break;
   case DisableStatusline:
     // Reset the scroll window.
     locked_stream << ANSI_SAVE_CURSOR;
-    locked_stream.Printf(ANSI_SET_SCROLL_ROWS, 0);
+    locked_stream.Printf(ANSI_SET_SCROLL_ROWS,
+                         static_cast<unsigned>(m_terminal_height));
     locked_stream << ANSI_RESTORE_CURSOR;
     // Clear the screen below to hide the old statusline.
     locked_stream << ANSI_CLEAR_BELOW;
@@ -116,7 +117,7 @@ void Statusline::UpdateScrollWindow(ScrollWindowMode mode) {
     // Clear the screen and update the scroll window.
     // FIXME: Find a better solution (#146919).
     locked_stream << ANSI_CLEAR_SCREEN;
-    locked_stream.Printf(ANSI_SET_SCROLL_ROWS, reduced_scroll_window);
+    locked_stream.Printf(ANSI_SET_SCROLL_ROWS, reduced_scroll_rows);
     break;
   }
 
diff --git a/lldb/source/Expression/DWARFExpression.cpp b/lldb/source/Expression/DWARFExpression.cpp
index 4f9d6ebf27bf0..364b2ecadadd4 100644
--- a/lldb/source/Expression/DWARFExpression.cpp
+++ b/lldb/source/Expression/DWARFExpression.cpp
@@ -861,32 +861,130 @@ ResolveLoadAddress(ExecutionContext *exe_ctx, lldb::ModuleSP &module_sp,
   return load_addr;
 }
 
-static llvm::Error Evaluate_DW_OP_deref(DWARFExpression::Stack &stack,
-                                        ExecutionContext *exe_ctx,
-                                        lldb::ModuleSP module_sp,
-                                        Process *process) {
+/// @brief Helper function to load sized data from a uint8_t buffer.
+///
+/// @param addr_bytes The buffer containing raw data.
+/// @param size_addr_bytes How large is the underlying raw data.
+/// @param byte_order What is the byte order of the underlying data.
+/// @param size How much of the underlying data we want to use.
+/// @return The underlying data converted into a Scalar.
+static Scalar DerefSizeExtractDataHelper(uint8_t *addr_bytes,
+                                         size_t size_addr_bytes,
+                                         ByteOrder byte_order, size_t size) {
+  DataExtractor addr_data(addr_bytes, size_addr_bytes, byte_order, size);
+
+  lldb::offset_t addr_data_offset = 0;
+  if (size <= 8)
+    return addr_data.GetMaxU64(&addr_data_offset, size);
+  return addr_data.GetAddress(&addr_data_offset);
+}
+
+static llvm::Error Evaluate_DW_OP_deref_size(
+    DWARFExpression::Stack &stack, ExecutionContext *exe_ctx,
+    lldb::ModuleSP module_sp, Process *process, Target *target, uint8_t size,
+    size_t size_addr_bytes,
+    LocationDescriptionKind &dwarf4_location_description_kind) {
   if (stack.empty())
-    return llvm::createStringError("expression stack empty for DW_OP_deref");
+    return llvm::createStringError(
+        "expression stack empty for DW_OP_deref_size");
 
-  const Value::ValueType value_type = stack.back().GetValueType();
+  if (size > 8)
+    return llvm::createStringError(
+        "Invalid address size for DW_OP_deref_size: %d\n", size);
+
+  // Deref a register or implicit location and truncate the value to `size`
+  // bytes. See the corresponding comment in DW_OP_deref for more details on
+  // why we deref these locations this way.
+  if (dwarf4_location_description_kind == Register ||
+      dwarf4_location_description_kind == Implicit) {
+    // Reset context to default values.
+    dwarf4_location_description_kind = Memory;
+    stack.back().ClearContext();
+
+    // Truncate the value on top of the stack to *size* bytes then
+    // extend to the size of an address (e.g. generic type).
+    Scalar scalar = stack.back().GetScalar();
+    scalar.TruncOrExtendTo(size * 8, /*sign=*/false);
+    scalar.TruncOrExtendTo(size_addr_bytes * 8,
+                           /*sign=*/false);
+    stack.back().GetScalar() = scalar;
+    return llvm::Error::success();
+  }
+
+  Value::ValueType value_type = stack.back().GetValueType();
   switch (value_type) {
   case Value::ValueType::HostAddress: {
     void *src = (void *)stack.back().GetScalar().ULongLong();
     intptr_t ptr;
     ::memcpy(&ptr, src, sizeof(void *));
+    // I can't decide whether the size operand should apply to the bytes in
+    // their lldb-host endianness or the target endianness.. I doubt this'll
+    // ever come up but I'll opt for assuming big endian regardless.
+    switch (size) {
+    case 1:
+      ptr = ptr & 0xff;
+      break;
+    case 2:
+      ptr = ptr & 0xffff;
+      break;
+    case 3:
+      ptr = ptr & 0xffffff;
+      break;
+    case 4:
+      ptr = ptr & 0xffffffff;
+      break;
+    // The casts are added to work around the case where intptr_t is a 32-bit
+    // quantity. Presumably we won't hit the 5..7 cases if (void*) is 32-bits in
+    // this program.
+    case 5:
+      ptr = (intptr_t)ptr & 0xffffffffffULL;
+      break;
+    case 6:
+      ptr = (intptr_t)ptr & 0xffffffffffffULL;
+      break;
+    case 7:
+      ptr = (intptr_t)ptr & 0xffffffffffffffULL;
+      break;
+    default:
+      break;
+    }
     stack.back().GetScalar() = ptr;
     stack.back().ClearContext();
   } break;
   case Value::ValueType::FileAddress: {
     auto file_addr = stack.back().GetScalar().ULongLong(LLDB_INVALID_ADDRESS);
     Address so_addr;
-    auto maybe_load_addr = ResolveLoadAddress(exe_ctx, module_sp, "DW_OP_deref",
-                                              file_addr, so_addr);
+    auto maybe_load_addr = ResolveLoadAddress(
+        exe_ctx, module_sp, "DW_OP_deref_size", file_addr, so_addr,
+        /*check_sectionoffset=*/true);
+
     if (!maybe_load_addr)
       return maybe_load_addr.takeError();
-    stack.back().GetScalar() = *maybe_load_addr;
+
+    addr_t load_addr = *maybe_load_addr;
+
+    if (load_addr == LLDB_INVALID_ADDRESS && so_addr.IsSectionOffset()) {
+      uint8_t addr_bytes[8];
+      Status error;
+
+      if (!target || target->ReadMemory(so_addr, &addr_bytes, size, error,
+                                        /*force_live_memory=*/false) != size)
+        return llvm::createStringError(
+            "failed to dereference pointer for DW_OP_deref_size: "
+            "%s\n",
+            error.AsCString());
+
+      ObjectFile *objfile = module_sp->GetObjectFile();
+
+      stack.back().GetScalar() = DerefSizeExtractDataHelper(
+          addr_bytes, size, objfile->GetByteOrder(), size);
+      stack.back().ClearContext();
+      break;
+    }
+    stack.back().GetScalar() = load_addr;
     // Fall through to load address promotion code below.
   }
+
     [[fallthrough]];
   case Value::ValueType::Scalar:
     // Promote Scalar to LoadAddress and fall through.
@@ -894,51 +992,34 @@ static llvm::Error Evaluate_DW_OP_deref(DWARFExpression::Stack &stack,
     [[fallthrough]];
   case Value::ValueType::LoadAddress: {
     if (!exe_ctx)
-      return llvm::createStringError("NULL execution context for DW_OP_deref");
+      return llvm::createStringError(
+          "no execution context for DW_OP_deref_size");
     if (!process)
-      return llvm::createStringError("NULL process for DW_OP_deref");
+      return llvm::createStringError("no process for DW_OP_deref_size");
+
     lldb::addr_t pointer_addr =
         stack.back().GetScalar().ULongLong(LLDB_INVALID_ADDRESS);
+    uint8_t addr_bytes[sizeof(lldb::addr_t)];
     Status error;
-    lldb::addr_t pointer_value =
-        process->ReadPointerFromMemory(pointer_addr, error);
-    if (pointer_value == LLDB_INVALID_ADDRESS)
-      return llvm::joinErrors(
-          llvm::createStringError(
-              "Failed to dereference pointer from 0x%" PRIx64
-              " for DW_OP_deref",
-              pointer_addr),
-          error.takeError());
-    stack.back().GetScalar() = pointer_value;
+
+    if (process->ReadMemory(pointer_addr, &addr_bytes, size, error) != size)
+      return llvm::createStringError(
+          "failed to dereference pointer from 0x%" PRIx64
+          " for DW_OP_deref_size: %s\n",
+          pointer_addr, error.AsCString());
+
+    stack.back().GetScalar() = DerefSizeExtractDataHelper(
+        addr_bytes, sizeof(addr_bytes), process->GetByteOrder(), size);
     stack.back().ClearContext();
   } break;
+
   case Value::ValueType::Invalid:
-    return llvm::createStringError("invalid value type for DW_OP_deref");
+    return llvm::createStringError("invalid value for DW_OP_deref_size");
   }
 
   return llvm::Error::success();
 }
 
-/// Helper function to move common code used to load sized data from a uint8_t
-/// buffer.
-///
-/// \param addr_bytes uint8_t buffer containg raw data
-/// \param size_addr_bytes how large is the underlying raw data
-/// \param byte_order what is the byter order of the underlyig data
-/// \param size How much of the underlying data we want to use
-/// \return The underlying data converted into a Scalar
-static Scalar DerefSizeExtractDataHelper(uint8_t *addr_bytes,
-                                         size_t size_addr_bytes,
-                                         ByteOrder byte_order, size_t size) {
-  DataExtractor addr_data(addr_bytes, size_addr_bytes, byte_order, size);
-
-  lldb::offset_t addr_data_offset = 0;
-  if (size <= 8)
-    return addr_data.GetMaxU64(&addr_data_offset, size);
-  else
-    return addr_data.GetAddress(&addr_data_offset);
-}
-
 llvm::Expected<Value> DWARFExpression::Evaluate(
     ExecutionContext *exe_ctx, RegisterContext *reg_ctx,
     lldb::ModuleSP module_sp, const DataExtractor &opcodes,
@@ -1079,8 +1160,10 @@ llvm::Expected<Value> DWARFExpression::Evaluate(
     // retrieved from the dereferenced address is the size of an address on the
     // target machine.
     case DW_OP_deref: {
-      if (llvm::Error err =
-              Evaluate_DW_OP_deref(stack, exe_ctx, module_sp, process))
+      size_t size = opcodes.GetAddressByteSize();
+      if (llvm::Error err = Evaluate_DW_OP_deref_size(
+              stack, exe_ctx, module_sp, process, target, size, size,
+              dwarf4_location_description_kind))
         return err;
     } break;
 
@@ -1097,131 +1180,11 @@ llvm::Expected<Value> DWARFExpression::Evaluate(
     // the size of an address on the target machine before being pushed on the
     // expression stack.
     case DW_OP_deref_size: {
-      if (stack.empty()) {
-        return llvm::createStringError(
-            "expression stack empty for DW_OP_deref_size");
-      }
-      uint8_t size = opcodes.GetU8(&offset);
-      if (size > 8) {
-        return llvm::createStringError(
-            "Invalid address size for DW_OP_deref_size: %d\n", size);
-      }
-      Value::ValueType value_type = stack.back().GetValueType();
-      switch (value_type) {
-      case Value::ValueType::HostAddress: {
-        void *src = (void *)stack.back().GetScalar().ULongLong();
-        intptr_t ptr;
-        ::memcpy(&ptr, src, sizeof(void *));
-        // I can't decide whether the size operand should apply to the bytes in
-        // their
-        // lldb-host endianness or the target endianness.. I doubt this'll ever
-        // come up but I'll opt for assuming big endian regardless.
-        switch (size) {
-        case 1:
-          ptr = ptr & 0xff;
-          break;
-        case 2:
-          ptr = ptr & 0xffff;
-          break;
-        case 3:
-          ptr = ptr & 0xffffff;
-          break;
-        case 4:
-          ptr = ptr & 0xffffffff;
-          break;
-        // the casts are added to work around the case where intptr_t is a 32
-        // bit quantity;
-        // presumably we won't hit the 5..7 cases if (void*) is 32-bits in this
-        // program.
-        case 5:
-          ptr = (intptr_t)ptr & 0xffffffffffULL;
-          break;
-        case 6:
-          ptr = (intptr_t)ptr & 0xffffffffffffULL;
-          break;
-        case 7:
-          ptr = (intptr_t)ptr & 0xffffffffffffffULL;
-          break;
-        default:
-          break;
-        }
-        stack.back().GetScalar() = ptr;
-        stack.back().ClearContext();
-      } break;
-      case Value::ValueType::FileAddress: {
-        auto file_addr =
-            stack.back().GetScalar().ULongLong(LLDB_INVALID_ADDRESS);
-        Address so_addr;
-        auto maybe_load_addr = ResolveLoadAddress(
-            exe_ctx, module_sp, "DW_OP_deref_size", file_addr, so_addr,
-            /*check_sectionoffset=*/true);
-
-        if (!maybe_load_addr)
-          return maybe_load_addr.takeError();
-
-        addr_t load_addr = *maybe_load_addr;
-
-        if (load_addr == LLDB_INVALID_ADDRESS && so_addr.IsSectionOffset()) {
-          uint8_t addr_bytes[8];
-          Status error;
-
-          if (target &&
-              target->ReadMemory(so_addr, &addr_bytes, size, error,
-                                 /*force_live_memory=*/false) == size) {
-            ObjectFile *objfile = module_sp->GetObjectFile();
-
-            stack.back().GetScalar() = DerefSizeExtractDataHelper(
-                addr_bytes, size, objfile->GetByteOrder(), size);
-            stack.back().ClearContext();
-            break;
-          } else {
-            return llvm::createStringError(
-                "Failed to dereference pointer for DW_OP_deref_size: "
-                "%s\n",
-                error.AsCString());
-          }
-        }
-        stack.back().GetScalar() = load_addr;
-        // Fall through to load address promotion code below.
-      }
-
-        [[fallthrough]];
-      case Value::ValueType::Scalar:
-      case Value::ValueType::LoadAddress:
-        if (exe_ctx) {
-          if (process) {
-            lldb::addr_t pointer_addr =
-                stack.back().GetScalar().ULongLong(LLDB_INVALID_ADDRESS);
-            uint8_t addr_bytes[sizeof(lldb::addr_t)];
-            Status error;
-            if (process->ReadMemory(pointer_addr, &addr_bytes, size, error) ==
-                size) {
-
-              stack.back().GetScalar() =
-                  DerefSizeExtractDataHelper(addr_bytes, sizeof(addr_bytes),
-                                             process->GetByteOrder(), size);
-              stack.back().ClearContext();
-            } else {
-              return llvm::createStringError(
-                  "Failed to dereference pointer from 0x%" PRIx64
-                  " for DW_OP_deref: %s\n",
-                  pointer_addr, error.AsCString());
-            }
-          } else {
-
-            return llvm::createStringError("NULL process for DW_OP_deref_size");
-          }
-        } else {
-          return llvm::createStringError(
-              "NULL execution context for DW_OP_deref_size");
-        }
-        break;
-
-      case Value::ValueType::Invalid:
-
-        return llvm::createStringError("invalid value for DW_OP_deref_size");
-      }
-
+      size_t size = opcodes.GetU8(&offset);
+      if (llvm::Error err = Evaluate_DW_OP_deref_size(
+              stack, exe_ctx, module_sp, process, target, size,
+              opcodes.GetAddressByteSize(), dwarf4_location_description_kind))
+        return err;
     } break;
 
     // OPCODE: DW_OP_xderef_size
diff --git a/lldb/source/Expression/ObjectFileJIT.cpp b/lldb/source/Expression/ObjectFileJIT.cpp
index e4a613551d22e..46ceb75fbc721 100644
--- a/lldb/source/Expression/ObjectFileJIT.cpp
+++ b/lldb/source/Expression/ObjectFileJIT.cpp
@@ -73,8 +73,8 @@ ObjectFileJIT::ObjectFileJIT(const lldb::ModuleSP &module_sp,
     : ObjectFile(module_sp, nullptr, 0, 0, DataBufferSP(), 0), m_delegate_wp() {
   if (delegate_sp) {
     m_delegate_wp = delegate_sp;
-    m_data.SetByteOrder(delegate_sp->GetByteOrder());
-    m_data.SetAddressByteSize(delegate_sp->GetAddressByteSize());
+    m_data_nsp->SetByteOrder(delegate_sp->GetByteOrder());
+    m_data_nsp->SetAddressByteSize(delegate_sp->GetAddressByteSize());
   }
 }
 
@@ -85,12 +85,14 @@ bool ObjectFileJIT::ParseHeader() {
   return false;
 }
 
-ByteOrder ObjectFileJIT::GetByteOrder() const { return m_data.GetByteOrder(); }
+ByteOrder ObjectFileJIT::GetByteOrder() const {
+  return m_data_nsp->GetByteOrder();
+}
 
 bool ObjectFileJIT::IsExecutable() const { return false; }
 
 uint32_t ObjectFileJIT::GetAddressByteSize() const {
-  return m_data.GetAddressByteSize();
+  return m_data_nsp->GetAddressByteSize();
 }
 
 void ObjectFileJIT::ParseSymtab(Symtab &symtab) {
diff --git a/lldb/source/Host/windows/ProcessLauncherWindows.cpp b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
index f5adadaf061bf..2f78ef80f385e 100644
--- a/lldb/source/Host/windows/ProcessLauncherWindows.cpp
+++ b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
@@ -14,6 +14,7 @@
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/Support/ConvertUTF.h"
 #include "llvm/Support/Program.h"
+#include "llvm/Support/WindowsError.h"
 
 #include <string>
 #include <vector>
@@ -21,42 +22,63 @@
 using namespace lldb;
 using namespace lldb_private;
 
-static void CreateEnvironmentBuffer(const Environment &env,
-                                    std::vector<char> &buffer) {
-  // The buffer is a list of null-terminated UTF-16 strings, followed by an
-  // extra L'\0' (two bytes of 0).  An empty environment must have one
-  // empty string, followed by an extra L'\0'.
+/// Create a UTF-16 environment block to use with CreateProcessW.
+///
+/// The buffer is a sequence of null-terminated UTF-16 strings, followed by an
+/// extra L'\0' (two bytes of 0). An empty environment must have one
+/// empty string, followed by an extra L'\0'.
+///
+/// The keys are sorted to comply with the CreateProcess API calling convention.
+///
+/// Ensure that the resulting buffer is used in conjunction with
+/// CreateProcessW and be sure that dwCreationFlags includes
+/// CREATE_UNICODE_ENVIRONMENT.
+///
+/// \param env The Environment object to convert.
+/// \returns The sorted sequence of environment variables and their values,
+/// separated by null terminators. The vector is guaranteed to never be empty.
+static std::vector<wchar_t> CreateEnvironmentBufferW(const Environment &env) {
+  std::vector<std::wstring> env_entries;
   for (const auto &KV : env) {
-    std::wstring warg;
-    if (llvm::ConvertUTF8toWide(Environment::compose(KV), warg)) {
-      buffer.insert(
-          buffer.end(), reinterpret_cast<const char *>(warg.c_str()),
-          reinterpret_cast<const char *>(warg.c_str() + warg.size() + 1));
-    }
+    std::wstring wentry;
+    if (llvm::ConvertUTF8toWide(Environment::compose(KV), wentry))
+      env_entries.push_back(std::move(wentry));
   }
-  // One null wchar_t (to end the block) is two null bytes
-  buffer.push_back(0);
-  buffer.push_back(0);
-  // Insert extra two bytes, just in case the environment was empty.
-  buffer.push_back(0);
-  buffer.push_back(0);
+  std::sort(env_entries.begin(), env_entries.end(),
+            [](const std::wstring &a, const std::wstring &b) {
+              return _wcsicmp(a.c_str(), b.c_str()) < 0;
+            });
+
+  std::vector<wchar_t> buffer;
+  for (const auto &env_entry : env_entries) {
+    buffer.insert(buffer.end(), env_entry.begin(), env_entry.end());
+    buffer.push_back(L'\0');
+  }
+
+  if (buffer.empty())
+    buffer.push_back(L'\0'); // If there are no environment variables, we have
+                             // to ensure there are 4 zero bytes in the buffer.
+  buffer.push_back(L'\0');
+
+  return buffer;
 }
 
-static bool GetFlattenedWindowsCommandString(Args args, std::wstring &command) {
+/// Flattens an Args object into a Windows command-line wide string.
+///
+/// Returns an empty string if args is empty.
+///
+/// \param args The Args object to flatten.
+/// \returns A wide string containing the flattened command line.
+static llvm::ErrorOr<std::wstring>
+GetFlattenedWindowsCommandStringW(Args args) {
   if (args.empty())
-    return false;
+    return L"";
 
   std::vector<llvm::StringRef> args_ref;
   for (auto &entry : args.entries())
     args_ref.push_back(entry.ref());
 
-  llvm::ErrorOr<std::wstring> result =
-      llvm::sys::flattenWindowsCommandLine(args_ref);
-  if (result.getError())
-    return false;
-
-  command = *result;
-  return true;
+  return llvm::sys::flattenWindowsCommandLine(args_ref);
 }
 
 HostProcess
@@ -65,11 +87,13 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
   error.Clear();
 
   std::string executable;
-  std::vector<char> environment;
-  STARTUPINFOEX startupinfoex = {};
-  STARTUPINFO &startupinfo = startupinfoex.StartupInfo;
+  STARTUPINFOEXW startupinfoex = {};
+  STARTUPINFOW &startupinfo = startupinfoex.StartupInfo;
   PROCESS_INFORMATION pi = {};
 
+  startupinfo.cb = sizeof(startupinfoex);
+  startupinfo.dwFlags |= STARTF_USESTDHANDLES;
+
   HANDLE stdin_handle = GetStdioHandle(launch_info, STDIN_FILENO);
   HANDLE stdout_handle = GetStdioHandle(launch_info, STDOUT_FILENO);
   HANDLE stderr_handle = GetStdioHandle(launch_info, STDERR_FILENO);
@@ -82,23 +106,6 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
       ::CloseHandle(stderr_handle);
   });
 
-  startupinfo.cb = sizeof(startupinfoex);
-  startupinfo.dwFlags |= STARTF_USESTDHANDLES;
-  startupinfo.hStdError =
-      stderr_handle ? stderr_handle : ::GetStdHandle(STD_ERROR_HANDLE);
-  startupinfo.hStdInput =
-      stdin_handle ? stdin_handle : ::GetStdHandle(STD_INPUT_HANDLE);
-  startupinfo.hStdOutput =
-      stdout_handle ? stdout_handle : ::GetStdHandle(STD_OUTPUT_HANDLE);
-
-  std::vector<HANDLE> inherited_handles;
-  if (startupinfo.hStdError)
-    inherited_handles.push_back(startupinfo.hStdError);
-  if (startupinfo.hStdInput)
-    inherited_handles.push_back(startupinfo.hStdInput);
-  if (startupinfo.hStdOutput)
-    inherited_handles.push_back(startupinfo.hStdOutput);
-
   SIZE_T attributelist_size = 0;
   InitializeProcThreadAttributeList(/*lpAttributeList=*/nullptr,
                                     /*dwAttributeCount=*/1, /*dwFlags=*/0,
@@ -116,22 +123,14 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
   }
   auto delete_attributelist = llvm::make_scope_exit(
       [&] { DeleteProcThreadAttributeList(startupinfoex.lpAttributeList); });
-  for (size_t i = 0; i < launch_info.GetNumFileActions(); ++i) {
-    const FileAction *act = launch_info.GetFileActionAtIndex(i);
-    if (act->GetAction() == FileAction::eFileActionDuplicate &&
-        act->GetFD() == act->GetActionArgument())
-      inherited_handles.push_back(reinterpret_cast<HANDLE>(act->GetFD()));
-  }
-  if (!inherited_handles.empty()) {
-    if (!UpdateProcThreadAttribute(
-            startupinfoex.lpAttributeList, /*dwFlags=*/0,
-            PROC_THREAD_ATTRIBUTE_HANDLE_LIST, inherited_handles.data(),
-            inherited_handles.size() * sizeof(HANDLE),
-            /*lpPreviousValue=*/nullptr, /*lpReturnSize=*/nullptr)) {
-      error = Status(::GetLastError(), eErrorTypeWin32);
-      return HostProcess();
-    }
+
+  auto inherited_handles_or_err = GetInheritedHandles(
+      launch_info, startupinfoex, stdout_handle, stderr_handle, stdin_handle);
+  if (!inherited_handles_or_err) {
+    error = Status(inherited_handles_or_err.getError());
+    return HostProcess();
   }
+  std::vector<HANDLE> inherited_handles = *inherited_handles_or_err;
 
   const char *hide_console_var =
       getenv("LLDB_LAUNCH_INFERIORS_WITHOUT_CONSOLE");
@@ -149,28 +148,32 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
   if (launch_info.GetFlags().Test(eLaunchFlagDisableSTDIO))
     flags &= ~CREATE_NEW_CONSOLE;
 
-  LPVOID env_block = nullptr;
-  ::CreateEnvironmentBuffer(launch_info.GetEnvironment(), environment);
-  env_block = environment.data();
+  std::vector<wchar_t> environment =
+      CreateEnvironmentBufferW(launch_info.GetEnvironment());
 
-  executable = launch_info.GetExecutableFile().GetPath();
-  std::wstring wcommandLine;
-  GetFlattenedWindowsCommandString(launch_info.GetArguments(), wcommandLine);
-
-  std::wstring wexecutable, wworkingDirectory;
-  llvm::ConvertUTF8toWide(executable, wexecutable);
-  llvm::ConvertUTF8toWide(launch_info.GetWorkingDirectory().GetPath(),
-                          wworkingDirectory);
+  auto wcommandLineOrErr =
+      GetFlattenedWindowsCommandStringW(launch_info.GetArguments());
+  if (!wcommandLineOrErr) {
+    error = Status(wcommandLineOrErr.getError());
+    return HostProcess();
+  }
+  std::wstring wcommandLine = *wcommandLineOrErr;
   // If the command line is empty, it's best to pass a null pointer to tell
   // CreateProcessW to use the executable name as the command line.  If the
   // command line is not empty, its contents may be modified by CreateProcessW.
   WCHAR *pwcommandLine = wcommandLine.empty() ? nullptr : &wcommandLine[0];
 
+  std::wstring wexecutable, wworkingDirectory;
+  llvm::ConvertUTF8toWide(launch_info.GetExecutableFile().GetPath(),
+                          wexecutable);
+  llvm::ConvertUTF8toWide(launch_info.GetWorkingDirectory().GetPath(),
+                          wworkingDirectory);
+
   BOOL result = ::CreateProcessW(
       wexecutable.c_str(), pwcommandLine, NULL, NULL,
-      /*bInheritHandles=*/!inherited_handles.empty(), flags, env_block,
+      /*bInheritHandles=*/!inherited_handles.empty(), flags, environment.data(),
       wworkingDirectory.size() == 0 ? NULL : wworkingDirectory.c_str(),
-      reinterpret_cast<STARTUPINFO *>(&startupinfoex), &pi);
+      reinterpret_cast<STARTUPINFOW *>(&startupinfoex), &pi);
 
   if (!result) {
     // Call GetLastError before we make any other system calls.
@@ -191,6 +194,46 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
   return HostProcess(pi.hProcess);
 }
 
+llvm::ErrorOr<std::vector<HANDLE>> ProcessLauncherWindows::GetInheritedHandles(
+    const ProcessLaunchInfo &launch_info, STARTUPINFOEXW &startupinfoex,
+    HANDLE stdout_handle, HANDLE stderr_handle, HANDLE stdin_handle) {
+  std::vector<HANDLE> inherited_handles;
+  STARTUPINFOW startupinfo = startupinfoex.StartupInfo;
+
+  startupinfo.hStdError =
+      stderr_handle ? stderr_handle : GetStdHandle(STD_ERROR_HANDLE);
+  startupinfo.hStdInput =
+      stdin_handle ? stdin_handle : GetStdHandle(STD_INPUT_HANDLE);
+  startupinfo.hStdOutput =
+      stdout_handle ? stdout_handle : GetStdHandle(STD_OUTPUT_HANDLE);
+
+  if (startupinfo.hStdError)
+    inherited_handles.push_back(startupinfo.hStdError);
+  if (startupinfo.hStdInput)
+    inherited_handles.push_back(startupinfo.hStdInput);
+  if (startupinfo.hStdOutput)
+    inherited_handles.push_back(startupinfo.hStdOutput);
+
+  for (size_t i = 0; i < launch_info.GetNumFileActions(); ++i) {
+    const FileAction *act = launch_info.GetFileActionAtIndex(i);
+    if (act->GetAction() == FileAction::eFileActionDuplicate &&
+        act->GetFD() == act->GetActionArgument())
+      inherited_handles.push_back(reinterpret_cast<HANDLE>(act->GetFD()));
+  }
+
+  if (inherited_handles.empty())
+    return inherited_handles;
+
+  if (!UpdateProcThreadAttribute(
+          startupinfoex.lpAttributeList, /*dwFlags=*/0,
+          PROC_THREAD_ATTRIBUTE_HANDLE_LIST, inherited_handles.data(),
+          inherited_handles.size() * sizeof(HANDLE),
+          /*lpPreviousValue=*/nullptr, /*lpReturnSize=*/nullptr))
+    return llvm::mapWindowsError(::GetLastError());
+
+  return inherited_handles;
+}
+
 HANDLE
 ProcessLauncherWindows::GetStdioHandle(const ProcessLaunchInfo &launch_info,
                                        int fd) {
diff --git a/lldb/source/Interpreter/ScriptInterpreter.cpp b/lldb/source/Interpreter/ScriptInterpreter.cpp
index d2fd372bfe9e3..7bad10ff3ea61 100644
--- a/lldb/source/Interpreter/ScriptInterpreter.cpp
+++ b/lldb/source/Interpreter/ScriptInterpreter.cpp
@@ -106,6 +106,13 @@ ScriptInterpreter::GetStatusFromSBError(const lldb::SBError &error) const {
   return Status();
 }
 
+lldb::ThreadSP ScriptInterpreter::GetOpaqueTypeFromSBThread(
+    const lldb::SBThread &thread) const {
+  if (thread.m_opaque_sp)
+    return thread.m_opaque_sp->GetThreadSP();
+  return nullptr;
+}
+
 lldb::StackFrameSP
 ScriptInterpreter::GetOpaqueTypeFromSBFrame(const lldb::SBFrame &frame) const {
   if (frame.m_opaque_sp)
diff --git a/lldb/source/Plugins/CMakeLists.txt b/lldb/source/Plugins/CMakeLists.txt
index 08f444e7b15e8..b6878b21ff71a 100644
--- a/lldb/source/Plugins/CMakeLists.txt
+++ b/lldb/source/Plugins/CMakeLists.txt
@@ -22,6 +22,7 @@ add_subdirectory(SymbolFile)
 add_subdirectory(SystemRuntime)
 add_subdirectory(SymbolLocator)
 add_subdirectory(SymbolVendor)
+add_subdirectory(SyntheticFrameProvider)
 add_subdirectory(Trace)
 add_subdirectory(TraceExporter)
 add_subdirectory(TypeSystem)
diff --git a/lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp b/lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
index 66d0a50985be7..e8bb706f7aab6 100644
--- a/lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
+++ b/lldb/source/Plugins/Disassembler/LLVMC/DisassemblerLLVMC.cpp
@@ -70,6 +70,7 @@ class DisassemblerLLVMC::MCDisasmInstance {
   bool HasDelaySlot(llvm::MCInst &mc_inst) const;
   bool IsCall(llvm::MCInst &mc_inst) const;
   bool IsLoad(llvm::MCInst &mc_inst) const;
+  bool IsBarrier(llvm::MCInst &mc_inst) const;
   bool IsAuthenticated(llvm::MCInst &mc_inst) const;
 
 private:
@@ -436,6 +437,11 @@ class InstructionLLVMC : public lldb_private::Instruction {
     return m_is_load;
   }
 
+  bool IsBarrier() override {
+    VisitInstruction();
+    return m_is_barrier;
+  }
+
   bool IsAuthenticated() override {
     VisitInstruction();
     return m_is_authenticated;
@@ -1195,6 +1201,7 @@ class InstructionLLVMC : public lldb_private::Instruction {
   bool m_is_call = false;
   bool m_is_load = false;
   bool m_is_authenticated = false;
+  bool m_is_barrier = false;
 
   void VisitInstruction() {
     if (m_has_visited_instruction)
@@ -1227,6 +1234,7 @@ class InstructionLLVMC : public lldb_private::Instruction {
     m_is_call = mc_disasm_ptr->IsCall(inst);
     m_is_load = mc_disasm_ptr->IsLoad(inst);
     m_is_authenticated = mc_disasm_ptr->IsAuthenticated(inst);
+    m_is_barrier = mc_disasm_ptr->IsBarrier(inst);
   }
 
 private:
@@ -1432,6 +1440,11 @@ bool DisassemblerLLVMC::MCDisasmInstance::IsLoad(llvm::MCInst &mc_inst) const {
   return m_instr_info_up->get(mc_inst.getOpcode()).mayLoad();
 }
 
+bool DisassemblerLLVMC::MCDisasmInstance::IsBarrier(
+    llvm::MCInst &mc_inst) const {
+  return m_instr_info_up->get(mc_inst.getOpcode()).isBarrier();
+}
+
 bool DisassemblerLLVMC::MCDisasmInstance::IsAuthenticated(
     llvm::MCInst &mc_inst) const {
   const auto &InstrDesc = m_instr_info_up->get(mc_inst.getOpcode());
diff --git a/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/CMakeLists.txt b/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/CMakeLists.txt
new file mode 100644
index 0000000000000..adbd6c45e45af
--- /dev/null
+++ b/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/CMakeLists.txt
@@ -0,0 +1,13 @@
+add_lldb_library(lldbPluginInstrumentationRuntimeBoundsSafety PLUGIN
+  InstrumentationRuntimeBoundsSafety.cpp
+
+  LINK_LIBS
+    lldbBreakpoint
+    lldbCore
+    lldbSymbol
+    lldbTarget
+    lldbPluginInstrumentationRuntimeUtility
+
+  CLANG_LIBS
+    clangCodeGen
+  )
diff --git a/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.cpp b/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.cpp
new file mode 100644
index 0000000000000..db9b21305e938
--- /dev/null
+++ b/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.cpp
@@ -0,0 +1,481 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "InstrumentationRuntimeBoundsSafety.h"
+
+#include "Plugins/Process/Utility/HistoryThread.h"
+#include "lldb/Breakpoint/StoppointCallbackContext.h"
+#include "lldb/Core/Debugger.h"
+#include "lldb/Core/Module.h"
+#include "lldb/Core/PluginManager.h"
+#include "lldb/Symbol/Block.h"
+#include "lldb/Symbol/Symbol.h"
+#include "lldb/Symbol/SymbolContext.h"
+#include "lldb/Symbol/Variable.h"
+#include "lldb/Symbol/VariableList.h"
+#include "lldb/Target/InstrumentationRuntimeStopInfo.h"
+#include "lldb/Target/RegisterContext.h"
+#include "lldb/Target/SectionLoadList.h"
+#include "lldb/Target/StopInfo.h"
+#include "lldb/Target/Target.h"
+#include "lldb/Target/Thread.h"
+#include "lldb/Utility/RegisterValue.h"
+#include "lldb/Utility/RegularExpression.h"
+#include "clang/CodeGen/ModuleBuilder.h"
+
+#include <memory>
+#include <type_traits>
+
+using namespace lldb;
+using namespace lldb_private;
+
+LLDB_PLUGIN_DEFINE(InstrumentationRuntimeBoundsSafety)
+
+constexpr llvm::StringLiteral
+    BoundsSafetySoftTrapMinimal("__bounds_safety_soft_trap");
+constexpr llvm::StringLiteral
+    BoundsSafetySoftTrapStr("__bounds_safety_soft_trap_s");
+
+constexpr std::array<llvm::StringLiteral, 2>
+getBoundsSafetySoftTrapRuntimeFuncs() {
+  return {BoundsSafetySoftTrapMinimal, BoundsSafetySoftTrapStr};
+}
+
+#define SOFT_TRAP_CATEGORY_PREFIX "Soft "
+#define SOFT_TRAP_FALLBACK_CATEGORY                                            \
+  SOFT_TRAP_CATEGORY_PREFIX "Bounds check failed"
+
+using ComputedStopInfo =
+    std::pair<std::optional<std::string>, std::optional<uint32_t>>;
+
+class InstrumentationBoundsSafetyStopInfo : public StopInfo {
+public:
+  ~InstrumentationBoundsSafetyStopInfo() override = default;
+
+  lldb::StopReason GetStopReason() const override {
+    return lldb::eStopReasonInstrumentation;
+  }
+
+  std::optional<uint32_t>
+  GetSuggestedStackFrameIndex(bool inlined_stack) override {
+    return m_value;
+  }
+
+  const char *GetDescription() override { return m_description.c_str(); }
+
+  bool DoShouldNotify(Event *event_ptr) override { return true; }
+
+  static lldb::StopInfoSP
+  CreateInstrumentationBoundsSafetyStopInfo(Thread &thread) {
+    return StopInfoSP(new InstrumentationBoundsSafetyStopInfo(thread));
+  }
+
+private:
+  InstrumentationBoundsSafetyStopInfo(Thread &thread);
+
+  ComputedStopInfo
+  ComputeStopReasonAndSuggestedStackFrame(bool &warning_emitted_for_failure);
+
+  ComputedStopInfo ComputeStopReasonAndSuggestedStackFrameWithDebugInfo(
+      lldb::StackFrameSP parent_sf, lldb::user_id_t debugger_id,
+      bool &warning_emitted_for_failure);
+
+  ComputedStopInfo ComputeStopReasonAndSuggestedStackFrameWithoutDebugInfo(
+      ThreadSP thread_sp, lldb::user_id_t debugger_id,
+      bool &warning_emitted_for_failure);
+};
+
+InstrumentationBoundsSafetyStopInfo::InstrumentationBoundsSafetyStopInfo(
+    Thread &thread)
+    : StopInfo(thread, 0) {
+  // No additional data describing the reason for stopping.
+  m_extended_info = nullptr;
+  m_description = SOFT_TRAP_FALLBACK_CATEGORY;
+
+  bool warning_emitted_for_failure = false;
+  auto [MaybeDescription, MaybeSuggestedStackIndex] =
+      ComputeStopReasonAndSuggestedStackFrame(warning_emitted_for_failure);
+  if (MaybeDescription)
+    m_description = MaybeDescription.value();
+  if (MaybeSuggestedStackIndex)
+    m_value = MaybeSuggestedStackIndex.value();
+
+  // Emit warning about the failure to compute the stop info if one wasn't
+  // already emitted.
+  if ((!MaybeDescription.has_value()) && !warning_emitted_for_failure) {
+    if (ThreadSP thread_sp = GetThread()) {
+      lldb::user_id_t debugger_id =
+          thread_sp->GetProcess()->GetTarget().GetDebugger().GetID();
+      Debugger::ReportWarning(
+          "specific BoundsSafety trap reason could not be computed",
+          debugger_id);
+    }
+  }
+}
+
+// Helper functions to make it convenient to log a failure and then return.
+template <typename T, typename... ArgTys>
+[[nodiscard]] T LogBeforeReturn(ArgTys &&...Args) {
+  LLDB_LOG(GetLog(LLDBLog::InstrumentationRuntime), Args...);
+  return T();
+}
+
+template <typename... ArgTys>
+[[nodiscard]] ComputedStopInfo LogFailedCSI(ArgTys &&...Args) {
+  return LogBeforeReturn<ComputedStopInfo>(Args...);
+}
+
+ComputedStopInfo
+InstrumentationBoundsSafetyStopInfo::ComputeStopReasonAndSuggestedStackFrame(
+    bool &warning_emitted_for_failure) {
+  ThreadSP thread_sp = GetThread();
+  if (!thread_sp)
+    return LogFailedCSI("failed to get thread while stopped");
+
+  lldb::user_id_t debugger_id =
+      thread_sp->GetProcess()->GetTarget().GetDebugger().GetID();
+
+  StackFrameSP parent_sf = thread_sp->GetStackFrameAtIndex(1);
+  if (!parent_sf)
+    return LogFailedCSI("got nullptr when fetching stackframe at index 1");
+
+  if (parent_sf->HasDebugInformation())
+    return ComputeStopReasonAndSuggestedStackFrameWithDebugInfo(
+        parent_sf, debugger_id, warning_emitted_for_failure);
+
+  // If the debug info is missing we can still get some information
+  // from the parameter in the soft trap runtime call.
+  return ComputeStopReasonAndSuggestedStackFrameWithoutDebugInfo(
+      thread_sp, debugger_id, warning_emitted_for_failure);
+}
+
+ComputedStopInfo InstrumentationBoundsSafetyStopInfo::
+    ComputeStopReasonAndSuggestedStackFrameWithDebugInfo(
+        lldb::StackFrameSP parent_sf, lldb::user_id_t debugger_id,
+        bool &warning_emitted_for_failure) {
+  // First try to use debug info to understand the reason for trapping. The
+  // call stack will look something like this:
+  //
+  // ```
+  // frame #0: `__bounds_safety_soft_trap_s(reason="")
+  // frame #1: `__clang_trap_msg$Bounds check failed$<reason>'
+  // frame #2: `bad_read(index=10)
+  // ```
+  // ....
+  const char *TrapReasonFuncName = parent_sf->GetFunctionName();
+
+  auto MaybeTrapReason =
+      clang::CodeGen::DemangleTrapReasonInDebugInfo(TrapReasonFuncName);
+  if (!MaybeTrapReason.has_value())
+    return LogFailedCSI(
+        "clang::CodeGen::DemangleTrapReasonInDebugInfo(\"{0}\") call failed",
+        TrapReasonFuncName);
+
+  llvm::StringRef category = MaybeTrapReason.value().first;
+  llvm::StringRef message = MaybeTrapReason.value().second;
+
+  // TODO: Clang should probably be changed to emit the "Soft " prefix itself
+  std::string stop_reason;
+  llvm::raw_string_ostream ss(stop_reason);
+  ss << SOFT_TRAP_CATEGORY_PREFIX;
+  if (category.empty())
+    ss << "<empty category>";
+  else
+    ss << category;
+  if (message.empty()) {
+    // This is not a failure so leave `warning_emitted_for_failure` untouched.
+    Debugger::ReportWarning(
+        "specific BoundsSafety trap reason is not "
+        "available because the compiler omitted it from the debug info",
+        debugger_id);
+  } else {
+    ss << ": " << message;
+  }
+  // Use computed stop-reason and assume the parent of `parent_sf` is the
+  // the place in the user's code where the call to the soft trap runtime
+  // originated.
+  return std::make_pair(stop_reason, parent_sf->GetFrameIndex() + 1);
+}
+
+ComputedStopInfo InstrumentationBoundsSafetyStopInfo::
+    ComputeStopReasonAndSuggestedStackFrameWithoutDebugInfo(
+        ThreadSP thread_sp, lldb::user_id_t debugger_id,
+        bool &warning_emitted_for_failure) {
+
+  StackFrameSP softtrap_sf = thread_sp->GetStackFrameAtIndex(0);
+  if (!softtrap_sf)
+    return LogFailedCSI("got nullptr when fetching stackframe at index 0");
+  llvm::StringRef trap_reason_func_name = softtrap_sf->GetFunctionName();
+
+  if (trap_reason_func_name == BoundsSafetySoftTrapMinimal) {
+    // This function has no arguments so there's no additional information
+    // that would allow us to identify the trap reason.
+    //
+    // Use the fallback stop reason and the current frame.
+    // While we "could" set the suggested frame to our parent (where the
+    // bounds check failed), doing this leads to very misleading output in
+    // LLDB. E.g.:
+    //
+    // ```
+    //     0x100003b40 <+104>: bl  0x100003d64    ; __bounds_safety_soft_trap
+    // ->  0x100003b44 <+108>: b   0x100003b48    ; <+112>
+    // ```
+    //
+    // This makes it look we stopped after finishing the call to
+    // `__bounds_safety_soft_trap` but actually we are in the middle of the
+    // call. To avoid this confusion just use the current frame.
+    std::string warning;
+    llvm::raw_string_ostream ss(warning);
+    ss << "specific BoundsSafety trap reason is not available because debug "
+          "info is missing on the caller of '"
+       << BoundsSafetySoftTrapMinimal << "'";
+    Debugger::ReportWarning(warning.c_str(), debugger_id);
+    warning_emitted_for_failure = true;
+    return {};
+  }
+
+  // __bounds_safety_soft_trap_s has one argument which is a pointer to a string
+  // describing the trap or a nullptr.
+  if (trap_reason_func_name != BoundsSafetySoftTrapStr) {
+    assert(0 && "hit breakpoint for unexpected function name");
+    return LogFailedCSI(
+        "unexpected function name. Expected \"{0}\" but got \"{1}\"",
+        BoundsSafetySoftTrapStr.data(), trap_reason_func_name.data());
+  }
+
+  RegisterContextSP rc = thread_sp->GetRegisterContext();
+  if (!rc)
+    return LogFailedCSI("failed to get register context");
+
+  // FIXME: LLDB should have an API that tells us for the current target if
+  // `LLDB_REGNUM_GENERIC_ARG1` can be used.
+  // https://github.com/llvm/llvm-project/issues/168602
+  // Don't try for architectures where examining the first register won't
+  // work.
+  ProcessSP process = thread_sp->GetProcess();
+  if (!process)
+    return LogFailedCSI("failed to get process");
+
+  switch (process->GetTarget().GetArchitecture().GetCore()) {
+  case ArchSpec::eCore_x86_32_i386:
+  case ArchSpec::eCore_x86_32_i486:
+  case ArchSpec::eCore_x86_32_i486sx:
+  case ArchSpec::eCore_x86_32_i686: {
+    // Technically some x86 calling conventions do use a register for
+    // passing the first argument but let's ignore that for now.
+    std::string warning;
+    llvm::raw_string_ostream ss(warning);
+    ss << "specific BoundsSafety trap reason cannot be inferred on x86 when "
+          "the caller of '"
+       << BoundsSafetySoftTrapStr << "' is missing debug info";
+    Debugger::ReportWarning(warning.c_str(), debugger_id);
+    warning_emitted_for_failure = true;
+    return {};
+  }
+  default: {
+  }
+  };
+
+  // Examine the register for the first argument.
+  const RegisterInfo *arg0_info = rc->GetRegisterInfo(
+      lldb::RegisterKind::eRegisterKindGeneric, LLDB_REGNUM_GENERIC_ARG1);
+  if (!arg0_info)
+    return LogFailedCSI(
+        "failed to get register info for LLDB_REGNUM_GENERIC_ARG1");
+  RegisterValue reg_value;
+  if (!rc->ReadRegister(arg0_info, reg_value))
+    return LogFailedCSI("failed to read register {0}", arg0_info->name);
+  uint64_t reg_value_as_int = reg_value.GetAsUInt64(UINT64_MAX);
+  if (reg_value_as_int == UINT64_MAX)
+    return LogFailedCSI("failed to read register {0} as a UInt64",
+                        arg0_info->name);
+
+  if (reg_value_as_int == 0) {
+    // nullptr arg. The compiler will pass that if no trap reason string was
+    // available.
+    Debugger::ReportWarning(
+        "specific BoundsSafety trap reason cannot be inferred because the "
+        "compiler omitted the reason",
+        debugger_id);
+    warning_emitted_for_failure = true;
+    return {};
+  }
+
+  // The first argument to the call is a pointer to a global C string
+  // containing the trap reason.
+  std::string out_string;
+  Status error_status;
+  thread_sp->GetProcess()->ReadCStringFromMemory(reg_value_as_int, out_string,
+                                                 error_status);
+  if (error_status.Fail())
+    return LogFailedCSI("failed to read C string from address {0}",
+                        (void *)reg_value_as_int);
+
+  LLDB_LOG(GetLog(LLDBLog::InstrumentationRuntime),
+           "read C string from {0} found in register {1}: \"{2}\"",
+           (void *)reg_value_as_int, arg0_info->name, out_string.c_str());
+  std::string stop_reason;
+  llvm::raw_string_ostream SS(stop_reason);
+  SS << SOFT_TRAP_FALLBACK_CATEGORY;
+  if (!stop_reason.empty()) {
+    SS << ": " << out_string;
+  }
+  // Use the current frame as the suggested frame for the same reason as for
+  // `__bounds_safety_soft_trap`.
+  return {stop_reason, 0};
+}
+
+InstrumentationRuntimeBoundsSafety::~InstrumentationRuntimeBoundsSafety() {
+  Deactivate();
+}
+
+lldb::InstrumentationRuntimeSP
+InstrumentationRuntimeBoundsSafety::CreateInstance(
+    const lldb::ProcessSP &process_sp) {
+  return InstrumentationRuntimeSP(
+      new InstrumentationRuntimeBoundsSafety(process_sp));
+}
+
+void InstrumentationRuntimeBoundsSafety::Initialize() {
+  PluginManager::RegisterPlugin(GetPluginNameStatic(),
+                                "BoundsSafety instrumentation runtime plugin.",
+                                CreateInstance, GetTypeStatic);
+}
+
+void InstrumentationRuntimeBoundsSafety::Terminate() {
+  PluginManager::UnregisterPlugin(CreateInstance);
+}
+
+lldb::InstrumentationRuntimeType
+InstrumentationRuntimeBoundsSafety::GetTypeStatic() {
+  return lldb::eInstrumentationRuntimeTypeBoundsSafety;
+}
+
+const RegularExpression &
+InstrumentationRuntimeBoundsSafety::GetPatternForRuntimeLibrary() {
+  static RegularExpression regex;
+  return regex;
+}
+
+bool InstrumentationRuntimeBoundsSafety::CheckIfRuntimeIsValid(
+    const lldb::ModuleSP module_sp) {
+  Log *log_category = GetLog(LLDBLog::InstrumentationRuntime);
+  for (const auto &SoftTrapFunc : getBoundsSafetySoftTrapRuntimeFuncs()) {
+    ConstString test_sym(SoftTrapFunc);
+
+    if (module_sp->FindFirstSymbolWithNameAndType(test_sym,
+                                                  lldb::eSymbolTypeAny)) {
+      LLDB_LOG(log_category, "found \"{0}\" in {1}",
+               test_sym.AsCString("<unknown symbol>"),
+               module_sp->GetObjectName().AsCString("<unknown module>"));
+      return true;
+    }
+  }
+  LLDB_LOG(log_category,
+           "did not find BoundsSafety soft trap functions in module {0}",
+           module_sp->GetObjectName().AsCString("<unknown module>"));
+  return false;
+}
+
+bool InstrumentationRuntimeBoundsSafety::NotifyBreakpointHit(
+    void *baton, StoppointCallbackContext *context, user_id_t break_id,
+    user_id_t break_loc_id) {
+  assert(baton && "null baton");
+  if (!baton)
+    return false; ///< false => resume execution.
+
+  InstrumentationRuntimeBoundsSafety *const instance =
+      static_cast<InstrumentationRuntimeBoundsSafety *>(baton);
+
+  ProcessSP process_sp = instance->GetProcessSP();
+  if (!process_sp)
+    return LogBeforeReturn<bool>("failed to get process from baton");
+  ThreadSP thread_sp = context->exe_ctx_ref.GetThreadSP();
+  if (!thread_sp)
+    return LogBeforeReturn<bool>(
+        "failed to get thread from StoppointCallbackContext");
+
+  if (process_sp != context->exe_ctx_ref.GetProcessSP())
+    return LogBeforeReturn<bool>(
+        "process from baton ({0}) and StoppointCallbackContext ({1}) do "
+        "not match",
+        (void *)process_sp.get(),
+        (void *)context->exe_ctx_ref.GetProcessSP().get());
+
+  if (process_sp->GetModIDRef().IsLastResumeForUserExpression())
+    return LogBeforeReturn<bool>("IsLastResumeForUserExpression is true");
+
+  // Maybe the stop reason and stackframe selection should be done by
+  // a stackframe recognizer instead?
+  thread_sp->SetStopInfo(
+      InstrumentationBoundsSafetyStopInfo::
+          CreateInstrumentationBoundsSafetyStopInfo(*thread_sp));
+  return true;
+}
+
+void InstrumentationRuntimeBoundsSafety::Activate() {
+  if (IsActive())
+    return;
+
+  ProcessSP process_sp = GetProcessSP();
+  if (!process_sp)
+    return LogBeforeReturn<void>("could not get process during Activate()");
+
+  std::vector<std::string> breakpoints;
+  for (auto &breakpoint_func : getBoundsSafetySoftTrapRuntimeFuncs())
+    breakpoints.emplace_back(breakpoint_func);
+
+  BreakpointSP breakpoint = process_sp->GetTarget().CreateBreakpoint(
+      /*containingModules=*/nullptr,
+      /*containingSourceFiles=*/nullptr, breakpoints, eFunctionNameTypeFull,
+      eLanguageTypeUnknown,
+      /*m_offset=*/0,
+      /*skip_prologue*/ eLazyBoolNo,
+      /*internal=*/true,
+      /*request_hardware*/ false);
+
+  if (!breakpoint)
+    return LogBeforeReturn<void>("failed to create breakpoint");
+
+  if (!breakpoint->HasResolvedLocations()) {
+    assert(0 && "breakpoint has no resolved locations");
+    process_sp->GetTarget().RemoveBreakpointByID(breakpoint->GetID());
+    return LogBeforeReturn<void>(
+        "breakpoint {0} for BoundsSafety soft traps did not resolve to "
+        "any locations",
+        breakpoint->GetID());
+  }
+
+  // Note: When `sync=true` the suggested stackframe is completely ignored. So
+  // we use `sync=false`. Is that a bug?
+  breakpoint->SetCallback(
+      InstrumentationRuntimeBoundsSafety::NotifyBreakpointHit, this,
+      /*sync=*/false);
+  breakpoint->SetBreakpointKind("bounds-safety-soft-trap");
+  SetBreakpointID(breakpoint->GetID());
+  LLDB_LOG(GetLog(LLDBLog::InstrumentationRuntime),
+           "created breakpoint {0} for BoundsSafety soft traps",
+           breakpoint->GetID());
+  SetActive(true);
+}
+
+void InstrumentationRuntimeBoundsSafety::Deactivate() {
+  SetActive(false);
+  Log *log_category = GetLog(LLDBLog::InstrumentationRuntime);
+  if (ProcessSP process_sp = GetProcessSP()) {
+    bool success =
+        process_sp->GetTarget().RemoveBreakpointByID(GetBreakpointID());
+    LLDB_LOG(log_category,
+             "{0}removed breakpoint {1} for BoundsSafety soft traps",
+             success ? "" : "failed to ", GetBreakpointID());
+  } else {
+    LLDB_LOG(log_category, "no process available during Deactivate()");
+  }
+
+  SetBreakpointID(LLDB_INVALID_BREAK_ID);
+}
diff --git a/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.h b/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.h
new file mode 100644
index 0000000000000..06c30f8febca8
--- /dev/null
+++ b/lldb/source/Plugins/InstrumentationRuntime/BoundsSafety/InstrumentationRuntimeBoundsSafety.h
@@ -0,0 +1,61 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_SOURCE_PLUGINS_INSTRUMENTATIONRUNTIME_BOUNDS_SAFETY_SOFT_TRAP_H
+#define LLDB_SOURCE_PLUGINS_INSTRUMENTATIONRUNTIME_BOUNDS_SAFETY_SOFT_TRAP_H
+
+#include "lldb/Target/ABI.h"
+#include "lldb/Target/InstrumentationRuntime.h"
+#include "lldb/Utility/StructuredData.h"
+#include "lldb/lldb-private.h"
+
+namespace lldb_private {
+
+class InstrumentationRuntimeBoundsSafety
+    : public lldb_private::InstrumentationRuntime {
+public:
+  ~InstrumentationRuntimeBoundsSafety() override;
+
+  static lldb::InstrumentationRuntimeSP
+  CreateInstance(const lldb::ProcessSP &process_sp);
+
+  static void Initialize();
+
+  static void Terminate();
+
+  static llvm::StringRef GetPluginNameStatic() { return "BoundsSafety"; }
+
+  static lldb::InstrumentationRuntimeType GetTypeStatic();
+
+  llvm::StringRef GetPluginName() override { return GetPluginNameStatic(); }
+
+  virtual lldb::InstrumentationRuntimeType GetType() { return GetTypeStatic(); }
+
+private:
+  InstrumentationRuntimeBoundsSafety(const lldb::ProcessSP &process_sp)
+      : lldb_private::InstrumentationRuntime(process_sp) {}
+
+  const RegularExpression &GetPatternForRuntimeLibrary() override;
+
+  bool CheckIfRuntimeIsValid(const lldb::ModuleSP module_sp) override;
+
+  void Activate() override;
+
+  void Deactivate();
+
+  static bool NotifyBreakpointHit(void *baton,
+                                  StoppointCallbackContext *context,
+                                  lldb::user_id_t break_id,
+                                  lldb::user_id_t break_loc_id);
+
+  bool MatchAllModules() override { return true; }
+};
+
+} // namespace lldb_private
+
+#endif
diff --git a/lldb/source/Plugins/InstrumentationRuntime/CMakeLists.txt b/lldb/source/Plugins/InstrumentationRuntime/CMakeLists.txt
index 2a6cf930945d1..b7e1a602f208f 100644
--- a/lldb/source/Plugins/InstrumentationRuntime/CMakeLists.txt
+++ b/lldb/source/Plugins/InstrumentationRuntime/CMakeLists.txt
@@ -2,6 +2,7 @@ set_property(DIRECTORY PROPERTY LLDB_PLUGIN_KIND InstrumentationRuntime)
 
 add_subdirectory(ASan)
 add_subdirectory(ASanLibsanitizers)
+add_subdirectory(BoundsSafety)
 add_subdirectory(MainThreadChecker)
 add_subdirectory(TSan)
 add_subdirectory(UBSan)
diff --git a/lldb/source/Plugins/Language/CPlusPlus/CMakeLists.txt b/lldb/source/Plugins/Language/CPlusPlus/CMakeLists.txt
index ca4fd3f680484..c52d3bdb31284 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/CMakeLists.txt
+++ b/lldb/source/Plugins/Language/CPlusPlus/CMakeLists.txt
@@ -31,6 +31,7 @@ add_lldb_library(lldbPluginCPlusPlusLanguage PLUGIN
   LibCxxValarray.cpp
   LibCxxVector.cpp
   LibStdcpp.cpp
+  LibStdcppSpan.cpp
   LibStdcppTuple.cpp
   LibStdcppUniquePointer.cpp
   MsvcStl.cpp
diff --git a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
index a3624accf9b5a..ae6086ff89d71 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
+++ b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
@@ -1424,6 +1424,10 @@ static void LoadLibStdcppFormatters(lldb::TypeCategoryImplSP cpp_category_sp) {
           stl_synth_flags,
           "lldb.formatters.cpp.gnu_libstdcpp.StdForwardListSynthProvider")));
 
+  AddCXXSynthetic(cpp_category_sp, LibStdcppSpanSyntheticFrontEndCreator,
+                  "libstdc++ std::span synthetic children", "^std::span<.+>$",
+                  stl_deref_flags, true);
+
   stl_summary_flags.SetDontShowChildren(false);
   stl_summary_flags.SetSkipPointers(false);
 
@@ -1514,6 +1518,11 @@ static void LoadLibStdcppFormatters(lldb::TypeCategoryImplSP cpp_category_sp) {
                 lldb_private::formatters::StdlibCoroutineHandleSummaryProvider,
                 "libstdc++ std::coroutine_handle summary provider",
                 libstdcpp_std_coroutine_handle_regex, stl_summary_flags, true);
+
+  AddCXXSummary(cpp_category_sp,
+                lldb_private::formatters::ContainerSizeSummaryProvider,
+                "libstdc++ std::span summary provider", "^std::span<.+>$",
+                stl_summary_flags, true);
 }
 
 static lldb_private::SyntheticChildrenFrontEnd *
diff --git a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp
index d8c095d6edeb7..4d283bb02e533 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp
+++ b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusNameParser.cpp
@@ -315,7 +315,7 @@ bool CPlusPlusNameParser::ConsumeAbiTag() {
 
   // Consume the actual tag string (and allow some special characters)
   while (ConsumeToken(tok::raw_identifier, tok::comma, tok::period,
-                      tok::numeric_constant))
+                      tok::numeric_constant, tok::kw_operator))
     ;
 
   if (!ConsumeToken(tok::r_square))
@@ -420,10 +420,11 @@ bool CPlusPlusNameParser::ConsumeOperator() {
     // Make sure we have more tokens before attempting to look ahead one more.
     if (m_next_token_index + 1 < m_tokens.size()) {
       // Look ahead two tokens.
-      clang::Token n_token = m_tokens[m_next_token_index + 1];
-      // If we find ( or < then this is indeed operator<< no need for fix.
-      if (n_token.getKind() != tok::l_paren && n_token.getKind() != tok::less) {
-        clang::Token tmp_tok;
+      const clang::Token n_token = m_tokens[m_next_token_index + 1];
+      // If we find `(`, `<` or `[` then this is indeed operator<< no need for
+      // fix.
+      if (!n_token.isOneOf(tok::l_paren, tok::less, tok::l_square)) {
+        clang::Token tmp_tok{};
         tmp_tok.startToken();
         tmp_tok.setLength(1);
         tmp_tok.setLocation(token.getLocation().getLocWithOffset(1));
diff --git a/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.h b/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.h
index 429142f63a4bd..8d2c81f2bbcbb 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.h
+++ b/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.h
@@ -37,6 +37,10 @@ SyntheticChildrenFrontEnd *
 LibstdcppMapIteratorSyntheticFrontEndCreator(CXXSyntheticChildren *,
                                              lldb::ValueObjectSP);
 
+SyntheticChildrenFrontEnd *
+LibStdcppSpanSyntheticFrontEndCreator(CXXSyntheticChildren *,
+                                      lldb::ValueObjectSP);
+
 SyntheticChildrenFrontEnd *
 LibStdcppTupleSyntheticFrontEndCreator(CXXSyntheticChildren *,
                                        lldb::ValueObjectSP);
diff --git a/lldb/source/Plugins/Language/CPlusPlus/LibStdcppSpan.cpp b/lldb/source/Plugins/Language/CPlusPlus/LibStdcppSpan.cpp
new file mode 100644
index 0000000000000..5e69792151c87
--- /dev/null
+++ b/lldb/source/Plugins/Language/CPlusPlus/LibStdcppSpan.cpp
@@ -0,0 +1,112 @@
+//===---------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "LibStdcpp.h"
+
+#include "lldb/DataFormatters/FormattersHelpers.h"
+#include "lldb/Utility/ConstString.h"
+#include "lldb/ValueObject/ValueObject.h"
+#include "llvm/ADT/APSInt.h"
+#include "llvm/Support/Error.h"
+#include <cstddef>
+#include <optional>
+
+using namespace lldb;
+
+namespace lldb_private::formatters {
+
+class LibStdcppSpanSyntheticFrontEnd : public SyntheticChildrenFrontEnd {
+public:
+  LibStdcppSpanSyntheticFrontEnd(const lldb::ValueObjectSP &valobj_sp)
+      : SyntheticChildrenFrontEnd(*valobj_sp) {
+    if (valobj_sp)
+      Update();
+  }
+
+  ~LibStdcppSpanSyntheticFrontEnd() override = default;
+
+  llvm::Expected<uint32_t> CalculateNumChildren() override {
+    return m_num_elements;
+  }
+
+  lldb::ValueObjectSP GetChildAtIndex(uint32_t idx) override {
+    if (!m_start)
+      return {};
+
+    uint64_t offset = (static_cast<uint64_t>(idx) * m_element_size);
+    offset += m_start->GetValueAsUnsigned(0);
+    const std::string name = llvm::formatv("[{0}]", idx);
+    return CreateValueObjectFromAddress(
+        name, offset, m_backend.GetExecutionContextRef(), m_element_type);
+  }
+
+  lldb::ChildCacheState Update() override {
+    const ValueObjectSP data_ptr = m_backend.GetChildMemberWithName("_M_ptr");
+    if (!data_ptr)
+      return lldb::ChildCacheState::eRefetch;
+
+    m_element_type = data_ptr->GetCompilerType().GetPointeeType();
+
+    // Get element size.
+    llvm::Expected<uint64_t> size_or_err = m_element_type.GetByteSize(nullptr);
+    if (!size_or_err) {
+      LLDB_LOG_ERRORV(GetLog(LLDBLog::DataFormatters), size_or_err.takeError(),
+                      "{0}");
+      return lldb::ChildCacheState::eReuse;
+    }
+
+    m_element_size = *size_or_err;
+    if (m_element_size > 0) {
+      m_start = data_ptr.get();
+    }
+
+    // Get number of elements.
+    if (const ValueObjectSP size_sp =
+            m_backend.GetChildAtNamePath({"_M_extent", "_M_extent_value"})) {
+      m_num_elements = size_sp->GetValueAsUnsigned(0);
+    } else if (const auto arg =
+                   m_backend.GetCompilerType().GetIntegralTemplateArgument(1)) {
+
+      m_num_elements = arg->value.GetAPSInt().getLimitedValue();
+    }
+
+    return lldb::ChildCacheState::eReuse;
+  }
+
+  llvm::Expected<size_t> GetIndexOfChildWithName(ConstString name) override {
+    if (!m_start)
+      return llvm::createStringError(
+          llvm::formatv("Type has no child named {0}", name.GetStringRef()));
+
+    auto optional_idx = formatters::ExtractIndexFromString(name.GetCString());
+    if (!optional_idx) {
+      return llvm::createStringError(
+          llvm::formatv("Type has no child named {0}", name.GetStringRef()));
+    }
+    return *optional_idx;
+  }
+
+private:
+  ValueObject *m_start = nullptr; ///< First element of span. Held, not owned.
+  CompilerType m_element_type;    ///< Type of span elements.
+  size_t m_num_elements = 0;      ///< Number of elements in span.
+  uint32_t m_element_size = 0;    ///< Size in bytes of each span element.
+};
+
+SyntheticChildrenFrontEnd *
+LibStdcppSpanSyntheticFrontEndCreator(CXXSyntheticChildren * /*unused*/,
+                                      lldb::ValueObjectSP valobj_sp) {
+  if (!valobj_sp)
+    return nullptr;
+  const CompilerType type = valobj_sp->GetCompilerType();
+  if (!type || type.GetNumTemplateArguments() != 2)
+    return nullptr;
+  return new LibStdcppSpanSyntheticFrontEnd(valobj_sp);
+}
+
+} // namespace lldb_private::formatters
diff --git a/lldb/source/Plugins/ObjectFile/Breakpad/ObjectFileBreakpad.cpp b/lldb/source/Plugins/ObjectFile/Breakpad/ObjectFileBreakpad.cpp
index 33673f139b49a..53ff6ef6613e9 100644
--- a/lldb/source/Plugins/ObjectFile/Breakpad/ObjectFileBreakpad.cpp
+++ b/lldb/source/Plugins/ObjectFile/Breakpad/ObjectFileBreakpad.cpp
@@ -130,13 +130,13 @@ void ObjectFileBreakpad::CreateSections(SectionList &unified_section_list) {
 
   std::optional<Record::Kind> current_section;
   offset_t section_start;
-  llvm::StringRef text = toStringRef(m_data.GetData());
+  llvm::StringRef text = toStringRef(m_data_nsp->GetData());
   uint32_t next_section_id = 1;
   auto maybe_add_section = [&](const uint8_t *end_ptr) {
     if (!current_section)
       return; // We have been called before parsing the first line.
 
-    offset_t end_offset = end_ptr - m_data.GetDataStart();
+    offset_t end_offset = end_ptr - m_data_nsp->GetDataStart();
     auto section_sp = std::make_shared<Section>(
         GetModule(), this, next_section_id++,
         ConstString(toString(*current_section)), eSectionTypeOther,
@@ -162,8 +162,8 @@ void ObjectFileBreakpad::CreateSections(SectionList &unified_section_list) {
     maybe_add_section(line.bytes_begin());
     // And start a new one.
     current_section = next_section;
-    section_start = line.bytes_begin() - m_data.GetDataStart();
+    section_start = line.bytes_begin() - m_data_nsp->GetDataStart();
   }
   // Finally, add the last section.
-  maybe_add_section(m_data.GetDataEnd());
+  maybe_add_section(m_data_nsp->GetDataEnd());
 }
diff --git a/lldb/source/Plugins/ObjectFile/COFF/ObjectFileCOFF.cpp b/lldb/source/Plugins/ObjectFile/COFF/ObjectFileCOFF.cpp
index 1121f696637b6..e78f4f08783d7 100644
--- a/lldb/source/Plugins/ObjectFile/COFF/ObjectFileCOFF.cpp
+++ b/lldb/source/Plugins/ObjectFile/COFF/ObjectFileCOFF.cpp
@@ -300,8 +300,8 @@ bool ObjectFileCOFF::ParseHeader() {
 
   std::lock_guard<std::recursive_mutex> guard(module->GetMutex());
 
-  m_data.SetByteOrder(eByteOrderLittle);
-  m_data.SetAddressByteSize(GetAddressByteSize());
+  m_data_nsp->SetByteOrder(eByteOrderLittle);
+  m_data_nsp->SetAddressByteSize(GetAddressByteSize());
 
   return true;
 }
diff --git a/lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp b/lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
index 3968715a6d215..5d81b110cc7ae 100644
--- a/lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
+++ b/lldb/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp
@@ -804,7 +804,7 @@ ByteOrder ObjectFileELF::GetByteOrder() const {
 }
 
 uint32_t ObjectFileELF::GetAddressByteSize() const {
-  return m_data.GetAddressByteSize();
+  return m_data_nsp->GetAddressByteSize();
 }
 
 AddressClass ObjectFileELF::GetAddressClass(addr_t file_addr) {
@@ -845,7 +845,7 @@ size_t ObjectFileELF::SectionIndex(const SectionHeaderCollConstIter &I) const {
 
 bool ObjectFileELF::ParseHeader() {
   lldb::offset_t offset = 0;
-  return m_header.Parse(m_data, &offset);
+  return m_header.Parse(*m_data_nsp.get(), &offset);
 }
 
 UUID ObjectFileELF::GetUUID() {
@@ -881,7 +881,7 @@ UUID ObjectFileELF::GetUUID() {
         return UUID();
 
       core_notes_crc =
-          CalculateELFNotesSegmentsCRC32(m_program_headers, m_data);
+          CalculateELFNotesSegmentsCRC32(m_program_headers, *m_data_nsp.get());
 
       if (core_notes_crc) {
         // Use 8 bytes - first 4 bytes for *magic* prefix, mainly to make it
@@ -892,7 +892,7 @@ UUID ObjectFileELF::GetUUID() {
       }
     } else {
       if (!m_gnu_debuglink_crc)
-        m_gnu_debuglink_crc = calc_crc32(0, m_data);
+        m_gnu_debuglink_crc = calc_crc32(0, *m_data_nsp.get());
       if (m_gnu_debuglink_crc) {
         // Use 4 bytes of crc from the .gnu_debuglink section.
         u32le data(m_gnu_debuglink_crc);
@@ -1078,7 +1078,8 @@ size_t ObjectFileELF::GetProgramHeaderInfo(ProgramHeaderColl &program_headers,
 
 // ParseProgramHeaders
 bool ObjectFileELF::ParseProgramHeaders() {
-  return GetProgramHeaderInfo(m_program_headers, m_data, m_header) != 0;
+  return GetProgramHeaderInfo(m_program_headers, *m_data_nsp.get(), m_header) !=
+         0;
 }
 
 lldb_private::Status
@@ -1668,8 +1669,8 @@ ObjectFileELF::StripLinkerSymbolAnnotations(llvm::StringRef symbol_name) const {
 
 // ParseSectionHeaders
 size_t ObjectFileELF::ParseSectionHeaders() {
-  return GetSectionHeaderInfo(m_section_headers, m_data, m_header, m_uuid,
-                              m_gnu_debuglink_file, m_gnu_debuglink_crc,
+  return GetSectionHeaderInfo(m_section_headers, *m_data_nsp.get(), m_header,
+                              m_uuid, m_gnu_debuglink_file, m_gnu_debuglink_crc,
                               m_arch_spec);
 }
 
@@ -3678,7 +3679,8 @@ ArchSpec ObjectFileELF::GetArchitecture() {
       if (H.p_type != PT_NOTE || H.p_offset == 0 || H.p_filesz == 0)
         continue;
       DataExtractor data;
-      if (data.SetData(m_data, H.p_offset, H.p_filesz) == H.p_filesz) {
+      if (data.SetData(*m_data_nsp.get(), H.p_offset, H.p_filesz) ==
+          H.p_filesz) {
         UUID uuid;
         RefineModuleDetailsFromNote(data, m_arch_spec, uuid);
       }
@@ -3833,10 +3835,10 @@ llvm::ArrayRef<ELFProgramHeader> ObjectFileELF::ProgramHeaders() {
 }
 
 DataExtractor ObjectFileELF::GetSegmentData(const ELFProgramHeader &H) {
-  // Try and read the program header from our cached m_data which can come from
-  // the file on disk being mmap'ed or from the initial part of the ELF file we
-  // read from memory and cached.
-  DataExtractor data = DataExtractor(m_data, H.p_offset, H.p_filesz);
+  // Try and read the program header from our cached m_data_nsp which can come
+  // from the file on disk being mmap'ed or from the initial part of the ELF
+  // file we read from memory and cached.
+  DataExtractor data = DataExtractor(*m_data_nsp.get(), H.p_offset, H.p_filesz);
   if (data.GetByteSize() == H.p_filesz)
     return data;
   if (IsInMemory()) {
diff --git a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
index 2218c23db5a95..dff9eab0e24d7 100644
--- a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
+++ b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
@@ -1012,35 +1012,35 @@ bool ObjectFileMachO::ParseHeader() {
   std::lock_guard<std::recursive_mutex> guard(module_sp->GetMutex());
   bool can_parse = false;
   lldb::offset_t offset = 0;
-  m_data.SetByteOrder(endian::InlHostByteOrder());
+  m_data_nsp->SetByteOrder(endian::InlHostByteOrder());
   // Leave magic in the original byte order
-  m_header.magic = m_data.GetU32(&offset);
+  m_header.magic = m_data_nsp->GetU32(&offset);
   switch (m_header.magic) {
   case MH_MAGIC:
-    m_data.SetByteOrder(endian::InlHostByteOrder());
-    m_data.SetAddressByteSize(4);
+    m_data_nsp->SetByteOrder(endian::InlHostByteOrder());
+    m_data_nsp->SetAddressByteSize(4);
     can_parse = true;
     break;
 
   case MH_MAGIC_64:
-    m_data.SetByteOrder(endian::InlHostByteOrder());
-    m_data.SetAddressByteSize(8);
+    m_data_nsp->SetByteOrder(endian::InlHostByteOrder());
+    m_data_nsp->SetAddressByteSize(8);
     can_parse = true;
     break;
 
   case MH_CIGAM:
-    m_data.SetByteOrder(endian::InlHostByteOrder() == eByteOrderBig
-                            ? eByteOrderLittle
-                            : eByteOrderBig);
-    m_data.SetAddressByteSize(4);
+    m_data_nsp->SetByteOrder(endian::InlHostByteOrder() == eByteOrderBig
+                                 ? eByteOrderLittle
+                                 : eByteOrderBig);
+    m_data_nsp->SetAddressByteSize(4);
     can_parse = true;
     break;
 
   case MH_CIGAM_64:
-    m_data.SetByteOrder(endian::InlHostByteOrder() == eByteOrderBig
-                            ? eByteOrderLittle
-                            : eByteOrderBig);
-    m_data.SetAddressByteSize(8);
+    m_data_nsp->SetByteOrder(endian::InlHostByteOrder() == eByteOrderBig
+                                 ? eByteOrderLittle
+                                 : eByteOrderBig);
+    m_data_nsp->SetAddressByteSize(8);
     can_parse = true;
     break;
 
@@ -1049,12 +1049,13 @@ bool ObjectFileMachO::ParseHeader() {
   }
 
   if (can_parse) {
-    m_data.GetU32(&offset, &m_header.cputype, 6);
+    m_data_nsp->GetU32(&offset, &m_header.cputype, 6);
 
     ModuleSpecList all_specs;
     ModuleSpec base_spec;
-    GetAllArchSpecs(m_header, m_data, MachHeaderSizeFromMagic(m_header.magic),
-                    base_spec, all_specs);
+    GetAllArchSpecs(m_header, *m_data_nsp.get(),
+                    MachHeaderSizeFromMagic(m_header.magic), base_spec,
+                    all_specs);
 
     for (unsigned i = 0, e = all_specs.GetSize(); i != e; ++i) {
       ArchSpec mach_arch =
@@ -1068,7 +1069,7 @@ bool ObjectFileMachO::ParseHeader() {
       if (SetModulesArchitecture(mach_arch)) {
         const size_t header_and_lc_size =
             m_header.sizeofcmds + MachHeaderSizeFromMagic(m_header.magic);
-        if (m_data.GetByteSize() < header_and_lc_size) {
+        if (m_data_nsp->GetByteSize() < header_and_lc_size) {
           DataBufferSP data_sp;
           ProcessSP process_sp(m_process_wp.lock());
           if (process_sp) {
@@ -1080,7 +1081,7 @@ bool ObjectFileMachO::ParseHeader() {
               continue;
           }
           if (data_sp)
-            m_data.SetData(data_sp);
+            m_data_nsp->SetData(data_sp);
         }
       }
       return true;
@@ -1094,7 +1095,7 @@ bool ObjectFileMachO::ParseHeader() {
 }
 
 ByteOrder ObjectFileMachO::GetByteOrder() const {
-  return m_data.GetByteOrder();
+  return m_data_nsp->GetByteOrder();
 }
 
 bool ObjectFileMachO::IsExecutable() const {
@@ -1114,7 +1115,7 @@ bool ObjectFileMachO::IsKext() const {
 }
 
 uint32_t ObjectFileMachO::GetAddressByteSize() const {
-  return m_data.GetAddressByteSize();
+  return m_data_nsp->GetAddressByteSize();
 }
 
 AddressClass ObjectFileMachO::GetAddressClass(lldb::addr_t file_addr) {
@@ -1297,13 +1298,13 @@ bool ObjectFileMachO::IsStripped() {
         const lldb::offset_t load_cmd_offset = offset;
 
         llvm::MachO::load_command lc = {};
-        if (m_data.GetU32(&offset, &lc.cmd, 2) == nullptr)
+        if (m_data_nsp->GetU32(&offset, &lc.cmd, 2) == nullptr)
           break;
         if (lc.cmd == LC_DYSYMTAB) {
           m_dysymtab.cmd = lc.cmd;
           m_dysymtab.cmdsize = lc.cmdsize;
-          if (m_data.GetU32(&offset, &m_dysymtab.ilocalsym,
-                            (sizeof(m_dysymtab) / sizeof(uint32_t)) - 2) ==
+          if (m_data_nsp->GetU32(&offset, &m_dysymtab.ilocalsym,
+                                 (sizeof(m_dysymtab) / sizeof(uint32_t)) - 2) ==
               nullptr) {
             // Clear m_dysymtab if we were unable to read all items from the
             // load command
@@ -1326,14 +1327,14 @@ ObjectFileMachO::EncryptedFileRanges ObjectFileMachO::GetEncryptedFileRanges() {
   llvm::MachO::encryption_info_command encryption_cmd;
   for (uint32_t i = 0; i < m_header.ncmds; ++i) {
     const lldb::offset_t load_cmd_offset = offset;
-    if (m_data.GetU32(&offset, &encryption_cmd, 2) == nullptr)
+    if (m_data_nsp->GetU32(&offset, &encryption_cmd, 2) == nullptr)
       break;
 
     // LC_ENCRYPTION_INFO and LC_ENCRYPTION_INFO_64 have the same sizes for the
     // 3 fields we care about, so treat them the same.
     if (encryption_cmd.cmd == LC_ENCRYPTION_INFO ||
         encryption_cmd.cmd == LC_ENCRYPTION_INFO_64) {
-      if (m_data.GetU32(&offset, &encryption_cmd.cryptoff, 3)) {
+      if (m_data_nsp->GetU32(&offset, &encryption_cmd.cryptoff, 3)) {
         if (encryption_cmd.cryptid != 0) {
           EncryptedFileRanges::Entry entry;
           entry.SetRangeBase(encryption_cmd.cryptoff);
@@ -1562,7 +1563,7 @@ void ObjectFileMachO::ProcessSegmentCommand(
   llvm::MachO::segment_command_64 load_cmd;
   memcpy(&load_cmd, &load_cmd_, sizeof(load_cmd_));
 
-  if (!m_data.GetU8(&offset, (uint8_t *)load_cmd.segname, 16))
+  if (!m_data_nsp->GetU8(&offset, (uint8_t *)load_cmd.segname, 16))
     return;
 
   ModuleSP module_sp = GetModule();
@@ -1586,11 +1587,11 @@ void ObjectFileMachO::ProcessSegmentCommand(
       add_section = false;
     }
   }
-  load_cmd.vmaddr = m_data.GetAddress(&offset);
-  load_cmd.vmsize = m_data.GetAddress(&offset);
-  load_cmd.fileoff = m_data.GetAddress(&offset);
-  load_cmd.filesize = m_data.GetAddress(&offset);
-  if (!m_data.GetU32(&offset, &load_cmd.maxprot, 4))
+  load_cmd.vmaddr = m_data_nsp->GetAddress(&offset);
+  load_cmd.vmsize = m_data_nsp->GetAddress(&offset);
+  load_cmd.fileoff = m_data_nsp->GetAddress(&offset);
+  load_cmd.filesize = m_data_nsp->GetAddress(&offset);
+  if (!m_data_nsp->GetU32(&offset, &load_cmd.maxprot, 4))
     return;
 
   SanitizeSegmentCommand(load_cmd, cmd_idx);
@@ -1681,16 +1682,16 @@ void ObjectFileMachO::ProcessSegmentCommand(
   const uint32_t num_u32s = load_cmd.cmd == LC_SEGMENT ? 7 : 8;
   for (segment_sect_idx = 0; segment_sect_idx < load_cmd.nsects;
        ++segment_sect_idx) {
-    if (m_data.GetU8(&offset, (uint8_t *)sect64.sectname,
-                     sizeof(sect64.sectname)) == nullptr)
+    if (m_data_nsp->GetU8(&offset, (uint8_t *)sect64.sectname,
+                          sizeof(sect64.sectname)) == nullptr)
       break;
-    if (m_data.GetU8(&offset, (uint8_t *)sect64.segname,
-                     sizeof(sect64.segname)) == nullptr)
+    if (m_data_nsp->GetU8(&offset, (uint8_t *)sect64.segname,
+                          sizeof(sect64.segname)) == nullptr)
       break;
-    sect64.addr = m_data.GetAddress(&offset);
-    sect64.size = m_data.GetAddress(&offset);
+    sect64.addr = m_data_nsp->GetAddress(&offset);
+    sect64.size = m_data_nsp->GetAddress(&offset);
 
-    if (m_data.GetU32(&offset, &sect64.offset, num_u32s) == nullptr)
+    if (m_data_nsp->GetU32(&offset, &sect64.offset, num_u32s) == nullptr)
       break;
 
     if (IsSharedCacheBinary() && !IsInMemory()) {
@@ -1855,8 +1856,8 @@ void ObjectFileMachO::ProcessDysymtabCommand(
     const llvm::MachO::load_command &load_cmd, lldb::offset_t offset) {
   m_dysymtab.cmd = load_cmd.cmd;
   m_dysymtab.cmdsize = load_cmd.cmdsize;
-  m_data.GetU32(&offset, &m_dysymtab.ilocalsym,
-                (sizeof(m_dysymtab) / sizeof(uint32_t)) - 2);
+  m_data_nsp->GetU32(&offset, &m_dysymtab.ilocalsym,
+                     (sizeof(m_dysymtab) / sizeof(uint32_t)) - 2);
 }
 
 void ObjectFileMachO::CreateSections(SectionList &unified_section_list) {
@@ -1875,7 +1876,7 @@ void ObjectFileMachO::CreateSections(SectionList &unified_section_list) {
   llvm::MachO::load_command load_cmd;
   for (uint32_t i = 0; i < m_header.ncmds; ++i) {
     const lldb::offset_t load_cmd_offset = offset;
-    if (m_data.GetU32(&offset, &load_cmd, 2) == nullptr)
+    if (m_data_nsp->GetU32(&offset, &load_cmd, 2) == nullptr)
       break;
 
     if (load_cmd.cmd == LC_SEGMENT || load_cmd.cmd == LC_SEGMENT_64)
@@ -2240,13 +2241,13 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
     const lldb::offset_t cmd_offset = offset;
     // Read in the load command and load command size
     llvm::MachO::load_command lc;
-    if (m_data.GetU32(&offset, &lc, 2) == nullptr)
+    if (m_data_nsp->GetU32(&offset, &lc, 2) == nullptr)
       break;
     // Watch for the symbol table load command
     switch (lc.cmd) {
     case LC_SYMTAB: {
       llvm::MachO::symtab_command lc_obj;
-      if (m_data.GetU32(&offset, &lc_obj.symoff, 4)) {
+      if (m_data_nsp->GetU32(&offset, &lc_obj.symoff, 4)) {
         lc_obj.cmd = lc.cmd;
         lc_obj.cmdsize = lc.cmdsize;
         symtab_load_command = lc_obj;
@@ -2256,7 +2257,7 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
     case LC_DYLD_INFO:
     case LC_DYLD_INFO_ONLY: {
       llvm::MachO::dyld_info_command lc_obj;
-      if (m_data.GetU32(&offset, &lc_obj.rebase_off, 10)) {
+      if (m_data_nsp->GetU32(&offset, &lc_obj.rebase_off, 10)) {
         lc_obj.cmd = lc.cmd;
         lc_obj.cmdsize = lc.cmdsize;
         dyld_info = lc_obj;
@@ -2268,8 +2269,8 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
     case LC_REEXPORT_DYLIB:
     case LC_LOADFVMLIB:
     case LC_LOAD_UPWARD_DYLIB: {
-      uint32_t name_offset = cmd_offset + m_data.GetU32(&offset);
-      const char *path = m_data.PeekCStr(name_offset);
+      uint32_t name_offset = cmd_offset + m_data_nsp->GetU32(&offset);
+      const char *path = m_data_nsp->PeekCStr(name_offset);
       if (path) {
         FileSpec file_spec(path);
         // Strip the path if there is @rpath, @executable, etc so we just use
@@ -2289,19 +2290,19 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       llvm::MachO::linkedit_data_command lc_obj;
       lc_obj.cmd = lc.cmd;
       lc_obj.cmdsize = lc.cmdsize;
-      if (m_data.GetU32(&offset, &lc_obj.dataoff, 2))
+      if (m_data_nsp->GetU32(&offset, &lc_obj.dataoff, 2))
         exports_trie_load_command = lc_obj;
     } break;
     case LC_FUNCTION_STARTS: {
       llvm::MachO::linkedit_data_command lc_obj;
       lc_obj.cmd = lc.cmd;
       lc_obj.cmdsize = lc.cmdsize;
-      if (m_data.GetU32(&offset, &lc_obj.dataoff, 2))
+      if (m_data_nsp->GetU32(&offset, &lc_obj.dataoff, 2))
         function_starts_load_command = lc_obj;
     } break;
 
     case LC_UUID: {
-      const uint8_t *uuid_bytes = m_data.PeekData(offset, 16);
+      const uint8_t *uuid_bytes = m_data_nsp->PeekData(offset, 16);
 
       if (uuid_bytes)
         image_uuid = UUID(uuid_bytes, 16);
@@ -2321,8 +2322,8 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
   if (section_list == nullptr)
     return;
 
-  const uint32_t addr_byte_size = m_data.GetAddressByteSize();
-  const ByteOrder byte_order = m_data.GetByteOrder();
+  const uint32_t addr_byte_size = m_data_nsp->GetAddressByteSize();
+  const ByteOrder byte_order = m_data_nsp->GetByteOrder();
   bool bit_width_32 = addr_byte_size == 4;
   const size_t nlist_byte_size =
       bit_width_32 ? sizeof(struct nlist) : sizeof(struct nlist_64);
@@ -2487,9 +2488,9 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       exports_trie_load_command.dataoff += linkedit_slide;
     }
 
-    nlist_data.SetData(m_data, symtab_load_command.symoff,
+    nlist_data.SetData(*m_data_nsp.get(), symtab_load_command.symoff,
                        nlist_data_byte_size);
-    strtab_data.SetData(m_data, symtab_load_command.stroff,
+    strtab_data.SetData(*m_data_nsp.get(), symtab_load_command.stroff,
                         strtab_data_byte_size);
 
     // We shouldn't have exports data from both the LC_DYLD_INFO command
@@ -2497,19 +2498,22 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
     lldbassert(!((dyld_info.export_size > 0)
                  && (exports_trie_load_command.datasize > 0)));
     if (dyld_info.export_size > 0) {
-      dyld_trie_data.SetData(m_data, dyld_info.export_off,
+      dyld_trie_data.SetData(*m_data_nsp.get(), dyld_info.export_off,
                              dyld_info.export_size);
     } else if (exports_trie_load_command.datasize > 0) {
-      dyld_trie_data.SetData(m_data, exports_trie_load_command.dataoff,
+      dyld_trie_data.SetData(*m_data_nsp.get(),
+                             exports_trie_load_command.dataoff,
                              exports_trie_load_command.datasize);
     }
 
     if (dysymtab.nindirectsyms != 0) {
-      indirect_symbol_index_data.SetData(m_data, dysymtab.indirectsymoff,
+      indirect_symbol_index_data.SetData(*m_data_nsp.get(),
+                                         dysymtab.indirectsymoff,
                                          dysymtab.nindirectsyms * 4);
     }
     if (function_starts_load_command.cmd) {
-      function_starts_data.SetData(m_data, function_starts_load_command.dataoff,
+      function_starts_data.SetData(*m_data_nsp.get(),
+                                   function_starts_load_command.dataoff,
                                    function_starts_load_command.datasize);
     }
   }
@@ -4561,8 +4565,9 @@ void ObjectFileMachO::Dump(Stream *s) {
     *s << ", file = '" << m_file;
     ModuleSpecList all_specs;
     ModuleSpec base_spec;
-    GetAllArchSpecs(m_header, m_data, MachHeaderSizeFromMagic(m_header.magic),
-                    base_spec, all_specs);
+    GetAllArchSpecs(m_header, *m_data_nsp.get(),
+                    MachHeaderSizeFromMagic(m_header.magic), base_spec,
+                    all_specs);
     for (unsigned i = 0, e = all_specs.GetSize(); i != e; ++i) {
       *s << "', triple";
       if (e)
@@ -4868,7 +4873,7 @@ UUID ObjectFileMachO::GetUUID() {
   if (module_sp) {
     std::lock_guard<std::recursive_mutex> guard(module_sp->GetMutex());
     lldb::offset_t offset = MachHeaderSizeFromMagic(m_header.magic);
-    return GetUUID(m_header, m_data, offset);
+    return GetUUID(m_header, *m_data_nsp.get(), offset);
   }
   return UUID();
 }
@@ -4888,7 +4893,7 @@ uint32_t ObjectFileMachO::GetDependentModules(FileSpecList &files) {
   uint32_t i;
   for (i = 0; i < m_header.ncmds; ++i) {
     const uint32_t cmd_offset = offset;
-    if (m_data.GetU32(&offset, &load_cmd, 2) == nullptr)
+    if (m_data_nsp->GetU32(&offset, &load_cmd, 2) == nullptr)
       break;
 
     switch (load_cmd.cmd) {
@@ -4899,17 +4904,17 @@ uint32_t ObjectFileMachO::GetDependentModules(FileSpecList &files) {
     case LC_LOAD_DYLINKER:
     case LC_LOADFVMLIB:
     case LC_LOAD_UPWARD_DYLIB: {
-      uint32_t name_offset = cmd_offset + m_data.GetU32(&offset);
+      uint32_t name_offset = cmd_offset + m_data_nsp->GetU32(&offset);
       // For LC_LOAD_DYLIB there is an alternate encoding
       // which adds a uint32_t `flags` field for `DYLD_USE_*`
       // flags.  This can be detected by a timestamp field with
       // the `DYLIB_USE_MARKER` constant value.
       bool is_delayed_init = false;
-      uint32_t use_command_marker = m_data.GetU32(&offset);
+      uint32_t use_command_marker = m_data_nsp->GetU32(&offset);
       if (use_command_marker == 0x1a741800 /* DYLIB_USE_MARKER */) {
         offset += 4; /* uint32_t current_version */
         offset += 4; /* uint32_t compat_version */
-        uint32_t flags = m_data.GetU32(&offset);
+        uint32_t flags = m_data_nsp->GetU32(&offset);
         // If this LC_LOAD_DYLIB is marked delay-init,
         // don't report it as a dependent library -- it
         // may be loaded in the process at some point,
@@ -4917,7 +4922,7 @@ uint32_t ObjectFileMachO::GetDependentModules(FileSpecList &files) {
         if (flags & 0x08 /* DYLIB_USE_DELAYED_INIT */)
           is_delayed_init = true;
       }
-      const char *path = m_data.PeekCStr(name_offset);
+      const char *path = m_data_nsp->PeekCStr(name_offset);
       if (path && !is_delayed_init) {
         if (load_cmd.cmd == LC_RPATH)
           rpath_paths.push_back(path);
@@ -5037,15 +5042,15 @@ lldb_private::Address ObjectFileMachO::GetEntryPointAddress() {
 
     for (i = 0; i < m_header.ncmds; ++i) {
       const lldb::offset_t cmd_offset = offset;
-      if (m_data.GetU32(&offset, &load_cmd, 2) == nullptr)
+      if (m_data_nsp->GetU32(&offset, &load_cmd, 2) == nullptr)
         break;
 
       switch (load_cmd.cmd) {
       case LC_UNIXTHREAD:
       case LC_THREAD: {
         while (offset < cmd_offset + load_cmd.cmdsize) {
-          uint32_t flavor = m_data.GetU32(&offset);
-          uint32_t count = m_data.GetU32(&offset);
+          uint32_t flavor = m_data_nsp->GetU32(&offset);
+          uint32_t count = m_data_nsp->GetU32(&offset);
           if (count == 0) {
             // We've gotten off somehow, log and exit;
             return m_entry_point_address;
@@ -5059,7 +5064,7 @@ lldb_private::Address ObjectFileMachO::GetEntryPointAddress() {
             {
               offset += 60; // This is the offset of pc in the GPR thread state
                             // data structure.
-              start_address = m_data.GetU32(&offset);
+              start_address = m_data_nsp->GetU32(&offset);
               done = true;
             }
             break;
@@ -5069,7 +5074,7 @@ lldb_private::Address ObjectFileMachO::GetEntryPointAddress() {
             {
               offset += 256; // This is the offset of pc in the GPR thread state
                              // data structure.
-              start_address = m_data.GetU64(&offset);
+              start_address = m_data_nsp->GetU64(&offset);
               done = true;
             }
             break;
@@ -5079,7 +5084,7 @@ lldb_private::Address ObjectFileMachO::GetEntryPointAddress() {
             {
               offset += 16 * 8; // This is the offset of rip in the GPR thread
                                 // state data structure.
-              start_address = m_data.GetU64(&offset);
+              start_address = m_data_nsp->GetU64(&offset);
               done = true;
             }
             break;
@@ -5094,7 +5099,7 @@ lldb_private::Address ObjectFileMachO::GetEntryPointAddress() {
         }
       } break;
       case LC_MAIN: {
-        uint64_t entryoffset = m_data.GetU64(&offset);
+        uint64_t entryoffset = m_data_nsp->GetU64(&offset);
         SectionSP text_segment_sp =
             GetSectionList()->FindSectionByName(GetSegmentNameTEXT());
         if (text_segment_sp) {
@@ -5178,7 +5183,7 @@ uint32_t ObjectFileMachO::GetNumThreadContexts() {
       llvm::MachO::thread_command thread_cmd;
       for (uint32_t i = 0; i < m_header.ncmds; ++i) {
         const uint32_t cmd_offset = offset;
-        if (m_data.GetU32(&offset, &thread_cmd, 2) == nullptr)
+        if (m_data_nsp->GetU32(&offset, &thread_cmd, 2) == nullptr)
           break;
 
         if (thread_cmd.cmd == LC_THREAD) {
@@ -5204,17 +5209,17 @@ ObjectFileMachO::FindLC_NOTEByName(std::string name) {
     for (uint32_t i = 0; i < m_header.ncmds; ++i) {
       const uint32_t cmd_offset = offset;
       llvm::MachO::load_command lc = {};
-      if (m_data.GetU32(&offset, &lc.cmd, 2) == nullptr)
+      if (m_data_nsp->GetU32(&offset, &lc.cmd, 2) == nullptr)
         break;
       if (lc.cmd == LC_NOTE) {
         char data_owner[17];
-        m_data.CopyData(offset, 16, data_owner);
+        m_data_nsp->CopyData(offset, 16, data_owner);
         data_owner[16] = '\0';
         offset += 16;
 
         if (name == data_owner) {
-          offset_t payload_offset = m_data.GetU64_unchecked(&offset);
-          offset_t payload_size = m_data.GetU64_unchecked(&offset);
+          offset_t payload_offset = m_data_nsp->GetU64_unchecked(&offset);
+          offset_t payload_size = m_data_nsp->GetU64_unchecked(&offset);
           results.push_back({payload_offset, payload_size});
         }
       }
@@ -5236,11 +5241,11 @@ std::string ObjectFileMachO::GetIdentifierString() {
       offset_t payload_offset = std::get<0>(lc_note);
       offset_t payload_size = std::get<1>(lc_note);
       uint32_t version;
-      if (m_data.GetU32(&payload_offset, &version, 1) != nullptr) {
+      if (m_data_nsp->GetU32(&payload_offset, &version, 1) != nullptr) {
         if (version == 1) {
           uint32_t strsize = payload_size - sizeof(uint32_t);
           std::string result(strsize, '\0');
-          m_data.CopyData(payload_offset, strsize, result.data());
+          m_data_nsp->CopyData(payload_offset, strsize, result.data());
           LLDB_LOGF(log, "LC_NOTE 'kern ver str' found with text '%s'",
                     result.c_str());
           return result;
@@ -5254,12 +5259,12 @@ std::string ObjectFileMachO::GetIdentifierString() {
     for (uint32_t i = 0; i < m_header.ncmds; ++i) {
       const uint32_t cmd_offset = offset;
       llvm::MachO::ident_command ident_command;
-      if (m_data.GetU32(&offset, &ident_command, 2) == nullptr)
+      if (m_data_nsp->GetU32(&offset, &ident_command, 2) == nullptr)
         break;
       if (ident_command.cmd == LC_IDENT && ident_command.cmdsize != 0) {
         std::string result(ident_command.cmdsize, '\0');
-        if (m_data.CopyData(offset, ident_command.cmdsize, result.data()) ==
-            ident_command.cmdsize) {
+        if (m_data_nsp->CopyData(offset, ident_command.cmdsize,
+                                 result.data()) == ident_command.cmdsize) {
           LLDB_LOGF(log, "LC_IDENT found with text '%s'", result.c_str());
           return result;
         }
@@ -5281,9 +5286,10 @@ AddressableBits ObjectFileMachO::GetAddressableBits() {
     for (auto lc_note : lc_notes) {
       offset_t payload_offset = std::get<0>(lc_note);
       uint32_t version;
-      if (m_data.GetU32(&payload_offset, &version, 1) != nullptr) {
+      if (m_data_nsp->GetU32(&payload_offset, &version, 1) != nullptr) {
         if (version == 3) {
-          uint32_t num_addr_bits = m_data.GetU32_unchecked(&payload_offset);
+          uint32_t num_addr_bits =
+              m_data_nsp->GetU32_unchecked(&payload_offset);
           addressable_bits.SetAddressableBits(num_addr_bits);
           LLDB_LOGF(log,
                     "LC_NOTE 'addrable bits' v3 found, value %d "
@@ -5291,8 +5297,8 @@ AddressableBits ObjectFileMachO::GetAddressableBits() {
                     num_addr_bits);
         }
         if (version == 4) {
-          uint32_t lo_addr_bits = m_data.GetU32_unchecked(&payload_offset);
-          uint32_t hi_addr_bits = m_data.GetU32_unchecked(&payload_offset);
+          uint32_t lo_addr_bits = m_data_nsp->GetU32_unchecked(&payload_offset);
+          uint32_t hi_addr_bits = m_data_nsp->GetU32_unchecked(&payload_offset);
 
           if (lo_addr_bits == hi_addr_bits)
             addressable_bits.SetAddressableBits(lo_addr_bits);
@@ -5363,25 +5369,26 @@ bool ObjectFileMachO::GetCorefileMainBinaryInfo(addr_t &value,
       //    uint32_t unused        [ for alignment ]
 
       uint32_t version;
-      if (m_data.GetU32(&payload_offset, &version, 1) != nullptr &&
+      if (m_data_nsp->GetU32(&payload_offset, &version, 1) != nullptr &&
           version <= 2) {
         uint32_t binspec_type = 0;
         uuid_t raw_uuid;
         memset(raw_uuid, 0, sizeof(uuid_t));
 
-        if (!m_data.GetU32(&payload_offset, &binspec_type, 1))
+        if (!m_data_nsp->GetU32(&payload_offset, &binspec_type, 1))
           return false;
-        if (!m_data.GetU64(&payload_offset, &value, 1))
+        if (!m_data_nsp->GetU64(&payload_offset, &value, 1))
           return false;
         uint64_t slide = LLDB_INVALID_ADDRESS;
-        if (version > 1 && !m_data.GetU64(&payload_offset, &slide, 1))
+        if (version > 1 && !m_data_nsp->GetU64(&payload_offset, &slide, 1))
           return false;
         if (value == LLDB_INVALID_ADDRESS && slide != LLDB_INVALID_ADDRESS) {
           value = slide;
           value_is_offset = true;
         }
 
-        if (m_data.CopyData(payload_offset, sizeof(uuid_t), raw_uuid) != 0) {
+        if (m_data_nsp->CopyData(payload_offset, sizeof(uuid_t), raw_uuid) !=
+            0) {
           uuid = UUID(raw_uuid, sizeof(uuid_t));
           // convert the "main bin spec" type into our
           // ObjectFile::BinaryType enum
@@ -5415,9 +5422,9 @@ bool ObjectFileMachO::GetCorefileMainBinaryInfo(addr_t &value,
                     version, type, typestr, value,
                     value_is_offset ? "true" : "false",
                     uuid.GetAsString().c_str());
-          if (!m_data.GetU32(&payload_offset, &log2_pagesize, 1))
+          if (!m_data_nsp->GetU32(&payload_offset, &log2_pagesize, 1))
             return false;
-          if (version > 1 && !m_data.GetU32(&payload_offset, &platform, 1))
+          if (version > 1 && !m_data_nsp->GetU32(&payload_offset, &platform, 1))
             return false;
           return true;
         }
@@ -5497,7 +5504,7 @@ StructuredData::ObjectSP ObjectFileMachO::GetCorefileProcessMetadata() {
 
   auto [payload_offset, strsize] = lc_notes[0];
   std::string buf(strsize, '\0');
-  if (m_data.CopyData(payload_offset, strsize, buf.data()) != strsize) {
+  if (m_data_nsp->CopyData(payload_offset, strsize, buf.data()) != strsize) {
     LLDB_LOGF(log,
               "Unable to read %" PRIu64
               " bytes of 'process metadata' LC_NOTE JSON contents",
@@ -5537,7 +5544,8 @@ ObjectFileMachO::GetThreadContextAtIndex(uint32_t idx,
         m_thread_context_offsets.GetEntryAtIndex(idx);
     if (thread_context_file_range) {
 
-      DataExtractor data(m_data, thread_context_file_range->GetRangeBase(),
+      DataExtractor data(*m_data_nsp.get(),
+                         thread_context_file_range->GetRangeBase(),
                          thread_context_file_range->GetByteSize());
 
       switch (m_header.cputype) {
@@ -5677,13 +5685,13 @@ llvm::VersionTuple ObjectFileMachO::GetVersion() {
     uint32_t i;
     for (i = 0; i < m_header.ncmds; ++i) {
       const lldb::offset_t cmd_offset = offset;
-      if (m_data.GetU32(&offset, &load_cmd, 2) == nullptr)
+      if (m_data_nsp->GetU32(&offset, &load_cmd, 2) == nullptr)
         break;
 
       if (load_cmd.cmd == LC_ID_DYLIB) {
         if (version_cmd == 0) {
           version_cmd = load_cmd.cmd;
-          if (m_data.GetU32(&offset, &load_cmd.dylib, 4) == nullptr)
+          if (m_data_nsp->GetU32(&offset, &load_cmd.dylib, 4) == nullptr)
             break;
           version = load_cmd.dylib.current_version;
         }
@@ -5709,7 +5717,7 @@ ArchSpec ObjectFileMachO::GetArchitecture() {
   if (module_sp) {
     std::lock_guard<std::recursive_mutex> guard(module_sp->GetMutex());
 
-    return GetArchitecture(module_sp, m_header, m_data,
+    return GetArchitecture(module_sp, m_header, *m_data_nsp.get(),
                            MachHeaderSizeFromMagic(m_header.magic));
   }
   return arch;
@@ -5880,14 +5888,16 @@ static llvm::VersionTuple FindMinimumVersionInfo(DataExtractor &data,
 llvm::VersionTuple ObjectFileMachO::GetMinimumOSVersion() {
   if (!m_min_os_version)
     m_min_os_version = FindMinimumVersionInfo(
-        m_data, MachHeaderSizeFromMagic(m_header.magic), m_header.ncmds);
+        *m_data_nsp.get(), MachHeaderSizeFromMagic(m_header.magic),
+        m_header.ncmds);
   return *m_min_os_version;
 }
 
 llvm::VersionTuple ObjectFileMachO::GetSDKVersion() {
   if (!m_sdk_versions)
     m_sdk_versions = FindMinimumVersionInfo(
-        m_data, MachHeaderSizeFromMagic(m_header.magic), m_header.ncmds);
+        *m_data_nsp.get(), MachHeaderSizeFromMagic(m_header.magic),
+        m_header.ncmds);
   return *m_sdk_versions;
 }
 
@@ -6702,12 +6712,12 @@ ObjectFileMachO::GetCorefileAllImageInfos() {
   for (auto lc_note : lc_notes) {
     offset_t payload_offset = std::get<0>(lc_note);
     // Read the struct all_image_infos_header.
-    uint32_t version = m_data.GetU32(&payload_offset);
+    uint32_t version = m_data_nsp->GetU32(&payload_offset);
     if (version != 1) {
       return image_infos;
     }
-    uint32_t imgcount = m_data.GetU32(&payload_offset);
-    uint64_t entries_fileoff = m_data.GetU64(&payload_offset);
+    uint32_t imgcount = m_data_nsp->GetU32(&payload_offset);
+    uint64_t entries_fileoff = m_data_nsp->GetU64(&payload_offset);
     // 'entries_size' is not used, nor is the 'unused' entry.
     //  offset += 4; // uint32_t entries_size;
     //  offset += 4; // uint32_t unused;
@@ -6717,17 +6727,18 @@ ObjectFileMachO::GetCorefileAllImageInfos() {
     payload_offset = entries_fileoff;
     for (uint32_t i = 0; i < imgcount; i++) {
       // Read the struct image_entry.
-      offset_t filepath_offset = m_data.GetU64(&payload_offset);
+      offset_t filepath_offset = m_data_nsp->GetU64(&payload_offset);
       uuid_t uuid;
-      memcpy(&uuid, m_data.GetData(&payload_offset, sizeof(uuid_t)),
+      memcpy(&uuid, m_data_nsp->GetData(&payload_offset, sizeof(uuid_t)),
              sizeof(uuid_t));
-      uint64_t load_address = m_data.GetU64(&payload_offset);
-      offset_t seg_addrs_offset = m_data.GetU64(&payload_offset);
-      uint32_t segment_count = m_data.GetU32(&payload_offset);
-      uint32_t currently_executing = m_data.GetU32(&payload_offset);
+      uint64_t load_address = m_data_nsp->GetU64(&payload_offset);
+      offset_t seg_addrs_offset = m_data_nsp->GetU64(&payload_offset);
+      uint32_t segment_count = m_data_nsp->GetU32(&payload_offset);
+      uint32_t currently_executing = m_data_nsp->GetU32(&payload_offset);
 
       MachOCorefileImageEntry image_entry;
-      image_entry.filename = (const char *)m_data.GetCStr(&filepath_offset);
+      image_entry.filename =
+          (const char *)m_data_nsp->GetCStr(&filepath_offset);
       image_entry.uuid = UUID(uuid, sizeof(uuid_t));
       image_entry.load_address = load_address;
       image_entry.currently_executing = currently_executing;
@@ -6735,10 +6746,10 @@ ObjectFileMachO::GetCorefileAllImageInfos() {
       offset_t seg_vmaddrs_offset = seg_addrs_offset;
       for (uint32_t j = 0; j < segment_count; j++) {
         char segname[17];
-        m_data.CopyData(seg_vmaddrs_offset, 16, segname);
+        m_data_nsp->CopyData(seg_vmaddrs_offset, 16, segname);
         segname[16] = '\0';
         seg_vmaddrs_offset += 16;
-        uint64_t vmaddr = m_data.GetU64(&seg_vmaddrs_offset);
+        uint64_t vmaddr = m_data_nsp->GetU64(&seg_vmaddrs_offset);
         seg_vmaddrs_offset += 8; /* unused */
 
         std::tuple<ConstString, addr_t> new_seg{ConstString(segname), vmaddr};
@@ -6757,14 +6768,14 @@ ObjectFileMachO::GetCorefileAllImageInfos() {
   lc_notes = FindLC_NOTEByName("load binary");
   for (auto lc_note : lc_notes) {
     offset_t payload_offset = std::get<0>(lc_note);
-    uint32_t version = m_data.GetU32(&payload_offset);
+    uint32_t version = m_data_nsp->GetU32(&payload_offset);
     if (version == 1) {
       uuid_t uuid;
-      memcpy(&uuid, m_data.GetData(&payload_offset, sizeof(uuid_t)),
+      memcpy(&uuid, m_data_nsp->GetData(&payload_offset, sizeof(uuid_t)),
              sizeof(uuid_t));
-      uint64_t load_address = m_data.GetU64(&payload_offset);
-      uint64_t slide = m_data.GetU64(&payload_offset);
-      std::string filename = m_data.GetCStr(&payload_offset);
+      uint64_t load_address = m_data_nsp->GetU64(&payload_offset);
+      uint64_t slide = m_data_nsp->GetU64(&payload_offset);
+      std::string filename = m_data_nsp->GetCStr(&payload_offset);
 
       MachOCorefileImageEntry image_entry;
       image_entry.filename = filename;
diff --git a/lldb/source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp b/lldb/source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
index 244489ae06d65..f25ed51001474 100644
--- a/lldb/source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
+++ b/lldb/source/Plugins/ObjectFile/PECOFF/ObjectFilePECOFF.cpp
@@ -396,7 +396,7 @@ bool ObjectFilePECOFF::CreateBinary() {
   Log *log = GetLog(LLDBLog::Object);
 
   auto binary = llvm::object::createBinary(llvm::MemoryBufferRef(
-      toStringRef(m_data.GetData()), m_file.GetFilename().GetStringRef()));
+      toStringRef(m_data_nsp->GetData()), m_file.GetFilename().GetStringRef()));
   if (!binary) {
     LLDB_LOG_ERROR(log, binary.takeError(),
                    "Failed to create binary for file ({1}): {0}", m_file);
@@ -442,20 +442,20 @@ bool ObjectFilePECOFF::ParseHeader() {
   if (module_sp) {
     std::lock_guard<std::recursive_mutex> guard(module_sp->GetMutex());
     m_sect_headers.clear();
-    m_data.SetByteOrder(eByteOrderLittle);
+    m_data_nsp->SetByteOrder(eByteOrderLittle);
     lldb::offset_t offset = 0;
 
-    if (ParseDOSHeader(m_data, m_dos_header)) {
+    if (ParseDOSHeader(*m_data_nsp.get(), m_dos_header)) {
       offset = m_dos_header.e_lfanew;
-      uint32_t pe_signature = m_data.GetU32(&offset);
+      uint32_t pe_signature = m_data_nsp->GetU32(&offset);
       if (pe_signature != IMAGE_NT_SIGNATURE)
         return false;
-      if (ParseCOFFHeader(m_data, &offset, m_coff_header)) {
+      if (ParseCOFFHeader(*m_data_nsp.get(), &offset, m_coff_header)) {
         if (m_coff_header.hdrsize > 0)
           ParseCOFFOptionalHeader(&offset);
         ParseSectionHeaders(offset);
       }
-      m_data.SetAddressByteSize(GetAddressByteSize());
+      m_data_nsp->SetAddressByteSize(GetAddressByteSize());
       return true;
     }
   }
@@ -602,57 +602,63 @@ bool ObjectFilePECOFF::ParseCOFFOptionalHeader(lldb::offset_t *offset_ptr) {
   const lldb::offset_t end_offset = *offset_ptr + m_coff_header.hdrsize;
   if (*offset_ptr < end_offset) {
     success = true;
-    m_coff_header_opt.magic = m_data.GetU16(offset_ptr);
-    m_coff_header_opt.major_linker_version = m_data.GetU8(offset_ptr);
-    m_coff_header_opt.minor_linker_version = m_data.GetU8(offset_ptr);
-    m_coff_header_opt.code_size = m_data.GetU32(offset_ptr);
-    m_coff_header_opt.data_size = m_data.GetU32(offset_ptr);
-    m_coff_header_opt.bss_size = m_data.GetU32(offset_ptr);
-    m_coff_header_opt.entry = m_data.GetU32(offset_ptr);
-    m_coff_header_opt.code_offset = m_data.GetU32(offset_ptr);
+    m_coff_header_opt.magic = m_data_nsp->GetU16(offset_ptr);
+    m_coff_header_opt.major_linker_version = m_data_nsp->GetU8(offset_ptr);
+    m_coff_header_opt.minor_linker_version = m_data_nsp->GetU8(offset_ptr);
+    m_coff_header_opt.code_size = m_data_nsp->GetU32(offset_ptr);
+    m_coff_header_opt.data_size = m_data_nsp->GetU32(offset_ptr);
+    m_coff_header_opt.bss_size = m_data_nsp->GetU32(offset_ptr);
+    m_coff_header_opt.entry = m_data_nsp->GetU32(offset_ptr);
+    m_coff_header_opt.code_offset = m_data_nsp->GetU32(offset_ptr);
 
     const uint32_t addr_byte_size = GetAddressByteSize();
 
     if (*offset_ptr < end_offset) {
       if (m_coff_header_opt.magic == OPT_HEADER_MAGIC_PE32) {
         // PE32 only
-        m_coff_header_opt.data_offset = m_data.GetU32(offset_ptr);
+        m_coff_header_opt.data_offset = m_data_nsp->GetU32(offset_ptr);
       } else
         m_coff_header_opt.data_offset = 0;
 
       if (*offset_ptr < end_offset) {
         m_coff_header_opt.image_base =
-            m_data.GetMaxU64(offset_ptr, addr_byte_size);
-        m_coff_header_opt.sect_alignment = m_data.GetU32(offset_ptr);
-        m_coff_header_opt.file_alignment = m_data.GetU32(offset_ptr);
-        m_coff_header_opt.major_os_system_version = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.minor_os_system_version = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.major_image_version = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.minor_image_version = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.major_subsystem_version = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.minor_subsystem_version = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.reserved1 = m_data.GetU32(offset_ptr);
-        m_coff_header_opt.image_size = m_data.GetU32(offset_ptr);
-        m_coff_header_opt.header_size = m_data.GetU32(offset_ptr);
-        m_coff_header_opt.checksum = m_data.GetU32(offset_ptr);
-        m_coff_header_opt.subsystem = m_data.GetU16(offset_ptr);
-        m_coff_header_opt.dll_flags = m_data.GetU16(offset_ptr);
+            m_data_nsp->GetMaxU64(offset_ptr, addr_byte_size);
+        m_coff_header_opt.sect_alignment = m_data_nsp->GetU32(offset_ptr);
+        m_coff_header_opt.file_alignment = m_data_nsp->GetU32(offset_ptr);
+        m_coff_header_opt.major_os_system_version =
+            m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.minor_os_system_version =
+            m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.major_image_version = m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.minor_image_version = m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.major_subsystem_version =
+            m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.minor_subsystem_version =
+            m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.reserved1 = m_data_nsp->GetU32(offset_ptr);
+        m_coff_header_opt.image_size = m_data_nsp->GetU32(offset_ptr);
+        m_coff_header_opt.header_size = m_data_nsp->GetU32(offset_ptr);
+        m_coff_header_opt.checksum = m_data_nsp->GetU32(offset_ptr);
+        m_coff_header_opt.subsystem = m_data_nsp->GetU16(offset_ptr);
+        m_coff_header_opt.dll_flags = m_data_nsp->GetU16(offset_ptr);
         m_coff_header_opt.stack_reserve_size =
-            m_data.GetMaxU64(offset_ptr, addr_byte_size);
+            m_data_nsp->GetMaxU64(offset_ptr, addr_byte_size);
         m_coff_header_opt.stack_commit_size =
-            m_data.GetMaxU64(offset_ptr, addr_byte_size);
+            m_data_nsp->GetMaxU64(offset_ptr, addr_byte_size);
         m_coff_header_opt.heap_reserve_size =
-            m_data.GetMaxU64(offset_ptr, addr_byte_size);
+            m_data_nsp->GetMaxU64(offset_ptr, addr_byte_size);
         m_coff_header_opt.heap_commit_size =
-            m_data.GetMaxU64(offset_ptr, addr_byte_size);
-        m_coff_header_opt.loader_flags = m_data.GetU32(offset_ptr);
-        uint32_t num_data_dir_entries = m_data.GetU32(offset_ptr);
+            m_data_nsp->GetMaxU64(offset_ptr, addr_byte_size);
+        m_coff_header_opt.loader_flags = m_data_nsp->GetU32(offset_ptr);
+        uint32_t num_data_dir_entries = m_data_nsp->GetU32(offset_ptr);
         m_coff_header_opt.data_dirs.clear();
         m_coff_header_opt.data_dirs.resize(num_data_dir_entries);
         uint32_t i;
         for (i = 0; i < num_data_dir_entries; i++) {
-          m_coff_header_opt.data_dirs[i].vmaddr = m_data.GetU32(offset_ptr);
-          m_coff_header_opt.data_dirs[i].vmsize = m_data.GetU32(offset_ptr);
+          m_coff_header_opt.data_dirs[i].vmaddr =
+              m_data_nsp->GetU32(offset_ptr);
+          m_coff_header_opt.data_dirs[i].vmsize =
+              m_data_nsp->GetU32(offset_ptr);
         }
 
         m_image_base = m_coff_header_opt.image_base;
@@ -684,8 +690,8 @@ DataExtractor ObjectFilePECOFF::ReadImageData(uint32_t offset, size_t size) {
   if (!size)
     return {};
 
-  if (m_data.ValidOffsetForDataOfSize(offset, size))
-    return DataExtractor(m_data, offset, size);
+  if (m_data_nsp->ValidOffsetForDataOfSize(offset, size))
+    return DataExtractor(*m_data_nsp.get(), offset, size);
 
   ProcessSP process_sp(m_process_wp.lock());
   DataExtractor data;
@@ -759,7 +765,7 @@ llvm::StringRef ObjectFilePECOFF::GetSectionName(const section_header_t &sect) {
       return "";
     lldb::offset_t string_file_offset =
         m_coff_header.symoff + (m_coff_header.nsyms * 18) + stroff;
-    if (const char *name = m_data.GetCStr(&string_file_offset))
+    if (const char *name = m_data_nsp->GetCStr(&string_file_offset))
       return name;
     return "";
   }
diff --git a/lldb/source/Plugins/ObjectFile/XCOFF/ObjectFileXCOFF.cpp b/lldb/source/Plugins/ObjectFile/XCOFF/ObjectFileXCOFF.cpp
index d2c46edaf28cb..bfe155609f3b1 100644
--- a/lldb/source/Plugins/ObjectFile/XCOFF/ObjectFileXCOFF.cpp
+++ b/lldb/source/Plugins/ObjectFile/XCOFF/ObjectFileXCOFF.cpp
@@ -94,7 +94,7 @@ bool ObjectFileXCOFF::CreateBinary() {
 
   Log *log = GetLog(LLDBLog::Object);
 
-  auto memory_ref = llvm::MemoryBufferRef(toStringRef(m_data.GetData()),
+  auto memory_ref = llvm::MemoryBufferRef(toStringRef(m_data_nsp->GetData()),
                                           m_file.GetFilename().GetStringRef());
   llvm::file_magic magic = llvm::identify_magic(memory_ref.getBuffer());
 
diff --git a/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp b/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
index 492b441867205..0bedb5e753b77 100644
--- a/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
+++ b/lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
@@ -287,7 +287,7 @@ ObjectFileWasm::ObjectFileWasm(const ModuleSP &module_sp, DataBufferSP data_sp,
                                offset_t offset, offset_t length)
     : ObjectFile(module_sp, file, offset, length, data_sp, data_offset),
       m_arch("wasm32-unknown-unknown-wasm") {
-  m_data.SetAddressByteSize(4);
+  m_data_nsp->SetAddressByteSize(4);
 }
 
 ObjectFileWasm::ObjectFileWasm(const lldb::ModuleSP &module_sp,
@@ -719,11 +719,11 @@ DataExtractor ObjectFileWasm::ReadImageData(offset_t offset, uint32_t size) {
         DataBufferSP buffer_sp(data_up.release());
         data.SetData(buffer_sp, 0, buffer_sp->GetByteSize());
       }
-    } else if (offset < m_data.GetByteSize()) {
-      size =
-          std::min(static_cast<uint64_t>(size), m_data.GetByteSize() - offset);
-      return DataExtractor(m_data.GetDataStart() + offset, size, GetByteOrder(),
-                           GetAddressByteSize());
+    } else if (offset < m_data_nsp->GetByteSize()) {
+      size = std::min(static_cast<uint64_t>(size),
+                      m_data_nsp->GetByteSize() - offset);
+      return DataExtractor(m_data_nsp->GetDataStart() + offset, size,
+                           GetByteOrder(), GetAddressByteSize());
     }
   }
   data.SetByteOrder(GetByteOrder());
diff --git a/lldb/source/Plugins/Process/Windows/Common/ProcessWindows.cpp b/lldb/source/Plugins/Process/Windows/Common/ProcessWindows.cpp
index 0fecefe23b88e..4cc39f928ee1e 100644
--- a/lldb/source/Plugins/Process/Windows/Common/ProcessWindows.cpp
+++ b/lldb/source/Plugins/Process/Windows/Common/ProcessWindows.cpp
@@ -79,8 +79,10 @@ namespace lldb_private {
 
 ProcessSP ProcessWindows::CreateInstance(lldb::TargetSP target_sp,
                                          lldb::ListenerSP listener_sp,
-                                         const FileSpec *,
+                                         const FileSpec *crash_file_path,
                                          bool can_connect) {
+  if (crash_file_path)
+    return nullptr; // Cannot create a Windows process from a crash_file.
   return ProcessSP(new ProcessWindows(target_sp, listener_sp));
 }
 
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
index 3c4d9a1f1ad37..1ba99d78aea32 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
@@ -210,11 +210,9 @@ void ProcessGDBRemote::Terminate() {
 lldb::ProcessSP ProcessGDBRemote::CreateInstance(
     lldb::TargetSP target_sp, ListenerSP listener_sp,
     const FileSpec *crash_file_path, bool can_connect) {
-  lldb::ProcessSP process_sp;
-  if (crash_file_path == nullptr)
-    process_sp = std::shared_ptr<ProcessGDBRemote>(
-        new ProcessGDBRemote(target_sp, listener_sp));
-  return process_sp;
+  if (crash_file_path)
+    return nullptr; // Cannot create a GDBRemote process from a crash_file.
+  return lldb::ProcessSP(new ProcessGDBRemote(target_sp, listener_sp));
 }
 
 void ProcessGDBRemote::DumpPluginHistory(Stream &s) {
diff --git a/lldb/source/Plugins/Process/scripted/ScriptedFrame.cpp b/lldb/source/Plugins/Process/scripted/ScriptedFrame.cpp
index 6519df9185df0..265bc28a8957f 100644
--- a/lldb/source/Plugins/Process/scripted/ScriptedFrame.cpp
+++ b/lldb/source/Plugins/Process/scripted/ScriptedFrame.cpp
@@ -7,42 +7,72 @@
 //===----------------------------------------------------------------------===//
 
 #include "ScriptedFrame.h"
-
+#include "Plugins/Process/Utility/RegisterContextMemory.h"
+
+#include "lldb/Core/Address.h"
+#include "lldb/Core/Debugger.h"
+#include "lldb/Interpreter/Interfaces/ScriptedFrameInterface.h"
+#include "lldb/Interpreter/Interfaces/ScriptedThreadInterface.h"
+#include "lldb/Interpreter/ScriptInterpreter.h"
+#include "lldb/Symbol/SymbolContext.h"
+#include "lldb/Target/ExecutionContext.h"
+#include "lldb/Target/Process.h"
+#include "lldb/Target/RegisterContext.h"
+#include "lldb/Target/Thread.h"
 #include "lldb/Utility/DataBufferHeap.h"
+#include "lldb/Utility/LLDBLog.h"
+#include "lldb/Utility/Log.h"
+#include "lldb/Utility/StructuredData.h"
 
 using namespace lldb;
 using namespace lldb_private;
 
+char ScriptedFrame::ID;
+
 void ScriptedFrame::CheckInterpreterAndScriptObject() const {
   lldbassert(m_script_object_sp && "Invalid Script Object.");
   lldbassert(GetInterface() && "Invalid Scripted Frame Interface.");
 }
 
 llvm::Expected<std::shared_ptr<ScriptedFrame>>
-ScriptedFrame::Create(ScriptedThread &thread,
+ScriptedFrame::Create(ThreadSP thread_sp,
+                      ScriptedThreadInterfaceSP scripted_thread_interface_sp,
                       StructuredData::DictionarySP args_sp,
                       StructuredData::Generic *script_object) {
-  if (!thread.IsValid())
-    return llvm::createStringError("Invalid scripted thread.");
+  if (!thread_sp || !thread_sp->IsValid())
+    return llvm::createStringError("invalid thread");
+
+  ProcessSP process_sp = thread_sp->GetProcess();
+  if (!process_sp || !process_sp->IsValid())
+    return llvm::createStringError("invalid process");
 
-  thread.CheckInterpreterAndScriptObject();
+  ScriptInterpreter *script_interp =
+      process_sp->GetTarget().GetDebugger().GetScriptInterpreter();
+  if (!script_interp)
+    return llvm::createStringError("no script interpreter");
 
-  auto scripted_frame_interface =
-      thread.GetInterface()->CreateScriptedFrameInterface();
+  auto scripted_frame_interface = script_interp->CreateScriptedFrameInterface();
   if (!scripted_frame_interface)
     return llvm::createStringError("failed to create scripted frame interface");
 
   llvm::StringRef frame_class_name;
   if (!script_object) {
-    std::optional<std::string> class_name =
-        thread.GetInterface()->GetScriptedFramePluginName();
-    if (!class_name || class_name->empty())
+    // If no script object is provided and we have a scripted thread interface,
+    // try to get the frame class name from it.
+    if (scripted_thread_interface_sp) {
+      std::optional<std::string> class_name =
+          scripted_thread_interface_sp->GetScriptedFramePluginName();
+      if (!class_name || class_name->empty())
+        return llvm::createStringError(
+            "failed to get scripted frame class name");
+      frame_class_name = *class_name;
+    } else {
       return llvm::createStringError(
-          "failed to get scripted thread class name");
-    frame_class_name = *class_name;
+          "no script object provided and no scripted thread interface");
+    }
   }
 
-  ExecutionContext exe_ctx(thread);
+  ExecutionContext exe_ctx(thread_sp);
   auto obj_or_err = scripted_frame_interface->CreatePluginObject(
       frame_class_name, exe_ctx, args_sp, script_object);
 
@@ -62,7 +92,7 @@ ScriptedFrame::Create(ScriptedThread &thread,
   SymbolContext sc;
   Address symbol_addr;
   if (pc != LLDB_INVALID_ADDRESS) {
-    symbol_addr.SetLoadAddress(pc, &thread.GetProcess()->GetTarget());
+    symbol_addr.SetLoadAddress(pc, &process_sp->GetTarget());
     symbol_addr.CalculateSymbolContext(&sc);
   }
 
@@ -77,11 +107,11 @@ ScriptedFrame::Create(ScriptedThread &thread,
 
   if (!reg_info)
     return llvm::createStringError(
-        "failed to get scripted thread registers info");
+        "failed to get scripted frame registers info");
 
   std::shared_ptr<DynamicRegisterInfo> register_info_sp =
-      DynamicRegisterInfo::Create(
-          *reg_info, thread.GetProcess()->GetTarget().GetArchitecture());
+      DynamicRegisterInfo::Create(*reg_info,
+                                  process_sp->GetTarget().GetArchitecture());
 
   lldb::RegisterContextSP reg_ctx_sp;
 
@@ -96,32 +126,35 @@ ScriptedFrame::Create(ScriptedThread &thread,
 
     std::shared_ptr<RegisterContextMemory> reg_ctx_memory =
         std::make_shared<RegisterContextMemory>(
-            thread, frame_id, *register_info_sp, LLDB_INVALID_ADDRESS);
+            *thread_sp, frame_id, *register_info_sp, LLDB_INVALID_ADDRESS);
     if (!reg_ctx_memory)
-      return llvm::createStringError("failed to create a register context.");
+      return llvm::createStringError("failed to create a register context");
 
     reg_ctx_memory->SetAllRegisterData(data_sp);
     reg_ctx_sp = reg_ctx_memory;
   }
 
   return std::make_shared<ScriptedFrame>(
-      thread, scripted_frame_interface, frame_id, pc, sc, reg_ctx_sp,
+      thread_sp, scripted_frame_interface, frame_id, pc, sc, reg_ctx_sp,
       register_info_sp, owned_script_object_sp);
 }
 
-ScriptedFrame::ScriptedFrame(ScriptedThread &thread,
+ScriptedFrame::ScriptedFrame(ThreadSP thread_sp,
                              ScriptedFrameInterfaceSP interface_sp,
                              lldb::user_id_t id, lldb::addr_t pc,
                              SymbolContext &sym_ctx,
                              lldb::RegisterContextSP reg_ctx_sp,
                              std::shared_ptr<DynamicRegisterInfo> reg_info_sp,
                              StructuredData::GenericSP script_object_sp)
-    : StackFrame(thread.shared_from_this(), /*frame_idx=*/id,
+    : StackFrame(thread_sp, /*frame_idx=*/id,
                  /*concrete_frame_idx=*/id, /*reg_context_sp=*/reg_ctx_sp,
                  /*cfa=*/0, /*pc=*/pc,
                  /*behaves_like_zeroth_frame=*/!id, /*symbol_ctx=*/&sym_ctx),
       m_scripted_frame_interface_sp(interface_sp),
-      m_script_object_sp(script_object_sp), m_register_info_sp(reg_info_sp) {}
+      m_script_object_sp(script_object_sp), m_register_info_sp(reg_info_sp) {
+  // FIXME: This should be part of the base class constructor.
+  m_stack_frame_kind = StackFrame::Kind::Synthetic;
+}
 
 ScriptedFrame::~ScriptedFrame() {}
 
@@ -164,7 +197,7 @@ std::shared_ptr<DynamicRegisterInfo> ScriptedFrame::GetDynamicRegisterInfo() {
     if (!reg_info)
       return ScriptedInterface::ErrorWithMessage<
           std::shared_ptr<DynamicRegisterInfo>>(
-          LLVM_PRETTY_FUNCTION, "Failed to get scripted frame registers info.",
+          LLVM_PRETTY_FUNCTION, "failed to get scripted frame registers info",
           error, LLDBLog::Thread);
 
     ThreadSP thread_sp = m_thread_wp.lock();
@@ -172,7 +205,7 @@ std::shared_ptr<DynamicRegisterInfo> ScriptedFrame::GetDynamicRegisterInfo() {
       return ScriptedInterface::ErrorWithMessage<
           std::shared_ptr<DynamicRegisterInfo>>(
           LLVM_PRETTY_FUNCTION,
-          "Failed to get scripted frame registers info: invalid thread.", error,
+          "failed to get scripted frame registers info: invalid thread", error,
           LLDBLog::Thread);
 
     ProcessSP process_sp = thread_sp->GetProcess();
@@ -180,8 +213,8 @@ std::shared_ptr<DynamicRegisterInfo> ScriptedFrame::GetDynamicRegisterInfo() {
       return ScriptedInterface::ErrorWithMessage<
           std::shared_ptr<DynamicRegisterInfo>>(
           LLVM_PRETTY_FUNCTION,
-          "Failed to get scripted frame registers info: invalid process.",
-          error, LLDBLog::Thread);
+          "failed to get scripted frame registers info: invalid process", error,
+          LLDBLog::Thread);
 
     m_register_info_sp = DynamicRegisterInfo::Create(
         *reg_info, process_sp->GetTarget().GetArchitecture());
diff --git a/lldb/source/Plugins/Process/scripted/ScriptedFrame.h b/lldb/source/Plugins/Process/scripted/ScriptedFrame.h
index b6b77c4a7d160..d1cbd429d4979 100644
--- a/lldb/source/Plugins/Process/scripted/ScriptedFrame.h
+++ b/lldb/source/Plugins/Process/scripted/ScriptedFrame.h
@@ -10,21 +10,19 @@
 #define LLDB_SOURCE_PLUGINS_SCRIPTED_FRAME_H
 
 #include "ScriptedThread.h"
-#include "lldb/Interpreter/ScriptInterpreter.h"
 #include "lldb/Target/DynamicRegisterInfo.h"
 #include "lldb/Target/StackFrame.h"
+#include "lldb/lldb-forward.h"
+#include "llvm/Support/Error.h"
+#include <memory>
 #include <string>
 
-namespace lldb_private {
-class ScriptedThread;
-}
-
 namespace lldb_private {
 
 class ScriptedFrame : public lldb_private::StackFrame {
 
 public:
-  ScriptedFrame(ScriptedThread &thread,
+  ScriptedFrame(lldb::ThreadSP thread_sp,
                 lldb::ScriptedFrameInterfaceSP interface_sp,
                 lldb::user_id_t frame_idx, lldb::addr_t pc,
                 SymbolContext &sym_ctx, lldb::RegisterContextSP reg_ctx_sp,
@@ -33,8 +31,29 @@ class ScriptedFrame : public lldb_private::StackFrame {
 
   ~ScriptedFrame() override;
 
+  /// Create a ScriptedFrame from a object instanciated in the script
+  /// interpreter.
+  ///
+  /// \param[in] thread_sp
+  ///     The thread this frame belongs to.
+  ///
+  /// \param[in] scripted_thread_interface_sp
+  ///     The scripted thread interface (needed for ScriptedThread
+  ///     compatibility). Can be nullptr for frames on real threads.
+  ///
+  /// \param[in] args_sp
+  ///     Arguments to pass to the frame creation.
+  ///
+  /// \param[in] script_object
+  ///     The optional script object representing this frame.
+  ///
+  /// \return
+  ///     An Expected containing the ScriptedFrame shared pointer if successful,
+  ///     otherwise an error.
   static llvm::Expected<std::shared_ptr<ScriptedFrame>>
-  Create(ScriptedThread &thread, StructuredData::DictionarySP args_sp,
+  Create(lldb::ThreadSP thread_sp,
+         lldb::ScriptedThreadInterfaceSP scripted_thread_interface_sp,
+         StructuredData::DictionarySP args_sp,
          StructuredData::Generic *script_object = nullptr);
 
   bool IsInlined() override;
@@ -43,6 +62,11 @@ class ScriptedFrame : public lldb_private::StackFrame {
   const char *GetFunctionName() override;
   const char *GetDisplayFunctionName() override;
 
+  bool isA(const void *ClassID) const override {
+    return ClassID == &ID || StackFrame::isA(ClassID);
+  }
+  static bool classof(const StackFrame *obj) { return obj->isA(&ID); }
+
 private:
   void CheckInterpreterAndScriptObject() const;
   lldb::ScriptedFrameInterfaceSP GetInterface() const;
@@ -55,6 +79,8 @@ class ScriptedFrame : public lldb_private::StackFrame {
   lldb::ScriptedFrameInterfaceSP m_scripted_frame_interface_sp;
   lldb_private::StructuredData::GenericSP m_script_object_sp;
   std::shared_ptr<DynamicRegisterInfo> m_register_info_sp;
+
+  static char ID;
 };
 
 } // namespace lldb_private
diff --git a/lldb/source/Plugins/Process/scripted/ScriptedThread.cpp b/lldb/source/Plugins/Process/scripted/ScriptedThread.cpp
index 491efac5aadef..1dd9c48f56a59 100644
--- a/lldb/source/Plugins/Process/scripted/ScriptedThread.cpp
+++ b/lldb/source/Plugins/Process/scripted/ScriptedThread.cpp
@@ -210,7 +210,7 @@ bool ScriptedThread::LoadArtificialStackFrames() {
     SymbolContext sc;
     symbol_addr.CalculateSymbolContext(&sc);
 
-    return std::make_shared<StackFrame>(this->shared_from_this(), idx, idx, cfa,
+    return std::make_shared<StackFrame>(shared_from_this(), idx, idx, cfa,
                                         cfa_is_valid, pc,
                                         StackFrame::Kind::Synthetic, artificial,
                                         behaves_like_zeroth_frame, &sc);
@@ -231,8 +231,8 @@ bool ScriptedThread::LoadArtificialStackFrames() {
       return error.ToError();
     }
 
-    auto frame_or_error =
-        ScriptedFrame::Create(*this, nullptr, object_sp->GetAsGeneric());
+    auto frame_or_error = ScriptedFrame::Create(
+        shared_from_this(), GetInterface(), nullptr, object_sp->GetAsGeneric());
 
     if (!frame_or_error) {
       ScriptedInterface::ErrorWithMessage<bool>(
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptInterpreterPythonInterfaces.cpp b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptInterpreterPythonInterfaces.cpp
index d43036d6fe544..f6c707b2bd168 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptInterpreterPythonInterfaces.cpp
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptInterpreterPythonInterfaces.cpp
@@ -31,6 +31,7 @@ void ScriptInterpreterPythonInterfaces::Initialize() {
   ScriptedStopHookPythonInterface::Initialize();
   ScriptedBreakpointPythonInterface::Initialize();
   ScriptedThreadPlanPythonInterface::Initialize();
+  ScriptedFrameProviderPythonInterface::Initialize();
 }
 
 void ScriptInterpreterPythonInterfaces::Terminate() {
@@ -40,6 +41,7 @@ void ScriptInterpreterPythonInterfaces::Terminate() {
   ScriptedStopHookPythonInterface::Terminate();
   ScriptedBreakpointPythonInterface::Terminate();
   ScriptedThreadPlanPythonInterface::Terminate();
+  ScriptedFrameProviderPythonInterface::Terminate();
 }
 
 #endif
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.cpp b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.cpp
index b866bf332b7b6..3dde5036453f4 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.cpp
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.cpp
@@ -6,6 +6,7 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "lldb/Core/PluginManager.h"
 #include "lldb/Host/Config.h"
 #include "lldb/Target/Thread.h"
 #include "lldb/Utility/Log.h"
@@ -30,18 +31,45 @@ ScriptedFrameProviderPythonInterface::ScriptedFrameProviderPythonInterface(
     ScriptInterpreterPythonImpl &interpreter)
     : ScriptedFrameProviderInterface(), ScriptedPythonInterface(interpreter) {}
 
+bool ScriptedFrameProviderPythonInterface::AppliesToThread(
+    llvm::StringRef class_name, lldb::ThreadSP thread_sp) {
+  // If there is any issue with this method, we will just assume it also applies
+  // to this thread which is the default behavior.
+  constexpr bool fail_value = true;
+  Status error;
+  StructuredData::ObjectSP obj =
+      CallStaticMethod(class_name, "applies_to_thread", error, thread_sp);
+  if (!ScriptedInterface::CheckStructuredDataObject(LLVM_PRETTY_FUNCTION, obj,
+                                                    error))
+    return fail_value;
+
+  return obj->GetBooleanValue(fail_value);
+}
+
 llvm::Expected<StructuredData::GenericSP>
 ScriptedFrameProviderPythonInterface::CreatePluginObject(
     const llvm::StringRef class_name, lldb::StackFrameListSP input_frames,
     StructuredData::DictionarySP args_sp) {
   if (!input_frames)
-    return llvm::createStringError("Invalid frame list");
+    return llvm::createStringError("invalid frame list");
 
   StructuredDataImpl sd_impl(args_sp);
   return ScriptedPythonInterface::CreatePluginObject(class_name, nullptr,
                                                      input_frames, sd_impl);
 }
 
+std::string ScriptedFrameProviderPythonInterface::GetDescription(
+    llvm::StringRef class_name) {
+  Status error;
+  StructuredData::ObjectSP obj =
+      CallStaticMethod(class_name, "get_description", error);
+  if (!ScriptedInterface::CheckStructuredDataObject(LLVM_PRETTY_FUNCTION, obj,
+                                                    error))
+    return {};
+
+  return obj->GetStringValue().str();
+}
+
 StructuredData::ObjectSP
 ScriptedFrameProviderPythonInterface::GetFrameAtIndex(uint32_t index) {
   Status error;
@@ -54,4 +82,32 @@ ScriptedFrameProviderPythonInterface::GetFrameAtIndex(uint32_t index) {
   return obj;
 }
 
+bool ScriptedFrameProviderPythonInterface::CreateInstance(
+    lldb::ScriptLanguage language, ScriptedInterfaceUsages usages) {
+  if (language != eScriptLanguagePython)
+    return false;
+
+  return true;
+}
+
+void ScriptedFrameProviderPythonInterface::Initialize() {
+  const std::vector<llvm::StringRef> ci_usages = {
+      "target frame-provider register -C <script-name> [-k key -v value ...]",
+      "target frame-provider list",
+      "target frame-provider remove <provider-name>",
+      "target frame-provider clear"};
+  const std::vector<llvm::StringRef> api_usages = {
+      "SBTarget.RegisterScriptedFrameProvider",
+      "SBTarget.RemoveScriptedFrameProvider",
+      "SBTarget.ClearScriptedFrameProvider"};
+  PluginManager::RegisterPlugin(
+      GetPluginNameStatic(),
+      llvm::StringRef("Provide scripted stack frames for threads"),
+      CreateInstance, eScriptLanguagePython, {ci_usages, api_usages});
+}
+
+void ScriptedFrameProviderPythonInterface::Terminate() {
+  PluginManager::UnregisterPlugin(CreateInstance);
+}
+
 #endif
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.h b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.h
index fd163984028d3..97a5cc7c669ea 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.h
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedFrameProviderPythonInterface.h
@@ -14,17 +14,22 @@
 #if LLDB_ENABLE_PYTHON
 
 #include "ScriptedPythonInterface.h"
+#include "lldb/Core/PluginInterface.h"
 #include "lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h"
 #include <optional>
 
 namespace lldb_private {
 class ScriptedFrameProviderPythonInterface
     : public ScriptedFrameProviderInterface,
-      public ScriptedPythonInterface {
+      public ScriptedPythonInterface,
+      public PluginInterface {
 public:
   ScriptedFrameProviderPythonInterface(
       ScriptInterpreterPythonImpl &interpreter);
 
+  bool AppliesToThread(llvm::StringRef class_name,
+                       lldb::ThreadSP thread_sp) override;
+
   llvm::Expected<StructuredData::GenericSP>
   CreatePluginObject(llvm::StringRef class_name,
                      lldb::StackFrameListSP input_frames,
@@ -33,10 +38,24 @@ class ScriptedFrameProviderPythonInterface
   llvm::SmallVector<AbstractMethodRequirement>
   GetAbstractMethodRequirements() const override {
     return llvm::SmallVector<AbstractMethodRequirement>(
-        {{"get_frame_at_index"}});
+        {{"get_description"}, {"get_frame_at_index"}});
   }
 
+  std::string GetDescription(llvm::StringRef class_name) override;
+
   StructuredData::ObjectSP GetFrameAtIndex(uint32_t index) override;
+
+  static void Initialize();
+  static void Terminate();
+
+  static bool CreateInstance(lldb::ScriptLanguage language,
+                             ScriptedInterfaceUsages usages);
+
+  static llvm::StringRef GetPluginNameStatic() {
+    return "ScriptedFrameProviderPythonInterface";
+  }
+
+  llvm::StringRef GetPluginName() override { return GetPluginNameStatic(); }
 };
 } // namespace lldb_private
 
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.cpp b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.cpp
index af2e0b5df4d22..ba4473cf9ec4d 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.cpp
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.cpp
@@ -93,6 +93,19 @@ ScriptedPythonInterface::ExtractValueFromPythonObject<lldb::StackFrameSP>(
   return nullptr;
 }
 
+template <>
+lldb::ThreadSP
+ScriptedPythonInterface::ExtractValueFromPythonObject<lldb::ThreadSP>(
+    python::PythonObject &p, Status &error) {
+  if (lldb::SBThread *sb_thread = reinterpret_cast<lldb::SBThread *>(
+          python::LLDBSWIGPython_CastPyObjectToSBThread(p.get())))
+    return m_interpreter.GetOpaqueTypeFromSBThread(*sb_thread);
+  error = Status::FromErrorString(
+      "Couldn't cast lldb::SBThread to lldb_private::Thread.");
+
+  return nullptr;
+}
+
 template <>
 SymbolContext
 ScriptedPythonInterface::ExtractValueFromPythonObject<SymbolContext>(
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.h b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.h
index af88a69e34a13..53a7ba65f64b7 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.h
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/Interfaces/ScriptedPythonInterface.h
@@ -41,7 +41,7 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
     eValid
   };
 
-  struct AbstrackMethodCheckerPayload {
+  struct AbstractMethodCheckerPayload {
 
     struct InvalidArgumentCountPayload {
       InvalidArgumentCountPayload(size_t required, size_t actual)
@@ -55,13 +55,69 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
     std::variant<std::monostate, InvalidArgumentCountPayload> payload;
   };
 
-  llvm::Expected<std::map<llvm::StringLiteral, AbstrackMethodCheckerPayload>>
+  llvm::Expected<FileSpec> GetScriptedModulePath() override {
+    using namespace python;
+    using Locker = ScriptInterpreterPythonImpl::Locker;
+
+    Locker py_lock(&m_interpreter, Locker::AcquireLock | Locker::NoSTDIN,
+                   Locker::FreeLock);
+
+    if (!m_object_instance_sp)
+      return llvm::createStringError("scripted Interface has invalid object");
+
+    PythonObject py_obj =
+        PythonObject(PyRefType::Borrowed,
+                     static_cast<PyObject *>(m_object_instance_sp->GetValue()));
+
+    if (!py_obj.IsAllocated())
+      return llvm::createStringError(
+          "scripted Interface has invalid python object");
+
+    PythonObject py_obj_class = py_obj.GetAttributeValue("__class__");
+    if (!py_obj_class.IsValid())
+      return llvm::createStringError(
+          "scripted Interface python object is missing '__class__' attribute");
+
+    PythonObject py_obj_module = py_obj_class.GetAttributeValue("__module__");
+    if (!py_obj_module.IsValid())
+      return llvm::createStringError(
+          "scripted Interface python object '__class__' is missing "
+          "'__module__' attribute");
+
+    PythonString py_obj_module_str = py_obj_module.Str();
+    if (!py_obj_module_str.IsValid())
+      return llvm::createStringError(
+          "scripted Interface python object '__class__.__module__' attribute "
+          "is not a string");
+
+    llvm::StringRef py_obj_module_str_ref = py_obj_module_str.GetString();
+    PythonModule py_module = PythonModule::AddModule(py_obj_module_str_ref);
+    if (!py_module.IsValid())
+      return llvm::createStringError("failed to import '%s' module",
+                                     py_obj_module_str_ref.data());
+
+    PythonObject py_module_file = py_module.GetAttributeValue("__file__");
+    if (!py_module_file.IsValid())
+      return llvm::createStringError(
+          "module '%s' is missing '__file__' attribute",
+          py_obj_module_str_ref.data());
+
+    PythonString py_module_file_str = py_module_file.Str();
+    if (!py_module_file_str.IsValid())
+      return llvm::createStringError(
+          "module '%s.__file__' attribute is not a string",
+          py_obj_module_str_ref.data());
+
+    return FileSpec(py_obj_module_str.GetString());
+  }
+
+  llvm::Expected<std::map<llvm::StringLiteral, AbstractMethodCheckerPayload>>
   CheckAbstractMethodImplementation(
       const python::PythonDictionary &class_dict) const {
 
     using namespace python;
 
-    std::map<llvm::StringLiteral, AbstrackMethodCheckerPayload> checker;
+    std::map<llvm::StringLiteral, AbstractMethodCheckerPayload> checker;
 #define SET_CASE_AND_CONTINUE(method_name, case)                               \
   {                                                                            \
     checker[method_name] = {case, {}};                                         \
@@ -74,7 +130,8 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
       if (!class_dict.HasKey(method_name))
         SET_CASE_AND_CONTINUE(method_name,
                               AbstractMethodCheckerCases::eNotImplemented)
-      auto callable_or_err = class_dict.GetItem(method_name);
+      llvm::Expected<PythonObject> callable_or_err =
+          class_dict.GetItem(method_name);
       if (!callable_or_err) {
         llvm::consumeError(callable_or_err.takeError());
         SET_CASE_AND_CONTINUE(method_name,
@@ -102,7 +159,7 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
       } else {
         checker[method_name] = {
             AbstractMethodCheckerCases::eInvalidArgumentCount,
-            AbstrackMethodCheckerPayload::InvalidArgumentCountPayload(
+            AbstractMethodCheckerPayload::InvalidArgumentCountPayload(
                 requirement.min_arg_count, arg_info.max_positional_args)};
       }
     }
@@ -291,7 +348,7 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
       case AbstractMethodCheckerCases::eInvalidArgumentCount: {
         auto &payload_variant = method_checker.second.payload;
         if (!std::holds_alternative<
-                AbstrackMethodCheckerPayload::InvalidArgumentCountPayload>(
+                AbstractMethodCheckerPayload::InvalidArgumentCountPayload>(
                 payload_variant)) {
           abstract_method_errors = llvm::joinErrors(
               std::move(abstract_method_errors),
@@ -300,7 +357,7 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
                   obj_class_name.GetString(), method_checker.first)));
         } else {
           auto payload = std::get<
-              AbstrackMethodCheckerPayload::InvalidArgumentCountPayload>(
+              AbstractMethodCheckerPayload::InvalidArgumentCountPayload>(
               payload_variant);
           abstract_method_errors = llvm::joinErrors(
               std::move(abstract_method_errors),
@@ -330,6 +387,112 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
     return m_object_instance_sp;
   }
 
+  /// Call a static method on a Python class without creating an instance.
+  ///
+  /// This method resolves a Python class by name and calls a static method
+  /// on it, returning the result. This is useful for calling class-level
+  /// methods that don't require an instance.
+  ///
+  /// \param class_name The fully-qualified name of the Python class.
+  /// \param method_name The name of the static method to call.
+  /// \param error Output parameter to receive error information if the call
+  /// fails.
+  /// \param args Arguments to pass to the static method.
+  ///
+  /// \return The return value of the static method call, or an error value.
+  template <typename T = StructuredData::ObjectSP, typename... Args>
+  T CallStaticMethod(llvm::StringRef class_name, llvm::StringRef method_name,
+                     Status &error, Args &&...args) {
+    using namespace python;
+    using Locker = ScriptInterpreterPythonImpl::Locker;
+
+    std::string caller_signature =
+        llvm::Twine(LLVM_PRETTY_FUNCTION + llvm::Twine(" (") +
+                    llvm::Twine(class_name) + llvm::Twine(".") +
+                    llvm::Twine(method_name) + llvm::Twine(")"))
+            .str();
+
+    if (class_name.empty())
+      return ErrorWithMessage<T>(caller_signature, "missing script class name",
+                                 error);
+
+    Locker py_lock(&m_interpreter, Locker::AcquireLock | Locker::NoSTDIN,
+                   Locker::FreeLock);
+
+    // Get the interpreter dictionary.
+    auto dict =
+        PythonModule::MainModule().ResolveName<python::PythonDictionary>(
+            m_interpreter.GetDictionaryName());
+    if (!dict.IsAllocated())
+      return ErrorWithMessage<T>(
+          caller_signature,
+          llvm::formatv("could not find interpreter dictionary: {0}",
+                        m_interpreter.GetDictionaryName())
+              .str(),
+          error);
+
+    // Resolve the class.
+    auto class_obj =
+        PythonObject::ResolveNameWithDictionary<python::PythonCallable>(
+            class_name, dict);
+    if (!class_obj.IsAllocated())
+      return ErrorWithMessage<T>(
+          caller_signature,
+          llvm::formatv("could not find script class: {0}", class_name).str(),
+          error);
+
+    // Get the static method from the class.
+    if (!class_obj.HasAttribute(method_name))
+      return ErrorWithMessage<T>(
+          caller_signature,
+          llvm::formatv("class {0} does not have method {1}", class_name,
+                        method_name)
+              .str(),
+          error);
+
+    PythonCallable method =
+        class_obj.GetAttributeValue(method_name).AsType<PythonCallable>();
+    if (!method.IsAllocated())
+      return ErrorWithMessage<T>(caller_signature,
+                                 llvm::formatv("method {0}.{1} is not callable",
+                                               class_name, method_name)
+                                     .str(),
+                                 error);
+
+    // Transform the arguments.
+    std::tuple<Args...> original_args = std::forward_as_tuple(args...);
+    auto transformed_args = TransformArgs(original_args);
+
+    // Call the static method.
+    llvm::Expected<PythonObject> expected_return_object =
+        llvm::make_error<llvm::StringError>("Not initialized.",
+                                            llvm::inconvertibleErrorCode());
+    std::apply(
+        [&method, &expected_return_object](auto &&...args) {
+          llvm::consumeError(expected_return_object.takeError());
+          expected_return_object = method(args...);
+        },
+        transformed_args);
+
+    if (llvm::Error e = expected_return_object.takeError()) {
+      error = Status::FromError(std::move(e));
+      return ErrorWithMessage<T>(
+          caller_signature, "python static method could not be called", error);
+    }
+
+    PythonObject py_return = std::move(expected_return_object.get());
+
+    // Re-assign reference and pointer arguments if needed.
+    if (sizeof...(Args) > 0)
+      if (!ReassignPtrsOrRefsArgs(original_args, transformed_args))
+        return ErrorWithMessage<T>(
+            caller_signature,
+            "couldn't re-assign reference and pointer arguments", error);
+
+    // Extract value from Python object (handles unallocated case).
+    return ExtractValueFromPythonObject<T>(py_return, error);
+  }
+
 protected:
   template <typename T = StructuredData::ObjectSP>
   T ExtractValueFromPythonObject(python::PythonObject &p, Status &error) {
@@ -346,7 +509,7 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
                     llvm::Twine(method_name) + llvm::Twine(")"))
             .str();
     if (!m_object_instance_sp)
-      return ErrorWithMessage<T>(caller_signature, "Python object ill-formed",
+      return ErrorWithMessage<T>(caller_signature, "python object ill-formed",
                                  error);
 
     Locker py_lock(&m_interpreter, Locker::AcquireLock | Locker::NoSTDIN,
@@ -358,7 +521,7 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
     if (!implementor.IsAllocated())
       return llvm::is_contained(GetAbstractMethods(), method_name)
                  ? ErrorWithMessage<T>(caller_signature,
-                                       "Python implementor not allocated.",
+                                       "python implementor not allocated",
                                        error)
                  : T{};
 
@@ -379,20 +542,20 @@ class ScriptedPythonInterface : virtual public ScriptedInterface {
     if (llvm::Error e = expected_return_object.takeError()) {
       error = Status::FromError(std::move(e));
       return ErrorWithMessage<T>(caller_signature,
-                                 "Python method could not be called.", error);
+                                 "python method could not be called", error);
     }
 
     PythonObject py_return = std::move(expected_return_object.get());
 
     // Now that we called the python method with the transformed arguments,
-    // we need to interate again over both the original and transformed
+    // we need to iterate again over both the original and transformed
     // parameter pack, and transform back the parameter that were passed in
     // the original parameter pack as references or pointers.
     if (sizeof...(Args) > 0)
       if (!ReassignPtrsOrRefsArgs(original_args, transformed_args))
         return ErrorWithMessage<T>(
             caller_signature,
-            "Couldn't re-assign reference and pointer arguments.", error);
+            "couldn't re-assign reference and pointer arguments", error);
 
     if (!py_return.IsAllocated())
       return {};
@@ -598,6 +761,11 @@ lldb::StreamSP
 ScriptedPythonInterface::ExtractValueFromPythonObject<lldb::StreamSP>(
     python::PythonObject &p, Status &error);
 
+template <>
+lldb::ThreadSP
+ScriptedPythonInterface::ExtractValueFromPythonObject<lldb::ThreadSP>(
+    python::PythonObject &p, Status &error);
+
 template <>
 lldb::StackFrameSP
 ScriptedPythonInterface::ExtractValueFromPythonObject<lldb::StackFrameSP>(
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp b/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
index a2a287a6714db..d2f795cb5a20a 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp
@@ -810,6 +810,17 @@ bool PythonCallable::Check(PyObject *py_obj) {
   if (!py_obj)
     return false;
 
+  PythonObject python_obj(PyRefType::Borrowed, py_obj);
+
+  // Handle staticmethod/classmethod descriptors by extracting the
+  // `__func__` attribute.
+  if (python_obj.HasAttribute("__func__")) {
+    PythonObject function_obj = python_obj.GetAttributeValue("__func__");
+    if (!function_obj.IsAllocated())
+      return false;
+    return PyCallable_Check(function_obj.release());
+  }
+
   return PyCallable_Check(py_obj);
 }
 
diff --git a/lldb/source/Plugins/ScriptInterpreter/Python/SWIGPythonBridge.h b/lldb/source/Plugins/ScriptInterpreter/Python/SWIGPythonBridge.h
index 2c971262fc34e..32948ffd30023 100644
--- a/lldb/source/Plugins/ScriptInterpreter/Python/SWIGPythonBridge.h
+++ b/lldb/source/Plugins/ScriptInterpreter/Python/SWIGPythonBridge.h
@@ -265,6 +265,7 @@ void *LLDBSWIGPython_CastPyObjectToSBLaunchInfo(PyObject *data);
 void *LLDBSWIGPython_CastPyObjectToSBError(PyObject *data);
 void *LLDBSWIGPython_CastPyObjectToSBEvent(PyObject *data);
 void *LLDBSWIGPython_CastPyObjectToSBStream(PyObject *data);
+void *LLDBSWIGPython_CastPyObjectToSBThread(PyObject *data);
 void *LLDBSWIGPython_CastPyObjectToSBFrame(PyObject *data);
 void *LLDBSWIGPython_CastPyObjectToSBSymbolContext(PyObject *data);
 void *LLDBSWIGPython_CastPyObjectToSBValue(PyObject *data);
diff --git a/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.cpp b/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.cpp
index 36aa49ac3de95..d65aa40b5be86 100644
--- a/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.cpp
+++ b/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.cpp
@@ -623,6 +623,7 @@ TypeSP DWARFASTParserClang::ParseTypeFromDWARF(const SymbolContext &sc,
 
     switch (tag) {
     case DW_TAG_typedef:
+    case DW_TAG_template_alias:
     case DW_TAG_base_type:
     case DW_TAG_pointer_type:
     case DW_TAG_reference_type:
@@ -748,7 +749,7 @@ DWARFASTParserClang::ParseTypeModifier(const SymbolContext &sc,
   TypeSP type_sp;
   CompilerType clang_type;
 
-  if (tag == DW_TAG_typedef) {
+  if (tag == DW_TAG_typedef || tag == DW_TAG_template_alias) {
     // DeclContext will be populated when the clang type is materialized in
     // Type::ResolveCompilerType.
     PrepareContextToReceiveMembers(
@@ -836,6 +837,7 @@ DWARFASTParserClang::ParseTypeModifier(const SymbolContext &sc,
     encoding_data_type = Type::eEncodingIsRValueReferenceUID;
     break;
   case DW_TAG_typedef:
+  case DW_TAG_template_alias:
     encoding_data_type = Type::eEncodingIsTypedefUID;
     break;
   case DW_TAG_const_type:
@@ -3707,12 +3709,10 @@ bool DWARFASTParserClang::CopyUniqueClassMethodTypes(
     }
   }
 
-  DWARFASTParserClang *src_dwarf_ast_parser =
-      static_cast<DWARFASTParserClang *>(
-          SymbolFileDWARF::GetDWARFParser(*src_class_die.GetCU()));
-  DWARFASTParserClang *dst_dwarf_ast_parser =
-      static_cast<DWARFASTParserClang *>(
-          SymbolFileDWARF::GetDWARFParser(*dst_class_die.GetCU()));
+  auto *src_dwarf_ast_parser = llvm::cast<DWARFASTParserClang>(
+      SymbolFileDWARF::GetDWARFParser(*src_class_die.GetCU()));
+  auto *dst_dwarf_ast_parser = llvm::cast<DWARFASTParserClang>(
+      SymbolFileDWARF::GetDWARFParser(*dst_class_die.GetCU()));
   auto link = [&](DWARFDIE src, DWARFDIE dst) {
     auto &die_to_type = dst_class_die.GetDWARF()->GetDIEToType();
     clang::DeclContext *dst_decl_ctx =
diff --git a/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.h b/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.h
index f5f707129d67d..6eb2b6b48787b 100644
--- a/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.h
+++ b/lldb/source/Plugins/SymbolFile/DWARF/DWARFASTParserClang.h
@@ -47,6 +47,11 @@ class DWARFASTParserClang : public lldb_private::plugin::dwarf::DWARFASTParser {
 
   ~DWARFASTParserClang() override;
 
+  // LLVM RTTI support
+  static bool classof(const DWARFASTParser *Parser) {
+    return Parser->GetKind() == Kind::DWARFASTParserClang;
+  }
+
   // DWARFASTParser interface.
   lldb::TypeSP
   ParseTypeFromDWARF(const lldb_private::SymbolContext &sc,
@@ -264,10 +269,6 @@ class DWARFASTParserClang : public lldb_private::plugin::dwarf::DWARFASTParser {
   lldb::ModuleSP
   GetModuleForType(const lldb_private::plugin::dwarf::DWARFDIE &die);
 
-  static bool classof(const DWARFASTParser *Parser) {
-    return Parser->GetKind() == Kind::DWARFASTParserClang;
-  }
-
 private:
   struct FieldInfo {
     /// Size in bits that this field occupies. Can but
diff --git a/lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp b/lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
index ca8e74337733c..7ba765371c54f 100644
--- a/lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
+++ b/lldb/source/Plugins/SymbolFile/DWARF/SymbolFileDWARF.cpp
@@ -1560,8 +1560,8 @@ bool SymbolFileDWARF::HasForwardDeclForCompilerType(
   auto clang_type_system = compiler_type.GetTypeSystem<TypeSystemClang>();
   if (!clang_type_system)
     return false;
-  DWARFASTParserClang *ast_parser =
-      static_cast<DWARFASTParserClang *>(clang_type_system->GetDWARFParser());
+  auto *ast_parser =
+      llvm::cast<DWARFASTParserClang>(clang_type_system->GetDWARFParser());
   return ast_parser->GetClangASTImporter().CanImport(compiler_type);
 }
 
@@ -1569,8 +1569,8 @@ bool SymbolFileDWARF::CompleteType(CompilerType &compiler_type) {
   std::lock_guard<std::recursive_mutex> guard(GetModuleMutex());
   auto clang_type_system = compiler_type.GetTypeSystem<TypeSystemClang>();
   if (clang_type_system) {
-    DWARFASTParserClang *ast_parser =
-        static_cast<DWARFASTParserClang *>(clang_type_system->GetDWARFParser());
+    auto *ast_parser =
+        llvm::cast<DWARFASTParserClang>(clang_type_system->GetDWARFParser());
     if (ast_parser &&
         ast_parser->GetClangASTImporter().CanImport(compiler_type))
       return ast_parser->GetClangASTImporter().CompleteType(compiler_type);
@@ -1614,8 +1614,7 @@ bool SymbolFileDWARF::CompleteType(CompilerType &compiler_type) {
 
   if (decl_die != def_die) {
     GetDIEToType()[def_die.GetDIE()] = type;
-    DWARFASTParserClang *ast_parser =
-        static_cast<DWARFASTParserClang *>(dwarf_ast);
+    auto *ast_parser = llvm::cast<DWARFASTParserClang>(dwarf_ast);
     ast_parser->MapDeclDIEToDefDIE(decl_die, def_die);
   }
 
diff --git a/lldb/source/Plugins/SymbolFile/NativePDB/SymbolFileNativePDB.cpp b/lldb/source/Plugins/SymbolFile/NativePDB/SymbolFileNativePDB.cpp
index 40e783f9bad38..3bf113a07d28c 100644
--- a/lldb/source/Plugins/SymbolFile/NativePDB/SymbolFileNativePDB.cpp
+++ b/lldb/source/Plugins/SymbolFile/NativePDB/SymbolFileNativePDB.cpp
@@ -86,6 +86,40 @@ static lldb::LanguageType TranslateLanguage(PDB_Lang lang) {
   }
 }
 
+static std::optional<std::string>
+findMatchingPDBFilePath(llvm::StringRef original_pdb_path,
+                        llvm::StringRef exe_path) {
+  const FileSystem &fs = FileSystem::Instance();
+
+  if (fs.Exists(original_pdb_path))
+    return std::string(original_pdb_path);
+
+  const auto exe_dir = FileSpec(exe_path).CopyByRemovingLastPathComponent();
+  // While the exe_path uses the native style, the exe might be compiled on a
+  // different OS, so try to guess the style used.
+  const FileSpec original_pdb_spec(original_pdb_path,
+                                   FileSpec::GuessPathStyle(original_pdb_path)
+                                       .value_or(FileSpec::Style::native));
+  const llvm::StringRef pdb_filename = original_pdb_spec.GetFilename();
+
+  // If the file doesn't exist, perhaps the path specified at build time
+  // doesn't match the PDB's current location, so check the location of the
+  // executable.
+  const FileSpec local_pdb = exe_dir.CopyByAppendingPathComponent(pdb_filename);
+  if (fs.Exists(local_pdb))
+    return local_pdb.GetPath();
+
+  // Otherwise, search for one in target.debug-file-search-paths
+  FileSpecList search_paths = Target::GetDefaultDebugFileSearchPaths();
+  for (const FileSpec &search_dir : search_paths) {
+    FileSpec pdb_path = search_dir.CopyByAppendingPathComponent(pdb_filename);
+    if (fs.Exists(pdb_path))
+      return pdb_path.GetPath();
+  }
+
+  return std::nullopt;
+}
+
 static std::unique_ptr<PDBFile>
 loadMatchingPDBFile(std::string exe_path, llvm::BumpPtrAllocator &allocator) {
   // Try to find a matching PDB for an EXE.
@@ -113,17 +147,14 @@ loadMatchingPDBFile(std::string exe_path, llvm::BumpPtrAllocator &allocator) {
     return nullptr;
   }
 
-  // If the file doesn't exist, perhaps the path specified at build time
-  // doesn't match the PDB's current location, so check the location of the
-  // executable.
-  if (!FileSystem::Instance().Exists(pdb_file)) {
-    const auto exe_dir = FileSpec(exe_path).CopyByRemovingLastPathComponent();
-    const auto pdb_name = FileSpec(pdb_file).GetFilename().GetCString();
-    pdb_file = exe_dir.CopyByAppendingPathComponent(pdb_name).GetPathAsConstString().GetStringRef();
-  }
+  std::optional<std::string> resolved_pdb_path =
+      findMatchingPDBFilePath(pdb_file, exe_path);
+  if (!resolved_pdb_path)
+    return nullptr;
 
   // If the file is not a PDB or if it doesn't have a matching GUID, fail.
-  auto pdb = ObjectFilePDB::loadPDBFile(std::string(pdb_file), allocator);
+  auto pdb =
+      ObjectFilePDB::loadPDBFile(*std::move(resolved_pdb_path), allocator);
   if (!pdb)
     return nullptr;
 
@@ -137,6 +168,9 @@ loadMatchingPDBFile(std::string exe_path, llvm::BumpPtrAllocator &allocator) {
 
   if (expected_info->getGuid() != guid)
     return nullptr;
+
+  LLDB_LOG(GetLog(LLDBLog::Symbols), "Loading {0} for {1}", pdb->getFilePath(),
+           exe_path);
   return pdb;
 }
 
diff --git a/lldb/source/Plugins/SyntheticFrameProvider/CMakeLists.txt b/lldb/source/Plugins/SyntheticFrameProvider/CMakeLists.txt
new file mode 100644
index 0000000000000..85b405e648c1f
--- /dev/null
+++ b/lldb/source/Plugins/SyntheticFrameProvider/CMakeLists.txt
@@ -0,0 +1 @@
+add_subdirectory(ScriptedFrameProvider)
diff --git a/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/CMakeLists.txt b/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/CMakeLists.txt
new file mode 100644
index 0000000000000..fe67d39efdf11
--- /dev/null
+++ b/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/CMakeLists.txt
@@ -0,0 +1,12 @@
+add_lldb_library(lldbPluginScriptedFrameProvider PLUGIN
+  ScriptedFrameProvider.cpp
+
+  LINK_COMPONENTS
+    Support
+
+  LINK_LIBS
+    lldbCore
+    lldbInterpreter
+    lldbTarget
+    lldbUtility
+  )
diff --git a/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.cpp b/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.cpp
new file mode 100644
index 0000000000000..739963e6f0c2f
--- /dev/null
+++ b/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.cpp
@@ -0,0 +1,221 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "ScriptedFrameProvider.h"
+#include "Plugins/Process/scripted/ScriptedFrame.h"
+#include "lldb/Core/Debugger.h"
+#include "lldb/Core/PluginManager.h"
+#include "lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h"
+#include "lldb/Interpreter/ScriptInterpreter.h"
+#include "lldb/Target/BorrowedStackFrame.h"
+#include "lldb/Target/Process.h"
+#include "lldb/Target/StackFrame.h"
+#include "lldb/Target/Thread.h"
+#include "lldb/Utility/ScriptedMetadata.h"
+#include "lldb/Utility/Status.h"
+#include "llvm/Support/Error.h"
+#include <cstdint>
+
+using namespace lldb;
+using namespace lldb_private;
+
+void ScriptedFrameProvider::Initialize() {
+  PluginManager::RegisterPlugin(GetPluginNameStatic(),
+                                "Provides synthetic frames via scripting",
+                                nullptr, ScriptedFrameProvider::CreateInstance);
+}
+
+void ScriptedFrameProvider::Terminate() {
+  PluginManager::UnregisterPlugin(ScriptedFrameProvider::CreateInstance);
+}
+
+llvm::Expected<lldb::SyntheticFrameProviderSP>
+ScriptedFrameProvider::CreateInstance(
+    lldb::StackFrameListSP input_frames,
+    const ScriptedFrameProviderDescriptor &descriptor) {
+  if (!input_frames)
+    return llvm::createStringError(
+        "failed to create scripted frame provider: invalid input frames");
+
+  Thread &thread = input_frames->GetThread();
+  ProcessSP process_sp = thread.GetProcess();
+  if (!process_sp)
+    return nullptr;
+
+  if (!descriptor.IsValid())
+    return llvm::createStringError(
+        "failed to create scripted frame provider: invalid scripted metadata");
+
+  if (!descriptor.AppliesToThread(thread))
+    return nullptr;
+
+  ScriptInterpreter *script_interp =
+      process_sp->GetTarget().GetDebugger().GetScriptInterpreter();
+  if (!script_interp)
+    return llvm::createStringError("cannot create scripted frame provider: No "
+                                   "script interpreter installed");
+
+  ScriptedFrameProviderInterfaceSP interface_sp =
+      script_interp->CreateScriptedFrameProviderInterface();
+  if (!interface_sp)
+    return llvm::createStringError(
+        "cannot create scripted frame provider: script interpreter couldn't "
+        "create Scripted Frame Provider Interface");
+
+  const ScriptedMetadataSP scripted_metadata = descriptor.scripted_metadata_sp;
+
+  // If we shouldn't attach a frame provider to this thread, just exit early.
+  if (!interface_sp->AppliesToThread(scripted_metadata->GetClassName(),
+                                     thread.shared_from_this()))
+    return nullptr;
+
+  auto obj_or_err = interface_sp->CreatePluginObject(
+      scripted_metadata->GetClassName(), input_frames,
+      scripted_metadata->GetArgsSP());
+  if (!obj_or_err)
+    return obj_or_err.takeError();
+
+  StructuredData::ObjectSP object_sp = *obj_or_err;
+  if (!object_sp || !object_sp->IsValid())
+    return llvm::createStringError(
+        "cannot create scripted frame provider: failed to create valid scripted"
+        "frame provider object");
+
+  return std::make_shared<ScriptedFrameProvider>(input_frames, interface_sp,
+                                                 descriptor);
+}
+
+ScriptedFrameProvider::ScriptedFrameProvider(
+    StackFrameListSP input_frames,
+    lldb::ScriptedFrameProviderInterfaceSP interface_sp,
+    const ScriptedFrameProviderDescriptor &descriptor)
+    : SyntheticFrameProvider(input_frames), m_interface_sp(interface_sp),
+      m_descriptor(descriptor) {}
+
+ScriptedFrameProvider::~ScriptedFrameProvider() = default;
+
+std::string ScriptedFrameProvider::GetDescription() const {
+  if (!m_interface_sp)
+    return {};
+
+  return m_interface_sp->GetDescription(m_descriptor.GetName());
+}
+
+llvm::Expected<StackFrameSP>
+ScriptedFrameProvider::GetFrameAtIndex(uint32_t idx) {
+  if (!m_interface_sp)
+    return llvm::createStringError(
+        "cannot get stack frame: scripted frame provider not initialized");
+
+  auto create_frame_from_dict =
+      [this](StructuredData::Dictionary *dict,
+             uint32_t index) -> llvm::Expected<StackFrameSP> {
+    lldb::addr_t pc;
+    if (!dict->GetValueForKeyAsInteger("pc", pc))
+      return llvm::createStringError(
+          "missing 'pc' key from scripted frame dictionary");
+
+    Address symbol_addr;
+    symbol_addr.SetLoadAddress(pc, &GetThread().GetProcess()->GetTarget());
+
+    const lldb::addr_t cfa = LLDB_INVALID_ADDRESS;
+    const bool cfa_is_valid = false;
+    const bool artificial = false;
+    const bool behaves_like_zeroth_frame = false;
+    SymbolContext sc;
+    symbol_addr.CalculateSymbolContext(&sc);
+
+    ThreadSP thread_sp = GetThread().shared_from_this();
+    return std::make_shared<StackFrame>(thread_sp, index, index, cfa,
+                                        cfa_is_valid, pc,
+                                        StackFrame::Kind::Synthetic, artificial,
+                                        behaves_like_zeroth_frame, &sc);
+  };
+
+  auto create_frame_from_script_object =
+      [this](
+          StructuredData::ObjectSP object_sp) -> llvm::Expected<StackFrameSP> {
+    Status error;
+    if (!object_sp || !object_sp->GetAsGeneric())
+      return llvm::createStringError("invalid script object");
+
+    ThreadSP thread_sp = GetThread().shared_from_this();
+    auto frame_or_error = ScriptedFrame::Create(thread_sp, nullptr, nullptr,
+                                                object_sp->GetAsGeneric());
+
+    if (!frame_or_error) {
+      ScriptedInterface::ErrorWithMessage<bool>(
+          LLVM_PRETTY_FUNCTION, toString(frame_or_error.takeError()), error);
+      return error.ToError();
+    }
+
+    return *frame_or_error;
+  };
+
+  StructuredData::ObjectSP obj_sp = m_interface_sp->GetFrameAtIndex(idx);
+
+  // None/null means no more frames or error.
+  if (!obj_sp || !obj_sp->IsValid())
+    return llvm::createStringError("invalid script object returned for frame " +
+                                   llvm::Twine(idx));
+
+  StackFrameSP synth_frame_sp = nullptr;
+  if (StructuredData::UnsignedInteger *int_obj =
+          obj_sp->GetAsUnsignedInteger()) {
+    uint32_t real_frame_index = int_obj->GetValue();
+    if (real_frame_index < m_input_frames->GetNumFrames()) {
+      StackFrameSP real_frame_sp =
+          m_input_frames->GetFrameAtIndex(real_frame_index);
+      synth_frame_sp =
+          (real_frame_index == idx)
+              ? real_frame_sp
+              : std::make_shared<BorrowedStackFrame>(real_frame_sp, idx);
+    }
+  } else if (StructuredData::Dictionary *dict = obj_sp->GetAsDictionary()) {
+    // Check if it's a dictionary describing a frame.
+    auto frame_from_dict_or_err = create_frame_from_dict(dict, idx);
+    if (!frame_from_dict_or_err) {
+      return llvm::createStringError(llvm::Twine(
+          "couldn't create frame from dictionary at index " + llvm::Twine(idx) +
+          ": " + toString(frame_from_dict_or_err.takeError())));
+    }
+    synth_frame_sp = *frame_from_dict_or_err;
+  } else if (obj_sp->GetAsGeneric()) {
+    // It's a ScriptedFrame object.
+    auto frame_from_script_obj_or_err = create_frame_from_script_object(obj_sp);
+    if (!frame_from_script_obj_or_err) {
+      return llvm::createStringError(
+          llvm::Twine("couldn't create frame from script object at index " +
+                      llvm::Twine(idx) + ": " +
+                      toString(frame_from_script_obj_or_err.takeError())));
+    }
+    synth_frame_sp = *frame_from_script_obj_or_err;
+  } else {
+    return llvm::createStringError(
+        llvm::Twine("invalid return type from get_frame_at_index at index " +
+                    llvm::Twine(idx)));
+  }
+
+  if (!synth_frame_sp)
+    return llvm::createStringError(
+        llvm::Twine("failed to create frame at index " + llvm::Twine(idx)));
+
+  synth_frame_sp->SetFrameIndex(idx);
+
+  return synth_frame_sp;
+}
+
+namespace lldb_private {
+void lldb_initialize_ScriptedFrameProvider() {
+  ScriptedFrameProvider::Initialize();
+}
+
+void lldb_terminate_ScriptedFrameProvider() {
+  ScriptedFrameProvider::Terminate();
+}
+} // namespace lldb_private
diff --git a/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.h b/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.h
new file mode 100644
index 0000000000000..3434bf26ade24
--- /dev/null
+++ b/lldb/source/Plugins/SyntheticFrameProvider/ScriptedFrameProvider/ScriptedFrameProvider.h
@@ -0,0 +1,53 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_PLUGINS_SYNTHETICFRAMEPROVIDER_SCRIPTEDFRAMEPROVIDER_SCRIPTEDFRAMEPROVIDER_H
+#define LLDB_PLUGINS_SYNTHETICFRAMEPROVIDER_SCRIPTEDFRAMEPROVIDER_SCRIPTEDFRAMEPROVIDER_H
+
+#include "lldb/Target/SyntheticFrameProvider.h"
+#include "lldb/Utility/ScriptedMetadata.h"
+#include "lldb/Utility/Status.h"
+#include "lldb/lldb-forward.h"
+#include "llvm/Support/Error.h"
+
+namespace lldb_private {
+
+class ScriptedFrameProvider : public SyntheticFrameProvider {
+public:
+  static llvm::StringRef GetPluginNameStatic() {
+    return "ScriptedFrameProvider";
+  }
+
+  static llvm::Expected<lldb::SyntheticFrameProviderSP>
+  CreateInstance(lldb::StackFrameListSP input_frames,
+                 const ScriptedFrameProviderDescriptor &descriptor);
+
+  static void Initialize();
+
+  static void Terminate();
+
+  ScriptedFrameProvider(lldb::StackFrameListSP input_frames,
+                        lldb::ScriptedFrameProviderInterfaceSP interface_sp,
+                        const ScriptedFrameProviderDescriptor &descriptor);
+  ~ScriptedFrameProvider() override;
+
+  llvm::StringRef GetPluginName() override { return GetPluginNameStatic(); }
+
+  std::string GetDescription() const override;
+
+  /// Get a single stack frame at the specified index.
+  llvm::Expected<lldb::StackFrameSP> GetFrameAtIndex(uint32_t idx) override;
+
+private:
+  lldb::ScriptedFrameProviderInterfaceSP m_interface_sp;
+  const ScriptedFrameProviderDescriptor &m_descriptor;
+};
+
+} // namespace lldb_private
+
+#endif // LLDB_PLUGINS_SYNTHETICFRAMEPROVIDER_SCRIPTEDFRAMEPROVIDER_SCRIPTEDFRAMEPROVIDER_H
diff --git a/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp b/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
index b9d665902dc45..19ae1cf392efa 100644
--- a/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
+++ b/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.cpp
@@ -25,6 +25,8 @@
 #include "lldb/Utility/Log.h"
 #include "lldb/Utility/Status.h"
 #include "lldb/Utility/StreamString.h"
+#include "llvm/ADT/SmallSet.h"
+#include <deque>
 
 using namespace lldb;
 using namespace lldb_private;
@@ -150,29 +152,38 @@ bool UnwindAssemblyInstEmulation::GetNonCallSiteUnwindPlanFromAssembly(
   EmulateInstruction::InstructionCondition last_condition =
       EmulateInstruction::UnconditionalCondition;
 
-  for (const InstructionSP &inst : inst_list.Instructions()) {
-    if (!inst)
-      continue;
-    DumpInstToLog(log, *inst, inst_list);
+  std::deque<std::size_t> to_visit = {0};
+  llvm::SmallSet<std::size_t, 0> enqueued = {0};
+
+  // Instructions reachable through jumps are inserted on the front.
+  // The next instruction is inserted on the back.
+  // Pop from the back to ensure non-branching instructions are visited
+  // sequentially.
+  while (!to_visit.empty()) {
+    const std::size_t current_index = to_visit.back();
+    Instruction &inst = *inst_list.GetInstructionAtIndex(current_index);
+    to_visit.pop_back();
+    DumpInstToLog(log, inst, inst_list);
 
     m_curr_row_modified = false;
-    m_forward_branch_offset = 0;
+    m_branch_offset = 0;
 
     lldb::addr_t current_offset =
-        inst->GetAddress().GetFileAddress() - base_addr;
+        inst.GetAddress().GetFileAddress() - base_addr;
     auto it = saved_unwind_states.upper_bound(current_offset);
     assert(it != saved_unwind_states.begin() &&
            "Unwind row for the function entry missing");
     --it; // Move it to the row corresponding to the current offset
 
-    // If the offset of m_state.row doesn't match with the offset we see in
-    // saved_unwind_states then we have to update current unwind state to
-    // the saved values. It is happening after we processed an epilogue and a
-    // return to caller instruction.
+    // When state is forwarded through a branch, the offset of m_state.row is
+    // different from the offset available in saved_unwind_states. Use the
+    // forwarded state in this case, as the previous instruction may have been
+    // an unconditional jump.
+    // FIXME: this assignment can always be done unconditionally.
     if (it->second.row.GetOffset() != m_state.row.GetOffset())
       m_state = it->second;
 
-    m_inst_emulator_up->SetInstruction(inst->GetOpcode(), inst->GetAddress(),
+    m_inst_emulator_up->SetInstruction(inst.GetOpcode(), inst.GetAddress(),
                                        nullptr);
     const EmulateInstruction::InstructionCondition new_condition =
         m_inst_emulator_up->GetInstructionCondition();
@@ -199,26 +210,43 @@ bool UnwindAssemblyInstEmulation::GetNonCallSiteUnwindPlanFromAssembly(
 
     // If the current instruction is a branch forward then save the current
     // CFI information for the offset where we are branching.
-    if (m_forward_branch_offset != 0 &&
-        range.ContainsFileAddress(inst->GetAddress().GetFileAddress() +
-                                  m_forward_branch_offset)) {
+    Address branch_address = inst.GetAddress();
+    branch_address.Slide(m_branch_offset);
+    if (m_branch_offset != 0 &&
+        range.ContainsFileAddress(branch_address.GetFileAddress())) {
       if (auto [it, inserted] = saved_unwind_states.emplace(
-              current_offset + m_forward_branch_offset, m_state);
-          inserted)
-        it->second.row.SetOffset(current_offset + m_forward_branch_offset);
+              current_offset + m_branch_offset, m_state);
+          inserted) {
+        it->second.row.SetOffset(current_offset + m_branch_offset);
+        if (std::size_t dest_instr_index =
+                inst_list.GetIndexOfInstructionAtAddress(branch_address);
+            dest_instr_index < inst_list.GetSize()) {
+          to_visit.push_front(dest_instr_index);
+          enqueued.insert(dest_instr_index);
+        }
+      }
     }
 
+    // If inst is a barrier, do not propagate state to the next instruction.
+    if (inst.IsBarrier())
+      continue;
+
     // Were there any changes to the CFI while evaluating this instruction?
     if (m_curr_row_modified) {
       // Save the modified row if we don't already have a CFI row in the
       // current address
       const lldb::addr_t next_inst_offset =
-          current_offset + inst->GetOpcode().GetByteSize();
+          current_offset + inst.GetOpcode().GetByteSize();
       if (saved_unwind_states.count(next_inst_offset) == 0) {
         m_state.row.SetOffset(next_inst_offset);
         saved_unwind_states.emplace(next_inst_offset, m_state);
       }
     }
+
+    const size_t next_idx = current_index + 1;
+    const bool never_enqueued = enqueued.insert(next_idx).second;
+    if (never_enqueued && next_idx < inst_list.GetSize())
+      to_visit.push_back(next_idx);
   }
 
   for (auto &[_, state] : saved_unwind_states)
@@ -506,21 +534,20 @@ bool UnwindAssemblyInstEmulation::WriteRegister(
   case EmulateInstruction::eContextAbsoluteBranchRegister:
   case EmulateInstruction::eContextRelativeBranchImmediate: {
     if (context.GetInfoType() == EmulateInstruction::eInfoTypeISAAndImmediate &&
-        context.info.ISAAndImmediate.unsigned_data32 > 0) {
-      m_forward_branch_offset = context.info.ISAAndImmediate.unsigned_data32;
+        context.info.ISAAndImmediate.unsigned_data32 != 0) {
+      m_branch_offset = context.info.ISAAndImmediate.unsigned_data32;
     } else if (context.GetInfoType() ==
                    EmulateInstruction::eInfoTypeISAAndImmediateSigned &&
-               context.info.ISAAndImmediateSigned.signed_data32 > 0) {
-      m_forward_branch_offset =
-          context.info.ISAAndImmediateSigned.signed_data32;
+               context.info.ISAAndImmediateSigned.signed_data32 != 0) {
+      m_branch_offset = context.info.ISAAndImmediateSigned.signed_data32;
     } else if (context.GetInfoType() ==
                    EmulateInstruction::eInfoTypeImmediate &&
-               context.info.unsigned_immediate > 0) {
-      m_forward_branch_offset = context.info.unsigned_immediate;
+               context.info.unsigned_immediate != 0) {
+      m_branch_offset = context.info.unsigned_immediate;
     } else if (context.GetInfoType() ==
                    EmulateInstruction::eInfoTypeImmediateSigned &&
-               context.info.signed_immediate > 0) {
-      m_forward_branch_offset = context.info.signed_immediate;
+               context.info.signed_immediate != 0) {
+      m_branch_offset = context.info.signed_immediate;
     }
   } break;
 
diff --git a/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.h b/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.h
index 6c0492f5dfc66..43daf1c9f9fd6 100644
--- a/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.h
+++ b/lldb/source/Plugins/UnwindAssembly/InstEmulation/UnwindAssemblyInstEmulation.h
@@ -64,7 +64,7 @@ class UnwindAssemblyInstEmulation : public lldb_private::UnwindAssembly {
                               lldb_private::EmulateInstruction *inst_emulator)
       : UnwindAssembly(arch), m_inst_emulator_up(inst_emulator),
         m_range_ptr(nullptr), m_unwind_plan_ptr(nullptr),
-        m_curr_row_modified(false), m_forward_branch_offset(0) {
+        m_curr_row_modified(false) {
     if (m_inst_emulator_up) {
       m_inst_emulator_up->SetBaton(this);
       m_inst_emulator_up->SetCallbacks(ReadMemory, WriteMemory, ReadRegister,
@@ -152,7 +152,7 @@ class UnwindAssemblyInstEmulation : public lldb_private::UnwindAssembly {
   bool m_curr_row_modified;
   // The instruction is branching forward with the given offset. 0 value means
   // no branching.
-  uint32_t m_forward_branch_offset;
+  int64_t m_branch_offset = 0;
 };
 
 #endif // LLDB_SOURCE_PLUGINS_UNWINDASSEMBLY_INSTEMULATION_UNWINDASSEMBLYINSTEMULATION_H
diff --git a/lldb/source/Symbol/ObjectFile.cpp b/lldb/source/Symbol/ObjectFile.cpp
index 6f5348c153030..ab28c17f9732c 100644
--- a/lldb/source/Symbol/ObjectFile.cpp
+++ b/lldb/source/Symbol/ObjectFile.cpp
@@ -254,13 +254,14 @@ ObjectFile::ObjectFile(const lldb::ModuleSP &module_sp,
     : ModuleChild(module_sp),
       m_file(), // This file could be different from the original module's file
       m_type(eTypeInvalid), m_strata(eStrataInvalid),
-      m_file_offset(file_offset), m_length(length), m_data(), m_process_wp(),
+      m_file_offset(file_offset), m_length(length),
+      m_data_nsp(std::make_shared<DataExtractor>()), m_process_wp(),
       m_memory_addr(LLDB_INVALID_ADDRESS), m_sections_up(), m_symtab_up(),
       m_symtab_once_up(new llvm::once_flag()) {
   if (file_spec_ptr)
     m_file = *file_spec_ptr;
   if (data_sp)
-    m_data.SetData(data_sp, data_offset, length);
+    m_data_nsp->SetData(data_sp, data_offset, length);
   Log *log = GetLog(LLDBLog::Object);
   LLDB_LOGF(log,
             "%p ObjectFile::ObjectFile() module = %p (%s), file = %s, "
@@ -275,11 +276,12 @@ ObjectFile::ObjectFile(const lldb::ModuleSP &module_sp,
                        const ProcessSP &process_sp, lldb::addr_t header_addr,
                        DataBufferSP header_data_sp)
     : ModuleChild(module_sp), m_file(), m_type(eTypeInvalid),
-      m_strata(eStrataInvalid), m_file_offset(0), m_length(0), m_data(),
-      m_process_wp(process_sp), m_memory_addr(header_addr), m_sections_up(),
-      m_symtab_up(), m_symtab_once_up(new llvm::once_flag()) {
+      m_strata(eStrataInvalid), m_file_offset(0), m_length(0),
+      m_data_nsp(std::make_shared<DataExtractor>()), m_process_wp(process_sp),
+      m_memory_addr(header_addr), m_sections_up(), m_symtab_up(),
+      m_symtab_once_up(new llvm::once_flag()) {
   if (header_data_sp)
-    m_data.SetData(header_data_sp, 0, header_data_sp->GetByteSize());
+    m_data_nsp->SetData(header_data_sp, 0, header_data_sp->GetByteSize());
   Log *log = GetLog(LLDBLog::Object);
   LLDB_LOGF(log,
             "%p ObjectFile::ObjectFile() module = %p (%s), process = %p, "
@@ -474,16 +476,16 @@ WritableDataBufferSP ObjectFile::ReadMemory(const ProcessSP &process_sp,
 
 size_t ObjectFile::GetData(lldb::offset_t offset, size_t length,
                            DataExtractor &data) const {
-  // The entire file has already been mmap'ed into m_data, so just copy from
+  // The entire file has already been mmap'ed into m_data_nsp, so just copy from
   // there as the back mmap buffer will be shared with shared pointers.
-  return data.SetData(m_data, offset, length);
+  return data.SetData(*m_data_nsp.get(), offset, length);
 }
 
 size_t ObjectFile::CopyData(lldb::offset_t offset, size_t length,
                             void *dst) const {
-  // The entire file has already been mmap'ed into m_data, so just copy from
+  // The entire file has already been mmap'ed into m_data_nsp, so just copy from
   // there Note that the data remains in target byte order.
-  return m_data.CopyData(offset, length, dst);
+  return m_data_nsp->CopyData(offset, length, dst);
 }
 
 size_t ObjectFile::ReadSectionData(Section *section,
diff --git a/lldb/source/Target/BorrowedStackFrame.cpp b/lldb/source/Target/BorrowedStackFrame.cpp
new file mode 100644
index 0000000000000..5afadf21fde03
--- /dev/null
+++ b/lldb/source/Target/BorrowedStackFrame.cpp
@@ -0,0 +1,187 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/Target/BorrowedStackFrame.h"
+
+using namespace lldb;
+using namespace lldb_private;
+
+char BorrowedStackFrame::ID;
+
+BorrowedStackFrame::BorrowedStackFrame(
+    StackFrameSP borrowed_frame_sp, uint32_t new_frame_index,
+    std::optional<uint32_t> new_concrete_frame_index)
+    : StackFrame(
+          borrowed_frame_sp->GetThread(), new_frame_index,
+          borrowed_frame_sp->GetConcreteFrameIndex(),
+          borrowed_frame_sp->GetRegisterContextSP(),
+          borrowed_frame_sp->GetStackID().GetPC(),
+          borrowed_frame_sp->GetStackID().GetCallFrameAddressWithoutMetadata(),
+          borrowed_frame_sp->m_behaves_like_zeroth_frame,
+          &borrowed_frame_sp->GetSymbolContext(eSymbolContextEverything)),
+      m_borrowed_frame_sp(borrowed_frame_sp),
+      m_new_frame_index(new_frame_index) {
+  if (new_concrete_frame_index)
+    m_new_concrete_frame_index = *new_concrete_frame_index;
+  else
+    m_new_concrete_frame_index =
+        IsInlined() ? LLDB_INVALID_FRAME_ID : new_frame_index;
+}
+
+uint32_t BorrowedStackFrame::GetFrameIndex() const { return m_new_frame_index; }
+
+void BorrowedStackFrame::SetFrameIndex(uint32_t index) {
+  m_new_frame_index = index;
+}
+
+uint32_t BorrowedStackFrame::GetConcreteFrameIndex() {
+  // FIXME: We need to find where the concrete frame into which this frame was
+  // inlined landed in the new stack frame list as that is the correct concrete
+  // frame index in this
+  // stack frame.
+  return m_new_concrete_frame_index;
+}
+
+StackID &BorrowedStackFrame::GetStackID() {
+  return m_borrowed_frame_sp->GetStackID();
+}
+
+const Address &BorrowedStackFrame::GetFrameCodeAddress() {
+  return m_borrowed_frame_sp->GetFrameCodeAddress();
+}
+
+Address BorrowedStackFrame::GetFrameCodeAddressForSymbolication() {
+  return m_borrowed_frame_sp->GetFrameCodeAddressForSymbolication();
+}
+
+bool BorrowedStackFrame::ChangePC(addr_t pc) {
+  return m_borrowed_frame_sp->ChangePC(pc);
+}
+
+const SymbolContext &
+BorrowedStackFrame::GetSymbolContext(SymbolContextItem resolve_scope) {
+  return m_borrowed_frame_sp->GetSymbolContext(resolve_scope);
+}
+
+llvm::Error BorrowedStackFrame::GetFrameBaseValue(Scalar &value) {
+  return m_borrowed_frame_sp->GetFrameBaseValue(value);
+}
+
+DWARFExpressionList *
+BorrowedStackFrame::GetFrameBaseExpression(Status *error_ptr) {
+  return m_borrowed_frame_sp->GetFrameBaseExpression(error_ptr);
+}
+
+Block *BorrowedStackFrame::GetFrameBlock() {
+  return m_borrowed_frame_sp->GetFrameBlock();
+}
+
+RegisterContextSP BorrowedStackFrame::GetRegisterContext() {
+  return m_borrowed_frame_sp->GetRegisterContext();
+}
+
+VariableList *BorrowedStackFrame::GetVariableList(bool get_file_globals,
+                                                  Status *error_ptr) {
+  return m_borrowed_frame_sp->GetVariableList(get_file_globals, error_ptr);
+}
+
+VariableListSP
+BorrowedStackFrame::GetInScopeVariableList(bool get_file_globals,
+                                           bool must_have_valid_location) {
+  return m_borrowed_frame_sp->GetInScopeVariableList(get_file_globals,
+                                                     must_have_valid_location);
+}
+
+ValueObjectSP BorrowedStackFrame::GetValueForVariableExpressionPath(
+    llvm::StringRef var_expr, DynamicValueType use_dynamic, uint32_t options,
+    VariableSP &var_sp, Status &error) {
+  return m_borrowed_frame_sp->GetValueForVariableExpressionPath(
+      var_expr, use_dynamic, options, var_sp, error);
+}
+
+bool BorrowedStackFrame::HasDebugInformation() {
+  return m_borrowed_frame_sp->HasDebugInformation();
+}
+
+const char *BorrowedStackFrame::Disassemble() {
+  return m_borrowed_frame_sp->Disassemble();
+}
+
+ValueObjectSP BorrowedStackFrame::GetValueObjectForFrameVariable(
+    const VariableSP &variable_sp, DynamicValueType use_dynamic) {
+  return m_borrowed_frame_sp->GetValueObjectForFrameVariable(variable_sp,
+                                                             use_dynamic);
+}
+
+bool BorrowedStackFrame::IsInlined() {
+  return m_borrowed_frame_sp->IsInlined();
+}
+
+bool BorrowedStackFrame::IsSynthetic() const {
+  return m_borrowed_frame_sp->IsSynthetic();
+}
+
+bool BorrowedStackFrame::IsHistorical() const {
+  return m_borrowed_frame_sp->IsHistorical();
+}
+
+bool BorrowedStackFrame::IsArtificial() const {
+  return m_borrowed_frame_sp->IsArtificial();
+}
+
+bool BorrowedStackFrame::IsHidden() { return m_borrowed_frame_sp->IsHidden(); }
+
+const char *BorrowedStackFrame::GetFunctionName() {
+  return m_borrowed_frame_sp->GetFunctionName();
+}
+
+const char *BorrowedStackFrame::GetDisplayFunctionName() {
+  return m_borrowed_frame_sp->GetDisplayFunctionName();
+}
+
+ValueObjectSP BorrowedStackFrame::FindVariable(ConstString name) {
+  return m_borrowed_frame_sp->FindVariable(name);
+}
+
+SourceLanguage BorrowedStackFrame::GetLanguage() {
+  return m_borrowed_frame_sp->GetLanguage();
+}
+
+SourceLanguage BorrowedStackFrame::GuessLanguage() {
+  return m_borrowed_frame_sp->GuessLanguage();
+}
+
+ValueObjectSP BorrowedStackFrame::GuessValueForAddress(addr_t addr) {
+  return m_borrowed_frame_sp->GuessValueForAddress(addr);
+}
+
+ValueObjectSP
+BorrowedStackFrame::GuessValueForRegisterAndOffset(ConstString reg,
+                                                   int64_t offset) {
+  return m_borrowed_frame_sp->GuessValueForRegisterAndOffset(reg, offset);
+}
+
+StructuredData::ObjectSP BorrowedStackFrame::GetLanguageSpecificData() {
+  return m_borrowed_frame_sp->GetLanguageSpecificData();
+}
+
+RecognizedStackFrameSP BorrowedStackFrame::GetRecognizedFrame() {
+  return m_borrowed_frame_sp->GetRecognizedFrame();
+}
+
+StackFrameSP BorrowedStackFrame::GetBorrowedFrame() const {
+  return m_borrowed_frame_sp;
+}
+
+bool BorrowedStackFrame::isA(const void *ClassID) const {
+  return ClassID == &ID || StackFrame::isA(ClassID);
+}
+
+bool BorrowedStackFrame::classof(const StackFrame *obj) {
+  return obj->isA(&ID);
+}
diff --git a/lldb/source/Target/CMakeLists.txt b/lldb/source/Target/CMakeLists.txt
index cff59049cdce5..df2ee03860ac0 100644
--- a/lldb/source/Target/CMakeLists.txt
+++ b/lldb/source/Target/CMakeLists.txt
@@ -41,6 +41,7 @@ add_lldb_library(lldbTarget
   SyntheticFrameProvider.cpp
   SectionLoadHistory.cpp
   SectionLoadList.cpp
+  BorrowedStackFrame.cpp
   StackFrame.cpp
   StackFrameList.cpp
   StackFrameRecognizer.cpp
diff --git a/lldb/source/Target/ExecutionContext.cpp b/lldb/source/Target/ExecutionContext.cpp
index a795913047639..b16ff26266c53 100644
--- a/lldb/source/Target/ExecutionContext.cpp
+++ b/lldb/source/Target/ExecutionContext.cpp
@@ -466,10 +466,13 @@ operator=(const ExecutionContext &exe_ctx) {
   else
     m_tid = LLDB_INVALID_THREAD_ID;
   lldb::StackFrameSP frame_sp(exe_ctx.GetFrameSP());
-  if (frame_sp)
+  if (frame_sp) {
     m_stack_id = frame_sp->GetStackID();
-  else
+    m_frame_list_wp = frame_sp->GetContainingStackFrameList();
+  } else {
     m_stack_id.Clear();
+    m_frame_list_wp.reset();
+  }
   return *this;
 }
 
@@ -511,6 +514,7 @@ void ExecutionContextRef::SetThreadSP(const lldb::ThreadSP &thread_sp) {
 void ExecutionContextRef::SetFrameSP(const lldb::StackFrameSP &frame_sp) {
   if (frame_sp) {
     m_stack_id = frame_sp->GetStackID();
+    m_frame_list_wp = frame_sp->GetContainingStackFrameList();
     SetThreadSP(frame_sp->GetThread());
   } else {
     ClearFrame();
@@ -638,6 +642,15 @@ lldb::ThreadSP ExecutionContextRef::GetThreadSP() const {
 
 lldb::StackFrameSP ExecutionContextRef::GetFrameSP() const {
   if (m_stack_id.IsValid()) {
+    // Try the remembered frame list first to avoid circular dependencies
+    // during frame provider initialization.
+    if (auto frame_list_sp = m_frame_list_wp.lock()) {
+      if (auto frame_sp = frame_list_sp->GetFrameWithStackID(m_stack_id))
+        return frame_sp;
+    }
+
+    // Fallback: ask the thread, which might re-trigger the frame provider
+    // initialization.
     lldb::ThreadSP thread_sp(GetThreadSP());
     if (thread_sp)
       return thread_sp->GetFrameWithStackID(m_stack_id);
diff --git a/lldb/source/Target/StackFrame.cpp b/lldb/source/Target/StackFrame.cpp
index 78f67d21d6600..ca3d4a1a29b59 100644
--- a/lldb/source/Target/StackFrame.cpp
+++ b/lldb/source/Target/StackFrame.cpp
@@ -45,6 +45,9 @@
 using namespace lldb;
 using namespace lldb_private;
 
+// LLVM RTTI support.
+char StackFrame::ID;
+
 // The first bits in the flags are reserved for the SymbolContext::Scope bits
 // so we know if we have tried to look up information in our internal symbol
 // context (m_sc) already.
diff --git a/lldb/source/Target/StackFrameList.cpp b/lldb/source/Target/StackFrameList.cpp
index ccf874fc03ebd..5d1a8a8370414 100644
--- a/lldb/source/Target/StackFrameList.cpp
+++ b/lldb/source/Target/StackFrameList.cpp
@@ -20,6 +20,7 @@
 #include "lldb/Target/StackFrame.h"
 #include "lldb/Target/StackFrameRecognizer.h"
 #include "lldb/Target/StopInfo.h"
+#include "lldb/Target/SyntheticFrameProvider.h"
 #include "lldb/Target/Target.h"
 #include "lldb/Target/Thread.h"
 #include "lldb/Target/Unwind.h"
@@ -55,6 +56,44 @@ StackFrameList::~StackFrameList() {
   Clear();
 }
 
+SyntheticStackFrameList::SyntheticStackFrameList(
+    Thread &thread, lldb::StackFrameListSP input_frames,
+    const lldb::StackFrameListSP &prev_frames_sp, bool show_inline_frames)
+    : StackFrameList(thread, prev_frames_sp, show_inline_frames),
+      m_input_frames(std::move(input_frames)) {}
+
+bool SyntheticStackFrameList::FetchFramesUpTo(
+    uint32_t end_idx, InterruptionControl allow_interrupt) {
+  // Check if the thread has a synthetic frame provider.
+  if (auto provider_sp = m_thread.GetFrameProvider()) {
+    // Use the synthetic frame provider to generate frames lazily.
+    // Keep fetching until we reach end_idx or the provider returns an error.
+    for (uint32_t idx = m_frames.size(); idx <= end_idx; idx++) {
+      if (allow_interrupt &&
+          m_thread.GetProcess()->GetTarget().GetDebugger().InterruptRequested())
+        return true;
+      auto frame_or_err = provider_sp->GetFrameAtIndex(idx);
+      if (!frame_or_err) {
+        // Provider returned error - we've reached the end.
+        LLDB_LOG_ERROR(GetLog(LLDBLog::Thread), frame_or_err.takeError(),
+                       "Frame provider reached end at index {0}: {1}", idx);
+        SetAllFramesFetched();
+        break;
+      }
+      StackFrameSP frame_sp = *frame_or_err;
+      // Set the frame list weak pointer so ExecutionContextRef can resolve
+      // the frame without calling Thread::GetStackFrameList().
+      frame_sp->m_frame_list_wp = shared_from_this();
+      m_frames.push_back(frame_sp);
+    }
+
+    return false; // Not interrupted.
+  }
+
+  // If no provider, fall back to the base implementation.
+  return StackFrameList::FetchFramesUpTo(end_idx, allow_interrupt);
+}
+
 void StackFrameList::CalculateCurrentInlinedDepth() {
   uint32_t cur_inlined_depth = GetCurrentInlinedDepth();
   if (cur_inlined_depth == UINT32_MAX) {
@@ -330,6 +369,7 @@ void StackFrameList::SynthesizeTailCallFrames(StackFrame &next_frame) {
         m_thread.shared_from_this(), frame_idx, concrete_frame_idx, cfa,
         cfa_is_valid, pc, StackFrame::Kind::Regular, artificial,
         behaves_like_zeroth_frame, &sc);
+    synth_frame->m_frame_list_wp = shared_from_this();
     m_frames.push_back(synth_frame);
     LLDB_LOG(log, "Pushed frame {0} at {1:x}", callee->GetDisplayName(), pc);
   }
@@ -445,6 +485,7 @@ bool StackFrameList::FetchFramesUpTo(uint32_t end_idx,
           unwind_frame_sp = std::make_shared<StackFrame>(
               m_thread.shared_from_this(), m_frames.size(), idx, reg_ctx_sp,
               cfa, pc, behaves_like_zeroth_frame, nullptr);
+          unwind_frame_sp->m_frame_list_wp = shared_from_this();
           m_frames.push_back(unwind_frame_sp);
         }
       } else {
@@ -479,6 +520,7 @@ bool StackFrameList::FetchFramesUpTo(uint32_t end_idx,
       // although its concrete index will stay the same.
       SynthesizeTailCallFrames(*unwind_frame_sp.get());
 
+      unwind_frame_sp->m_frame_list_wp = shared_from_this();
       m_frames.push_back(unwind_frame_sp);
     }
 
@@ -503,6 +545,7 @@ bool StackFrameList::FetchFramesUpTo(uint32_t end_idx,
             unwind_frame_sp->GetRegisterContextSP(), cfa, next_frame_address,
             behaves_like_zeroth_frame, &next_frame_sc));
 
+        frame_sp->m_frame_list_wp = shared_from_this();
         m_frames.push_back(frame_sp);
         unwind_sc = next_frame_sc;
         curr_frame_address = next_frame_address;
@@ -559,6 +602,7 @@ bool StackFrameList::FetchFramesUpTo(uint32_t end_idx,
       prev_frame->UpdatePreviousFrameFromCurrentFrame(*curr_frame);
       // Now copy the fixed up previous frame into the current frames so the
       // pointer doesn't change.
+      prev_frame_sp->m_frame_list_wp = shared_from_this();
       m_frames[curr_frame_idx] = prev_frame_sp;
 
 #if defined(DEBUG_STACK_FRAMES)
diff --git a/lldb/source/Target/SyntheticFrameProvider.cpp b/lldb/source/Target/SyntheticFrameProvider.cpp
index 241ce82c39be3..97ff42d1ed53e 100644
--- a/lldb/source/Target/SyntheticFrameProvider.cpp
+++ b/lldb/source/Target/SyntheticFrameProvider.cpp
@@ -8,10 +8,12 @@
 
 #include "lldb/Target/SyntheticFrameProvider.h"
 #include "lldb/Core/PluginManager.h"
+#include "lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h"
 #include "lldb/Target/Thread.h"
 #include "lldb/Utility/LLDBLog.h"
 #include "lldb/Utility/Log.h"
 #include "lldb/Utility/Status.h"
+#include "lldb/Utility/Stream.h"
 
 using namespace lldb;
 using namespace lldb_private;
@@ -21,12 +23,17 @@ SyntheticFrameProvider::SyntheticFrameProvider(StackFrameListSP input_frames)
 
 SyntheticFrameProvider::~SyntheticFrameProvider() = default;
 
-void SyntheticFrameProviderDescriptor::Dump(Stream *s) const {
+void ScriptedFrameProviderDescriptor::Dump(Stream *s) const {
   if (!s)
     return;
 
+  s->Format("  ID: {0:x}\n", GetID());
   s->Printf("  Name: %s\n", GetName().str().c_str());
 
+  std::string description = GetDescription();
+  if (!description.empty())
+    s->Printf("  Description: %s\n", description.c_str());
+
   // Show thread filter information.
   if (thread_specs.empty()) {
     s->PutCString("  Thread Filter: (applies to all threads)\n");
@@ -41,9 +48,23 @@ void SyntheticFrameProviderDescriptor::Dump(Stream *s) const {
   }
 }
 
+uint32_t ScriptedFrameProviderDescriptor::GetID() const {
+  if (!scripted_metadata_sp)
+    return 0;
+
+  return scripted_metadata_sp->GetID();
+}
+
+std::string ScriptedFrameProviderDescriptor::GetDescription() const {
+  // If we have an interface, call get_description() to fetch it.
+  if (interface_sp && scripted_metadata_sp)
+    return interface_sp->GetDescription(scripted_metadata_sp->GetClassName());
+  return {};
+}
+
 llvm::Expected<SyntheticFrameProviderSP> SyntheticFrameProvider::CreateInstance(
     StackFrameListSP input_frames,
-    const SyntheticFrameProviderDescriptor &descriptor) {
+    const ScriptedFrameProviderDescriptor &descriptor) {
   if (!input_frames)
     return llvm::createStringError(
         "cannot create synthetic frame provider: invalid input frames");
diff --git a/lldb/source/Target/Target.cpp b/lldb/source/Target/Target.cpp
index 3a936b85f6339..b6a662ad3f14d 100644
--- a/lldb/source/Target/Target.cpp
+++ b/lldb/source/Target/Target.cpp
@@ -3720,6 +3720,61 @@ Status Target::Attach(ProcessAttachInfo &attach_info, Stream *stream) {
   return error;
 }
 
+llvm::Expected<uint32_t> Target::AddScriptedFrameProviderDescriptor(
+    const ScriptedFrameProviderDescriptor &descriptor) {
+  if (!descriptor.IsValid())
+    return llvm::createStringError("invalid frame provider descriptor");
+
+  llvm::StringRef name = descriptor.GetName();
+  if (name.empty())
+    return llvm::createStringError(
+        "frame provider descriptor has no class name");
+
+  std::lock_guard<std::recursive_mutex> guard(
+      m_frame_provider_descriptors_mutex);
+
+  uint32_t descriptor_id = descriptor.GetID();
+  m_frame_provider_descriptors[descriptor_id] = descriptor;
+
+  // Clear frame providers on existing threads so they reload with new config.
+  if (ProcessSP process_sp = GetProcessSP())
+    for (ThreadSP thread_sp : process_sp->Threads())
+      thread_sp->ClearScriptedFrameProvider();
+
+  return descriptor_id;
+}
+
+bool Target::RemoveScriptedFrameProviderDescriptor(uint32_t id) {
+  std::lock_guard<std::recursive_mutex> guard(
+      m_frame_provider_descriptors_mutex);
+  bool removed = m_frame_provider_descriptors.erase(id);
+
+  if (removed)
+    if (ProcessSP process_sp = GetProcessSP())
+      for (ThreadSP thread_sp : process_sp->Threads())
+        thread_sp->ClearScriptedFrameProvider();
+
+  return removed;
+}
+
+void Target::ClearScriptedFrameProviderDescriptors() {
+  std::lock_guard<std::recursive_mutex> guard(
+      m_frame_provider_descriptors_mutex);
+
+  m_frame_provider_descriptors.clear();
+
+  if (ProcessSP process_sp = GetProcessSP())
+    for (ThreadSP thread_sp : process_sp->Threads())
+      thread_sp->ClearScriptedFrameProvider();
+}
+
+const llvm::DenseMap<uint32_t, ScriptedFrameProviderDescriptor> &
+Target::GetScriptedFrameProviderDescriptors() const {
+  std::lock_guard<std::recursive_mutex> guard(
+      m_frame_provider_descriptors_mutex);
+  return m_frame_provider_descriptors;
+}
+
 void Target::FinalizeFileActions(ProcessLaunchInfo &info) {
   Log *log = GetLog(LLDBLog::Process);
 
diff --git a/lldb/source/Target/TargetList.cpp b/lldb/source/Target/TargetList.cpp
index 2e03bc1e38ea0..ce04e9c1209b8 100644
--- a/lldb/source/Target/TargetList.cpp
+++ b/lldb/source/Target/TargetList.cpp
@@ -48,7 +48,7 @@ Status TargetList::CreateTarget(Debugger &debugger,
                                 LoadDependentFiles load_dependent_files,
                                 const OptionGroupPlatform *platform_options,
                                 TargetSP &target_sp) {
-  std::lock_guard<std::recursive_mutex> guard(m_target_list_mutex);
+
   auto result = TargetList::CreateTargetInternal(
       debugger, user_exe_path, triple_str, load_dependent_files,
       platform_options, target_sp);
@@ -63,7 +63,7 @@ Status TargetList::CreateTarget(Debugger &debugger,
                                 const ArchSpec &specified_arch,
                                 LoadDependentFiles load_dependent_files,
                                 PlatformSP &platform_sp, TargetSP &target_sp) {
-  std::lock_guard<std::recursive_mutex> guard(m_target_list_mutex);
+
   auto result = TargetList::CreateTargetInternal(
       debugger, user_exe_path, specified_arch, load_dependent_files,
       platform_sp, target_sp);
@@ -521,6 +521,7 @@ uint32_t TargetList::GetIndexOfTarget(lldb::TargetSP target_sp) const {
 }
 
 void TargetList::AddTargetInternal(TargetSP target_sp, bool do_select) {
+  std::lock_guard<std::recursive_mutex> guard(m_target_list_mutex);
   lldbassert(!llvm::is_contained(m_target_list, target_sp) &&
              "target already exists it the list");
   UnregisterInProcessTarget(target_sp);
diff --git a/lldb/source/Target/Thread.cpp b/lldb/source/Target/Thread.cpp
index 8c3e19725f8cb..b40e753aca1e9 100644
--- a/lldb/source/Target/Thread.cpp
+++ b/lldb/source/Target/Thread.cpp
@@ -13,9 +13,12 @@
 #include "lldb/Core/Module.h"
 #include "lldb/Core/StructuredDataImpl.h"
 #include "lldb/Host/Host.h"
+#include "lldb/Interpreter/Interfaces/ScriptedFrameInterface.h"
+#include "lldb/Interpreter/Interfaces/ScriptedFrameProviderInterface.h"
 #include "lldb/Interpreter/OptionValueFileSpecList.h"
 #include "lldb/Interpreter/OptionValueProperties.h"
 #include "lldb/Interpreter/Property.h"
+#include "lldb/Interpreter/ScriptInterpreter.h"
 #include "lldb/Symbol/Function.h"
 #include "lldb/Target/ABI.h"
 #include "lldb/Target/DynamicLoader.h"
@@ -26,6 +29,7 @@
 #include "lldb/Target/ScriptedThreadPlan.h"
 #include "lldb/Target/StackFrameRecognizer.h"
 #include "lldb/Target/StopInfo.h"
+#include "lldb/Target/SyntheticFrameProvider.h"
 #include "lldb/Target/SystemRuntime.h"
 #include "lldb/Target/Target.h"
 #include "lldb/Target/ThreadPlan.h"
@@ -45,6 +49,7 @@
 #include "lldb/Utility/LLDBLog.h"
 #include "lldb/Utility/Log.h"
 #include "lldb/Utility/RegularExpression.h"
+#include "lldb/Utility/ScriptedMetadata.h"
 #include "lldb/Utility/State.h"
 #include "lldb/Utility/Stream.h"
 #include "lldb/Utility/StreamString.h"
@@ -257,6 +262,7 @@ void Thread::DestroyThread() {
   std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
   m_curr_frames_sp.reset();
   m_prev_frames_sp.reset();
+  m_frame_provider_sp.reset();
   m_prev_framezero_pc.reset();
 }
 
@@ -1439,13 +1445,76 @@ void Thread::CalculateExecutionContext(ExecutionContext &exe_ctx) {
 StackFrameListSP Thread::GetStackFrameList() {
   std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
 
-  if (!m_curr_frames_sp)
+  if (m_curr_frames_sp)
+    return m_curr_frames_sp;
+
+  // First, try to load a frame provider if we don't have one yet.
+  if (!m_frame_provider_sp) {
+    ProcessSP process_sp = GetProcess();
+    if (process_sp) {
+      Target &target = process_sp->GetTarget();
+      const auto &descriptors = target.GetScriptedFrameProviderDescriptors();
+
+      // Find first descriptor that applies to this thread.
+      for (const auto &entry : descriptors) {
+        const ScriptedFrameProviderDescriptor &descriptor = entry.second;
+        if (descriptor.IsValid() && descriptor.AppliesToThread(*this)) {
+          if (llvm::Error error = LoadScriptedFrameProvider(descriptor)) {
+            LLDB_LOG_ERROR(GetLog(LLDBLog::Thread), std::move(error),
+                           "Failed to load scripted frame provider: {0}");
+          }
+          break; // Use first matching descriptor (success or failure).
+        }
+      }
+    }
+  }
+
+  // Create the frame list based on whether we have a provider.
+  if (m_frame_provider_sp) {
+    // We have a provider - create synthetic frame list.
+    StackFrameListSP input_frames = m_frame_provider_sp->GetInputFrames();
+    m_curr_frames_sp = std::make_shared<SyntheticStackFrameList>(
+        *this, input_frames, m_prev_frames_sp, true);
+  } else {
+    // No provider - use normal unwinder frames.
     m_curr_frames_sp =
         std::make_shared<StackFrameList>(*this, m_prev_frames_sp, true);
+  }
 
   return m_curr_frames_sp;
 }
 
+llvm::Error Thread::LoadScriptedFrameProvider(
+    const ScriptedFrameProviderDescriptor &descriptor) {
+  std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
+
+  // Note: We don't create input_frames here - it will be created lazily
+  // by SyntheticStackFrameList when frames are first fetched.
+  // Creating them too early can cause crashes during thread initialization.
+
+  // Create a temporary StackFrameList just to get the thread reference for the
+  // provider. The provider won't actually use this - it will get real input
+  // frames from SyntheticStackFrameList later.
+  StackFrameListSP temp_frames =
+      std::make_shared<StackFrameList>(*this, m_prev_frames_sp, true);
+
+  auto provider_or_err =
+      SyntheticFrameProvider::CreateInstance(temp_frames, descriptor);
+  if (!provider_or_err)
+    return provider_or_err.takeError();
+
+  ClearScriptedFrameProvider();
+  m_frame_provider_sp = *provider_or_err;
+  return llvm::Error::success();
+}
+
+void Thread::ClearScriptedFrameProvider() {
+  std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
+  m_frame_provider_sp.reset();
+  m_curr_frames_sp.reset();
+  m_prev_frames_sp.reset();
+}
+
 std::optional<addr_t> Thread::GetPreviousFrameZeroPC() {
   return m_prev_framezero_pc;
 }
@@ -1466,6 +1535,7 @@ void Thread::ClearStackFrames() {
     m_prev_frames_sp.swap(m_curr_frames_sp);
   m_curr_frames_sp.reset();
 
+  m_frame_provider_sp.reset();
   m_extended_info.reset();
   m_extended_info_fetched = false;
 }
diff --git a/lldb/source/Target/ThreadPlanStepOut.cpp b/lldb/source/Target/ThreadPlanStepOut.cpp
index d49a01bdbcef7..0307b38d7d94b 100644
--- a/lldb/source/Target/ThreadPlanStepOut.cpp
+++ b/lldb/source/Target/ThreadPlanStepOut.cpp
@@ -356,13 +356,10 @@ bool ThreadPlanStepOut::DoPlanExplainsStop(Event *event_ptr) {
           }
         }
 
-        // If there was only one owner, then we're done.  But if we also hit
-        // some user breakpoint on our way out, we should mark ourselves as
-        // done, but also not claim to explain the stop, since it is more
-        // important to report the user breakpoint than the step out
-        // completion.
-
-        if (site_sp->GetNumberOfConstituents() == 1)
+        // If the thread also hit a user breakpoint on its way out, the plan is
+        // done but should not claim to explain the stop. It is more important
+        // to report the user breakpoint than the step out completion.
+        if (!site_sp->ContainsUserBreakpointForThread(GetThread()))
           return true;
       }
       return false;
diff --git a/lldb/source/Target/ThreadSpec.cpp b/lldb/source/Target/ThreadSpec.cpp
index ba4c3aa894553..624f64e3af800 100644
--- a/lldb/source/Target/ThreadSpec.cpp
+++ b/lldb/source/Target/ThreadSpec.cpp
@@ -19,6 +19,10 @@ const char *ThreadSpec::g_option_names[static_cast<uint32_t>(
 
 ThreadSpec::ThreadSpec() : m_name(), m_queue_name() {}
 
+ThreadSpec::ThreadSpec(Thread &thread)
+    : m_index(thread.GetIndexID()), m_tid(thread.GetID()),
+      m_name(thread.GetName()), m_queue_name(thread.GetQueueName()) {}
+
 std::unique_ptr<ThreadSpec> ThreadSpec::CreateFromStructuredData(
     const StructuredData::Dictionary &spec_dict, Status &error) {
   uint32_t index = UINT32_MAX;
diff --git a/lldb/source/Utility/CMakeLists.txt b/lldb/source/Utility/CMakeLists.txt
index 338b8bd8b0ef1..80b53f8c098d2 100644
--- a/lldb/source/Utility/CMakeLists.txt
+++ b/lldb/source/Utility/CMakeLists.txt
@@ -77,6 +77,7 @@ add_lldb_library(lldbUtility NO_INTERNAL_DEPENDENCIES
   UserIDResolver.cpp
   VASprintf.cpp
   VMRange.cpp
+  VirtualDataExtractor.cpp
   XcodeSDK.cpp
   ZipFile.cpp
 
diff --git a/lldb/source/Utility/VirtualDataExtractor.cpp b/lldb/source/Utility/VirtualDataExtractor.cpp
new file mode 100644
index 0000000000000..a23e43b383d25
--- /dev/null
+++ b/lldb/source/Utility/VirtualDataExtractor.cpp
@@ -0,0 +1,139 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/Utility/VirtualDataExtractor.h"
+#include <cassert>
+
+using namespace lldb;
+using namespace lldb_private;
+
+VirtualDataExtractor::VirtualDataExtractor(const void *data,
+                                           offset_t data_length,
+                                           ByteOrder byte_order,
+                                           uint32_t addr_size,
+                                           LookupTable lookup_table)
+    : DataExtractor(data, data_length, byte_order, addr_size),
+      m_lookup_table(std::move(lookup_table)) {
+  m_lookup_table.Sort();
+}
+
+VirtualDataExtractor::VirtualDataExtractor(const DataBufferSP &data_sp,
+                                           ByteOrder byte_order,
+                                           uint32_t addr_size,
+                                           LookupTable lookup_table)
+    : DataExtractor(data_sp, byte_order, addr_size),
+      m_lookup_table(std::move(lookup_table)) {
+  m_lookup_table.Sort();
+}
+
+const VirtualDataExtractor::LookupTable::Entry *
+VirtualDataExtractor::FindEntry(offset_t virtual_addr) const {
+  // Use RangeDataVector's binary search instead of linear search.
+  return m_lookup_table.FindEntryThatContains(virtual_addr);
+}
+
+bool VirtualDataExtractor::ValidateVirtualRead(offset_t virtual_addr,
+                                               offset_t length) const {
+  const LookupTable::Entry *entry = FindEntry(virtual_addr);
+  if (!entry)
+    return false;
+
+  // Assert that the read does not cross entry boundaries.
+  // RangeData.Contains() checks if a range is fully contained.
+  assert(entry->Contains(LookupTable::Range(virtual_addr, length)) &&
+         "Read crosses lookup table entry boundary");
+
+  // Also validate that the physical offset is within the data buffer.
+  // RangeData.data contains the physical offset.
+  offset_t physical_offset = entry->data + (virtual_addr - entry->base);
+  return ValidOffsetForDataOfSize(physical_offset, length);
+}
+
+const void *VirtualDataExtractor::GetData(offset_t *offset_ptr,
+                                          offset_t length) const {
+  // Override to treat offset as virtual address.
+  if (!offset_ptr)
+    return nullptr;
+
+  offset_t virtual_addr = *offset_ptr;
+
+  if (!ValidateVirtualRead(virtual_addr, length))
+    return nullptr;
+
+  const LookupTable::Entry *entry = FindEntry(virtual_addr);
+  assert(entry && "ValidateVirtualRead should have found an entry");
+
+  offset_t physical_offset = entry->data + (virtual_addr - entry->base);
+  // Use base class PeekData directly to avoid recursion.
+  const void *result = DataExtractor::PeekData(physical_offset, length);
+
+  if (result) {
+    // Advance the virtual offset pointer.
+    *offset_ptr += length;
+  }
+
+  return result;
+}
+
+const uint8_t *VirtualDataExtractor::PeekData(offset_t offset,
+                                              offset_t length) const {
+  // Override to treat offset as virtual address.
+  if (!ValidateVirtualRead(offset, length))
+    return nullptr;
+
+  const LookupTable::Entry *entry = FindEntry(offset);
+  assert(entry && "ValidateVirtualRead should have found an entry");
+
+  offset_t physical_offset = entry->data + (offset - entry->base);
+  // Use the base class PeekData with the physical offset.
+  return DataExtractor::PeekData(physical_offset, length);
+}
+
+uint8_t VirtualDataExtractor::GetU8_unchecked(offset_t *offset_ptr) const {
+  offset_t virtual_addr = *offset_ptr;
+  const LookupTable::Entry *entry = FindEntry(virtual_addr);
+  assert(entry && "Unchecked methods require valid virtual address");
+
+  offset_t physical_offset = entry->data + (virtual_addr - entry->base);
+  uint8_t result = DataExtractor::GetU8_unchecked(&physical_offset);
+  *offset_ptr += 1;
+  return result;
+}
+
+uint16_t VirtualDataExtractor::GetU16_unchecked(offset_t *offset_ptr) const {
+  offset_t virtual_addr = *offset_ptr;
+  const LookupTable::Entry *entry = FindEntry(virtual_addr);
+  assert(entry && "Unchecked methods require valid virtual address");
+
+  offset_t physical_offset = entry->data + (virtual_addr - entry->base);
+  uint16_t result = DataExtractor::GetU16_unchecked(&physical_offset);
+  *offset_ptr += 2;
+  return result;
+}
+
+uint32_t VirtualDataExtractor::GetU32_unchecked(offset_t *offset_ptr) const {
+  offset_t virtual_addr = *offset_ptr;
+  const LookupTable::Entry *entry = FindEntry(virtual_addr);
+  assert(entry && "Unchecked methods require valid virtual address");
+
+  offset_t physical_offset = entry->data + (virtual_addr - entry->base);
+  uint32_t result = DataExtractor::GetU32_unchecked(&physical_offset);
+  *offset_ptr += 4;
+  return result;
+}
+
+uint64_t VirtualDataExtractor::GetU64_unchecked(offset_t *offset_ptr) const {
+  offset_t virtual_addr = *offset_ptr;
+  const LookupTable::Entry *entry = FindEntry(virtual_addr);
+  assert(entry && "Unchecked methods require valid virtual address");
+
+  offset_t physical_offset = entry->data + (virtual_addr - entry->base);
+  uint64_t result = DataExtractor::GetU64_unchecked(&physical_offset);
+  *offset_ptr += 8;
+  return result;
+}
diff --git a/lldb/source/ValueObject/DILAST.cpp b/lldb/source/ValueObject/DILAST.cpp
index 7ed34db6e20df..0b9e1f4d48ac8 100644
--- a/lldb/source/ValueObject/DILAST.cpp
+++ b/lldb/source/ValueObject/DILAST.cpp
@@ -51,4 +51,8 @@ BooleanLiteralNode::Accept(Visitor *v) const {
   return v->Visit(this);
 }
 
+llvm::Expected<lldb::ValueObjectSP> CastNode::Accept(Visitor *v) const {
+  return v->Visit(this);
+}
+
 } // namespace lldb_private::dil
diff --git a/lldb/source/ValueObject/DILEval.cpp b/lldb/source/ValueObject/DILEval.cpp
index 40a05a467f883..dc0d93d242739 100644
--- a/lldb/source/ValueObject/DILEval.cpp
+++ b/lldb/source/ValueObject/DILEval.cpp
@@ -740,4 +740,16 @@ Interpreter::Visit(const BooleanLiteralNode *node) {
   return ValueObject::CreateValueObjectFromBool(m_target, value, "result");
 }
 
+llvm::Expected<lldb::ValueObjectSP> Interpreter::Visit(const CastNode *node) {
+  auto operand_or_err = Evaluate(node->GetOperand());
+  if (!operand_or_err)
+    return operand_or_err;
+
+  lldb::ValueObjectSP operand = *operand_or_err;
+  // Don't actually do the cast for now -- that code will be added later.
+  // For now just return an error message.
+  return llvm::make_error<DILDiagnosticError>(
+      m_expr, "Type casting is not supported here.", node->GetLocation());
+}
+
 } // namespace lldb_private::dil
diff --git a/lldb/source/ValueObject/DILParser.cpp b/lldb/source/ValueObject/DILParser.cpp
index 072ddff1e28d2..e94ce31f3b979 100644
--- a/lldb/source/ValueObject/DILParser.cpp
+++ b/lldb/source/ValueObject/DILParser.cpp
@@ -13,7 +13,9 @@
 
 #include "lldb/ValueObject/DILParser.h"
 #include "lldb/Host/common/DiagnosticsRendering.h"
+#include "lldb/Symbol/CompileUnit.h"
 #include "lldb/Target/ExecutionContextScope.h"
+#include "lldb/Target/LanguageRuntime.h"
 #include "lldb/ValueObject/DILAST.h"
 #include "lldb/ValueObject/DILEval.h"
 #include "llvm/ADT/StringRef.h"
@@ -80,15 +82,63 @@ ASTNodeUP DILParser::Run() {
 // Parse an expression.
 //
 //  expression:
-//    unary_expression
+//    cast_expression
 //
-ASTNodeUP DILParser::ParseExpression() { return ParseUnaryExpression(); }
+ASTNodeUP DILParser::ParseExpression() { return ParseCastExpression(); }
+
+// Parse a cast_expression.
+//
+// cast_expression:
+//   unary_expression
+//   "(" type_id ")" cast_expression
+
+ASTNodeUP DILParser::ParseCastExpression() {
+  if (!CurToken().Is(Token::l_paren))
+    return ParseUnaryExpression();
+
+  // This could be a type cast, try parsing the contents as a type declaration.
+  Token token = CurToken();
+  uint32_t loc = token.GetLocation();
+
+  // Enable lexer backtracking, so that we can rollback in case it's not
+  // actually a type declaration.
+
+  // Start tentative parsing (save token location/idx, for possible rollback).
+  uint32_t save_token_idx = m_dil_lexer.GetCurrentTokenIdx();
+
+  // Consume the token only after enabling the backtracking.
+  m_dil_lexer.Advance();
+
+  // Try parsing the type declaration. If the returned value is not valid,
+  // then we should rollback and try parsing the expression.
+  auto type_id = ParseTypeId();
+  if (type_id) {
+    // Successfully parsed the type declaration. Commit the backtracked
+    // tokens and parse the cast_expression.
+
+    if (!type_id.value().IsValid())
+      return std::make_unique<ErrorNode>();
+
+    Expect(Token::r_paren);
+    m_dil_lexer.Advance();
+    auto rhs = ParseCastExpression();
+
+    return std::make_unique<CastNode>(loc, type_id.value(), std::move(rhs),
+                                      CastKind::eNone);
+  }
+
+  // Failed to parse the contents of the parentheses as a type declaration.
+  // Rollback the lexer and try parsing it as unary_expression.
+  TentativeParsingRollback(save_token_idx);
+
+  return ParseUnaryExpression();
+}
 
 // Parse an unary_expression.
 //
 //  unary_expression:
 //    postfix_expression
-//    unary_operator expression
+//    unary_operator cast_expression
 //
 //  unary_operator:
 //    "&"
@@ -102,7 +152,7 @@ ASTNodeUP DILParser::ParseUnaryExpression() {
     Token token = CurToken();
     uint32_t loc = token.GetLocation();
     m_dil_lexer.Advance();
-    auto rhs = ParseExpression();
+    auto rhs = ParseCastExpression();
     switch (token.GetKind()) {
     case Token::star:
       return std::make_unique<UnaryOpNode>(loc, UnaryOpKind::Deref,
@@ -282,6 +332,81 @@ std::string DILParser::ParseNestedNameSpecifier() {
   }
 }
 
+// Parse a type_id.
+//
+//  type_id:
+//    type_specifier_seq [abstract_declarator]
+//
+//  type_specifier_seq:
+//    type_specifier [type_specifier]
+//
+//  type_specifier:
+//    ["::"] [nested_name_specifier] type_name // not handled for now!
+//    builtin_typename
+//
+std::optional<CompilerType> DILParser::ParseTypeId() {
+  CompilerType type;
+  // For now only allow builtin types -- will expand add to this later.
+  auto maybe_builtin_type = ParseBuiltinType();
+  if (maybe_builtin_type) {
+    type = *maybe_builtin_type;
+  } else
+    return {};
+
+  //
+  //  abstract_declarator:
+  //    ptr_operator [abstract_declarator]
+  //
+  std::vector<Token> ptr_operators;
+  while (CurToken().IsOneOf({Token::star, Token::amp})) {
+    Token tok = CurToken();
+    ptr_operators.push_back(std::move(tok));
+    m_dil_lexer.Advance();
+  }
+  type = ResolveTypeDeclarators(type, ptr_operators);
+
+  return type;
+}
+
+// Parse a built-in type
+//
+// builtin_typename:
+//   identifer_seq
+//
+//  identifier_seq
+//    identifer [identifier_seq]
+//
+// A built-in type can be a single identifier or a space-separated
+// list of identifiers (e.g. "short" or "long long").
+std::optional<CompilerType> DILParser::ParseBuiltinType() {
+  std::string type_name = "";
+  uint32_t save_token_idx = m_dil_lexer.GetCurrentTokenIdx();
+  bool first_word = true;
+  while (CurToken().GetKind() == Token::identifier) {
+    if (CurToken().GetSpelling() == "const" ||
+        CurToken().GetSpelling() == "volatile")
+      continue;
+    if (!first_word)
+      type_name.push_back(' ');
+    else
+      first_word = false;
+    type_name.append(CurToken().GetSpelling());
+    m_dil_lexer.Advance();
+  }
+
+  if (type_name.size() > 0) {
+    lldb::TargetSP target_sp = m_ctx_scope->CalculateTarget();
+    ConstString const_type_name(type_name.c_str());
+    for (auto type_system_sp : target_sp->GetScratchTypeSystems())
+      if (auto compiler_type =
+              type_system_sp->GetBuiltinTypeByName(const_type_name))
+        return compiler_type;
+  }
+
+  TentativeParsingRollback(save_token_idx);
+  return {};
+}
+
 // Parse an id_expression.
 //
 //  id_expression:
@@ -347,6 +472,40 @@ std::string DILParser::ParseUnqualifiedId() {
   return identifier;
 }
 
+CompilerType
+DILParser::ResolveTypeDeclarators(CompilerType type,
+                                  const std::vector<Token> &ptr_operators) {
+  // Resolve pointers/references.
+  for (Token tk : ptr_operators) {
+    uint32_t loc = tk.GetLocation();
+    if (tk.GetKind() == Token::star) {
+      // Pointers to reference types are forbidden.
+      if (type.IsReferenceType()) {
+        BailOut(llvm::formatv("'type name' declared as a pointer to a "
+                              "reference of type {0}",
+                              type.TypeDescription()),
+                loc, CurToken().GetSpelling().length());
+        return {};
+      }
+      // Get pointer type for the base type: e.g. int* -> int**.
+      type = type.GetPointerType();
+
+    } else if (tk.GetKind() == Token::amp) {
+      // References to references are forbidden.
+      // FIXME: In future we may want to allow rvalue references (i.e. &&).
+      if (type.IsReferenceType()) {
+        BailOut("type name declared as a reference to a reference", loc,
+                CurToken().GetSpelling().length());
+        return {};
+      }
+      // Get reference type for the base type: e.g. int -> int&.
+      type = type.GetLValueReferenceType();
+    }
+  }
+
+  return type;
+}
+
 // Parse an boolean_literal.
 //
 //  boolean_literal:
diff --git a/lldb/source/ValueObject/ValueObjectSynthetic.cpp b/lldb/source/ValueObject/ValueObjectSynthetic.cpp
index f673c51a88412..44e53bd5fd82e 100644
--- a/lldb/source/ValueObject/ValueObjectSynthetic.cpp
+++ b/lldb/source/ValueObject/ValueObjectSynthetic.cpp
@@ -443,3 +443,18 @@ void ValueObjectSynthetic::SetLanguageFlags(uint64_t flags) {
   else
     this->ValueObject::SetLanguageFlags(flags);
 }
+
+void ValueObjectSynthetic::GetExpressionPath(Stream &stream,
+                                             GetExpressionPathFormat epformat) {
+  // A synthetic ValueObject may wrap an underlying  Register or RegisterSet
+  // ValueObject, which requires a different approach to generating the
+  // expression path. In such cases, delegate to the non-synthetic value object.
+  if (const lldb::ValueType obj_value_type = GetValueType();
+      IsSynthetic() && (obj_value_type == lldb::eValueTypeRegister ||
+                        obj_value_type == lldb::eValueTypeRegisterSet)) {
+
+    if (const lldb::ValueObjectSP raw_value = GetNonSyntheticValue())
+      return raw_value->GetExpressionPath(stream, epformat);
+  }
+  return ValueObject::GetExpressionPath(stream, epformat);
+}
diff --git a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/span/TestDataFormatterStdSpan.py b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/span/TestDataFormatterStdSpan.py
index a45c0ff551323..f586fb3d698c1 100644
--- a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/span/TestDataFormatterStdSpan.py
+++ b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/span/TestDataFormatterStdSpan.py
@@ -74,7 +74,7 @@ def do_test(self):
             result_summary="item 0 is 1",
         )
 
-        self.runCmd("type summary delete span")
+        self.runCmd("type summary clear")
 
         # New span with strings
         lldbutil.continue_to_breakpoint(process, bkpt)
@@ -155,12 +155,6 @@ def do_test(self):
         )
         self.check_size("nested", 2)
 
-    @skipIf(compiler="clang", compiler_version=["<", "11.0"])
-    @add_test_categories(["libc++"])
-    def test_libcxx(self):
-        self.build(dictionary={"USE_LIBCPP": 1})
-        self.do_test()
-
     def do_test_ref_and_ptr(self):
         """Test that std::span is correctly formatted when passed by ref and ptr"""
         (self.target, process, thread, bkpt) = lldbutil.run_to_source_breakpoint(
@@ -174,8 +168,26 @@ def do_test_ref_and_ptr(self):
 
         self.expect("frame variable ptr", patterns=["ptr = 0x[0-9a-f]+ size=5"])
 
+    @skipIf(compiler="clang", compiler_version=["<", "11.0"])
+    @add_test_categories(["libc++"])
+    def test_libcxx(self):
+        self.build(dictionary={"USE_LIBCPP": 1})
+        self.do_test()
+
     @skipIf(compiler="clang", compiler_version=["<", "11.0"])
     @add_test_categories(["libc++"])
     def test_ref_and_ptr_libcxx(self):
         self.build(dictionary={"USE_LIBCPP": 1})
         self.do_test_ref_and_ptr()
+
+    @skipIf(compiler="clang", compiler_version=["<", "11.0"])
+    @add_test_categories(["libstdcxx"])
+    def test_libstdcxx(self):
+        self.build(dictionary={"USE_LIBSTDCPP": 1})
+        self.do_test()
+
+    @skipIf(compiler="clang", compiler_version=["<", "11.0"])
+    @add_test_categories(["libstdcxx"])
+    def test_ref_and_ptr_libstdcxx(self):
+        self.build(dictionary={"USE_LIBSTDCPP": 1})
+        self.do_test_ref_and_ptr()
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/Makefile b/lldb/test/API/functionalities/scripted_frame_provider/Makefile
new file mode 100644
index 0000000000000..99998b20bcb05
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/Makefile
@@ -0,0 +1,3 @@
+CXX_SOURCES := main.cpp
+
+include Makefile.rules
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/TestScriptedFrameProvider.py b/lldb/test/API/functionalities/scripted_frame_provider/TestScriptedFrameProvider.py
new file mode 100644
index 0000000000000..caed94f5f93da
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/TestScriptedFrameProvider.py
@@ -0,0 +1,428 @@
+"""
+Test scripted frame provider functionality.
+"""
+
+import os
+
+import lldb
+import lldbsuite.test.lldbplatformutil as lldbplatformutil
+from lldbsuite.test.decorators import *
+from lldbsuite.test.lldbtest import TestBase
+from lldbsuite.test import lldbutil
+
+ at skipIf(oslist=["linux"], archs=["arm$"])
+class ScriptedFrameProviderTestCase(TestBase):
+    NO_DEBUG_INFO_TESTCASE = True
+
+    def setUp(self):
+        TestBase.setUp(self)
+        self.source = "main.cpp"
+
+    def test_replace_all_frames(self):
+        """Test that we can replace the entire stack."""
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # Import the test frame provider
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        # Attach the Replace provider
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.ReplaceFrameProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider: {error}")
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Verify we have exactly 3 synthetic frames
+        self.assertEqual(thread.GetNumFrames(), 3, "Should have 3 synthetic frames")
+
+        # Verify frame indices and PCs (dictionary-based frames don't have custom function names)
+        frame0 = thread.GetFrameAtIndex(0)
+        self.assertIsNotNone(frame0)
+        self.assertEqual(frame0.GetPC(), 0x1000)
+
+        frame1 = thread.GetFrameAtIndex(1)
+        self.assertIsNotNone(frame1)
+        self.assertIn("thread_func", frame1.GetFunctionName())
+
+        frame2 = thread.GetFrameAtIndex(2)
+        self.assertIsNotNone(frame2)
+        self.assertEqual(frame2.GetPC(), 0x3000)
+
+    def test_prepend_frames(self):
+        """Test that we can add frames before real stack."""
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # Get original frame count and PC
+        original_frame_count = thread.GetNumFrames()
+        self.assertGreaterEqual(
+            original_frame_count, 2, "Should have at least 2 real frames"
+        )
+
+        # Import and attach Prepend provider
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.PrependFrameProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider: {error}")
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Verify we have 2 more frames
+        new_frame_count = thread.GetNumFrames()
+        self.assertEqual(new_frame_count, original_frame_count + 2)
+
+        # Verify first 2 frames are synthetic (check PCs, not function names)
+        frame0 = thread.GetFrameAtIndex(0)
+        self.assertEqual(frame0.GetPC(), 0x9000)
+
+        frame1 = thread.GetFrameAtIndex(1)
+        self.assertEqual(frame1.GetPC(), 0xA000)
+
+        # Verify frame 2 is the original real frame 0
+        frame2 = thread.GetFrameAtIndex(2)
+        self.assertIn("thread_func", frame2.GetFunctionName())
+
+    def test_append_frames(self):
+        """Test that we can add frames after real stack."""
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # Get original frame count
+        original_frame_count = thread.GetNumFrames()
+
+        # Import and attach Append provider
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.AppendFrameProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider: {error}")
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Verify we have 1 more frame
+        new_frame_count = thread.GetNumFrames()
+        self.assertEqual(new_frame_count, original_frame_count + 1)
+
+        # Verify first frames are still real
+        frame0 = thread.GetFrameAtIndex(0)
+        self.assertIn("thread_func", frame0.GetFunctionName())
+
+        frame_n_plus_1 = thread.GetFrameAtIndex(new_frame_count - 1)
+        self.assertEqual(frame_n_plus_1.GetPC(), 0x10)
+
+    def test_scripted_frame_objects(self):
+        """Test that provider can return ScriptedFrame objects."""
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # Import the provider that returns ScriptedFrame objects
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.ScriptedFrameObjectProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider: {error}")
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Verify we have 5 frames
+        self.assertEqual(
+            thread.GetNumFrames(), 5, "Should have 5 custom scripted frames"
+        )
+
+        # Verify frame properties from CustomScriptedFrame
+        frame0 = thread.GetFrameAtIndex(0)
+        self.assertIsNotNone(frame0)
+        self.assertEqual(frame0.GetFunctionName(), "custom_scripted_frame_0")
+        self.assertEqual(frame0.GetPC(), 0x5000)
+        self.assertTrue(frame0.IsSynthetic(), "Frame should be marked as synthetic")
+
+        frame1 = thread.GetFrameAtIndex(1)
+        self.assertIsNotNone(frame1)
+        self.assertEqual(frame1.GetPC(), 0x6000)
+
+        frame2 = thread.GetFrameAtIndex(2)
+        self.assertIsNotNone(frame2)
+        self.assertEqual(frame2.GetFunctionName(), "custom_scripted_frame_2")
+        self.assertEqual(frame2.GetPC(), 0x7000)
+        self.assertTrue(frame2.IsSynthetic(), "Frame should be marked as synthetic")
+
+    def test_applies_to_thread(self):
+        """Test that applies_to_thread filters which threads get the provider."""
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # We should have at least 2 threads (worker threads) at the breakpoint
+        num_threads = process.GetNumThreads()
+        self.assertGreaterEqual(
+            num_threads, 2, "Should have at least 2 threads at breakpoint"
+        )
+
+        # Import the test frame provider
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        # Collect original thread info before applying provider
+        thread_info = {}
+        for i in range(num_threads):
+            t = process.GetThreadAtIndex(i)
+            thread_info[t.GetIndexID()] = {
+                "frame_count": t.GetNumFrames(),
+                "pc": t.GetFrameAtIndex(0).GetPC(),
+            }
+
+        # Register the ThreadFilterFrameProvider which only applies to thread ID 1
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.ThreadFilterFrameProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider: {error}")
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Check each thread
+        thread_id_1_found = False
+        # On ARM32, FixCodeAddress clears bit 0, so synthetic PCs get modified
+        is_arm_32bit = lldbplatformutil.getArchitecture() == "arm"
+        expected_synthetic_pc = 0xFFFE if is_arm_32bit else 0xFFFF
+
+        for i in range(num_threads):
+            t = process.GetThreadAtIndex(i)
+            thread_id = t.GetIndexID()
+
+            if thread_id == 1:
+                # Thread with ID 1 should have synthetic frame
+                thread_id_1_found = True
+                self.assertEqual(
+                    t.GetNumFrames(),
+                    1,
+                    f"Thread with ID 1 should have 1 synthetic frame",
+                )
+                self.assertEqual(
+                    t.GetFrameAtIndex(0).GetPC(),
+                    expected_synthetic_pc,
+                    f"Thread with ID 1 should have synthetic PC {expected_synthetic_pc:#x}",
+                )
+            else:
+                # Other threads should keep their original frames
+                self.assertEqual(
+                    t.GetNumFrames(),
+                    thread_info[thread_id]["frame_count"],
+                    f"Thread with ID {thread_id} should not be affected by provider",
+                )
+                self.assertEqual(
+                    t.GetFrameAtIndex(0).GetPC(),
+                    thread_info[thread_id]["pc"],
+                    f"Thread with ID {thread_id} should have its original PC",
+                )
+
+        # We should have found at least one thread with ID 1
+        self.assertTrue(
+            thread_id_1_found,
+            "Should have found a thread with ID 1 to test filtering",
+        )
+
+    def test_remove_frame_provider_by_id(self):
+        """Test that RemoveScriptedFrameProvider removes a specific provider by ID."""
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # Import the test frame providers
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        # Get original frame count
+        original_frame_count = thread.GetNumFrames()
+        original_pc = thread.GetFrameAtIndex(0).GetPC()
+
+        # Register the first provider and get its ID
+        error = lldb.SBError()
+        provider_id_1 = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.ReplaceFrameProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider 1: {error}")
+
+        # Verify first provider is active (3 synthetic frames)
+        self.assertEqual(thread.GetNumFrames(), 3, "Should have 3 synthetic frames")
+        self.assertEqual(
+            thread.GetFrameAtIndex(0).GetPC(), 0x1000, "Should have first provider's PC"
+        )
+
+        # Register a second provider and get its ID
+        provider_id_2 = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.PrependFrameProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+        self.assertTrue(error.Success(), f"Failed to register provider 2: {error}")
+
+        # Verify IDs are different
+        self.assertNotEqual(
+            provider_id_1, provider_id_2, "Provider IDs should be unique"
+        )
+
+        # Now remove the first provider by ID
+        result = target.RemoveScriptedFrameProvider(provider_id_1)
+        self.assertSuccess(
+            result, f"Should successfully remove provider with ID {provider_id_1}"
+        )
+
+        # After removing the first provider, the second provider should still be active
+        # The PrependFrameProvider adds 2 frames before the real stack
+        # Since ReplaceFrameProvider had 3 frames, and we removed it, we should now
+        # have the original frames (from real stack) with PrependFrameProvider applied
+        new_frame_count = thread.GetNumFrames()
+        self.assertEqual(
+            new_frame_count,
+            original_frame_count + 2,
+            "Should have original frames + 2 prepended frames",
+        )
+
+        # First two frames should be from PrependFrameProvider
+        self.assertEqual(
+            thread.GetFrameAtIndex(0).GetPC(),
+            0x9000,
+            "First frame should be from PrependFrameProvider",
+        )
+        self.assertEqual(
+            thread.GetFrameAtIndex(1).GetPC(),
+            0xA000,
+            "Second frame should be from PrependFrameProvider",
+        )
+
+        # Remove the second provider
+        result = target.RemoveScriptedFrameProvider(provider_id_2)
+        self.assertSuccess(
+            result, f"Should successfully remove provider with ID {provider_id_2}"
+        )
+
+        # After removing both providers, frames should be back to original
+        self.assertEqual(
+            thread.GetNumFrames(),
+            original_frame_count,
+            "Should restore original frame count",
+        )
+        self.assertEqual(
+            thread.GetFrameAtIndex(0).GetPC(),
+            original_pc,
+            "Should restore original PC",
+        )
+
+        # Try to remove a provider that doesn't exist
+        result = target.RemoveScriptedFrameProvider(999999)
+        self.assertTrue(result.Fail(), "Should fail to remove non-existent provider")
+
+    def test_circular_dependency_fix(self):
+        """Test that accessing input_frames in __init__ doesn't cause circular dependency.
+
+        This test verifies the fix for the circular dependency issue where:
+        1. Thread::GetStackFrameList() creates the frame provider
+        2. Provider's __init__ accesses input_frames and calls methods on frames
+        3. SBFrame methods trigger ExecutionContextRef::GetFrameSP()
+        4. Before the fix: GetFrameSP() would call Thread::GetStackFrameList() again -> circular dependency!
+        5. After the fix: GetFrameSP() uses the remembered frame list -> no circular dependency
+
+        The fix works by:
+        - StackFrame stores m_frame_list_wp (weak pointer to originating list)
+        - ExecutionContextRef stores m_frame_list_wp when created from a frame
+        - ExecutionContextRef::GetFrameSP() tries the remembered list first before asking the thread
+        """
+        self.build()
+        target, process, thread, bkpt = lldbutil.run_to_source_breakpoint(
+            self, "Break here", lldb.SBFileSpec(self.source), only_one_thread=False
+        )
+
+        # Get original frame count and PC
+        original_frame_count = thread.GetNumFrames()
+        original_pc = thread.GetFrameAtIndex(0).GetPC()
+        self.assertGreaterEqual(
+            original_frame_count, 2, "Should have at least 2 real frames"
+        )
+
+        # Import the provider that accesses input frames in __init__
+        script_path = os.path.join(self.getSourceDir(), "test_frame_providers.py")
+        self.runCmd("command script import " + script_path)
+
+        # Register the CircularDependencyTestProvider
+        # Before the fix, this would crash or hang due to circular dependency
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "test_frame_providers.CircularDependencyTestProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+
+        # If we get here without crashing, the fix is working!
+        self.assertTrue(error.Success(), f"Failed to register provider: {error}")
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Verify the provider worked correctly
+        # Should have 1 synthetic frame + all original frames
+        new_frame_count = thread.GetNumFrames()
+        self.assertEqual(
+            new_frame_count,
+            original_frame_count + 1,
+            "Should have original frames + 1 synthetic frame",
+        )
+
+        # On ARM32, FixCodeAddress clears bit 0, so synthetic PCs get modified
+        is_arm_32bit = lldbplatformutil.getArchitecture() == "arm"
+        expected_synthetic_pc = 0xDEADBEEE if is_arm_32bit else 0xDEADBEEF
+
+        # First frame should be synthetic
+        frame0 = thread.GetFrameAtIndex(0)
+        self.assertIsNotNone(frame0)
+        self.assertEqual(
+            frame0.GetPC(),
+            expected_synthetic_pc,
+            f"First frame should be synthetic frame with PC {expected_synthetic_pc:#x}",
+        )
+
+        # Second frame should be the original first frame
+        frame1 = thread.GetFrameAtIndex(1)
+        self.assertIsNotNone(frame1)
+        self.assertEqual(
+            frame1.GetPC(),
+            original_pc,
+            "Second frame should be original first frame",
+        )
+
+        # Verify we can still call methods on frames (no circular dependency!)
+        for i in range(min(3, new_frame_count)):
+            frame = thread.GetFrameAtIndex(i)
+            self.assertIsNotNone(frame)
+            # These calls should not trigger circular dependency
+            pc = frame.GetPC()
+            self.assertNotEqual(pc, 0, f"Frame {i} should have valid PC")
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/Makefile b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/Makefile
new file mode 100644
index 0000000000000..10495940055b6
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/Makefile
@@ -0,0 +1,3 @@
+C_SOURCES := main.c
+
+include Makefile.rules
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/TestFrameProviderCircularDependency.py b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/TestFrameProviderCircularDependency.py
new file mode 100644
index 0000000000000..b15bfb24804b6
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/TestFrameProviderCircularDependency.py
@@ -0,0 +1,119 @@
+"""
+Test that frame providers wouldn't cause a hang due to a circular dependency
+during its initialization.
+"""
+
+import os
+import lldb
+from lldbsuite.test.decorators import *
+from lldbsuite.test.lldbtest import TestBase
+from lldbsuite.test import lldbutil
+
+ at skipIf(oslist=["linux"], archs=["arm$"])
+class FrameProviderCircularDependencyTestCase(TestBase):
+    NO_DEBUG_INFO_TESTCASE = True
+
+    def setUp(self):
+        TestBase.setUp(self)
+        self.source = "main.c"
+
+    @expectedFailureAll(oslist=["windows"], bugnumber="llvm.org/pr24778")
+    def test_circular_dependency_with_function_replacement(self):
+        """
+        Test the circular dependency fix with a provider that replaces function names.
+        """
+        self.build()
+
+        target = self.dbg.CreateTarget(self.getBuildArtifact("a.out"))
+        self.assertTrue(target, "Target should be valid")
+
+        bkpt = target.BreakpointCreateBySourceRegex(
+            "break here", lldb.SBFileSpec(self.source)
+        )
+        self.assertTrue(bkpt.IsValid(), "Breakpoint should be valid")
+        self.assertEqual(bkpt.GetNumLocations(), 1, "Should have 1 breakpoint location")
+
+        process = target.LaunchSimple(None, None, self.get_process_working_directory())
+        self.assertTrue(process, "Process should be valid")
+        self.assertEqual(
+            process.GetState(), lldb.eStateStopped, "Process should be stopped"
+        )
+
+        thread = process.GetSelectedThread()
+        self.assertTrue(thread.IsValid(), "Thread should be valid")
+
+        frame0 = thread.GetFrameAtIndex(0)
+        self.assertIn("bar", frame0.GetFunctionName(), "Should be stopped in bar()")
+
+        original_frame_count = thread.GetNumFrames()
+        self.assertGreaterEqual(
+            original_frame_count, 3, "Should have at least 3 frames: bar, foo, main"
+        )
+
+        frame_names = [thread.GetFrameAtIndex(i).GetFunctionName() for i in range(3)]
+        self.assertEqual(frame_names[0], "bar", "Frame 0 should be bar")
+        self.assertEqual(frame_names[1], "foo", "Frame 1 should be foo")
+        self.assertEqual(frame_names[2], "main", "Frame 2 should be main")
+
+        script_path = os.path.join(self.getSourceDir(), "frame_provider.py")
+        self.runCmd("command script import " + script_path)
+
+        # Register the frame provider that accesses input_frames.
+        # Before the fix, this registration would trigger the circular dependency:
+        # - Thread::GetStackFrameList() creates provider
+        # - Provider's get_frame_at_index() accesses input_frames[0]
+        # - Calls frame.GetFunctionName() -> ExecutionContextRef::GetFrameSP()
+        # - Before fix: Calls Thread::GetStackFrameList() again -> CIRCULAR!
+        # - After fix: Uses remembered m_frame_list_wp -> Works!
+        error = lldb.SBError()
+        provider_id = target.RegisterScriptedFrameProvider(
+            "frame_provider.ScriptedFrameObjectProvider",
+            lldb.SBStructuredData(),
+            error,
+        )
+
+        # If we reach here without crashing/hanging, the fix is working!
+        self.assertTrue(
+            error.Success(),
+            f"Should successfully register provider (if this fails, circular dependency!): {error}",
+        )
+        self.assertNotEqual(provider_id, 0, "Provider ID should be non-zero")
+
+        # Verify the provider is working correctly.
+        # Frame count should be unchanged (we're replacing frames, not adding).
+        new_frame_count = thread.GetNumFrames()
+        self.assertEqual(
+            new_frame_count,
+            original_frame_count,
+            "Frame count should be unchanged (replacement, not addition)",
+        )
+
+        # Verify that "bar" was replaced with "baz".
+        frame0_new = thread.GetFrameAtIndex(0)
+        self.assertIsNotNone(frame0_new, "Frame 0 should exist")
+        self.assertEqual(
+            frame0_new.GetFunctionName(),
+            "baz",
+            "Frame 0 function should be replaced: bar -> baz",
+        )
+
+        # Verify other frames are unchanged.
+        frame1_new = thread.GetFrameAtIndex(1)
+        self.assertEqual(
+            frame1_new.GetFunctionName(), "foo", "Frame 1 should still be foo"
+        )
+
+        frame2_new = thread.GetFrameAtIndex(2)
+        self.assertEqual(
+            frame2_new.GetFunctionName(), "main", "Frame 2 should still be main"
+        )
+
+        # Verify we can call methods on all frames (no circular dependency!).
+        for i in range(new_frame_count):
+            frame = thread.GetFrameAtIndex(i)
+            self.assertIsNotNone(frame, f"Frame {i} should exist")
+            # These calls should not trigger circular dependency.
+            pc = frame.GetPC()
+            self.assertNotEqual(pc, 0, f"Frame {i} should have valid PC")
+            func_name = frame.GetFunctionName()
+            self.assertIsNotNone(func_name, f"Frame {i} should have function name")
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/frame_provider.py b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/frame_provider.py
new file mode 100644
index 0000000000000..f27f18cd07b7f
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/frame_provider.py
@@ -0,0 +1,102 @@
+"""
+Frame provider that reproduces the circular dependency issue.
+
+This provider accesses input_frames and calls methods on them,
+which before the fix would cause a circular dependency.
+"""
+
+import lldb
+from lldb.plugins.scripted_process import ScriptedFrame
+from lldb.plugins.scripted_frame_provider import ScriptedFrameProvider
+
+
+class CustomScriptedFrame(ScriptedFrame):
+    """Custom scripted frame with full control over frame behavior."""
+
+    def __init__(self, thread, idx, pc, function_name):
+        args = lldb.SBStructuredData()
+        super().__init__(thread, args)
+
+        self.idx = idx
+        self.pc = pc
+        self.function_name = function_name
+
+    def get_id(self):
+        """Return the frame index."""
+        return self.idx
+
+    def get_pc(self):
+        """Return the program counter."""
+        return self.pc
+
+    def get_function_name(self):
+        """Return the function name."""
+        return self.function_name
+
+    def is_artificial(self):
+        """Mark as artificial frame."""
+        return False
+
+    def is_hidden(self):
+        """Not hidden."""
+        return False
+
+    def get_register_context(self):
+        return None
+
+
+class ScriptedFrameObjectProvider(ScriptedFrameProvider):
+    """
+    Provider that returns ScriptedFrame objects and accesses input_frames.
+
+    This provider demonstrates the circular dependency bug fix:
+    1. During get_frame_at_index(), we access input_frames[idx]
+    2. We call frame.GetFunctionName() and frame.GetPC() on input frames
+    3. Before the fix: These calls would trigger ExecutionContextRef::GetFrameSP()
+       which would call Thread::GetStackFrameList() -> circular dependency!
+    4. After the fix: ExecutionContextRef uses the remembered frame list -> no circular dependency
+    """
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+        self.replacement_count = 0
+        if self.target.process:
+            baz_symbol_ctx = self.target.FindFunctions("baz")
+            self.baz_symbol_ctx = None
+            if len(baz_symbol_ctx) == 1:
+                self.baz_symbol_ctx = baz_symbol_ctx[0]
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Provider that replaces 'bar' function with 'baz'"
+
+    def get_frame_at_index(self, idx):
+        """
+        Replace frames named 'bar' with custom frames named 'baz'.
+
+        This accesses input_frames and calls methods on them, which would
+        trigger the circular dependency bug before the fix.
+        """
+        if idx < len(self.input_frames):
+            # This access and method calls would cause circular dependency before fix!
+            frame = self.input_frames[idx]
+
+            # Calling GetFunctionName() triggers ExecutionContextRef resolution.
+            function_name = frame.GetFunctionName()
+
+            if function_name == "bar" and self.baz_symbol_ctx:
+                # Replace "bar" with "baz".
+                baz_func = self.baz_symbol_ctx.GetFunction()
+                new_function_name = baz_func.GetName()
+                pc = baz_func.GetStartAddress().GetLoadAddress(self.target)
+                custom_frame = CustomScriptedFrame(
+                    self.thread, idx, pc, new_function_name
+                )
+                self.replacement_count += 1
+                return custom_frame
+
+            # Pass through other frames by returning their index.
+            return idx
+
+        return None
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/main.c b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/main.c
new file mode 100644
index 0000000000000..bbd1028236f40
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/circular_dependency/main.c
@@ -0,0 +1,21 @@
+#include <stdio.h>
+
+int baz() {
+  printf("baz\n");
+  return 666;
+}
+
+int bar() {
+  printf("bar\n");
+  return 42; // break here.
+}
+
+int foo() {
+  printf("foo\n");
+  return bar();
+}
+
+int main() {
+  printf("main\n");
+  return foo();
+}
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/main.cpp b/lldb/test/API/functionalities/scripted_frame_provider/main.cpp
new file mode 100644
index 0000000000000..0298e88e4de16
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/main.cpp
@@ -0,0 +1,53 @@
+// Multi-threaded test program for testing frame providers.
+
+#include <condition_variable>
+#include <iostream>
+#include <mutex>
+#include <thread>
+
+std::mutex mtx;
+std::condition_variable cv;
+int ready_count = 0;
+constexpr int NUM_THREADS = 2;
+
+void thread_func(int thread_num) {
+  std::cout << "Thread " << thread_num << " started\n";
+
+  {
+    std::unique_lock<std::mutex> lock(mtx);
+    ready_count++;
+    if (ready_count == NUM_THREADS + 1) {
+      cv.notify_all();
+    } else {
+      cv.wait(lock, [] { return ready_count == NUM_THREADS + 1; });
+    }
+  }
+
+  std::cout << "Thread " << thread_num << " at breakpoint\n"; // Break here.
+}
+
+int main(int argc, char **argv) {
+  std::thread threads[NUM_THREADS];
+
+  for (int i = 0; i < NUM_THREADS; i++) {
+    threads[i] = std::thread(thread_func, i);
+  }
+
+  {
+    std::unique_lock<std::mutex> lock(mtx);
+    ready_count++;
+    if (ready_count == NUM_THREADS + 1) {
+      cv.notify_all();
+    } else {
+      cv.wait(lock, [] { return ready_count == NUM_THREADS + 1; });
+    }
+  }
+
+  std::cout << "Main thread at barrier\n";
+
+  for (int i = 0; i < NUM_THREADS; i++)
+    threads[i].join();
+
+  std::cout << "All threads completed\n";
+  return 0;
+}
diff --git a/lldb/test/API/functionalities/scripted_frame_provider/test_frame_providers.py b/lldb/test/API/functionalities/scripted_frame_provider/test_frame_providers.py
new file mode 100644
index 0000000000000..b9731fdc0a197
--- /dev/null
+++ b/lldb/test/API/functionalities/scripted_frame_provider/test_frame_providers.py
@@ -0,0 +1,222 @@
+"""
+Test frame providers for scripted frame provider functionality.
+
+These providers demonstrate various merge strategies:
+- Replace: Replace entire stack
+- Prepend: Add frames before real stack
+- Append: Add frames after real stack
+
+It also shows the ability to mix a dictionary, a ScriptedFrame or an SBFrame
+index to create stackframes
+"""
+
+import lldb
+from lldb.plugins.scripted_process import ScriptedFrame
+from lldb.plugins.scripted_frame_provider import ScriptedFrameProvider
+
+
+class ReplaceFrameProvider(ScriptedFrameProvider):
+    """Replace entire stack with custom frames."""
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+        self.frames = [
+            {
+                "idx": 0,
+                "pc": 0x1000,
+            },
+            0,
+            {
+                "idx": 2,
+                "pc": 0x3000,
+            },
+        ]
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Replace entire stack with 3 custom frames"
+
+    def get_frame_at_index(self, index):
+        if index >= len(self.frames):
+            return None
+        return self.frames[index]
+
+
+class PrependFrameProvider(ScriptedFrameProvider):
+    """Prepend synthetic frames before real stack."""
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Prepend 2 synthetic frames before real stack"
+
+    def get_frame_at_index(self, index):
+        if index == 0:
+            return {"pc": 0x9000}
+        elif index == 1:
+            return {"pc": 0xA000}
+        elif index - 2 < len(self.input_frames):
+            return index - 2  # Return real frame index.
+        return None
+
+
+class AppendFrameProvider(ScriptedFrameProvider):
+    """Append synthetic frames after real stack."""
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Append 1 synthetic frame after real stack"
+
+    def get_frame_at_index(self, index):
+        if index < len(self.input_frames):
+            return index  # Return real frame index.
+        elif index == len(self.input_frames):
+            return {
+                "idx": 1,
+                "pc": 0x10,
+            }
+        return None
+
+
+class CustomScriptedFrame(ScriptedFrame):
+    """Custom scripted frame with full control over frame behavior."""
+
+    def __init__(self, thread, idx, pc, function_name):
+        args = lldb.SBStructuredData()
+        super().__init__(thread, args)
+
+        self.idx = idx
+        self.pc = pc
+        self.function_name = function_name
+
+    def get_id(self):
+        """Return the frame index."""
+        return self.idx
+
+    def get_pc(self):
+        """Return the program counter."""
+        return self.pc
+
+    def get_function_name(self):
+        """Return the function name."""
+        return self.function_name
+
+    def is_artificial(self):
+        """Mark as artificial frame."""
+        return False
+
+    def is_hidden(self):
+        """Not hidden."""
+        return False
+
+    def get_register_context(self):
+        """No register context for this test."""
+        return None
+
+
+class ScriptedFrameObjectProvider(ScriptedFrameProvider):
+    """Provider that returns ScriptedFrame objects instead of dictionaries."""
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Provider returning custom ScriptedFrame objects"
+
+    def get_frame_at_index(self, index):
+        """Return ScriptedFrame objects or dictionaries based on index."""
+        if index == 0:
+            return CustomScriptedFrame(
+                self.thread, 0, 0x5000, "custom_scripted_frame_0"
+            )
+        elif index == 1:
+            return {"pc": 0x6000}
+        elif index == 2:
+            return CustomScriptedFrame(
+                self.thread, 2, 0x7000, "custom_scripted_frame_2"
+            )
+        elif index == 3:
+            return len(self.input_frames) - 2  # Real frame index.
+        elif index == 4:
+            return len(self.input_frames) - 1  # Real frame index.
+        return None
+
+
+class ThreadFilterFrameProvider(ScriptedFrameProvider):
+    """Provider that only applies to thread with ID 1."""
+
+    @staticmethod
+    def applies_to_thread(thread):
+        """Only apply to thread with index ID 1."""
+        return thread.GetIndexID() == 1
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Provider that only applies to thread ID 1"
+
+    def get_frame_at_index(self, index):
+        """Return a single synthetic frame."""
+        if index == 0:
+            return {"pc": 0xFFFF}
+        return None
+
+
+class CircularDependencyTestProvider(ScriptedFrameProvider):
+    """
+    Provider that tests the circular dependency fix.
+
+    This provider accesses input_frames during __init__ and calls methods
+    on those frames. Before the fix, this would cause a circular dependency:
+    - Thread::GetStackFrameList() creates provider
+    - Provider's __init__ accesses input_frames[0]
+    - SBFrame::GetPC() tries to resolve ExecutionContextRef
+    - ExecutionContextRef::GetFrameSP() calls Thread::GetStackFrameList()
+    - Re-enters initialization -> circular dependency!
+
+    With the fix, ExecutionContextRef remembers the frame list, so it doesn't
+    re-enter Thread::GetStackFrameList().
+    """
+
+    def __init__(self, input_frames, args):
+        super().__init__(input_frames, args)
+
+        # This would cause circular dependency before the fix!
+        # Accessing frames and calling methods on them during __init__
+        self.original_frame_count = len(input_frames)
+        self.original_pcs = []
+
+        # Call GetPC() on each input frame - this triggers ExecutionContextRef resolution.
+        for i in range(min(3, len(input_frames))):
+            frame = input_frames[i]
+            if frame.IsValid():
+                pc = frame.GetPC()
+                self.original_pcs.append(pc)
+
+    @staticmethod
+    def get_description():
+        """Return a description of this provider."""
+        return "Provider that tests circular dependency fix by accessing frames in __init__"
+
+    def get_frame_at_index(self, index):
+        """Prepend a synthetic frame, then pass through original frames."""
+        if index == 0:
+            # Synthetic frame at index 0.
+            return {"pc": 0xDEADBEEF}
+        elif index - 1 < self.original_frame_count:
+            # Pass through original frames at indices 1, 2, 3, ...
+            return index - 1
+        return None
diff --git a/lldb/test/API/functionalities/statusline/TestStatusline.py b/lldb/test/API/functionalities/statusline/TestStatusline.py
index ca376cc595f30..4ffa864120a5c 100644
--- a/lldb/test/API/functionalities/statusline/TestStatusline.py
+++ b/lldb/test/API/functionalities/statusline/TestStatusline.py
@@ -71,8 +71,10 @@ def test(self):
         )
         self.expect('set set separator "| "')
 
-        # Hide the statusline and check or the control character.
-        self.expect("set set show-statusline false", ["\x1b[1;0r"])
+        # Hide the statusline and check for the control character.
+        self.expect(
+            "set set show-statusline false", ["\x1b[1;{}r".format(self.TERMINAL_HEIGHT)]
+        )
 
     def test_no_color(self):
         """Basic test for the statusline with colors disabled."""
diff --git a/lldb/test/API/lang/BoundsSafety/soft_trap/Makefile b/lldb/test/API/lang/BoundsSafety/soft_trap/Makefile
new file mode 100644
index 0000000000000..5e83e7ac6d93f
--- /dev/null
+++ b/lldb/test/API/lang/BoundsSafety/soft_trap/Makefile
@@ -0,0 +1,10 @@
+# FIXME: mockSoftTrapRuntime.c shouldn't really be built with -fbounds-safety
+C_SOURCES := main.c mockSoftTrapRuntime.c
+
+soft-trap-test-minimal: CFLAGS_EXTRAS := -fbounds-safety -fbounds-safety-soft-traps=call-minimal
+soft-trap-test-minimal: all
+
+soft-trap-test-with-str: CFLAGS_EXTRAS := -fbounds-safety -fbounds-safety-soft-traps=call-with-str
+soft-trap-test-with-str: all
+
+include Makefile.rules
diff --git a/lldb/test/API/lang/BoundsSafety/soft_trap/TestBoundsSafetyInstrumentationPlugin.py b/lldb/test/API/lang/BoundsSafety/soft_trap/TestBoundsSafetyInstrumentationPlugin.py
new file mode 100644
index 0000000000000..535a0bcf7c00e
--- /dev/null
+++ b/lldb/test/API/lang/BoundsSafety/soft_trap/TestBoundsSafetyInstrumentationPlugin.py
@@ -0,0 +1,148 @@
+"""
+Test the BoundsSafety instrumentation plugin
+"""
+
+import lldb
+from lldbsuite.test.lldbtest import *
+from lldbsuite.test.decorators import *
+
+
+STOP_REASON_MAX_LEN = 100
+SOFT_TRAP_FUNC_MINIMAL = "__bounds_safety_soft_trap"
+SOFT_TRAP_FUNC_WITH_STR = "__bounds_safety_soft_trap_s"
+
+
+class BoundsSafetyTestSoftTrapPlugin(TestBase):
+    def _check_stop_reason_impl(
+        self,
+        expected_soft_trap_func: str,
+        expected_stop_reason: str,
+        expected_func_name: str,
+        expected_file_name: str,
+        expected_line_num: int,
+    ):
+        process = self.test_target.process
+        thread = process.GetSelectedThread()
+        self.assertEqual(
+            thread.GetStopReason(),
+            lldb.eStopReasonInstrumentation,
+        )
+
+        stop_reason = thread.GetStopDescription(STOP_REASON_MAX_LEN)
+        self.assertEqual(stop_reason, expected_stop_reason)
+
+        soft_trap_func_frame = thread.GetFrameAtIndex(0)
+        self.assertEqual(soft_trap_func_frame.name, expected_soft_trap_func)
+
+        stop_frame = thread.GetSelectedFrame()
+        self.assertEqual(stop_frame.name, expected_func_name)
+        # The stop frame isn't frame 1 because that frame is the artificial
+        # frame containing the trap reason.
+        self.assertEqual(stop_frame.idx, 2)
+        file_name = stop_frame.GetLineEntry().GetFileSpec().basename
+        self.assertEqual(file_name, expected_file_name)
+        line = stop_frame.GetLineEntry().line
+        self.assertEqual(line, expected_line_num)
+
+    def check_state_soft_trap_minimal(
+        self, stop_reason: str, func_name: str, file_name: str, line_num: int
+    ):
+        """
+        Check the program state is as expected when hitting
+        a soft trap from -fbounds-safety-soft-traps=call-minimal
+        """
+        self._check_stop_reason_impl(
+            SOFT_TRAP_FUNC_MINIMAL,
+            expected_stop_reason=stop_reason,
+            expected_func_name=func_name,
+            expected_file_name=file_name,
+            expected_line_num=line_num,
+        )
+
+    def check_state_soft_trap_with_str(
+        self, stop_reason: str, func_name: str, file_name: str, line_num: int
+    ):
+        """
+        Check the program state is as expected when hitting
+        a soft trap from -fbounds-safety-soft-traps=call-with_str
+        """
+        self._check_stop_reason_impl(
+            SOFT_TRAP_FUNC_WITH_STR,
+            expected_stop_reason=stop_reason,
+            expected_func_name=func_name,
+            expected_file_name=file_name,
+            expected_line_num=line_num,
+        )
+
+    # Skip the tests on Windows because they fail due to the stop reason
+    # being `eStopReasonNon` instead of the expected
+    # `eStopReasonInstrumentation`.
+    @skipIfWindows
+    @skipUnlessBoundsSafety
+    def test_call_minimal(self):
+        """
+        Test the plugin on code built with
+        -fbounds-safety-soft-traps=call-minimal
+        """
+        self.build(make_targets=["soft-trap-test-minimal"])
+        self.test_target = self.createTestTarget()
+        self.runCmd("run")
+
+        process = self.test_target.process
+
+        # First soft trap hit
+        self.check_state_soft_trap_minimal(
+            "Soft Bounds check failed: indexing above upper bound in 'buffer[2]'",
+            "main",
+            "main.c",
+            7,
+        )
+
+        process.Continue()
+
+        # Second soft trap hit
+        self.check_state_soft_trap_minimal(
+            "Soft Bounds check failed: indexing below lower bound in 'buffer[-1]'",
+            "main",
+            "main.c",
+            8,
+        )
+
+        process.Continue()
+        self.assertEqual(process.GetState(), lldb.eStateExited)
+        self.assertEqual(process.GetExitStatus(), 0)
+
+    @skipIfWindows
+    @skipUnlessBoundsSafety
+    def test_call_with_str(self):
+        """
+        Test the plugin on code built with
+        -fbounds-safety-soft-traps=call-with-str
+        """
+        self.build(make_targets=["soft-trap-test-with-str"])
+        self.test_target = self.createTestTarget()
+        self.runCmd("run")
+
+        process = self.test_target.process
+
+        # First soft trap hit
+        self.check_state_soft_trap_with_str(
+            "Soft Bounds check failed: indexing above upper bound in 'buffer[2]'",
+            "main",
+            "main.c",
+            7,
+        )
+
+        process.Continue()
+
+        # Second soft trap hit
+        self.check_state_soft_trap_with_str(
+            "Soft Bounds check failed: indexing below lower bound in 'buffer[-1]'",
+            "main",
+            "main.c",
+            8,
+        )
+
+        process.Continue()
+        self.assertEqual(process.GetState(), lldb.eStateExited)
+        self.assertEqual(process.GetExitStatus(), 0)
diff --git a/lldb/test/API/lang/BoundsSafety/soft_trap/main.c b/lldb/test/API/lang/BoundsSafety/soft_trap/main.c
new file mode 100644
index 0000000000000..518afaaa02e8c
--- /dev/null
+++ b/lldb/test/API/lang/BoundsSafety/soft_trap/main.c
@@ -0,0 +1,10 @@
+#include <ptrcheck.h>
+
+int main(void) {
+  int pad;
+  int buffer[] = {0, 1};
+  int pad2;
+  int tmp = buffer[2]; // access past upper bound
+  tmp = buffer[-1];    // access below lower bound
+  return 0;
+}
diff --git a/lldb/test/API/lang/BoundsSafety/soft_trap/mockSoftTrapRuntime.c b/lldb/test/API/lang/BoundsSafety/soft_trap/mockSoftTrapRuntime.c
new file mode 100644
index 0000000000000..2cfbd24234eff
--- /dev/null
+++ b/lldb/test/API/lang/BoundsSafety/soft_trap/mockSoftTrapRuntime.c
@@ -0,0 +1,17 @@
+#include <bounds_safety_soft_traps.h>
+#include <ptrcheck.h>
+#include <stdio.h>
+
+#if __CLANG_BOUNDS_SAFETY_SOFT_TRAP_API_VERSION > 0
+#error API version changed
+#endif
+
+// FIXME: The runtimes really shouldn't be built with `-fbounds-safety` in
+// soft trap mode because of the risk of infinite recursion. However,
+// there's currently no way to have source files built with different flags
+
+void __bounds_safety_soft_trap_s(const char *reason) {
+  printf("BoundsSafety check FAILED: message:\"%s\"\n", reason ? reason : "");
+}
+
+void __bounds_safety_soft_trap(void) { printf("BoundsSafety check FAILED\n"); }
diff --git a/lldb/test/API/python_api/exprpath_register/Makefile b/lldb/test/API/python_api/exprpath_register/Makefile
new file mode 100644
index 0000000000000..10495940055b6
--- /dev/null
+++ b/lldb/test/API/python_api/exprpath_register/Makefile
@@ -0,0 +1,3 @@
+C_SOURCES := main.c
+
+include Makefile.rules
diff --git a/lldb/test/API/python_api/exprpath_register/TestExprPathRegisters.py b/lldb/test/API/python_api/exprpath_register/TestExprPathRegisters.py
new file mode 100644
index 0000000000000..4ffbc5e49fb0d
--- /dev/null
+++ b/lldb/test/API/python_api/exprpath_register/TestExprPathRegisters.py
@@ -0,0 +1,64 @@
+"""
+Test Getting the expression path for registers works correctly
+"""
+
+import lldb
+from lldbsuite.test import lldbutil
+from lldbsuite.test.lldbtest import TestBase, VALID_BREAKPOINT, VALID_TARGET
+
+
+class TestExprPathRegisters(TestBase):
+    NO_DEBUG_INFO_TESTCASE = True
+
+    def verify_register_path(self, reg_value: lldb.SBValue):
+        stream = lldb.SBStream()
+        reg_name = reg_value.name
+        self.assertTrue(
+            reg_value.GetExpressionPath(stream),
+            f"Expected an expression path for register {reg_name}.",
+        )
+        reg_expr_path = stream.GetData()
+        self.assertEqual(reg_expr_path, f"${reg_name}")
+
+    def test_float_registers(self):
+        """Verify the expression path of the registers is valid."""
+        self.build()
+        _, _, thread, _ = lldbutil.run_to_name_breakpoint(self, "my_foo")
+        frame = thread.GetSelectedFrame()
+        self.assertTrue(frame, "Expected a valid Frame.")
+
+        # possible floating point register on some cpus.
+        register_names = [
+            "xmm0",
+            "ymm0",
+            "v0",
+            "v1",
+            "f0",
+            "f1",
+            "d0",
+            "d1",
+            "vr0",
+            "vr1",
+            "st0",
+            "st1",
+        ]
+        for name in register_names:
+            reg_value = frame.FindRegister(name)
+            # some the register will not be available for the cpu
+            # only verify if it is valid.
+            if reg_value:
+                self.verify_register_path(reg_value)
+
+    def test_all_registers(self):
+        """Test all the registers that is avaiable on the machine"""
+        self.build()
+        _, _, thread, _ = lldbutil.run_to_name_breakpoint(self, "my_foo")
+        frame = thread.GetSelectedFrame()
+        self.assertTrue(frame, "Expected a valid Frame.")
+
+        register_sets = frame.GetRegisters()
+        self.assertTrue(register_sets.IsValid(), "Expected Frame Registers")
+
+        for register_set in register_sets:
+            for register in register_set.children:
+                self.verify_register_path(register)
diff --git a/lldb/test/API/python_api/exprpath_register/main.c b/lldb/test/API/python_api/exprpath_register/main.c
new file mode 100644
index 0000000000000..4809a87cdf210
--- /dev/null
+++ b/lldb/test/API/python_api/exprpath_register/main.c
@@ -0,0 +1,10 @@
+
+float my_foo() {
+  float result = 10.0 + 20.0;
+  return result;
+}
+
+int main(void) {
+  float result = my_foo();
+  return (int)result;
+}
diff --git a/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockCallSoftTrapRuntime.c b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockCallSoftTrapRuntime.c
new file mode 100644
index 0000000000000..698cf272386c2
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockCallSoftTrapRuntime.c
@@ -0,0 +1,8 @@
+#include <bounds_safety_soft_traps.h>
+#include <stdio.h>
+
+int main(void) {
+  __bounds_safety_soft_trap_s(0);
+  printf("Execution continued\n");
+  return 0;
+}
diff --git a/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockSoftTrapRuntime.c b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockSoftTrapRuntime.c
new file mode 100644
index 0000000000000..3e680451efba2
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetyMockSoftTrapRuntime.c
@@ -0,0 +1,15 @@
+#include <bounds_safety_soft_traps.h>
+#include <ptrcheck.h>
+#include <stdio.h>
+
+#if __CLANG_BOUNDS_SAFETY_SOFT_TRAP_API_VERSION > 0
+#error API version changed
+#endif
+
+#if __has_ptrcheck
+#error Do not compile the runtime with -fbounds-safety enabled due to potential for infinite recursion
+#endif
+
+void __bounds_safety_soft_trap_s(const char *reason) { printf("BoundsSafety check FAILED: message:\"%s\"\n", reason ? reason : ""); }
+
+void __bounds_safety_soft_trap(void) { printf("BoundsSafety check FAILED\n"); }
diff --git a/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTraps.c b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTraps.c
new file mode 100644
index 0000000000000..265e57e427b1f
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTraps.c
@@ -0,0 +1,12 @@
+#include <stdio.h>
+
+int bad_read(int index) {
+  int array[] = {0, 1, 2};
+  return array[index];
+}
+
+int main(int argc, char **argv) {
+  bad_read(10);
+  printf("Execution continued\n");
+  return 0;
+}
diff --git a/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTrapsMissingReason.c b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTrapsMissingReason.c
new file mode 100644
index 0000000000000..db65c9d93d42f
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/Inputs/boundsSafetySoftTrapsMissingReason.c
@@ -0,0 +1,20 @@
+#include <ptrcheck.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+int bad_call(int *__counted_by(count) ptr, int count) {}
+
+int main(int argc, char **argv) {
+  const int num_bytes = sizeof(int) * 2;
+  int *array = (int *)malloc(num_bytes);
+  memset(array, 0, num_bytes);
+
+  // The count argument is too large and will cause a trap.
+  // This code pattern is currently missing a trap reason (rdar://100346924) and
+  // so we can use it to test how `InstrumentationRuntimeBoundsSafety` handles
+  // this.
+  bad_call(array, 3);
+  printf("Execution continued\n");
+  return 0;
+}
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal.test
new file mode 100644
index 0000000000000..7e93f14a2672d
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal.test
@@ -0,0 +1,31 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-minimal -g -O0 %S/Inputs/boundsSafetySoftTraps.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out | FileCheck %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed: indexing above upper bound in 'array[index]'{{$}}
+
+# Check that the `bad_read` frame is selected
+bt
+# CHECK: frame #{{.*}}`__bounds_safety_soft_trap{{$}}
+# CHECK-NEXT: frame #{{.*}}`__clang_trap_msg$Bounds check failed$indexing above upper bound in 'array[index]'
+# CHECK-NEXT: * frame #{{.*}}`bad_read(index=10) at boundsSafetySoftTraps.c:5
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_missing_reason.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_missing_reason.test
new file mode 100644
index 0000000000000..a56144a196d63
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_missing_reason.test
@@ -0,0 +1,34 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-minimal -g -O0 %S/Inputs/boundsSafetySoftTrapsMissingReason.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out 2> %t.warnings | FileCheck %s
+# Warnings are checked separately because their order in the output is not guaranteed.
+# RUN: FileCheck --input-file=%t.warnings --check-prefix=WARN %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed{{$}}
+# WARN: warning: specific BoundsSafety trap reason is not available because the compiler omitted it from the debug info
+
+# Check that the `bad_read` frame is selected
+bt
+# CHECK: frame #{{.*}}`__bounds_safety_soft_trap{{$}}
+# CHECK-NEXT: frame #{{.*}}`__clang_trap_msg$Bounds check failed$
+# CHECK-NEXT: * frame #{{.*}}`main({{.+}}) at boundsSafetySoftTrapsMissingReason.c:17
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_dbg_info.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_dbg_info.test
new file mode 100644
index 0000000000000..dfff65d5b5a8c
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_dbg_info.test
@@ -0,0 +1,33 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-minimal -O0 %S/Inputs/boundsSafetySoftTraps.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out 2> %t.warnings | FileCheck %s
+# Warnings are checked separately because their order in the output is not guaranteed.
+# RUN: FileCheck --input-file=%t.warnings --check-prefix=WARN %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed{{$}}
+# WARN: warning: specific BoundsSafety trap reason is not available because debug info is missing on the caller of '__bounds_safety_soft_trap'
+
+# Check that the `__bounds_safety_soft_trap` frame is selected
+bt
+# CHECK: * frame #{{.*}}`__bounds_safety_soft_trap{{$}}
+# CHECK-NEXT: frame #{{.*}}`bad_read
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_plugin.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_plugin.test
new file mode 100644
index 0000000000000..4acc927b667ed
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_minimal_no_plugin.test
@@ -0,0 +1,30 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-minimal -g -O0 %S/Inputs/boundsSafetySoftTraps.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out | FileCheck %s
+
+# Run without the plugin. A user might want to do this so they can set their
+# own custom breakpoint with custom stopping behavior (e.g. stop after n hits).
+plugin disable instrumentation-runtime.BoundsSafety
+# CHECK: [-] BoundsSafety
+
+b __bounds_safety_soft_trap
+run
+
+# CHECK: * thread #{{.*}} stop reason = breakpoint 1.1
+
+# Check that reason for bounds check failing can be seen in the stacktrace
+bt
+# CHECK: * frame #{{.*}}`__bounds_safety_soft_trap{{$}}
+# CHECK-NEXT: frame #{{.*}}`__clang_trap_msg$Bounds check failed$indexing above upper bound in 'array[index]'
+# CHECK-NEXT: frame #{{.*}}`bad_read(index=10) at boundsSafetySoftTraps.c:5
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_str.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_str.test
new file mode 100644
index 0000000000000..2fdae6d122733
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_str.test
@@ -0,0 +1,31 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-with-str -g -O0 %S/Inputs/boundsSafetySoftTraps.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out | FileCheck %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed: indexing above upper bound in 'array[index]'{{$}}
+
+# Check that the `bad_read` frame is selected
+bt
+# CHECK: frame #{{.*}}`__bounds_safety_soft_trap_s{{$}}
+# CHECK-NEXT: frame #{{.*}}`__clang_trap_msg$Bounds check failed$indexing above upper bound in 'array[index]'
+# CHECK-NEXT: * frame #{{.*}}`bad_read(index=10) at boundsSafetySoftTraps.c:5
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_missing_reason.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_missing_reason.test
new file mode 100644
index 0000000000000..68d10c00280ac
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_missing_reason.test
@@ -0,0 +1,34 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-with-str -g -O0 %S/Inputs/boundsSafetySoftTrapsMissingReason.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out 2> %t.warnings | FileCheck %s
+# Warnings are checked separately because their order in the output is not guaranteed.
+# RUN: FileCheck --input-file=%t.warnings --check-prefix=WARN %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed{{$}}
+# WARN: warning: specific BoundsSafety trap reason is not available because the compiler omitted it from the debug info
+
+# Check that the `bad_read` frame is selected
+bt
+# CHECK: frame #{{.*}}`__bounds_safety_soft_trap_s{{$}}
+# CHECK-NEXT: frame #{{.*}}`__clang_trap_msg$Bounds check failed$
+# CHECK-NEXT: * frame #{{.*}}`main({{.+}}) at boundsSafetySoftTrapsMissingReason.c:17
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info.test
new file mode 100644
index 0000000000000..afe6098878822
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info.test
@@ -0,0 +1,30 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-with-str -O0 %S/Inputs/boundsSafetySoftTraps.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out | FileCheck %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed: indexing above upper bound in 'array[index]'{{$}}
+
+# Check that the `__bounds_safety_soft_trap_s` frame is selected
+bt
+# CHECK: * frame #{{.*}}`__bounds_safety_soft_trap_s{{$}}
+# CHECK-NEXT: frame #{{.*}}`bad_read
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info_null_str.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info_null_str.test
new file mode 100644
index 0000000000000..6e0bf325e7665
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_dbg_info_null_str.test
@@ -0,0 +1,36 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockCallSoftTrapRuntime.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out 2> %t.warnings | FileCheck %s
+# Warnings are checked separately because their order in the output is not guaranteed.
+# RUN: FileCheck --input-file=%t.warnings --check-prefix=WARN %s
+
+# This test relies on this plugin being enabled
+plugin list instrumentation-runtime.BoundsSafety
+# CHECK: [+] BoundsSafety
+
+# Emit logging so that the code has test coverage
+log enable lldb instrumentation-runtime
+
+run
+
+# This exists to check that the instrumentation correctly handles
+# `__bounds_safety_soft_trap_s()` being called with a nullptr argument.
+
+# CHECK: * thread #{{.*}} stop reason = Soft Bounds check failed{{$}}
+# WARN: warning: specific BoundsSafety trap reason cannot be inferred because the compiler omitted the reason
+
+# Check that the `__bounds_safety_soft_trap_s` frame is selected
+bt
+# CHECK: * frame #{{.*}}`__bounds_safety_soft_trap_s{{$}}
+# CHECK-NEXT: frame #{{.*}}`main
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_plugin.test b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_plugin.test
new file mode 100644
index 0000000000000..ca2ca988446e0
--- /dev/null
+++ b/lldb/test/Shell/BoundsSafety/boundssafety_soft_trap_call_with_str_no_plugin.test
@@ -0,0 +1,30 @@
+# UNSUPPORTED: system-windows
+# REQUIRES: clang-bounds-safety
+# RUN: %clang_host -c -fbounds-safety -fbounds-safety-soft-traps=call-with-str -g -O0 %S/Inputs/boundsSafetySoftTraps.c -o %t.o
+# Note: Building the runtime without debug info is intentional because this is the common case
+# RUN: %clang_host -c -O0 %S/Inputs/boundsSafetyMockSoftTrapRuntime.c -o %t.softtrap_runtime.o
+# RUN: %clang_host %t.o %t.softtrap_runtime.o -o %t.out
+# RUN: %lldb -b -s %s %t.out | FileCheck %s
+
+# Run without the plugin. A user might want to do this so they can set their
+# own custom breakpoint with custom stopping behavior (e.g. stop after n hits).
+plugin disable instrumentation-runtime.BoundsSafety
+# CHECK: [-] BoundsSafety
+
+b __bounds_safety_soft_trap_s
+run
+
+# CHECK: * thread #{{.*}} stop reason = breakpoint 1.1
+
+# Check that reason for bounds check failing can be seen in the stacktrace
+bt
+# CHECK: * frame #{{.*}}`__bounds_safety_soft_trap_s
+# CHECK-NEXT: frame #{{.*}}`__clang_trap_msg$Bounds check failed$indexing above upper bound in 'array[index]'
+# CHECK-NEXT: frame #{{.*}}`bad_read(index=10) at boundsSafetySoftTraps.c:5
+
+# Resume execution
+c
+# CHECK: BoundsSafety check FAILED: message:"indexing above upper bound in 'array[index]'"
+# CHECK-NEXT: Execution continued
+# CHECK: Process {{[0-9]+}} exited with status = 0
+q
diff --git a/lldb/test/Shell/SymbolFile/NativePDB/find-pdb-next-to-exe.test b/lldb/test/Shell/SymbolFile/NativePDB/find-pdb-next-to-exe.test
new file mode 100644
index 0000000000000..c35c82ad84d2f
--- /dev/null
+++ b/lldb/test/Shell/SymbolFile/NativePDB/find-pdb-next-to-exe.test
@@ -0,0 +1,76 @@
+# REQUIRES: lld, target-windows
+
+# Test where LLDB looks for PDBs.
+# RUN: split-file %s %t
+
+# RUN: mkdir -p %t/build
+# RUN: mkdir -p %t/dir1
+# RUN: mkdir -p %t/dir2
+# RUN: mkdir -p %t/dir3
+
+# RUN: echo "settings append target.debug-file-search-paths %t/dir2" >> %t/init.input
+# RUN: echo "settings append target.debug-file-search-paths %t/dir3" >> %t/init.input
+
+# RUN: %build --compiler=clang-cl --nodefaultlib --output=%t/build/a.exe %t/main.cpp
+
+# Regular setup - PDB is at the original path
+# RUN: %lldb -S %t/init.input -s %t/check.input %t/build/a.exe | FileCheck --check-prefix=BOTH-ORIG %s
+# BOTH-ORIG: (lldb) target create
+# BOTH-ORIG-NEXT: Loading {{.*[/\\]}}build{{[/\\]}}a.pdb for {{.*[/\\]}}build{{[/\\]}}a.exe
+# BOTH-ORIG: (A) a = (x = 47)
+
+# Move the executable to a different directory but keep the PDB.
+# RUN: mv %t/build/a.exe %t/dir1
+# RUN: %lldb -S %t/init.input -s %t/check.input %t/dir1/a.exe | FileCheck --check-prefix=PDB-ORIG %s
+# PDB-ORIG: (lldb) target create
+# PDB-ORIG-NEXT: Loading {{.*[/\\]}}build{{[/\\]}}a.pdb for {{.*[/\\]}}dir1{{[/\\]}}a.exe
+# PDB-ORIG: (A) a = (x = 47)
+
+# Copy the PDB to the same directory and all search dirs. LLDB should prefer the original PDB.
+# RUN: cp %t/build/a.pdb %t/dir1
+# RUN: cp %t/build/a.pdb %t/dir2
+# RUN: cp %t/build/a.pdb %t/dir3
+# RUN: %lldb -S %t/init.input -s %t/check.input %t/dir1/a.exe | FileCheck --check-prefix=PDB-ORIG %s
+
+# Remove the original PDB. LLDB should now use the one next to the exe.
+# RUN: rm %t/build/a.pdb
+# RUN: %lldb -S %t/init.input -s %t/check.input %t/dir1/a.exe | FileCheck --check-prefix=NEXT-TO-EXE %s
+# NEXT-TO-EXE: (lldb) target create
+# NEXT-TO-EXE-NEXT: Loading {{.*[/\\]}}dir1{{[/\\]}}a.pdb for {{.*[/\\]}}dir1{{[/\\]}}a.exe
+# NEXT-TO-EXE: (A) a = (x = 47)
+
+# Remove the PDB next to the exe. LLDB should now use the one in dir2 (first in list).
+# RUN: rm %t/dir1/a.pdb
+# RUN: %lldb -S %t/init.input -s %t/check.input %t/dir1/a.exe | FileCheck --check-prefix=DIR2 %s
+# DIR2: (lldb) target create
+# DIR2-NEXT: Loading {{.*[/\\]}}dir2{{[/\\]}}a.pdb for {{.*[/\\]}}dir1{{[/\\]}}a.exe
+# DIR2: (A) a = (x = 47)
+
+# Remove the PDB in dir2. LLDB should now use the one in dir3 (second in list).
+# RUN: rm %t/dir2/a.pdb
+# RUN: %lldb -S %t/init.input -s %t/check.input %t/dir1/a.exe | FileCheck --check-prefix=DIR3 %s
+# DIR3: (lldb) target create
+# DIR3-NEXT: Loading {{.*[/\\]}}dir3{{[/\\]}}a.pdb for {{.*[/\\]}}dir1{{[/\\]}}a.exe
+# DIR3: (A) a = (x = 47)
+
+# Remove the last PDB in dir3. Now, there's no matching PDB anymore.
+# RUN: rm %t/dir3/a.pdb
+# RUN: %lldb -S %t/init.input -s %t/check.input -f %t/dir1/a.exe 2>&1 | FileCheck --check-prefix=NOPDB %s
+# NOPDB: error: can't find global variable 'a'
+
+#--- main.cpp
+
+struct A {
+  int x = 47;
+};
+A a;
+int main() {}
+
+#--- init.input
+
+log enable lldb symbol
+
+#--- check.input
+
+target variable a
+q
diff --git a/lldb/test/Shell/helper/toolchain.py b/lldb/test/Shell/helper/toolchain.py
index b9e7dd7c196ab..0c8c39d37e089 100644
--- a/lldb/test/Shell/helper/toolchain.py
+++ b/lldb/test/Shell/helper/toolchain.py
@@ -226,7 +226,7 @@ def use_support_substitutions(config):
         except OSError:
             res = -1
         if res == 0 and out:
-            sdk_path = lit.util.to_string(out)
+            sdk_path = out.decode("utf-8")
             llvm_config.lit_config.note("using SDKROOT: %r" % sdk_path)
             host_flags += ["-isysroot", sdk_path]
     elif sys.platform != "win32":
diff --git a/lldb/tools/debugserver/source/DNB.cpp b/lldb/tools/debugserver/source/DNB.cpp
index 0cd48d91a682a..4d5afcf93a44b 100644
--- a/lldb/tools/debugserver/source/DNB.cpp
+++ b/lldb/tools/debugserver/source/DNB.cpp
@@ -1101,7 +1101,7 @@ DNBGetLibrariesInfoForAddresses(nub_process_t pid,
 JSONGenerator::ObjectSP DNBGetSharedCacheInfo(nub_process_t pid) {
   MachProcessSP procSP;
   if (GetProcessSP(pid, procSP)) {
-    return procSP->GetSharedCacheInfo(pid);
+    return procSP->GetInferiorSharedCacheInfo(pid);
   }
   return JSONGenerator::ObjectSP();
 }
diff --git a/lldb/tools/debugserver/source/MacOSX/MachProcess.h b/lldb/tools/debugserver/source/MacOSX/MachProcess.h
index 56bc9d6c7461e..67b27b9902999 100644
--- a/lldb/tools/debugserver/source/MacOSX/MachProcess.h
+++ b/lldb/tools/debugserver/source/MacOSX/MachProcess.h
@@ -283,7 +283,10 @@ class MachProcess {
   JSONGenerator::ObjectSP
   GetAllLoadedLibrariesInfos(nub_process_t pid,
                              bool fetch_report_load_commands);
-  JSONGenerator::ObjectSP GetSharedCacheInfo(nub_process_t pid);
+  bool GetDebugserverSharedCacheInfo(uuid_t &uuid,
+                                     std::string &shared_cache_path);
+  bool GetInferiorSharedCacheFilepath(std::string &inferior_sc_path);
+  JSONGenerator::ObjectSP GetInferiorSharedCacheInfo(nub_process_t pid);
 
   nub_size_t GetNumThreads() const;
   nub_thread_t GetThreadAtIndex(nub_size_t thread_idx) const;
@@ -474,6 +477,14 @@ class MachProcess {
 
   void *(*m_dyld_process_info_create)(task_t task, uint64_t timestamp,
                                       kern_return_t *kernelError);
+  void *(*m_dyld_process_create_for_task)(task_read_t task, kern_return_t *kr);
+  void *(*m_dyld_process_snapshot_create_for_process)(void *process,
+                                                      kern_return_t *kr);
+  void *(*m_dyld_process_snapshot_get_shared_cache)(void *snapshot);
+  void (*m_dyld_shared_cache_for_each_file)(
+      void *cache, void (^block)(const char *file_path));
+  void (*m_dyld_process_snapshot_dispose)(void *snapshot);
+  void (*m_dyld_process_dispose)(void *process);
   void (*m_dyld_process_info_for_each_image)(
       void *info, void (^callback)(uint64_t machHeaderAddress,
                                    const uuid_t uuid, const char *path));
@@ -481,6 +492,7 @@ class MachProcess {
   void (*m_dyld_process_info_get_cache)(void *info, void *cacheInfo);
   uint32_t (*m_dyld_process_info_get_platform)(void *info);
   void (*m_dyld_process_info_get_state)(void *info, void *stateInfo);
+  const char *(*m_dyld_shared_cache_file_path)();
 };
 
 #endif // LLDB_TOOLS_DEBUGSERVER_SOURCE_MACOSX_MACHPROCESS_H
diff --git a/lldb/tools/debugserver/source/MacOSX/MachProcess.mm b/lldb/tools/debugserver/source/MacOSX/MachProcess.mm
index 3b875e61a268d..10ed8045a9211 100644
--- a/lldb/tools/debugserver/source/MacOSX/MachProcess.mm
+++ b/lldb/tools/debugserver/source/MacOSX/MachProcess.mm
@@ -534,13 +534,35 @@ static bool FBSAddEventDataToOptions(NSMutableDictionary *options,
       m_image_infos_baton(NULL), m_sent_interrupt_signo(0),
       m_auto_resume_signo(0), m_did_exec(false),
       m_dyld_process_info_create(nullptr),
+      m_dyld_process_create_for_task(nullptr),
+      m_dyld_process_snapshot_create_for_process(nullptr),
+      m_dyld_process_snapshot_get_shared_cache(nullptr),
+      m_dyld_shared_cache_for_each_file(nullptr),
+      m_dyld_process_snapshot_dispose(nullptr), m_dyld_process_dispose(nullptr),
       m_dyld_process_info_for_each_image(nullptr),
       m_dyld_process_info_release(nullptr),
       m_dyld_process_info_get_cache(nullptr),
-      m_dyld_process_info_get_state(nullptr) {
+      m_dyld_process_info_get_state(nullptr),
+      m_dyld_shared_cache_file_path(nullptr) {
   m_dyld_process_info_create =
       (void *(*)(task_t task, uint64_t timestamp, kern_return_t * kernelError))
           dlsym(RTLD_DEFAULT, "_dyld_process_info_create");
+
+  m_dyld_process_create_for_task =
+      (void *(*)(task_read_t, kern_return_t *))dlsym(
+          RTLD_DEFAULT, "dyld_process_create_for_task");
+  m_dyld_process_snapshot_create_for_process =
+      (void *(*)(void *, kern_return_t *))dlsym(
+          RTLD_DEFAULT, "dyld_process_snapshot_create_for_process");
+  m_dyld_process_snapshot_get_shared_cache = (void *(*)(void *))dlsym(
+      RTLD_DEFAULT, "dyld_process_snapshot_get_shared_cache");
+  m_dyld_shared_cache_for_each_file =
+      (void (*)(void *, void (^)(const char *)))dlsym(
+          RTLD_DEFAULT, "dyld_shared_cache_for_each_file");
+  m_dyld_process_snapshot_dispose =
+      (void (*)(void *))dlsym(RTLD_DEFAULT, "dyld_process_snapshot_dispose");
+  m_dyld_process_dispose =
+      (void (*)(void *))dlsym(RTLD_DEFAULT, "dyld_process_dispose");
   m_dyld_process_info_for_each_image =
       (void (*)(void *info, void (^)(uint64_t machHeaderAddress,
                                      const uuid_t uuid, const char *path)))
@@ -553,6 +575,8 @@ static bool FBSAddEventDataToOptions(NSMutableDictionary *options,
       RTLD_DEFAULT, "_dyld_process_info_get_platform");
   m_dyld_process_info_get_state = (void (*)(void *info, void *stateInfo))dlsym(
       RTLD_DEFAULT, "_dyld_process_info_get_state");
+  m_dyld_shared_cache_file_path =
+      (const char *(*)())dlsym(RTLD_DEFAULT, "dyld_shared_cache_file_path");
 
   DNBLogThreadedIf(LOG_PROCESS | LOG_VERBOSE, "%s", __PRETTY_FUNCTION__);
 }
@@ -1179,13 +1203,82 @@ static bool mach_header_validity_test(uint32_t magic, uint32_t cputype) {
                                           /* report_load_commands =  */ true);
 }
 
-// From dyld's internal podyld_process_info.h:
+bool MachProcess::GetDebugserverSharedCacheInfo(
+    uuid_t &uuid, std::string &shared_cache_path) {
+  uuid_clear(uuid);
+  shared_cache_path.clear();
+
+  if (m_dyld_process_info_create && m_dyld_process_info_get_cache) {
+    kern_return_t kern_ret;
+    dyld_process_info info =
+        m_dyld_process_info_create(mach_task_self(), 0, &kern_ret);
+    if (info) {
+      struct dyld_process_cache_info shared_cache_info;
+      m_dyld_process_info_get_cache(info, &shared_cache_info);
+      uuid_copy(uuid, shared_cache_info.cacheUUID);
+      m_dyld_process_info_release(info);
+    }
+  }
+  if (m_dyld_shared_cache_file_path) {
+    const char *cache_path = m_dyld_shared_cache_file_path();
+    if (cache_path)
+      shared_cache_path = cache_path;
+  }
+  if (!uuid_is_null(uuid))
+    return true;
+  return false;
+}
+
+bool MachProcess::GetInferiorSharedCacheFilepath(
+    std::string &inferior_sc_path) {
+  inferior_sc_path.clear();
+
+  if (!m_dyld_process_create_for_task ||
+      !m_dyld_process_snapshot_create_for_process ||
+      !m_dyld_process_snapshot_get_shared_cache ||
+      !m_dyld_shared_cache_for_each_file || !m_dyld_process_snapshot_dispose ||
+      !m_dyld_process_dispose)
+    return false;
+
+  __block std::string sc_path;
+  kern_return_t kr;
+  void *process = m_dyld_process_create_for_task(m_task.TaskPort(), &kr);
+  if (kr != KERN_SUCCESS)
+    return false;
+  void *snapshot = m_dyld_process_snapshot_create_for_process(process, &kr);
+  if (kr != KERN_SUCCESS)
+    return false;
+  void *cache = m_dyld_process_snapshot_get_shared_cache(snapshot);
+
+  // The shared cache is a collection of files on disk, this callback
+  // will iterate over all of them.
+  // The first filepath provided is the base filename of the cache.
+  __block bool done = false;
+  m_dyld_shared_cache_for_each_file(cache, ^(const char *path) {
+    if (done) {
+      return;
+    }
+    done = true;
+    sc_path = path;
+  });
+  m_dyld_process_snapshot_dispose(snapshot);
+  m_dyld_process_dispose(process);
+
+  inferior_sc_path = sc_path;
+  if (!sc_path.empty())
+    return true;
+  return false;
+}
+
+// From dyld's internal dyld_process_info.h:
 
-JSONGenerator::ObjectSP MachProcess::GetSharedCacheInfo(nub_process_t pid) {
+JSONGenerator::ObjectSP
+MachProcess::GetInferiorSharedCacheInfo(nub_process_t pid) {
   JSONGenerator::DictionarySP reply_sp(new JSONGenerator::Dictionary());
 
-  kern_return_t kern_ret;
+  uuid_t inferior_sc_uuid;
   if (m_dyld_process_info_create && m_dyld_process_info_get_cache) {
+    kern_return_t kern_ret;
     dyld_process_info info =
         m_dyld_process_info_create(m_task.TaskPort(), 0, &kern_ret);
     if (info) {
@@ -1197,6 +1290,7 @@ static bool mach_header_validity_test(uint32_t magic, uint32_t cputype) {
 
       uuid_string_t uuidstr;
       uuid_unparse_upper(shared_cache_info.cacheUUID, uuidstr);
+      uuid_copy(inferior_sc_uuid, shared_cache_info.cacheUUID);
       reply_sp->AddStringItem("shared_cache_uuid", uuidstr);
 
       reply_sp->AddBooleanItem("no_shared_cache", shared_cache_info.noCache);
@@ -1206,6 +1300,29 @@ static bool mach_header_validity_test(uint32_t magic, uint32_t cputype) {
       m_dyld_process_info_release(info);
     }
   }
+
+  // If debugserver and the inferior are have the same cache UUID,
+  // use the simple call to get the filepath to debugserver's shared
+  // cache, return that.
+  uuid_t debugserver_sc_uuid;
+  std::string debugserver_sc_path;
+  bool found_sc_filepath = false;
+  if (GetDebugserverSharedCacheInfo(debugserver_sc_uuid, debugserver_sc_path)) {
+    if (uuid_compare(inferior_sc_uuid, debugserver_sc_uuid) == 0 &&
+        !debugserver_sc_path.empty()) {
+      reply_sp->AddStringItem("shared_cache_path", debugserver_sc_path);
+      found_sc_filepath = true;
+    }
+  }
+
+  // Use SPI that are only available on newer OSes to fetch the
+  // filepath of the shared cache of the inferior, if available.
+  if (!found_sc_filepath) {
+    std::string inferior_sc_path;
+    if (GetInferiorSharedCacheFilepath(inferior_sc_path))
+      reply_sp->AddStringItem("shared_cache_path", inferior_sc_path);
+  }
+
   return reply_sp;
 }
 
diff --git a/lldb/tools/lldb-dap/DAP.cpp b/lldb/tools/lldb-dap/DAP.cpp
index 465d85a07bd34..6971bfea5c128 100644
--- a/lldb/tools/lldb-dap/DAP.cpp
+++ b/lldb/tools/lldb-dap/DAP.cpp
@@ -274,8 +274,11 @@ Id DAP::Send(const Message &message) {
   std::lock_guard<std::mutex> guard(call_mutex);
   Message msg = std::visit(
       [this](auto &&msg) -> Message {
-        if (msg.seq == kCalculateSeq)
-          msg.seq = seq++;
+        if (msg.seq == kCalculateSeq) {
+          seq++;
+          msg.seq = seq;
+        }
+        assert(msg.seq > 0 && "message sequence must be greater than zero.");
         return msg;
       },
       Message(message));
diff --git a/lldb/tools/lldb-dap/Handler/CompileUnitsRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/CompileUnitsRequestHandler.cpp
index cd937116f7380..0e5c2b23d8d67 100644
--- a/lldb/tools/lldb-dap/Handler/CompileUnitsRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/CompileUnitsRequestHandler.cpp
@@ -60,12 +60,12 @@ void CompileUnitsRequestHandler::operator()(
   llvm::json::Object body;
   llvm::json::Array units;
   const auto *arguments = request.getObject("arguments");
-  const std::string module_id =
-      GetString(arguments, "moduleId").value_or("").str();
+  const llvm::StringRef module_id =
+      GetString(arguments, "moduleId").value_or("");
   int num_modules = dap.target.GetNumModules();
   for (int i = 0; i < num_modules; i++) {
     auto curr_module = dap.target.GetModuleAtIndex(i);
-    if (module_id == curr_module.GetUUIDString()) {
+    if (module_id == llvm::StringRef(curr_module.GetUUIDString())) {
       int num_units = curr_module.GetNumCompileUnits();
       for (int j = 0; j < num_units; j++) {
         auto curr_unit = curr_module.GetCompileUnitAtIndex(j);
diff --git a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
index 53e1810a5b0e0..2d30e089447f1 100644
--- a/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
+++ b/lldb/tools/lldb-dap/Handler/InitializeRequestHandler.cpp
@@ -23,7 +23,7 @@ llvm::Expected<InitializeResponse> InitializeRequestHandler::Run(
     const InitializeRequestArguments &arguments) const {
   // Store initialization arguments for later use in Launch/Attach.
   dap.clientFeatures = arguments.supportedFeatures;
-  dap.sourceInitFile = arguments.lldbExtSourceInitFile.value_or(true);
+  dap.sourceInitFile = arguments.lldbExtSourceInitFile;
 
   return dap.GetCapabilities();
 }
diff --git a/lldb/tools/lldb-dap/JSONUtils.cpp b/lldb/tools/lldb-dap/JSONUtils.cpp
index 5c4afa3fd2f62..40b4f5b9f7f90 100644
--- a/lldb/tools/lldb-dap/JSONUtils.cpp
+++ b/lldb/tools/lldb-dap/JSONUtils.cpp
@@ -554,9 +554,8 @@ llvm::json::Value CreateStackFrame(DAP &dap, lldb::SBFrame &frame,
 
   lldb::SBModule module = frame.GetModule();
   if (module.IsValid()) {
-    std::string uuid = module.GetUUIDString();
-    if (!uuid.empty())
-      object.try_emplace("moduleId", uuid);
+    if (const llvm::StringRef uuid = module.GetUUIDString(); !uuid.empty())
+      object.try_emplace("moduleId", uuid.str());
   }
 
   return llvm::json::Value(std::move(object));
diff --git a/lldb/tools/lldb-dap/Protocol/ProtocolBase.h b/lldb/tools/lldb-dap/Protocol/ProtocolBase.h
index 42c6c8890af24..09ce6802b17c0 100644
--- a/lldb/tools/lldb-dap/Protocol/ProtocolBase.h
+++ b/lldb/tools/lldb-dap/Protocol/ProtocolBase.h
@@ -31,11 +31,11 @@ namespace lldb_dap::protocol {
 // MARK: Base Protocol
 
 /// Message unique identifier type.
-using Id = int64_t;
+using Id = uint64_t;
 
 /// A unique identifier that indicates the `seq` field should be calculated by
 /// the current session.
-static constexpr Id kCalculateSeq = INT64_MAX;
+static constexpr Id kCalculateSeq = UINT64_MAX;
 
 /// A client or debug adapter initiated request.
 struct Request {
diff --git a/lldb/tools/lldb-dap/Protocol/ProtocolRequests.cpp b/lldb/tools/lldb-dap/Protocol/ProtocolRequests.cpp
index d53a520ade39b..0a1d580bffd68 100644
--- a/lldb/tools/lldb-dap/Protocol/ProtocolRequests.cpp
+++ b/lldb/tools/lldb-dap/Protocol/ProtocolRequests.cpp
@@ -216,12 +216,13 @@ bool fromJSON(const json::Value &Params, InitializeRequestArguments &IRA,
   }
 
   return OM.map("adapterID", IRA.adapterID) &&
-         OM.map("clientID", IRA.clientID) &&
-         OM.map("clientName", IRA.clientName) && OM.map("locale", IRA.locale) &&
-         OM.map("linesStartAt1", IRA.linesStartAt1) &&
-         OM.map("columnsStartAt1", IRA.columnsStartAt1) &&
+         OM.mapOptional("clientID", IRA.clientID) &&
+         OM.mapOptional("clientName", IRA.clientName) &&
+         OM.mapOptional("locale", IRA.locale) &&
+         OM.mapOptional("linesStartAt1", IRA.linesStartAt1) &&
+         OM.mapOptional("columnsStartAt1", IRA.columnsStartAt1) &&
          OM.mapOptional("pathFormat", IRA.pathFormat) &&
-         OM.map("$__lldb_sourceInitFile", IRA.lldbExtSourceInitFile);
+         OM.mapOptional("$__lldb_sourceInitFile", IRA.lldbExtSourceInitFile);
 }
 
 bool fromJSON(const json::Value &Params, Configuration &C, json::Path P) {
diff --git a/lldb/tools/lldb-dap/Protocol/ProtocolRequests.h b/lldb/tools/lldb-dap/Protocol/ProtocolRequests.h
index 37fc2465f6a05..6a85033ae7ef2 100644
--- a/lldb/tools/lldb-dap/Protocol/ProtocolRequests.h
+++ b/lldb/tools/lldb-dap/Protocol/ProtocolRequests.h
@@ -108,23 +108,23 @@ struct InitializeRequestArguments {
   std::string adapterID;
 
   /// The ID of the client using this adapter.
-  std::optional<std::string> clientID;
+  std::string clientID;
 
   /// The human-readable name of the client using this adapter.
-  std::optional<std::string> clientName;
+  std::string clientName;
 
   /// The ISO-639 locale of the client using this adapter, e.g. en-US or de-CH.
-  std::optional<std::string> locale;
+  std::string locale;
 
   /// Determines in what format paths are specified. The default is `path`,
   /// which is the native format.
   PathFormat pathFormat = ePatFormatPath;
 
   /// If true all line numbers are 1-based (default).
-  std::optional<bool> linesStartAt1;
+  bool linesStartAt1 = true;
 
   /// If true all column numbers are 1-based (default).
-  std::optional<bool> columnsStartAt1;
+  bool columnsStartAt1 = true;
 
   /// The set of supported features reported by the client.
   llvm::DenseSet<ClientFeature> supportedFeatures;
@@ -133,7 +133,7 @@ struct InitializeRequestArguments {
   /// @{
 
   /// Source init files when initializing lldb::SBDebugger.
-  std::optional<bool> lldbExtSourceInitFile;
+  bool lldbExtSourceInitFile = true;
 
   /// @}
 };
diff --git a/lldb/unittests/DAP/ProtocolRequestsTest.cpp b/lldb/unittests/DAP/ProtocolRequestsTest.cpp
index ba9aef1e5fcc5..a74c369924b8e 100644
--- a/lldb/unittests/DAP/ProtocolRequestsTest.cpp
+++ b/lldb/unittests/DAP/ProtocolRequestsTest.cpp
@@ -77,7 +77,7 @@ TEST(ProtocolRequestsTest, EvaluateArguments) {
   EXPECT_EQ(expected->expression, "hello world");
   EXPECT_EQ(expected->context, eEvaluateContextRepl);
 
-  // Check required keys;
+  // Check required keys.
   EXPECT_THAT_EXPECTED(parse<EvaluateArguments>(R"({})"),
                        FailedWithMessage("missing value at (root).expression"));
 }
@@ -118,3 +118,67 @@ TEST(ProtocolRequestsTest, EvaluateResponseBody) {
   ASSERT_THAT_EXPECTED(expected_opt, llvm::Succeeded());
   EXPECT_EQ(PrettyPrint(*expected_opt), PrettyPrint(body));
 }
+
+TEST(ProtocolRequestsTest, InitializeRequestArguments) {
+  llvm::Expected<InitializeRequestArguments> expected =
+      parse<InitializeRequestArguments>(R"({"adapterID": "myid"})");
+  ASSERT_THAT_EXPECTED(expected, llvm::Succeeded());
+  EXPECT_EQ(expected->adapterID, "myid");
+
+  // Check optional keys.
+  expected = parse<InitializeRequestArguments>(R"({
+    "adapterID": "myid",
+    "clientID": "myclientid",
+    "clientName": "lldb-dap-unit-tests",
+    "locale": "en-US",
+    "linesStartAt1": true,
+    "columnsStartAt1": true,
+    "pathFormat": "uri",
+    "supportsVariableType": true,
+    "supportsVariablePaging": true,
+    "supportsRunInTerminalRequest": true,
+    "supportsMemoryReferences": true,
+    "supportsProgressReporting": true,
+    "supportsInvalidatedEvent": true,
+    "supportsMemoryEvent": true,
+    "supportsArgsCanBeInterpretedByShell": true,
+    "supportsStartDebuggingRequest": true,
+    "supportsANSIStyling": true
+  })");
+  ASSERT_THAT_EXPECTED(expected, llvm::Succeeded());
+  EXPECT_EQ(expected->adapterID, "myid");
+  EXPECT_EQ(expected->clientID, "myclientid");
+  EXPECT_EQ(expected->clientName, "lldb-dap-unit-tests");
+  EXPECT_EQ(expected->locale, "en-US");
+  EXPECT_EQ(expected->linesStartAt1, true);
+  EXPECT_EQ(expected->columnsStartAt1, true);
+  EXPECT_EQ(expected->pathFormat, ePathFormatURI);
+  EXPECT_EQ(expected->supportedFeatures.contains(eClientFeatureVariableType),
+            true);
+  EXPECT_EQ(
+      expected->supportedFeatures.contains(eClientFeatureRunInTerminalRequest),
+      true);
+  EXPECT_EQ(
+      expected->supportedFeatures.contains(eClientFeatureMemoryReferences),
+      true);
+  EXPECT_EQ(
+      expected->supportedFeatures.contains(eClientFeatureProgressReporting),
+      true);
+  EXPECT_EQ(
+      expected->supportedFeatures.contains(eClientFeatureInvalidatedEvent),
+      true);
+  EXPECT_EQ(expected->supportedFeatures.contains(eClientFeatureMemoryEvent),
+            true);
+  EXPECT_EQ(expected->supportedFeatures.contains(
+                eClientFeatureArgsCanBeInterpretedByShell),
+            true);
+  EXPECT_EQ(
+      expected->supportedFeatures.contains(eClientFeatureStartDebuggingRequest),
+      true);
+  EXPECT_EQ(expected->supportedFeatures.contains(eClientFeatureANSIStyling),
+            true);
+
+  // Check required keys.
+  EXPECT_THAT_EXPECTED(parse<InitializeRequestArguments>(R"({})"),
+                       FailedWithMessage("missing value at (root).adapterID"));
+}
diff --git a/lldb/unittests/DAP/TestBase.cpp b/lldb/unittests/DAP/TestBase.cpp
index 8cb459964f7d8..f4dde9559e9d3 100644
--- a/lldb/unittests/DAP/TestBase.cpp
+++ b/lldb/unittests/DAP/TestBase.cpp
@@ -72,6 +72,7 @@ void DAPTestBase::TearDown() {
 
 void DAPTestBase::SetUpTestSuite() {
   lldb::SBError error = SBDebugger::InitializeWithErrorHandling();
+  EXPECT_TRUE(error.IsValid());
   EXPECT_TRUE(error.Success());
 }
 void DAPTestBase::TeatUpTestSuite() { SBDebugger::Terminate(); }
diff --git a/lldb/unittests/Expression/DWARFExpressionTest.cpp b/lldb/unittests/Expression/DWARFExpressionTest.cpp
index e0c2193d27c36..f264fb3ce94e5 100644
--- a/lldb/unittests/Expression/DWARFExpressionTest.cpp
+++ b/lldb/unittests/Expression/DWARFExpressionTest.cpp
@@ -237,8 +237,9 @@ static llvm::Expected<Value> Evaluate(llvm::ArrayRef<uint8_t> expr,
                                       DWARFExpression::Delegate *unit = nullptr,
                                       ExecutionContext *exe_ctx = nullptr,
                                       RegisterContext *reg_ctx = nullptr) {
-  DataExtractor extractor(expr.data(), expr.size(), lldb::eByteOrderLittle,
-                          /*addr_size*/ 4);
+  DataExtractor extractor(
+      expr.data(), expr.size(), lldb::eByteOrderLittle,
+      /*addr_size*/ exe_ctx ? exe_ctx->GetAddressByteSize() : 4);
 
   return DWARFExpression::Evaluate(exe_ctx, reg_ctx, module_sp, extractor, unit,
                                    lldb::eRegisterKindLLDB,
@@ -1216,3 +1217,107 @@ TEST_F(DWARFExpressionMockProcessTestWithAArch, DW_op_deref_no_ptr_fixing) {
   llvm::Expected<Value> result_deref = evaluate_expr(expr_deref);
   EXPECT_THAT_EXPECTED(result_deref, ExpectLoadAddress(expected_value));
 }
+
+TEST_F(DWARFExpressionMockProcessTest, deref_register) {
+  TestContext test_ctx;
+  constexpr uint32_t reg_r0 = 0x504;
+  MockMemory::Map memory = {
+      {{0x004, 4}, {0x1, 0x2, 0x3, 0x4}},
+      {{0x504, 4}, {0xa, 0xb, 0xc, 0xd}},
+      {{0x505, 4}, {0x5, 0x6, 0x7, 0x8}},
+  };
+  ASSERT_TRUE(CreateTestContext(&test_ctx, "i386-pc-linux",
+                                RegisterValue(reg_r0), memory, memory));
+
+  ExecutionContext exe_ctx(test_ctx.process_sp);
+  MockDwarfDelegate delegate = MockDwarfDelegate::Dwarf5();
+  auto Eval = [&](llvm::ArrayRef<uint8_t> expr_data) {
+    ExecutionContext exe_ctx(test_ctx.process_sp);
+    return Evaluate(expr_data, {}, &delegate, &exe_ctx,
+                    test_ctx.reg_ctx_sp.get());
+  };
+
+  // Reads from the register r0.
+  // Sets the context to RegisterInfo so we know this is a register location.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_reg0}),
+                       ExpectScalar(reg_r0, Value::ContextType::RegisterInfo));
+
+  // Reads from the location(register r0).
+  // Clears the context so we know this is a value not a location.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_reg0, DW_OP_deref}),
+                       ExpectLoadAddress(reg_r0, Value::ContextType::Invalid));
+
+  // Reads from the location(register r0) and adds the value to the host buffer.
+  // The evaluator should implicitly convert it to a memory location when
+  // added to a composite value and should add the contents of memory[r0]
+  // to the host buffer.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_reg0, DW_OP_deref, DW_OP_piece, 4}),
+                       ExpectHostAddress({0xa, 0xb, 0xc, 0xd}));
+
+  // Reads from the location(register r0) and truncates the value to one byte.
+  // Clears the context so we know this is a value not a location.
+  EXPECT_THAT_EXPECTED(
+      Eval({DW_OP_reg0, DW_OP_deref_size, 1}),
+      ExpectLoadAddress(reg_r0 & 0xff, Value::ContextType::Invalid));
+
+  // Reads from the location(register r0) and truncates to one byte then adds
+  // the value to the host buffer. The evaluator should implicitly convert it to
+  // a memory location when added to a composite value and should add the
+  // contents of memory[r0 & 0xff] to the host buffer.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_reg0, DW_OP_deref_size, 1, DW_OP_piece, 4}),
+                       ExpectHostAddress({0x1, 0x2, 0x3, 0x4}));
+
+  // Reads from the register r0 + 1.
+  EXPECT_THAT_EXPECTED(
+      Eval({DW_OP_breg0, 1}),
+      ExpectLoadAddress(reg_r0 + 1, Value::ContextType::Invalid));
+
+  // Reads from address r0 + 1, which contains the bytes [5,6,7,8].
+  EXPECT_THAT_EXPECTED(
+      Eval({DW_OP_breg0, 1, DW_OP_deref}),
+      ExpectLoadAddress(0x08070605, Value::ContextType::Invalid));
+}
+
+TEST_F(DWARFExpressionMockProcessTest, deref_implicit_value) {
+  TestContext test_ctx;
+  MockMemory::Map memory = {
+      {{0x4, 1}, {0x1}},
+      {{0x4, 4}, {0x1, 0x2, 0x3, 0x4}},
+  };
+  ASSERT_TRUE(CreateTestContext(&test_ctx, "i386-pc-linux", {}, memory));
+
+  ExecutionContext exe_ctx(test_ctx.process_sp);
+  MockDwarfDelegate delegate = MockDwarfDelegate::Dwarf5();
+  auto Eval = [&](llvm::ArrayRef<uint8_t> expr_data) {
+    ExecutionContext exe_ctx(test_ctx.process_sp);
+    return Evaluate(expr_data, {}, &delegate, &exe_ctx,
+                    test_ctx.reg_ctx_sp.get());
+  };
+
+  // Creates an implicit location with a value of 4.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_lit4, DW_OP_stack_value}),
+                       ExpectScalar(0x4));
+
+  // Creates an implicit location with a value of 4. The deref reads the value
+  // out of the location and implicitly converts it to a load address.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_lit4, DW_OP_stack_value, DW_OP_deref}),
+                       ExpectLoadAddress(0x4));
+
+  // Creates an implicit location with a value of 0x504 (uleb128(0x504) =
+  // 0xa84). The deref reads the low byte out of the location and implicitly
+  // converts it to a load address.
+  EXPECT_THAT_EXPECTED(
+      Eval({DW_OP_constu, 0x84, 0xa, DW_OP_stack_value, DW_OP_deref_size, 1}),
+      ExpectLoadAddress(0x4));
+
+  // The tests below are similar to the ones above, but there is no implicit
+  // location created by a stack_value operation. They are provided here as a
+  // reference to contrast with the above tests.
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_lit4}), ExpectLoadAddress(0x4));
+
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_lit4, DW_OP_deref}),
+                       ExpectLoadAddress(0x04030201));
+
+  EXPECT_THAT_EXPECTED(Eval({DW_OP_lit4, DW_OP_deref_size, 1}),
+                       ExpectLoadAddress(0x01));
+}
diff --git a/lldb/unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp b/lldb/unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp
index c05418168e62e..41df35f67a790 100644
--- a/lldb/unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp
+++ b/lldb/unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp
@@ -69,6 +69,12 @@ TEST(CPlusPlusLanguage, MethodNameParsing) {
        "const",
        "std::__1::ranges::__begin::__fn::operator()[abi:v160000]<char const, "
        "18ul>"},
+      {"bool Ball[abi:BALL]<int>::operator<<[abi:operator]<int>(int)", "bool",
+       "Ball[abi:BALL]<int>", "operator<<[abi:operator]<int>", "(int)", "",
+       "Ball[abi:BALL]<int>::operator<<[abi:operator]<int>"},
+      {"bool Ball[abi:BALL]<int>::operator>>[abi:operator]<int>(int)", "bool",
+       "Ball[abi:BALL]<int>", "operator>>[abi:operator]<int>", "(int)", "",
+       "Ball[abi:BALL]<int>::operator>>[abi:operator]<int>"},
       // Internal classes
       {"operator<<(Cls, Cls)::Subclass::function()", "",
        "operator<<(Cls, Cls)::Subclass", "function", "()", "",
diff --git a/lldb/unittests/ScriptInterpreter/Python/PythonTestSuite.cpp b/lldb/unittests/ScriptInterpreter/Python/PythonTestSuite.cpp
index a63b740d9472f..5694aeeff3e5b 100644
--- a/lldb/unittests/ScriptInterpreter/Python/PythonTestSuite.cpp
+++ b/lldb/unittests/ScriptInterpreter/Python/PythonTestSuite.cpp
@@ -136,6 +136,11 @@ lldb_private::python::LLDBSWIGPython_CastPyObjectToSBStream(PyObject *data) {
   return nullptr;
 }
 
+void *
+lldb_private::python::LLDBSWIGPython_CastPyObjectToSBThread(PyObject *data) {
+  return nullptr;
+}
+
 void *
 lldb_private::python::LLDBSWIGPython_CastPyObjectToSBFrame(PyObject *data) {
   return nullptr;
diff --git a/lldb/unittests/Symbol/TestClangASTImporter.cpp b/lldb/unittests/Symbol/TestClangASTImporter.cpp
index f1b3d7911c4bd..07c42088b9101 100644
--- a/lldb/unittests/Symbol/TestClangASTImporter.cpp
+++ b/lldb/unittests/Symbol/TestClangASTImporter.cpp
@@ -287,7 +287,7 @@ TEST_F(TestClangASTImporter, RecordLayoutFromOrigin) {
   clang_utils::SourceASTWithRecord source;
 
   auto *dwarf_parser =
-      static_cast<DWARFASTParserClang *>(source.ast->GetDWARFParser());
+      llvm::cast<DWARFASTParserClang>(source.ast->GetDWARFParser());
   auto &importer = dwarf_parser->GetClangASTImporter();
 
   // Set the layout for the origin decl in the origin ClangASTImporter.
diff --git a/lldb/unittests/SymbolFile/DWARF/DWARFASTParserClangTests.cpp b/lldb/unittests/SymbolFile/DWARF/DWARFASTParserClangTests.cpp
index cef3a25a4a960..6a753b6b33edf 100644
--- a/lldb/unittests/SymbolFile/DWARF/DWARFASTParserClangTests.cpp
+++ b/lldb/unittests/SymbolFile/DWARF/DWARFASTParserClangTests.cpp
@@ -42,6 +42,52 @@ class DWARFASTParserClangStub : public DWARFASTParserClang {
     return keys;
   }
 };
+
+/// Helper structure for DWARFASTParserClang tests that want to parse DWARF
+/// generated using yaml2obj. On construction parses the supplied YAML data
+/// into a DWARF module and thereafter vends a DWARFASTParserClang and
+/// TypeSystemClang that are guaranteed to live for the duration of this object.
+class DWARFASTParserClangYAMLTester {
+public:
+  DWARFASTParserClangYAMLTester(llvm::StringRef yaml_data)
+      : m_module_tester(yaml_data) {}
+
+  DWARFDIE GetCUDIE() {
+    DWARFUnit *unit = m_module_tester.GetDwarfUnit();
+    assert(unit);
+
+    const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
+    assert(cu_entry->Tag() == DW_TAG_compile_unit);
+
+    return DWARFDIE(unit, cu_entry);
+  }
+
+  DWARFASTParserClang &GetParser() {
+    auto *parser = GetTypeSystem().GetDWARFParser();
+
+    assert(llvm::isa_and_nonnull<DWARFASTParserClang>(parser));
+
+    return *llvm::cast<DWARFASTParserClang>(parser);
+  }
+
+  TypeSystemClang &GetTypeSystem() {
+    ModuleSP module_sp = m_module_tester.GetModule();
+    assert(module_sp);
+
+    SymbolFile *symfile = module_sp->GetSymbolFile();
+    assert(symfile);
+
+    TypeSystemSP ts_sp = llvm::cantFail(
+        symfile->GetTypeSystemForLanguage(lldb::eLanguageTypeC_plus_plus));
+
+    assert(llvm::isa_and_nonnull<TypeSystemClang>(ts_sp.get()));
+
+    return llvm::cast<TypeSystemClang>(*ts_sp);
+  }
+
+private:
+  YAMLModuleTester m_module_tester;
+};
 } // namespace
 
 // If your implementation needs to dereference the dummy pointers we are
@@ -99,7 +145,6 @@ TEST_F(DWARFASTParserClangTests,
             - Value:           0x0000000000000001
         - AbbrCode:        0x00000000
 )";
-
   YAMLModuleTester t(yamldata);
   ASSERT_TRUE((bool)t.GetDwarfUnit());
 
@@ -248,17 +293,9 @@ TEST_F(DWARFASTParserClangTests, TestCallingConventionParsing) {
         - AbbrCode:        0x0
 ...
 )";
-  YAMLModuleTester t(yamldata);
+  DWARFASTParserClangYAMLTester tester(yamldata);
 
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   std::vector<std::string> found_function_types;
   // The DWARF above is just a list of functions. Parse all of them to
@@ -267,7 +304,8 @@ TEST_F(DWARFASTParserClangTests, TestCallingConventionParsing) {
     ASSERT_EQ(func.Tag(), DW_TAG_subprogram);
     SymbolContext sc;
     bool new_type = false;
-    lldb::TypeSP type = ast_parser.ParseTypeFromDWARF(sc, func, &new_type);
+    lldb::TypeSP type =
+        tester.GetParser().ParseTypeFromDWARF(sc, func, &new_type);
     found_function_types.push_back(
         type->GetForwardCompilerType().GetTypeName().AsCString());
   }
@@ -394,18 +432,9 @@ TEST_F(DWARFASTParserClangTests, TestPtrAuthParsing) {
         - AbbrCode:        0x00 # end of child tags of 0x0c
 ...
 )";
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
 
+  DWARFDIE cu_die = tester.GetCUDIE();
   DWARFDIE ptrauth_variable = cu_die.GetFirstChild();
   ASSERT_EQ(ptrauth_variable.Tag(), DW_TAG_variable);
   DWARFDIE ptrauth_type =
@@ -415,7 +444,7 @@ TEST_F(DWARFASTParserClangTests, TestPtrAuthParsing) {
   SymbolContext sc;
   bool new_type = false;
   lldb::TypeSP type_sp =
-      ast_parser.ParseTypeFromDWARF(sc, ptrauth_type, &new_type);
+      tester.GetParser().ParseTypeFromDWARF(sc, ptrauth_type, &new_type);
   CompilerType compiler_type = type_sp->GetForwardCompilerType();
   ASSERT_EQ(compiler_type.GetPtrAuthKey(), 0U);
   ASSERT_EQ(compiler_type.GetPtrAuthAddressDiversity(), false);
@@ -554,24 +583,17 @@ TEST_F(DWARFASTParserClangTests, TestDefaultTemplateParamParsing) {
   auto BufferOrError = llvm::MemoryBuffer::getFile(
       GetInputFilePath("DW_AT_default_value-test.yaml"), /*IsText=*/true);
   ASSERT_TRUE(BufferOrError);
-  YAMLModuleTester t(BufferOrError.get()->getBuffer());
 
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(BufferOrError.get()->getBuffer());
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   llvm::SmallVector<lldb::TypeSP, 2> types;
   for (DWARFDIE die : cu_die.children()) {
     if (die.Tag() == DW_TAG_class_type) {
       SymbolContext sc;
       bool new_type = false;
-      types.push_back(ast_parser.ParseTypeFromDWARF(sc, die, &new_type));
+      types.push_back(
+          tester.GetParser().ParseTypeFromDWARF(sc, die, &new_type));
     }
   }
 
@@ -605,23 +627,14 @@ TEST_F(DWARFASTParserClangTests, TestSpecDeclExistsError) {
   auto BufferOrError = llvm::MemoryBuffer::getFile(
       GetInputFilePath("DW_AT_spec_decl_exists-test.yaml"), /*IsText=*/true);
   ASSERT_TRUE(BufferOrError);
-  YAMLModuleTester t(BufferOrError.get()->getBuffer());
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(BufferOrError.get()->getBuffer());
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   llvm::SmallVector<lldb::TypeSP, 2> specializations;
   for (DWARFDIE die : cu_die.children()) {
     SymbolContext sc;
     bool new_type = false;
-    auto type = ast_parser.ParseTypeFromDWARF(sc, die, &new_type);
+    auto type = tester.GetParser().ParseTypeFromDWARF(sc, die, &new_type);
     llvm::StringRef die_name = llvm::StringRef(die.GetName());
     if (die_name.starts_with("_Optional_payload")) {
       specializations.push_back(std::move(type));
@@ -730,18 +743,8 @@ TEST_F(DWARFASTParserClangTests, TestUniqueDWARFASTTypeMap_CppInsertMapFind) {
         - AbbrCode:        0x00 # end of child tags of 0x0c
 ...
 )";
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   DWARFDIE decl_die;
   DWARFDIE def_die;
@@ -762,6 +765,8 @@ TEST_F(DWARFASTParserClangTests, TestUniqueDWARFASTTypeMap_CppInsertMapFind) {
   ParsedDWARFTypeAttributes attrs(def_die);
   ASSERT_TRUE(attrs.decl.IsValid());
 
+  DWARFASTParserClang &ast_parser = tester.GetParser();
+
   SymbolContext sc;
   bool new_type = false;
   lldb::TypeSP type_sp = ast_parser.ParseTypeFromDWARF(sc, decl_die, &new_type);
@@ -906,18 +911,8 @@ TEST_F(DWARFASTParserClangTests, TestObjectPointer) {
         - AbbrCode: 0x0
 ...
 )";
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto context_die = cu_die.GetFirstChild();
   ASSERT_TRUE(context_die.IsValid());
@@ -932,7 +927,8 @@ TEST_F(DWARFASTParserClangTests, TestObjectPointer) {
     auto param_die = decl_die.GetFirstChild();
     ASSERT_TRUE(param_die.IsValid());
 
-    EXPECT_EQ(param_die, ast_parser.GetObjectParameter(decl_die, context_die));
+    EXPECT_EQ(param_die,
+              tester.GetParser().GetObjectParameter(decl_die, context_die));
   }
 
   {
@@ -945,8 +941,8 @@ TEST_F(DWARFASTParserClangTests, TestObjectPointer) {
     auto param_die = subprogram_definition.GetFirstChild();
     ASSERT_TRUE(param_die.IsValid());
 
-    EXPECT_EQ(param_die, ast_parser.GetObjectParameter(subprogram_definition,
-                                                       context_die));
+    EXPECT_EQ(param_die, tester.GetParser().GetObjectParameter(
+                             subprogram_definition, context_die));
   }
 }
 
@@ -1076,18 +1072,8 @@ TEST_F(DWARFASTParserClangTests,
         - AbbrCode: 0x0
 ...
 )";
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto context_die = cu_die.GetFirstChild();
   ASSERT_TRUE(context_die.IsValid());
@@ -1105,7 +1091,7 @@ TEST_F(DWARFASTParserClangTests,
   auto param_die = subprogram_definition.GetFirstChild();
   ASSERT_TRUE(param_die.IsValid());
   EXPECT_EQ(param_die,
-            ast_parser.GetObjectParameter(subprogram_definition, {}));
+            tester.GetParser().GetObjectParameter(subprogram_definition, {}));
 }
 
 TEST_F(DWARFASTParserClangTests, TestParseSubroutine_ExplicitObjectParameter) {
@@ -1243,21 +1229,15 @@ TEST_F(DWARFASTParserClangTests, TestParseSubroutine_ExplicitObjectParameter) {
         - AbbrCode: 0x0
 ...
 )";
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto ts_or_err =
       cu_die.GetDWARF()->GetTypeSystemForLanguage(eLanguageTypeC_plus_plus);
   ASSERT_TRUE(static_cast<bool>(ts_or_err));
   llvm::consumeError(ts_or_err.takeError());
   auto *parser =
-      static_cast<DWARFASTParserClang *>((*ts_or_err)->GetDWARFParser());
+      llvm::cast<DWARFASTParserClang>((*ts_or_err)->GetDWARFParser());
 
   auto context_die = cu_die.GetFirstChild();
   ASSERT_TRUE(context_die.IsValid());
@@ -1419,22 +1399,8 @@ TEST_F(DWARFASTParserClangTests, TestParseSubroutine_ParameterCreation) {
         - AbbrCode: 0x0
 ...
 )";
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto ts_or_err =
-      cu_die.GetDWARF()->GetTypeSystemForLanguage(eLanguageTypeC_plus_plus);
-  ASSERT_TRUE(static_cast<bool>(ts_or_err));
-  llvm::consumeError(ts_or_err.takeError());
-
-  auto *ts = static_cast<TypeSystemClang *>(ts_or_err->get());
-  auto *parser = static_cast<DWARFASTParserClang *>(ts->GetDWARFParser());
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto subprogram = cu_die.GetFirstChild();
   ASSERT_TRUE(subprogram.IsValid());
@@ -1442,11 +1408,13 @@ TEST_F(DWARFASTParserClangTests, TestParseSubroutine_ParameterCreation) {
 
   SymbolContext sc;
   bool new_type;
-  auto type_sp = parser->ParseTypeFromDWARF(sc, subprogram, &new_type);
+  auto type_sp =
+      tester.GetParser().ParseTypeFromDWARF(sc, subprogram, &new_type);
   ASSERT_NE(type_sp, nullptr);
 
-  auto result = ts->GetTranslationUnitDecl()->lookup(
-      clang_utils::getDeclarationName(*ts, "func"));
+  TypeSystemClang &ts = tester.GetTypeSystem();
+  auto result = ts.GetTranslationUnitDecl()->lookup(
+      clang_utils::getDeclarationName(ts, "func"));
   ASSERT_TRUE(result.isSingleResult());
 
   auto const *func = llvm::cast<clang::FunctionDecl>(result.front());
@@ -1609,19 +1577,8 @@ TEST_F(DWARFASTParserClangTests, TestObjectPointer_IndexEncoding) {
         - AbbrCode: 0x0
 ...
 )";
-
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto context_die = cu_die.GetFirstChild();
   ASSERT_TRUE(context_die.IsValid());
@@ -1640,7 +1597,8 @@ TEST_F(DWARFASTParserClangTests, TestObjectPointer_IndexEncoding) {
     auto param_die = sub1.GetFirstChild().GetSibling();
     ASSERT_TRUE(param_die.IsValid());
 
-    EXPECT_EQ(param_die, ast_parser.GetObjectParameter(sub1, context_die));
+    EXPECT_EQ(param_die,
+              tester.GetParser().GetObjectParameter(sub1, context_die));
   }
 
   // Object parameter is at constant index 0
@@ -1648,7 +1606,8 @@ TEST_F(DWARFASTParserClangTests, TestObjectPointer_IndexEncoding) {
     auto param_die = sub2.GetFirstChild();
     ASSERT_TRUE(param_die.IsValid());
 
-    EXPECT_EQ(param_die, ast_parser.GetObjectParameter(sub2, context_die));
+    EXPECT_EQ(param_die,
+              tester.GetParser().GetObjectParameter(sub2, context_die));
   }
 }
 
@@ -1711,19 +1670,8 @@ TEST_F(DWARFASTParserClangTests, TestTypeBitSize) {
             - Value: 0x02
 ...
 )";
-
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto type_die = cu_die.GetFirstChild();
   ASSERT_TRUE(type_die.IsValid());
@@ -1734,8 +1682,8 @@ TEST_F(DWARFASTParserClangTests, TestTypeBitSize) {
   EXPECT_EQ(attrs.data_bit_size.value_or(0), 2U);
 
   SymbolContext sc;
-  auto type_sp =
-      ast_parser.ParseTypeFromDWARF(sc, type_die, /*type_is_new_ptr=*/nullptr);
+  auto type_sp = tester.GetParser().ParseTypeFromDWARF(
+      sc, type_die, /*type_is_new_ptr=*/nullptr);
   ASSERT_NE(type_sp, nullptr);
 
   EXPECT_EQ(llvm::expectedToOptional(type_sp->GetByteSize(nullptr)).value_or(0),
@@ -1857,27 +1805,17 @@ TEST_F(DWARFASTParserClangTests, TestBitIntParsing) {
 ...
 
 )";
-
-  YAMLModuleTester t(yamldata);
-
-  DWARFUnit *unit = t.GetDwarfUnit();
-  ASSERT_NE(unit, nullptr);
-  const DWARFDebugInfoEntry *cu_entry = unit->DIE().GetDIE();
-  ASSERT_EQ(cu_entry->Tag(), DW_TAG_compile_unit);
-  ASSERT_EQ(unit->GetDWARFLanguageType(), DW_LANG_C_plus_plus);
-  DWARFDIE cu_die(unit, cu_entry);
-
-  auto holder = std::make_unique<clang_utils::TypeSystemClangHolder>("ast");
-  auto &ast_ctx = *holder->GetAST();
-  DWARFASTParserClangStub ast_parser(ast_ctx);
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
 
   auto type_die = cu_die.GetFirstChild();
   ASSERT_TRUE(type_die.IsValid());
 
   {
     SymbolContext sc;
-    auto type_sp = ast_parser.ParseTypeFromDWARF(sc, type_die,
-                                                 /*type_is_new_ptr=*/nullptr);
+    auto type_sp =
+        tester.GetParser().ParseTypeFromDWARF(sc, type_die,
+                                              /*type_is_new_ptr=*/nullptr);
     ASSERT_NE(type_sp, nullptr);
 
     EXPECT_EQ(
@@ -1891,8 +1829,9 @@ TEST_F(DWARFASTParserClangTests, TestBitIntParsing) {
   {
     type_die = type_die.GetSibling();
     SymbolContext sc;
-    auto type_sp = ast_parser.ParseTypeFromDWARF(sc, type_die,
-                                                 /*type_is_new_ptr=*/nullptr);
+    auto type_sp =
+        tester.GetParser().ParseTypeFromDWARF(sc, type_die,
+                                              /*type_is_new_ptr=*/nullptr);
     ASSERT_NE(type_sp, nullptr);
 
     EXPECT_EQ(
@@ -1906,8 +1845,9 @@ TEST_F(DWARFASTParserClangTests, TestBitIntParsing) {
   {
     type_die = type_die.GetSibling();
     SymbolContext sc;
-    auto type_sp = ast_parser.ParseTypeFromDWARF(sc, type_die,
-                                                 /*type_is_new_ptr=*/nullptr);
+    auto type_sp =
+        tester.GetParser().ParseTypeFromDWARF(sc, type_die,
+                                              /*type_is_new_ptr=*/nullptr);
     ASSERT_NE(type_sp, nullptr);
 
     EXPECT_EQ(
@@ -1922,8 +1862,9 @@ TEST_F(DWARFASTParserClangTests, TestBitIntParsing) {
   {
     type_die = type_die.GetSibling();
     SymbolContext sc;
-    auto type_sp = ast_parser.ParseTypeFromDWARF(sc, type_die,
-                                                 /*type_is_new_ptr=*/nullptr);
+    auto type_sp =
+        tester.GetParser().ParseTypeFromDWARF(sc, type_die,
+                                              /*type_is_new_ptr=*/nullptr);
     ASSERT_NE(type_sp, nullptr);
 
     EXPECT_EQ(
@@ -1938,8 +1879,9 @@ TEST_F(DWARFASTParserClangTests, TestBitIntParsing) {
   {
     type_die = type_die.GetSibling();
     SymbolContext sc;
-    auto type_sp = ast_parser.ParseTypeFromDWARF(sc, type_die,
-                                                 /*type_is_new_ptr=*/nullptr);
+    auto type_sp =
+        tester.GetParser().ParseTypeFromDWARF(sc, type_die,
+                                              /*type_is_new_ptr=*/nullptr);
     ASSERT_NE(type_sp, nullptr);
 
     EXPECT_EQ(
@@ -1953,3 +1895,226 @@ TEST_F(DWARFASTParserClangTests, TestBitIntParsing) {
     EXPECT_EQ(type_sp->GetForwardCompilerType().GetTypeName(), "_BitInt(64)");
   }
 }
+
+TEST_F(DWARFASTParserClangTests, TestTemplateAlias_NoSimpleTemplateNames) {
+  // Tests that we correctly parse the DW_TAG_template_alias generated by
+  // -gno-simple-template-names.
+
+  const char *yamldata = R"(
+--- !ELF
+FileHeader:
+  Class:   ELFCLASS64
+  Data:    ELFDATA2LSB
+  Type:    ET_EXEC
+  Machine: EM_AARCH64
+DWARF:
+  debug_abbrev:
+    - ID:              0
+      Table:
+        - Code:            0x1
+          Tag:             DW_TAG_compile_unit
+          Children:        DW_CHILDREN_yes
+          Attributes:
+            - Attribute:       DW_AT_language
+              Form:            DW_FORM_data2
+        - Code:            0x2
+          Tag:             DW_TAG_base_type
+          Children:        DW_CHILDREN_no
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+        - Code:            0x3
+          Tag:             DW_TAG_template_alias
+          Children:        DW_CHILDREN_yes
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+            - Attribute: DW_AT_type
+              Form:      DW_FORM_ref4
+        - Code:            0x4
+          Tag:             DW_TAG_template_type_parameter
+          Children:        DW_CHILDREN_no
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+            - Attribute: DW_AT_type
+              Form:      DW_FORM_ref4
+
+  debug_info:
+     - Version:  5
+       UnitType: DW_UT_compile
+       AddrSize: 8
+       Entries:
+
+# DW_TAG_compile_unit
+#   DW_AT_language (DW_LANG_C_plus_plus)
+
+        - AbbrCode: 0x1
+          Values:
+            - Value: 0x04
+
+#   DW_TAG_base_type
+#     DW_AT_name ('int')
+
+        - AbbrCode: 0x2
+          Values:
+            - CStr: int
+
+#   DW_TAG_template_alias
+#     DW_AT_name ('Foo<int>')
+#     DW_AT_type ('int')
+#     DW_TAG_template_type_parameter
+#       DW_AT_name ('T')
+#       DW_AT_type ('int')
+
+        - AbbrCode: 0x3
+          Values:
+            - CStr: Foo<int>
+            - Value: 0xf
+
+        - AbbrCode: 0x4
+          Values:
+            - CStr: T
+            - Value: 0xf
+
+        - AbbrCode: 0x0
+        - AbbrCode: 0x0
+...
+)";
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
+
+  auto alias_die = cu_die.GetFirstChild().GetSibling();
+  ASSERT_EQ(alias_die.Tag(), DW_TAG_template_alias);
+
+  SymbolContext sc;
+  auto type_sp =
+      tester.GetParser().ParseTypeFromDWARF(sc, alias_die,
+                                            /*type_is_new_ptr=*/nullptr);
+  ASSERT_NE(type_sp, nullptr);
+
+  EXPECT_TRUE(type_sp->IsTypedef());
+  EXPECT_EQ(type_sp->GetName(), "Foo<int>");
+  EXPECT_EQ(type_sp->GetForwardCompilerType().GetTypeName(), "Foo<int>");
+}
+
+TEST_F(DWARFASTParserClangTests,
+       TestTemplateAlias_InStruct_NoSimpleTemplateNames) {
+  // Tests that we correctly parse the DW_TAG_template_alias scoped inside a
+  // DW_TAG_structure_type *declaration* generated by
+  // -gno-simple-template-names. This tests the codepath the forcefully
+  // completes the context of the alias via PrepareContextToReceiveMembers.
+
+  const char *yamldata = R"(
+--- !ELF
+FileHeader:
+  Class:   ELFCLASS64
+  Data:    ELFDATA2LSB
+  Type:    ET_EXEC
+  Machine: EM_AARCH64
+DWARF:
+  debug_abbrev:
+    - ID:              0
+      Table:
+        - Code:            0x1
+          Tag:             DW_TAG_compile_unit
+          Children:        DW_CHILDREN_yes
+          Attributes:
+            - Attribute:       DW_AT_language
+              Form:            DW_FORM_data2
+        - Code:            0x2
+          Tag:             DW_TAG_base_type
+          Children:        DW_CHILDREN_no
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+        - Code:            0x3
+          Tag:             DW_TAG_structure_type
+          Children:        DW_CHILDREN_yes
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+            - Attribute: DW_AT_declaration
+              Form:      DW_FORM_flag_present
+        - Code:            0x4
+          Tag:             DW_TAG_template_alias
+          Children:        DW_CHILDREN_yes
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+            - Attribute: DW_AT_type
+              Form:      DW_FORM_ref4
+        - Code:            0x5
+          Tag:             DW_TAG_template_type_parameter
+          Children:        DW_CHILDREN_no
+          Attributes:
+            - Attribute: DW_AT_name
+              Form:      DW_FORM_string
+            - Attribute: DW_AT_type
+              Form:      DW_FORM_ref4
+
+  debug_info:
+     - Version:  5
+       UnitType: DW_UT_compile
+       AddrSize: 8
+       Entries:
+
+# DW_TAG_compile_unit
+#   DW_AT_language (DW_LANG_C_plus_plus)
+
+        - AbbrCode: 0x1
+          Values:
+            - Value: 0x04
+
+#   DW_TAG_base_type
+#     DW_AT_name ('int')
+
+        - AbbrCode: 0x2
+          Values:
+            - CStr: int
+
+#   DW_TAG_structure_type
+#     DW_AT_name ('Foo')
+
+        - AbbrCode: 0x3
+          Values:
+            - CStr: Foo
+
+#     DW_TAG_template_alias
+#       DW_AT_name ('Bar<int>')
+#       DW_AT_type ('int')
+#       DW_TAG_template_type_parameter
+#         DW_AT_name ('T')
+#         DW_AT_type ('int')
+
+        - AbbrCode: 0x4
+          Values:
+            - CStr: Bar<int>
+            - Value: 0xf
+
+        - AbbrCode: 0x5
+          Values:
+            - CStr: T
+            - Value: 0xf
+
+        - AbbrCode: 0x0
+        - AbbrCode: 0x0
+        - AbbrCode: 0x0
+...
+)";
+  DWARFASTParserClangYAMLTester tester(yamldata);
+  DWARFDIE cu_die = tester.GetCUDIE();
+
+  auto alias_die = cu_die.GetFirstChild().GetSibling().GetFirstChild();
+  ASSERT_EQ(alias_die.Tag(), DW_TAG_template_alias);
+
+  SymbolContext sc;
+  auto type_sp =
+      tester.GetParser().ParseTypeFromDWARF(sc, alias_die,
+                                            /*type_is_new_ptr=*/nullptr);
+  ASSERT_NE(type_sp, nullptr);
+
+  EXPECT_TRUE(type_sp->IsTypedef());
+  EXPECT_EQ(type_sp->GetName(), "Bar<int>");
+  EXPECT_EQ(type_sp->GetForwardCompilerType().GetTypeName(), "Foo::Bar<int>");
+}
diff --git a/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp b/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp
index 6c74860971674..b72abc188b4c2 100644
--- a/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp
+++ b/lldb/unittests/UnwindAssembly/ARM64/TestArm64InstEmulation.cpp
@@ -987,7 +987,7 @@ TEST_F(TestArm64InstEmulation, TestMidFunctionEpilogueAndBackwardsJump) {
       0xfd, 0x7b, 0x42, 0xa9, // <+20>: ldp    x29, x30, [sp, #0x20]
       0xff, 0xc3, 0x00, 0x91, // <+24>: add    sp, sp, #0x30
       0xc0, 0x03, 0x5f, 0xd6, // <+28>: ret
-      // AFTER_EPILOGUE:  LLDB computes the next 5 unwind states incorrectly.
+      // AFTER_EPILOGUE
       0x37, 0x00, 0x80, 0xd2, // <+32>: mov    x23, #0x1
       0xf6, 0x5f, 0x41, 0xa9, // <+36>: ldp    x22, x23, [sp, #0x10]
       0xfd, 0x7b, 0x42, 0xa9, // <+40>: ldp    x29, x30, [sp, #0x20]
@@ -1054,12 +1054,19 @@ TEST_F(TestArm64InstEmulation, TestMidFunctionEpilogueAndBackwardsJump) {
   EXPECT_TRUE(row->GetCFAValue().GetRegisterNumber() == gpr_sp_arm64);
   EXPECT_EQ(row->GetCFAValue().GetOffset(), 0);
 
-  // FIXME: Row for offset +32 incorrectly inherits the state of the `ret`
-  // instruction, but +32 _never_ executes after the `ret`.
+  // Row for offset +32 should not inherit the state of the `ret` instruction
+  // in +28. Instead, it should inherit the state of the branch in +64.
+  // Check for register x22, which is available in row +64.
   // <+28>: ret
   // <+32>: mov    x23, #0x1
   row = unwind_plan.GetRowForFunctionOffset(32);
-  // FIXME: EXPECT_NE(28, row->GetOffset());
+  EXPECT_EQ(32, row->GetOffset());
+  {
+    UnwindPlan::Row::AbstractRegisterLocation loc;
+    EXPECT_TRUE(row->GetRegisterInfo(gpr_x22_arm64, loc));
+    EXPECT_TRUE(loc.IsAtCFAPlusOffset());
+    EXPECT_EQ(loc.GetOffset(), -32);
+  }
 
   // Check that the state of this branch
   // <+16>: b.ne   ; <+52> DO_SOMETHING_AND_GOTO_AFTER_EPILOGUE
@@ -1070,4 +1077,12 @@ TEST_F(TestArm64InstEmulation, TestMidFunctionEpilogueAndBackwardsJump) {
   EXPECT_TRUE(row->GetCFAValue().IsRegisterPlusOffset());
   EXPECT_EQ(row->GetCFAValue().GetRegisterNumber(), gpr_fp_arm64);
   EXPECT_EQ(row->GetCFAValue().GetOffset(), 16);
+
+  row = unwind_plan.GetRowForFunctionOffset(64);
+  {
+    UnwindPlan::Row::AbstractRegisterLocation loc;
+    EXPECT_TRUE(row->GetRegisterInfo(gpr_x22_arm64, loc));
+    EXPECT_TRUE(loc.IsAtCFAPlusOffset());
+    EXPECT_EQ(loc.GetOffset(), -32);
+  }
 }
diff --git a/lldb/unittests/Utility/CMakeLists.txt b/lldb/unittests/Utility/CMakeLists.txt
index 2bdd50291d2ae..4cbe15bb5b073 100644
--- a/lldb/unittests/Utility/CMakeLists.txt
+++ b/lldb/unittests/Utility/CMakeLists.txt
@@ -47,6 +47,7 @@ add_lldb_unittest(UtilityTests
   UserIDResolverTest.cpp
   UUIDTest.cpp
   VASprintfTest.cpp
+  VirtualDataExtractorTest.cpp
   VMRangeTest.cpp
   XcodeSDKTest.cpp
 
diff --git a/lldb/unittests/Utility/VirtualDataExtractorTest.cpp b/lldb/unittests/Utility/VirtualDataExtractorTest.cpp
new file mode 100644
index 0000000000000..09f3edbecc7ad
--- /dev/null
+++ b/lldb/unittests/Utility/VirtualDataExtractorTest.cpp
@@ -0,0 +1,583 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "lldb/Utility/VirtualDataExtractor.h"
+#include "lldb/Utility/DataBufferHeap.h"
+#include "gtest/gtest.h"
+
+using namespace lldb_private;
+using namespace lldb;
+
+using Table = VirtualDataExtractor::LookupTable;
+using Entry = Table::Entry;
+
+TEST(VirtualDataExtractorTest, BasicConstruction) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  EXPECT_EQ(extractor->GetByteSize(), 8U);
+}
+
+TEST(VirtualDataExtractorTest, GetDataAtVirtualOffset) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  const void *data = extractor->GetData(&virtual_offset, 4);
+
+  ASSERT_NE(data, nullptr);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+  EXPECT_EQ(memcmp(data, buffer, 4), 0);
+}
+
+TEST(VirtualDataExtractorTest, GetDataAtVirtualOffsetInvalid) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  // Try to read from an invalid virtual address.
+  offset_t virtual_offset = 0x2000;
+  const void *data = extractor->GetData(&virtual_offset, 4);
+
+  EXPECT_EQ(data, nullptr);
+}
+
+TEST(VirtualDataExtractorTest, GetU8AtVirtualOffset) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x12U);
+  EXPECT_EQ(virtual_offset, 0x1001U);
+
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x34U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+}
+
+TEST(VirtualDataExtractorTest, GetU16AtVirtualOffset) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU16(&virtual_offset), 0x3412U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  EXPECT_EQ(extractor->GetU16(&virtual_offset), 0x7856U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, GetU32AtVirtualOffset) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0x78563412U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0xF0DEBC9AU);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetU64AtVirtualOffset) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 8, Table{Entry(0x1000, 8, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU64(&virtual_offset), 0xF0DEBC9A78563412ULL);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetAddressAtVirtualOffset) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetAddress(&virtual_offset), 0x78563412U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, BigEndian) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderBig, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU16(&virtual_offset), 0x1234U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  EXPECT_EQ(extractor->GetU16(&virtual_offset), 0x5678U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, MultipleEntries) {
+  // Create a buffer with distinct patterns for each section.
+  uint8_t buffer[] = {
+      0x01, 0x02, 0x03, 0x04, // Physical offset 0-3.
+      0x11, 0x12, 0x13, 0x14, // Physical offset 4-7.
+      0x21, 0x22, 0x23, 0x24  // Physical offset 8-11.
+  };
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4,
+      Table{Entry(0x1000, 4, 0),   // Virt 0x1000-0x1004
+            Entry(0x2000, 4, 4),   // Virt 0x2000-0x2004
+            Entry(0x3000, 4, 8)}); // Virt 0x3000-0x3004
+
+  // Test reading from first virtual range.
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x01U);
+
+  // Test reading from second virtual range.
+  virtual_offset = 0x2000;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x11U);
+
+  // Test reading from third virtual range.
+  virtual_offset = 0x3000;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x21U);
+}
+
+TEST(VirtualDataExtractorTest, NonContiguousVirtualAddresses) {
+  uint8_t buffer[] = {0xAA, 0xBB, 0xCC, 0xDD};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4,
+      Table{Entry(0x1000, 2, 0),   // Virt 0x1000-0x1002
+            Entry(0x5000, 2, 2)}); // Virt 0x5000-0x5002
+
+  // Test reading from first virtual range.
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU16(&virtual_offset), 0xBBAAU);
+
+  // Test reading from second virtual range (non-contiguous).
+  virtual_offset = 0x5000;
+  EXPECT_EQ(extractor->GetU16(&virtual_offset), 0xDDCCU);
+
+  // Test that gap between ranges is invalid.
+  virtual_offset = 0x3000;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0U);
+}
+
+TEST(VirtualDataExtractorTest, SharedDataBuffer) {
+  // Test with shared_ptr to DataBuffer.
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04};
+  auto data_sp = std::make_shared<DataBufferHeap>(buffer, sizeof(buffer));
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      data_sp, eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0x04030201U);
+}
+
+TEST(VirtualDataExtractorTest, NullPointerHandling) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  // Test that passing nullptr returns default values.
+  EXPECT_EQ(extractor->GetU8(nullptr), 0U);
+  EXPECT_EQ(extractor->GetU16(nullptr), 0U);
+  EXPECT_EQ(extractor->GetU32(nullptr), 0U);
+  EXPECT_EQ(extractor->GetU64(nullptr), 0U);
+  EXPECT_EQ(extractor->GetAddress(nullptr), 0U);
+  EXPECT_EQ(extractor->GetData(nullptr, 4), nullptr);
+}
+
+TEST(VirtualDataExtractorTest, OffsetMapping) {
+  // Test that virtual to physical offset mapping works correctly.
+  uint8_t buffer[] = {0x00, 0x00, 0x00, 0x00, 0xAA, 0xBB, 0xCC, 0xDD};
+
+  // Map virtual address 0x1000 to physical offset 4 (skipping first 4 bytes).
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 4)});
+
+  offset_t virtual_offset = 0x1000;
+  // Should read from physical offset 4, not 0.
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0xDDCCBBAAU);
+}
+
+TEST(VirtualDataExtractorTest, GetU8Unchecked) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU8_unchecked(&virtual_offset), 0x12U);
+  EXPECT_EQ(virtual_offset, 0x1001U);
+
+  EXPECT_EQ(extractor->GetU8_unchecked(&virtual_offset), 0x34U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+}
+
+TEST(VirtualDataExtractorTest, GetU16Unchecked) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU16_unchecked(&virtual_offset), 0x3412U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  EXPECT_EQ(extractor->GetU16_unchecked(&virtual_offset), 0x7856U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, GetU32Unchecked) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU32_unchecked(&virtual_offset), 0x78563412U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  EXPECT_EQ(extractor->GetU32_unchecked(&virtual_offset), 0xF0DEBC9AU);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetU64Unchecked) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 8, Table{Entry(0x1000, 8, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU64_unchecked(&virtual_offset),
+            0xF0DEBC9A78563412ULL);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetMaxU64Unchecked) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  // Test various byte sizes.
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxU64_unchecked(&virtual_offset, 1), 0x12U);
+  EXPECT_EQ(virtual_offset, 0x1001U);
+
+  virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxU64_unchecked(&virtual_offset, 2), 0x3412U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxU64_unchecked(&virtual_offset, 4), 0x78563412U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxU64_unchecked(&virtual_offset, 8),
+            0xF0DEBC9A78563412ULL);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetAddressUnchecked) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetAddress_unchecked(&virtual_offset), 0x78563412U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, UncheckedWithBigEndian) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderBig, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU16_unchecked(&virtual_offset), 0x1234U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  EXPECT_EQ(extractor->GetU16_unchecked(&virtual_offset), 0x5678U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, GetCStr) {
+  // Create buffer with null-terminated strings.
+  uint8_t buffer[] = {'H', 'e', 'l', 'l',  'o', '\0', 'W', 'o',
+                      'r', 'l', 'd', '\0', 'F', 'o',  'o', '\0'};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4,
+      Table{Entry(0x1000, 6, 0), Entry(0x2000, 12, 6)});
+
+  // Test reading first string.
+  offset_t virtual_offset = 0x1000;
+  const char *str1 = extractor->GetCStr(&virtual_offset);
+  ASSERT_NE(str1, nullptr);
+  EXPECT_STREQ(str1, "Hello");
+  EXPECT_EQ(virtual_offset, 0x1006U); // After "Hello\0"
+
+  // Test reading second string.
+  virtual_offset = 0x2000;
+  const char *str2 = extractor->GetCStr(&virtual_offset);
+  ASSERT_NE(str2, nullptr);
+  EXPECT_STREQ(str2, "World");
+  EXPECT_EQ(virtual_offset, 0x2006U); // After "World\0"
+}
+
+TEST(VirtualDataExtractorTest, GetFloat) {
+  // Create buffer with float value (IEEE 754 single precision).
+  // 3.14159f in little endian: 0xDB 0x0F 0x49 0x40
+  uint8_t buffer[] = {0xDB, 0x0F, 0x49, 0x40};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  float value = extractor->GetFloat(&virtual_offset);
+  EXPECT_NEAR(value, 3.14159f, 0.00001f);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, GetDouble) {
+  // Create buffer with double value (IEEE 754 double precision).
+  // 3.14159265358979 in little endian
+  uint8_t buffer[] = {0x18, 0x2D, 0x44, 0x54, 0xFB, 0x21, 0x09, 0x40};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 8, Table{Entry(0x1000, 8, 0)});
+
+  offset_t virtual_offset = 0x1000;
+  double value = extractor->GetDouble(&virtual_offset);
+  EXPECT_NEAR(value, 3.14159265358979, 0.00000000000001);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetULEB128) {
+  // ULEB128 encoding: 0x624 (1572 decimal) = 0xA4 0x0C
+  uint8_t buffer[] = {0xA4, 0x0C, 0xFF, 0x00, 0x7F, 0x80, 0x01};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 7, 0)});
+
+  // Test reading first ULEB128 value (1572).
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetULEB128(&virtual_offset), 1572U);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  // Test reading second ULEB128 value (127).
+  virtual_offset = 0x1004;
+  EXPECT_EQ(extractor->GetULEB128(&virtual_offset), 127U);
+  EXPECT_EQ(virtual_offset, 0x1005U);
+
+  // Test reading third ULEB128 value (128).
+  EXPECT_EQ(extractor->GetULEB128(&virtual_offset), 128U);
+  EXPECT_EQ(virtual_offset, 0x1007U);
+}
+
+TEST(VirtualDataExtractorTest, GetSLEB128) {
+  // SLEB128 encoding: -123 = 0x85 0x7F, 123 = 0xFB 0x00
+  uint8_t buffer[] = {0x85, 0x7F, 0xFB, 0x00};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  // Test reading negative SLEB128 value (-123).
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetSLEB128(&virtual_offset), -123);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+
+  // Test reading positive SLEB128 value (123).
+  EXPECT_EQ(extractor->GetSLEB128(&virtual_offset), 123);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, GetU8Array) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  // Test reading array of 4 bytes.
+  offset_t virtual_offset = 0x1000;
+  uint8_t dst[4] = {0};
+  void *result = extractor->GetU8(&virtual_offset, dst, 4);
+  ASSERT_NE(result, nullptr);
+  EXPECT_EQ(dst[0], 0x01U);
+  EXPECT_EQ(dst[1], 0x02U);
+  EXPECT_EQ(dst[2], 0x03U);
+  EXPECT_EQ(dst[3], 0x04U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, GetU16Array) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  // Test reading array of 3 uint16_t values.
+  offset_t virtual_offset = 0x1000;
+  uint16_t dst[3] = {0};
+  void *result = extractor->GetU16(&virtual_offset, dst, 3);
+  ASSERT_NE(result, nullptr);
+  EXPECT_EQ(dst[0], 0x3412U);
+  EXPECT_EQ(dst[1], 0x7856U);
+  EXPECT_EQ(dst[2], 0xBC9AU);
+  EXPECT_EQ(virtual_offset, 0x1006U);
+}
+
+TEST(VirtualDataExtractorTest, GetU32Array) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  // Test reading array of 2 uint32_t values.
+  offset_t virtual_offset = 0x1000;
+  uint32_t dst[2] = {0};
+  void *result = extractor->GetU32(&virtual_offset, dst, 2);
+  ASSERT_NE(result, nullptr);
+  EXPECT_EQ(dst[0], 0x78563412U);
+  EXPECT_EQ(dst[1], 0xF0DEBC9AU);
+  EXPECT_EQ(virtual_offset, 0x1008U);
+}
+
+TEST(VirtualDataExtractorTest, GetU64Array) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08,
+                      0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 8, Table{Entry(0x1000, 16, 0)});
+
+  // Test reading array of 2 uint64_t values.
+  offset_t virtual_offset = 0x1000;
+  uint64_t dst[2] = {0};
+  void *result = extractor->GetU64(&virtual_offset, dst, 2);
+  ASSERT_NE(result, nullptr);
+  EXPECT_EQ(dst[0], 0x0807060504030201ULL);
+  EXPECT_EQ(dst[1], 0x1817161514131211ULL);
+  EXPECT_EQ(virtual_offset, 0x1010U);
+}
+
+TEST(VirtualDataExtractorTest, GetMaxU64WithVariableSizes) {
+  uint8_t buffer[] = {0x12, 0x34, 0x56, 0x78, 0x9A, 0xBC, 0xDE, 0xF0};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 8, 0)});
+
+  // Test reading 3-byte value.
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxU64(&virtual_offset, 3), 0x563412U);
+  EXPECT_EQ(virtual_offset, 0x1003U);
+
+  // Test reading 5-byte value.
+  virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxU64(&virtual_offset, 5), 0x9A78563412ULL);
+  EXPECT_EQ(virtual_offset, 0x1005U);
+}
+
+TEST(VirtualDataExtractorTest, GetMaxS64) {
+  // Test with negative number (sign extension).
+  uint8_t buffer[] = {0xFF, 0xFF, 0xFF, 0xFF};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  // Test reading 1-byte signed value (-1).
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxS64(&virtual_offset, 1), -1);
+  EXPECT_EQ(virtual_offset, 0x1001U);
+
+  // Test reading 2-byte signed value (-1).
+  virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetMaxS64(&virtual_offset, 2), -1);
+  EXPECT_EQ(virtual_offset, 0x1002U);
+}
+
+TEST(VirtualDataExtractorTest, CannotReadAcrossEntryBoundaries) {
+  // Create buffer with two separate regions.
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04, 0x11, 0x12, 0x13, 0x14};
+
+  // First entry: virtual 0x1000-0x1004 maps to physical 0-4.
+  // Second entry: virtual 0x2000-0x2004 maps to physical 4-8.
+  // Note: there's a gap in virtual addresses (0x1004-0x2000).
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4,
+      Table{Entry(0x1000, 4, 0), Entry(0x2000, 4, 4)});
+
+  // Verify we can read within the first entry.
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0x04030201U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  // Verify we can read within the second entry.
+  virtual_offset = 0x2000;
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0x14131211U);
+  EXPECT_EQ(virtual_offset, 0x2004U);
+
+  // Verify we CANNOT read in the gap between entries.
+  // This address is not in any lookup table entry.
+  virtual_offset = 0x1500;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0U);
+  EXPECT_EQ(virtual_offset, 0x1500U);
+
+  // Verify we CANNOT read data pointer from the gap.
+  virtual_offset = 0x1800;
+  const void *data = extractor->GetData(&virtual_offset, 1);
+  EXPECT_EQ(data, nullptr);
+  EXPECT_EQ(virtual_offset, 0x1800U); // Offset unchanged.
+
+  // Verify we can read individual bytes within each entry.
+  virtual_offset = 0x1003;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x04U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  // Verify we CANNOT read past the end of an entry.
+  virtual_offset = 0x1004;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
+
+TEST(VirtualDataExtractorTest, ReadExactlyAtEntryEnd) {
+  uint8_t buffer[] = {0x01, 0x02, 0x03, 0x04};
+
+  lldb::DataExtractorSP extractor = std::make_shared<VirtualDataExtractor>(
+      buffer, sizeof(buffer), eByteOrderLittle, 4, Table{Entry(0x1000, 4, 0)});
+
+  // Reading exactly to the end of an entry should work.
+  offset_t virtual_offset = 0x1000;
+  EXPECT_EQ(extractor->GetU32(&virtual_offset), 0x04030201U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  // But reading one byte past the end should fail.
+  virtual_offset = 0x1004;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+
+  // Reading from just before the end should work for smaller sizes.
+  virtual_offset = 0x1003;
+  EXPECT_EQ(extractor->GetU8(&virtual_offset), 0x04U);
+  EXPECT_EQ(virtual_offset, 0x1004U);
+}
diff --git a/llvm/cmake/modules/CrossCompile.cmake b/llvm/cmake/modules/CrossCompile.cmake
index 2a69c5133c56f..6075e6abdde74 100644
--- a/llvm/cmake/modules/CrossCompile.cmake
+++ b/llvm/cmake/modules/CrossCompile.cmake
@@ -69,6 +69,8 @@ function(llvm_create_cross_target project_name target_name toolchain buildtype)
          "${LLVM_EXTERNAL_PROJECTS}")
   string(REPLACE ";" "$<SEMICOLON>" llvm_enable_runtimes_arg
          "${LLVM_ENABLE_RUNTIMES}")
+  string(REPLACE ";" "$<SEMICOLON>" llvm_tablegen_flags
+         "${LLVM_TABLEGEN_FLAGS}")
 
   set(external_project_source_dirs)
   foreach(project ${LLVM_EXTERNAL_PROJECTS})
@@ -100,7 +102,7 @@ function(llvm_create_cross_target project_name target_name toolchain buildtype)
         -DLLVM_TEMPORARILY_ALLOW_OLD_TOOLCHAIN="${LLVM_TEMPORARILY_ALLOW_OLD_TOOLCHAIN}"
         -DLLVM_INCLUDE_BENCHMARKS=OFF
         -DLLVM_INCLUDE_TESTS=OFF
-        -DLLVM_TABLEGEN_FLAGS="${LLVM_TABLEGEN_FLAGS}"
+        -DLLVM_TABLEGEN_FLAGS="${llvm_tablegen_flags}"
         -DPYTHON_EXECUTABLE="${PYTHON_EXECUTABLE}"
         ${build_type_flags} ${linker_flag} ${external_clang_dir} ${libc_flags}
         ${ARGN}
diff --git a/llvm/docs/InstCombineContributorGuide.md b/llvm/docs/InstCombineContributorGuide.md
index 12567fc36f1d1..1c432b9b7446c 100644
--- a/llvm/docs/InstCombineContributorGuide.md
+++ b/llvm/docs/InstCombineContributorGuide.md
@@ -338,7 +338,7 @@ complexity and increasing compile-time overhead.
 
 We do not require explicit proof of real-world usefulness for every transform
 -- in most cases the usefulness is fairly "obvious". However, the question may
-come up for complex or unusual folds. Keep this in mind when chosing what you
+come up for complex or unusual folds. Keep this in mind when choosing what you
 work on.
 
 In particular, fixes for fuzzer-generated missed optimization reports will
diff --git a/llvm/docs/KeyInstructionsDebugInfo.md b/llvm/docs/KeyInstructionsDebugInfo.md
index 305740554c0fe..d93151a236680 100644
--- a/llvm/docs/KeyInstructionsDebugInfo.md
+++ b/llvm/docs/KeyInstructionsDebugInfo.md
@@ -82,7 +82,7 @@ int c =
 ```
 In the current implementation an `is_stmt` won't be generated for the `a + b` instruction, meaning debuggers will likely step over the `add` and stop at the `store` of the result into `c` (which does get `is_stmt`). A user might have wished to edit `a` or `b` on the previous line in order to alter the result stored to `c`, which they now won't have the chance to do (they'd need to edit the variables on a previous line instead). If the expression was all on one line then they would be able to edit the values before the `add`. For these reasons we're choosing to recommend that the feature should not be enabled at O0.
 
-It should be possible to fix this case if we make a few changes: add all the instructions in the statement (i.e., including the loads) to the atom, and tweak the DwarfEmission code to understand this situation (same atom, different line). So there is room to persue this in the future. Though that gets tricky in some cases due to the [other limitation mentioned above](#lack-of-multiple-atom-membership), e.g.:
+It should be possible to fix this case if we make a few changes: add all the instructions in the statement (i.e., including the loads) to the atom, and tweak the DwarfEmission code to understand this situation (same atom, different line). So there is room to pursue this in the future. Though that gets tricky in some cases due to the [other limitation mentioned above](#lack-of-multiple-atom-membership), e.g.:
 ```c
 int e =        // atom 1
     (a + b)    // atom 1
diff --git a/llvm/docs/MIRLangRef.rst b/llvm/docs/MIRLangRef.rst
index f7647c898c1e6..32c96d197a027 100644
--- a/llvm/docs/MIRLangRef.rst
+++ b/llvm/docs/MIRLangRef.rst
@@ -807,6 +807,22 @@ For an int eq predicate ``ICMP_EQ``, the syntax is:
 
    %2:gpr(s32) = G_ICMP intpred(eq), %0, %1
 
+LaneMask Operands
+^^^^^^^^^^^^^^^^^
+
+A LaneMask operand contains a LaneBitmask struct representing the covering of a
+register with sub-registers. Instructions typically associate a LaneMask operand
+with one or more Register operands, and use it to represent sub-register
+granularity information like liveness for those associated Register operands.
+
+
+For example, the COPY_LANEMASK instruction uses this operand to copy only active
+lanes (of the source register) in the mask. The syntax for it would look like:
+
+.. code-block:: text
+
+   $vgpr1 = COPY_LANEMASK $vgpr0, lanemask(0x00000000000000C0)
+
 .. TODO: Describe the parsers default behaviour when optional YAML attributes
    are missing.
 .. TODO: Describe the syntax for virtual register YAML definitions.
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index c6c527d1ae964..2a1e88f1fd17f 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -149,6 +149,8 @@ Changes to the RISC-V Backend
 * Adds experimental support for the 'Zibi` (Branch with Immediate) extension.
 * Add support for Zvfofp8min (OFP8 conversion extension)
 * Adds assembler support for the Andes `XAndesvsinth` (Andes Vector Small Int Handling Extension).
+* DWARF fission is now compatible with linker relaxations, allowing `-gsplit-dwarf` and `-mrelax`
+  to be used together when building for the RISC-V platform.
 
 Changes to the WebAssembly Backend
 ----------------------------------
diff --git a/llvm/docs/SPIRVUsage.rst b/llvm/docs/SPIRVUsage.rst
index aedb6643cf581..88164e6fa53d8 100644
--- a/llvm/docs/SPIRVUsage.rst
+++ b/llvm/docs/SPIRVUsage.rst
@@ -30,8 +30,8 @@ Static Compiler Commands
    Description: This command compiles an LLVM IL file (`input.ll`) to a SPIR-V binary (`output.spvt`) for a 32-bit architecture.
 
 2. **Compilation with Extensions and Optimization**
-   Command: `llc -O1 -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers input.ll -o output.spvt`
-   Description: Compiles an LLVM IL file to SPIR-V with (`-O1`) optimizations, targeting a 64-bit architecture. It enables the SPV_INTEL_arbitrary_precision_integers extension.
+   Command: `llc -O1 -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers input.ll -o output.spvt`
+   Description: Compiles an LLVM IL file to SPIR-V with (`-O1`) optimizations, targeting a 64-bit architecture. It enables the SPV_ALTERA_arbitrary_precision_integers extension.
 
 3. **Compilation with experimental NonSemantic.Shader.DebugInfo.100 support**
    Command: `llc --spv-emit-nonsemantic-debug-info --spirv-ext=+SPV_KHR_non_semantic_info input.ll -o output.spvt`
@@ -136,7 +136,7 @@ extensions to enable or disable, each prefixed with ``+`` or ``-``, respectively
 
 To enable multiple extensions, list them separated by comma. For example, to enable support for atomic operations on floating-point numbers and arbitrary precision integers, use:
 
-``-spirv-ext=+SPV_EXT_shader_atomic_float_add,+SPV_INTEL_arbitrary_precision_integers``
+``-spirv-ext=+SPV_EXT_shader_atomic_float_add,+SPV_ALTERA_arbitrary_precision_integers``
 
 To enable all extensions, use the following option:
 ``-spirv-ext=all``
@@ -145,7 +145,7 @@ To enable all KHR extensions, use the following option:
 ``-spirv-ext=khr``
 
 To enable all extensions except specified, specify ``all`` followed by a list of disallowed extensions. For example:
-``-spirv-ext=all,-SPV_INTEL_arbitrary_precision_integers``
+``-spirv-ext=all,-SPV_ALTERA_arbitrary_precision_integers``
 
 Below is a list of supported SPIR-V extensions, sorted alphabetically by their extension names:
 
@@ -171,7 +171,7 @@ Below is a list of supported SPIR-V extensions, sorted alphabetically by their e
      - Extends the SPV_EXT_shader_atomic_float_add and SPV_EXT_shader_atomic_float_min_max to support addition, minimum and maximum on 16-bit `bfloat16` floating-point numbers in memory.
    * - ``SPV_INTEL_2d_block_io``
      - Adds additional subgroup block prefetch, load, load transposed, load transformed and store instructions to read two-dimensional blocks of data from a two-dimensional region of memory, or to write two-dimensional blocks of data to a two dimensional region of memory.
-   * - ``SPV_INTEL_arbitrary_precision_integers``
+   * - ``SPV_ALTERA_arbitrary_precision_integers``
      - Allows generating arbitrary width integer types.
    * - ``SPV_INTEL_bindless_images``
      - Adds instructions to convert convert unsigned integer handles to images, samplers and sampled images.
@@ -245,6 +245,9 @@ Below is a list of supported SPIR-V extensions, sorted alphabetically by their e
      - Adds execution mode and capability to enable maximal reconvergence.
    * - ``SPV_ALTERA_blocking_pipes``
      - Adds new pipe read and write functions that have blocking semantics instead of the non-blocking semantics of the existing pipe read/write functions.
+   * - ``SPV_ALTERA_arbitrary_precision_fixed_point``
+     - Add instructions for fixed point arithmetic. The extension works without SPV_ALTERA_arbitrary_precision_integers, but together they allow greater flexibility in representing arbitrary precision data types.
+
 
 SPIR-V representation in LLVM IR
 ================================
diff --git a/llvm/docs/Telemetry.rst b/llvm/docs/Telemetry.rst
index 4f30ae82b5628..c36105c99709f 100644
--- a/llvm/docs/Telemetry.rst
+++ b/llvm/docs/Telemetry.rst
@@ -32,7 +32,7 @@ Important notes
 * There is no concrete implementation of a Telemetry library in upstream LLVM.
   We only provide the abstract API here. Any tool that wants telemetry will
   implement one.
-  
+
   The rationale for this is that all the tools in LLVM are very different in
   what they care about (what/where/when to instrument data). Hence, it might not
   be practical to have a single implementation.
@@ -41,16 +41,16 @@ Important notes
 
 * No implementation of Telemetry in upstream LLVM shall store any of the
   collected data due to privacy and security reasons:
-  
+
   * Different organizations have different privacy models:
-  
+
     * Which data is sensitive, which is not?
     * Whether it is acceptable for instrumented data to be stored anywhere?
       (to a local file, what not?)
-      
+
   * Data ownership and data collection consents are hard to accommodate from
     LLVM developers' point of view:
-  
+
     * E.g., data collected by Telemetry is not necessarily owned by the user
       of an LLVM tool with Telemetry enabled, hence the user's consent to data
       collection is not meaningful. On the other hand, LLVM developers have no
@@ -75,7 +75,7 @@ The framework consists of four important classes:
   It is up to the vendor to decide which pieces of data to forward and where
   to forward them to for their final storage.
 * ``llvm::telemetry::Config``: Configurations for the ``Manager``.
-  
+
 .. image:: llvm_telemetry_design.png
 
 How to implement and interact with the API
@@ -123,7 +123,7 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
     void write(StringRef KeyName, unsigned long Value) override {
       writeHelper(KeyName, Value);
     }
-    
+
     void write(StringRef KeyName, unsigned long long Value) override {
       writeHelper(KeyName, Value);
     }
@@ -131,12 +131,12 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
     void write(StringRef KeyName, StringRef Value) override {
       writeHelper(KeyName, Value);
     }
- 
+
     void beginObject(StringRef KeyName) override {
       Children.push_back(json::Object());
       ChildrenNames.push_back(KeyName.str());
     }
-    
+
     void endObject() override {
       assert(!Children.empty() && !ChildrenNames.empty());
       json::Value Val = json::Value(std::move(Children.back()));
@@ -146,7 +146,7 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
       ChildrenNames.pop_back();
       writeHelper(Name, std::move(Val));
     }
-    
+
     Error finalize() override {
       if (!Started)
         return createStringError("Serializer not currently in use");
@@ -167,10 +167,10 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
     std::vector<json::Object> Children;
     std::vector<std::string> ChildrenNames;
   };
-       
-  class MyManager : public telemery::Manager {
+
+  class MyManager : public telemetry::Manager {
   public:
-  static std::unique_ptr<MyManager> createInstatnce(telemetry::Config *Config) {
+  static std::unique_ptr<MyManager> createInstance(telemetry::Config *Config) {
     // If Telemetry is not enabled, then just return null;
     if (!Config->EnableTelemetry)
       return nullptr;
@@ -182,19 +182,19 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
     Entry->SessionId = SessionId;
     return Error::success();
   }
-  
+
   // You can also define additional instrumentation points.
   void logStartup(TelemetryInfo *Entry) {
     // Add some additional data to entry.
     Entry->Msg = "Some message";
     dispatch(Entry);
   }
-  
+
   void logAdditionalPoint(TelemetryInfo *Entry) {
     // .... code here
   }
-  
-  private:    
+
+  private:
     const std::string SessionId;
   };
 
@@ -203,11 +203,11 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
     Error receiveEntry(const TelemetryInfo *Entry) override {
       if (Error Err = Serializer.init())
         return Err;
-      
+
       Entry->serialize(Serializer);
       if (Error Err = Serializer.finalize())
         return Err;
-      
+
       json::Object Copied = *Serializer.getOutputObject();
       // Send the `Copied` object to wherever.
       return Error::success();
@@ -220,16 +220,16 @@ To use Telemetry in your tool, you need to provide a concrete implementation of
   // This defines a custom TelemetryInfo that has an additional Msg field.
   struct MyTelemetryInfo : public telemetry::TelemetryInfo {
     std::string Msg;
-    
+
     Error serialize(Serializer &Serializer) const override {
       TelemetryInfo::serialize(serializer);
       Serializer.writeString("MyMsg", Msg);
     }
-      
+
     // Note: implement getKind() and classof() to support dyn_cast operations.
   };
 
-    
+
 2) Use the library in your tool.
 
 Logging the tool init-process:
@@ -241,10 +241,10 @@ Logging the tool init-process:
   telemetry::Config MyConfig = makeConfig(); // Build up the appropriate Config struct here.
   auto Manager = MyManager::createInstance(&MyConfig);
 
-  
+
   // Any other tool's init code can go here.
   // ...
-  
+
   // Finally, take a snapshot of the time now so we know how long it took the
   // init process to finish.
   auto EndTime = std::chrono::time_point<std::chrono::steady_clock>::now();
diff --git a/llvm/include/llvm/ADT/MapVector.h b/llvm/include/llvm/ADT/MapVector.h
index 80bcb7e0b7ba4..2b2f098dd3abf 100644
--- a/llvm/include/llvm/ADT/MapVector.h
+++ b/llvm/include/llvm/ADT/MapVector.h
@@ -99,6 +99,11 @@ class MapVector {
     return try_emplace_impl(Key).first->second;
   }
 
+  [[nodiscard]] auto keys() { return make_first_range(Vector); }
+  [[nodiscard]] auto keys() const { return make_first_range(Vector); }
+  [[nodiscard]] auto values() { return make_second_range(Vector); }
+  [[nodiscard]] auto values() const { return make_second_range(Vector); }
+
   // Returns a copy of the value.  Only allowed if ValueT is copyable.
   [[nodiscard]] ValueT lookup(const KeyT &Key) const {
     static_assert(std::is_copy_constructible_v<ValueT>,
diff --git a/llvm/include/llvm/ADT/SetVector.h b/llvm/include/llvm/ADT/SetVector.h
index 0fde14126c79b..4d0a20f4f95f7 100644
--- a/llvm/include/llvm/ADT/SetVector.h
+++ b/llvm/include/llvm/ADT/SetVector.h
@@ -39,8 +39,7 @@ namespace llvm {
 ///
 /// The key and value types are derived from the Set and Vector types
 /// respectively. This allows the vector-type operations and set-type operations
-/// to have different types. In particular, this is useful when storing pointers
-/// as "Foo *" values but looking them up as "const Foo *" keys.
+/// to have different types.
 ///
 /// No constraint is placed on the key and value types, although it is assumed
 /// that value_type can be converted into key_type for insertion. Users must be
@@ -60,6 +59,9 @@ class SetVector {
   // excessively long linear scans from occuring.
   static_assert(N <= 32, "Small size should be less than or equal to 32!");
 
+  using const_arg_type =
+      typename const_pointer_or_const_ref<typename Set::key_type>::type;
+
 public:
   using value_type = typename Vector::value_type;
   using key_type = typename Set::key_type;
@@ -247,17 +249,17 @@ class SetVector {
   }
 
   /// Check if the SetVector contains the given key.
-  [[nodiscard]] bool contains(const key_type &key) const {
+  [[nodiscard]] bool contains(const_arg_type key) const {
     if constexpr (canBeSmall())
       if (isSmall())
         return is_contained(vector_, key);
 
-    return set_.find(key) != set_.end();
+    return is_contained(set_, key);
   }
 
   /// Count the number of elements of a given key in the SetVector.
   /// \returns 0 if the element is not in the SetVector, 1 if it is.
-  [[nodiscard]] size_type count(const key_type &key) const {
+  [[nodiscard]] size_type count(const_arg_type key) const {
     return contains(key) ? 1 : 0;
   }
 
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index ec26da87eb916..aa711642a3a6d 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -31,6 +31,9 @@
 #include "llvm/Support/DOTGraphTraits.h"
 #include "llvm/Support/FormatVariadic.h"
 
+#include <functional>
+#include <sstream>
+
 namespace llvm {
 class ModuleSlotTracker;
 
@@ -69,13 +72,18 @@ class DOTFuncInfo {
   bool ShowHeat;
   bool EdgeWeights;
   bool RawWeights;
+  using NodeIdFormatterTy =
+      std::function<std::optional<std::string>(const BasicBlock *)>;
+  std::optional<NodeIdFormatterTy> NodeIdFormatter;
 
 public:
   DOTFuncInfo(const Function *F) : DOTFuncInfo(F, nullptr, nullptr, 0) {}
   LLVM_ABI ~DOTFuncInfo();
 
-  LLVM_ABI DOTFuncInfo(const Function *F, const BlockFrequencyInfo *BFI,
-                       const BranchProbabilityInfo *BPI, uint64_t MaxFreq);
+  LLVM_ABI
+  DOTFuncInfo(const Function *F, const BlockFrequencyInfo *BFI,
+              const BranchProbabilityInfo *BPI, uint64_t MaxFreq,
+              std::optional<NodeIdFormatterTy> NodeIdFormatter = std::nullopt);
 
   const BlockFrequencyInfo *getBFI() const { return BFI; }
 
@@ -102,6 +110,10 @@ class DOTFuncInfo {
   void setEdgeWeights(bool EdgeWeights) { this->EdgeWeights = EdgeWeights; }
 
   bool showEdgeWeights() { return EdgeWeights; }
+
+  std::optional<NodeIdFormatterTy> getNodeIdFormatter() {
+    return NodeIdFormatter;
+  }
 };
 
 template <>
@@ -311,21 +323,27 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
   }
 
   std::string getNodeAttributes(const BasicBlock *Node, DOTFuncInfo *CFGInfo) {
+    std::stringstream Attrs;
+
+    if (auto NodeIdFmt = CFGInfo->getNodeIdFormatter())
+      if (auto NodeId = (*NodeIdFmt)(Node))
+        Attrs << "id=\"" << *NodeId << "\"";
+
+    if (CFGInfo->showHeatColors()) {
+      uint64_t Freq = CFGInfo->getFreq(Node);
+      std::string Color = getHeatColor(Freq, CFGInfo->getMaxFreq());
+      std::string EdgeColor = (Freq <= (CFGInfo->getMaxFreq() / 2))
+                                  ? (getHeatColor(0))
+                                  : (getHeatColor(1));
+      if (!Attrs.str().empty())
+        Attrs << ",";
+      Attrs << "color=\"" << EdgeColor << "ff\", style=filled, "
+            << "fillcolor=\"" << Color << "70\", " << "fontname=\"Courier\"";
+    }
 
-    if (!CFGInfo->showHeatColors())
-      return "";
-
-    uint64_t Freq = CFGInfo->getFreq(Node);
-    std::string Color = getHeatColor(Freq, CFGInfo->getMaxFreq());
-    std::string EdgeColor = (Freq <= (CFGInfo->getMaxFreq() / 2))
-                                ? (getHeatColor(0))
-                                : (getHeatColor(1));
-
-    std::string Attrs = "color=\"" + EdgeColor + "ff\", style=filled," +
-                        " fillcolor=\"" + Color + "70\"" +
-                        " fontname=\"Courier\"";
-    return Attrs;
+    return Attrs.str();
   }
+
   LLVM_ABI bool isNodeHidden(const BasicBlock *Node,
                              const DOTFuncInfo *CFGInfo);
   LLVM_ABI void computeDeoptOrUnreachablePaths(const Function *F);
diff --git a/llvm/include/llvm/Analysis/DependenceAnalysis.h b/llvm/include/llvm/Analysis/DependenceAnalysis.h
index 8286d8e8e45cc..6dec24fc9f104 100644
--- a/llvm/include/llvm/Analysis/DependenceAnalysis.h
+++ b/llvm/include/llvm/Analysis/DependenceAnalysis.h
@@ -506,17 +506,6 @@ class DependenceInfo {
   bool isKnownPredicate(ICmpInst::Predicate Pred, const SCEV *X,
                         const SCEV *Y) const;
 
-  /// isKnownLessThan - Compare to see if S is less than Size
-  /// Another wrapper for isKnownNegative(S - max(Size, 1)) with some extra
-  /// checking if S is an AddRec and we can prove lessthan using the loop
-  /// bounds.
-  bool isKnownLessThan(const SCEV *S, const SCEV *Size) const;
-
-  /// isKnownNonNegative - Compare to see if S is known not to be negative
-  /// Uses the fact that S comes from Ptr, which may be an inbound GEP,
-  /// Proving there is no wrapping going on.
-  bool isKnownNonNegative(const SCEV *S, const Value *Ptr) const;
-
   /// collectUpperBound - All subscripts are the same type (on my machine,
   /// an i64). The loop bound may be a smaller type. collectUpperBound
   /// find the bound, if available, and zero extends it to the Type T.
@@ -554,7 +543,7 @@ class DependenceInfo {
   /// If the dependence isn't proven to exist,
   /// marks the Result as inconsistent.
   bool testSIV(const SCEV *Src, const SCEV *Dst, unsigned &Level,
-               FullDependence &Result) const;
+               FullDependence &Result, bool UnderRuntimeAssumptions);
 
   /// testRDIV - Tests the RDIV subscript pair (Src and Dst) for dependence.
   /// Things of the form [c1 + a1*i] and [c2 + a2*j]
@@ -584,7 +573,7 @@ class DependenceInfo {
   bool strongSIVtest(const SCEV *Coeff, const SCEV *SrcConst,
                      const SCEV *DstConst, const Loop *CurrentSrcLoop,
                      const Loop *CurrentDstLoop, unsigned Level,
-                     FullDependence &Result) const;
+                     FullDependence &Result, bool UnderRuntimeAssumptions);
 
   /// weakCrossingSIVtest - Tests the weak-crossing SIV subscript pair
   /// (Src and Dst) for dependence.
diff --git a/llvm/include/llvm/Analysis/IVDescriptors.h b/llvm/include/llvm/Analysis/IVDescriptors.h
index 2c8484fde5b16..fc141ed6d96fe 100644
--- a/llvm/include/llvm/Analysis/IVDescriptors.h
+++ b/llvm/include/llvm/Analysis/IVDescriptors.h
@@ -95,12 +95,17 @@ class RecurrenceDescriptor {
                        RecurKind K, FastMathFlags FMF, Instruction *ExactFP,
                        Type *RT, bool Signed, bool Ordered,
                        SmallPtrSetImpl<Instruction *> &CI,
-                       unsigned MinWidthCastToRecurTy)
+                       unsigned MinWidthCastToRecurTy,
+                       bool PhiHasUsesOutsideReductionChain = false)
       : IntermediateStore(Store), StartValue(Start), LoopExitInstr(Exit),
         Kind(K), FMF(FMF), ExactFPMathInst(ExactFP), RecurrenceType(RT),
         IsSigned(Signed), IsOrdered(Ordered),
+        PhiHasUsesOutsideReductionChain(PhiHasUsesOutsideReductionChain),
         MinWidthCastToRecurrenceType(MinWidthCastToRecurTy) {
     CastInsts.insert_range(CI);
+    assert(
+        (!PhiHasUsesOutsideReductionChain || isMinMaxRecurrenceKind(K)) &&
+        "Only min/max recurrences are allowed to have multiple uses currently");
   }
 
   /// This POD struct holds information about a potential recurrence operation.
@@ -339,6 +344,13 @@ class RecurrenceDescriptor {
   /// Expose an ordered FP reduction to the instance users.
   bool isOrdered() const { return IsOrdered; }
 
+  /// Returns true if the reduction PHI has any uses outside the reduction
+  /// chain. This is relevant for min/max reductions that are part of a
+  /// FindLastIV pattern.
+  bool hasUsesOutsideReductionChain() const {
+    return PhiHasUsesOutsideReductionChain;
+  }
+
   /// Attempts to find a chain of operations from Phi to LoopExitInst that can
   /// be treated as a set of reductions instructions for in-loop reductions.
   LLVM_ABI SmallVector<Instruction *, 4> getReductionOpChain(PHINode *Phi,
@@ -376,6 +388,10 @@ class RecurrenceDescriptor {
   // Currently only a non-reassociative FAdd can be considered in-order,
   // if it is also the only FAdd in the PHI's use chain.
   bool IsOrdered = false;
+  // True if the reduction PHI has in-loop users outside the reduction chain.
+  // This is relevant for min/max reductions that are part of a FindLastIV
+  // pattern.
+  bool PhiHasUsesOutsideReductionChain = false;
   // Instructions used for type-promoting the recurrence.
   SmallPtrSet<Instruction *, 8> CastInsts;
   // The minimum width used by the recurrence.
diff --git a/llvm/include/llvm/Analysis/RuntimeLibcallInfo.h b/llvm/include/llvm/Analysis/RuntimeLibcallInfo.h
index 28a2ec47f81ad..2d31c8aa6301b 100644
--- a/llvm/include/llvm/Analysis/RuntimeLibcallInfo.h
+++ b/llvm/include/llvm/Analysis/RuntimeLibcallInfo.h
@@ -22,7 +22,12 @@ class LLVM_ABI RuntimeLibraryAnalysis
   RuntimeLibraryAnalysis() = default;
   RuntimeLibraryAnalysis(RTLIB::RuntimeLibcallsInfo &&BaselineInfoImpl)
       : LibcallsInfo(std::move(BaselineInfoImpl)) {}
-  explicit RuntimeLibraryAnalysis(const Triple &T) : LibcallsInfo(T) {}
+  RuntimeLibraryAnalysis(
+      const Triple &TT,
+      ExceptionHandling ExceptionModel = ExceptionHandling::None,
+      FloatABI::ABIType FloatABI = FloatABI::Default,
+      EABI EABIVersion = EABI::Default, StringRef ABIName = "",
+      VectorLibrary VecLib = VectorLibrary::NoLibrary);
 
   LLVM_ABI RTLIB::RuntimeLibcallsInfo run(const Module &M,
                                           ModuleAnalysisManager &);
@@ -41,12 +46,19 @@ class LLVM_ABI RuntimeLibraryInfoWrapper : public ImmutablePass {
 public:
   static char ID;
   RuntimeLibraryInfoWrapper();
-  explicit RuntimeLibraryInfoWrapper(const Triple &T);
-  explicit RuntimeLibraryInfoWrapper(const RTLIB::RuntimeLibcallsInfo &RTLCI);
+  RuntimeLibraryInfoWrapper(
+      const Triple &TT,
+      ExceptionHandling ExceptionModel = ExceptionHandling::None,
+      FloatABI::ABIType FloatABI = FloatABI::Default,
+      EABI EABIVersion = EABI::Default, StringRef ABIName = "",
+      VectorLibrary VecLib = VectorLibrary::NoLibrary);
 
   const RTLIB::RuntimeLibcallsInfo &getRTLCI(const Module &M) {
-    ModuleAnalysisManager DummyMAM;
-    RTLCI = RTLA.run(M, DummyMAM);
+    if (!RTLCI) {
+      ModuleAnalysisManager DummyMAM;
+      RTLCI = RTLA.run(M, DummyMAM);
+    }
+
     return *RTLCI;
   }
 
diff --git a/llvm/include/llvm/Analysis/ScalarEvolution.h b/llvm/include/llvm/Analysis/ScalarEvolution.h
index 04ea769bd06d1..9ce23ab344668 100644
--- a/llvm/include/llvm/Analysis/ScalarEvolution.h
+++ b/llvm/include/llvm/Analysis/ScalarEvolution.h
@@ -427,6 +427,15 @@ class LLVM_ABI SCEVUnionPredicate final : public SCEVPredicate {
 
   ArrayRef<const SCEVPredicate *> getPredicates() const { return Preds; }
 
+  /// Returns a new SCEVUnionPredicate that is the union of this predicate
+  /// and the given predicate \p N.
+  SCEVUnionPredicate getUnionWith(const SCEVPredicate *N,
+                                  ScalarEvolution &SE) const {
+    SCEVUnionPredicate Result(Preds, SE);
+    Result.add(N, SE);
+    return Result;
+  }
+
   /// Implementation of the SCEVPredicate interface
   bool isAlwaysTrue() const override;
   bool implies(const SCEVPredicate *N, ScalarEvolution &SE) const override;
@@ -1078,6 +1087,9 @@ class ScalarEvolution {
   isKnownMultipleOf(const SCEV *S, uint64_t M,
                     SmallVectorImpl<const SCEVPredicate *> &Assumptions);
 
+  /// Return true if we know that S1 and S2 must have the same sign.
+  LLVM_ABI bool haveSameSign(const SCEV *S1, const SCEV *S2);
+
   /// Splits SCEV expression \p S into two SCEVs. One of them is obtained from
   /// \p S by substitution of all AddRec sub-expression related to loop \p L
   /// with initial value of that SCEV. The second is obtained from \p S by
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index e24e22da5681b..99525607f744a 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -157,7 +157,8 @@ class MemIntrinsicCostAttributes {
         Alignment(Alignment) {}
 
   LLVM_ABI MemIntrinsicCostAttributes(Intrinsic::ID Id, Type *DataTy,
-                                      Align Alignment, unsigned AddressSpace)
+                                      Align Alignment,
+                                      unsigned AddressSpace = 0)
       : DataTy(DataTy), IID(Id), AddressSpace(AddressSpace),
         Alignment(Alignment) {}
 
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 624302bc6d0a3..5f1d855621c93 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -852,10 +852,8 @@ class TargetTransformInfoImplBase {
   }
 
   virtual InstructionCost
-  getGatherScatterOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
-                         bool VariableMask, Align Alignment,
-                         TTI::TargetCostKind CostKind,
-                         const Instruction *I = nullptr) const {
+  getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const {
     return 1;
   }
 
@@ -866,10 +864,8 @@ class TargetTransformInfoImplBase {
   }
 
   virtual InstructionCost
-  getStridedMemoryOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
-                         bool VariableMask, Align Alignment,
-                         TTI::TargetCostKind CostKind,
-                         const Instruction *I = nullptr) const {
+  getStridedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const {
     return InstructionCost::getInvalid();
   }
 
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index b1beb68feca46..fceff5f93b765 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1571,10 +1571,15 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   }
 
   InstructionCost
-  getGatherScatterOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
-                         bool VariableMask, Align Alignment,
-                         TTI::TargetCostKind CostKind,
-                         const Instruction *I = nullptr) const override {
+  getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override {
+    unsigned Opcode = (MICA.getID() == Intrinsic::masked_gather ||
+                       MICA.getID() == Intrinsic::vp_gather)
+                          ? Instruction::Load
+                          : Instruction::Store;
+    Type *DataTy = MICA.getDataType();
+    bool VariableMask = MICA.getVariableMask();
+    Align Alignment = MICA.getAlignment();
     return getCommonMaskedMemoryOpCost(Opcode, DataTy, Alignment, VariableMask,
                                        true, CostKind);
   }
@@ -1594,16 +1599,20 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
                                        /*IsGatherScatter*/ true, CostKind);
   }
 
-  InstructionCost getStridedMemoryOpCost(unsigned Opcode, Type *DataTy,
-                                         const Value *Ptr, bool VariableMask,
-                                         Align Alignment,
-                                         TTI::TargetCostKind CostKind,
-                                         const Instruction *I) const override {
+  InstructionCost
+  getStridedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override {
     // For a target without strided memory operations (or for an illegal
     // operation type on one which does), assume we lower to a gather/scatter
     // operation.  (Which may in turn be scalarized.)
-    return thisT()->getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                           Alignment, CostKind, I);
+    unsigned IID = MICA.getID() == Intrinsic::experimental_vp_strided_load
+                       ? Intrinsic::masked_gather
+                       : Intrinsic::masked_scatter;
+    return thisT()->getGatherScatterOpCost(
+        MemIntrinsicCostAttributes(IID, MICA.getDataType(), MICA.getPointer(),
+                                   MICA.getVariableMask(), MICA.getAlignment(),
+                                   MICA.getInst()),
+        CostKind);
   }
 
   InstructionCost getInterleavedMemoryOpCost(
@@ -1817,7 +1826,16 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
           }
         }
       }
-
+      if (ICA.getID() == Intrinsic::vp_load_ff) {
+        Type *RetTy = ICA.getReturnType();
+        Type *DataTy = cast<StructType>(RetTy)->getElementType(0);
+        Align Alignment;
+        if (auto *VPI = dyn_cast_or_null<VPIntrinsic>(ICA.getInst()))
+          Alignment = VPI->getPointerAlignment().valueOrOne();
+        return thisT()->getMemIntrinsicInstrCost(
+            MemIntrinsicCostAttributes(ICA.getID(), DataTy, Alignment),
+            CostKind);
+      }
       if (ICA.getID() == Intrinsic::vp_scatter) {
         if (ICA.isTypeBasedOnly()) {
           IntrinsicCostAttributes MaskedScatter(
@@ -3044,38 +3062,24 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
   getMemIntrinsicInstrCost(const MemIntrinsicCostAttributes &MICA,
                            TTI::TargetCostKind CostKind) const override {
     unsigned Id = MICA.getID();
-    Type *DataTy = MICA.getDataType();
-    const Value *Ptr = MICA.getPointer();
-    const Instruction *I = MICA.getInst();
-    bool VariableMask = MICA.getVariableMask();
-    Align Alignment = MICA.getAlignment();
 
     switch (Id) {
     case Intrinsic::experimental_vp_strided_load:
-    case Intrinsic::experimental_vp_strided_store: {
-      unsigned Opcode = Id == Intrinsic::experimental_vp_strided_load
-                            ? Instruction::Load
-                            : Instruction::Store;
-      return thisT()->getStridedMemoryOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                             Alignment, CostKind, I);
-    }
+    case Intrinsic::experimental_vp_strided_store:
+      return thisT()->getStridedMemoryOpCost(MICA, CostKind);
     case Intrinsic::masked_scatter:
     case Intrinsic::masked_gather:
     case Intrinsic::vp_scatter:
-    case Intrinsic::vp_gather: {
-      unsigned Opcode =
-          (Id == Intrinsic::masked_gather || Id == Intrinsic::vp_gather)
-              ? Instruction::Load
-              : Instruction::Store;
-      return thisT()->getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                             Alignment, CostKind, I);
-    }
+    case Intrinsic::vp_gather:
+      return thisT()->getGatherScatterOpCost(MICA, CostKind);
     case Intrinsic::masked_load:
     case Intrinsic::masked_store:
       return thisT()->getMaskedMemoryOpCost(MICA, CostKind);
     case Intrinsic::masked_compressstore:
     case Intrinsic::masked_expandload:
       return thisT()->getExpandCompressMemoryOpCost(MICA, CostKind);
+    case Intrinsic::vp_load_ff:
+      return InstructionCost::getInvalid();
     default:
       llvm_unreachable("unexpected intrinsic");
     }
diff --git a/llvm/include/llvm/CodeGen/GlobalISel/InstructionSelector.h b/llvm/include/llvm/CodeGen/GlobalISel/InstructionSelector.h
index 569407963695e..483afb426fa10 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/InstructionSelector.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/InstructionSelector.h
@@ -35,9 +35,6 @@ class LLVM_ABI InstructionSelector : public GIMatchTableExecutor {
   ///       !isPreISelGenericOpcode(I.getOpcode())
   virtual bool select(MachineInstr &I) = 0;
 
-  // FIXME: Eliminate dependency on TargetPassConfig for NewPM transition
-  const TargetPassConfig *TPC = nullptr;
-
   MachineOptimizationRemarkEmitter *MORE = nullptr;
 
   /// Note: InstructionSelect does not track changed instructions.
diff --git a/llvm/include/llvm/CodeGen/GlobalISel/LegalizerInfo.h b/llvm/include/llvm/CodeGen/GlobalISel/LegalizerInfo.h
index 51318c9c2736d..9324bab3fe656 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/LegalizerInfo.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/LegalizerInfo.h
@@ -314,6 +314,16 @@ LLVM_ABI LegalityPredicate scalarWiderThan(unsigned TypeIdx, unsigned Size);
 LLVM_ABI LegalityPredicate scalarOrEltNarrowerThan(unsigned TypeIdx,
                                                    unsigned Size);
 
+/// True iff the specified type index is a vector with a number of elements
+/// that's greater than the given size.
+LLVM_ABI LegalityPredicate vectorElementCountIsGreaterThan(unsigned TypeIdx,
+                                                           unsigned Size);
+
+/// True iff the specified type index is a vector with a number of elements
+/// that's less than or equal to the given size.
+LLVM_ABI LegalityPredicate
+vectorElementCountIsLessThanOrEqualTo(unsigned TypeIdx, unsigned Size);
+
 /// True iff the specified type index is a scalar or a vector with an element
 /// type that's wider than the given size.
 LLVM_ABI LegalityPredicate scalarOrEltWiderThan(unsigned TypeIdx,
diff --git a/llvm/include/llvm/CodeGen/GlobalISel/RegBankSelect.h b/llvm/include/llvm/CodeGen/GlobalISel/RegBankSelect.h
index 076c70d21bbdf..6060bb6144c62 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/RegBankSelect.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/RegBankSelect.h
@@ -510,9 +510,6 @@ class RegBankSelect : public MachineFunctionPass {
   /// Optimization mode of the pass.
   Mode OptMode;
 
-  /// Current target configuration. Controls how the pass handles errors.
-  const TargetPassConfig *TPC;
-
   /// Assign the register bank of each operand of \p MI.
   /// \return True on success, false otherwise.
   bool assignInstr(MachineInstr &MI);
diff --git a/llvm/include/llvm/CodeGen/GlobalISel/Utils.h b/llvm/include/llvm/CodeGen/GlobalISel/Utils.h
index e1aa8eceefd3f..da2742e089f8f 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/Utils.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/Utils.h
@@ -155,12 +155,10 @@ LLVM_ABI bool isTriviallyDead(const MachineInstr &MI,
 /// Report an ISel error as a missed optimization remark to the LLVMContext's
 /// diagnostic stream.  Set the FailedISel MachineFunction property.
 LLVM_ABI void reportGISelFailure(MachineFunction &MF,
-                                 const TargetPassConfig &TPC,
                                  MachineOptimizationRemarkEmitter &MORE,
                                  MachineOptimizationRemarkMissed &R);
 
 LLVM_ABI void reportGISelFailure(MachineFunction &MF,
-                                 const TargetPassConfig &TPC,
                                  MachineOptimizationRemarkEmitter &MORE,
                                  const char *PassName, StringRef Msg,
                                  const MachineInstr &MI);
@@ -168,7 +166,6 @@ LLVM_ABI void reportGISelFailure(MachineFunction &MF,
 /// Report an ISel warning as a missed optimization remark to the LLVMContext's
 /// diagnostic stream.
 LLVM_ABI void reportGISelWarning(MachineFunction &MF,
-                                 const TargetPassConfig &TPC,
                                  MachineOptimizationRemarkEmitter &MORE,
                                  MachineOptimizationRemarkMissed &R);
 
diff --git a/llvm/include/llvm/CodeGen/LibcallLoweringInfo.h b/llvm/include/llvm/CodeGen/LibcallLoweringInfo.h
index 8624fd2403a12..3e0137710e8eb 100644
--- a/llvm/include/llvm/CodeGen/LibcallLoweringInfo.h
+++ b/llvm/include/llvm/CodeGen/LibcallLoweringInfo.h
@@ -9,12 +9,16 @@
 #ifndef LLVM_CODEGEN_LIBCALLLOWERINGINFO_H
 #define LLVM_CODEGEN_LIBCALLLOWERINGINFO_H
 
+#include "llvm/ADT/DenseMap.h"
 #include "llvm/IR/RuntimeLibcalls.h"
+#include "llvm/Pass.h"
 
 namespace llvm {
 
 class TargetSubtargetInfo;
+class TargetMachine;
 
+/// Tracks which library functions to use for a particular subtarget.
 class LibcallLoweringInfo {
 private:
   const RTLIB::RuntimeLibcallsInfo &RTLCI;
@@ -73,6 +77,70 @@ class LibcallLoweringInfo {
   }
 };
 
+/// Record a mapping from subtarget to LibcallLoweringInfo.
+class LibcallLoweringModuleAnalysisResult {
+private:
+  using LibcallLoweringMap =
+      DenseMap<const TargetSubtargetInfo *, LibcallLoweringInfo>;
+  mutable LibcallLoweringMap LoweringMap;
+  const RTLIB::RuntimeLibcallsInfo *RTLCI = nullptr;
+
+public:
+  LibcallLoweringModuleAnalysisResult() = default;
+  LibcallLoweringModuleAnalysisResult(RTLIB::RuntimeLibcallsInfo &RTLCI)
+      : RTLCI(&RTLCI) {}
+
+  void init(const RTLIB::RuntimeLibcallsInfo *RT) { RTLCI = RT; }
+
+  void clear() {
+    RTLCI = nullptr;
+    LoweringMap.clear();
+  }
+
+  LLVM_ABI bool invalidate(Module &, const PreservedAnalyses &,
+                           ModuleAnalysisManager::Invalidator &);
+
+  const LibcallLoweringInfo &
+  getLibcallLowering(const TargetSubtargetInfo &Subtarget) const {
+    return LoweringMap.try_emplace(&Subtarget, *RTLCI, Subtarget).first->second;
+  }
+};
+
+class LibcallLoweringModuleAnalysis
+    : public AnalysisInfoMixin<LibcallLoweringModuleAnalysis> {
+private:
+  friend AnalysisInfoMixin<LibcallLoweringModuleAnalysis>;
+  static AnalysisKey Key;
+
+  LibcallLoweringModuleAnalysisResult LibcallLoweringMap;
+
+public:
+  using Result = LibcallLoweringModuleAnalysisResult;
+
+  LLVM_ABI Result run(Module &M, ModuleAnalysisManager &);
+};
+
+class LLVM_ABI LibcallLoweringInfoWrapper : public ImmutablePass {
+  LibcallLoweringModuleAnalysisResult Result;
+
+public:
+  static char ID;
+  LibcallLoweringInfoWrapper();
+
+  const LibcallLoweringInfo &
+  getLibcallLowering(const TargetSubtargetInfo &Subtarget) const {
+    return Result.getLibcallLowering(Subtarget);
+  }
+
+  const LibcallLoweringModuleAnalysisResult &getResult() const {
+    return Result;
+  }
+
+  bool doInitialization(Module &M) override;
+  void getAnalysisUsage(AnalysisUsage &AU) const override;
+  void releaseMemory() override;
+};
+
 } // end namespace llvm
 
 #endif // LLVM_CODEGEN_LIBCALLLOWERINGINFO_H
diff --git a/llvm/include/llvm/CodeGen/MIR2Vec.h b/llvm/include/llvm/CodeGen/MIR2Vec.h
index c12d0043bc481..9e59b0d3da4f4 100644
--- a/llvm/include/llvm/CodeGen/MIR2Vec.h
+++ b/llvm/include/llvm/CodeGen/MIR2Vec.h
@@ -139,7 +139,7 @@ class MIRVocabulary {
       "FrameIndex",      "ConstantPoolIndex", "TargetIndex",  "JumpTableIndex",
       "ExternalSymbol",  "GlobalAddress",     "BlockAddress", "RegisterMask",
       "RegisterLiveOut", "Metadata",          "MCSymbol",     "CFIIndex",
-      "IntrinsicID",     "Predicate",         "ShuffleMask"};
+      "IntrinsicID",     "Predicate",         "ShuffleMask",  "LaneMask"};
   static_assert(std::size(CommonOperandNames) == MachineOperand::MO_Last - 1 &&
                 "Common operand names size changed, update accordingly");
 
diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index 077e39b49df6f..8e2574974a82d 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -1451,6 +1451,10 @@ class MachineInstr
     return getOpcode() == TargetOpcode::COPY;
   }
 
+  bool isCopyLaneMask() const {
+    return getOpcode() == TargetOpcode::COPY_LANEMASK;
+  }
+
   bool isFullCopy() const {
     return isCopy() && !getOperand(0).getSubReg() && !getOperand(1).getSubReg();
   }
@@ -1484,6 +1488,7 @@ class MachineInstr
     case TargetOpcode::PHI:
     case TargetOpcode::G_PHI:
     case TargetOpcode::COPY:
+    case TargetOpcode::COPY_LANEMASK:
     case TargetOpcode::INSERT_SUBREG:
     case TargetOpcode::SUBREG_TO_REG:
     case TargetOpcode::REG_SEQUENCE:
diff --git a/llvm/include/llvm/CodeGen/MachineInstrBuilder.h b/llvm/include/llvm/CodeGen/MachineInstrBuilder.h
index caeb430d6fd1c..060f0c41de73a 100644
--- a/llvm/include/llvm/CodeGen/MachineInstrBuilder.h
+++ b/llvm/include/llvm/CodeGen/MachineInstrBuilder.h
@@ -307,6 +307,11 @@ class MachineInstrBuilder {
     return *this;
   }
 
+  const MachineInstrBuilder &addLaneMask(LaneBitmask LaneMask) const {
+    MI->addOperand(*MF, MachineOperand::CreateLaneMask(LaneMask));
+    return *this;
+  }
+
   const MachineInstrBuilder &addSym(MCSymbol *Sym,
                                     unsigned char TargetFlags = 0) const {
     MI->addOperand(*MF, MachineOperand::CreateMCSymbol(Sym, TargetFlags));
diff --git a/llvm/include/llvm/CodeGen/MachineOperand.h b/llvm/include/llvm/CodeGen/MachineOperand.h
index 9104e93ed9783..d85da5a4997f1 100644
--- a/llvm/include/llvm/CodeGen/MachineOperand.h
+++ b/llvm/include/llvm/CodeGen/MachineOperand.h
@@ -16,6 +16,7 @@
 #include "llvm/ADT/DenseMapInfo.h"
 #include "llvm/CodeGen/Register.h"
 #include "llvm/IR/Intrinsics.h"
+#include "llvm/MC/LaneBitmask.h"
 #include "llvm/Support/Compiler.h"
 #include <cassert>
 
@@ -69,7 +70,8 @@ class MachineOperand {
     MO_Predicate,         ///< Generic predicate for ISel
     MO_ShuffleMask,       ///< Other IR Constant for ISel (shuffle masks)
     MO_DbgInstrRef, ///< Integer indices referring to an instruction+operand
-    MO_Last = MO_DbgInstrRef
+    MO_LaneMask,    ///< Mask to represent active parts of registers
+    MO_Last = MO_LaneMask
   };
 
 private:
@@ -178,6 +180,7 @@ class MachineOperand {
     Intrinsic::ID IntrinsicID; // For MO_IntrinsicID.
     unsigned Pred;           // For MO_Predicate
     ArrayRef<int> ShuffleMask; // For MO_ShuffleMask
+    LaneBitmask LaneMask;      // For MO_LaneMask
 
     struct {                  // For MO_Register.
       // Register number is in SmallContents.RegNo.
@@ -360,6 +363,7 @@ class MachineOperand {
   bool isIntrinsicID() const { return OpKind == MO_IntrinsicID; }
   bool isPredicate() const { return OpKind == MO_Predicate; }
   bool isShuffleMask() const { return OpKind == MO_ShuffleMask; }
+  bool isLaneMask() const { return OpKind == MO_LaneMask; }
   //===--------------------------------------------------------------------===//
   // Accessors for Register Operands
   //===--------------------------------------------------------------------===//
@@ -624,6 +628,11 @@ class MachineOperand {
     return Contents.ShuffleMask;
   }
 
+  LaneBitmask getLaneMask() const {
+    assert(isLaneMask() && "Wrong MachineOperand accessor");
+    return Contents.LaneMask;
+  }
+
   /// Return the offset from the symbol in this operand. This always returns 0
   /// for ExternalSymbol operands.
   int64_t getOffset() const {
@@ -992,6 +1001,12 @@ class MachineOperand {
     return Op;
   }
 
+  static MachineOperand CreateLaneMask(LaneBitmask LaneMask) {
+    MachineOperand Op(MachineOperand::MO_LaneMask);
+    Op.Contents.LaneMask = LaneMask;
+    return Op;
+  }
+
   friend class MachineInstr;
   friend class MachineRegisterInfo;
 
diff --git a/llvm/include/llvm/CodeGen/SelectionDAG.h b/llvm/include/llvm/CodeGen/SelectionDAG.h
index 501cbc947132e..37a43aaf1d0c7 100644
--- a/llvm/include/llvm/CodeGen/SelectionDAG.h
+++ b/llvm/include/llvm/CodeGen/SelectionDAG.h
@@ -1185,11 +1185,17 @@ class SelectionDAG {
   SDValue getPOISON(EVT VT) { return getNode(ISD::POISON, SDLoc(), VT); }
 
   /// Return a node that represents the runtime scaling 'MulImm * RuntimeVL'.
-  LLVM_ABI SDValue getVScale(const SDLoc &DL, EVT VT, APInt MulImm,
-                             bool ConstantFold = true);
+  LLVM_ABI SDValue getVScale(const SDLoc &DL, EVT VT, APInt MulImm);
 
-  LLVM_ABI SDValue getElementCount(const SDLoc &DL, EVT VT, ElementCount EC,
-                                   bool ConstantFold = true);
+  LLVM_ABI SDValue getElementCount(const SDLoc &DL, EVT VT, ElementCount EC);
+
+  LLVM_ABI SDValue getTypeSize(const SDLoc &DL, EVT VT, TypeSize TS);
+
+  /// Return a vector with the first 'Len' lanes set to true and remaining lanes
+  /// set to false. The mask's ValueType is the same as when comparing vectors
+  /// of type VT.
+  LLVM_ABI SDValue getMaskFromElementCount(const SDLoc &DL, EVT VT,
+                                           ElementCount Len);
 
   /// Return a GLOBAL_OFFSET_TABLE node. This does not have a useful SDLoc.
   SDValue getGLOBAL_OFFSET_TABLE(EVT VT) {
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index b2697c81fd825..149366c69bdcc 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -1243,7 +1243,7 @@ class LLVM_ABI TargetLoweringBase {
   /// to a MemIntrinsicNode (touches memory). If this is the case, it returns
   /// true and store the intrinsic information into the IntrinsicInfo that was
   /// passed to the function.
-  virtual bool getTgtMemIntrinsic(IntrinsicInfo &, const CallInst &,
+  virtual bool getTgtMemIntrinsic(IntrinsicInfo &, const CallBase &,
                                   MachineFunction &,
                                   unsigned /*Intrinsic*/) const {
     return false;
diff --git a/llvm/include/llvm/DebugInfo/DWARF/DWARFAcceleratorTable.h b/llvm/include/llvm/DebugInfo/DWARF/DWARFAcceleratorTable.h
index 87586eda90682..086e11a623e9e 100644
--- a/llvm/include/llvm/DebugInfo/DWARF/DWARFAcceleratorTable.h
+++ b/llvm/include/llvm/DebugInfo/DWARF/DWARFAcceleratorTable.h
@@ -153,7 +153,11 @@ class LLVM_ABI AppleAcceleratorTable : public DWARFAcceleratorTable {
   uint64_t getHashBase() const { return getBucketBase() + getNumBuckets() * 4; }
 
   /// Return the offset into the section where the I-th hash is.
-  uint64_t getIthHashBase(uint32_t I) const { return getHashBase() + I * 4; }
+  std::optional<uint64_t> getIthHashBase(uint32_t I) const {
+    if (I < Hdr.HashCount)
+      return getHashBase() + I * 4;
+    return std::nullopt;
+  }
 
   /// Return the offset into the section where the offset list begins.
   uint64_t getOffsetBase() const { return getHashBase() + getNumHashes() * 4; }
@@ -164,8 +168,10 @@ class LLVM_ABI AppleAcceleratorTable : public DWARFAcceleratorTable {
   }
 
   /// Return the offset into the section where the I-th offset is.
-  uint64_t getIthOffsetBase(uint32_t I) const {
-    return getOffsetBase() + I * 4;
+  std::optional<uint64_t> getIthOffsetBase(uint32_t I) const {
+    if (I < Hdr.HashCount)
+      return getOffsetBase() + I * 4;
+    return std::nullopt;
   }
 
   /// Returns the index of the bucket where a hypothetical Hash would be.
@@ -188,14 +194,18 @@ class LLVM_ABI AppleAcceleratorTable : public DWARFAcceleratorTable {
 
   /// Reads the I-th hash in the hash list.
   std::optional<uint32_t> readIthHash(uint32_t I) const {
-    uint64_t Offset = getIthHashBase(I);
-    return readU32FromAccel(Offset);
+    std::optional<uint64_t> OptOffset = getIthHashBase(I);
+    if (OptOffset)
+      return readU32FromAccel(*OptOffset);
+    return std::nullopt;
   }
 
   /// Reads the I-th offset in the offset list.
   std::optional<uint32_t> readIthOffset(uint32_t I) const {
-    uint64_t Offset = getIthOffsetBase(I);
-    return readU32FromAccel(Offset);
+    std::optional<uint64_t> OptOffset = getIthOffsetBase(I);
+    if (OptOffset)
+      return readU32FromAccel(*OptOffset);
+    return std::nullopt;
   }
 
   /// Reads a string offset from the accelerator table at Offset, which is
@@ -282,6 +292,7 @@ class LLVM_ABI AppleAcceleratorTable : public DWARFAcceleratorTable {
     constexpr static auto EndMarker = std::numeric_limits<uint64_t>::max();
 
     EntryWithName Current;
+    uint32_t OffsetIdx = 0;
     uint64_t Offset = EndMarker;
     uint32_t NumEntriesToCome = 0;
 
@@ -298,7 +309,9 @@ class LLVM_ABI AppleAcceleratorTable : public DWARFAcceleratorTable {
     /// Reads the next string pointer and the entry count for that string,
     /// populating `NumEntriesToCome`.
     /// If not possible (e.g. end of the section), becomes the end iterator.
-    /// Assumes `Offset` points to a string reference.
+    /// If `Offset` is zero, then the next valid string offset will be fetched
+    /// from the Offsets array, otherwise it will continue to parse the current
+    /// entry's strings.
     void prepareNextStringOrEnd();
 
   public:
diff --git a/llvm/include/llvm/ExecutionEngine/Orc/CallableTraitsHelper.h b/llvm/include/llvm/ExecutionEngine/Orc/CallableTraitsHelper.h
new file mode 100644
index 0000000000000..11bafa9745693
--- /dev/null
+++ b/llvm/include/llvm/ExecutionEngine/Orc/CallableTraitsHelper.h
@@ -0,0 +1,74 @@
+//===- CallableTraitsHelper.h - Callable arg/ret type extractor -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// CallableTraitsHelper API.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_EXECUTIONENGINE_ORC_CALLABLETRAITSHELPER_H
+#define LLVM_EXECUTIONENGINE_ORC_CALLABLETRAITSHELPER_H
+
+#include <tuple>
+#include <type_traits>
+
+namespace llvm::orc {
+
+/// CallableTraitsHelper takes an implementation class template Impl and some
+/// callable type C and passes the return and argument types of C to the Impl
+/// class template.
+///
+/// This can be used to simplify the implementation of classes that need to
+/// operate on callable types.
+template <template <typename...> typename ImplT, typename C>
+struct CallableTraitsHelper
+    : public CallableTraitsHelper<
+          ImplT,
+          decltype(&std::remove_cv_t<std::remove_reference_t<C>>::operator())> {
+};
+
+template <template <typename...> typename ImplT, typename RetT,
+          typename... ArgTs>
+struct CallableTraitsHelper<ImplT, RetT(ArgTs...)>
+    : public ImplT<RetT, ArgTs...> {};
+
+template <template <typename...> typename ImplT, typename RetT,
+          typename... ArgTs>
+struct CallableTraitsHelper<ImplT, RetT (*)(ArgTs...)>
+    : public CallableTraitsHelper<ImplT, RetT(ArgTs...)> {};
+
+template <template <typename...> typename ImplT, typename RetT,
+          typename... ArgTs>
+struct CallableTraitsHelper<ImplT, RetT (&)(ArgTs...)>
+    : public CallableTraitsHelper<ImplT, RetT(ArgTs...)> {};
+
+template <template <typename...> typename ImplT, typename ClassT, typename RetT,
+          typename... ArgTs>
+struct CallableTraitsHelper<ImplT, RetT (ClassT::*)(ArgTs...)>
+    : public CallableTraitsHelper<ImplT, RetT(ArgTs...)> {};
+
+template <template <typename...> typename ImplT, typename ClassT, typename RetT,
+          typename... ArgTs>
+struct CallableTraitsHelper<ImplT, RetT (ClassT::*)(ArgTs...) const>
+    : public CallableTraitsHelper<ImplT, RetT(ArgTs...)> {};
+
+namespace detail {
+template <typename RetT, typename... ArgTs> struct CallableArgInfoImpl {
+  using ReturnType = RetT;
+  using ArgsTupleType = std::tuple<ArgTs...>;
+};
+} // namespace detail
+
+/// CallableArgInfo provides typedefs for the return type and argument types
+/// (as a tuple) of the given callable type.
+template <typename Callable>
+struct CallableArgInfo
+    : public CallableTraitsHelper<detail::CallableArgInfoImpl, Callable> {};
+
+} // namespace llvm::orc
+
+#endif // LLVM_EXECUTIONENGINE_ORC_CALLABLETRAITSHELPER_H
diff --git a/llvm/include/llvm/Frontend/OpenMP/ClauseT.h b/llvm/include/llvm/Frontend/OpenMP/ClauseT.h
index 67ebafc89cf99..eb0e4a828eb1e 100644
--- a/llvm/include/llvm/Frontend/OpenMP/ClauseT.h
+++ b/llvm/include/llvm/Frontend/OpenMP/ClauseT.h
@@ -1307,7 +1307,7 @@ struct WriteT {
 };
 
 // V6: [6.4.7] Looprange clause
-template <typename T, typename I, typename E> struct LoopRangeT {
+template <typename T, typename I, typename E> struct LooprangeT {
   using Begin = E;
   using End = E;
 
@@ -1346,7 +1346,7 @@ using TupleClausesT =
                  DoacrossT<T, I, E>, DynGroupprivateT<T, I, E>, FromT<T, I, E>,
                  GrainsizeT<T, I, E>, IfT<T, I, E>, InitT<T, I, E>,
                  InReductionT<T, I, E>, LastprivateT<T, I, E>, LinearT<T, I, E>,
-                 LoopRangeT<T, I, E>, MapT<T, I, E>, NumTasksT<T, I, E>,
+                 LooprangeT<T, I, E>, MapT<T, I, E>, NumTasksT<T, I, E>,
                  OrderT<T, I, E>, ReductionT<T, I, E>, ScheduleT<T, I, E>,
                  TaskReductionT<T, I, E>, ToT<T, I, E>>;
 
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMP.td b/llvm/include/llvm/Frontend/OpenMP/OMP.td
index da70048d28c12..d9966068b605a 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMP.td
+++ b/llvm/include/llvm/Frontend/OpenMP/OMP.td
@@ -297,7 +297,7 @@ def OMPC_Link : Clause<[Spelling<"link">]> {
 }
 def OMPC_LoopRange : Clause<[Spelling<"looprange">]> {
   let clangClass = "OMPLoopRangeClause";
-  let flangClass = "OmpLoopRangeClause";
+  let flangClass = "OmpLooprangeClause";
 }
 def OMPC_Map : Clause<[Spelling<"map">]> {
   let clangClass = "OMPMapClause";
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index b801e212ceced..f5eb6222fd58d 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -576,16 +576,33 @@ class OpenMPIRBuilder {
   using FinalizeCallbackTy = std::function<Error(InsertPointTy CodeGenIP)>;
 
   struct FinalizationInfo {
-    /// The finalization callback provided by the last in-flight invocation of
-    /// createXXXX for the directive of kind DK.
-    FinalizeCallbackTy FiniCB;
-
+    FinalizationInfo(FinalizeCallbackTy FiniCB, omp::Directive DK,
+                     bool IsCancellable)
+        : DK(DK), IsCancellable(IsCancellable), FiniCB(std::move(FiniCB)) {}
     /// The directive kind of the innermost directive that has an associated
     /// region which might require finalization when it is left.
-    omp::Directive DK;
+    const omp::Directive DK;
 
     /// Flag to indicate if the directive is cancellable.
-    bool IsCancellable;
+    const bool IsCancellable;
+
+    /// The basic block to which control should be transferred to
+    /// implement the FiniCB. Memoized to avoid generating finalization
+    /// multiple times.
+    Expected<BasicBlock *> getFiniBB(IRBuilderBase &Builder);
+
+    /// For cases where there is an unavoidable existing finalization block
+    /// (e.g. loop finialization after omp sections). The existing finalization
+    /// block must not contain any non-finalization code.
+    Error mergeFiniBB(IRBuilderBase &Builder, BasicBlock *ExistingFiniBB);
+
+  private:
+    /// Access via getFiniBB.
+    BasicBlock *FiniBB = nullptr;
+
+    /// The finalization callback provided by the last in-flight invocation of
+    /// createXXXX for the directive of kind DK.
+    FinalizeCallbackTy FiniCB;
   };
 
   /// Push a finalization callback on the finalization stack.
@@ -1751,9 +1768,10 @@ class OpenMPIRBuilder {
   ///                  need to be copied to the new function.
   ///
   /// \return The ListToGlobalCopy function.
-  Function *emitListToGlobalCopyFunction(ArrayRef<ReductionInfo> ReductionInfos,
-                                         Type *ReductionsBufferTy,
-                                         AttributeList FuncAttrs);
+  Expected<Function *>
+  emitListToGlobalCopyFunction(ArrayRef<ReductionInfo> ReductionInfos,
+                               Type *ReductionsBufferTy,
+                               AttributeList FuncAttrs, ArrayRef<bool> IsByRef);
 
   /// This function emits a helper that copies all the reduction variables from
   /// the team into the provided global buffer for the reduction variables.
@@ -1768,9 +1786,10 @@ class OpenMPIRBuilder {
   ///                  need to be copied to the new function.
   ///
   /// \return The GlobalToList function.
-  Function *emitGlobalToListCopyFunction(ArrayRef<ReductionInfo> ReductionInfos,
-                                         Type *ReductionsBufferTy,
-                                         AttributeList FuncAttrs);
+  Expected<Function *>
+  emitGlobalToListCopyFunction(ArrayRef<ReductionInfo> ReductionInfos,
+                               Type *ReductionsBufferTy,
+                               AttributeList FuncAttrs, ArrayRef<bool> IsByRef);
 
   /// This function emits a helper that reduces all the reduction variables from
   /// the team into the provided global buffer for the reduction variables.
@@ -1789,10 +1808,11 @@ class OpenMPIRBuilder {
   ///                  need to be copied to the new function.
   ///
   /// \return The ListToGlobalReduce function.
-  Function *
+  Expected<Function *>
   emitListToGlobalReduceFunction(ArrayRef<ReductionInfo> ReductionInfos,
                                  Function *ReduceFn, Type *ReductionsBufferTy,
-                                 AttributeList FuncAttrs);
+                                 AttributeList FuncAttrs,
+                                 ArrayRef<bool> IsByRef);
 
   /// This function emits a helper that reduces all the reduction variables from
   /// the team into the provided global buffer for the reduction variables.
@@ -1811,10 +1831,11 @@ class OpenMPIRBuilder {
   ///                  need to be copied to the new function.
   ///
   /// \return The GlobalToListReduce function.
-  Function *
+  Expected<Function *>
   emitGlobalToListReduceFunction(ArrayRef<ReductionInfo> ReductionInfos,
                                  Function *ReduceFn, Type *ReductionsBufferTy,
-                                 AttributeList FuncAttrs);
+                                 AttributeList FuncAttrs,
+                                 ArrayRef<bool> IsByRef);
 
   /// Get the function name of a reduction function.
   std::string getReductionFuncName(StringRef Name) const;
@@ -2246,8 +2267,7 @@ class OpenMPIRBuilder {
   ///
   /// \return an error, if any were triggered during execution.
   LLVM_ABI Error emitCancelationCheckImpl(Value *CancelFlag,
-                                          omp::Directive CanceledDirective,
-                                          FinalizeCallbackTy ExitCB = {});
+                                          omp::Directive CanceledDirective);
 
   /// Generate a target region entry call.
   ///
@@ -3402,7 +3422,8 @@ class OpenMPIRBuilder {
   /// Common interface to finalize the region
   ///
   /// \param OMPD Directive to generate exiting code for
-  /// \param FinIP Insertion point for emitting Finalization code and exit call
+  /// \param FinIP Insertion point for emitting Finalization code and exit call.
+  ///              This block must not contain any non-finalization code.
   /// \param ExitCall Call to the ending OMP Runtime Function
   /// \param HasFinalize indicate if the directive will require finalization
   ///         and has a finalization callback in the stack that
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index c91fc254ebe11..2c86a43e114ea 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -109,6 +109,21 @@ namespace Intrinsic {
   LLVM_ABI Function *getOrInsertDeclaration(Module *M, ID id,
                                             ArrayRef<Type *> Tys = {});
 
+  /// Look up the Function declaration of the intrinsic \p IID in the Module
+  /// \p M. If it does not exist, add a declaration and return it. Otherwise,
+  /// return the existing declaration.
+  ///
+  /// This overload automatically resolves overloaded intrinsics based on the
+  /// provided return type and argument types. For non-overloaded intrinsics,
+  /// the return type and argument types are ignored.
+  ///
+  /// \param M - The module to get or insert the intrinsic declaration.
+  /// \param IID - The intrinsic ID.
+  /// \param RetTy - The return type of the intrinsic.
+  /// \param ArgTys - The argument types of the intrinsic.
+  LLVM_ABI Function *getOrInsertDeclaration(Module *M, ID IID, Type *RetTy,
+                                            ArrayRef<Type *> ArgTys);
+
   /// Look up the Function declaration of the intrinsic \p id in the Module
   /// \p M and return it if it exists. Otherwise, return nullptr. This version
   /// supports non-overloaded intrinsics.
diff --git a/llvm/include/llvm/IR/IntrinsicsARM.td b/llvm/include/llvm/IR/IntrinsicsARM.td
index ecadb235bec36..3b475c8d5614d 100644
--- a/llvm/include/llvm/IR/IntrinsicsARM.td
+++ b/llvm/include/llvm/IR/IntrinsicsARM.td
@@ -972,6 +972,13 @@ def int_arm_mve_vmaxnma_predicated: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
    [LLVMMatchType<0>, LLVMMatchType<0>, llvm_anyvector_ty],
     [IntrNoMem]>;
 
+def int_arm_mve_vminnm: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+   [LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem]>;
+def int_arm_mve_vmaxnm: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+   [LLVMMatchType<0>, LLVMMatchType<0>],
+    [IntrNoMem]>;
+
 multiclass MVEPredicated<list<LLVMType> rets, list<LLVMType> params,
                          LLVMType pred = llvm_anyvector_ty,
                          list<IntrinsicProperty> props = [IntrNoMem],
@@ -1362,6 +1369,9 @@ def int_arm_mve_vqmovn_predicated: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
     llvm_i32_ty /* unsigned output */, llvm_i32_ty /* unsigned input */,
     llvm_i32_ty /* top half */, llvm_anyvector_ty /* pred */], [IntrNoMem]>;
 
+def int_arm_mve_fma: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+   [LLVMMatchType<0> /* mult op #1 */, LLVMMatchType<0> /* mult op #2 */,
+    LLVMMatchType<0> /* addend */], [IntrNoMem]>;
 // fma_predicated returns the add operand for disabled lanes.
 def int_arm_mve_fma_predicated: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
    [LLVMMatchType<0> /* mult op #1 */, LLVMMatchType<0> /* mult op #2 */,
diff --git a/llvm/include/llvm/IR/IntrinsicsRISCVXCV.td b/llvm/include/llvm/IR/IntrinsicsRISCVXCV.td
index 9f6a9964903ae..465665c838bae 100644
--- a/llvm/include/llvm/IR/IntrinsicsRISCVXCV.td
+++ b/llvm/include/llvm/IR/IntrinsicsRISCVXCV.td
@@ -90,4 +90,8 @@ let TargetPrefix = "riscv" in {
   def int_riscv_cv_mac_machhuRN : ScalarCoreVMacGprGprGprImmIntrinsic;
   def int_riscv_cv_mac_macsRN   : ScalarCoreVMacGprGprGprImmIntrinsic;
   def int_riscv_cv_mac_machhsRN : ScalarCoreVMacGprGprGprImmIntrinsic;
+
+  def int_riscv_cv_elw_elw
+    : Intrinsic<[llvm_i32_ty], [llvm_ptr_ty],
+                [IntrReadMem, IntrArgMemOnly, IntrHasSideEffects]>;
 } // TargetPrefix = "riscv"
diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index 10a4d8525a9e8..c718e29b99ff4 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -133,6 +133,7 @@ LLVM_ABI void initializeGlobalMergeFuncPassWrapperPass(PassRegistry &);
 LLVM_ABI void initializeGlobalMergePass(PassRegistry &);
 LLVM_ABI void initializeGlobalsAAWrapperPassPass(PassRegistry &);
 LLVM_ABI void initializeHardwareLoopsLegacyPass(PassRegistry &);
+LLVM_ABI void initializeLibcallLoweringInfoWrapperPass(PassRegistry &);
 LLVM_ABI void initializeMIRProfileLoaderPassPass(PassRegistry &);
 LLVM_ABI void initializeIRSimilarityIdentifierWrapperPassPass(PassRegistry &);
 LLVM_ABI void initializeIRTranslatorPass(PassRegistry &);
diff --git a/llvm/include/llvm/MC/MCObjectFileInfo.h b/llvm/include/llvm/MC/MCObjectFileInfo.h
index ed7f462c3c598..51b7d73d46036 100644
--- a/llvm/include/llvm/MC/MCObjectFileInfo.h
+++ b/llvm/include/llvm/MC/MCObjectFileInfo.h
@@ -29,10 +29,6 @@ class MCSection;
 
 class LLVM_ABI MCObjectFileInfo {
 protected:
-  /// True if target object file supports a weak_definition of constant 0 for an
-  /// omitted EH frame.
-  bool SupportsWeakOmittedEHFrame = false;
-
   /// True if the target object file supports emitting a compact unwind section
   /// without an associated EH frame section.
   bool SupportsCompactUnwindWithoutEHFrame = false;
@@ -260,9 +256,6 @@ class LLVM_ABI MCObjectFileInfo {
   virtual ~MCObjectFileInfo();
   MCContext &getContext() const { return *Ctx; }
 
-  bool getSupportsWeakOmittedEHFrame() const {
-    return SupportsWeakOmittedEHFrame;
-  }
   bool getSupportsCompactUnwindWithoutEHFrame() const {
     return SupportsCompactUnwindWithoutEHFrame;
   }
diff --git a/llvm/include/llvm/ProfileData/SampleProf.h b/llvm/include/llvm/ProfileData/SampleProf.h
index 05f1b568b0643..d63714780afef 100644
--- a/llvm/include/llvm/ProfileData/SampleProf.h
+++ b/llvm/include/llvm/ProfileData/SampleProf.h
@@ -1658,6 +1658,7 @@ class ProfileSymbolList {
   }
 
   unsigned size() { return Syms.size(); }
+  void reserve(size_t Size) { Syms.reserve(Size); }
 
   void setToCompress(bool TC) { ToCompress = TC; }
   bool toCompress() { return ToCompress; }
diff --git a/llvm/include/llvm/Support/DebugCounter.h b/llvm/include/llvm/Support/DebugCounter.h
index 9904a0dd86559..979d6b8e62f23 100644
--- a/llvm/include/llvm/Support/DebugCounter.h
+++ b/llvm/include/llvm/Support/DebugCounter.h
@@ -44,8 +44,8 @@
 
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/StringRef.h"
-#include "llvm/ADT/UniqueVector.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/Debug.h"
 #include <string>
@@ -63,6 +63,29 @@ class DebugCounter {
     bool contains(int64_t Idx) const { return Idx >= Begin && Idx <= End; }
   };
 
+  /// Struct to store counter info.
+  class CounterInfo {
+    friend class DebugCounter;
+
+    /// Whether counting should be enabled, either due to -debug-counter or
+    /// -print-debug-counter.
+    bool Active = false;
+    /// Whether chunks for the counter are set (differs from Active in that
+    /// -print-debug-counter uses Active=true, IsSet=false).
+    bool IsSet = false;
+
+    int64_t Count = 0;
+    uint64_t CurrChunkIdx = 0;
+    StringRef Name;
+    StringRef Desc;
+    SmallVector<Chunk> Chunks;
+
+  public:
+    CounterInfo(StringRef Name, StringRef Desc) : Name(Name), Desc(Desc) {
+      DebugCounter::registerCounter(this);
+    }
+  };
+
   LLVM_ABI static void printChunks(raw_ostream &OS, ArrayRef<Chunk>);
 
   /// Return true on parsing error and print the error message on the
@@ -75,28 +98,26 @@ class DebugCounter {
   // Used by the command line option parser to push a new value it parsed.
   LLVM_ABI void push_back(const std::string &);
 
-  // Register a counter with the specified name.
+  // Register a counter with the specified counter information.
   //
   // FIXME: Currently, counter registration is required to happen before command
   // line option parsing. The main reason to register counters is to produce a
   // nice list of them on the command line, but i'm not sure this is worth it.
-  static unsigned registerCounter(StringRef Name, StringRef Desc) {
-    return instance().addCounter(std::string(Name), std::string(Desc));
+  static void registerCounter(CounterInfo *Info) {
+    instance().addCounter(Info);
   }
-  LLVM_ABI static bool shouldExecuteImpl(unsigned CounterName);
+  LLVM_ABI static bool shouldExecuteImpl(CounterInfo &Counter);
 
-  inline static bool shouldExecute(unsigned CounterName) {
-    if (!isCountingEnabled())
+  inline static bool shouldExecute(CounterInfo &Counter) {
+    if (!Counter.Active)
       return true;
-    return shouldExecuteImpl(CounterName);
+    return shouldExecuteImpl(Counter);
   }
 
   // Return true if a given counter had values set (either programatically or on
   // the command line).  This will return true even if those values are
   // currently in a state where the counter will always execute.
-  static bool isCounterSet(unsigned ID) {
-    return instance().Counters[ID].IsSet;
-  }
+  static bool isCounterSet(CounterInfo &Info) { return Info.IsSet; }
 
   struct CounterState {
     int64_t Count;
@@ -104,19 +125,14 @@ class DebugCounter {
   };
 
   // Return the state of a counter. This only works for set counters.
-  static CounterState getCounterState(unsigned ID) {
-    auto &Us = instance();
-    auto Result = Us.Counters.find(ID);
-    assert(Result != Us.Counters.end() && "Asking about a non-set counter");
-    return {Result->second.Count, Result->second.CurrChunkIdx};
+  static CounterState getCounterState(CounterInfo &Info) {
+    return {Info.Count, Info.CurrChunkIdx};
   }
 
   // Set a registered counter to a given state.
-  static void setCounterState(unsigned ID, CounterState State) {
-    auto &Us = instance();
-    auto &Counter = Us.Counters[ID];
-    Counter.Count = State.Count;
-    Counter.CurrChunkIdx = State.ChunkIdx;
+  static void setCounterState(CounterInfo &Info, CounterState State) {
+    Info.Count = State.Count;
+    Info.CurrChunkIdx = State.ChunkIdx;
   }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
@@ -126,66 +142,38 @@ class DebugCounter {
 
   LLVM_ABI void print(raw_ostream &OS) const;
 
-  // Get the counter ID for a given named counter, or return 0 if none is found.
-  unsigned getCounterId(const std::string &Name) const {
-    return RegisteredCounters.idFor(Name);
+  // Get the counter info for a given named counter,
+  // or return null if none is found.
+  CounterInfo *getCounterInfo(StringRef Name) const {
+    return Counters.lookup(Name);
   }
 
   // Return the number of registered counters.
-  unsigned int getNumCounters() const { return RegisteredCounters.size(); }
+  unsigned int getNumCounters() const { return Counters.size(); }
 
-  // Return the name and description of the counter with the given ID.
-  std::pair<std::string, std::string> getCounterInfo(unsigned ID) const {
-    return {RegisteredCounters[ID], Counters.lookup(ID).Desc};
+  // Return the name and description of the counter with the given info.
+  std::pair<StringRef, StringRef> getCounterDesc(CounterInfo *Info) const {
+    return {Info->Name, Info->Desc};
   }
 
   // Iterate through the registered counters
-  using CounterVector = UniqueVector<std::string>;
-  CounterVector::const_iterator begin() const {
-    return RegisteredCounters.begin();
+  MapVector<StringRef, CounterInfo *>::const_iterator begin() const {
+    return Counters.begin();
+  }
+  MapVector<StringRef, CounterInfo *>::const_iterator end() const {
+    return Counters.end();
   }
-  CounterVector::const_iterator end() const { return RegisteredCounters.end(); }
 
-  // Force-enables counting all DebugCounters.
-  //
-  // Since DebugCounters are incompatible with threading (not only do they not
-  // make sense, but we'll also see data races), this should only be used in
-  // contexts where we're certain we won't spawn threads.
-  static void enableAllCounters() { instance().Enabled = true; }
-
-  static bool isCountingEnabled() {
-// Compile to nothing when debugging is off
-#ifdef NDEBUG
-    return false;
-#else
-    return instance().Enabled || instance().ShouldPrintCounter;
-#endif
+  void activateAllCounters() {
+    for (auto &[_, Counter] : Counters)
+      Counter->Active = true;
   }
 
 protected:
-  unsigned addCounter(const std::string &Name, const std::string &Desc) {
-    unsigned Result = RegisteredCounters.insert(Name);
-    auto &C = Counters[Result];
-    C = {};
-    C.Desc = Desc;
-    return Result;
-  }
-  // Struct to store counter info.
-  struct CounterInfo {
-    int64_t Count = 0;
-    uint64_t CurrChunkIdx = 0;
-    bool IsSet = false;
-    std::string Desc;
-    SmallVector<Chunk> Chunks;
-  };
+  void addCounter(CounterInfo *Info) { Counters[Info->Name] = Info; }
   bool handleCounterIncrement(CounterInfo &Info);
 
-  DenseMap<unsigned, CounterInfo> Counters;
-  CounterVector RegisteredCounters;
-
-  // Whether we should do DebugCounting at all. DebugCounters aren't
-  // thread-safe, so this should always be false in multithreaded scenarios.
-  bool Enabled = false;
+  MapVector<StringRef, CounterInfo *> Counters;
 
   bool ShouldPrintCounter = false;
 
@@ -195,8 +183,7 @@ class DebugCounter {
 };
 
 #define DEBUG_COUNTER(VARNAME, COUNTERNAME, DESC)                              \
-  static const unsigned VARNAME =                                              \
-      DebugCounter::registerCounter(COUNTERNAME, DESC)
+  static DebugCounter::CounterInfo VARNAME(COUNTERNAME, DESC)
 
 } // namespace llvm
 #endif
diff --git a/llvm/include/llvm/Support/LSP/Protocol.h b/llvm/include/llvm/Support/LSP/Protocol.h
index 6a3b0a517819f..30d68bad6691a 100644
--- a/llvm/include/llvm/Support/LSP/Protocol.h
+++ b/llvm/include/llvm/Support/LSP/Protocol.h
@@ -1269,6 +1269,33 @@ struct CodeAction {
 /// Add support for JSON serialization.
 LLVM_ABI_FOR_TEST llvm::json::Value toJSON(const CodeAction &);
 
+//===----------------------------------------------------------------------===//
+//  ShowMessageParams
+//===----------------------------------------------------------------------===//
+
+enum class MessageType { Error = 1, Warning = 2, Info = 3, Log = 4, Debug = 5 };
+
+struct MessageActionItem {
+  /// A short title like 'Retry', 'Open Log' etc.
+  std::string title;
+};
+
+struct ShowMessageParams {
+  ShowMessageParams(MessageType Type, std::string Message)
+      : type(Type), message(Message) {}
+  MessageType type;
+  /// The actual message.
+  std::string message;
+  /// The message action items to present.
+  std::optional<std::vector<MessageActionItem>> actions;
+};
+
+/// Add support for JSON serialization.
+llvm::json::Value toJSON(const MessageActionItem &Params);
+
+/// Add support for JSON serialization.
+llvm::json::Value toJSON(const ShowMessageParams &Params);
+
 } // namespace lsp
 } // namespace llvm
 
diff --git a/llvm/include/llvm/Support/TargetOpcodes.def b/llvm/include/llvm/Support/TargetOpcodes.def
index 341fc5e50b33c..0d92f50a09d38 100644
--- a/llvm/include/llvm/Support/TargetOpcodes.def
+++ b/llvm/include/llvm/Support/TargetOpcodes.def
@@ -114,6 +114,12 @@ HANDLE_TARGET_OPCODE(REG_SEQUENCE)
 /// used to copy between subregisters of virtual registers.
 HANDLE_TARGET_OPCODE(COPY)
 
+/// COPY_LANEMASK - Target-independent partial register copy. The laneMask
+/// operand indicates which parts of the source register are copied to the
+/// destination. Other parts of the destination are undefined. It does not
+/// support copy between virtual registers having subregister indices.
+HANDLE_TARGET_OPCODE(COPY_LANEMASK)
+
 /// BUNDLE - This instruction represents an instruction bundle. Instructions
 /// which immediately follow a BUNDLE instruction which are marked with
 /// 'InsideBundle' flag are inside the bundle.
diff --git a/llvm/include/llvm/Target/Target.td b/llvm/include/llvm/Target/Target.td
index ef2ccb0abeb1e..315de55b75510 100644
--- a/llvm/include/llvm/Target/Target.td
+++ b/llvm/include/llvm/Target/Target.td
@@ -1352,6 +1352,13 @@ def COPY : StandardPseudoInstruction {
   let isAsCheapAsAMove = true;
   let hasNoSchedulingInfo = false;
 }
+def COPY_LANEMASK : StandardPseudoInstruction {
+  let OutOperandList = (outs unknown:$dst);
+  let InOperandList = (ins unknown:$src, unknown:$lanemask);
+  let AsmString = "";
+  let hasSideEffects = false;
+  let isAsCheapAsAMove = true;
+}
 def BUNDLE : StandardPseudoInstruction {
   let OutOperandList = (outs);
   let InOperandList = (ins variable_ops);
diff --git a/llvm/include/llvm/Transforms/Instrumentation/AllocToken.h b/llvm/include/llvm/Transforms/Instrumentation/AllocToken.h
index 077703c214745..299fc03c5d96b 100644
--- a/llvm/include/llvm/Transforms/Instrumentation/AllocToken.h
+++ b/llvm/include/llvm/Transforms/Instrumentation/AllocToken.h
@@ -25,7 +25,7 @@ class Module;
 
 struct AllocTokenOptions {
   AllocTokenMode Mode = DefaultAllocTokenMode;
-  std::optional<uint64_t> MaxTokens;
+  uint64_t MaxTokens = 0;
   bool FastABI = false;
   bool Extended = false;
   AllocTokenOptions() = default;
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index ecbd0ef7df5e5..1b37aabaafae8 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -196,6 +196,15 @@ class LoopVectorizeHints {
 
   /// Interface to emit optimization remarks.
   OptimizationRemarkEmitter &ORE;
+
+  /// Reports a condition where loop vectorization is disallowed: prints
+  /// \p DebugMsg for debugging purposes along with the corresponding
+  /// optimization remark \p RemarkName, with \p RemarkMsg as the user-facing
+  /// message. The loop \p L is used for the location of the remark.
+  void reportDisallowedVectorization(const StringRef DebugMsg,
+                                     const StringRef RemarkName,
+                                     const StringRef RemarkMsg,
+                                     const Loop *L) const;
 };
 
 /// This holds vectorization requirements that must be verified late in
diff --git a/llvm/lib/Analysis/CFGPrinter.cpp b/llvm/lib/Analysis/CFGPrinter.cpp
index 38aad849755be..39108a906f081 100644
--- a/llvm/lib/Analysis/CFGPrinter.cpp
+++ b/llvm/lib/Analysis/CFGPrinter.cpp
@@ -92,8 +92,10 @@ static void viewCFG(Function &F, const BlockFrequencyInfo *BFI,
 }
 
 DOTFuncInfo::DOTFuncInfo(const Function *F, const BlockFrequencyInfo *BFI,
-                         const BranchProbabilityInfo *BPI, uint64_t MaxFreq)
-    : F(F), BFI(BFI), BPI(BPI), MaxFreq(MaxFreq) {
+                         const BranchProbabilityInfo *BPI, uint64_t MaxFreq,
+                         std::optional<NodeIdFormatterTy> NodeIdFormatter)
+    : F(F), BFI(BFI), BPI(BPI), MaxFreq(MaxFreq),
+      NodeIdFormatter(NodeIdFormatter) {
   ShowHeat = false;
   EdgeWeights = !!BPI; // Print EdgeWeights when BPI is available.
   RawWeights = !!BFI;  // Print RawWeights when BFI is available.
diff --git a/llvm/lib/Analysis/ConstantFolding.cpp b/llvm/lib/Analysis/ConstantFolding.cpp
index 916154b465af4..63d12ee585e64 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -3348,8 +3348,12 @@ static Constant *ConstantFoldIntrinsicCall2(Intrinsic::ID IntrinsicID, Type *Ty,
       case Intrinsic::copysign:
         return ConstantFP::get(Ty->getContext(), APFloat::copySign(Op1V, Op2V));
       case Intrinsic::minnum:
+        if (Op1V.isSignaling() || Op2V.isSignaling())
+          return nullptr;
         return ConstantFP::get(Ty->getContext(), minnum(Op1V, Op2V));
       case Intrinsic::maxnum:
+        if (Op1V.isSignaling() || Op2V.isSignaling())
+          return nullptr;
         return ConstantFP::get(Ty->getContext(), maxnum(Op1V, Op2V));
       case Intrinsic::minimum:
         return ConstantFP::get(Ty->getContext(), minimum(Op1V, Op2V));
diff --git a/llvm/lib/Analysis/Delinearization.cpp b/llvm/lib/Analysis/Delinearization.cpp
index 0c3b02ae09f47..686622feec477 100644
--- a/llvm/lib/Analysis/Delinearization.cpp
+++ b/llvm/lib/Analysis/Delinearization.cpp
@@ -747,6 +747,26 @@ bool llvm::validateDelinearizationResult(ScalarEvolution &SE,
                                          ArrayRef<const SCEV *> Sizes,
                                          ArrayRef<const SCEV *> Subscripts,
                                          const Value *Ptr) {
+  // Sizes and Subscripts are as follows:
+  //
+  //   Sizes:      [UNK][S_2]...[S_n]
+  //   Subscripts: [I_1][I_2]...[I_n]
+  //
+  // where the size of the outermost dimension is unknown (UNK).
+
+  auto AddOverflow = [&](const SCEV *A, const SCEV *B) -> const SCEV * {
+    if (!SE.willNotOverflow(Instruction::Add, /*IsSigned=*/true, A, B))
+      return nullptr;
+    return SE.getAddExpr(A, B);
+  };
+
+  auto MulOverflow = [&](const SCEV *A, const SCEV *B) -> const SCEV * {
+    if (!SE.willNotOverflow(Instruction::Mul, /*IsSigned=*/true, A, B))
+      return nullptr;
+    return SE.getMulExpr(A, B);
+  };
+
+  // Range check: 0 <= I_k < S_k for k = 2..n.
   for (size_t I = 1; I < Sizes.size(); ++I) {
     const SCEV *Size = Sizes[I - 1];
     const SCEV *Subscript = Subscripts[I];
@@ -755,6 +775,49 @@ bool llvm::validateDelinearizationResult(ScalarEvolution &SE,
     if (!isKnownLessThan(&SE, Subscript, Size))
       return false;
   }
+
+  // The offset computation is as follows:
+  //
+  //   Offset = I_n +
+  //            S_n * I_{n-1} +
+  //            ... +
+  //            (S_2 * ... * S_n) * I_1
+  //
+  // Regarding this as a function from (I_1, I_2, ..., I_n) to integers, it
+  // must be injective. To guarantee it, the above calculation must not
+  // overflow. Since we have already checked that 0 <= I_k < S_k for k = 2..n,
+  // the minimum and maximum values occur in the following cases:
+  //
+  //   Min = [I_1][0]...[0] = S_2 * ... * S_n * I_1
+  //   Max = [I_1][S_2-1]...[S_n-1]
+  //       = (S_2 * ... * S_n) * I_1 +
+  //         (S_2 * ... * S_{n-1}) * (S_2 - 1) +
+  //         ... +
+  //         (S_n - 1)
+  //       = (S_2 * ... * S_n) * I_1 +
+  //         (S_2 * ... * S_n) - 1  (can be proven by induction)
+  //       = Min + (S_2 * ... * S_n) - 1
+  //
+  // NOTE: I_1 can be negative, so Min is not just 0.
+  const SCEV *Prod = SE.getOne(Sizes[0]->getType());
+  for (const SCEV *Size : Sizes) {
+    Prod = MulOverflow(Prod, Size);
+    if (!Prod)
+      return false;
+  }
+  const SCEV *Min = MulOverflow(Prod, Subscripts[0]);
+  if (!Min)
+    return false;
+
+  // We have already checked that Min and Prod don't overflow, so it's enough
+  // to check whether Min + Prod - 1 doesn't overflow.
+  const SCEV *MaxPlusOne = AddOverflow(Min, Prod);
+  if (!MaxPlusOne)
+    return false;
+  if (!SE.willNotOverflow(Instruction::Sub, /*IsSigned=*/true, MaxPlusOne,
+                          SE.getOne(MaxPlusOne->getType())))
+    return false;
+
   return true;
 }
 
diff --git a/llvm/lib/Analysis/DependenceAnalysis.cpp b/llvm/lib/Analysis/DependenceAnalysis.cpp
index 253f4d1441098..fe07b7edb6713 100644
--- a/llvm/lib/Analysis/DependenceAnalysis.cpp
+++ b/llvm/lib/Analysis/DependenceAnalysis.cpp
@@ -1296,7 +1296,8 @@ bool DependenceInfo::testZIV(const SCEV *Src, const SCEV *Dst,
 bool DependenceInfo::strongSIVtest(const SCEV *Coeff, const SCEV *SrcConst,
                                    const SCEV *DstConst, const Loop *CurSrcLoop,
                                    const Loop *CurDstLoop, unsigned Level,
-                                   FullDependence &Result) const {
+                                   FullDependence &Result,
+                                   bool UnderRuntimeAssumptions) {
   if (!isDependenceTestEnabled(DependenceTestType::StrongSIV))
     return false;
 
@@ -1368,7 +1369,29 @@ bool DependenceInfo::strongSIVtest(const SCEV *Coeff, const SCEV *SrcConst,
       Result.DV[Level].Direction &= Dependence::DVEntry::EQ;
     ++StrongSIVsuccesses;
   } else if (Delta->isZero()) {
-    // since 0/X == 0
+    // Check if coefficient could be zero. If so, 0/0 is undefined and we
+    // cannot conclude that only same-iteration dependencies exist.
+    // When coeff=0, all iterations access the same location.
+    if (SE->isKnownNonZero(Coeff)) {
+      LLVM_DEBUG(
+          dbgs() << "\t    Coefficient proven non-zero by SCEV analysis\n");
+    } else {
+      // Cannot prove at compile time, would need runtime assumption.
+      if (UnderRuntimeAssumptions) {
+        const SCEVPredicate *Pred = SE->getComparePredicate(
+            ICmpInst::ICMP_NE, Coeff, SE->getZero(Coeff->getType()));
+        Result.Assumptions = Result.Assumptions.getUnionWith(Pred, *SE);
+        LLVM_DEBUG(dbgs() << "\t    Added runtime assumption: " << *Coeff
+                          << " != 0\n");
+      } else {
+        // Cannot add runtime assumptions, this test cannot handle this case.
+        // Let more complex tests try.
+        LLVM_DEBUG(dbgs() << "\t    Would need runtime assumption " << *Coeff
+                          << " != 0, but not allowed. Failing this test.\n");
+        return false;
+      }
+    }
+    // Since 0/X == 0 (where X is known non-zero or assumed non-zero).
     Result.DV[Level].Distance = Delta;
     Result.DV[Level].Direction &= Dependence::DVEntry::EQ;
     ++StrongSIVsuccesses;
@@ -2331,7 +2354,8 @@ bool DependenceInfo::symbolicRDIVtest(const SCEV *A1, const SCEV *A2,
 //
 // Return true if dependence disproved.
 bool DependenceInfo::testSIV(const SCEV *Src, const SCEV *Dst, unsigned &Level,
-                             FullDependence &Result) const {
+                             FullDependence &Result,
+                             bool UnderRuntimeAssumptions) {
   LLVM_DEBUG(dbgs() << "    src = " << *Src << "\n");
   LLVM_DEBUG(dbgs() << "    dst = " << *Dst << "\n");
   const SCEVAddRecExpr *SrcAddRec = dyn_cast<SCEVAddRecExpr>(Src);
@@ -2349,8 +2373,9 @@ bool DependenceInfo::testSIV(const SCEV *Src, const SCEV *Dst, unsigned &Level,
     Level = mapSrcLoop(CurSrcLoop);
     bool disproven;
     if (SrcCoeff == DstCoeff)
-      disproven = strongSIVtest(SrcCoeff, SrcConst, DstConst, CurSrcLoop,
-                                CurDstLoop, Level, Result);
+      disproven =
+          strongSIVtest(SrcCoeff, SrcConst, DstConst, CurSrcLoop, CurDstLoop,
+                        Level, Result, UnderRuntimeAssumptions);
     else if (SrcCoeff == SE->getNegativeSCEV(DstCoeff))
       disproven = weakCrossingSIVtest(SrcCoeff, SrcConst, DstConst, CurSrcLoop,
                                       CurDstLoop, Level, Result);
@@ -2582,33 +2607,17 @@ bool DependenceInfo::gcdMIVtest(const SCEV *Src, const SCEV *Dst,
   const SCEV *DstConst = Coefficients;
 
   APInt ExtraGCD = APInt::getZero(BitWidth);
-  const SCEV *Delta = SE->getMinusSCEV(DstConst, SrcConst);
+  const SCEV *Delta = minusSCEVNoSignedOverflow(DstConst, SrcConst, *SE);
+  if (!Delta)
+    return false;
   LLVM_DEBUG(dbgs() << "    Delta = " << *Delta << "\n");
   const SCEVConstant *Constant = dyn_cast<SCEVConstant>(Delta);
-  if (const SCEVAddExpr *Sum = dyn_cast<SCEVAddExpr>(Delta)) {
-    // If Delta is a sum of products, we may be able to make further progress.
-    for (const SCEV *Operand : Sum->operands()) {
-      if (isa<SCEVConstant>(Operand)) {
-        assert(!Constant && "Surprised to find multiple constants");
-        Constant = cast<SCEVConstant>(Operand);
-      } else if (const SCEVMulExpr *Product = dyn_cast<SCEVMulExpr>(Operand)) {
-        // Search for constant operand to participate in GCD;
-        // If none found; return false.
-        std::optional<APInt> ConstOp = getConstanCoefficient(Product);
-        if (!ConstOp)
-          return false;
-        ExtraGCD = APIntOps::GreatestCommonDivisor(ExtraGCD, ConstOp->abs());
-      } else
-        return false;
-    }
-  }
   if (!Constant)
     return false;
   APInt ConstDelta = cast<SCEVConstant>(Constant)->getAPInt();
   LLVM_DEBUG(dbgs() << "    ConstDelta = " << ConstDelta << "\n");
   if (ConstDelta == 0)
     return false;
-  RunningGCD = APIntOps::GreatestCommonDivisor(RunningGCD, ExtraGCD);
   LLVM_DEBUG(dbgs() << "    RunningGCD = " << RunningGCD << "\n");
   APInt Remainder = ConstDelta.srem(RunningGCD);
   if (Remainder != 0) {
@@ -3479,11 +3488,10 @@ DependenceInfo::depends(Instruction *Src, Instruction *Dst,
                                         SCEVUnionPredicate(Assume, *SE));
   }
 
-  if (!Assume.empty() && !UnderRuntimeAssumptions) {
-    // Runtime assumptions needed but not allowed.
+  // Runtime assumptions needed but not allowed.
+  if (!Assume.empty() && !UnderRuntimeAssumptions)
     return std::make_unique<Dependence>(Src, Dst,
                                         SCEVUnionPredicate(Assume, *SE));
-  }
 
   unsigned Pairs = 1;
   SmallVector<Subscript, 2> Pair(Pairs);
@@ -3583,7 +3591,8 @@ DependenceInfo::depends(Instruction *Src, Instruction *Dst,
     case Subscript::SIV: {
       LLVM_DEBUG(dbgs() << ", SIV\n");
       unsigned Level;
-      if (testSIV(Pair[SI].Src, Pair[SI].Dst, Level, Result))
+      if (testSIV(Pair[SI].Src, Pair[SI].Dst, Level, Result,
+                  UnderRuntimeAssumptions))
         return nullptr;
       break;
     }
@@ -3664,6 +3673,7 @@ DependenceInfo::depends(Instruction *Src, Instruction *Dst,
   } else {
     // On the other hand, if all directions are equal and there's no
     // loop-independent dependence possible, then no dependence exists.
+    // However, if there are runtime assumptions, we must return the result.
     bool AllEqual = true;
     for (unsigned II = 1; II <= CommonLevels; ++II) {
       if (Result.getDirection(II) != Dependence::DVEntry::EQ) {
@@ -3671,7 +3681,7 @@ DependenceInfo::depends(Instruction *Src, Instruction *Dst,
         break;
       }
     }
-    if (AllEqual)
+    if (AllEqual && Result.Assumptions.getPredicates().empty())
       return nullptr;
   }
 
diff --git a/llvm/lib/Analysis/IVDescriptors.cpp b/llvm/lib/Analysis/IVDescriptors.cpp
index 4d21f1c7e2de2..7624e0ed6f2b0 100644
--- a/llvm/lib/Analysis/IVDescriptors.cpp
+++ b/llvm/lib/Analysis/IVDescriptors.cpp
@@ -216,6 +216,52 @@ static bool checkOrderedReduction(RecurKind Kind, Instruction *ExactFPMathInst,
   return true;
 }
 
+/// Returns true if \p Phi is a min/max reduction matching \p Kind where \p Phi
+/// is used outside the reduction chain. This is common for loops selecting the
+/// index of a minimum/maximum value (argmin/argmax).
+static bool isMinMaxReductionPhiWithUsersOutsideReductionChain(
+    PHINode *Phi, RecurKind Kind, Loop *TheLoop, RecurrenceDescriptor &RedDes) {
+  BasicBlock *Latch = TheLoop->getLoopLatch();
+  if (!Latch)
+    return false;
+
+  assert(Phi->getNumIncomingValues() == 2 && "phi must have 2 incoming values");
+  Value *Inc = Phi->getIncomingValueForBlock(Latch);
+  if (Phi->hasOneUse() || !Inc->hasOneUse() ||
+      !RecurrenceDescriptor::isIntMinMaxRecurrenceKind(Kind))
+    return false;
+
+  Value *A, *B;
+  bool IsMinMax = [&]() {
+    switch (Kind) {
+    case RecurKind::UMax:
+      return match(Inc, m_UMax(m_Value(A), m_Value(B)));
+    case RecurKind::UMin:
+      return match(Inc, m_UMin(m_Value(A), m_Value(B)));
+    case RecurKind::SMax:
+      return match(Inc, m_SMax(m_Value(A), m_Value(B)));
+    case RecurKind::SMin:
+      return match(Inc, m_SMin(m_Value(A), m_Value(B)));
+    default:
+      llvm_unreachable("all min/max kinds must be handled");
+    }
+  }();
+  if (!IsMinMax)
+    return false;
+
+  if (A == B || (A != Phi && B != Phi))
+    return false;
+
+  SmallPtrSet<Instruction *, 4> CastInsts;
+  Value *RdxStart = Phi->getIncomingValueForBlock(TheLoop->getLoopPreheader());
+  RedDes =
+      RecurrenceDescriptor(RdxStart, /*Exit=*/nullptr, /*Store=*/nullptr, Kind,
+                           FastMathFlags(), /*ExactFP=*/nullptr, Phi->getType(),
+                           /*Signed=*/false, /*Ordered=*/false, CastInsts,
+                           /*MinWidthCastToRecurTy=*/-1U, /*PhiMultiUse=*/true);
+  return true;
+}
+
 bool RecurrenceDescriptor::AddReductionVar(
     PHINode *Phi, RecurKind Kind, Loop *TheLoop, FastMathFlags FuncFMF,
     RecurrenceDescriptor &RedDes, DemandedBits *DB, AssumptionCache *AC,
@@ -227,6 +273,11 @@ bool RecurrenceDescriptor::AddReductionVar(
   if (Phi->getParent() != TheLoop->getHeader())
     return false;
 
+  // Check for min/max reduction variables that feed other users in the loop.
+  if (isMinMaxReductionPhiWithUsersOutsideReductionChain(Phi, Kind, TheLoop,
+                                                         RedDes))
+    return true;
+
   // Obtain the reduction start value from the value that comes from the loop
   // preheader.
   Value *RdxStart = Phi->getIncomingValueForBlock(TheLoop->getLoopPreheader());
diff --git a/llvm/lib/Analysis/InstructionSimplify.cpp b/llvm/lib/Analysis/InstructionSimplify.cpp
index 59a213b47825a..bd85444d7d2b0 100644
--- a/llvm/lib/Analysis/InstructionSimplify.cpp
+++ b/llvm/lib/Analysis/InstructionSimplify.cpp
@@ -6620,7 +6620,8 @@ static MinMaxOptResult OptimizeConstMinMax(const Constant *RHSConst,
   assert(OutNewConstVal != nullptr);
 
   bool PropagateNaN = IID == Intrinsic::minimum || IID == Intrinsic::maximum;
-  bool PropagateSNaN = IID == Intrinsic::minnum || IID == Intrinsic::maxnum;
+  bool ReturnsOtherForAllNaNs =
+      IID == Intrinsic::minimumnum || IID == Intrinsic::maximumnum;
   bool IsMin = IID == Intrinsic::minimum || IID == Intrinsic::minnum ||
                IID == Intrinsic::minimumnum;
 
@@ -6637,29 +6638,27 @@ static MinMaxOptResult OptimizeConstMinMax(const Constant *RHSConst,
 
   // minnum(x, qnan) -> x
   // maxnum(x, qnan) -> x
-  // minnum(x, snan) -> qnan
-  // maxnum(x, snan) -> qnan
   // minimum(X, nan) -> qnan
   // maximum(X, nan) -> qnan
   // minimumnum(X, nan) -> x
   // maximumnum(X, nan) -> x
   if (CAPF.isNaN()) {
-    if (PropagateNaN || (PropagateSNaN && CAPF.isSignaling())) {
+    if (PropagateNaN) {
       *OutNewConstVal = ConstantFP::get(CFP->getType(), CAPF.makeQuiet());
       return MinMaxOptResult::UseNewConstVal;
+    } else if (ReturnsOtherForAllNaNs || !CAPF.isSignaling()) {
+      return MinMaxOptResult::UseOtherVal;
     }
-    return MinMaxOptResult::UseOtherVal;
+    return MinMaxOptResult::CannotOptimize;
   }
 
   if (CAPF.isInfinity() || (Call && Call->hasNoInfs() && CAPF.isLargest())) {
-    // minnum(X, -inf) -> -inf (ignoring sNaN -> qNaN propagation)
-    // maxnum(X, +inf) -> +inf (ignoring sNaN -> qNaN propagation)
     // minimum(X, -inf) -> -inf if nnan
     // maximum(X, +inf) -> +inf if nnan
     // minimumnum(X, -inf) -> -inf
     // maximumnum(X, +inf) -> +inf
     if (CAPF.isNegative() == IsMin &&
-        (!PropagateNaN || (Call && Call->hasNoNaNs()))) {
+        (ReturnsOtherForAllNaNs || (Call && Call->hasNoNaNs()))) {
       *OutNewConstVal = const_cast<Constant *>(RHSConst);
       return MinMaxOptResult::UseNewConstVal;
     }
@@ -7004,12 +7003,10 @@ Value *llvm::simplifyBinaryIntrinsic(Intrinsic::ID IID, Type *ReturnType,
   case Intrinsic::minimum:
   case Intrinsic::maximumnum:
   case Intrinsic::minimumnum: {
-    // In several cases here, we deviate from exact IEEE 754 semantics
-    // to enable optimizations (as allowed by the LLVM IR spec).
-    //
-    // For instance, we may return one of the arguments unmodified instead of
-    // inserting an llvm.canonicalize to transform input sNaNs into qNaNs,
-    // or may assume all NaN inputs are qNaNs.
+    // In some cases here, we deviate from exact IEEE-754 semantics to enable
+    // optimizations (as allowed by the LLVM IR spec) by returning one of the
+    // arguments unmodified instead of inserting an llvm.canonicalize to
+    // transform input sNaNs into qNaNs,
 
     // If the arguments are the same, this is a no-op (ignoring NaN quieting)
     if (Op0 == Op1)
diff --git a/llvm/lib/Analysis/MemoryBuiltins.cpp b/llvm/lib/Analysis/MemoryBuiltins.cpp
index 1df4eda2580df..6c7259d2d875c 100644
--- a/llvm/lib/Analysis/MemoryBuiltins.cpp
+++ b/llvm/lib/Analysis/MemoryBuiltins.cpp
@@ -739,29 +739,30 @@ combinePossibleConstantValues(std::optional<APInt> LHS,
 }
 
 static std::optional<APInt> aggregatePossibleConstantValuesImpl(
-    const Value *V, ObjectSizeOpts::Mode EvalMode, unsigned recursionDepth) {
+    const Value *V, ObjectSizeOpts::Mode EvalMode, unsigned BitWidth,
+    unsigned recursionDepth) {
   constexpr unsigned maxRecursionDepth = 4;
   if (recursionDepth == maxRecursionDepth)
     return std::nullopt;
 
   if (const auto *CI = dyn_cast<ConstantInt>(V)) {
-    return CI->getValue();
+    return CI->getValue().sextOrTrunc(BitWidth);
   } else if (const auto *SI = dyn_cast<SelectInst>(V)) {
     return combinePossibleConstantValues(
         aggregatePossibleConstantValuesImpl(SI->getTrueValue(), EvalMode,
-                                            recursionDepth + 1),
+                                            BitWidth, recursionDepth + 1),
         aggregatePossibleConstantValuesImpl(SI->getFalseValue(), EvalMode,
-                                            recursionDepth + 1),
+                                            BitWidth, recursionDepth + 1),
         EvalMode);
   } else if (const auto *PN = dyn_cast<PHINode>(V)) {
     unsigned Count = PN->getNumIncomingValues();
     if (Count == 0)
       return std::nullopt;
     auto Acc = aggregatePossibleConstantValuesImpl(
-        PN->getIncomingValue(0), EvalMode, recursionDepth + 1);
+        PN->getIncomingValue(0), EvalMode, BitWidth, recursionDepth + 1);
     for (unsigned I = 1; Acc && I < Count; ++I) {
       auto Tmp = aggregatePossibleConstantValuesImpl(
-          PN->getIncomingValue(I), EvalMode, recursionDepth + 1);
+          PN->getIncomingValue(I), EvalMode, BitWidth, recursionDepth + 1);
       Acc = combinePossibleConstantValues(Acc, Tmp, EvalMode);
     }
     return Acc;
@@ -771,9 +772,10 @@ static std::optional<APInt> aggregatePossibleConstantValuesImpl(
 }
 
 static std::optional<APInt>
-aggregatePossibleConstantValues(const Value *V, ObjectSizeOpts::Mode EvalMode) {
+aggregatePossibleConstantValues(const Value *V, ObjectSizeOpts::Mode EvalMode,
+                                unsigned BitWidth) {
   if (auto *CI = dyn_cast<ConstantInt>(V))
-    return CI->getValue();
+    return CI->getValue().sextOrTrunc(BitWidth);
 
   if (EvalMode != ObjectSizeOpts::Mode::Min &&
       EvalMode != ObjectSizeOpts::Mode::Max)
@@ -782,7 +784,7 @@ aggregatePossibleConstantValues(const Value *V, ObjectSizeOpts::Mode EvalMode) {
   // Not using computeConstantRange here because we cannot guarantee it's not
   // doing optimization based on UB which we want to avoid when expanding
   // __builtin_object_size.
-  return aggregatePossibleConstantValuesImpl(V, EvalMode, 0u);
+  return aggregatePossibleConstantValuesImpl(V, EvalMode, BitWidth, 0u);
 }
 
 /// Align \p Size according to \p Alignment. If \p Size is greater than
@@ -844,9 +846,14 @@ OffsetSpan ObjectSizeOffsetVisitor::computeImpl(Value *V) {
         Options.EvalMode == ObjectSizeOpts::Mode::Min
             ? ObjectSizeOpts::Mode::Max
             : ObjectSizeOpts::Mode::Min;
-    auto OffsetRangeAnalysis = [EvalMode](Value &VOffset, APInt &Offset) {
+    // For a GEPOperator the indices are first converted to offsets in the
+    // pointer’s index type, so we need to provide the index type to make sure
+    // the min/max operations are performed in correct type.
+    unsigned IdxTyBits = DL.getIndexTypeSizeInBits(V->getType());
+    auto OffsetRangeAnalysis = [EvalMode, IdxTyBits](Value &VOffset,
+                                                     APInt &Offset) {
       if (auto PossibleOffset =
-              aggregatePossibleConstantValues(&VOffset, EvalMode)) {
+              aggregatePossibleConstantValues(&VOffset, EvalMode, IdxTyBits)) {
         Offset = *PossibleOffset;
         return true;
       }
@@ -956,8 +963,9 @@ OffsetSpan ObjectSizeOffsetVisitor::visitAllocaInst(AllocaInst &I) {
     return OffsetSpan(Zero, align(Size, I.getAlign()));
 
   Value *ArraySize = I.getArraySize();
-  if (auto PossibleSize =
-          aggregatePossibleConstantValues(ArraySize, Options.EvalMode)) {
+  if (auto PossibleSize = aggregatePossibleConstantValues(
+          ArraySize, Options.EvalMode,
+          ArraySize->getType()->getScalarSizeInBits())) {
     APInt NumElems = *PossibleSize;
     if (!CheckedZextOrTrunc(NumElems))
       return ObjectSizeOffsetVisitor::unknown();
@@ -988,8 +996,8 @@ OffsetSpan ObjectSizeOffsetVisitor::visitCallBase(CallBase &CB) {
     if (!V->getType()->isIntegerTy())
       return V;
 
-    if (auto PossibleBound =
-            aggregatePossibleConstantValues(V, Options.EvalMode))
+    if (auto PossibleBound = aggregatePossibleConstantValues(
+            V, Options.EvalMode, V->getType()->getScalarSizeInBits()))
       return ConstantInt::get(V->getType(), *PossibleBound);
 
     return V;
diff --git a/llvm/lib/Analysis/RuntimeLibcallInfo.cpp b/llvm/lib/Analysis/RuntimeLibcallInfo.cpp
index 9ea789a4ee45a..1c5a1cc75b7bd 100644
--- a/llvm/lib/Analysis/RuntimeLibcallInfo.cpp
+++ b/llvm/lib/Analysis/RuntimeLibcallInfo.cpp
@@ -13,6 +13,15 @@ using namespace llvm;
 
 AnalysisKey RuntimeLibraryAnalysis::Key;
 
+RuntimeLibraryAnalysis::RuntimeLibraryAnalysis(const Triple &TT,
+                                               ExceptionHandling ExceptionModel,
+                                               FloatABI::ABIType FloatABI,
+                                               EABI EABIVersion,
+                                               StringRef ABIName,
+                                               VectorLibrary VecLib)
+    : LibcallsInfo(std::in_place, TT, ExceptionModel, FloatABI, EABIVersion,
+                   ABIName, VecLib) {}
+
 RTLIB::RuntimeLibcallsInfo
 RuntimeLibraryAnalysis::run(const Module &M, ModuleAnalysisManager &) {
   if (!LibcallsInfo)
@@ -26,6 +35,13 @@ INITIALIZE_PASS(RuntimeLibraryInfoWrapper, "runtime-library-info",
 RuntimeLibraryInfoWrapper::RuntimeLibraryInfoWrapper()
     : ImmutablePass(ID), RTLA(RTLIB::RuntimeLibcallsInfo(Triple())) {}
 
+RuntimeLibraryInfoWrapper::RuntimeLibraryInfoWrapper(
+    const Triple &TT, ExceptionHandling ExceptionModel,
+    FloatABI::ABIType FloatABI, EABI EABIVersion, StringRef ABIName,
+    VectorLibrary VecLib)
+    : ImmutablePass(ID), RTLCI(std::in_place, TT, ExceptionModel, FloatABI,
+                               EABIVersion, ABIName, VecLib) {}
+
 char RuntimeLibraryInfoWrapper::ID = 0;
 
 ModulePass *llvm::createRuntimeLibraryInfoWrapperPass() {
diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp
index a31f17b1936d6..1d7a8b981b5ee 100644
--- a/llvm/lib/Analysis/ScalarEvolution.cpp
+++ b/llvm/lib/Analysis/ScalarEvolution.cpp
@@ -3490,19 +3490,30 @@ const SCEV *ScalarEvolution::getUDivExpr(const SCEV *LHS,
           }
           /// Get a canonical UDivExpr for a recurrence.
           /// {X,+,N}/C => {Y,+,N}/C where Y=X-(X%N). Safe when C%N=0.
-          // We can currently only fold X%N if X is constant.
-          const SCEVConstant *StartC = dyn_cast<SCEVConstant>(AR->getStart());
-          if (StartC && !DivInt.urem(StepInt) &&
-              getZeroExtendExpr(AR, ExtTy) ==
-              getAddRecExpr(getZeroExtendExpr(AR->getStart(), ExtTy),
-                            getZeroExtendExpr(Step, ExtTy),
-                            AR->getLoop(), SCEV::FlagAnyWrap)) {
-            const APInt &StartInt = StartC->getAPInt();
-            const APInt &StartRem = StartInt.urem(StepInt);
-            if (StartRem != 0) {
+          const APInt *StartRem;
+          if (!DivInt.urem(StepInt) && match(getURemExpr(AR->getStart(), Step),
+                                             m_scev_APInt(StartRem))) {
+            bool NoWrap =
+                getZeroExtendExpr(AR, ExtTy) ==
+                getAddRecExpr(getZeroExtendExpr(AR->getStart(), ExtTy),
+                              getZeroExtendExpr(Step, ExtTy), AR->getLoop(),
+                              SCEV::FlagAnyWrap);
+
+            // With N <= C and both N, C as powers-of-2, the transformation
+            // {X,+,N}/C => {(X - X%N),+,N}/C preserves division results even
+            // if wrapping occurs, as the division results remain equivalent for
+            // all offsets in [[(X - X%N), X).
+            bool CanFoldWithWrap = StepInt.ule(DivInt) && // N <= C
+                                   StepInt.isPowerOf2() && DivInt.isPowerOf2();
+            // Only fold if the subtraction can be folded in the start
+            // expression.
+            const SCEV *NewStart =
+                getMinusSCEV(AR->getStart(), getConstant(*StartRem));
+            if (*StartRem != 0 && (NoWrap || CanFoldWithWrap) &&
+                !isa<SCEVAddExpr>(NewStart)) {
               const SCEV *NewLHS =
-                  getAddRecExpr(getConstant(StartInt - StartRem), Step,
-                                AR->getLoop(), SCEV::FlagNW);
+                  getAddRecExpr(NewStart, Step, AR->getLoop(),
+                                NoWrap ? SCEV::FlagNW : SCEV::FlagAnyWrap);
               if (LHS != NewLHS) {
                 LHS = NewLHS;
 
@@ -11107,6 +11118,11 @@ bool ScalarEvolution::isKnownMultipleOf(
   return true;
 }
 
+bool ScalarEvolution::haveSameSign(const SCEV *S1, const SCEV *S2) {
+  return ((isKnownNonNegative(S1) && isKnownNonNegative(S2)) ||
+          (isKnownNegative(S1) && isKnownNegative(S2)));
+}
+
 std::pair<const SCEV *, const SCEV *>
 ScalarEvolution::SplitIntoInitAndPostInc(const Loop *L, const SCEV *S) {
   // Compute SCEV on entry of loop L.
@@ -12026,8 +12042,7 @@ bool ScalarEvolution::isImpliedCondBalancedTypes(
   if (IsSignFlippedPredicate(Pred, FoundPred)) {
     // Unsigned comparison is the same as signed comparison when both the
     // operands are non-negative or negative.
-    if ((isKnownNonNegative(FoundLHS) && isKnownNonNegative(FoundRHS)) ||
-        (isKnownNegative(FoundLHS) && isKnownNegative(FoundRHS)))
+    if (haveSameSign(FoundLHS, FoundRHS))
       return isImpliedCondOperands(Pred, LHS, RHS, FoundLHS, FoundRHS, CtxI);
     // Create local copies that we can freely swap and canonicalize our
     // conditions to "le/lt".
@@ -13990,7 +14005,15 @@ static void PrintLoopInfo(raw_ostream &OS, ScalarEvolution *SE,
 }
 
 namespace llvm {
-raw_ostream &operator<<(raw_ostream &OS, ScalarEvolution::LoopDisposition LD) {
+// Note: these overloaded operators need to be in the llvm namespace for them
+// to be resolved correctly. If we put them outside the llvm namespace, the
+//
+// OS << ": " << SE.getLoopDisposition(SV, InnerL);
+//
+// code below "breaks" and start printing raw enum values as opposed to the
+// string values.
+static raw_ostream &operator<<(raw_ostream &OS,
+                               ScalarEvolution::LoopDisposition LD) {
   switch (LD) {
   case ScalarEvolution::LoopVariant:
     OS << "Variant";
@@ -14005,7 +14028,8 @@ raw_ostream &operator<<(raw_ostream &OS, ScalarEvolution::LoopDisposition LD) {
   return OS;
 }
 
-raw_ostream &operator<<(raw_ostream &OS, ScalarEvolution::BlockDisposition BD) {
+static raw_ostream &operator<<(raw_ostream &OS,
+                               llvm::ScalarEvolution::BlockDisposition BD) {
   switch (BD) {
   case ScalarEvolution::DoesNotDominateBlock:
     OS << "DoesNotDominate";
diff --git a/llvm/lib/Analysis/ScalarEvolutionDivision.cpp b/llvm/lib/Analysis/ScalarEvolutionDivision.cpp
index bce41f9f5329e..4e422539ff9f6 100644
--- a/llvm/lib/Analysis/ScalarEvolutionDivision.cpp
+++ b/llvm/lib/Analysis/ScalarEvolutionDivision.cpp
@@ -29,8 +29,6 @@ class Type;
 
 using namespace llvm;
 
-namespace {
-
 static inline int sizeOfSCEV(const SCEV *S) {
   struct FindSCEVSize {
     int Size = 0;
@@ -52,8 +50,6 @@ static inline int sizeOfSCEV(const SCEV *S) {
   return F.Size;
 }
 
-} // namespace
-
 // Computes the Quotient and Remainder of the division of Numerator by
 // Denominator.
 void SCEVDivision::divide(ScalarEvolution &SE, const SCEV *Numerator,
diff --git a/llvm/lib/Analysis/ValueTracking.cpp b/llvm/lib/Analysis/ValueTracking.cpp
index dbceb8e557849..9cb6f19b9340c 100644
--- a/llvm/lib/Analysis/ValueTracking.cpp
+++ b/llvm/lib/Analysis/ValueTracking.cpp
@@ -2249,6 +2249,11 @@ static void computeKnownBitsFromOperator(const Operator *I,
     break;
   }
   case Instruction::ShuffleVector: {
+    if (auto *Splat = getSplatValue(I)) {
+      computeKnownBits(Splat, Known, Q, Depth + 1);
+      break;
+    }
+
     auto *Shuf = dyn_cast<ShuffleVectorInst>(I);
     // FIXME: Do we need to handle ConstantExpr involving shufflevectors?
     if (!Shuf) {
@@ -5877,6 +5882,12 @@ void computeKnownFPClass(const Value *V, const APInt &DemandedElts,
     break;
   }
   case Instruction::ShuffleVector: {
+    // Handle vector splat idiom
+    if (Value *Splat = getSplatValue(V)) {
+      computeKnownFPClass(Splat, Known, InterestedClasses, Q, Depth + 1);
+      break;
+    }
+
     // For undef elements, we don't know anything about the common state of
     // the shuffle result.
     APInt DemandedLHS, DemandedRHS;
diff --git a/llvm/lib/AsmParser/LLParser.cpp b/llvm/lib/AsmParser/LLParser.cpp
index c3678d37607d5..2a0246074a462 100644
--- a/llvm/lib/AsmParser/LLParser.cpp
+++ b/llvm/lib/AsmParser/LLParser.cpp
@@ -9994,25 +9994,26 @@ bool LLParser::parseAliasSummary(std::string Name, GlobalValue::GUID GUID,
 
   ValueInfo AliaseeVI;
   unsigned GVId;
-  if (parseGVReference(AliaseeVI, GVId))
-    return true;
-
-  if (parseToken(lltok::rparen, "expected ')' here"))
-    return true;
-
   auto AS = std::make_unique<AliasSummary>(GVFlags);
-
   AS->setModulePath(ModulePath);
 
-  // Record forward reference if the aliasee is not parsed yet.
-  if (AliaseeVI.getRef() == FwdVIRef) {
-    ForwardRefAliasees[GVId].emplace_back(AS.get(), Loc);
-  } else {
-    auto Summary = Index->findSummaryInModule(AliaseeVI, ModulePath);
-    assert(Summary && "Aliasee must be a definition");
-    AS->setAliasee(AliaseeVI, Summary);
+  if (!EatIfPresent(lltok::kw_null)) {
+    if (parseGVReference(AliaseeVI, GVId))
+      return true;
+
+    // Record forward reference if the aliasee is not parsed yet.
+    if (AliaseeVI.getRef() == FwdVIRef) {
+      ForwardRefAliasees[GVId].emplace_back(AS.get(), Loc);
+    } else {
+      auto Summary = Index->findSummaryInModule(AliaseeVI, ModulePath);
+      assert(Summary && "Aliasee must be a definition");
+      AS->setAliasee(AliaseeVI, Summary);
+    }
   }
 
+  if (parseToken(lltok::rparen, "expected ')' here"))
+    return true;
+
   return addGlobalValueToIndex(Name, GUID,
                                (GlobalValue::LinkageTypes)GVFlags.Linkage, ID,
                                std::move(AS), Loc);
diff --git a/llvm/lib/Bitcode/Reader/BitcodeReader.cpp b/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
index 04cb0a699ebbf..90e6ffbf8992b 100644
--- a/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
+++ b/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
@@ -7200,9 +7200,11 @@ template <bool AllowNullValueInfo>
 std::pair<ValueInfo, GlobalValue::GUID>
 ModuleSummaryIndexBitcodeReader::getValueInfoFromValueId(unsigned ValueId) {
   auto VGI = ValueIdToValueInfoMap[ValueId];
-  // We can have a null value info for memprof callsite info records in
-  // distributed ThinLTO index files when the callee function summary is not
-  // included in the index. The bitcode writer records 0 in that case,
+  // We can have a null value info in distributed ThinLTO index files:
+  // - For memprof callsite info records when the callee function summary is not
+  //   included in the index.
+  // - For alias summary when its aliasee summary is not included in the index.
+  // The bitcode writer records 0 in these cases,
   // and the caller of this helper will set AllowNullValueInfo to true.
   assert(AllowNullValueInfo || std::get<0>(VGI));
   return VGI;
@@ -7990,10 +7992,13 @@ Error ModuleSummaryIndexBitcodeReader::parseEntireSummary(unsigned ID) {
       LastSeenSummary = AS.get();
       AS->setModulePath(ModuleIdMap[ModuleId]);
 
-      auto AliaseeVI = std::get<0>(getValueInfoFromValueId(AliaseeValueId));
-      auto AliaseeInModule = TheIndex.findSummaryInModule(AliaseeVI, AS->modulePath());
-      AS->setAliasee(AliaseeVI, AliaseeInModule);
-
+      auto AliaseeVI = std::get<0>(
+          getValueInfoFromValueId</*AllowNullValueInfo*/ true>(AliaseeValueId));
+      if (AliaseeVI) {
+        auto AliaseeInModule =
+            TheIndex.findSummaryInModule(AliaseeVI, AS->modulePath());
+        AS->setAliasee(AliaseeVI, AliaseeInModule);
+      }
       ValueInfo VI = std::get<0>(getValueInfoFromValueId(ValueID));
       LastSeenGUID = VI.getGUID();
       TheIndex.addGlobalValueSummary(VI, std::move(AS));
diff --git a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
index 0dd3fa3361fee..845ecb4672cf2 100644
--- a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
+++ b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
@@ -5248,8 +5248,10 @@ void IndexBitcodeWriter::writeCombinedGlobalValueSummary() {
     NameVals.push_back(ModuleIdMap[AS->modulePath()]);
     NameVals.push_back(
         getEncodedGVSummaryFlags(AS->flags(), shouldImportValueAsDecl(AS)));
-    auto AliaseeValueId = SummaryToValueIdMap[&AS->getAliasee()];
-    assert(AliaseeValueId);
+    // Set value id to 0 when an alias is imported but the aliasee summary is
+    // not contained in the index.
+    auto AliaseeValueId =
+        AS->hasAliasee() ? SummaryToValueIdMap[&AS->getAliasee()] : 0;
     NameVals.push_back(AliaseeValueId);
 
     // Emit the finished record.
@@ -5257,8 +5259,9 @@ void IndexBitcodeWriter::writeCombinedGlobalValueSummary() {
     NameVals.clear();
     MaybeEmitOriginalName(*AS);
 
-    if (auto *FS = dyn_cast<FunctionSummary>(&AS->getAliasee()))
-      getReferencedTypeIds(FS, ReferencedTypeIds);
+    if (AS->hasAliasee())
+      if (auto *FS = dyn_cast<FunctionSummary>(&AS->getAliasee()))
+        getReferencedTypeIds(FS, ReferencedTypeIds);
   }
 
   SmallVector<StringRef, 4> Functions;
diff --git a/llvm/lib/CodeGen/CodeGen.cpp b/llvm/lib/CodeGen/CodeGen.cpp
index 9795a0b707fd3..fe293c63fa762 100644
--- a/llvm/lib/CodeGen/CodeGen.cpp
+++ b/llvm/lib/CodeGen/CodeGen.cpp
@@ -57,6 +57,7 @@ void llvm::initializeCodeGen(PassRegistry &Registry) {
   initializeInterleavedLoadCombinePass(Registry);
   initializeInterleavedAccessPass(Registry);
   initializeJMCInstrumenterPass(Registry);
+  initializeLibcallLoweringInfoWrapperPass(Registry);
   initializeLiveDebugValuesLegacyPass(Registry);
   initializeLiveDebugVariablesWrapperLegacyPass(Registry);
   initializeLiveIntervalsWrapperPassPass(Registry);
diff --git a/llvm/lib/CodeGen/ExpandFp.cpp b/llvm/lib/CodeGen/ExpandFp.cpp
index f44eb227133ae..13ed4846d2bf7 100644
--- a/llvm/lib/CodeGen/ExpandFp.cpp
+++ b/llvm/lib/CodeGen/ExpandFp.cpp
@@ -975,11 +975,12 @@ static RTLIB::Libcall fremToLibcall(Type *Ty) {
 /* Return true if, according to \p LibInfo, the target either directly
    supports the frem instruction for the \p Ty, has a custom lowering,
    or uses a libcall. */
-static bool targetSupportsFrem(const TargetLowering &TLI, Type *Ty) {
+static bool targetSupportsFrem(const TargetLowering &TLI,
+                               const LibcallLoweringInfo &Libcalls, Type *Ty) {
   if (!TLI.isOperationExpand(ISD::FREM, EVT::getEVT(Ty)))
     return true;
 
-  return TLI.getLibcallName(fremToLibcall(Ty->getScalarType()));
+  return Libcalls.getLibcallName(fremToLibcall(Ty->getScalarType()));
 }
 
 static void addToWorklist(Instruction &I,
@@ -991,7 +992,7 @@ static void addToWorklist(Instruction &I,
 }
 
 static bool runImpl(Function &F, const TargetLowering &TLI,
-                    AssumptionCache *AC) {
+                    const LibcallLoweringInfo &Libcalls, AssumptionCache *AC) {
   SmallVector<Instruction *, 4> Worklist;
 
   unsigned MaxLegalFpConvertBitWidth =
@@ -1010,7 +1011,7 @@ static bool runImpl(Function &F, const TargetLowering &TLI,
 
     switch (I.getOpcode()) {
     case Instruction::FRem:
-      return !targetSupportsFrem(TLI, Ty) &&
+      return !targetSupportsFrem(TLI, Libcalls, Ty) &&
              FRemExpander::canExpandType(Ty->getScalarType());
 
     case Instruction::FPToUI:
@@ -1090,20 +1091,27 @@ class ExpandFpLegacyPass : public FunctionPass {
 
   bool runOnFunction(Function &F) override {
     auto *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
-    auto *TLI = TM->getSubtargetImpl(F)->getTargetLowering();
+    const TargetSubtargetInfo *Subtarget = TM->getSubtargetImpl(F);
+    auto *TLI = Subtarget->getTargetLowering();
     AssumptionCache *AC = nullptr;
 
+    const LibcallLoweringInfo &Libcalls =
+        getAnalysis<LibcallLoweringInfoWrapper>().getLibcallLowering(
+            *Subtarget);
+
     if (OptLevel != CodeGenOptLevel::None && !F.hasOptNone())
       AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
-    return runImpl(F, *TLI, AC);
+    return runImpl(F, *TLI, Libcalls, AC);
   }
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<LibcallLoweringInfoWrapper>();
     AU.addRequired<TargetPassConfig>();
     if (OptLevel != CodeGenOptLevel::None)
       AU.addRequired<AssumptionCacheTracker>();
     AU.addPreserved<AAResultsWrapperPass>();
     AU.addPreserved<GlobalsAAWrapperPass>();
+    AU.addRequired<LibcallLoweringInfoWrapper>();
   }
 };
 } // namespace
@@ -1126,13 +1134,29 @@ PreservedAnalyses ExpandFpPass::run(Function &F, FunctionAnalysisManager &FAM) {
   AssumptionCache *AC = nullptr;
   if (OptLevel != CodeGenOptLevel::None)
     AC = &FAM.getResult<AssumptionAnalysis>(F);
-  return runImpl(F, TLI, AC) ? PreservedAnalyses::none()
-                             : PreservedAnalyses::all();
+
+  auto &MAMProxy = FAM.getResult<ModuleAnalysisManagerFunctionProxy>(F);
+
+  const LibcallLoweringModuleAnalysisResult *LibcallLowering =
+      MAMProxy.getCachedResult<LibcallLoweringModuleAnalysis>(*F.getParent());
+
+  if (!LibcallLowering) {
+    F.getContext().emitError("'" + LibcallLoweringModuleAnalysis::name() +
+                             "' analysis required");
+    return PreservedAnalyses::all();
+  }
+
+  const LibcallLoweringInfo &Libcalls =
+      LibcallLowering->getLibcallLowering(*STI);
+
+  return runImpl(F, TLI, Libcalls, AC) ? PreservedAnalyses::none()
+                                       : PreservedAnalyses::all();
 }
 
 char ExpandFpLegacyPass::ID = 0;
 INITIALIZE_PASS_BEGIN(ExpandFpLegacyPass, "expand-fp",
                       "Expand certain fp instructions", false, false)
+INITIALIZE_PASS_DEPENDENCY(LibcallLoweringInfoWrapper)
 INITIALIZE_PASS_END(ExpandFpLegacyPass, "expand-fp", "Expand fp", false, false)
 
 FunctionPass *llvm::createExpandFpPass(CodeGenOptLevel OptLevel) {
diff --git a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
index e0665d99a891d..63ed7d598455f 100644
--- a/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
@@ -111,17 +111,18 @@ INITIALIZE_PASS_END(IRTranslator, DEBUG_TYPE, "IRTranslator LLVM IR -> MI",
                 false, false)
 
 static void reportTranslationError(MachineFunction &MF,
-                                   const TargetPassConfig &TPC,
                                    OptimizationRemarkEmitter &ORE,
                                    OptimizationRemarkMissed &R) {
   MF.getProperties().setFailedISel();
+  bool IsGlobalISelAbortEnabled =
+      MF.getTarget().Options.GlobalISelAbort == GlobalISelAbortMode::Enable;
 
   // Print the function name explicitly if we don't have a debug location (which
   // makes the diagnostic less useful) or if we're going to emit a raw error.
-  if (!R.getLocation().isValid() || TPC.isGlobalISelAbortEnabled())
+  if (!R.getLocation().isValid() || IsGlobalISelAbortEnabled)
     R << (" (in function: " + MF.getName() + ")").str();
 
-  if (TPC.isGlobalISelAbortEnabled())
+  if (IsGlobalISelAbortEnabled)
     report_fatal_error(Twine(R.getMsg()));
   else
     ORE.emit(R);
@@ -242,7 +243,7 @@ ArrayRef<Register> IRTranslator::getOrCreateVRegs(const Value &Val) {
                                  MF->getFunction().getSubprogram(),
                                  &MF->getFunction().getEntryBlock());
       R << "unable to translate constant: " << ore::NV("Type", Val.getType());
-      reportTranslationError(*MF, *TPC, *ORE, R);
+      reportTranslationError(*MF, *ORE, R);
       return *VRegs;
     }
   }
@@ -279,7 +280,7 @@ Align IRTranslator::getMemOpAlign(const Instruction &I) {
 
   OptimizationRemarkMissed R("gisel-irtranslator", "", &I);
   R << "unable to translate memop: " << ore::NV("Opcode", &I);
-  reportTranslationError(*MF, *TPC, *ORE, R);
+  reportTranslationError(*MF, *ORE, R);
   return Align(1);
 }
 
@@ -4147,7 +4148,7 @@ bool IRTranslator::runOnMachineFunction(MachineFunction &CurMF) {
     OptimizationRemarkMissed R("gisel-irtranslator", "GISelFailure",
                                F.getSubprogram(), &F.getEntryBlock());
     R << "unable to translate in big endian mode";
-    reportTranslationError(*MF, *TPC, *ORE, R);
+    reportTranslationError(*MF, *ORE, R);
     return false;
   }
 
@@ -4191,7 +4192,7 @@ bool IRTranslator::runOnMachineFunction(MachineFunction &CurMF) {
                                F.getSubprogram(), &F.getEntryBlock());
     R << "unable to lower function: "
       << ore::NV("Prototype", F.getFunctionType());
-    reportTranslationError(*MF, *TPC, *ORE, R);
+    reportTranslationError(*MF, *ORE, R);
     return false;
   }
 
@@ -4214,7 +4215,7 @@ bool IRTranslator::runOnMachineFunction(MachineFunction &CurMF) {
                                F.getSubprogram(), &F.getEntryBlock());
     R << "unable to lower arguments: "
       << ore::NV("Prototype", F.getFunctionType());
-    reportTranslationError(*MF, *TPC, *ORE, R);
+    reportTranslationError(*MF, *ORE, R);
     return false;
   }
 
@@ -4265,7 +4266,7 @@ bool IRTranslator::runOnMachineFunction(MachineFunction &CurMF) {
           R << ": '" << InstStrStorage << "'";
         }
 
-        reportTranslationError(*MF, *TPC, *ORE, R);
+        reportTranslationError(*MF, *ORE, R);
         return false;
       }
 
@@ -4273,7 +4274,7 @@ bool IRTranslator::runOnMachineFunction(MachineFunction &CurMF) {
         OptimizationRemarkMissed R("gisel-irtranslator", "GISelFailure",
                                    BB->getTerminator()->getDebugLoc(), BB);
         R << "unable to translate basic block";
-        reportTranslationError(*MF, *TPC, *ORE, R);
+        reportTranslationError(*MF, *ORE, R);
         return false;
       }
     }
diff --git a/llvm/lib/CodeGen/GlobalISel/InstructionSelect.cpp b/llvm/lib/CodeGen/GlobalISel/InstructionSelect.cpp
index 2dd22c8a7e8ba..1d281ab83aacc 100644
--- a/llvm/lib/CodeGen/GlobalISel/InstructionSelect.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/InstructionSelect.cpp
@@ -137,7 +137,6 @@ bool InstructionSelect::runOnMachineFunction(MachineFunction &MF) {
     return false;
 
   ISel = MF.getSubtarget().getInstructionSelector();
-  ISel->TPC = &getAnalysis<TargetPassConfig>();
 
   // FIXME: Properly override OptLevel in TargetMachine. See OptLevelChanger
   CodeGenOptLevel OldOptLevel = OptLevel;
@@ -159,7 +158,6 @@ bool InstructionSelect::selectMachineFunction(MachineFunction &MF) {
   LLVM_DEBUG(dbgs() << "Selecting function: " << MF.getName() << '\n');
   assert(ISel && "Cannot work without InstructionSelector");
 
-  const TargetPassConfig &TPC = *ISel->TPC;
   CodeGenCoverage CoverageInfo;
   ISel->setupMF(MF, VT, &CoverageInfo, PSI, BFI);
 
@@ -177,8 +175,8 @@ bool InstructionSelect::selectMachineFunction(MachineFunction &MF) {
   // property check already is.
   if (!DisableGISelLegalityCheck)
     if (const MachineInstr *MI = machineFunctionIsIllegal(MF)) {
-      reportGISelFailure(MF, TPC, MORE, "gisel-select",
-                         "instruction is not legal", *MI);
+      reportGISelFailure(MF, MORE, "gisel-select", "instruction is not legal",
+                         *MI);
       return false;
     }
   // FIXME: We could introduce new blocks and will need to fix the outer loop.
@@ -215,8 +213,7 @@ bool InstructionSelect::selectMachineFunction(MachineFunction &MF) {
         if (!selectInstr(MI)) {
           LLVM_DEBUG(dbgs() << "Selection failed!\n";
                      MIIMaintainer.reportFullyCreatedInstrs());
-          reportGISelFailure(MF, TPC, MORE, "gisel-select", "cannot select",
-                             MI);
+          reportGISelFailure(MF, MORE, "gisel-select", "cannot select", MI);
           return false;
         }
         LLVM_DEBUG(MIIMaintainer.reportFullyCreatedInstrs());
@@ -279,7 +276,7 @@ bool InstructionSelect::selectMachineFunction(MachineFunction &MF) {
 
     const TargetRegisterClass *RC = MRI.getRegClassOrNull(VReg);
     if (!RC) {
-      reportGISelFailure(MF, TPC, MORE, "gisel-select",
+      reportGISelFailure(MF, MORE, "gisel-select",
                          "VReg has no regclass after selection", *MI);
       return false;
     }
@@ -288,7 +285,7 @@ bool InstructionSelect::selectMachineFunction(MachineFunction &MF) {
     if (Ty.isValid() &&
         TypeSize::isKnownGT(Ty.getSizeInBits(), TRI.getRegSizeInBits(*RC))) {
       reportGISelFailure(
-          MF, TPC, MORE, "gisel-select",
+          MF, MORE, "gisel-select",
           "VReg's low-level type and register class have different sizes", *MI);
       return false;
     }
@@ -299,7 +296,7 @@ bool InstructionSelect::selectMachineFunction(MachineFunction &MF) {
                                       MF.getFunction().getSubprogram(),
                                       /*MBB=*/nullptr);
     R << "inserting blocks is not supported yet";
-    reportGISelFailure(MF, TPC, MORE, R);
+    reportGISelFailure(MF, MORE, R);
     return false;
   }
 #endif
diff --git a/llvm/lib/CodeGen/GlobalISel/LegalityPredicates.cpp b/llvm/lib/CodeGen/GlobalISel/LegalityPredicates.cpp
index 30c2d089c3121..5e7cd5fd5d9ad 100644
--- a/llvm/lib/CodeGen/GlobalISel/LegalityPredicates.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/LegalityPredicates.cpp
@@ -155,6 +155,26 @@ LegalityPredicate LegalityPredicates::scalarOrEltNarrowerThan(unsigned TypeIdx,
   };
 }
 
+LegalityPredicate
+LegalityPredicates::vectorElementCountIsGreaterThan(unsigned TypeIdx,
+                                                    unsigned Size) {
+
+  return [=](const LegalityQuery &Query) {
+    const LLT QueryTy = Query.Types[TypeIdx];
+    return QueryTy.isFixedVector() && QueryTy.getNumElements() > Size;
+  };
+}
+
+LegalityPredicate
+LegalityPredicates::vectorElementCountIsLessThanOrEqualTo(unsigned TypeIdx,
+                                                          unsigned Size) {
+
+  return [=](const LegalityQuery &Query) {
+    const LLT QueryTy = Query.Types[TypeIdx];
+    return QueryTy.isFixedVector() && QueryTy.getNumElements() <= Size;
+  };
+}
+
 LegalityPredicate LegalityPredicates::scalarOrEltWiderThan(unsigned TypeIdx,
                                                            unsigned Size) {
   return [=](const LegalityQuery &Query) {
diff --git a/llvm/lib/CodeGen/GlobalISel/Legalizer.cpp b/llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
index aef16b5f33af4..0f0656aaa4f45 100644
--- a/llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
@@ -348,7 +348,7 @@ bool Legalizer::runOnMachineFunction(MachineFunction &MF) {
                                             *MIRBuilder, VT);
 
   if (Result.FailedOn) {
-    reportGISelFailure(MF, TPC, MORE, "gisel-legalize",
+    reportGISelFailure(MF, MORE, "gisel-legalize",
                        "unable to legalize instruction", *Result.FailedOn);
     return false;
   }
@@ -360,7 +360,7 @@ bool Legalizer::runOnMachineFunction(MachineFunction &MF) {
     R << "lost "
       << ore::NV("NumLostDebugLocs", LocObserver.getNumLostDebugLocs())
       << " debug locations during pass";
-    reportGISelWarning(MF, TPC, MORE, R);
+    reportGISelWarning(MF, MORE, R);
     // Example remark:
     // --- !Missed
     // Pass:            gisel-legalize
diff --git a/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp b/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
index 120c38ab8404c..1aa1d465d8da6 100644
--- a/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
@@ -6684,13 +6684,24 @@ LegalizerHelper::moreElementsVector(MachineInstr &MI, unsigned TypeIdx,
   case TargetOpcode::G_FMAXIMUMNUM:
   case TargetOpcode::G_STRICT_FADD:
   case TargetOpcode::G_STRICT_FSUB:
-  case TargetOpcode::G_STRICT_FMUL:
+  case TargetOpcode::G_STRICT_FMUL: {
+    Observer.changingInstr(MI);
+    moreElementsVectorSrc(MI, MoreTy, 1);
+    moreElementsVectorSrc(MI, MoreTy, 2);
+    moreElementsVectorDst(MI, MoreTy, 0);
+    Observer.changedInstr(MI);
+    return Legalized;
+  }
   case TargetOpcode::G_SHL:
   case TargetOpcode::G_ASHR:
   case TargetOpcode::G_LSHR: {
     Observer.changingInstr(MI);
     moreElementsVectorSrc(MI, MoreTy, 1);
-    moreElementsVectorSrc(MI, MoreTy, 2);
+    // The shift operand may have a different scalar type from the source and
+    // destination operands.
+    LLT ShiftMoreTy = MoreTy.changeElementType(
+        MRI.getType(MI.getOperand(2).getReg()).getElementType());
+    moreElementsVectorSrc(MI, ShiftMoreTy, 2);
     moreElementsVectorDst(MI, MoreTy, 0);
     Observer.changedInstr(MI);
     return Legalized;
@@ -6806,12 +6817,10 @@ LegalizerHelper::moreElementsVector(MachineInstr &MI, unsigned TypeIdx,
     LLT DstExtTy;
     if (TypeIdx == 0) {
       DstExtTy = MoreTy;
-      SrcExtTy = LLT::fixed_vector(
-          MoreTy.getNumElements(),
+      SrcExtTy = MoreTy.changeElementType(
           MRI.getType(MI.getOperand(1).getReg()).getElementType());
     } else {
-      DstExtTy = LLT::fixed_vector(
-          MoreTy.getNumElements(),
+      DstExtTy = MoreTy.changeElementType(
           MRI.getType(MI.getOperand(0).getReg()).getElementType());
       SrcExtTy = MoreTy;
     }
diff --git a/llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp b/llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp
index bcb4f1c551cfd..5db631be32acd 100644
--- a/llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp
@@ -39,6 +39,7 @@
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/raw_ostream.h"
+#include "llvm/Target/TargetMachine.h"
 #include <algorithm>
 #include <cassert>
 #include <cstdint>
@@ -83,7 +84,6 @@ void RegBankSelect::init(MachineFunction &MF) {
   assert(RBI && "Cannot work without RegisterBankInfo");
   MRI = &MF.getRegInfo();
   TRI = MF.getSubtarget().getRegisterInfo();
-  TPC = &getAnalysis<TargetPassConfig>();
   if (OptMode != Mode::Fast) {
     MBFI = &getAnalysis<MachineBlockFrequencyInfoWrapperPass>().getMBFI();
     MBPI = &getAnalysis<MachineBranchProbabilityInfoWrapperPass>().getMBPI();
@@ -308,7 +308,8 @@ const RegisterBankInfo::InstructionMapping &RegBankSelect::findBestMapping(
         RepairPts.emplace_back(std::move(RepairPt));
     }
   }
-  if (!BestMapping && !TPC->isGlobalISelAbortEnabled()) {
+  if (!BestMapping && MI.getMF()->getTarget().Options.GlobalISelAbort !=
+                          GlobalISelAbortMode::Enable) {
     // If none of the mapping worked that means they are all impossible.
     // Thus, pick the first one and set an impossible repairing point.
     // It will trigger the failed isel mode.
@@ -708,7 +709,7 @@ bool RegBankSelect::assignRegisterBanks(MachineFunction &MF) {
         continue;
 
       if (!assignInstr(MI)) {
-        reportGISelFailure(MF, *TPC, *MORE, "gisel-regbankselect",
+        reportGISelFailure(MF, *MORE, "gisel-regbankselect",
                            "unable to map instruction", MI);
         return false;
       }
@@ -722,7 +723,7 @@ bool RegBankSelect::checkFunctionIsLegal(MachineFunction &MF) const {
 #ifndef NDEBUG
   if (!DisableGISelLegalityCheck) {
     if (const MachineInstr *MI = machineFunctionIsIllegal(MF)) {
-      reportGISelFailure(MF, *TPC, *MORE, "gisel-regbankselect",
+      reportGISelFailure(MF, *MORE, "gisel-regbankselect",
                          "instruction is not legal", *MI);
       return false;
     }
diff --git a/llvm/lib/CodeGen/GlobalISel/Utils.cpp b/llvm/lib/CodeGen/GlobalISel/Utils.cpp
index e8954a3d9899b..15e81f5773b69 100644
--- a/llvm/lib/CodeGen/GlobalISel/Utils.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/Utils.cpp
@@ -234,11 +234,11 @@ bool llvm::isTriviallyDead(const MachineInstr &MI,
 
 static void reportGISelDiagnostic(DiagnosticSeverity Severity,
                                   MachineFunction &MF,
-                                  const TargetPassConfig &TPC,
                                   MachineOptimizationRemarkEmitter &MORE,
                                   MachineOptimizationRemarkMissed &R) {
-  bool IsFatal = Severity == DS_Error &&
-                 TPC.isGlobalISelAbortEnabled();
+  bool IsGlobalISelAbortEnabled =
+      MF.getTarget().Options.GlobalISelAbort == GlobalISelAbortMode::Enable;
+  bool IsFatal = Severity == DS_Error && IsGlobalISelAbortEnabled;
   // Print the function name explicitly if we don't have a debug location (which
   // makes the diagnostic less useful) or if we're going to emit a raw error.
   if (!R.getLocation().isValid() || IsFatal)
@@ -250,20 +250,20 @@ static void reportGISelDiagnostic(DiagnosticSeverity Severity,
     MORE.emit(R);
 }
 
-void llvm::reportGISelWarning(MachineFunction &MF, const TargetPassConfig &TPC,
+void llvm::reportGISelWarning(MachineFunction &MF,
                               MachineOptimizationRemarkEmitter &MORE,
                               MachineOptimizationRemarkMissed &R) {
-  reportGISelDiagnostic(DS_Warning, MF, TPC, MORE, R);
+  reportGISelDiagnostic(DS_Warning, MF, MORE, R);
 }
 
-void llvm::reportGISelFailure(MachineFunction &MF, const TargetPassConfig &TPC,
+void llvm::reportGISelFailure(MachineFunction &MF,
                               MachineOptimizationRemarkEmitter &MORE,
                               MachineOptimizationRemarkMissed &R) {
   MF.getProperties().setFailedISel();
-  reportGISelDiagnostic(DS_Error, MF, TPC, MORE, R);
+  reportGISelDiagnostic(DS_Error, MF, MORE, R);
 }
 
-void llvm::reportGISelFailure(MachineFunction &MF, const TargetPassConfig &TPC,
+void llvm::reportGISelFailure(MachineFunction &MF,
                               MachineOptimizationRemarkEmitter &MORE,
                               const char *PassName, StringRef Msg,
                               const MachineInstr &MI) {
@@ -271,9 +271,10 @@ void llvm::reportGISelFailure(MachineFunction &MF, const TargetPassConfig &TPC,
                                     MI.getDebugLoc(), MI.getParent());
   R << Msg;
   // Printing MI is expensive;  only do it if expensive remarks are enabled.
-  if (TPC.isGlobalISelAbortEnabled() || MORE.allowExtraAnalysis(PassName))
+  if (MF.getTarget().Options.GlobalISelAbort == GlobalISelAbortMode::Enable ||
+      MORE.allowExtraAnalysis(PassName))
     R << ": " << ore::MNV("Inst", MI);
-  reportGISelFailure(MF, TPC, MORE, R);
+  reportGISelFailure(MF, MORE, R);
 }
 
 unsigned llvm::getInverseGMinMaxOpcode(unsigned MinMaxOpc) {
@@ -768,8 +769,12 @@ llvm::ConstantFoldFPBinOp(unsigned Opcode, const Register Op1,
     C1.copySign(C2);
     return C1;
   case TargetOpcode::G_FMINNUM:
+    if (C1.isSignaling() || C2.isSignaling())
+      return std::nullopt;
     return minnum(C1, C2);
   case TargetOpcode::G_FMAXNUM:
+    if (C1.isSignaling() || C2.isSignaling())
+      return std::nullopt;
     return maxnum(C1, C2);
   case TargetOpcode::G_FMINIMUM:
     return minimum(C1, C2);
diff --git a/llvm/lib/CodeGen/LibcallLoweringInfo.cpp b/llvm/lib/CodeGen/LibcallLoweringInfo.cpp
index 6f3607e8db824..0d54fac2422e2 100644
--- a/llvm/lib/CodeGen/LibcallLoweringInfo.cpp
+++ b/llvm/lib/CodeGen/LibcallLoweringInfo.cpp
@@ -7,7 +7,10 @@
 //===----------------------------------------------------------------------===//
 
 #include "llvm/CodeGen/LibcallLoweringInfo.h"
+#include "llvm/Analysis/RuntimeLibcallInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Target/TargetMachine.h"
 
 using namespace llvm;
 
@@ -28,3 +31,42 @@ LibcallLoweringInfo::LibcallLoweringInfo(
 
   Subtarget.initLibcallLoweringInfo(*this);
 }
+
+AnalysisKey LibcallLoweringModuleAnalysis::Key;
+
+bool LibcallLoweringModuleAnalysisResult::invalidate(
+    Module &, const PreservedAnalyses &PA,
+    ModuleAnalysisManager::Invalidator &) {
+  // Passes that change the runtime libcall set must explicitly invalidate this
+  // pass.
+  auto PAC = PA.getChecker<LibcallLoweringModuleAnalysis>();
+  return !PAC.preservedWhenStateless();
+}
+
+LibcallLoweringModuleAnalysisResult
+LibcallLoweringModuleAnalysis::run(Module &M, ModuleAnalysisManager &MAM) {
+  LibcallLoweringMap.init(&MAM.getResult<RuntimeLibraryAnalysis>(M));
+  return LibcallLoweringMap;
+}
+
+INITIALIZE_PASS_BEGIN(LibcallLoweringInfoWrapper, "libcall-lowering-info",
+                      "Library Function Lowering Analysis", false, true)
+INITIALIZE_PASS_DEPENDENCY(RuntimeLibraryInfoWrapper)
+INITIALIZE_PASS_END(LibcallLoweringInfoWrapper, "libcall-lowering-info",
+                    "Library Function Lowering Analysis", false, true)
+
+char LibcallLoweringInfoWrapper::ID = 0;
+
+LibcallLoweringInfoWrapper::LibcallLoweringInfoWrapper() : ImmutablePass(ID) {}
+
+bool LibcallLoweringInfoWrapper::doInitialization(Module &M) {
+  Result.init(&getAnalysis<RuntimeLibraryInfoWrapper>().getRTLCI(M));
+  return false;
+}
+
+void LibcallLoweringInfoWrapper::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.addRequired<RuntimeLibraryInfoWrapper>();
+  AU.setPreservesAll();
+}
+
+void LibcallLoweringInfoWrapper::releaseMemory() { Result.clear(); }
diff --git a/llvm/lib/CodeGen/MIRParser/MILexer.cpp b/llvm/lib/CodeGen/MIRParser/MILexer.cpp
index dbd56c7414f38..aa79aad781ee8 100644
--- a/llvm/lib/CodeGen/MIRParser/MILexer.cpp
+++ b/llvm/lib/CodeGen/MIRParser/MILexer.cpp
@@ -266,6 +266,7 @@ static MIToken::TokenKind getIdentifierKind(StringRef Identifier) {
       .Case("constant-pool", MIToken::kw_constant_pool)
       .Case("call-entry", MIToken::kw_call_entry)
       .Case("custom", MIToken::kw_custom)
+      .Case("lanemask", MIToken::kw_lanemask)
       .Case("liveout", MIToken::kw_liveout)
       .Case("landing-pad", MIToken::kw_landing_pad)
       .Case("inlineasm-br-indirect-target",
diff --git a/llvm/lib/CodeGen/MIRParser/MILexer.h b/llvm/lib/CodeGen/MIRParser/MILexer.h
index 0407a0e7540d7..ff4c0c6f6d59d 100644
--- a/llvm/lib/CodeGen/MIRParser/MILexer.h
+++ b/llvm/lib/CodeGen/MIRParser/MILexer.h
@@ -122,6 +122,7 @@ struct MIToken {
     kw_constant_pool,
     kw_call_entry,
     kw_custom,
+    kw_lanemask,
     kw_liveout,
     kw_landing_pad,
     kw_inlineasm_br_indirect_target,
diff --git a/llvm/lib/CodeGen/MIRParser/MIParser.cpp b/llvm/lib/CodeGen/MIRParser/MIParser.cpp
index f35274d4e2edf..51862b501cc47 100644
--- a/llvm/lib/CodeGen/MIRParser/MIParser.cpp
+++ b/llvm/lib/CodeGen/MIRParser/MIParser.cpp
@@ -496,6 +496,7 @@ class MIParser {
   bool parseTargetIndexOperand(MachineOperand &Dest);
   bool parseDbgInstrRefOperand(MachineOperand &Dest);
   bool parseCustomRegisterMaskOperand(MachineOperand &Dest);
+  bool parseLaneMaskOperand(MachineOperand &Dest);
   bool parseLiveoutRegisterMaskOperand(MachineOperand &Dest);
   bool parseMachineOperand(const unsigned OpCode, const unsigned OpIdx,
                            MachineOperand &Dest,
@@ -2886,6 +2887,31 @@ bool MIParser::parseCustomRegisterMaskOperand(MachineOperand &Dest) {
   return false;
 }
 
+bool MIParser::parseLaneMaskOperand(MachineOperand &Dest) {
+  assert(Token.is(MIToken::kw_lanemask));
+
+  lex();
+  if (expectAndConsume(MIToken::lparen))
+    return true;
+
+  // Parse lanemask.
+  if (Token.isNot(MIToken::IntegerLiteral) && Token.isNot(MIToken::HexLiteral))
+    return error("expected a valid lane mask value");
+  static_assert(sizeof(LaneBitmask::Type) == sizeof(uint64_t),
+                "Use correct get-function for lane mask.");
+  LaneBitmask::Type V;
+  if (getUint64(V))
+    return true;
+  LaneBitmask LaneMask(V);
+  lex();
+
+  if (expectAndConsume(MIToken::rparen))
+    return true;
+
+  Dest = MachineOperand::CreateLaneMask(LaneMask);
+  return false;
+}
+
 bool MIParser::parseLiveoutRegisterMaskOperand(MachineOperand &Dest) {
   assert(Token.is(MIToken::kw_liveout));
   uint32_t *Mask = MF.allocateRegMask();
@@ -2986,6 +3012,8 @@ bool MIParser::parseMachineOperand(const unsigned OpCode, const unsigned OpIdx,
     return parseIntrinsicOperand(Dest);
   case MIToken::kw_target_index:
     return parseTargetIndexOperand(Dest);
+  case MIToken::kw_lanemask:
+    return parseLaneMaskOperand(Dest);
   case MIToken::kw_liveout:
     return parseLiveoutRegisterMaskOperand(Dest);
   case MIToken::kw_floatpred:
diff --git a/llvm/lib/CodeGen/MIRPrinter.cpp b/llvm/lib/CodeGen/MIRPrinter.cpp
index c0554497653f8..02f07b474c048 100644
--- a/llvm/lib/CodeGen/MIRPrinter.cpp
+++ b/llvm/lib/CodeGen/MIRPrinter.cpp
@@ -965,7 +965,8 @@ static void printMIOperand(raw_ostream &OS, MFPrintState &State,
   case MachineOperand::MO_Predicate:
   case MachineOperand::MO_BlockAddress:
   case MachineOperand::MO_DbgInstrRef:
-  case MachineOperand::MO_ShuffleMask: {
+  case MachineOperand::MO_ShuffleMask:
+  case MachineOperand::MO_LaneMask: {
     unsigned TiedOperandIdx = 0;
     if (ShouldPrintRegisterTies && Op.isReg() && Op.isTied() && !Op.isDef())
       TiedOperandIdx = Op.getParent()->findTiedOperandIdx(OpIdx);
diff --git a/llvm/lib/CodeGen/MIRVRegNamerUtils.cpp b/llvm/lib/CodeGen/MIRVRegNamerUtils.cpp
index a22cc91b90542..884c625f94cc7 100644
--- a/llvm/lib/CodeGen/MIRVRegNamerUtils.cpp
+++ b/llvm/lib/CodeGen/MIRVRegNamerUtils.cpp
@@ -93,11 +93,12 @@ std::string VRegRenamer::getInstructionOpcodeHash(MachineInstr &MI) {
     // is contributing to a hash collision but there's enough information
     // (Opcodes,other registers etc) that this will likely not be a problem.
 
-    // TODO: Handle the following Index/ID/Predicate cases. They can
+    // TODO: Handle the following Index/ID/Predicate/LaneMask cases. They can
     // be hashed on in a stable manner.
     case MachineOperand::MO_CFIIndex:
     case MachineOperand::MO_IntrinsicID:
     case MachineOperand::MO_Predicate:
+    case MachineOperand::MO_LaneMask:
 
     // In the cases below we havn't found a way to produce an artifact that will
     // result in a stable hash, in most cases because they are pointers. We want
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index ba0b025167307..be94e1e6d25b6 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -1425,14 +1425,14 @@ bool MachineBasicBlock::canSplitCriticalEdge(const MachineBasicBlock *Succ,
   // where both sides of the branches are always executed.
 
   if (MF->getTarget().requiresStructuredCFG()) {
-    // If `Succ` is a loop header, splitting the critical edge will not
-    // break structured CFG.
-    if (MLI) {
-      const MachineLoop *L = MLI->getLoopFor(Succ);
-      return L && L->getHeader() == Succ;
-    }
-
-    return false;
+    if (!MLI)
+      return false;
+    const MachineLoop *L = MLI->getLoopFor(Succ);
+    // Only if `Succ` is a loop header, splitting the critical edge will not
+    // break structured CFG. And fallthrough to check if this's terminator is
+    // analyzable.
+    if (!L || L->getHeader() != Succ)
+      return false;
   }
 
   // Do we have an Indirect jump with a jumptable that we can rewrite?
diff --git a/llvm/lib/CodeGen/MachineOperand.cpp b/llvm/lib/CodeGen/MachineOperand.cpp
index 8c6d2194433d0..ac1f201bc8b83 100644
--- a/llvm/lib/CodeGen/MachineOperand.cpp
+++ b/llvm/lib/CodeGen/MachineOperand.cpp
@@ -394,6 +394,8 @@ bool MachineOperand::isIdenticalTo(const MachineOperand &Other) const {
     return getPredicate() == Other.getPredicate();
   case MachineOperand::MO_ShuffleMask:
     return getShuffleMask() == Other.getShuffleMask();
+  case MachineOperand::MO_LaneMask:
+    return getLaneMask() == Other.getLaneMask();
   }
   llvm_unreachable("Invalid machine operand type");
 }
@@ -460,6 +462,9 @@ hash_code llvm::hash_value(const MachineOperand &MO) {
     return hash_combine(MO.getType(), MO.getTargetFlags(), MO.getPredicate());
   case MachineOperand::MO_ShuffleMask:
     return hash_combine(MO.getType(), MO.getTargetFlags(), MO.getShuffleMask());
+  case MachineOperand::MO_LaneMask:
+    return hash_combine(MO.getType(), MO.getTargetFlags(),
+                        MO.getLaneMask().getAsInteger());
   }
   llvm_unreachable("Invalid machine operand type");
 }
@@ -1019,11 +1024,11 @@ void MachineOperand::print(raw_ostream &OS, ModuleSlotTracker &MST,
   }
   case MachineOperand::MO_Predicate: {
     auto Pred = static_cast<CmpInst::Predicate>(getPredicate());
-    OS << (CmpInst::isIntPredicate(Pred) ? "int" : "float") << "pred("
-       << Pred << ')';
+    OS << (CmpInst::isIntPredicate(Pred) ? "int" : "float") << "pred(" << Pred
+       << ')';
     break;
   }
-  case MachineOperand::MO_ShuffleMask:
+  case MachineOperand::MO_ShuffleMask: {
     OS << "shufflemask(";
     ArrayRef<int> Mask = getShuffleMask();
     StringRef Separator;
@@ -1038,6 +1043,14 @@ void MachineOperand::print(raw_ostream &OS, ModuleSlotTracker &MST,
     OS << ')';
     break;
   }
+  case MachineOperand::MO_LaneMask: {
+    OS << "lanemask(";
+    LaneBitmask LaneMask = getLaneMask();
+    OS << "0x" << PrintLaneMask(LaneMask);
+    OS << ')';
+    break;
+  }
+  }
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/lib/CodeGen/MachineStableHash.cpp b/llvm/lib/CodeGen/MachineStableHash.cpp
index 6da708d51b95f..2f5f5aeccb2e4 100644
--- a/llvm/lib/CodeGen/MachineStableHash.cpp
+++ b/llvm/lib/CodeGen/MachineStableHash.cpp
@@ -165,6 +165,10 @@ stable_hash llvm::stableHashValue(const MachineOperand &MO) {
     return stable_hash_combine(MO.getType(), MO.getTargetFlags(),
                                stable_hash_name(SymbolName));
   }
+  case MachineOperand::MO_LaneMask: {
+    return stable_hash_combine(MO.getType(), MO.getTargetFlags(),
+                               MO.getLaneMask().getAsInteger());
+  }
   case MachineOperand::MO_CFIIndex:
     return stable_hash_combine(MO.getType(), MO.getTargetFlags(),
                                MO.getCFIIndex());
diff --git a/llvm/lib/CodeGen/MachineVerifier.cpp b/llvm/lib/CodeGen/MachineVerifier.cpp
index a2a66d6128348..9be741d96e456 100644
--- a/llvm/lib/CodeGen/MachineVerifier.cpp
+++ b/llvm/lib/CodeGen/MachineVerifier.cpp
@@ -2428,6 +2428,46 @@ void MachineVerifier::visitMachineInstrBefore(const MachineInstr *MI) {
     }
     break;
   }
+  case TargetOpcode::COPY_LANEMASK: {
+    const MachineOperand &DstOp = MI->getOperand(0);
+    const MachineOperand &SrcOp = MI->getOperand(1);
+    const MachineOperand &LaneMaskOp = MI->getOperand(2);
+    const Register SrcReg = SrcOp.getReg();
+    const LaneBitmask LaneMask = LaneMaskOp.getLaneMask();
+    LaneBitmask SrcMaxLaneMask = LaneBitmask::getAll();
+
+    if (DstOp.getSubReg())
+      report("COPY_LANEMASK must not use a subregister index", &DstOp, 0);
+
+    if (SrcOp.getSubReg())
+      report("COPY_LANEMASK must not use a subregister index", &SrcOp, 1);
+
+    if (LaneMask.none())
+      report("COPY_LANEMASK must read at least one lane", MI);
+
+    if (SrcReg.isPhysical()) {
+      const TargetRegisterClass *SrcRC = TRI->getMinimalPhysRegClass(SrcReg);
+      if (SrcRC)
+        SrcMaxLaneMask = SrcRC->getLaneMask();
+    } else {
+      SrcMaxLaneMask = MRI->getMaxLaneMaskForVReg(SrcReg);
+    }
+
+    // COPY_LANEMASK should be used only for partial copy. For full
+    // copy, one should strictly use the COPY instruction.
+    if (SrcMaxLaneMask == LaneMask)
+      report("COPY_LANEMASK cannot be used to do full copy", MI);
+
+    // If LaneMask is greater than the SrcMaxLaneMask, it implies
+    // COPY_LANEMASK is attempting to read from the lanes that
+    // don't exists in the source register.
+    if (SrcMaxLaneMask < LaneMask)
+      report("COPY_LANEMASK attempts to read from the lanes that "
+             "don't exist in the source register",
+             MI);
+
+    break;
+  }
   case TargetOpcode::STATEPOINT: {
     StatepointOpers SO(MI);
     if (!MI->getOperand(SO.getIDPos()).isImm() ||
diff --git a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
index d738dc4eea36d..72c3c566163e2 100644
--- a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
+++ b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
@@ -17,6 +17,7 @@
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/CodeGen/ExpandVectorPredication.h"
+#include "llvm/CodeGen/LibcallLoweringInfo.h"
 #include "llvm/CodeGen/Passes.h"
 #include "llvm/CodeGen/TargetLowering.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@@ -51,6 +52,7 @@ namespace {
 
 struct PreISelIntrinsicLowering {
   const TargetMachine *TM;
+  const LibcallLoweringModuleAnalysisResult &ModuleLibcalls;
   const function_ref<TargetTransformInfo &(Function &)> LookupTTI;
   const function_ref<TargetLibraryInfo &(Function &)> LookupTLI;
 
@@ -61,11 +63,13 @@ struct PreISelIntrinsicLowering {
 
   explicit PreISelIntrinsicLowering(
       const TargetMachine *TM_,
+      const LibcallLoweringModuleAnalysisResult &ModuleLibcalls_,
       function_ref<TargetTransformInfo &(Function &)> LookupTTI_,
       function_ref<TargetLibraryInfo &(Function &)> LookupTLI_,
       bool UseMemIntrinsicLibFunc_ = true)
-      : TM(TM_), LookupTTI(LookupTTI_), LookupTLI(LookupTLI_),
-        UseMemIntrinsicLibFunc(UseMemIntrinsicLibFunc_) {}
+      : TM(TM_), ModuleLibcalls(ModuleLibcalls_), LookupTTI(LookupTTI_),
+        LookupTLI(LookupTLI_), UseMemIntrinsicLibFunc(UseMemIntrinsicLibFunc_) {
+  }
 
   static bool shouldExpandMemIntrinsicWithSize(Value *Size,
                                                const TargetTransformInfo &TTI);
@@ -230,21 +234,26 @@ bool PreISelIntrinsicLowering::shouldExpandMemIntrinsicWithSize(
   return SizeVal > Threshold || Threshold == 0;
 }
 
-static bool canEmitLibcall(const TargetMachine *TM, Function *F,
-                           RTLIB::Libcall LC) {
+static bool
+canEmitLibcall(const LibcallLoweringModuleAnalysisResult &ModuleLowering,
+               const TargetMachine *TM, Function *F, RTLIB::Libcall LC) {
   // TODO: Should this consider the address space of the memcpy?
   if (!TM)
     return true;
-  const TargetLowering *TLI = TM->getSubtargetImpl(*F)->getTargetLowering();
-  return TLI->getLibcallName(LC) != nullptr;
+  const LibcallLoweringInfo &Lowering =
+      ModuleLowering.getLibcallLowering(*TM->getSubtargetImpl(*F));
+  return Lowering.getLibcallImpl(LC) != RTLIB::Unsupported;
 }
 
-static bool canEmitMemcpy(const TargetMachine *TM, Function *F) {
+static bool
+canEmitMemcpy(const LibcallLoweringModuleAnalysisResult &ModuleLowering,
+              const TargetMachine *TM, Function *F) {
   // TODO: Should this consider the address space of the memcpy?
   if (!TM)
     return true;
-  const TargetLowering *TLI = TM->getSubtargetImpl(*F)->getTargetLowering();
-  return TLI->getMemcpyImpl() != RTLIB::Unsupported;
+  const LibcallLoweringInfo &Lowering =
+      ModuleLowering.getLibcallLowering(*TM->getSubtargetImpl(*F));
+  return Lowering.getMemcpyImpl() != RTLIB::Unsupported;
 }
 
 // Return a value appropriate for use with the memset_pattern16 libcall, if
@@ -317,7 +326,8 @@ bool PreISelIntrinsicLowering::expandMemIntrinsicUses(
       Function *ParentFunc = Memcpy->getFunction();
       const TargetTransformInfo &TTI = LookupTTI(*ParentFunc);
       if (shouldExpandMemIntrinsicWithSize(Memcpy->getLength(), TTI)) {
-        if (UseMemIntrinsicLibFunc && canEmitMemcpy(TM, ParentFunc))
+        if (UseMemIntrinsicLibFunc &&
+            canEmitMemcpy(ModuleLibcalls, TM, ParentFunc))
           break;
 
         // TODO: For optsize, emit the loop into a separate function
@@ -349,7 +359,7 @@ bool PreISelIntrinsicLowering::expandMemIntrinsicUses(
       const TargetTransformInfo &TTI = LookupTTI(*ParentFunc);
       if (shouldExpandMemIntrinsicWithSize(Memmove->getLength(), TTI)) {
         if (UseMemIntrinsicLibFunc &&
-            canEmitLibcall(TM, ParentFunc, RTLIB::MEMMOVE))
+            canEmitLibcall(ModuleLibcalls, TM, ParentFunc, RTLIB::MEMMOVE))
           break;
 
         if (expandMemMoveAsLoop(Memmove, TTI)) {
@@ -366,7 +376,7 @@ bool PreISelIntrinsicLowering::expandMemIntrinsicUses(
       const TargetTransformInfo &TTI = LookupTTI(*ParentFunc);
       if (shouldExpandMemIntrinsicWithSize(Memset->getLength(), TTI)) {
         if (UseMemIntrinsicLibFunc &&
-            canEmitLibcall(TM, ParentFunc, RTLIB::MEMSET))
+            canEmitLibcall(ModuleLibcalls, TM, ParentFunc, RTLIB::MEMSET))
           break;
 
         expandMemSetAsLoop(Memset);
@@ -619,10 +629,14 @@ class PreISelIntrinsicLoweringLegacyPass : public ModulePass {
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.addRequired<TargetTransformInfoWrapperPass>();
     AU.addRequired<TargetLibraryInfoWrapperPass>();
+    AU.addRequired<LibcallLoweringInfoWrapper>();
     AU.addRequired<TargetPassConfig>();
   }
 
   bool runOnModule(Module &M) override {
+    const LibcallLoweringModuleAnalysisResult &ModuleLibcalls =
+        getAnalysis<LibcallLoweringInfoWrapper>().getResult();
+
     auto LookupTTI = [this](Function &F) -> TargetTransformInfo & {
       return this->getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
     };
@@ -631,7 +645,7 @@ class PreISelIntrinsicLoweringLegacyPass : public ModulePass {
     };
 
     const auto *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
-    PreISelIntrinsicLowering Lowering(TM, LookupTTI, LookupTLI);
+    PreISelIntrinsicLowering Lowering(TM, ModuleLibcalls, LookupTTI, LookupTLI);
     return Lowering.lowerIntrinsics(M);
   }
 };
@@ -643,6 +657,8 @@ char PreISelIntrinsicLoweringLegacyPass::ID;
 INITIALIZE_PASS_BEGIN(PreISelIntrinsicLoweringLegacyPass,
                       "pre-isel-intrinsic-lowering",
                       "Pre-ISel Intrinsic Lowering", false, false)
+INITIALIZE_PASS_DEPENDENCY(LibcallLoweringInfoWrapper)
+INITIALIZE_PASS_DEPENDENCY(RuntimeLibraryInfoWrapper)
 INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
 INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
 INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
@@ -654,9 +670,12 @@ ModulePass *llvm::createPreISelIntrinsicLoweringPass() {
   return new PreISelIntrinsicLoweringLegacyPass();
 }
 
-PreservedAnalyses PreISelIntrinsicLoweringPass::run(Module &M,
-                                                    ModuleAnalysisManager &AM) {
-  auto &FAM = AM.getResult<FunctionAnalysisManagerModuleProxy>(M).getManager();
+PreservedAnalyses
+PreISelIntrinsicLoweringPass::run(Module &M, ModuleAnalysisManager &MAM) {
+  const LibcallLoweringModuleAnalysisResult &LibcallLowering =
+      MAM.getResult<LibcallLoweringModuleAnalysis>(M);
+
+  auto &FAM = MAM.getResult<FunctionAnalysisManagerModuleProxy>(M).getManager();
 
   auto LookupTTI = [&FAM](Function &F) -> TargetTransformInfo & {
     return FAM.getResult<TargetIRAnalysis>(F);
@@ -665,7 +684,7 @@ PreservedAnalyses PreISelIntrinsicLoweringPass::run(Module &M,
     return FAM.getResult<TargetLibraryAnalysis>(F);
   };
 
-  PreISelIntrinsicLowering Lowering(TM, LookupTTI, LookupTLI);
+  PreISelIntrinsicLowering Lowering(TM, LibcallLowering, LookupTTI, LookupTLI);
   if (!Lowering.lowerIntrinsics(M))
     return PreservedAnalyses::all();
   else
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 6b79dbb46cadc..5377f22e5c61f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -1065,8 +1065,9 @@ static bool isConstantSplatVectorMaskForType(SDNode *N, EVT ScalarTy) {
 
 // Determines if it is a constant integer or a splat/build vector of constant
 // integers (and undefs).
-// Do not permit build vector implicit truncation.
-static bool isConstantOrConstantVector(SDValue N, bool NoOpaques = false) {
+// Do not permit build vector implicit truncation unless AllowTruncation is set.
+static bool isConstantOrConstantVector(SDValue N, bool NoOpaques = false,
+                                       bool AllowTruncation = false) {
   if (ConstantSDNode *Const = dyn_cast<ConstantSDNode>(N))
     return !(Const->isOpaque() && NoOpaques);
   if (N.getOpcode() != ISD::BUILD_VECTOR && N.getOpcode() != ISD::SPLAT_VECTOR)
@@ -1076,8 +1077,13 @@ static bool isConstantOrConstantVector(SDValue N, bool NoOpaques = false) {
     if (Op.isUndef())
       continue;
     ConstantSDNode *Const = dyn_cast<ConstantSDNode>(Op);
-    if (!Const || Const->getAPIntValue().getBitWidth() != BitWidth ||
-        (Const->isOpaque() && NoOpaques))
+    if (!Const || (Const->isOpaque() && NoOpaques))
+      return false;
+    // When AllowTruncation is true, allow constants that have been promoted
+    // during type legalization as long as the value fits in the target type.
+    if ((AllowTruncation &&
+         Const->getAPIntValue().getActiveBits() > BitWidth) ||
+        (!AllowTruncation && Const->getAPIntValue().getBitWidth() != BitWidth))
       return false;
   }
   return true;
@@ -3288,6 +3294,9 @@ static SDValue getAsCarry(const TargetLowering &TLI, SDValue V,
 
   // First, peel away TRUNCATE/ZERO_EXTEND/AND nodes due to legalization.
   while (true) {
+    if (ForceCarryReconstruction && V.getValueType() == MVT::i1)
+      return V;
+
     if (V.getOpcode() == ISD::TRUNCATE || V.getOpcode() == ISD::ZERO_EXTEND) {
       V = V.getOperand(0);
       continue;
@@ -3302,9 +3311,6 @@ static SDValue getAsCarry(const TargetLowering &TLI, SDValue V,
       continue;
     }
 
-    if (ForceCarryReconstruction && V.getValueType() == MVT::i1)
-      return V;
-
     break;
   }
 
@@ -4983,7 +4989,7 @@ static bool isDivRemLibcallAvailable(SDNode *Node, bool isSigned,
   case MVT::i128: LC= isSigned ? RTLIB::SDIVREM_I128:RTLIB::UDIVREM_I128; break;
   }
 
-  return TLI.getLibcallName(LC) != nullptr;
+  return TLI.getLibcallImpl(LC) != RTLIB::Unsupported;
 }
 
 /// Issue divrem if both quotient and remainder are needed.
@@ -5322,7 +5328,8 @@ SDValue DAGCombiner::visitUDIVLike(SDValue N0, SDValue N1, SDNode *N) {
   EVT VT = N->getValueType(0);
 
   // fold (udiv x, (1 << c)) -> x >>u c
-  if (isConstantOrConstantVector(N1, /*NoOpaques*/ true)) {
+  if (isConstantOrConstantVector(N1, /*NoOpaques=*/true,
+                                 /*AllowTruncation=*/true)) {
     if (SDValue LogBase2 = BuildLogBase2(N1, DL)) {
       AddToWorklist(LogBase2.getNode());
 
@@ -5336,7 +5343,8 @@ SDValue DAGCombiner::visitUDIVLike(SDValue N0, SDValue N1, SDNode *N) {
   // fold (udiv x, (shl c, y)) -> x >>u (log2(c)+y) iff c is power of 2
   if (N1.getOpcode() == ISD::SHL) {
     SDValue N10 = N1.getOperand(0);
-    if (isConstantOrConstantVector(N10, /*NoOpaques*/ true)) {
+    if (isConstantOrConstantVector(N10, /*NoOpaques=*/true,
+                                   /*AllowTruncation=*/true)) {
       if (SDValue LogBase2 = BuildLogBase2(N10, DL)) {
         AddToWorklist(LogBase2.getNode());
 
@@ -5352,7 +5360,8 @@ SDValue DAGCombiner::visitUDIVLike(SDValue N0, SDValue N1, SDNode *N) {
 
   // fold (udiv x, c) -> alternate
   AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();
-  if (isConstantOrConstantVector(N1) &&
+  if (isConstantOrConstantVector(N1, /*NoOpaques=*/false,
+                                 /*AllowTruncation=*/true) &&
       !TLI.isIntDivCheap(N->getValueType(0), Attr))
     if (SDValue Op = BuildUDIV(N))
       return Op;
@@ -19505,7 +19514,8 @@ SDValue DAGCombiner::visitFMinMax(SDNode *N) {
   const SDNodeFlags Flags = N->getFlags();
   unsigned Opc = N->getOpcode();
   bool PropAllNaNsToQNaNs = Opc == ISD::FMINIMUM || Opc == ISD::FMAXIMUM;
-  bool PropOnlySNaNsToQNaNs = Opc == ISD::FMINNUM || Opc == ISD::FMAXNUM;
+  bool ReturnsOtherForAllNaNs =
+      Opc == ISD::FMINIMUMNUM || Opc == ISD::FMAXIMUMNUM;
   bool IsMin =
       Opc == ISD::FMINNUM || Opc == ISD::FMINIMUM || Opc == ISD::FMINIMUMNUM;
   SelectionDAG::FlagInserter FlagsInserter(DAG, N);
@@ -19524,32 +19534,30 @@ SDValue DAGCombiner::visitFMinMax(SDNode *N) {
 
     // minnum(X, qnan) -> X
     // maxnum(X, qnan) -> X
-    // minnum(X, snan) -> qnan
-    // maxnum(X, snan) -> qnan
     // minimum(X, nan) -> qnan
     // maximum(X, nan) -> qnan
     // minimumnum(X, nan) -> X
     // maximumnum(X, nan) -> X
     if (AF.isNaN()) {
-      if (PropAllNaNsToQNaNs || (AF.isSignaling() && PropOnlySNaNsToQNaNs)) {
+      if (PropAllNaNsToQNaNs) {
         if (AF.isSignaling())
           return DAG.getConstantFP(AF.makeQuiet(), SDLoc(N), VT);
         return N->getOperand(1);
+      } else if (ReturnsOtherForAllNaNs || !AF.isSignaling()) {
+        return N->getOperand(0);
       }
-      return N->getOperand(0);
+      return SDValue();
     }
 
     // In the following folds, inf can be replaced with the largest finite
     // float, if the ninf flag is set.
     if (AF.isInfinity() || (Flags.hasNoInfs() && AF.isLargest())) {
-      // minnum(X, -inf) -> -inf (ignoring sNaN -> qNaN propagation)
-      // maxnum(X, +inf) -> +inf (ignoring sNaN -> qNaN propagation)
       // minimum(X, -inf) -> -inf if nnan
       // maximum(X, +inf) -> +inf if nnan
       // minimumnum(X, -inf) -> -inf
       // maximumnum(X, +inf) -> +inf
       if (IsMin == AF.isNegative() &&
-          (!PropAllNaNsToQNaNs || Flags.hasNoNaNs()))
+          (ReturnsOtherForAllNaNs || Flags.hasNoNaNs()))
         return N->getOperand(1);
 
       // minnum(X, +inf) -> X if nnan
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index 99d14a60c6ed1..8336e1d1f4134 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -2381,8 +2381,19 @@ SelectionDAGLegalize::ExpandDivRemLibCall(SDNode *Node,
   Entry.IsZExt = !isSigned;
   Args.push_back(Entry);
 
-  SDValue Callee = DAG.getExternalSymbol(TLI.getLibcallName(LC),
-                                         TLI.getPointerTy(DAG.getDataLayout()));
+  RTLIB::LibcallImpl LibcallImpl = TLI.getLibcallImpl(LC);
+  if (LibcallImpl == RTLIB::Unsupported) {
+    DAG.getContext()->emitError(Twine("no libcall available for ") +
+                                Node->getOperationName(&DAG));
+    SDValue Poison = DAG.getPOISON(RetVT);
+    Results.push_back(Poison);
+    Results.push_back(Poison);
+    return;
+  }
+
+  SDValue Callee =
+      DAG.getExternalSymbol(TLI.getLibcallImplName(LibcallImpl).data(),
+                            TLI.getPointerTy(DAG.getDataLayout()));
 
   SDLoc dl(Node);
   TargetLowering::CallLoweringInfo CLI(DAG);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 4274e951446b8..6e1e02f38113e 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -1702,10 +1702,8 @@ void DAGTypeLegalizer::SplitVecRes_LOOP_DEPENDENCE_MASK(SDNode *N, SDValue &Lo,
   Lo = DAG.getNode(N->getOpcode(), DL, LoVT, PtrA, PtrB, N->getOperand(2));
 
   unsigned EltSize = N->getConstantOperandVal(2);
-  unsigned Offset = EltSize * HiVT.getVectorMinNumElements();
-  SDValue Addend = HiVT.isScalableVT()
-                       ? DAG.getVScale(DL, MVT::i64, APInt(64, Offset))
-                       : DAG.getConstant(Offset, DL, MVT::i64);
+  ElementCount Offset = HiVT.getVectorElementCount() * EltSize;
+  SDValue Addend = DAG.getElementCount(DL, MVT::i64, Offset);
 
   PtrA = DAG.getNode(ISD::ADD, DL, MVT::i64, PtrA, Addend);
   Hi = DAG.getNode(N->getOpcode(), DL, HiVT, PtrA, PtrB, N->getOperand(2));
@@ -3940,43 +3938,55 @@ SDValue DAGTypeLegalizer::SplitVecOp_EXTRACT_SUBVECTOR(SDNode *N) {
 
   GetSplitVector(N->getOperand(0), Lo, Hi);
 
-  uint64_t LoEltsMin = Lo.getValueType().getVectorMinNumElements();
-  uint64_t IdxVal = Idx->getAsZExtVal();
+  ElementCount LoElts = Lo.getValueType().getVectorElementCount();
+  // Note: For scalable vectors, the index is scaled by vscale.
+  ElementCount IdxVal =
+      ElementCount::get(Idx->getAsZExtVal(), SubVT.isScalableVector());
+  uint64_t IdxValMin = IdxVal.getKnownMinValue();
 
-  unsigned NumResultElts = SubVT.getVectorMinNumElements();
+  EVT SrcVT = N->getOperand(0).getValueType();
+  ElementCount NumResultElts = SubVT.getVectorElementCount();
 
-  if (IdxVal < LoEltsMin) {
-    // If the extracted elements are all in the low half, do a simple extract.
-    if (IdxVal + NumResultElts <= LoEltsMin)
-      return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, SubVT, Lo, Idx);
+  // If the extracted elements are all in the low half, do a simple extract.
+  if (ElementCount::isKnownLE(IdxVal + NumResultElts, LoElts))
+    return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, SubVT, Lo, Idx);
 
+  unsigned LoEltsMin = LoElts.getKnownMinValue();
+  if (IdxValMin < LoEltsMin && SubVT.isFixedLengthVector() &&
+      SrcVT.isFixedLengthVector()) {
     // Extracted subvector crosses vector split, so we need to blend the two
     // halves.
     // TODO: May be able to emit partial extract_subvector.
     SmallVector<SDValue, 8> Elts;
-    Elts.reserve(NumResultElts);
+    Elts.reserve(NumResultElts.getFixedValue());
 
-    DAG.ExtractVectorElements(Lo, Elts, /*Start=*/IdxVal,
-                              /*Count=*/LoEltsMin - IdxVal);
+    // This is not valid for scalable vectors. If SubVT is scalable, this is the
+    // same as unrolling a scalable dimension (invalid). If ScrVT is scalable,
+    // `Lo[LoEltsMin]` may not be the last element of `Lo`.
+    DAG.ExtractVectorElements(Lo, Elts, /*Start=*/IdxValMin,
+                              /*Count=*/LoEltsMin - IdxValMin);
     DAG.ExtractVectorElements(Hi, Elts, /*Start=*/0,
                               /*Count=*/SubVT.getVectorNumElements() -
                                   Elts.size());
     return DAG.getBuildVector(SubVT, dl, Elts);
   }
 
-  EVT SrcVT = N->getOperand(0).getValueType();
   if (SubVT.isScalableVector() == SrcVT.isScalableVector()) {
-    uint64_t ExtractIdx = IdxVal - LoEltsMin;
-    if (ExtractIdx % NumResultElts == 0)
-      return DAG.getExtractSubvector(dl, SubVT, Hi, ExtractIdx);
+    ElementCount ExtractIdx = IdxVal - LoElts;
+    if (ExtractIdx.isKnownMultipleOf(NumResultElts))
+      return DAG.getExtractSubvector(dl, SubVT, Hi,
+                                     ExtractIdx.getKnownMinValue());
 
-    // We cannot create an extract_subvector that isn't a multiple of the result
-    // size, which may go out of bounds for the last elements. Shuffle the
-    // desired elements down to 0 and do a simple 0 extract.
     EVT HiVT = Hi.getValueType();
+    assert(HiVT.isFixedLengthVector() &&
+           "Only fixed-vector extracts are supported in this case");
+
+    // We cannot create an extract_subvector that isn't a multiple of the
+    // result size, which may go out of bounds for the last elements. Shuffle
+    // the desired elements down to 0 and do a simple 0 extract.
     SmallVector<int, 8> Mask(HiVT.getVectorNumElements(), -1);
-    for (int I = 0; I != static_cast<int>(NumResultElts); ++I)
-      Mask[I] = ExtractIdx + I;
+    for (int I = 0; I != int(NumResultElts.getFixedValue()); ++I)
+      Mask[I] = int(ExtractIdx.getFixedValue()) + I;
 
     SDValue Shuffle =
         DAG.getVectorShuffle(HiVT, dl, Hi, DAG.getPOISON(HiVT), Mask);
@@ -6218,8 +6228,33 @@ SDValue DAGTypeLegalizer::WidenVecRes_EXTRACT_SUBVECTOR(SDNode *N) {
       return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT, Parts);
     }
 
-    report_fatal_error("Don't know how to widen the result of "
-                       "EXTRACT_SUBVECTOR for scalable vectors");
+    // Fallback to extracting through memory.
+
+    Align Alignment = DAG.getReducedAlign(InVT, /*UseABI=*/false);
+    SDValue StackPtr = DAG.CreateStackTemporary(InVT.getStoreSize(), Alignment);
+    MachineFunction &MF = DAG.getMachineFunction();
+    int FrameIndex = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
+    auto PtrInfo = MachinePointerInfo::getFixedStack(MF, FrameIndex);
+
+    MachineMemOperand *StoreMMO = MF.getMachineMemOperand(
+        PtrInfo, MachineMemOperand::MOStore,
+        LocationSize::beforeOrAfterPointer(), Alignment);
+    MachineMemOperand *LoadMMO = MF.getMachineMemOperand(
+        PtrInfo, MachineMemOperand::MOLoad,
+        LocationSize::beforeOrAfterPointer(), Alignment);
+
+    // Write out the input vector.
+    SDValue Ch = DAG.getStore(DAG.getEntryNode(), dl, InOp, StackPtr, StoreMMO);
+
+    // Build a mask to match the length of the non-widened result.
+    SDValue Mask =
+        DAG.getMaskFromElementCount(dl, WidenVT, VT.getVectorElementCount());
+
+    // Read back the sub-vector setting the remaining lanes to poison.
+    StackPtr = TLI.getVectorSubVecPointer(DAG, StackPtr, InVT, VT, Idx);
+    return DAG.getMaskedLoad(
+        WidenVT, dl, Ch, StackPtr, DAG.getUNDEF(StackPtr.getValueType()), Mask,
+        DAG.getPOISON(WidenVT), VT, LoadMMO, ISD::UNINDEXED, ISD::NON_EXTLOAD);
   }
 
   // We could try widening the input to the right length but for now, extract
@@ -6323,11 +6358,8 @@ SDValue DAGTypeLegalizer::WidenVecRes_LOAD(SDNode *N) {
   if (VT.isVector()) {
     // If all else fails replace the load with a wide masked load.
     SDLoc DL(N);
-    EVT IdxVT = TLI.getVectorIdxTy(DAG.getDataLayout());
-
-    SDValue Len = DAG.getElementCount(DL, IdxVT, VT.getVectorElementCount());
-    SDValue Mask = DAG.getNode(ISD::GET_ACTIVE_LANE_MASK, DL, WideMaskVT,
-                               DAG.getConstant(0, DL, IdxVT), Len);
+    SDValue Mask =
+        DAG.getMaskFromElementCount(DL, WideVT, VT.getVectorElementCount());
 
     SDValue NewLoad = DAG.getMaskedLoad(
         WideVT, DL, LD->getChain(), LD->getBasePtr(), LD->getOffset(), Mask,
@@ -7464,9 +7496,7 @@ SDValue DAGTypeLegalizer::WidenVecOp_INSERT_SUBVECTOR(SDNode *N) {
   SDValue InVec = N->getOperand(0);
 
   EVT OrigVT = SubVec.getValueType();
-  if (getTypeAction(SubVec.getValueType()) == TargetLowering::TypeWidenVector)
-    SubVec = GetWidenedVector(SubVec);
-
+  SubVec = GetWidenedVector(SubVec);
   EVT SubVT = SubVec.getValueType();
 
   // Whether or not all the elements of the widened SubVec will be inserted into
@@ -7488,17 +7518,52 @@ SDValue DAGTypeLegalizer::WidenVecOp_INSERT_SUBVECTOR(SDNode *N) {
     }
   }
 
+  if (!IndicesValid)
+    report_fatal_error(
+        "Don't know how to widen the operands for INSERT_SUBVECTOR");
+
   SDLoc DL(N);
 
   // We need to make sure that the indices are still valid, otherwise we might
   // widen what was previously well-defined to something undefined.
-  if (IndicesValid && InVec.isUndef() && N->getConstantOperandVal(2) == 0)
+  if (InVec.isUndef() && N->getConstantOperandVal(2) == 0)
     return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, InVec, SubVec,
                        N->getOperand(2));
 
-  if (!IndicesValid || OrigVT.isScalableVector())
-    report_fatal_error(
-        "Don't know how to widen the operands for INSERT_SUBVECTOR");
+  if (OrigVT.isScalableVector()) {
+    // Fallback to inserting through memory.
+
+    Align Alignment = DAG.getReducedAlign(VT, /*UseABI=*/false);
+    SDValue StackPtr = DAG.CreateStackTemporary(VT.getStoreSize(), Alignment);
+    MachineFunction &MF = DAG.getMachineFunction();
+    int FrameIndex = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
+    auto PtrInfo = MachinePointerInfo::getFixedStack(MF, FrameIndex);
+
+    MachineMemOperand *StoreMMO = MF.getMachineMemOperand(
+        PtrInfo, MachineMemOperand::MOStore,
+        LocationSize::beforeOrAfterPointer(), Alignment);
+    MachineMemOperand *LoadMMO = MF.getMachineMemOperand(
+        PtrInfo, MachineMemOperand::MOLoad,
+        LocationSize::beforeOrAfterPointer(), Alignment);
+
+    // Write out the vector being inserting into.
+    SDValue Ch =
+        DAG.getStore(DAG.getEntryNode(), DL, InVec, StackPtr, StoreMMO);
+
+    // Build a mask to match the length of the sub-vector.
+    SDValue Mask =
+        DAG.getMaskFromElementCount(DL, SubVT, OrigVT.getVectorElementCount());
+
+    // Overwrite the sub-vector at the required offset.
+    SDValue SubVecPtr =
+        TLI.getVectorSubVecPointer(DAG, StackPtr, VT, OrigVT, N->getOperand(2));
+    Ch = DAG.getMaskedStore(Ch, DL, SubVec, SubVecPtr,
+                            DAG.getUNDEF(SubVecPtr.getValueType()), Mask, VT,
+                            StoreMMO, ISD::UNINDEXED, ISD::NON_EXTLOAD);
+
+    // Read back the result.
+    return DAG.getLoad(VT, DL, Ch, StackPtr, LoadMMO);
+  }
 
   // If the operands can't be widened legally, just replace the INSERT_SUBVECTOR
   // with a series of INSERT_VECTOR_ELT
@@ -7577,12 +7642,9 @@ SDValue DAGTypeLegalizer::WidenVecOp_STORE(SDNode *N) {
   if (StVT.isVector()) {
     // If all else fails replace the store with a wide masked store.
     SDLoc DL(N);
-    EVT IdxVT = TLI.getVectorIdxTy(DAG.getDataLayout());
-
     SDValue WideStVal = GetWidenedVector(StVal);
-    SDValue Len = DAG.getElementCount(DL, IdxVT, StVT.getVectorElementCount());
-    SDValue Mask = DAG.getNode(ISD::GET_ACTIVE_LANE_MASK, DL, WideMaskVT,
-                               DAG.getConstant(0, DL, IdxVT), Len);
+    SDValue Mask =
+        DAG.getMaskFromElementCount(DL, WideVT, StVT.getVectorElementCount());
 
     return DAG.getMaskedStore(ST->getChain(), DL, WideStVal, ST->getBasePtr(),
                               ST->getOffset(), Mask, ST->getMemoryVT(),
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 06735708d5369..b009e6a3d5f5f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -2098,32 +2098,51 @@ SDValue SelectionDAG::getCondCode(ISD::CondCode Cond) {
   return SDValue(CondCodeNodes[Cond], 0);
 }
 
-SDValue SelectionDAG::getVScale(const SDLoc &DL, EVT VT, APInt MulImm,
-                                bool ConstantFold) {
+SDValue SelectionDAG::getVScale(const SDLoc &DL, EVT VT, APInt MulImm) {
   assert(MulImm.getBitWidth() == VT.getSizeInBits() &&
          "APInt size does not match type size!");
 
   if (MulImm == 0)
     return getConstant(0, DL, VT);
 
-  if (ConstantFold) {
-    const MachineFunction &MF = getMachineFunction();
-    const Function &F = MF.getFunction();
-    ConstantRange CR = getVScaleRange(&F, 64);
-    if (const APInt *C = CR.getSingleElement())
-      return getConstant(MulImm * C->getZExtValue(), DL, VT);
-  }
+  const MachineFunction &MF = getMachineFunction();
+  const Function &F = MF.getFunction();
+  ConstantRange CR = getVScaleRange(&F, 64);
+  if (const APInt *C = CR.getSingleElement())
+    return getConstant(MulImm * C->getZExtValue(), DL, VT);
 
   return getNode(ISD::VSCALE, DL, VT, getConstant(MulImm, DL, VT));
 }
 
-SDValue SelectionDAG::getElementCount(const SDLoc &DL, EVT VT, ElementCount EC,
-                                      bool ConstantFold) {
-  if (EC.isScalable())
-    return getVScale(DL, VT,
-                     APInt(VT.getSizeInBits(), EC.getKnownMinValue()));
+/// \returns a value of type \p VT that represents the runtime value of \p
+/// Quantity, i.e. scaled by vscale if it's scalable, or a fixed constant
+/// otherwise. Quantity should be a FixedOrScalableQuantity, i.e. ElementCount
+/// or TypeSize.
+template <typename Ty>
+static SDValue getFixedOrScalableQuantity(SelectionDAG &DAG, const SDLoc &DL,
+                                          EVT VT, Ty Quantity) {
+  if (Quantity.isScalable())
+    return DAG.getVScale(
+        DL, VT, APInt(VT.getSizeInBits(), Quantity.getKnownMinValue()));
+
+  return DAG.getConstant(Quantity.getKnownMinValue(), DL, VT);
+}
+
+SDValue SelectionDAG::getElementCount(const SDLoc &DL, EVT VT,
+                                      ElementCount EC) {
+  return getFixedOrScalableQuantity(*this, DL, VT, EC);
+}
 
-  return getConstant(EC.getKnownMinValue(), DL, VT);
+SDValue SelectionDAG::getTypeSize(const SDLoc &DL, EVT VT, TypeSize TS) {
+  return getFixedOrScalableQuantity(*this, DL, VT, TS);
+}
+
+SDValue SelectionDAG::getMaskFromElementCount(const SDLoc &DL, EVT DataVT,
+                                              ElementCount EC) {
+  EVT IdxVT = TLI->getVectorIdxTy(getDataLayout());
+  EVT MaskVT = TLI->getSetCCResultType(getDataLayout(), *getContext(), DataVT);
+  return getNode(ISD::GET_ACTIVE_LANE_MASK, DL, MaskVT,
+                 getConstant(0, DL, IdxVT), getElementCount(DL, IdxVT, EC));
 }
 
 SDValue SelectionDAG::getStepVector(const SDLoc &DL, EVT ResVT) {
@@ -7371,8 +7390,12 @@ SDValue SelectionDAG::foldConstantFPMath(unsigned Opcode, const SDLoc &DL,
       C1.copySign(C2);
       return getConstantFP(C1, DL, VT);
     case ISD::FMINNUM:
+      if (C1.isSignaling() || C2.isSignaling())
+        return SDValue();
       return getConstantFP(minnum(C1, C2), DL, VT);
     case ISD::FMAXNUM:
+      if (C1.isSignaling() || C2.isSignaling())
+        return SDValue();
       return getConstantFP(maxnum(C1, C2), DL, VT);
     case ISD::FMINIMUM:
       return getConstantFP(minimum(C1, C2), DL, VT);
@@ -8500,16 +8523,7 @@ static SDValue getMemsetStringVal(EVT VT, const SDLoc &dl, SelectionDAG &DAG,
 SDValue SelectionDAG::getMemBasePlusOffset(SDValue Base, TypeSize Offset,
                                            const SDLoc &DL,
                                            const SDNodeFlags Flags) {
-  EVT VT = Base.getValueType();
-  SDValue Index;
-
-  if (Offset.isScalable())
-    Index = getVScale(DL, Base.getValueType(),
-                      APInt(Base.getValueSizeInBits().getFixedValue(),
-                            Offset.getKnownMinValue()));
-  else
-    Index = getConstant(Offset.getFixedValue(), DL, VT);
-
+  SDValue Index = getTypeSize(DL, Base.getValueType(), Offset);
   return getMemBasePlusOffset(Base, Index, DL, Flags);
 }
 
@@ -9047,8 +9061,8 @@ static bool isInTailCallPositionWrapper(const CallInst *CI,
 std::pair<SDValue, SDValue>
 SelectionDAG::getMemcmp(SDValue Chain, const SDLoc &dl, SDValue Mem0,
                         SDValue Mem1, SDValue Size, const CallInst *CI) {
-  const char *LibCallName = TLI->getLibcallName(RTLIB::MEMCMP);
-  if (!LibCallName)
+  RTLIB::LibcallImpl MemcmpImpl = TLI->getLibcallImpl(RTLIB::MEMCMP);
+  if (MemcmpImpl == RTLIB::Unsupported)
     return {};
 
   PointerType *PT = PointerType::getUnqual(*getContext());
@@ -9061,13 +9075,14 @@ SelectionDAG::getMemcmp(SDValue Chain, const SDLoc &dl, SDValue Mem0,
   bool IsTailCall =
       isInTailCallPositionWrapper(CI, this, /*AllowReturnsFirstArg*/ true);
 
+  StringRef LibCallName = TLI->getLibcallImplName(MemcmpImpl);
   CLI.setDebugLoc(dl)
       .setChain(Chain)
-      .setLibCallee(
-          TLI->getLibcallCallingConv(RTLIB::MEMCMP),
-          Type::getInt32Ty(*getContext()),
-          getExternalSymbol(LibCallName, TLI->getPointerTy(getDataLayout())),
-          std::move(Args))
+      .setLibCallee(TLI->getLibcallImplCallingConv(MemcmpImpl),
+                    Type::getInt32Ty(*getContext()),
+                    getExternalSymbol(LibCallName.data(),
+                                      TLI->getPointerTy(getDataLayout())),
+                    std::move(Args))
       .setTailCall(IsTailCall);
 
   return TLI->LowerCallTo(CLI);
@@ -9077,8 +9092,8 @@ std::pair<SDValue, SDValue> SelectionDAG::getStrlen(SDValue Chain,
                                                     const SDLoc &dl,
                                                     SDValue Src,
                                                     const CallInst *CI) {
-  const char *LibCallName = TLI->getLibcallName(RTLIB::STRLEN);
-  if (!LibCallName)
+  RTLIB::LibcallImpl StrlenImpl = TLI->getLibcallImpl(RTLIB::STRLEN);
+  if (StrlenImpl == RTLIB::Unsupported)
     return {};
 
   // Emit a library call.
@@ -9088,13 +9103,15 @@ std::pair<SDValue, SDValue> SelectionDAG::getStrlen(SDValue Chain,
   TargetLowering::CallLoweringInfo CLI(*this);
   bool IsTailCall =
       isInTailCallPositionWrapper(CI, this, /*AllowReturnsFirstArg*/ true);
+  StringRef LibcallName = TLI->getLibcallImplName(StrlenImpl);
 
   CLI.setDebugLoc(dl)
       .setChain(Chain)
-      .setLibCallee(TLI->getLibcallCallingConv(RTLIB::STRLEN), CI->getType(),
-                    getExternalSymbol(
-                        LibCallName, TLI->getProgramPointerTy(getDataLayout())),
-                    std::move(Args))
+      .setLibCallee(
+          TLI->getLibcallImplCallingConv(StrlenImpl), CI->getType(),
+          getExternalSymbol(LibcallName.data(),
+                            TLI->getProgramPointerTy(getDataLayout())),
+          std::move(Args))
       .setTailCall(IsTailCall);
 
   return TLI->LowerCallTo(CLI);
@@ -9197,17 +9214,19 @@ SDValue SelectionDAG::getAtomicMemcpy(SDValue Chain, const SDLoc &dl,
 
   RTLIB::Libcall LibraryCall =
       RTLIB::getMEMCPY_ELEMENT_UNORDERED_ATOMIC(ElemSz);
-  if (LibraryCall == RTLIB::UNKNOWN_LIBCALL)
+  RTLIB::LibcallImpl LibcallImpl = TLI->getLibcallImpl(LibraryCall);
+  if (LibcallImpl == RTLIB::Unsupported)
     report_fatal_error("Unsupported element size");
 
   TargetLowering::CallLoweringInfo CLI(*this);
   CLI.setDebugLoc(dl)
       .setChain(Chain)
-      .setLibCallee(TLI->getLibcallCallingConv(LibraryCall),
-                    Type::getVoidTy(*getContext()),
-                    getExternalSymbol(TLI->getLibcallName(LibraryCall),
-                                      TLI->getPointerTy(getDataLayout())),
-                    std::move(Args))
+      .setLibCallee(
+          TLI->getLibcallImplCallingConv(LibcallImpl),
+          Type::getVoidTy(*getContext()),
+          getExternalSymbol(TLI->getLibcallImplName(LibcallImpl).data(),
+                            TLI->getPointerTy(getDataLayout())),
+          std::move(Args))
       .setDiscardResult()
       .setTailCall(isTailCall);
 
@@ -9303,17 +9322,19 @@ SDValue SelectionDAG::getAtomicMemmove(SDValue Chain, const SDLoc &dl,
 
   RTLIB::Libcall LibraryCall =
       RTLIB::getMEMMOVE_ELEMENT_UNORDERED_ATOMIC(ElemSz);
-  if (LibraryCall == RTLIB::UNKNOWN_LIBCALL)
+  RTLIB::LibcallImpl LibcallImpl = TLI->getLibcallImpl(LibraryCall);
+  if (LibcallImpl == RTLIB::Unsupported)
     report_fatal_error("Unsupported element size");
 
   TargetLowering::CallLoweringInfo CLI(*this);
   CLI.setDebugLoc(dl)
       .setChain(Chain)
-      .setLibCallee(TLI->getLibcallCallingConv(LibraryCall),
-                    Type::getVoidTy(*getContext()),
-                    getExternalSymbol(TLI->getLibcallName(LibraryCall),
-                                      TLI->getPointerTy(getDataLayout())),
-                    std::move(Args))
+      .setLibCallee(
+          TLI->getLibcallImplCallingConv(LibcallImpl),
+          Type::getVoidTy(*getContext()),
+          getExternalSymbol(TLI->getLibcallImplName(LibcallImpl).data(),
+                            TLI->getPointerTy(getDataLayout())),
+          std::move(Args))
       .setDiscardResult()
       .setTailCall(isTailCall);
 
@@ -9374,27 +9395,32 @@ SDValue SelectionDAG::getMemset(SDValue Chain, const SDLoc &dl, SDValue Dst,
   // FIXME: pass in SDLoc
   CLI.setDebugLoc(dl).setChain(Chain);
 
-  const char *BzeroName = getTargetLoweringInfo().getLibcallName(RTLIB::BZERO);
+  RTLIB::LibcallImpl BzeroImpl = TLI->getLibcallImpl(RTLIB::BZERO);
+  bool UseBZero = BzeroImpl != RTLIB::Unsupported && isNullConstant(Src);
 
-  bool UseBZero = isNullConstant(Src) && BzeroName;
   // If zeroing out and bzero is present, use it.
   if (UseBZero) {
     TargetLowering::ArgListTy Args;
     Args.emplace_back(Dst, PointerType::getUnqual(Ctx));
     Args.emplace_back(Size, DL.getIntPtrType(Ctx));
     CLI.setLibCallee(
-        TLI->getLibcallCallingConv(RTLIB::BZERO), Type::getVoidTy(Ctx),
-        getExternalSymbol(BzeroName, TLI->getPointerTy(DL)), std::move(Args));
+        TLI->getLibcallImplCallingConv(BzeroImpl), Type::getVoidTy(Ctx),
+        getExternalSymbol(TLI->getLibcallImplName(BzeroImpl).data(),
+                          TLI->getPointerTy(DL)),
+        std::move(Args));
   } else {
+    RTLIB::LibcallImpl MemsetImpl = TLI->getLibcallImpl(RTLIB::MEMSET);
+
     TargetLowering::ArgListTy Args;
     Args.emplace_back(Dst, PointerType::getUnqual(Ctx));
     Args.emplace_back(Src, Src.getValueType().getTypeForEVT(Ctx));
     Args.emplace_back(Size, DL.getIntPtrType(Ctx));
-    CLI.setLibCallee(TLI->getLibcallCallingConv(RTLIB::MEMSET),
-                     Dst.getValueType().getTypeForEVT(Ctx),
-                     getExternalSymbol(TLI->getLibcallName(RTLIB::MEMSET),
-                                       TLI->getPointerTy(DL)),
-                     std::move(Args));
+    CLI.setLibCallee(
+        TLI->getLibcallImplCallingConv(MemsetImpl),
+        Dst.getValueType().getTypeForEVT(Ctx),
+        getExternalSymbol(TLI->getLibcallImplName(MemsetImpl).data(),
+                          TLI->getPointerTy(DL)),
+        std::move(Args));
   }
 
   RTLIB::LibcallImpl MemsetImpl = TLI->getLibcallImpl(RTLIB::MEMSET);
@@ -9426,17 +9452,19 @@ SDValue SelectionDAG::getAtomicMemset(SDValue Chain, const SDLoc &dl,
 
   RTLIB::Libcall LibraryCall =
       RTLIB::getMEMSET_ELEMENT_UNORDERED_ATOMIC(ElemSz);
-  if (LibraryCall == RTLIB::UNKNOWN_LIBCALL)
+  RTLIB::LibcallImpl LibcallImpl = TLI->getLibcallImpl(LibraryCall);
+  if (LibcallImpl == RTLIB::Unsupported)
     report_fatal_error("Unsupported element size");
 
   TargetLowering::CallLoweringInfo CLI(*this);
   CLI.setDebugLoc(dl)
       .setChain(Chain)
-      .setLibCallee(TLI->getLibcallCallingConv(LibraryCall),
-                    Type::getVoidTy(*getContext()),
-                    getExternalSymbol(TLI->getLibcallName(LibraryCall),
-                                      TLI->getPointerTy(getDataLayout())),
-                    std::move(Args))
+      .setLibCallee(
+          TLI->getLibcallImplCallingConv(LibcallImpl),
+          Type::getVoidTy(*getContext()),
+          getExternalSymbol(TLI->getLibcallImplName(LibcallImpl).data(),
+                            TLI->getPointerTy(getDataLayout())),
+          std::move(Args))
       .setDiscardResult()
       .setTailCall(isTailCall);
 
@@ -13585,11 +13613,8 @@ std::pair<SDValue, SDValue> SelectionDAG::SplitEVL(SDValue N, EVT VecVT,
   EVT VT = N.getValueType();
   assert(VecVT.getVectorElementCount().isKnownEven() &&
          "Expecting the mask to be an evenly-sized vector");
-  unsigned HalfMinNumElts = VecVT.getVectorMinNumElements() / 2;
-  SDValue HalfNumElts =
-      VecVT.isFixedLengthVector()
-          ? getConstant(HalfMinNumElts, DL, VT)
-          : getVScale(DL, VT, APInt(VT.getScalarSizeInBits(), HalfMinNumElts));
+  SDValue HalfNumElts = getElementCount(
+      DL, VT, VecVT.getVectorElementCount().divideCoefficientBy(2));
   SDValue Lo = getNode(ISD::UMIN, DL, VT, N, HalfNumElts);
   SDValue Hi = getNode(ISD::USUBSAT, DL, VT, N, HalfNumElts);
   return std::make_pair(Lo, Hi);
@@ -14166,13 +14191,18 @@ SDValue SelectionDAG::makeStateFunctionCall(unsigned LibFunc, SDValue Ptr,
   assert(InChain.getValueType() == MVT::Other && "Expected token chain");
   TargetLowering::ArgListTy Args;
   Args.emplace_back(Ptr, Ptr.getValueType().getTypeForEVT(*getContext()));
-  RTLIB::Libcall LC = static_cast<RTLIB::Libcall>(LibFunc);
-  SDValue Callee = getExternalSymbol(TLI->getLibcallName(LC),
-                                     TLI->getPointerTy(getDataLayout()));
+  RTLIB::LibcallImpl LibcallImpl =
+      TLI->getLibcallImpl(static_cast<RTLIB::Libcall>(LibFunc));
+  if (LibcallImpl == RTLIB::Unsupported)
+    reportFatalUsageError("emitting call to unsupported libcall");
+
+  SDValue Callee =
+      getExternalSymbol(TLI->getLibcallImplName(LibcallImpl).data(),
+                        TLI->getPointerTy(getDataLayout()));
   TargetLowering::CallLoweringInfo CLI(*this);
   CLI.setDebugLoc(DLoc).setChain(InChain).setLibCallee(
-      TLI->getLibcallCallingConv(LC), Type::getVoidTy(*getContext()), Callee,
-      std::move(Args));
+      TLI->getLibcallImplCallingConv(LibcallImpl),
+      Type::getVoidTy(*getContext()), Callee, std::move(Args));
   return TLI->LowerCallTo(CLI).second;
 }
 
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 53d73ad618bd1..2caf847370383 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -4584,17 +4584,9 @@ void SelectionDAGBuilder::visitAlloca(const AllocaInst &I) {
   if (AllocSize.getValueType() != IntPtr)
     AllocSize = DAG.getZExtOrTrunc(AllocSize, dl, IntPtr);
 
-  if (TySize.isScalable())
-    AllocSize = DAG.getNode(ISD::MUL, dl, IntPtr, AllocSize,
-                            DAG.getVScale(dl, IntPtr,
-                                          APInt(IntPtr.getScalarSizeInBits(),
-                                                TySize.getKnownMinValue())));
-  else {
-    SDValue TySizeValue =
-        DAG.getConstant(TySize.getFixedValue(), dl, MVT::getIntegerVT(64));
-    AllocSize = DAG.getNode(ISD::MUL, dl, IntPtr, AllocSize,
-                            DAG.getZExtOrTrunc(TySizeValue, dl, IntPtr));
-  }
+  AllocSize = DAG.getNode(
+      ISD::MUL, dl, IntPtr, AllocSize,
+      DAG.getZExtOrTrunc(DAG.getTypeSize(dl, MVT::i64, TySize), dl, IntPtr));
 
   // Handle alignment.  If the requested alignment is less than or equal to
   // the stack alignment, ignore it.  If the size is greater than or equal to
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 521d8f07434e6..1e71937372159 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -185,12 +185,12 @@ TargetLowering::makeLibCall(SelectionDAG &DAG, RTLIB::Libcall LC, EVT RetVT,
     Args.push_back(Entry);
   }
 
-  const char *LibcallName = getLibcallName(LC);
-  if (LC == RTLIB::UNKNOWN_LIBCALL || !LibcallName)
+  RTLIB::LibcallImpl LibcallImpl = getLibcallImpl(LC);
+  if (LibcallImpl == RTLIB::Unsupported)
     reportFatalInternalError("unsupported library call operation");
 
-  SDValue Callee =
-      DAG.getExternalSymbol(LibcallName, getPointerTy(DAG.getDataLayout()));
+  SDValue Callee = DAG.getExternalSymbol(getLibcallImplName(LibcallImpl).data(),
+                                         getPointerTy(DAG.getDataLayout()));
 
   Type *RetTy = RetVT.getTypeForEVT(*DAG.getContext());
   Type *OrigRetTy = RetTy;
@@ -206,8 +206,8 @@ TargetLowering::makeLibCall(SelectionDAG &DAG, RTLIB::Libcall LC, EVT RetVT,
 
   CLI.setDebugLoc(dl)
       .setChain(InChain)
-      .setLibCallee(getLibcallCallingConv(LC), RetTy, OrigRetTy, Callee,
-                    std::move(Args))
+      .setLibCallee(getLibcallImplCallingConv(LibcallImpl), RetTy, OrigRetTy,
+                    Callee, std::move(Args))
       .setNoReturn(CallOptions.DoesNotReturn)
       .setDiscardResult(!CallOptions.IsReturnValueUsed)
       .setIsPostTypeLegalization(CallOptions.IsPostTypeLegalization)
@@ -6738,7 +6738,9 @@ SDValue TargetLowering::BuildUDIV(SDNode *N, SelectionDAG &DAG,
   auto BuildUDIVPattern = [&](ConstantSDNode *C) {
     if (C->isZero())
       return false;
-    const APInt& Divisor = C->getAPIntValue();
+    // Truncate the divisor to the target scalar type in case it was promoted
+    // during type legalization.
+    APInt Divisor = C->getAPIntValue().trunc(EltBits);
 
     SDValue PreShift, MagicFactor, NPQFactor, PostShift;
 
@@ -6779,7 +6781,8 @@ SDValue TargetLowering::BuildUDIV(SDNode *N, SelectionDAG &DAG,
   };
 
   // Collect the shifts/magic values from each element.
-  if (!ISD::matchUnaryPredicate(N1, BuildUDIVPattern))
+  if (!ISD::matchUnaryPredicate(N1, BuildUDIVPattern, /*AllowUndefs=*/false,
+                                /*AllowTruncation=*/true))
     return SDValue();
 
   SDValue PreShift, PostShift, MagicFactor, NPQFactor;
@@ -10628,12 +10631,8 @@ TargetLowering::IncrementMemoryAddress(SDValue Addr, SDValue Mask,
                                     AddrVT);
     Increment = DAG.getZExtOrTrunc(Increment, DL, AddrVT);
     Increment = DAG.getNode(ISD::MUL, DL, AddrVT, Increment, Scale);
-  } else if (DataVT.isScalableVector()) {
-    Increment = DAG.getVScale(DL, AddrVT,
-                              APInt(AddrVT.getFixedSizeInBits(),
-                                    DataVT.getStoreSize().getKnownMinValue()));
   } else
-    Increment = DAG.getConstant(DataVT.getStoreSize(), DL, AddrVT);
+    Increment = DAG.getTypeSize(DL, AddrVT, DataVT.getStoreSize());
 
   return DAG.getNode(ISD::ADD, DL, AddrVT, Addr, Increment);
 }
@@ -11125,7 +11124,8 @@ void TargetLowering::forceExpandWideMUL(SelectionDAG &DAG, const SDLoc &dl,
   else if (WideVT == MVT::i128)
     LC = RTLIB::MUL_I128;
 
-  if (LC == RTLIB::UNKNOWN_LIBCALL || !getLibcallName(LC)) {
+  RTLIB::LibcallImpl LibcallImpl = getLibcallImpl(LC);
+  if (LibcallImpl == RTLIB::Unsupported) {
     forceExpandMultiply(DAG, dl, Signed, Lo, Hi, LHS, RHS);
     return;
   }
@@ -11926,10 +11926,8 @@ SDValue TargetLowering::expandVectorSplice(SDNode *Node,
   // Store the lo part of CONCAT_VECTORS(V1, V2)
   SDValue StoreV1 = DAG.getStore(DAG.getEntryNode(), DL, V1, StackPtr, PtrInfo);
   // Store the hi part of CONCAT_VECTORS(V1, V2)
-  SDValue OffsetToV2 = DAG.getVScale(
-      DL, PtrVT,
-      APInt(PtrVT.getFixedSizeInBits(), VT.getStoreSize().getKnownMinValue()));
-  SDValue StackPtr2 = DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr, OffsetToV2);
+  SDValue VTBytes = DAG.getTypeSize(DL, PtrVT, VT.getStoreSize());
+  SDValue StackPtr2 = DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr, VTBytes);
   SDValue StoreV2 = DAG.getStore(StoreV1, DL, V2, StackPtr2, PtrInfo);
 
   if (Imm >= 0) {
@@ -11948,13 +11946,8 @@ SDValue TargetLowering::expandVectorSplice(SDNode *Node,
   SDValue TrailingBytes =
       DAG.getConstant(TrailingElts * EltByteSize, DL, PtrVT);
 
-  if (TrailingElts > VT.getVectorMinNumElements()) {
-    SDValue VLBytes =
-        DAG.getVScale(DL, PtrVT,
-                      APInt(PtrVT.getFixedSizeInBits(),
-                            VT.getStoreSize().getKnownMinValue()));
-    TrailingBytes = DAG.getNode(ISD::UMIN, DL, PtrVT, TrailingBytes, VLBytes);
-  }
+  if (TrailingElts > VT.getVectorMinNumElements())
+    TrailingBytes = DAG.getNode(ISD::UMIN, DL, PtrVT, TrailingBytes, VTBytes);
 
   // Calculate the start address of the spliced result.
   StackPtr2 = DAG.getNode(ISD::SUB, DL, PtrVT, StackPtr2, TrailingBytes);
diff --git a/llvm/lib/DebugInfo/DWARF/DWARFAcceleratorTable.cpp b/llvm/lib/DebugInfo/DWARF/DWARFAcceleratorTable.cpp
index ea336378bebb3..cf5b7fb650b43 100644
--- a/llvm/lib/DebugInfo/DWARF/DWARFAcceleratorTable.cpp
+++ b/llvm/lib/DebugInfo/DWARF/DWARFAcceleratorTable.cpp
@@ -321,17 +321,29 @@ void AppleAcceleratorTable::Iterator::prepareNextEntryOrEnd() {
 }
 
 void AppleAcceleratorTable::Iterator::prepareNextStringOrEnd() {
-  std::optional<uint32_t> StrOffset = getTable().readStringOffsetAt(Offset);
+  const AppleAcceleratorTable &Table = getTable();
+  if (Offset == 0) {
+    // Always start looking for strings using a valid offset from the Offsets
+    // table. Entries are not always consecutive.
+    std::optional<uint64_t> OptOffset = Table.readIthOffset(OffsetIdx++);
+    if (!OptOffset)
+      return setToEnd();
+    Offset = *OptOffset;
+  }
+  std::optional<uint32_t> StrOffset = Table.readStringOffsetAt(Offset);
   if (!StrOffset)
     return setToEnd();
 
-  // A zero denotes the end of the collision list. Read the next string
-  // again.
-  if (*StrOffset == 0)
+  // A zero denotes the end of the collision list. Skip to the next offset
+  // in the offsets table by setting the Offset to zero so we will grab the
+  // next offset from the offsets table.
+  if (*StrOffset == 0) {
+    Offset = 0;
     return prepareNextStringOrEnd();
+  }
   Current.StrOffset = *StrOffset;
 
-  std::optional<uint32_t> MaybeNumEntries = getTable().readU32FromAccel(Offset);
+  std::optional<uint32_t> MaybeNumEntries = Table.readU32FromAccel(Offset);
   if (!MaybeNumEntries || *MaybeNumEntries == 0)
     return setToEnd();
   NumEntriesToCome = *MaybeNumEntries;
@@ -339,7 +351,7 @@ void AppleAcceleratorTable::Iterator::prepareNextStringOrEnd() {
 
 AppleAcceleratorTable::Iterator::Iterator(const AppleAcceleratorTable &Table,
                                           bool SetEnd)
-    : Current(Table), Offset(Table.getEntriesBase()), NumEntriesToCome(0) {
+    : Current(Table), Offset(0), NumEntriesToCome(0) {
   if (SetEnd)
     setToEnd();
   else
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index cf88c4309974f..6f73d0c8dbfa2 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -682,6 +682,47 @@ OpenMPIRBuilder::getOrCreateRuntimeFunction(Module &M, RuntimeFunction FnID) {
   return {FnTy, Fn};
 }
 
+Expected<BasicBlock *>
+OpenMPIRBuilder::FinalizationInfo::getFiniBB(IRBuilderBase &Builder) {
+  if (!FiniBB) {
+    Function *ParentFunc = Builder.GetInsertBlock()->getParent();
+    IRBuilderBase::InsertPointGuard Guard(Builder);
+    FiniBB = BasicBlock::Create(Builder.getContext(), ".fini", ParentFunc);
+    Builder.SetInsertPoint(FiniBB);
+    // FiniCB adds the branch to the exit stub.
+    if (Error Err = FiniCB(Builder.saveIP()))
+      return Err;
+  }
+  return FiniBB;
+}
+
+Error OpenMPIRBuilder::FinalizationInfo::mergeFiniBB(IRBuilderBase &Builder,
+                                                     BasicBlock *OtherFiniBB) {
+  // Simple case: FiniBB does not exist yet: re-use OtherFiniBB.
+  if (!FiniBB) {
+    FiniBB = OtherFiniBB;
+
+    Builder.SetInsertPoint(FiniBB->getFirstNonPHIIt());
+    if (Error Err = FiniCB(Builder.saveIP()))
+      return Err;
+
+    return Error::success();
+  }
+
+  // Move instructions from FiniBB to the start of OtherFiniBB.
+  auto EndIt = FiniBB->end();
+  if (FiniBB->size() >= 1)
+    if (auto Prev = std::prev(EndIt); Prev->isTerminator())
+      EndIt = Prev;
+  OtherFiniBB->splice(OtherFiniBB->getFirstNonPHIIt(), FiniBB, FiniBB->begin(),
+                      EndIt);
+
+  FiniBB->replaceAllUsesWith(OtherFiniBB);
+  FiniBB->eraseFromParent();
+  FiniBB = OtherFiniBB;
+  return Error::success();
+}
+
 Function *OpenMPIRBuilder::getOrCreateRuntimeFunctionPtr(RuntimeFunction FnID) {
   FunctionCallee RTLFn = getOrCreateRuntimeFunction(M, FnID);
   auto *Fn = dyn_cast<llvm::Function>(RTLFn.getCallee());
@@ -1108,8 +1149,20 @@ OpenMPIRBuilder::createCancel(const LocationDescription &Loc,
   auto *UI = Builder.CreateUnreachable();
 
   Instruction *ThenTI = UI, *ElseTI = nullptr;
-  if (IfCondition)
+  if (IfCondition) {
     SplitBlockAndInsertIfThenElse(IfCondition, UI, &ThenTI, &ElseTI);
+
+    // Even if the if condition evaluates to false, this should count as a
+    // cancellation point
+    Builder.SetInsertPoint(ElseTI);
+    auto ElseIP = Builder.saveIP();
+
+    InsertPointOrErrorTy IPOrErr = createCancellationPoint(
+        LocationDescription{ElseIP, Loc.DL}, CanceledDirective);
+    if (!IPOrErr)
+      return IPOrErr;
+  }
+
   Builder.SetInsertPoint(ThenTI);
 
   Value *CancelKind = nullptr;
@@ -1129,21 +1182,9 @@ OpenMPIRBuilder::createCancel(const LocationDescription &Loc,
   Value *Args[] = {Ident, getOrCreateThreadID(Ident), CancelKind};
   Value *Result = createRuntimeFunctionCall(
       getOrCreateRuntimeFunctionPtr(OMPRTL___kmpc_cancel), Args);
-  auto ExitCB = [this, CanceledDirective, Loc](InsertPointTy IP) -> Error {
-    if (CanceledDirective == OMPD_parallel) {
-      IRBuilder<>::InsertPointGuard IPG(Builder);
-      Builder.restoreIP(IP);
-      return createBarrier(LocationDescription(Builder.saveIP(), Loc.DL),
-                           omp::Directive::OMPD_unknown,
-                           /* ForceSimpleCall */ false,
-                           /* CheckCancelFlag */ false)
-          .takeError();
-    }
-    return Error::success();
-  };
 
   // The actual cancel logic is shared with others, e.g., cancel_barriers.
-  if (Error Err = emitCancelationCheckImpl(Result, CanceledDirective, ExitCB))
+  if (Error Err = emitCancelationCheckImpl(Result, CanceledDirective))
     return Err;
 
   // Update the insertion point and remove the terminator we introduced.
@@ -1180,21 +1221,9 @@ OpenMPIRBuilder::createCancellationPoint(const LocationDescription &Loc,
   Value *Args[] = {Ident, getOrCreateThreadID(Ident), CancelKind};
   Value *Result = createRuntimeFunctionCall(
       getOrCreateRuntimeFunctionPtr(OMPRTL___kmpc_cancellationpoint), Args);
-  auto ExitCB = [this, CanceledDirective, Loc](InsertPointTy IP) -> Error {
-    if (CanceledDirective == OMPD_parallel) {
-      IRBuilder<>::InsertPointGuard IPG(Builder);
-      Builder.restoreIP(IP);
-      return createBarrier(LocationDescription(Builder.saveIP(), Loc.DL),
-                           omp::Directive::OMPD_unknown,
-                           /* ForceSimpleCall */ false,
-                           /* CheckCancelFlag */ false)
-          .takeError();
-    }
-    return Error::success();
-  };
 
   // The actual cancel logic is shared with others, e.g., cancel_barriers.
-  if (Error Err = emitCancelationCheckImpl(Result, CanceledDirective, ExitCB))
+  if (Error Err = emitCancelationCheckImpl(Result, CanceledDirective))
     return Err;
 
   // Update the insertion point and remove the terminator we introduced.
@@ -1298,8 +1327,7 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::emitKernelLaunch(
 }
 
 Error OpenMPIRBuilder::emitCancelationCheckImpl(
-    Value *CancelFlag, omp::Directive CanceledDirective,
-    FinalizeCallbackTy ExitCB) {
+    Value *CancelFlag, omp::Directive CanceledDirective) {
   assert(isLastFinalizationInfoCancellable(CanceledDirective) &&
          "Unexpected cancellation!");
 
@@ -1326,13 +1354,12 @@ Error OpenMPIRBuilder::emitCancelationCheckImpl(
 
   // From the cancellation block we finalize all variables and go to the
   // post finalization block that is known to the FiniCB callback.
-  Builder.SetInsertPoint(CancellationBlock);
-  if (ExitCB)
-    if (Error Err = ExitCB(Builder.saveIP()))
-      return Err;
   auto &FI = FinalizationStack.back();
-  if (Error Err = FI.FiniCB(Builder.saveIP()))
-    return Err;
+  Expected<BasicBlock *> FiniBBOrErr = FI.getFiniBB(Builder);
+  if (!FiniBBOrErr)
+    return FiniBBOrErr.takeError();
+  Builder.SetInsertPoint(CancellationBlock);
+  Builder.CreateBr(*FiniBBOrErr);
 
   // The continuation block is where code generation continues.
   Builder.SetInsertPoint(NonCancellationBlock, NonCancellationBlock->begin());
@@ -1821,8 +1848,18 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createParallel(
   Instruction *PRegPreFiniTI = PRegPreFiniBB->getTerminator();
 
   InsertPointTy PreFiniIP(PRegPreFiniBB, PRegPreFiniTI->getIterator());
-  if (Error Err = FiniCB(PreFiniIP))
-    return Err;
+  Expected<BasicBlock *> FiniBBOrErr = FiniInfo.getFiniBB(Builder);
+  if (!FiniBBOrErr)
+    return FiniBBOrErr.takeError();
+  {
+    IRBuilderBase::InsertPointGuard Guard(Builder);
+    Builder.restoreIP(PreFiniIP);
+    Builder.CreateBr(*FiniBBOrErr);
+    // There's currently a branch to omp.par.exit. Delete it. We will get there
+    // via the fini block
+    if (Instruction *Term = Builder.GetInsertBlock()->getTerminator())
+      Term->eraseFromParent();
+  }
 
   // Register the outlined info.
   addOutlineInfo(std::move(OI));
@@ -2258,23 +2295,7 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createSections(
   if (!updateToLocation(Loc))
     return Loc.IP;
 
-  // FiniCBWrapper needs to create a branch to the loop finalization block, but
-  // this has not been created yet at some times when this callback runs.
-  SmallVector<BranchInst *> CancellationBranches;
-  auto FiniCBWrapper = [&](InsertPointTy IP) {
-    if (IP.getBlock()->end() != IP.getPoint())
-      return FiniCB(IP);
-    // This must be done otherwise any nested constructs using FinalizeOMPRegion
-    // will fail because that function requires the Finalization Basic Block to
-    // have a terminator, which is already removed by EmitOMPRegionBody.
-    // IP is currently at cancelation block.
-    BranchInst *DummyBranch = Builder.CreateBr(IP.getBlock());
-    IP = InsertPointTy(DummyBranch->getParent(), DummyBranch->getIterator());
-    CancellationBranches.push_back(DummyBranch);
-    return FiniCB(IP);
-  };
-
-  FinalizationStack.push_back({FiniCBWrapper, OMPD_sections, IsCancellable});
+  FinalizationStack.push_back({FiniCB, OMPD_sections, IsCancellable});
 
   // Each section is emitted as a switch case
   // Each finalization callback is handled from clang.EmitOMPSectionDirective()
@@ -2340,20 +2361,8 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createSections(
   auto FiniInfo = FinalizationStack.pop_back_val();
   assert(FiniInfo.DK == OMPD_sections &&
          "Unexpected finalization stack state!");
-  if (FinalizeCallbackTy &CB = FiniInfo.FiniCB) {
-    Builder.restoreIP(AfterIP);
-    BasicBlock *FiniBB =
-        splitBBWithSuffix(Builder, /*CreateBranch=*/true, "sections.fini");
-    if (Error Err = CB(Builder.saveIP()))
-      return Err;
-    AfterIP = {FiniBB, FiniBB->begin()};
-  }
-
-  // Now we can fix the dummy branch to point to the right place
-  for (BranchInst *DummyBranch : CancellationBranches) {
-    assert(DummyBranch->getNumSuccessors() == 1);
-    DummyBranch->setSuccessor(0, LoopFini);
-  }
+  if (Error Err = FiniInfo.mergeFiniBB(Builder, LoopFini))
+    return Err;
 
   return AfterIP;
 }
@@ -3158,9 +3167,9 @@ Expected<Function *> OpenMPIRBuilder::emitShuffleAndReduceFunction(
   return SarFunc;
 }
 
-Function *OpenMPIRBuilder::emitListToGlobalCopyFunction(
+Expected<Function *> OpenMPIRBuilder::emitListToGlobalCopyFunction(
     ArrayRef<ReductionInfo> ReductionInfos, Type *ReductionsBufferTy,
-    AttributeList FuncAttrs) {
+    AttributeList FuncAttrs, ArrayRef<bool> IsByRef) {
   OpenMPIRBuilder::InsertPointTy OldIP = Builder.saveIP();
   LLVMContext &Ctx = M.getContext();
   FunctionType *FuncTy = FunctionType::get(
@@ -3230,7 +3239,21 @@ Function *OpenMPIRBuilder::emitListToGlobalCopyFunction(
 
     switch (RI.EvaluationKind) {
     case EvalKind::Scalar: {
-      Value *TargetElement = Builder.CreateLoad(RI.ElementType, ElemPtr);
+      Value *TargetElement;
+
+      if (IsByRef.empty() || !IsByRef[En.index()]) {
+        TargetElement = Builder.CreateLoad(RI.ElementType, ElemPtr);
+      } else {
+        InsertPointOrErrorTy GenResult =
+            RI.DataPtrPtrGen(Builder.saveIP(), ElemPtr, ElemPtr);
+
+        if (!GenResult)
+          return GenResult.takeError();
+
+        ElemPtr = Builder.CreateLoad(Builder.getPtrTy(), ElemPtr);
+        TargetElement = Builder.CreateLoad(RI.ByRefElementType, ElemPtr);
+      }
+
       Builder.CreateStore(TargetElement, GlobVal);
       break;
     }
@@ -3268,9 +3291,9 @@ Function *OpenMPIRBuilder::emitListToGlobalCopyFunction(
   return LtGCFunc;
 }
 
-Function *OpenMPIRBuilder::emitListToGlobalReduceFunction(
+Expected<Function *> OpenMPIRBuilder::emitListToGlobalReduceFunction(
     ArrayRef<ReductionInfo> ReductionInfos, Function *ReduceFn,
-    Type *ReductionsBufferTy, AttributeList FuncAttrs) {
+    Type *ReductionsBufferTy, AttributeList FuncAttrs, ArrayRef<bool> IsByRef) {
   OpenMPIRBuilder::InsertPointTy OldIP = Builder.saveIP();
   LLVMContext &Ctx = M.getContext();
   FunctionType *FuncTy = FunctionType::get(
@@ -3309,6 +3332,8 @@ Function *OpenMPIRBuilder::emitListToGlobalReduceFunction(
   Value *LocalReduceList =
       Builder.CreateAlloca(RedListArrayTy, nullptr, ".omp.reduction.red_list");
 
+  InsertPointTy AllocaIP{EntryBlock, EntryBlock->begin()};
+
   Value *BufferArgAddrCast = Builder.CreatePointerBitCastOrAddrSpaceCast(
       BufferArgAlloca, Builder.getPtrTy(),
       BufferArgAlloca->getName() + ".ascast");
@@ -3330,6 +3355,20 @@ Function *OpenMPIRBuilder::emitListToGlobalReduceFunction(
   Type *IndexTy = Builder.getIndexTy(
       M.getDataLayout(), M.getDataLayout().getDefaultGlobalsAddressSpace());
   for (auto En : enumerate(ReductionInfos)) {
+    const ReductionInfo &RI = En.value();
+    Value *ByRefAlloc;
+
+    if (!IsByRef.empty() && IsByRef[En.index()]) {
+      InsertPointTy OldIP = Builder.saveIP();
+      Builder.restoreIP(AllocaIP);
+
+      ByRefAlloc = Builder.CreateAlloca(RI.ByRefAllocatedType);
+      ByRefAlloc = Builder.CreatePointerBitCastOrAddrSpaceCast(
+          ByRefAlloc, Builder.getPtrTy(), ByRefAlloc->getName() + ".ascast");
+
+      Builder.restoreIP(OldIP);
+    }
+
     Value *TargetElementPtrPtr = Builder.CreateInBoundsGEP(
         RedListArrayTy, LocalReduceListAddrCast,
         {ConstantInt::get(IndexTy, 0), ConstantInt::get(IndexTy, En.index())});
@@ -3338,7 +3377,21 @@ Function *OpenMPIRBuilder::emitListToGlobalReduceFunction(
     // Global = Buffer.VD[Idx];
     Value *GlobValPtr = Builder.CreateConstInBoundsGEP2_32(
         ReductionsBufferTy, BufferVD, 0, En.index());
-    Builder.CreateStore(GlobValPtr, TargetElementPtrPtr);
+
+    if (!IsByRef.empty() && IsByRef[En.index()]) {
+      Value *ByRefDataPtr;
+
+      InsertPointOrErrorTy GenResult =
+          RI.DataPtrPtrGen(Builder.saveIP(), ByRefAlloc, ByRefDataPtr);
+
+      if (!GenResult)
+        return GenResult.takeError();
+
+      Builder.CreateStore(GlobValPtr, ByRefDataPtr);
+      Builder.CreateStore(ByRefAlloc, TargetElementPtrPtr);
+    } else {
+      Builder.CreateStore(GlobValPtr, TargetElementPtrPtr);
+    }
   }
 
   // Call reduce_function(GlobalReduceList, ReduceList)
@@ -3351,32 +3404,32 @@ Function *OpenMPIRBuilder::emitListToGlobalReduceFunction(
   return LtGRFunc;
 }
 
-Function *OpenMPIRBuilder::emitGlobalToListCopyFunction(
+Expected<Function *> OpenMPIRBuilder::emitGlobalToListCopyFunction(
     ArrayRef<ReductionInfo> ReductionInfos, Type *ReductionsBufferTy,
-    AttributeList FuncAttrs) {
+    AttributeList FuncAttrs, ArrayRef<bool> IsByRef) {
   OpenMPIRBuilder::InsertPointTy OldIP = Builder.saveIP();
   LLVMContext &Ctx = M.getContext();
   FunctionType *FuncTy = FunctionType::get(
       Builder.getVoidTy(),
       {Builder.getPtrTy(), Builder.getInt32Ty(), Builder.getPtrTy()},
       /* IsVarArg */ false);
-  Function *LtGCFunc =
+  Function *GtLCFunc =
       Function::Create(FuncTy, GlobalVariable::InternalLinkage,
                        "_omp_reduction_global_to_list_copy_func", &M);
-  LtGCFunc->setAttributes(FuncAttrs);
-  LtGCFunc->addParamAttr(0, Attribute::NoUndef);
-  LtGCFunc->addParamAttr(1, Attribute::NoUndef);
-  LtGCFunc->addParamAttr(2, Attribute::NoUndef);
+  GtLCFunc->setAttributes(FuncAttrs);
+  GtLCFunc->addParamAttr(0, Attribute::NoUndef);
+  GtLCFunc->addParamAttr(1, Attribute::NoUndef);
+  GtLCFunc->addParamAttr(2, Attribute::NoUndef);
 
-  BasicBlock *EntryBlock = BasicBlock::Create(Ctx, "entry", LtGCFunc);
+  BasicBlock *EntryBlock = BasicBlock::Create(Ctx, "entry", GtLCFunc);
   Builder.SetInsertPoint(EntryBlock);
 
   // Buffer: global reduction buffer.
-  Argument *BufferArg = LtGCFunc->getArg(0);
+  Argument *BufferArg = GtLCFunc->getArg(0);
   // Idx: index of the buffer.
-  Argument *IdxArg = LtGCFunc->getArg(1);
+  Argument *IdxArg = GtLCFunc->getArg(1);
   // ReduceList: thread local Reduce list.
-  Argument *ReduceListArg = LtGCFunc->getArg(2);
+  Argument *ReduceListArg = GtLCFunc->getArg(2);
 
   Value *BufferArgAlloca = Builder.CreateAlloca(Builder.getPtrTy(), nullptr,
                                                 BufferArg->getName() + ".addr");
@@ -3420,7 +3473,20 @@ Function *OpenMPIRBuilder::emitGlobalToListCopyFunction(
 
     switch (RI.EvaluationKind) {
     case EvalKind::Scalar: {
-      Value *TargetElement = Builder.CreateLoad(RI.ElementType, GlobValPtr);
+      Type *ElemType = RI.ElementType;
+
+      if (!IsByRef.empty() && IsByRef[En.index()]) {
+        ElemType = RI.ByRefElementType;
+        InsertPointOrErrorTy GenResult =
+            RI.DataPtrPtrGen(Builder.saveIP(), ElemPtr, ElemPtr);
+
+        if (!GenResult)
+          return GenResult.takeError();
+
+        ElemPtr = Builder.CreateLoad(Builder.getPtrTy(), ElemPtr);
+      }
+
+      Value *TargetElement = Builder.CreateLoad(ElemType, GlobValPtr);
       Builder.CreateStore(TargetElement, ElemPtr);
       break;
     }
@@ -3456,35 +3522,35 @@ Function *OpenMPIRBuilder::emitGlobalToListCopyFunction(
 
   Builder.CreateRetVoid();
   Builder.restoreIP(OldIP);
-  return LtGCFunc;
+  return GtLCFunc;
 }
 
-Function *OpenMPIRBuilder::emitGlobalToListReduceFunction(
+Expected<Function *> OpenMPIRBuilder::emitGlobalToListReduceFunction(
     ArrayRef<ReductionInfo> ReductionInfos, Function *ReduceFn,
-    Type *ReductionsBufferTy, AttributeList FuncAttrs) {
+    Type *ReductionsBufferTy, AttributeList FuncAttrs, ArrayRef<bool> IsByRef) {
   OpenMPIRBuilder::InsertPointTy OldIP = Builder.saveIP();
   LLVMContext &Ctx = M.getContext();
   auto *FuncTy = FunctionType::get(
       Builder.getVoidTy(),
       {Builder.getPtrTy(), Builder.getInt32Ty(), Builder.getPtrTy()},
       /* IsVarArg */ false);
-  Function *LtGRFunc =
+  Function *GtLRFunc =
       Function::Create(FuncTy, GlobalVariable::InternalLinkage,
                        "_omp_reduction_global_to_list_reduce_func", &M);
-  LtGRFunc->setAttributes(FuncAttrs);
-  LtGRFunc->addParamAttr(0, Attribute::NoUndef);
-  LtGRFunc->addParamAttr(1, Attribute::NoUndef);
-  LtGRFunc->addParamAttr(2, Attribute::NoUndef);
+  GtLRFunc->setAttributes(FuncAttrs);
+  GtLRFunc->addParamAttr(0, Attribute::NoUndef);
+  GtLRFunc->addParamAttr(1, Attribute::NoUndef);
+  GtLRFunc->addParamAttr(2, Attribute::NoUndef);
 
-  BasicBlock *EntryBlock = BasicBlock::Create(Ctx, "entry", LtGRFunc);
+  BasicBlock *EntryBlock = BasicBlock::Create(Ctx, "entry", GtLRFunc);
   Builder.SetInsertPoint(EntryBlock);
 
   // Buffer: global reduction buffer.
-  Argument *BufferArg = LtGRFunc->getArg(0);
+  Argument *BufferArg = GtLRFunc->getArg(0);
   // Idx: index of the buffer.
-  Argument *IdxArg = LtGRFunc->getArg(1);
+  Argument *IdxArg = GtLRFunc->getArg(1);
   // ReduceList: thread local Reduce list.
-  Argument *ReduceListArg = LtGRFunc->getArg(2);
+  Argument *ReduceListArg = GtLRFunc->getArg(2);
 
   Value *BufferArgAlloca = Builder.CreateAlloca(Builder.getPtrTy(), nullptr,
                                                 BufferArg->getName() + ".addr");
@@ -3500,6 +3566,8 @@ Function *OpenMPIRBuilder::emitGlobalToListReduceFunction(
   Value *LocalReduceList =
       Builder.CreateAlloca(RedListArrayTy, nullptr, ".omp.reduction.red_list");
 
+  InsertPointTy AllocaIP{EntryBlock, EntryBlock->begin()};
+
   Value *BufferArgAddrCast = Builder.CreatePointerBitCastOrAddrSpaceCast(
       BufferArgAlloca, Builder.getPtrTy(),
       BufferArgAlloca->getName() + ".ascast");
@@ -3521,6 +3589,20 @@ Function *OpenMPIRBuilder::emitGlobalToListReduceFunction(
   Type *IndexTy = Builder.getIndexTy(
       M.getDataLayout(), M.getDataLayout().getDefaultGlobalsAddressSpace());
   for (auto En : enumerate(ReductionInfos)) {
+    const ReductionInfo &RI = En.value();
+    Value *ByRefAlloc;
+
+    if (!IsByRef.empty() && IsByRef[En.index()]) {
+      InsertPointTy OldIP = Builder.saveIP();
+      Builder.restoreIP(AllocaIP);
+
+      ByRefAlloc = Builder.CreateAlloca(RI.ByRefAllocatedType);
+      ByRefAlloc = Builder.CreatePointerBitCastOrAddrSpaceCast(
+          ByRefAlloc, Builder.getPtrTy(), ByRefAlloc->getName() + ".ascast");
+
+      Builder.restoreIP(OldIP);
+    }
+
     Value *TargetElementPtrPtr = Builder.CreateInBoundsGEP(
         RedListArrayTy, ReductionList,
         {ConstantInt::get(IndexTy, 0), ConstantInt::get(IndexTy, En.index())});
@@ -3529,7 +3611,19 @@ Function *OpenMPIRBuilder::emitGlobalToListReduceFunction(
         Builder.CreateInBoundsGEP(ReductionsBufferTy, BufferVal, Idxs);
     Value *GlobValPtr = Builder.CreateConstInBoundsGEP2_32(
         ReductionsBufferTy, BufferVD, 0, En.index());
-    Builder.CreateStore(GlobValPtr, TargetElementPtrPtr);
+
+    if (!IsByRef.empty() && IsByRef[En.index()]) {
+      Value *ByRefDataPtr;
+      InsertPointOrErrorTy GenResult =
+          RI.DataPtrPtrGen(Builder.saveIP(), ByRefAlloc, ByRefDataPtr);
+      if (!GenResult)
+        return GenResult.takeError();
+
+      Builder.CreateStore(GlobValPtr, ByRefDataPtr);
+      Builder.CreateStore(ByRefAlloc, TargetElementPtrPtr);
+    } else {
+      Builder.CreateStore(GlobValPtr, TargetElementPtrPtr);
+    }
   }
 
   // Call reduce_function(ReduceList, GlobalReduceList)
@@ -3539,7 +3633,7 @@ Function *OpenMPIRBuilder::emitGlobalToListReduceFunction(
       ->addFnAttr(Attribute::NoUnwind);
   Builder.CreateRetVoid();
   Builder.restoreIP(OldIP);
-  return LtGRFunc;
+  return GtLRFunc;
 }
 
 std::string OpenMPIRBuilder::getReductionFuncName(StringRef Name) const {
@@ -3795,7 +3889,10 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createReductionsGPU(
     auto Size = M.getDataLayout().getTypeStoreSize(En.value().ElementType);
     if (Size > MaxDataSize)
       MaxDataSize = Size;
-    ReductionTypeArgs.emplace_back(En.value().ElementType);
+    Type *RedTypeArg = (!IsByRef.empty() && IsByRef[En.index()])
+                           ? En.value().ByRefElementType
+                           : En.value().ElementType;
+    ReductionTypeArgs.emplace_back(RedTypeArg);
   }
   Value *ReductionDataSize =
       Builder.getInt64(MaxDataSize * ReductionInfos.size());
@@ -3813,20 +3910,33 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createReductionsGPU(
     CodeGenIP = Builder.saveIP();
     StructType *ReductionsBufferTy = StructType::create(
         Ctx, ReductionTypeArgs, "struct._globalized_locals_ty");
-    Function *RedFixedBuferFn = getOrCreateRuntimeFunctionPtr(
+    Function *RedFixedBufferFn = getOrCreateRuntimeFunctionPtr(
         RuntimeFunction::OMPRTL___kmpc_reduction_get_fixed_buffer);
-    Function *LtGCFunc = emitListToGlobalCopyFunction(
-        ReductionInfos, ReductionsBufferTy, FuncAttrs);
-    Function *LtGRFunc = emitListToGlobalReduceFunction(
-        ReductionInfos, ReductionFunc, ReductionsBufferTy, FuncAttrs);
-    Function *GtLCFunc = emitGlobalToListCopyFunction(
-        ReductionInfos, ReductionsBufferTy, FuncAttrs);
-    Function *GtLRFunc = emitGlobalToListReduceFunction(
-        ReductionInfos, ReductionFunc, ReductionsBufferTy, FuncAttrs);
+
+    Expected<Function *> LtGCFunc = emitListToGlobalCopyFunction(
+        ReductionInfos, ReductionsBufferTy, FuncAttrs, IsByRef);
+    if (!LtGCFunc)
+      return LtGCFunc.takeError();
+
+    Expected<Function *> LtGRFunc = emitListToGlobalReduceFunction(
+        ReductionInfos, ReductionFunc, ReductionsBufferTy, FuncAttrs, IsByRef);
+    if (!LtGRFunc)
+      return LtGRFunc.takeError();
+
+    Expected<Function *> GtLCFunc = emitGlobalToListCopyFunction(
+        ReductionInfos, ReductionsBufferTy, FuncAttrs, IsByRef);
+    if (!GtLCFunc)
+      return GtLCFunc.takeError();
+
+    Expected<Function *> GtLRFunc = emitGlobalToListReduceFunction(
+        ReductionInfos, ReductionFunc, ReductionsBufferTy, FuncAttrs, IsByRef);
+    if (!GtLRFunc)
+      return GtLRFunc.takeError();
+
     Builder.restoreIP(CodeGenIP);
 
     Value *KernelTeamsReductionPtr = createRuntimeFunctionCall(
-        RedFixedBuferFn, {}, "_openmp_teams_reductions_buffer_$_$ptr");
+        RedFixedBufferFn, {}, "_openmp_teams_reductions_buffer_$_$ptr");
 
     Value *Args3[] = {SrcLocInfo,
                       KernelTeamsReductionPtr,
@@ -3835,10 +3945,10 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createReductionsGPU(
                       RL,
                       *SarFunc,
                       WcFunc,
-                      LtGCFunc,
-                      LtGRFunc,
-                      GtLCFunc,
-                      GtLRFunc};
+                      *LtGCFunc,
+                      *LtGRFunc,
+                      *GtLCFunc,
+                      *GtLRFunc};
 
     Function *TeamsReduceFn = getOrCreateRuntimeFunctionPtr(
         RuntimeFunction::OMPRTL___kmpc_nvptx_teams_reduce_nowait_v2);
@@ -6718,9 +6828,6 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::EmitOMPInlinedRegion(
       emitCommonDirectiveExit(OMPD, FinIP, ExitCall, HasFinalize);
   if (!AfterIP)
     return AfterIP.takeError();
-  assert(FiniBB->getUniquePredecessor()->getUniqueSuccessor() == FiniBB &&
-         "Unexpected Control Flow State!");
-  MergeBlockIntoPredecessor(FiniBB);
 
   // If we are skipping the region of a non conditional, remove the exit
   // block, and clear the builder's insertion point.
@@ -6780,14 +6887,12 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::emitCommonDirectiveExit(
     FinalizationInfo Fi = FinalizationStack.pop_back_val();
     assert(Fi.DK == OMPD && "Unexpected Directive for Finalization call!");
 
-    if (Error Err = Fi.FiniCB(FinIP))
-      return Err;
-
-    BasicBlock *FiniBB = FinIP.getBlock();
-    Instruction *FiniBBTI = FiniBB->getTerminator();
+    if (Error Err = Fi.mergeFiniBB(Builder, FinIP.getBlock()))
+      return std::move(Err);
 
-    // set Builder IP for call creation
-    Builder.SetInsertPoint(FiniBBTI);
+    // Exit condition: insertion point is before the terminator of the new Fini
+    // block
+    Builder.SetInsertPoint(FinIP.getBlock()->getTerminator());
   }
 
   if (!ExitCall)
diff --git a/llvm/lib/IR/ConstantRange.cpp b/llvm/lib/IR/ConstantRange.cpp
index b454c9a4cd3ae..9beaee60d0bc1 100644
--- a/llvm/lib/IR/ConstantRange.cpp
+++ b/llvm/lib/IR/ConstantRange.cpp
@@ -841,6 +841,8 @@ ConstantRange ConstantRange::zeroExtend(uint32_t DstTySize) const {
   if (isEmptySet()) return getEmpty(DstTySize);
 
   unsigned SrcTySize = getBitWidth();
+  if (DstTySize == SrcTySize)
+    return *this;
   assert(SrcTySize < DstTySize && "Not a value extension");
   if (isFullSet() || isUpperWrapped()) {
     // Change into [0, 1 << src bit width)
@@ -858,6 +860,8 @@ ConstantRange ConstantRange::signExtend(uint32_t DstTySize) const {
   if (isEmptySet()) return getEmpty(DstTySize);
 
   unsigned SrcTySize = getBitWidth();
+  if (DstTySize == SrcTySize)
+    return *this;
   assert(SrcTySize < DstTySize && "Not a value extension");
 
   // special case: [X, INT_MIN) -- not really wrapping around
@@ -874,6 +878,8 @@ ConstantRange ConstantRange::signExtend(uint32_t DstTySize) const {
 
 ConstantRange ConstantRange::truncate(uint32_t DstTySize,
                                       unsigned NoWrapKind) const {
+  if (DstTySize == getBitWidth())
+    return *this;
   assert(getBitWidth() > DstTySize && "Not a value truncation");
   if (isEmptySet())
     return getEmpty(DstTySize);
diff --git a/llvm/lib/IR/IRBuilder.cpp b/llvm/lib/IR/IRBuilder.cpp
index 95edb2e8e56d8..8e1707ac98a51 100644
--- a/llvm/lib/IR/IRBuilder.cpp
+++ b/llvm/lib/IR/IRBuilder.cpp
@@ -858,24 +858,12 @@ CallInst *IRBuilderBase::CreateIntrinsic(Type *RetTy, Intrinsic::ID ID,
                                          const Twine &Name) {
   Module *M = BB->getModule();
 
-  SmallVector<Intrinsic::IITDescriptor> Table;
-  Intrinsic::getIntrinsicInfoTableEntries(ID, Table);
-  ArrayRef<Intrinsic::IITDescriptor> TableRef(Table);
-
   SmallVector<Type *> ArgTys;
   ArgTys.reserve(Args.size());
   for (auto &I : Args)
     ArgTys.push_back(I->getType());
-  FunctionType *FTy = FunctionType::get(RetTy, ArgTys, false);
-  SmallVector<Type *> OverloadTys;
-  Intrinsic::MatchIntrinsicTypesResult Res =
-      matchIntrinsicSignature(FTy, TableRef, OverloadTys);
-  (void)Res;
-  assert(Res == Intrinsic::MatchIntrinsicTypes_Match && TableRef.empty() &&
-         "Wrong types for intrinsic!");
-  // TODO: Handle varargs intrinsics.
-
-  Function *Fn = Intrinsic::getOrInsertDeclaration(M, ID, OverloadTys);
+
+  Function *Fn = Intrinsic::getOrInsertDeclaration(M, ID, RetTy, ArgTys);
   return createCallHelper(Fn, Args, Name, FMFSource);
 }
 
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index 859689b9cf168..f46d3e5063e43 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -727,14 +727,14 @@ Intrinsic::ID Intrinsic::lookupIntrinsicID(StringRef Name) {
 #include "llvm/IR/IntrinsicImpl.inc"
 #undef GET_INTRINSIC_ATTRIBUTES
 
-Function *Intrinsic::getOrInsertDeclaration(Module *M, ID id,
-                                            ArrayRef<Type *> Tys) {
-  // There can never be multiple globals with the same name of different types,
-  // because intrinsics must be a specific type.
-  auto *FT = getType(M->getContext(), id, Tys);
+static Function *getOrInsertIntrinsicDeclarationImpl(Module *M,
+                                                     Intrinsic::ID id,
+                                                     ArrayRef<Type *> Tys,
+                                                     FunctionType *FT) {
   Function *F = cast<Function>(
-      M->getOrInsertFunction(
-           Tys.empty() ? getName(id) : getName(id, Tys, M, FT), FT)
+      M->getOrInsertFunction(Tys.empty() ? Intrinsic::getName(id)
+                                         : Intrinsic::getName(id, Tys, M, FT),
+                             FT)
           .getCallee());
   if (F->getFunctionType() == FT)
     return F;
@@ -746,11 +746,49 @@ Function *Intrinsic::getOrInsertDeclaration(Module *M, ID id,
   // invalid declaration will get upgraded later.
   F->setName(F->getName() + ".invalid");
   return cast<Function>(
-      M->getOrInsertFunction(
-           Tys.empty() ? getName(id) : getName(id, Tys, M, FT), FT)
+      M->getOrInsertFunction(Tys.empty() ? Intrinsic::getName(id)
+                                         : Intrinsic::getName(id, Tys, M, FT),
+                             FT)
           .getCallee());
 }
 
+Function *Intrinsic::getOrInsertDeclaration(Module *M, ID id,
+                                            ArrayRef<Type *> Tys) {
+  // There can never be multiple globals with the same name of different types,
+  // because intrinsics must be a specific type.
+  FunctionType *FT = getType(M->getContext(), id, Tys);
+  return getOrInsertIntrinsicDeclarationImpl(M, id, Tys, FT);
+}
+
+Function *Intrinsic::getOrInsertDeclaration(Module *M, ID id, Type *RetTy,
+                                            ArrayRef<Type *> ArgTys) {
+  // If the intrinsic is not overloaded, use the non-overloaded version.
+  if (!Intrinsic::isOverloaded(id))
+    return getOrInsertDeclaration(M, id);
+
+  // Get the intrinsic signature metadata.
+  SmallVector<Intrinsic::IITDescriptor, 8> Table;
+  getIntrinsicInfoTableEntries(id, Table);
+  ArrayRef<Intrinsic::IITDescriptor> TableRef = Table;
+
+  FunctionType *FTy = FunctionType::get(RetTy, ArgTys, /*isVarArg=*/false);
+
+  // Automatically determine the overloaded types.
+  SmallVector<Type *, 4> OverloadTys;
+  [[maybe_unused]] Intrinsic::MatchIntrinsicTypesResult Res =
+      matchIntrinsicSignature(FTy, TableRef, OverloadTys);
+  assert(Res == Intrinsic::MatchIntrinsicTypes_Match &&
+         "intrinsic signature mismatch");
+
+  // If intrinsic requires vararg, recreate the FunctionType accordingly.
+  if (!matchIntrinsicVarArg(/*isVarArg=*/true, TableRef))
+    FTy = FunctionType::get(RetTy, ArgTys, /*isVarArg=*/true);
+
+  assert(TableRef.empty() && "Unprocessed descriptors remain");
+
+  return getOrInsertIntrinsicDeclarationImpl(M, id, OverloadTys, FTy);
+}
+
 Function *Intrinsic::getDeclarationIfExists(const Module *M, ID id) {
   return M->getFunction(getName(id));
 }
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index a02af59600c44..4e242311e290f 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -19,7 +19,6 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Analysis/OptimizationRemarkEmitter.h"
 #include "llvm/Analysis/StackSafetyAnalysis.h"
-#include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Bitcode/BitcodeReader.h"
 #include "llvm/Bitcode/BitcodeWriter.h"
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 93118becedbac..f9cde383ce32d 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -17,6 +17,7 @@
 #include "llvm/Analysis/AliasAnalysis.h"
 #include "llvm/Analysis/CGSCCPassManager.h"
 #include "llvm/Analysis/ModuleSummaryAnalysis.h"
+#include "llvm/Analysis/RuntimeLibcallInfo.h"
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/Bitcode/BitcodeReader.h"
 #include "llvm/Bitcode/BitcodeWriter.h"
@@ -446,6 +447,11 @@ static void codegen(const Config &Conf, TargetMachine *TM,
     legacy::PassManager CodeGenPasses;
     TargetLibraryInfoImpl TLII(Mod.getTargetTriple());
     CodeGenPasses.add(new TargetLibraryInfoWrapperPass(TLII));
+    CodeGenPasses.add(new RuntimeLibraryInfoWrapper(
+        Mod.getTargetTriple(), TM->Options.ExceptionModel,
+        TM->Options.FloatABIType, TM->Options.EABIVersion,
+        TM->Options.MCOptions.ABIName, TM->Options.VecLib));
+
     // No need to make index available if the module is empty.
     // In theory these passes should not use the index for an empty
     // module, however, this guards against doing any unnecessary summary-based
diff --git a/llvm/lib/MC/MCObjectFileInfo.cpp b/llvm/lib/MC/MCObjectFileInfo.cpp
index b2f500083f5d8..5afe00eee2242 100644
--- a/llvm/lib/MC/MCObjectFileInfo.cpp
+++ b/llvm/lib/MC/MCObjectFileInfo.cpp
@@ -61,9 +61,6 @@ static bool useCompactUnwind(const Triple &T) {
 }
 
 void MCObjectFileInfo::initMachOMCObjectFileInfo(const Triple &T) {
-  // MachO
-  SupportsWeakOmittedEHFrame = false;
-
   EHFrameSection = Ctx->getMachOSection(
       "__TEXT", "__eh_frame",
       MachO::S_COALESCED | MachO::S_ATTR_NO_TOC |
@@ -1090,7 +1087,6 @@ void MCObjectFileInfo::initMCObjectFileInfo(MCContext &MCCtx, bool PIC,
   Ctx = &MCCtx;
 
   // Common.
-  SupportsWeakOmittedEHFrame = true;
   SupportsCompactUnwindWithoutEHFrame = false;
   OmitDwarfIfHaveCompactUnwind = false;
 
diff --git a/llvm/lib/MC/MCWin64EH.cpp b/llvm/lib/MC/MCWin64EH.cpp
index 6d146f6cedd6e..6835ba73ffab8 100644
--- a/llvm/lib/MC/MCWin64EH.cpp
+++ b/llvm/lib/MC/MCWin64EH.cpp
@@ -673,7 +673,7 @@ static void ARM64EmitUnwindCode(MCStreamer &streamer,
     break;
   case Win64EH::UOP_SaveFPLRX:
     b = 0x80;
-    b |= ((inst.Offset - 1) >> 3) & 0x3F;
+    b |= ((inst.Offset >> 3) - 1) & 0x3F;
     streamer.emitInt8(b);
     break;
   case Win64EH::UOP_SaveFPLR:
@@ -1051,7 +1051,9 @@ static bool tryARM64PackedUnwind(WinEH::FrameInfo *info, uint32_t FuncLength,
   // the order - that would work fine when unwinding from within
   // functions, but not be exactly right if unwinding happens within
   // prologs/epilogs.
-  for (const WinEH::Instruction &Inst : info->Instructions) {
+  for (auto It = info->Instructions.begin(), EndIt = info->Instructions.end();
+       It != EndIt; It++) {
+    const WinEH::Instruction &Inst = *It;
     switch (Inst.Operation) {
     case Win64EH::UOP_End:
       if (Location != Start)
@@ -1169,6 +1171,28 @@ static bool tryARM64PackedUnwind(WinEH::FrameInfo *info, uint32_t FuncLength,
           Location != FloatRegs && Location != InputArgs &&
           Location != StackAdjust)
         return false;
+      // Becuase there's no save_lrpair_x opcode, the case of CR=01,
+      // RegI=1 is handled as a special case with a pair of instructions; an
+      // alloc followed by a regular save_lrpair. So when encountering an
+      // alloc here, check if this is the start of such an instruction pair.
+      if (Location == Start2) { // Can't have this at Start3, after PACSignLR
+        auto NextIt = It + 1;
+        if (NextIt != EndIt) {
+          const WinEH::Instruction &NextInst = *NextIt;
+          if (NextInst.Operation == Win64EH::UOP_SaveLRPair &&
+              NextInst.Offset == 0 && NextInst.Register == 19) {
+            assert(Predecrement == 0);
+            assert(RegI == 0);
+            assert(!StandaloneLR);
+            Predecrement = Inst.Offset;
+            RegI = 1;
+            StandaloneLR = true;
+            Location = FloatRegs;
+            It++; // Consume both the Alloc and the SaveLRPair
+            continue;
+          }
+        }
+      }
       // Can have either a single decrement, or a pair of decrements with
       // 4080 and another decrement.
       if (StackOffset == 0)
@@ -1252,16 +1276,32 @@ static bool tryARM64PackedUnwind(WinEH::FrameInfo *info, uint32_t FuncLength,
   if (PAC && !FPLRPair)
     return false;
   int H = Nops == 4;
-  // There's an inconsistency regarding packed unwind info with homed
-  // parameters; according to the documentation, the epilog shouldn't have
-  // the same corresponding nops (and thus, to set the H bit, we should
-  // require an epilog which isn't exactly symmetrical - we shouldn't accept
-  // an exact mirrored epilog for those cases), but in practice,
-  // RtlVirtualUnwind behaves as if it does expect the epilogue to contain
-  // the same nops. See https://github.com/llvm/llvm-project/issues/54879.
-  // To play it safe, don't produce packed unwind info with homed parameters.
+  // For packed unwind info with the H bit set, the prolog and epilog
+  // actually shouldn't be symmetrical; the epilog shouldn't have any
+  // nop instructions/opcodes while the prolog has them. We currently
+  // require exactly symmetrical prologs/epilogs, which is wrong for this
+  // case - therefore, don't emit packed unwind info for this case.
+  // See https://github.com/llvm/llvm-project/issues/54879 for details.
+  //
+  // Additionally - older versions of Windows also deviated from the
+  // documentation here; older versions of Windows (at least up until
+  // 10.0.22000.2176) incorrectly did assume that the epilog has matching
+  // nop instructions. This is fixed at least in version 10.0.26100.6899.
+  // As long as we can't assume that the generated code always will run on
+  // a new enough version, don't emit the packed format here, even if the
+  // implementation would be fixed to match for the asymmetrical form
+  // according to the documentation.
   if (H)
     return false;
+  // Older versions of Windows (at least in 10.0.22000.2176) incorrectly
+  // unwind packed unwind info with CR=01, RegI=1, RegF>0, see
+  // https://github.com/llvm/llvm-project/issues/169588#issuecomment-3584907886.
+  // This issue only exists in older versions; current versions
+  // (10.0.26100.6899) do handle it correctly. As long as we can't be sure
+  // that we won't run on older versions, avoid producing the packed form
+  // here.
+  if (StandaloneLR && RegI == 1 && RegF > 0)
+    return false;
   int IntSZ = 8 * RegI;
   if (StandaloneLR)
     IntSZ += 8;
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index dd73c04959732..c6beb3fdf09bd 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -73,6 +73,7 @@
 #include "llvm/Transforms/IPO/SampleProfileProbe.h"
 #include "llvm/Transforms/IPO/WholeProgramDevirt.h"
 #include "llvm/Transforms/InstCombine/InstCombine.h"
+#include "llvm/Transforms/Instrumentation/AllocToken.h"
 #include "llvm/Transforms/Instrumentation/CGProfile.h"
 #include "llvm/Transforms/Instrumentation/ControlHeightReduction.h"
 #include "llvm/Transforms/Instrumentation/InstrProfiling.h"
@@ -1615,6 +1616,11 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
   MPM.addPass(createModuleToFunctionPassAdaptor(std::move(OptimizePM),
                                                 PTO.EagerlyInvalidateAnalyses));
 
+  // AllocToken transforms heap allocation calls; this needs to run late after
+  // other allocation call transformations (such as those in InstCombine).
+  if (!LTOPreLink)
+    MPM.addPass(AllocTokenPass());
+
   invokeOptimizerLastEPCallbacks(MPM, Level, LTOPhase);
 
   // Split out cold code. Splitting is done late to avoid hiding context from
@@ -1853,6 +1859,11 @@ ModulePassManager PassBuilder::buildThinLTODefaultPipeline(
     MPM.addPass(LowerTypeTestsPass(nullptr, nullptr,
                                    lowertypetests::DropTestKind::Assume));
     MPM.addPass(buildCoroWrapper(ThinOrFullLTOPhase::ThinLTOPostLink));
+
+    // AllocToken transforms heap allocation calls; this needs to run late after
+    // other allocation call transformations (such as those in InstCombine).
+    MPM.addPass(AllocTokenPass());
+
     // Drop available_externally and unreferenced globals. This is necessary
     // with ThinLTO in order to avoid leaving undefined references to dead
     // globals in the object file.
@@ -1914,6 +1925,10 @@ PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level,
 
     MPM.addPass(buildCoroWrapper(ThinOrFullLTOPhase::FullLTOPostLink));
 
+    // AllocToken transforms heap allocation calls; this needs to run late after
+    // other allocation call transformations (such as those in InstCombine).
+    MPM.addPass(AllocTokenPass());
+
     invokeFullLinkTimeOptimizationLastEPCallbacks(MPM, Level);
 
     // Emit annotation remarks.
@@ -2001,6 +2016,10 @@ PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level,
 
     MPM.addPass(buildCoroWrapper(ThinOrFullLTOPhase::FullLTOPostLink));
 
+    // AllocToken transforms heap allocation calls; this needs to run late after
+    // other allocation call transformations (such as those in InstCombine).
+    MPM.addPass(AllocTokenPass());
+
     invokeFullLinkTimeOptimizationLastEPCallbacks(MPM, Level);
 
     // Emit annotation remarks.
@@ -2235,6 +2254,10 @@ PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level,
 
   MPM.addPass(CoroCleanupPass());
 
+  // AllocToken transforms heap allocation calls; this needs to run late after
+  // other allocation call transformations (such as those in InstCombine).
+  MPM.addPass(AllocTokenPass());
+
   invokeFullLinkTimeOptimizationLastEPCallbacks(MPM, Level);
 
   // Emit annotation remarks.
@@ -2351,6 +2374,11 @@ PassBuilder::buildO0DefaultPipeline(OptimizationLevel Level,
 
   MPM.addPass(buildCoroWrapper(Phase));
 
+  // AllocToken transforms heap allocation calls; this needs to run late after
+  // other allocation call transformations (such as those in InstCombine).
+  if (!isLTOPreLink(Phase))
+    MPM.addPass(AllocTokenPass());
+
   invokeOptimizerLastEPCallbacks(MPM, Level, Phase);
 
   if (isLTOPreLink(Phase))
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 074c328ef0931..e9874ecd553ee 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -30,6 +30,7 @@ MODULE_ANALYSIS("ir2vec-vocab", IR2VecVocabAnalysis())
 MODULE_ANALYSIS("ir-similarity", IRSimilarityAnalysis())
 MODULE_ANALYSIS("last-run-tracking", LastRunTrackingAnalysis())
 MODULE_ANALYSIS("lcg", LazyCallGraphAnalysis())
+MODULE_ANALYSIS("libcall-lowering-info", LibcallLoweringModuleAnalysis())
 MODULE_ANALYSIS("module-summary", ModuleSummaryIndexAnalysis())
 MODULE_ANALYSIS("no-op-module", NoOpModuleAnalysis())
 MODULE_ANALYSIS("pass-instrumentation", PassInstrumentationAnalysis(PIC))
diff --git a/llvm/lib/ProfileData/SampleProf.cpp b/llvm/lib/ProfileData/SampleProf.cpp
index ac7513ef2cb49..a7d784a72e380 100644
--- a/llvm/lib/ProfileData/SampleProf.cpp
+++ b/llvm/lib/ProfileData/SampleProf.cpp
@@ -22,6 +22,8 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/LEB128.h"
 #include "llvm/Support/raw_ostream.h"
+#include <algorithm>
+#include <cstdint>
 #include <string>
 #include <system_error>
 
@@ -398,6 +400,10 @@ LLVM_DUMP_METHOD void FunctionSamples::dump() const { print(dbgs(), 0); }
 
 std::error_code ProfileSymbolList::read(const uint8_t *Data,
                                         uint64_t ListSize) {
+  // Scan forward to see how many elements we expect.
+  reserve(std::min<uint64_t>(ProfileSymbolListCutOff,
+                             std::count(Data, Data + ListSize, 0)));
+
   const char *ListStart = reinterpret_cast<const char *>(Data);
   uint64_t Size = 0;
   uint64_t StrNum = 0;
diff --git a/llvm/lib/Support/DebugCounter.cpp b/llvm/lib/Support/DebugCounter.cpp
index 5ab1def43313b..6c69c884a684d 100644
--- a/llvm/lib/Support/DebugCounter.cpp
+++ b/llvm/lib/Support/DebugCounter.cpp
@@ -4,6 +4,7 @@
 
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Format.h"
+#include "llvm/Support/ManagedStatic.h"
 
 using namespace llvm;
 
@@ -110,12 +111,11 @@ class DebugCounterList : public cl::list<std::string, DebugCounter> {
     // width, so we do the same.
     Option::printHelpStr(HelpStr, GlobalWidth, ArgStr.size() + 6);
     const auto &CounterInstance = DebugCounter::instance();
-    for (const auto &Name : CounterInstance) {
-      const auto Info =
-          CounterInstance.getCounterInfo(CounterInstance.getCounterId(Name));
-      size_t NumSpaces = GlobalWidth - Info.first.size() - 8;
-      outs() << "    =" << Info.first;
-      outs().indent(NumSpaces) << " -   " << Info.second << '\n';
+    for (const auto &Entry : CounterInstance) {
+      const auto &[Name, Desc] = CounterInstance.getCounterDesc(Entry.second);
+      size_t NumSpaces = GlobalWidth - Name.size() - 8;
+      outs() << "    =" << Name;
+      outs().indent(NumSpaces) << " -   " << Desc << '\n';
     }
   }
 };
@@ -135,7 +135,11 @@ struct DebugCounterOwner : DebugCounter {
       cl::Optional,
       cl::location(this->ShouldPrintCounter),
       cl::init(false),
-      cl::desc("Print out debug counter info after all counters accumulated")};
+      cl::desc("Print out debug counter info after all counters accumulated"),
+      cl::callback([&](const bool &Value) {
+        if (Value)
+          activateAllCounters();
+      })};
   cl::opt<bool, true> PrintDebugCounterQueries{
       "print-debug-counter-queries",
       cl::Hidden,
@@ -167,23 +171,20 @@ struct DebugCounterOwner : DebugCounter {
 
 } // anonymous namespace
 
+// Use ManagedStatic instead of function-local static variable to ensure
+// the destructor (which accesses counters and streams) runs during
+// llvm_shutdown() rather than at some unspecified point.
+static ManagedStatic<DebugCounterOwner> Owner;
+
 void llvm::initDebugCounterOptions() { (void)DebugCounter::instance(); }
 
-DebugCounter &DebugCounter::instance() {
-  static DebugCounterOwner O;
-  return O;
-}
+DebugCounter &DebugCounter::instance() { return *Owner; }
 
 // This is called by the command line parser when it sees a value for the
 // debug-counter option defined above.
 void DebugCounter::push_back(const std::string &Val) {
   if (Val.empty())
     return;
-#ifdef NDEBUG
-  // isCountingEnabled is hardcoded to false in NDEBUG.
-  errs() << "Requested --debug-counter in LLVM build without assertions. This "
-            "is a no-op.\n";
-#endif
 
   // The strings should come in as counter=chunk_list
   auto CounterPair = StringRef(Val).split('=');
@@ -198,32 +199,26 @@ void DebugCounter::push_back(const std::string &Val) {
     return;
   }
 
-  unsigned CounterID = getCounterId(std::string(CounterName));
-  if (!CounterID) {
+  CounterInfo *Counter = getCounterInfo(CounterName);
+  if (!Counter) {
     errs() << "DebugCounter Error: " << CounterName
            << " is not a registered counter\n";
     return;
   }
-  enableAllCounters();
 
-  CounterInfo &Counter = Counters[CounterID];
-  Counter.IsSet = true;
-  Counter.Chunks = std::move(Chunks);
+  Counter->Active = Counter->IsSet = true;
+  Counter->Chunks = std::move(Chunks);
 }
 
 void DebugCounter::print(raw_ostream &OS) const {
-  SmallVector<StringRef, 16> CounterNames(RegisteredCounters.begin(),
-                                          RegisteredCounters.end());
+  SmallVector<StringRef, 16> CounterNames(Counters.keys());
   sort(CounterNames);
 
-  auto &Us = instance();
   OS << "Counters and values:\n";
-  for (auto &CounterName : CounterNames) {
-    unsigned CounterID = getCounterId(std::string(CounterName));
-    const CounterInfo &C = Us.Counters[CounterID];
-    OS << left_justify(RegisteredCounters[CounterID], 32) << ": {" << C.Count
-       << ",";
-    printChunks(OS, C.Chunks);
+  for (StringRef CounterName : CounterNames) {
+    const CounterInfo *C = getCounterInfo(CounterName);
+    OS << left_justify(C->Name, 32) << ": {" << C->Count << ",";
+    printChunks(OS, C->Chunks);
     OS << "}\n";
   }
 }
@@ -253,20 +248,14 @@ bool DebugCounter::handleCounterIncrement(CounterInfo &Info) {
   return Res;
 }
 
-bool DebugCounter::shouldExecuteImpl(unsigned CounterName) {
+bool DebugCounter::shouldExecuteImpl(CounterInfo &Counter) {
   auto &Us = instance();
-  auto Result = Us.Counters.find(CounterName);
-  if (Result != Us.Counters.end()) {
-    auto &CounterInfo = Result->second;
-    bool Res = Us.handleCounterIncrement(CounterInfo);
-    if (Us.ShouldPrintCounterQueries && CounterInfo.IsSet) {
-      dbgs() << "DebugCounter " << Us.RegisteredCounters[CounterName] << "="
-             << (CounterInfo.Count - 1) << (Res ? " execute" : " skip") << "\n";
-    }
-    return Res;
+  bool Res = Us.handleCounterIncrement(Counter);
+  if (Us.ShouldPrintCounterQueries && Counter.IsSet) {
+    dbgs() << "DebugCounter " << Counter.Name << "=" << (Counter.Count - 1)
+           << (Res ? " execute" : " skip") << "\n";
   }
-  // Didn't find the counter, should we warn?
-  return true;
+  return Res;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/lib/Support/LSP/Protocol.cpp b/llvm/lib/Support/LSP/Protocol.cpp
index f8eeb32db02ab..c2957bdc0dacb 100644
--- a/llvm/lib/Support/LSP/Protocol.cpp
+++ b/llvm/lib/Support/LSP/Protocol.cpp
@@ -1037,3 +1037,21 @@ llvm::json::Value llvm::lsp::toJSON(const CodeAction &Value) {
     CodeAction["edit"] = *Value.edit;
   return std::move(CodeAction);
 }
+
+//===----------------------------------------------------------------------===//
+// ShowMessageParams
+//===----------------------------------------------------------------------===//
+
+llvm::json::Value llvm::lsp::toJSON(const ShowMessageParams &Params) {
+  auto Out = llvm::json::Object{
+      {"type", static_cast<int>(Params.type)},
+      {"message", Params.message},
+  };
+  if (Params.actions)
+    Out["actions"] = *Params.actions;
+  return Out;
+}
+
+llvm::json::Value llvm::lsp::toJSON(const MessageActionItem &Params) {
+  return llvm::json::Object{{"title", Params.title}};
+}
diff --git a/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp b/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
index 34d74d04c4419..60e6a82d41cc8 100644
--- a/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
@@ -1717,6 +1717,7 @@ bool AArch64ExpandPseudo::expandMI(MachineBasicBlock &MBB,
    }
    case AArch64::InOutZAUsePseudo:
    case AArch64::RequiresZASavePseudo:
+   case AArch64::RequiresZT0SavePseudo:
    case AArch64::SMEStateAllocPseudo:
    case AArch64::COALESCER_BARRIER_FPR16:
    case AArch64::COALESCER_BARRIER_FPR32:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 6072fd9d8f242..4be371bd4e67a 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -530,7 +530,6 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
 
   setOperationAction(ISD::FREM, MVT::f32, Expand);
   setOperationAction(ISD::FREM, MVT::f64, Expand);
-  setOperationAction(ISD::FREM, MVT::f80, Expand);
 
   setOperationAction(ISD::BUILD_PAIR, MVT::i64, Expand);
 
@@ -1591,6 +1590,10 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
       setOperationAction(ISD::AVGCEILS, VT, Custom);
       setOperationAction(ISD::AVGCEILU, VT, Custom);
 
+      setOperationAction(ISD::ANY_EXTEND_VECTOR_INREG, VT, Custom);
+      setOperationAction(ISD::SIGN_EXTEND_VECTOR_INREG, VT, Custom);
+      setOperationAction(ISD::ZERO_EXTEND_VECTOR_INREG, VT, Custom);
+
       if (!Subtarget->isLittleEndian())
         setOperationAction(ISD::BITCAST, VT, Custom);
 
@@ -4554,6 +4557,26 @@ static SDValue lowerADDSUBO_CARRY(SDValue Op, SelectionDAG &DAG,
   return DAG.getMergeValues({Sum, OutFlag}, DL);
 }
 
+static SDValue lowerIntNeonIntrinsic(SDValue Op, unsigned Opcode,
+                                     SelectionDAG &DAG) {
+  SDLoc DL(Op);
+  auto getFloatVT = [](EVT VT) {
+    assert((VT == MVT::i32 || VT == MVT::i64) && "Unexpected VT");
+    return VT == MVT::i32 ? MVT::f32 : MVT::f64;
+  };
+  auto bitcastToFloat = [&](SDValue Val) {
+    return DAG.getBitcast(getFloatVT(Val.getValueType()), Val);
+  };
+  SmallVector<SDValue, 2> NewOps;
+  NewOps.reserve(Op.getNumOperands() - 1);
+
+  for (unsigned I = 1, E = Op.getNumOperands(); I < E; ++I)
+    NewOps.push_back(bitcastToFloat(Op.getOperand(I)));
+  EVT OrigVT = Op.getValueType();
+  SDValue OpNode = DAG.getNode(Opcode, DL, getFloatVT(OrigVT), NewOps);
+  return DAG.getBitcast(OrigVT, OpNode);
+}
+
 static SDValue LowerXALUO(SDValue Op, SelectionDAG &DAG) {
   // Let legalize expand this if it isn't a legal type yet.
   if (!DAG.getTargetLoweringInfo().isTypeLegal(Op.getValueType()))
@@ -6400,26 +6423,46 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
                                      Op.getOperand(1).getValueType(),
                                      Op.getOperand(1), Op.getOperand(2)));
     return SDValue();
+  case Intrinsic::aarch64_neon_sqrshl:
+    if (Op.getValueType().isVector())
+      return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::SQRSHL, DAG);
+  case Intrinsic::aarch64_neon_sqshl:
+    if (Op.getValueType().isVector())
+      return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::SQSHL, DAG);
+  case Intrinsic::aarch64_neon_uqrshl:
+    if (Op.getValueType().isVector())
+      return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::UQRSHL, DAG);
+  case Intrinsic::aarch64_neon_uqshl:
+    if (Op.getValueType().isVector())
+      return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::UQSHL, DAG);
   case Intrinsic::aarch64_neon_sqadd:
     if (Op.getValueType().isVector())
       return DAG.getNode(ISD::SADDSAT, DL, Op.getValueType(), Op.getOperand(1),
                          Op.getOperand(2));
-    return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::SQADD, DAG);
+
   case Intrinsic::aarch64_neon_sqsub:
     if (Op.getValueType().isVector())
       return DAG.getNode(ISD::SSUBSAT, DL, Op.getValueType(), Op.getOperand(1),
                          Op.getOperand(2));
-    return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::SQSUB, DAG);
+
   case Intrinsic::aarch64_neon_uqadd:
     if (Op.getValueType().isVector())
       return DAG.getNode(ISD::UADDSAT, DL, Op.getValueType(), Op.getOperand(1),
                          Op.getOperand(2));
-    return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::UQADD, DAG);
   case Intrinsic::aarch64_neon_uqsub:
     if (Op.getValueType().isVector())
       return DAG.getNode(ISD::USUBSAT, DL, Op.getValueType(), Op.getOperand(1),
                          Op.getOperand(2));
-    return SDValue();
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::UQSUB, DAG);
+  case Intrinsic::aarch64_neon_sqdmulls_scalar:
+    return lowerIntNeonIntrinsic(Op, AArch64ISD::SQDMULL, DAG);
   case Intrinsic::aarch64_sve_whilelt:
     return optimizeIncrementingWhile(Op.getNode(), DAG, /*IsSigned=*/true,
                                      /*IsEqual=*/false);
@@ -7859,6 +7902,9 @@ SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
     return LowerEXTRACT_VECTOR_ELT(Op, DAG);
   case ISD::BUILD_VECTOR:
     return LowerBUILD_VECTOR(Op, DAG);
+  case ISD::ANY_EXTEND_VECTOR_INREG:
+  case ISD::SIGN_EXTEND_VECTOR_INREG:
+    return LowerEXTEND_VECTOR_INREG(Op, DAG);
   case ISD::ZERO_EXTEND_VECTOR_INREG:
     return LowerZERO_EXTEND_VECTOR_INREG(Op, DAG);
   case ISD::VECTOR_SHUFFLE:
@@ -8647,7 +8693,7 @@ SDValue AArch64TargetLowering::LowerFormalArguments(
               Subtarget->isWindowsArm64EC()) &&
              "Indirect arguments should be scalable on most subtargets");
 
-      uint64_t PartSize = VA.getValVT().getStoreSize().getKnownMinValue();
+      TypeSize PartSize = VA.getValVT().getStoreSize();
       unsigned NumParts = 1;
       if (Ins[i].Flags.isInConsecutiveRegs()) {
         while (!Ins[i + NumParts - 1].Flags.isInConsecutiveRegsLast())
@@ -8664,16 +8710,8 @@ SDValue AArch64TargetLowering::LowerFormalArguments(
         InVals.push_back(ArgValue);
         NumParts--;
         if (NumParts > 0) {
-          SDValue BytesIncrement;
-          if (PartLoad.isScalableVector()) {
-            BytesIncrement = DAG.getVScale(
-                DL, Ptr.getValueType(),
-                APInt(Ptr.getValueSizeInBits().getFixedValue(), PartSize));
-          } else {
-            BytesIncrement = DAG.getConstant(
-                APInt(Ptr.getValueSizeInBits().getFixedValue(), PartSize), DL,
-                Ptr.getValueType());
-          }
+          SDValue BytesIncrement =
+              DAG.getTypeSize(DL, Ptr.getValueType(), PartSize);
           Ptr = DAG.getNode(ISD::ADD, DL, Ptr.getValueType(), Ptr,
                             BytesIncrement, SDNodeFlags::NoUnsignedWrap);
           ExtraArgLocs++;
@@ -9642,6 +9680,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
     if (CallAttrs.requiresLazySave() ||
         CallAttrs.requiresPreservingAllZAState())
       ZAMarkerNode = AArch64ISD::REQUIRES_ZA_SAVE;
+    else if (CallAttrs.requiresPreservingZT0())
+      ZAMarkerNode = AArch64ISD::REQUIRES_ZT0_SAVE;
     else if (CallAttrs.caller().hasZAState() ||
              CallAttrs.caller().hasZT0State())
       ZAMarkerNode = AArch64ISD::INOUT_ZA_USE;
@@ -9761,7 +9801,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
 
   SDValue ZTFrameIdx;
   MachineFrameInfo &MFI = MF.getFrameInfo();
-  bool ShouldPreserveZT0 = CallAttrs.requiresPreservingZT0();
+  bool ShouldPreserveZT0 =
+      !UseNewSMEABILowering && CallAttrs.requiresPreservingZT0();
 
   // If the caller has ZT0 state which will not be preserved by the callee,
   // spill ZT0 before the call.
@@ -9774,7 +9815,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
 
   // If caller shares ZT0 but the callee is not shared ZA, we need to stop
   // PSTATE.ZA before the call if there is no lazy-save active.
-  bool DisableZA = CallAttrs.requiresDisablingZABeforeCall();
+  bool DisableZA =
+      !UseNewSMEABILowering && CallAttrs.requiresDisablingZABeforeCall();
   assert((!DisableZA || !RequiresLazySave) &&
          "Lazy-save should have PSTATE.SM=1 on entry to the function");
 
@@ -9876,8 +9918,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
       assert((isScalable || Subtarget->isWindowsArm64EC()) &&
              "Indirect arguments should be scalable on most subtargets");
 
-      uint64_t StoreSize = VA.getValVT().getStoreSize().getKnownMinValue();
-      uint64_t PartSize = StoreSize;
+      TypeSize StoreSize = VA.getValVT().getStoreSize();
+      TypeSize PartSize = StoreSize;
       unsigned NumParts = 1;
       if (Outs[i].Flags.isInConsecutiveRegs()) {
         while (!Outs[i + NumParts - 1].Flags.isInConsecutiveRegsLast())
@@ -9888,7 +9930,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
       Type *Ty = EVT(VA.getValVT()).getTypeForEVT(*DAG.getContext());
       Align Alignment = DAG.getDataLayout().getPrefTypeAlign(Ty);
       MachineFrameInfo &MFI = MF.getFrameInfo();
-      int FI = MFI.CreateStackObject(StoreSize, Alignment, false);
+      int FI =
+          MFI.CreateStackObject(StoreSize.getKnownMinValue(), Alignment, false);
       if (isScalable) {
         bool IsPred = VA.getValVT() == MVT::aarch64svcount ||
                       VA.getValVT().getVectorElementType() == MVT::i1;
@@ -9909,16 +9952,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
 
         NumParts--;
         if (NumParts > 0) {
-          SDValue BytesIncrement;
-          if (isScalable) {
-            BytesIncrement = DAG.getVScale(
-                DL, Ptr.getValueType(),
-                APInt(Ptr.getValueSizeInBits().getFixedValue(), PartSize));
-          } else {
-            BytesIncrement = DAG.getConstant(
-                APInt(Ptr.getValueSizeInBits().getFixedValue(), PartSize), DL,
-                Ptr.getValueType());
-          }
+          SDValue BytesIncrement =
+              DAG.getTypeSize(DL, Ptr.getValueType(), PartSize);
           MPI = MachinePointerInfo(MPI.getAddrSpace());
           Ptr = DAG.getNode(ISD::ADD, DL, Ptr.getValueType(), Ptr,
                             BytesIncrement, SDNodeFlags::NoUnsignedWrap);
@@ -10263,7 +10298,8 @@ AArch64TargetLowering::LowerCall(CallLoweringInfo &CLI,
         getSMToggleCondition(CallAttrs));
   }
 
-  if (RequiresLazySave || CallAttrs.requiresEnablingZAAfterCall())
+  if (!UseNewSMEABILowering &&
+      (RequiresLazySave || CallAttrs.requiresEnablingZAAfterCall()))
     // Unconditionally resume ZA.
     Result = DAG.getNode(
         AArch64ISD::SMSTART, DL, DAG.getVTList(MVT::Other, MVT::Glue), Result,
@@ -14701,6 +14737,40 @@ static SDValue tryToConvertShuffleOfTbl2ToTbl4(SDValue Op,
                       Tbl2->getOperand(1), Tbl2->getOperand(2), TBLMask});
 }
 
+SDValue
+AArch64TargetLowering::LowerEXTEND_VECTOR_INREG(SDValue Op,
+                                                SelectionDAG &DAG) const {
+  SDLoc DL(Op);
+  EVT VT = Op.getValueType();
+  assert(VT.isScalableVector() && "Unexpected result type!");
+
+  bool Signed = Op.getOpcode() == ISD::SIGN_EXTEND_VECTOR_INREG;
+  unsigned UnpackOpcode = Signed ? AArch64ISD::SUNPKLO : AArch64ISD::UUNPKLO;
+
+  // Repeatedly unpack Val until the result is of the desired type.
+  SDValue Val = Op.getOperand(0);
+  switch (Val.getSimpleValueType().SimpleTy) {
+  default:
+    return SDValue();
+  case MVT::nxv16i8:
+    Val = DAG.getNode(UnpackOpcode, DL, MVT::nxv8i16, Val);
+    if (VT == MVT::nxv8i16)
+      break;
+    [[fallthrough]];
+  case MVT::nxv8i16:
+    Val = DAG.getNode(UnpackOpcode, DL, MVT::nxv4i32, Val);
+    if (VT == MVT::nxv4i32)
+      break;
+    [[fallthrough]];
+  case MVT::nxv4i32:
+    Val = DAG.getNode(UnpackOpcode, DL, MVT::nxv2i64, Val);
+    assert(VT == MVT::nxv2i64 && "Unexpected result type!");
+    break;
+  }
+
+  return Val;
+}
+
 // Baseline legalization for ZERO_EXTEND_VECTOR_INREG will blend-in zeros,
 // but we don't have an appropriate instruction,
 // so custom-lower it as ZIP1-with-zeros.
@@ -14709,6 +14779,10 @@ AArch64TargetLowering::LowerZERO_EXTEND_VECTOR_INREG(SDValue Op,
                                                      SelectionDAG &DAG) const {
   SDLoc DL(Op);
   EVT VT = Op.getValueType();
+
+  if (VT.isScalableVector())
+    return LowerEXTEND_VECTOR_INREG(Op, DAG);
+
   SDValue SrcOp = Op.getOperand(0);
   EVT SrcVT = SrcOp.getValueType();
   assert(VT.getScalarSizeInBits() % SrcVT.getScalarSizeInBits() == 0 &&
@@ -17193,7 +17267,7 @@ SDValue AArch64TargetLowering::LowerVSCALE(SDValue Op,
 template <unsigned NumVecs>
 static bool
 setInfoSVEStN(const AArch64TargetLowering &TLI, const DataLayout &DL,
-              AArch64TargetLowering::IntrinsicInfo &Info, const CallInst &CI) {
+              AArch64TargetLowering::IntrinsicInfo &Info, const CallBase &CI) {
   Info.opc = ISD::INTRINSIC_VOID;
   // Retrieve EC from first vector argument.
   const EVT VT = TLI.getMemValueType(DL, CI.getArgOperand(0)->getType());
@@ -17218,7 +17292,7 @@ setInfoSVEStN(const AArch64TargetLowering &TLI, const DataLayout &DL,
 /// MemIntrinsicNodes.  The associated MachineMemOperands record the alignment
 /// specified in the intrinsic calls.
 bool AArch64TargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                               const CallInst &I,
+                                               const CallBase &I,
                                                MachineFunction &MF,
                                                unsigned Intrinsic) const {
   auto &DL = I.getDataLayout();
@@ -26949,22 +27023,25 @@ static SDValue performSelectCombine(SDNode *N,
   assert((N0.getValueType() == MVT::i1 || N0.getValueType() == MVT::i32) &&
          "Scalar-SETCC feeding SELECT has unexpected result type!");
 
-  // If NumMaskElts == 0, the comparison is larger than select result. The
-  // largest real NEON comparison is 64-bits per lane, which means the result is
-  // at most 32-bits and an illegal vector. Just bail out for now.
-  EVT SrcVT = N0.getOperand(0).getValueType();
-
   // Don't try to do this optimization when the setcc itself has i1 operands.
   // There are no legal vectors of i1, so this would be pointless. v1f16 is
   // ruled out to prevent the creation of setcc that need to be scalarized.
+  EVT SrcVT = N0.getOperand(0).getValueType();
   if (SrcVT == MVT::i1 ||
       (SrcVT.isFloatingPoint() && SrcVT.getSizeInBits() <= 16))
     return SDValue();
 
-  int NumMaskElts = ResVT.getSizeInBits() / SrcVT.getSizeInBits();
+  // If NumMaskElts == 0, the comparison is larger than select result. The
+  // largest real NEON comparison is 64-bits per lane, which means the result is
+  // at most 32-bits and an illegal vector. Just bail out for now.
+  unsigned NumMaskElts = ResVT.getSizeInBits() / SrcVT.getSizeInBits();
   if (!ResVT.isVector() || NumMaskElts == 0)
     return SDValue();
 
+  // Avoid creating vectors with excessive VFs before legalization.
+  if (DCI.isBeforeLegalize() && NumMaskElts != ResVT.getVectorNumElements())
+    return SDValue();
+
   SrcVT = EVT::getVectorVT(*DAG.getContext(), SrcVT, NumMaskElts);
   EVT CCVT = SrcVT.changeVectorElementTypeToInteger();
 
@@ -28887,7 +28964,8 @@ void AArch64TargetLowering::ReplaceExtractSubVectorResults(
   if ((Index != 0) && (Index != ResEC.getKnownMinValue()))
     return;
 
-  unsigned Opcode = (Index == 0) ? AArch64ISD::UUNPKLO : AArch64ISD::UUNPKHI;
+  unsigned Opcode = (Index == 0) ? (unsigned)ISD::ANY_EXTEND_VECTOR_INREG
+                                 : (unsigned)AArch64ISD::UUNPKHI;
   EVT ExtendedHalfVT = VT.widenIntegerVectorElementType(*DAG.getContext());
 
   SDValue Half = DAG.getNode(Opcode, DL, ExtendedHalfVT, N->getOperand(0));
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 32aa913181a21..1d4446d287462 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -206,7 +206,7 @@ class AArch64TargetLowering : public TargetLowering {
   EmitInstrWithCustomInserter(MachineInstr &MI,
                               MachineBasicBlock *MBB) const override;
 
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
 
@@ -714,6 +714,7 @@ class AArch64TargetLowering : public TargetLowering {
   SDValue LowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;
+  SDValue LowerEXTEND_VECTOR_INREG(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerZERO_EXTEND_VECTOR_INREG(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerSPLAT_VECTOR(SDValue Op, SelectionDAG &DAG) const;
diff --git a/llvm/lib/Target/AArch64/AArch64InstrFormats.td b/llvm/lib/Target/AArch64/AArch64InstrFormats.td
index 61a8f764e39ed..4d2e740779961 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrFormats.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrFormats.td
@@ -7700,16 +7700,21 @@ multiclass SIMDThreeScalarD<bit U, bits<5> opc, string asm,
 }
 
 multiclass SIMDThreeScalarBHSD<bit U, bits<5> opc, string asm,
-                               SDPatternOperator OpNode, SDPatternOperator SatOp> {
+                               SDPatternOperator OpNode, SDPatternOperator G_OpNode, SDPatternOperator SatOp> {
   def v1i64  : BaseSIMDThreeScalar<U, 0b111, opc, FPR64, asm,
     [(set (v1i64 FPR64:$Rd), (SatOp (v1i64 FPR64:$Rn), (v1i64 FPR64:$Rm)))]>;
   def v1i32  : BaseSIMDThreeScalar<U, 0b101, opc, FPR32, asm, []>;
   def v1i16  : BaseSIMDThreeScalar<U, 0b011, opc, FPR16, asm, []>;
   def v1i8   : BaseSIMDThreeScalar<U, 0b001, opc, FPR8 , asm, []>;
 
-  def : Pat<(i64 (OpNode (i64 FPR64:$Rn), (i64 FPR64:$Rm))),
+  def : Pat<(i64 (G_OpNode (i64 FPR64:$Rn), (i64 FPR64:$Rm))),
             (!cast<Instruction>(NAME#"v1i64") FPR64:$Rn, FPR64:$Rm)>;
-  def : Pat<(i32 (OpNode (i32 FPR32:$Rn), (i32 FPR32:$Rm))),
+  def : Pat<(i32 (G_OpNode (i32 FPR32:$Rn), (i32 FPR32:$Rm))),
+            (!cast<Instruction>(NAME#"v1i32") FPR32:$Rn, FPR32:$Rm)>;
+
+  def : Pat<(f64 (OpNode FPR64:$Rn, FPR64:$Rm)),
+            (!cast<Instruction>(NAME#"v1i64") FPR64:$Rn, FPR64:$Rm)>;
+  def : Pat<(f32 (OpNode FPR32:$Rn, FPR32:$Rm)),
             (!cast<Instruction>(NAME#"v1i32") FPR32:$Rn, FPR32:$Rm)>;
 }
 
@@ -7795,7 +7800,7 @@ multiclass SIMDThreeScalarMixedHS<bit U, bits<5> opc, string asm,
   def i32  : BaseSIMDThreeScalarMixed<U, 0b10, opc,
                                       (outs FPR64:$Rd),
                                       (ins FPR32:$Rn, FPR32:$Rm), asm, "",
-            [(set (i64 FPR64:$Rd), (OpNode (i32 FPR32:$Rn), (i32 FPR32:$Rm)))]>;
+            [(set (f64 FPR64:$Rd), (OpNode FPR32:$Rn, FPR32:$Rm))]>;
 }
 
 let mayLoad = 0, mayStore = 0, hasSideEffects = 0 in
@@ -9800,7 +9805,8 @@ multiclass SIMDIndexedLongSD<bit U, bits<4> opc, string asm,
 
 multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                                        SDPatternOperator VecAcc,
-                                       SDPatternOperator ScalAcc> {
+                                       SDPatternOperator ScalAcc,
+                                       SDPatternOperator G_ScalAcc> {
   def v4i16_indexed : BaseSIMDIndexedTied<0, U, 0, 0b01, opc,
                                       V128, V64,
                                       V128_lo, VectorIndexH,
@@ -9869,7 +9875,7 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
     let Inst{20} = idx{0};
   }
 
-  def : Pat<(i32 (ScalAcc (i32 FPR32Op:$Rd),
+  def : Pat<(i32 (G_ScalAcc (i32 FPR32Op:$Rd),
                           (i32 (vector_extract
                                     (v4i32 (int_aarch64_neon_sqdmull
                                                 (v4i16 V64:$Rn),
@@ -9881,7 +9887,19 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                         (INSERT_SUBREG (IMPLICIT_DEF), V64:$Rm, dsub),
                         (i64 0))>;
 
-  def : Pat<(i32 (ScalAcc (i32 FPR32Op:$Rd),
+  def : Pat<(f32 (ScalAcc FPR32Op:$Rd,
+                    (bitconvert (i32 (vector_extract
+                                      (v4i32 (int_aarch64_neon_sqdmull
+                                                (v4i16 V64:$Rn),
+                                                (v4i16 V64:$Rm))),
+                                      (i64 0)))))),
+            (!cast<Instruction>(NAME # v1i32_indexed)
+                        FPR32Op:$Rd,
+                        (f16 (EXTRACT_SUBREG V64:$Rn, hsub)),
+                        (INSERT_SUBREG (IMPLICIT_DEF), V64:$Rm, dsub),
+                        (i64 0))>;
+
+  def : Pat<(i32 (G_ScalAcc (i32 FPR32Op:$Rd),
                           (i32 (vector_extract
                                     (v4i32 (int_aarch64_neon_sqdmull
                                                 (v4i16 V64:$Rn),
@@ -9894,15 +9912,27 @@ multiclass SIMDIndexedLongSQDMLXSDTied<bit U, bits<4> opc, string asm,
                         V128_lo:$Rm,
                         VectorIndexH:$idx)>;
 
+  def : Pat<(f32 (ScalAcc FPR32Op:$Rd,
+                          (bitconvert (i32 (vector_extract
+                                    (v4i32 (int_aarch64_neon_sqdmull
+                                                (v4i16 V64:$Rn),
+                                                (dup_v8i16 (v8i16 V128_lo:$Rm),
+                                                            VectorIndexH:$idx))),
+                                    (i64 0)))))),
+            (!cast<Instruction>(NAME # v1i32_indexed)
+                        FPR32Op:$Rd,
+                        (f16 (EXTRACT_SUBREG V64:$Rn, hsub)),
+                        V128_lo:$Rm,
+                        VectorIndexH:$idx)>;
+
   def v1i64_indexed : BaseSIMDIndexedTied<1, U, 1, 0b10, opc,
                                       FPR64Op, FPR32Op, V128, VectorIndexS,
                                       asm, ".s", "", "", ".s",
-    [(set (i64 FPR64Op:$dst),
-          (ScalAcc (i64 FPR64Op:$Rd),
-                   (i64 (int_aarch64_neon_sqdmulls_scalar
-                            (i32 FPR32Op:$Rn),
-                            (i32 (vector_extract (v4i32 V128:$Rm),
-                                                 VectorIndexS:$idx))))))]> {
+    [(set (f64 FPR64Op:$dst),
+          (ScalAcc FPR64Op:$Rd,
+                   (AArch64sqdmull FPR32Op:$Rn,
+                            (bitconvert (i32 (vector_extract (v4i32 V128:$Rm),
+                                                VectorIndexS:$idx))))))]> {
 
     bits<2> idx;
     let Inst{11} = idx{1};
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index da93a2b13fc11..64017d7cafca3 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -1024,6 +1024,18 @@ def AArch64fcvtnu_half : SDNode<"AArch64ISD::FCVTNU_HALF", SDTFPExtendOp>;
 def AArch64fcvtps_half : SDNode<"AArch64ISD::FCVTPS_HALF", SDTFPExtendOp>;
 def AArch64fcvtpu_half : SDNode<"AArch64ISD::FCVTPU_HALF", SDTFPExtendOp>;
 
+def AArch64sqadd:   SDNode<"AArch64ISD::SQADD",   SDTFPBinOp>;
+def AArch64sqrshl:  SDNode<"AArch64ISD::SQRSHL",  SDTFPBinOp>;
+def AArch64sqshl:   SDNode<"AArch64ISD::SQSHL",   SDTFPBinOp>;
+def AArch64sqsub:   SDNode<"AArch64ISD::SQSUB",   SDTFPBinOp>;
+def AArch64uqadd:   SDNode<"AArch64ISD::UQADD",   SDTFPBinOp>;
+def AArch64uqrshl:  SDNode<"AArch64ISD::UQRSHL",  SDTFPBinOp>;
+def AArch64uqshl:   SDNode<"AArch64ISD::UQSHL",   SDTFPBinOp>;
+def AArch64uqsub:   SDNode<"AArch64ISD::UQSUB",   SDTFPBinOp>;
+def AArch64sqdmull: SDNode<"AArch64ISD::SQDMULL", 
+                           SDTypeProfile<1, 2, [ SDTCisSameAs<1, 2>, 
+                           SDTCisFP<0>, SDTCisFP<1>]>>;
+
 //def Aarch64softf32tobf16v8: SDNode<"AArch64ISD::", SDTFPRoundOp>;
 
 // Vector immediate ops
@@ -6433,19 +6445,19 @@ defm FCMGT    : SIMDThreeScalarFPCmp<1, 1, 0b100, "fcmgt", AArch64fcmgt>;
 defm FMULX    : SIMDFPThreeScalar<0, 0, 0b011, "fmulx", int_aarch64_neon_fmulx, HasNEONandIsStreamingSafe>;
 defm FRECPS   : SIMDFPThreeScalar<0, 0, 0b111, "frecps", int_aarch64_neon_frecps, HasNEONandIsStreamingSafe>;
 defm FRSQRTS  : SIMDFPThreeScalar<0, 1, 0b111, "frsqrts", int_aarch64_neon_frsqrts, HasNEONandIsStreamingSafe>;
-defm SQADD    : SIMDThreeScalarBHSD<0, 0b00001, "sqadd", int_aarch64_neon_sqadd, saddsat>;
+defm SQADD    : SIMDThreeScalarBHSD<0, 0b00001, "sqadd", AArch64sqadd, int_aarch64_neon_sqadd, saddsat>;
 defm SQDMULH  : SIMDThreeScalarHS<  0, 0b10110, "sqdmulh", int_aarch64_neon_sqdmulh>;
 defm SQRDMULH : SIMDThreeScalarHS<  1, 0b10110, "sqrdmulh", int_aarch64_neon_sqrdmulh>;
-defm SQRSHL   : SIMDThreeScalarBHSD<0, 0b01011, "sqrshl", int_aarch64_neon_sqrshl, int_aarch64_neon_sqrshl>;
-defm SQSHL    : SIMDThreeScalarBHSD<0, 0b01001, "sqshl", int_aarch64_neon_sqshl, int_aarch64_neon_sqshl>;
-defm SQSUB    : SIMDThreeScalarBHSD<0, 0b00101, "sqsub", int_aarch64_neon_sqsub, ssubsat>;
+defm SQRSHL   : SIMDThreeScalarBHSD<0, 0b01011, "sqrshl", AArch64sqrshl, int_aarch64_neon_sqrshl, int_aarch64_neon_sqrshl>;
+defm SQSHL    : SIMDThreeScalarBHSD<0, 0b01001, "sqshl", AArch64sqshl, int_aarch64_neon_sqshl, int_aarch64_neon_sqshl>;
+defm SQSUB    : SIMDThreeScalarBHSD<0, 0b00101, "sqsub", AArch64sqsub, int_aarch64_neon_sqsub, ssubsat>;
 defm SRSHL    : SIMDThreeScalarD<   0, 0b01010, "srshl", int_aarch64_neon_srshl>;
 defm SSHL     : SIMDThreeScalarD<   0, 0b01000, "sshl", int_aarch64_neon_sshl>;
 defm SUB      : SIMDThreeScalarD<   1, 0b10000, "sub", sub>;
-defm UQADD    : SIMDThreeScalarBHSD<1, 0b00001, "uqadd", int_aarch64_neon_uqadd, uaddsat>;
-defm UQRSHL   : SIMDThreeScalarBHSD<1, 0b01011, "uqrshl", int_aarch64_neon_uqrshl, int_aarch64_neon_uqrshl>;
-defm UQSHL    : SIMDThreeScalarBHSD<1, 0b01001, "uqshl", int_aarch64_neon_uqshl, int_aarch64_neon_uqshl>;
-defm UQSUB    : SIMDThreeScalarBHSD<1, 0b00101, "uqsub", int_aarch64_neon_uqsub, usubsat>;
+defm UQADD    : SIMDThreeScalarBHSD<1, 0b00001, "uqadd", AArch64uqadd, int_aarch64_neon_uqadd, uaddsat>;
+defm UQRSHL   : SIMDThreeScalarBHSD<1, 0b01011, "uqrshl", AArch64uqrshl, int_aarch64_neon_uqrshl, int_aarch64_neon_uqrshl>;
+defm UQSHL    : SIMDThreeScalarBHSD<1, 0b01001, "uqshl", AArch64uqshl, int_aarch64_neon_uqshl, int_aarch64_neon_uqshl>;
+defm UQSUB    : SIMDThreeScalarBHSD<1, 0b00101, "uqsub", AArch64uqsub, int_aarch64_neon_uqsub, usubsat>;
 defm URSHL    : SIMDThreeScalarD<   1, 0b01010, "urshl", int_aarch64_neon_urshl>;
 defm USHL     : SIMDThreeScalarD<   1, 0b01000, "ushl", int_aarch64_neon_ushl>;
 let Predicates = [HasRDM] in {
@@ -6496,17 +6508,16 @@ def : InstAlias<"faclt $dst, $src1, $src2",
 // Advanced SIMD three scalar instructions (mixed operands).
 //===----------------------------------------------------------------------===//
 defm SQDMULL  : SIMDThreeScalarMixedHS<0, 0b11010, "sqdmull",
-                                       int_aarch64_neon_sqdmulls_scalar>;
+                                       AArch64sqdmull>;
 defm SQDMLAL  : SIMDThreeScalarMixedTiedHS<0, 0b10010, "sqdmlal">;
 defm SQDMLSL  : SIMDThreeScalarMixedTiedHS<0, 0b10110, "sqdmlsl">;
 
-def : Pat<(i64 (int_aarch64_neon_sqadd (i64 FPR64:$Rd),
-                   (i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
-                                                        (i32 FPR32:$Rm))))),
+def : Pat<(f64 (AArch64sqadd FPR64:$Rd,
+                (AArch64sqdmull FPR32:$Rn, FPR32:$Rm))),
           (SQDMLALi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;
-def : Pat<(i64 (int_aarch64_neon_sqsub (i64 FPR64:$Rd),
-                   (i64 (int_aarch64_neon_sqdmulls_scalar (i32 FPR32:$Rn),
-                                                        (i32 FPR32:$Rm))))),
+
+def : Pat<(f64 (AArch64sqsub FPR64:$Rd,
+               (AArch64sqdmull FPR32:$Rn, FPR32:$Rm))),
           (SQDMLSLi32 FPR64:$Rd, FPR32:$Rn, FPR32:$Rm)>;
 
 //===----------------------------------------------------------------------===//
@@ -8734,9 +8745,9 @@ defm SMLSL : SIMDVectorIndexedLongSDTied<0, 0b0110, "smlsl",
     TriOpFrag<(sub node:$LHS, (AArch64smull node:$MHS, node:$RHS))>>;
 defm SMULL : SIMDVectorIndexedLongSD<0, 0b1010, "smull", AArch64smull>;
 defm SQDMLAL : SIMDIndexedLongSQDMLXSDTied<0, 0b0011, "sqdmlal", saddsat,
-                                           int_aarch64_neon_sqadd>;
+                                           AArch64sqadd, int_aarch64_neon_sqadd>;
 defm SQDMLSL : SIMDIndexedLongSQDMLXSDTied<0, 0b0111, "sqdmlsl", ssubsat,
-                                           int_aarch64_neon_sqsub>;
+                                           AArch64sqsub, int_aarch64_neon_sqsub>;
 defm SQRDMLAH : SIMDIndexedSQRDMLxHSDTied<1, 0b1101, "sqrdmlah",
                                           int_aarch64_neon_sqrdmlah>;
 defm SQRDMLSH : SIMDIndexedSQRDMLxHSDTied<1, 0b1111, "sqrdmlsh",
diff --git a/llvm/lib/Target/AArch64/AArch64SMEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SMEInstrInfo.td
index 737169253ddb3..b099f15ecf7e3 100644
--- a/llvm/lib/Target/AArch64/AArch64SMEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SMEInstrInfo.td
@@ -102,6 +102,7 @@ def : Pat<(i64 (AArch64AllocateSMESaveBuffer GPR64:$size)),
 let hasSideEffects = 1, isMeta = 1 in {
   def InOutZAUsePseudo : Pseudo<(outs), (ins), []>, Sched<[]>;
   def RequiresZASavePseudo : Pseudo<(outs), (ins), []>, Sched<[]>;
+  def RequiresZT0SavePseudo : Pseudo<(outs), (ins), []>, Sched<[]>;
 }
 
 def SMEStateAllocPseudo : Pseudo<(outs), (ins), []>, Sched<[]>;
@@ -122,6 +123,11 @@ def AArch64_requires_za_save
            [SDNPHasChain, SDNPInGlue, SDNPOutGlue]>;
 def : Pat<(AArch64_requires_za_save), (RequiresZASavePseudo)>;
 
+def AArch64_requires_zt0_save
+  : SDNode<"AArch64ISD::REQUIRES_ZT0_SAVE", SDTypeProfile<0, 0, []>,
+           [SDNPHasChain, SDNPInGlue, SDNPOutGlue]>;
+def : Pat<(AArch64_requires_zt0_save), (RequiresZT0SavePseudo)>;
+
 def AArch64_sme_state_alloc
   : SDNode<"AArch64ISD::SME_STATE_ALLOC", SDTypeProfile<0, 0,[]>,
            [SDNPHasChain]>;
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 3a5f1499f9d2d..8b08b30388cc2 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4794,12 +4794,21 @@ static unsigned getSVEGatherScatterOverhead(unsigned Opcode,
   }
 }
 
-InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
-    unsigned Opcode, Type *DataTy, const Value *Ptr, bool VariableMask,
-    Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) const {
+InstructionCost
+AArch64TTIImpl::getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                                       TTI::TargetCostKind CostKind) const {
+
+  unsigned Opcode = (MICA.getID() == Intrinsic::masked_gather ||
+                     MICA.getID() == Intrinsic::vp_gather)
+                        ? Instruction::Load
+                        : Instruction::Store;
+
+  Type *DataTy = MICA.getDataType();
+  Align Alignment = MICA.getAlignment();
+  const Instruction *I = MICA.getInst();
+
   if (useNeonVector(DataTy) || !isLegalMaskedGatherScatter(DataTy))
-    return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                         Alignment, CostKind, I);
+    return BaseT::getGatherScatterOpCost(MICA, CostKind);
   auto *VT = cast<VectorType>(DataTy);
   auto LT = getTypeLegalizationCost(DataTy);
   if (!LT.first.isValid())
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index fe3bb5e7981d2..c4f402b7c3b8e 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -192,10 +192,8 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
                         TTI::TargetCostKind CostKind) const override;
 
   InstructionCost
-  getGatherScatterOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
-                         bool VariableMask, Align Alignment,
-                         TTI::TargetCostKind CostKind,
-                         const Instruction *I = nullptr) const override;
+  getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override;
 
   bool isExtPartOfAvgExpr(const Instruction *ExtUser, Type *Dst,
                           Type *Src) const;
diff --git a/llvm/lib/Target/AArch64/MCTargetDesc/AArch64AsmBackend.cpp b/llvm/lib/Target/AArch64/MCTargetDesc/AArch64AsmBackend.cpp
index 7a2b6790f8a5b..1f9694cf98fec 100644
--- a/llvm/lib/Target/AArch64/MCTargetDesc/AArch64AsmBackend.cpp
+++ b/llvm/lib/Target/AArch64/MCTargetDesc/AArch64AsmBackend.cpp
@@ -586,6 +586,11 @@ class DarwinAArch64AsmBackend : public AArch64AsmBackend {
   /// Generate the compact unwind encoding from the CFI directives.
   uint64_t generateCompactUnwindEncoding(const MCDwarfFrameInfo *FI,
                                          const MCContext *Ctxt) const override {
+    // MTE-tagged frames must use DWARF unwinding because compact unwind
+    // doesn't handle MTE tags
+    if (FI->IsMTETaggedFrame)
+      return CU::UNWIND_ARM64_MODE_DWARF;
+
     ArrayRef<MCCFIInstruction> Instrs = FI->Instructions;
     if (Instrs.empty())
       return CU::UNWIND_ARM64_MODE_FRAMELESS;
diff --git a/llvm/lib/Target/AArch64/MachineSMEABIPass.cpp b/llvm/lib/Target/AArch64/MachineSMEABIPass.cpp
index ead1dfceb96a0..b3e1ddbb91f79 100644
--- a/llvm/lib/Target/AArch64/MachineSMEABIPass.cpp
+++ b/llvm/lib/Target/AArch64/MachineSMEABIPass.cpp
@@ -72,20 +72,34 @@ using namespace llvm;
 
 namespace {
 
-enum ZAState {
+// Note: For agnostic ZA, we assume the function is always entered/exited in the
+// "ACTIVE" state -- this _may_ not be the case (since OFF is also a
+// possibility, but for the purpose of placing ZA saves/restores, that does not
+// matter).
+enum ZAState : uint8_t {
   // Any/unknown state (not valid)
   ANY = 0,
 
   // ZA is in use and active (i.e. within the accumulator)
   ACTIVE,
 
+  // ZA is active, but ZT0 has been saved.
+  // This handles the edge case of sharedZA && !sharesZT0.
+  ACTIVE_ZT0_SAVED,
+
   // A ZA save has been set up or committed (i.e. ZA is dormant or off)
+  // If the function uses ZT0 it must also be saved.
   LOCAL_SAVED,
 
+  // ZA has been committed to the lazy save buffer of the current function.
+  // If the function uses ZT0 it must also be saved.
+  // ZA is off.
+  LOCAL_COMMITTED,
+
   // The ZA/ZT0 state on entry to the function.
   ENTRY,
 
-  // ZA is off
+  // ZA is off.
   OFF,
 
   // The number of ZA states (not a valid state)
@@ -164,6 +178,14 @@ class EmitContext {
     return AgnosticZABufferPtr;
   }
 
+  int getZT0SaveSlot(MachineFunction &MF) {
+    if (ZT0SaveFI)
+      return *ZT0SaveFI;
+    MachineFrameInfo &MFI = MF.getFrameInfo();
+    ZT0SaveFI = MFI.CreateSpillStackObject(64, Align(16));
+    return *ZT0SaveFI;
+  }
+
   /// Returns true if the function must allocate a ZA save buffer on entry. This
   /// will be the case if, at any point in the function, a ZA save was emitted.
   bool needsSaveBuffer() const {
@@ -173,6 +195,7 @@ class EmitContext {
   }
 
 private:
+  std::optional<int> ZT0SaveFI;
   std::optional<int> TPIDR2BlockFI;
   Register AgnosticZABufferPtr = AArch64::NoRegister;
 };
@@ -184,8 +207,10 @@ class EmitContext {
 /// state would not be legal, as transitioning to it drops the content of ZA.
 static bool isLegalEdgeBundleZAState(ZAState State) {
   switch (State) {
-  case ZAState::ACTIVE:      // ZA state within the accumulator/ZT0.
-  case ZAState::LOCAL_SAVED: // ZA state is saved on the stack.
+  case ZAState::ACTIVE:           // ZA state within the accumulator/ZT0.
+  case ZAState::ACTIVE_ZT0_SAVED: // ZT0 is saved (ZA is active).
+  case ZAState::LOCAL_SAVED:      // ZA state may be saved on the stack.
+  case ZAState::LOCAL_COMMITTED:  // ZA state is saved on the stack.
     return true;
   default:
     return false;
@@ -199,7 +224,9 @@ StringRef getZAStateString(ZAState State) {
   switch (State) {
     MAKE_CASE(ZAState::ANY)
     MAKE_CASE(ZAState::ACTIVE)
+    MAKE_CASE(ZAState::ACTIVE_ZT0_SAVED)
     MAKE_CASE(ZAState::LOCAL_SAVED)
+    MAKE_CASE(ZAState::LOCAL_COMMITTED)
     MAKE_CASE(ZAState::ENTRY)
     MAKE_CASE(ZAState::OFF)
   default:
@@ -221,18 +248,39 @@ static bool isZAorZTRegOp(const TargetRegisterInfo &TRI,
 /// Returns the required ZA state needed before \p MI and an iterator pointing
 /// to where any code required to change the ZA state should be inserted.
 static std::pair<ZAState, MachineBasicBlock::iterator>
-getZAStateBeforeInst(const TargetRegisterInfo &TRI, MachineInstr &MI,
-                     bool ZAOffAtReturn) {
+getInstNeededZAState(const TargetRegisterInfo &TRI, MachineInstr &MI,
+                     SMEAttrs SMEFnAttrs) {
   MachineBasicBlock::iterator InsertPt(MI);
 
+  // Note: InOutZAUsePseudo, RequiresZASavePseudo, and RequiresZT0SavePseudo are
+  // intended to mark the position immediately before a call. Due to
+  // SelectionDAG constraints, these markers occur after the ADJCALLSTACKDOWN,
+  // so we use std::prev(InsertPt) to get the position before the call.
+
   if (MI.getOpcode() == AArch64::InOutZAUsePseudo)
     return {ZAState::ACTIVE, std::prev(InsertPt)};
 
+  // Note: If we need to save both ZA and ZT0 we use RequiresZASavePseudo.
   if (MI.getOpcode() == AArch64::RequiresZASavePseudo)
     return {ZAState::LOCAL_SAVED, std::prev(InsertPt)};
 
-  if (MI.isReturn())
+  // If we only need to save ZT0 there's two cases to consider:
+  //   1. The function has ZA state (that we don't need to save).
+  //      - In this case we switch to the "ACTIVE_ZT0_SAVED" state.
+  //        This only saves ZT0.
+  //   2. The function does not have ZA state
+  //      - In this case we switch to "LOCAL_COMMITTED" state.
+  //        This saves ZT0 and turns ZA off.
+  if (MI.getOpcode() == AArch64::RequiresZT0SavePseudo) {
+    return {SMEFnAttrs.hasZAState() ? ZAState::ACTIVE_ZT0_SAVED
+                                    : ZAState::LOCAL_COMMITTED,
+            std::prev(InsertPt)};
+  }
+
+  if (MI.isReturn()) {
+    bool ZAOffAtReturn = SMEFnAttrs.hasPrivateZAInterface();
     return {ZAOffAtReturn ? ZAState::OFF : ZAState::ACTIVE, InsertPt};
+  }
 
   for (auto &MO : MI.operands()) {
     if (isZAorZTRegOp(TRI, MO))
@@ -280,6 +328,9 @@ struct MachineSMEABI : public MachineFunctionPass {
   /// predecessors).
   void propagateDesiredStates(FunctionInfo &FnInfo, bool Forwards = true);
 
+  void emitZT0SaveRestore(EmitContext &, MachineBasicBlock &MBB,
+                          MachineBasicBlock::iterator MBBI, bool IsSave);
+
   // Emission routines for private and shared ZA functions (using lazy saves).
   void emitSMEPrologue(MachineBasicBlock &MBB,
                        MachineBasicBlock::iterator MBBI);
@@ -290,8 +341,8 @@ struct MachineSMEABI : public MachineFunctionPass {
                          MachineBasicBlock::iterator MBBI);
   void emitAllocateLazySaveBuffer(EmitContext &, MachineBasicBlock &MBB,
                                   MachineBasicBlock::iterator MBBI);
-  void emitZAOff(MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI,
-                 bool ClearTPIDR2);
+  void emitZAMode(MachineBasicBlock &MBB, MachineBasicBlock::iterator MBBI,
+                  bool ClearTPIDR2, bool On);
 
   // Emission routines for agnostic ZA functions.
   void emitSetupFullZASave(MachineBasicBlock &MBB,
@@ -409,7 +460,7 @@ FunctionInfo MachineSMEABI::collectNeededZAStates(SMEAttrs SMEFnAttrs) {
       Block.FixedEntryState = ZAState::ENTRY;
     } else if (MBB.isEHPad()) {
       // EH entry block:
-      Block.FixedEntryState = ZAState::LOCAL_SAVED;
+      Block.FixedEntryState = ZAState::LOCAL_COMMITTED;
     }
 
     LiveRegUnits LiveUnits(*TRI);
@@ -431,8 +482,7 @@ FunctionInfo MachineSMEABI::collectNeededZAStates(SMEAttrs SMEFnAttrs) {
         PhysLiveRegsAfterSMEPrologue = PhysLiveRegs;
       }
       // Note: We treat Agnostic ZA as inout_za with an alternate save/restore.
-      auto [NeededState, InsertPt] = getZAStateBeforeInst(
-          *TRI, MI, /*ZAOffAtReturn=*/SMEFnAttrs.hasPrivateZAInterface());
+      auto [NeededState, InsertPt] = getInstNeededZAState(*TRI, MI, SMEFnAttrs);
       assert((InsertPt == MBBI || isCallStartOpcode(InsertPt->getOpcode())) &&
              "Unexpected state change insertion point!");
       // TODO: Do something to avoid state changes where NZCV is live.
@@ -582,8 +632,8 @@ MachineSMEABI::findStateChangeInsertionPoint(
     PhysLiveRegs = Block.PhysLiveRegsAtExit;
   }
 
-  if (!(PhysLiveRegs & LiveRegs::NZCV))
-    return {InsertPt, PhysLiveRegs}; // Nothing to do (no live flags).
+  if (PhysLiveRegs == LiveRegs::None)
+    return {InsertPt, PhysLiveRegs}; // Nothing to do (no live regs).
 
   // Find the previous state change. We can not move before this point.
   MachineBasicBlock::iterator PrevStateChangeI;
@@ -600,15 +650,21 @@ MachineSMEABI::findStateChangeInsertionPoint(
   // Note: LiveUnits will only accurately track X0 and NZCV.
   LiveRegUnits LiveUnits(*TRI);
   setPhysLiveRegs(LiveUnits, PhysLiveRegs);
+  auto BestCandidate = std::make_pair(InsertPt, PhysLiveRegs);
   for (MachineBasicBlock::iterator I = InsertPt; I != PrevStateChangeI; --I) {
     // Don't move before/into a call (which may have a state change before it).
     if (I->getOpcode() == TII->getCallFrameDestroyOpcode() || I->isCall())
       break;
     LiveUnits.stepBackward(*I);
-    if (LiveUnits.available(AArch64::NZCV))
-      return {I, getPhysLiveRegs(LiveUnits)};
+    LiveRegs CurrentPhysLiveRegs = getPhysLiveRegs(LiveUnits);
+    // Find places where NZCV is available, but keep looking for locations where
+    // both NZCV and X0 are available, which can avoid some copies.
+    if (!(CurrentPhysLiveRegs & LiveRegs::NZCV))
+      BestCandidate = {I, CurrentPhysLiveRegs};
+    if (CurrentPhysLiveRegs == LiveRegs::None)
+      break;
   }
-  return {InsertPt, PhysLiveRegs};
+  return BestCandidate;
 }
 
 void MachineSMEABI::insertStateChanges(EmitContext &Context,
@@ -752,9 +808,9 @@ void MachineSMEABI::emitRestoreLazySave(EmitContext &Context,
   restorePhyRegSave(RegSave, MBB, MBBI, DL);
 }
 
-void MachineSMEABI::emitZAOff(MachineBasicBlock &MBB,
-                              MachineBasicBlock::iterator MBBI,
-                              bool ClearTPIDR2) {
+void MachineSMEABI::emitZAMode(MachineBasicBlock &MBB,
+                               MachineBasicBlock::iterator MBBI,
+                               bool ClearTPIDR2, bool On) {
   DebugLoc DL = getDebugLoc(MBB, MBBI);
 
   if (ClearTPIDR2)
@@ -765,7 +821,7 @@ void MachineSMEABI::emitZAOff(MachineBasicBlock &MBB,
   // Disable ZA.
   BuildMI(MBB, MBBI, DL, TII->get(AArch64::MSRpstatesvcrImm1))
       .addImm(AArch64SVCR::SVCRZA)
-      .addImm(0);
+      .addImm(On ? 1 : 0);
 }
 
 void MachineSMEABI::emitAllocateLazySaveBuffer(
@@ -891,6 +947,28 @@ void MachineSMEABI::emitFullZASaveRestore(EmitContext &Context,
   restorePhyRegSave(RegSave, MBB, MBBI, DL);
 }
 
+void MachineSMEABI::emitZT0SaveRestore(EmitContext &Context,
+                                       MachineBasicBlock &MBB,
+                                       MachineBasicBlock::iterator MBBI,
+                                       bool IsSave) {
+  DebugLoc DL = getDebugLoc(MBB, MBBI);
+  Register ZT0Save = MRI->createVirtualRegister(&AArch64::GPR64spRegClass);
+
+  BuildMI(MBB, MBBI, DL, TII->get(AArch64::ADDXri), ZT0Save)
+      .addFrameIndex(Context.getZT0SaveSlot(*MF))
+      .addImm(0)
+      .addImm(0);
+
+  if (IsSave) {
+    BuildMI(MBB, MBBI, DL, TII->get(AArch64::STR_TX))
+        .addReg(AArch64::ZT0)
+        .addReg(ZT0Save);
+  } else {
+    BuildMI(MBB, MBBI, DL, TII->get(AArch64::LDR_TX), AArch64::ZT0)
+        .addReg(ZT0Save);
+  }
+}
+
 void MachineSMEABI::emitAllocateFullZASaveBuffer(
     EmitContext &Context, MachineBasicBlock &MBB,
     MachineBasicBlock::iterator MBBI, LiveRegs PhysLiveRegs) {
@@ -935,6 +1013,17 @@ void MachineSMEABI::emitAllocateFullZASaveBuffer(
   restorePhyRegSave(RegSave, MBB, MBBI, DL);
 }
 
+struct FromState {
+  ZAState From;
+
+  constexpr uint8_t to(ZAState To) const {
+    static_assert(NUM_ZA_STATE < 16, "expected ZAState to fit in 4-bits");
+    return uint8_t(From) << 4 | uint8_t(To);
+  }
+};
+
+constexpr FromState transitionFrom(ZAState From) { return FromState{From}; }
+
 void MachineSMEABI::emitStateChange(EmitContext &Context,
                                     MachineBasicBlock &MBB,
                                     MachineBasicBlock::iterator InsertPt,
@@ -949,8 +1038,6 @@ void MachineSMEABI::emitStateChange(EmitContext &Context,
   if (From == ZAState::ENTRY && To == ZAState::OFF)
     return;
 
-  [[maybe_unused]] SMEAttrs SMEFnAttrs = AFI->getSMEFnAttrs();
-
   // TODO: Avoid setting up the save buffer if there's no transition to
   // LOCAL_SAVED.
   if (From == ZAState::ENTRY) {
@@ -966,17 +1053,67 @@ void MachineSMEABI::emitStateChange(EmitContext &Context,
     From = ZAState::ACTIVE;
   }
 
-  if (From == ZAState::ACTIVE && To == ZAState::LOCAL_SAVED)
-    emitZASave(Context, MBB, InsertPt, PhysLiveRegs);
-  else if (From == ZAState::LOCAL_SAVED && To == ZAState::ACTIVE)
-    emitZARestore(Context, MBB, InsertPt, PhysLiveRegs);
-  else if (To == ZAState::OFF) {
-    assert(From != ZAState::ENTRY &&
-           "ENTRY to OFF should have already been handled");
-    assert(!SMEFnAttrs.hasAgnosticZAInterface() &&
-           "Should not turn ZA off in agnostic ZA function");
-    emitZAOff(MBB, InsertPt, /*ClearTPIDR2=*/From == ZAState::LOCAL_SAVED);
-  } else {
+  SMEAttrs SMEFnAttrs = AFI->getSMEFnAttrs();
+  bool IsAgnosticZA = SMEFnAttrs.hasAgnosticZAInterface();
+  bool HasZT0State = SMEFnAttrs.hasZT0State();
+  bool HasZAState = IsAgnosticZA || SMEFnAttrs.hasZAState();
+
+  switch (transitionFrom(From).to(To)) {
+  // This section handles: ACTIVE <-> ACTIVE_ZT0_SAVED
+  case transitionFrom(ZAState::ACTIVE).to(ZAState::ACTIVE_ZT0_SAVED):
+    emitZT0SaveRestore(Context, MBB, InsertPt, /*IsSave=*/true);
+    break;
+  case transitionFrom(ZAState::ACTIVE_ZT0_SAVED).to(ZAState::ACTIVE):
+    emitZT0SaveRestore(Context, MBB, InsertPt, /*IsSave=*/false);
+    break;
+
+  // This section handles: ACTIVE[_ZT0_SAVED] -> LOCAL_SAVED
+  case transitionFrom(ZAState::ACTIVE).to(ZAState::LOCAL_SAVED):
+  case transitionFrom(ZAState::ACTIVE_ZT0_SAVED).to(ZAState::LOCAL_SAVED):
+    if (HasZT0State && From == ZAState::ACTIVE)
+      emitZT0SaveRestore(Context, MBB, InsertPt, /*IsSave=*/true);
+    if (HasZAState)
+      emitZASave(Context, MBB, InsertPt, PhysLiveRegs);
+    break;
+
+  // This section handles: ACTIVE -> LOCAL_COMMITTED
+  case transitionFrom(ZAState::ACTIVE).to(ZAState::LOCAL_COMMITTED):
+    // TODO: We could support ZA state here, but this transition is currently
+    // only possible when we _don't_ have ZA state.
+    assert(HasZT0State && !HasZAState && "Expect to only have ZT0 state.");
+    emitZT0SaveRestore(Context, MBB, InsertPt, /*IsSave=*/true);
+    emitZAMode(MBB, InsertPt, /*ClearTPIDR2=*/false, /*On=*/false);
+    break;
+
+  // This section handles: LOCAL_COMMITTED -> (OFF|LOCAL_SAVED)
+  case transitionFrom(ZAState::LOCAL_COMMITTED).to(ZAState::OFF):
+  case transitionFrom(ZAState::LOCAL_COMMITTED).to(ZAState::LOCAL_SAVED):
+    // These transistions are a no-op.
+    break;
+
+  // This section handles: LOCAL_(SAVED|COMMITTED) -> ACTIVE[_ZT0_SAVED]
+  case transitionFrom(ZAState::LOCAL_COMMITTED).to(ZAState::ACTIVE):
+  case transitionFrom(ZAState::LOCAL_COMMITTED).to(ZAState::ACTIVE_ZT0_SAVED):
+  case transitionFrom(ZAState::LOCAL_SAVED).to(ZAState::ACTIVE):
+    if (HasZAState)
+      emitZARestore(Context, MBB, InsertPt, PhysLiveRegs);
+    else
+      emitZAMode(MBB, InsertPt, /*ClearTPIDR2=*/false, /*On=*/true);
+    if (HasZT0State && To == ZAState::ACTIVE)
+      emitZT0SaveRestore(Context, MBB, InsertPt, /*IsSave=*/false);
+    break;
+
+  // This section handles transistions to OFF (not previously covered)
+  case transitionFrom(ZAState::ACTIVE).to(ZAState::OFF):
+  case transitionFrom(ZAState::ACTIVE_ZT0_SAVED).to(ZAState::OFF):
+  case transitionFrom(ZAState::LOCAL_SAVED).to(ZAState::OFF):
+    assert(SMEFnAttrs.hasPrivateZAInterface() &&
+           "Did not expect to turn ZA off in shared/agnostic ZA function");
+    emitZAMode(MBB, InsertPt, /*ClearTPIDR2=*/From == ZAState::LOCAL_SAVED,
+               /*On=*/false);
+    break;
+
+  default:
     dbgs() << "Error: Transition from " << getZAStateString(From) << " to "
            << getZAStateString(To) << '\n';
     llvm_unreachable("Unimplemented state transition");
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 5dea64844e64e..ed8ae2b16c5d4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -905,25 +905,25 @@ def FeatureCubeInsts : SubtargetFeature<"cube-insts",
   "HasCubeInsts",
   "true",
   "Has v_cube* instructions"
->; 
+>;
 
 def FeatureLerpInst : SubtargetFeature<"lerp-inst",
   "HasLerpInst",
   "true",
   "Has v_lerp_u8 instruction"
->; 
+>;
 
 def FeatureSadInsts : SubtargetFeature<"sad-insts",
   "HasSadInsts",
   "true",
   "Has v_sad* instructions"
->; 
+>;
 
 def FeatureQsadInsts : SubtargetFeature<"qsad-insts",
   "HasQsadInsts",
   "true",
   "Has v_qsad* instructions"
->; 
+>;
 
 def FeatureCvtNormInsts : SubtargetFeature<"cvt-norm-insts",
   "HasCvtNormInsts",
@@ -1568,8 +1568,8 @@ def FeatureVolcanicIslands : GCNSubtargetFeatureGeneration<"VOLCANIC_ISLANDS",
    FeatureGFX7GFX8GFX9Insts, FeatureSMemTimeInst, FeatureMadMacF32Insts,
    FeatureDsSrc2Insts, FeatureExtendedImageInsts, FeatureFastDenormalF32,
    FeatureUnalignedBufferAccess, FeatureImageInsts, FeatureGDS, FeatureGWS,
-   FeatureDefaultComponentZero, FeatureVmemWriteVgprInOrder, FeatureCubeInsts, 
-   FeatureLerpInst, FeatureSadInsts, FeatureQsadInsts, 
+   FeatureDefaultComponentZero, FeatureVmemWriteVgprInOrder, FeatureCubeInsts,
+   FeatureLerpInst, FeatureSadInsts, FeatureQsadInsts,
    FeatureCvtPkNormVOP2Insts
   ]
 >;
@@ -1590,8 +1590,8 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
    FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
    FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
    FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad,
-   FeatureCubeInsts, FeatureLerpInst, FeatureSadInsts, FeatureQsadInsts, 
-   FeatureCvtNormInsts, FeatureCvtPkNormVOP2Insts, 
+   FeatureCubeInsts, FeatureLerpInst, FeatureSadInsts, FeatureQsadInsts,
+   FeatureCvtNormInsts, FeatureCvtPkNormVOP2Insts,
    FeatureCvtPkNormVOP3Insts
   ]
 >;
@@ -1616,9 +1616,9 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
-   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad, FeatureCubeInsts, 
-   FeatureLerpInst, FeatureSadInsts, FeatureQsadInsts, 
-   FeatureCvtNormInsts, FeatureCvtPkNormVOP2Insts, 
+   FeatureVmemWriteVgprInOrder, FeatureMemToLDSLoad, FeatureCubeInsts,
+   FeatureLerpInst, FeatureSadInsts, FeatureQsadInsts,
+   FeatureCvtNormInsts, FeatureCvtPkNormVOP2Insts,
    FeatureCvtPkNormVOP3Insts
   ]
 >;
@@ -1643,7 +1643,7 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
    FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
    FeatureVmemWriteVgprInOrder, FeatureCubeInsts, FeatureLerpInst,
-   FeatureSadInsts, FeatureQsadInsts, FeatureCvtNormInsts, 
+   FeatureSadInsts, FeatureQsadInsts, FeatureCvtNormInsts,
    FeatureCvtPkNormVOP2Insts, FeatureCvtPkNormVOP3Insts
   ]
 >;
@@ -2124,12 +2124,12 @@ def FeatureISAVersion12 : FeatureSet<
    FeatureBVHDualAndBVH8Insts,
    FeatureWaitsBeforeSystemScopeStores,
    FeatureD16Writes32BitVgpr,
-   FeatureCubeInsts, 
-   FeatureLerpInst, 
+   FeatureCubeInsts,
+   FeatureLerpInst,
    FeatureSadInsts,
-   FeatureQsadInsts, 
-   FeatureCvtNormInsts, 
-   FeatureCvtPkNormVOP2Insts, 
+   FeatureQsadInsts,
+   FeatureCvtNormInsts,
+   FeatureCvtPkNormVOP2Insts,
    FeatureCvtPkNormVOP3Insts
    ]>;
 
@@ -2137,7 +2137,6 @@ def FeatureISAVersion12_50_Common : FeatureSet<
   [FeatureGFX12,
    FeatureGFX1250Insts,
    FeatureRequiresAlignedVGPRs,
-   FeatureAddressableLocalMemorySize327680,
    FeatureCuMode,
    Feature1024AddressableVGPRs,
    Feature64BitLiterals,
@@ -2206,17 +2205,18 @@ def FeatureISAVersion12_50_Common : FeatureSet<
    FeatureXNACK,
    FeatureClusters,
    FeatureD16Writes32BitVgpr,
+   FeatureCubeInsts,
+   FeatureLerpInst,
+   FeatureSadInsts,
+   FeatureQsadInsts,
+   FeatureCvtNormInsts,
+   FeatureCvtPkNormVOP2Insts,
+   FeatureCvtPkNormVOP3Insts
 ]>;
 
 def FeatureISAVersion12_50 : FeatureSet<
   !listconcat(FeatureISAVersion12_50_Common.Features,
-  [FeatureCubeInsts, 
-   FeatureLerpInst, 
-   FeatureSadInsts, 
-   FeatureQsadInsts, 
-   FeatureCvtNormInsts, 
-   FeatureCvtPkNormVOP2Insts, 
-   FeatureCvtPkNormVOP3Insts])>;
+  [FeatureAddressableLocalMemorySize327680])>;
 
 def FeatureISAVersion12_51 : FeatureSet<
   !listconcat(FeatureISAVersion12_50.Features,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index 7afaddea164f8..682f1aa1f46e1 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -21,6 +21,7 @@
 #include "llvm/CodeGen/FunctionLoweringInfo.h"
 #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
+#include "llvm/CodeGen/PseudoSourceValueManager.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 
 #define DEBUG_TYPE "amdgpu-call-lowering"
@@ -414,7 +415,8 @@ void AMDGPUCallLowering::lowerParameter(MachineIRBuilder &B, ArgInfo &OrigArg,
   MachineFunction &MF = B.getMF();
   const Function &F = MF.getFunction();
   const DataLayout &DL = F.getDataLayout();
-  MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
+  const SITargetLowering &TLI = *getTLI<SITargetLowering>();
+  MachinePointerInfo PtrInfo = TLI.getKernargSegmentPtrInfo(MF);
 
   LLT PtrTy = LLT::pointer(AMDGPUAS::CONSTANT_ADDRESS, 64);
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 78a3ec7f0c266..8698e816ddbb9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -4451,16 +4451,14 @@ bool AMDGPUDAGToDAGISel::isUniformLoad(const SDNode *N) const {
   const auto *Ld = cast<LoadSDNode>(N);
   const MachineMemOperand *MMO = Ld->getMemOperand();
 
-  if (Ld->isDivergent()) {
-    // FIXME: We ought to able able to take the direct isDivergent result. We
-    // cannot rely on the MMO for a uniformity check, and should stop using
-    // it. This is a hack for 2 ways that the IR divergence analysis is superior
-    // to the DAG divergence: Recognizing shift-of-workitem-id as always
-    // uniform, and isSingleLaneExecution. These should be handled in the DAG
-    // version, and then this can be dropped.
-    if (!MMO->getValue() || !AMDGPU::isUniformMMO(MMO))
-      return false;
-  }
+  // FIXME: We ought to able able to take the direct isDivergent result. We
+  // cannot rely on the MMO for a uniformity check, and should stop using
+  // it. This is a hack for 2 ways that the IR divergence analysis is superior
+  // to the DAG divergence: Recognizing shift-of-workitem-id as always
+  // uniform, and isSingleLaneExecution. These should be handled in the DAG
+  // version, and then this can be dropped.
+  if (Ld->isDivergent() && !AMDGPU::isUniformMMO(MMO))
+    return false;
 
   return MMO->getSize().hasValue() &&
          Ld->getAlign() >=
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.cpp
index 7caafa16f9043..2b1f4048947bf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstrInfo.cpp
@@ -28,13 +28,20 @@ Intrinsic::ID AMDGPU::getIntrinsicID(const MachineInstr &I) {
 
 // TODO: Should largely merge with AMDGPUTTIImpl::isSourceOfDivergence.
 bool AMDGPU::isUniformMMO(const MachineMemOperand *MMO) {
-  // FIXME: null value is should be treated as unknown, not as uniform.
   const Value *Ptr = MMO->getValue();
+  if (!Ptr) {
+    if (const PseudoSourceValue *PSV = MMO->getPseudoValue()) {
+      return PSV->isConstantPool() || PSV->isStack() || PSV->isGOT() ||
+             PSV->isJumpTable();
+    }
+
+    // Unknown value.
+    return false;
+  }
+
   // UndefValue means this is a load of a kernel input.  These are uniform.
   // Sometimes LDS instructions have constant pointers.
-  // If Ptr is null, then that means this mem operand contains a
-  // PseudoSourceValue like GOT.
-  if (!Ptr || isa<UndefValue, Constant, GlobalValue>(Ptr))
+  if (isa<UndefValue, Constant, GlobalValue>(Ptr))
     return true;
 
   if (MMO->getAddrSpace() == AMDGPUAS::CONSTANT_ADDRESS_32BIT)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 8b4396cd63e9a..ae62dbe1cc706 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -30,6 +30,7 @@
 #include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
 #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/GlobalISel/Utils.h"
+#include "llvm/CodeGen/PseudoSourceValueManager.h"
 #include "llvm/CodeGen/TargetOpcodes.h"
 #include "llvm/IR/DiagnosticInfo.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
@@ -2321,14 +2322,14 @@ Register AMDGPULegalizerInfo::getSegmentAperture(
     return B.buildUnmerge(S32, Dst).getReg(1);
   }
 
-  // TODO: can we be smarter about machine pointer info?
-  MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
   Register LoadAddr = MRI.createGenericVirtualRegister(
     LLT::pointer(AMDGPUAS::CONSTANT_ADDRESS, 64));
   // For code object version 5, private_base and shared_base are passed through
   // implicit kernargs.
   if (AMDGPU::getAMDHSACodeObjectVersion(*MF.getFunction().getParent()) >=
       AMDGPU::AMDHSA_COV5) {
+    MachinePointerInfo PtrInfo = getKernargSegmentPtrInfo(B.getMF());
+
     AMDGPUTargetLowering::ImplicitParameter Param =
         AS == AMDGPUAS::LOCAL_ADDRESS ? AMDGPUTargetLowering::SHARED_BASE
                                       : AMDGPUTargetLowering::PRIVATE_BASE;
@@ -2343,7 +2344,7 @@ Register AMDGPULegalizerInfo::getSegmentAperture(
       return Register();
 
     MachineMemOperand *MMO = MF.getMachineMemOperand(
-        PtrInfo,
+        PtrInfo.getWithOffset(Offset),
         MachineMemOperand::MOLoad | MachineMemOperand::MODereferenceable |
             MachineMemOperand::MOInvariant,
         LLT::scalar(32), commonAlignment(Align(64), Offset));
@@ -2361,6 +2362,9 @@ Register AMDGPULegalizerInfo::getSegmentAperture(
   if (!loadInputValue(QueuePtr, B, AMDGPUFunctionArgInfo::QUEUE_PTR))
     return Register();
 
+  // TODO: Use custom PseudoSourceValue
+  MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
+
   // Offset into amd_queue_t for group_segment_aperture_base_hi /
   // private_segment_aperture_base_hi.
   uint32_t StructOffset = (AS == AMDGPUAS::LOCAL_ADDRESS) ? 0x40 : 0x44;
@@ -2560,8 +2564,14 @@ bool AMDGPULegalizerInfo::legalizeAddrSpaceCast(
     const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
     uint32_t AddrHiVal = Info->get32BitAddressHighBits();
     auto PtrLo = B.buildPtrToInt(S32, Src);
-    auto HighAddr = B.buildConstant(S32, AddrHiVal);
-    B.buildMergeLikeInstr(Dst, {PtrLo, HighAddr});
+    if (AddrHiVal == 0) {
+      auto Zext = B.buildZExt(LLT::scalar(64), PtrLo);
+      B.buildIntToPtr(Dst, Zext);
+    } else {
+      auto HighAddr = B.buildConstant(S32, AddrHiVal);
+      B.buildMergeLikeInstr(Dst, {PtrLo, HighAddr});
+    }
+
     MI.eraseFromParent();
     return true;
   }
@@ -4709,6 +4719,14 @@ bool AMDGPULegalizerInfo::legalizeWorkitemIDIntrinsic(
   return true;
 }
 
+MachinePointerInfo
+AMDGPULegalizerInfo::getKernargSegmentPtrInfo(MachineFunction &MF) const {
+  // This isn't really a constant pool but close enough.
+  MachinePointerInfo PtrInfo(MF.getPSVManager().getConstantPool());
+  PtrInfo.AddrSpace = AMDGPUAS::CONSTANT_ADDRESS;
+  return PtrInfo;
+}
+
 Register AMDGPULegalizerInfo::getKernargParameterPtr(MachineIRBuilder &B,
                                                      int64_t Offset) const {
   LLT PtrTy = LLT::pointer(AMDGPUAS::CONSTANT_ADDRESS, 64);
@@ -4736,8 +4754,8 @@ bool AMDGPULegalizerInfo::legalizeKernargMemParameter(MachineInstr &MI,
          "unexpected kernarg parameter type");
 
   Register Ptr = getKernargParameterPtr(B, Offset);
-  MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
-  B.buildLoad(DstReg, Ptr, PtrInfo, Align(4),
+  MachinePointerInfo PtrInfo = getKernargSegmentPtrInfo(B.getMF());
+  B.buildLoad(DstReg, Ptr, PtrInfo.getWithOffset(Offset), Align(4),
               MachineMemOperand::MODereferenceable |
                   MachineMemOperand::MOInvariant);
   MI.eraseFromParent();
@@ -7260,9 +7278,9 @@ bool AMDGPULegalizerInfo::legalizeTrapHsaQueuePtr(
       return false;
 
     // TODO: can we be smarter about machine pointer info?
-    MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
+    MachinePointerInfo PtrInfo = getKernargSegmentPtrInfo(MF);
     MachineMemOperand *MMO = MF.getMachineMemOperand(
-        PtrInfo,
+        PtrInfo.getWithOffset(Offset),
         MachineMemOperand::MOLoad | MachineMemOperand::MODereferenceable |
             MachineMemOperand::MOInvariant,
         LLT::scalar(64), commonAlignment(Align(64), Offset));
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
index cd44a9ba0807c..31db548d2af88 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h
@@ -132,6 +132,7 @@ class AMDGPULegalizerInfo final : public LegalizerInfo {
       MachineInstr &MI, MachineRegisterInfo &MRI, MachineIRBuilder &B,
       unsigned Dim, AMDGPUFunctionArgInfo::PreloadedValue ArgType) const;
 
+  MachinePointerInfo getKernargSegmentPtrInfo(MachineFunction &MF) const;
   Register getKernargParameterPtr(MachineIRBuilder &B, int64_t Offset) const;
   bool legalizeKernargMemParameter(MachineInstr &MI, MachineIRBuilder &B,
                                    uint64_t Offset,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
index dee3dff3bf575..bf9b4297bd435 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
@@ -229,7 +229,8 @@ void AMDGPUMCInstLower::lower(const MachineInstr *MI, MCInst &OutMI) const {
     OutMI.addOperand(Src);
     return;
   } else if (Opcode == AMDGPU::SI_TCRETURN ||
-             Opcode == AMDGPU::SI_TCRETURN_GFX) {
+             Opcode == AMDGPU::SI_TCRETURN_GFX ||
+             Opcode == AMDGPU::SI_TCRETURN_CHAIN) {
     // TODO: How to use branch immediate and avoid register+add?
     Opcode = AMDGPU::S_SETPC_B64;
   } else if (AMDGPU::getT16D16Helper(Opcode)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalize.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalize.cpp
index 396d64625fb5c..839120da89711 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalize.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalize.cpp
@@ -435,7 +435,8 @@ bool AMDGPURegBankLegalize::runOnMachineFunction(MachineFunction &MF) {
     unsigned Opc = MI->getOpcode();
     // Insert point for use operands needs some calculation.
     if (Opc == AMDGPU::G_PHI) {
-      RBLHelper.applyMappingPHI(*MI);
+      if (!RBLHelper.applyMappingPHI(*MI))
+        return false;
       continue;
     }
 
@@ -466,7 +467,8 @@ bool AMDGPURegBankLegalize::runOnMachineFunction(MachineFunction &MF) {
       // S1 rules are in RegBankLegalizeRules.
     }
 
-    RBLHelper.findRuleAndApplyMapping(*MI);
+    if (!RBLHelper.findRuleAndApplyMapping(*MI))
+      return false;
   }
 
   // Sgpr S1 clean up combines:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.cpp
index 123fc5bf37a19..cc31d7d5c55ac 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.cpp
@@ -32,28 +32,48 @@ using namespace AMDGPU;
 RegBankLegalizeHelper::RegBankLegalizeHelper(
     MachineIRBuilder &B, const MachineUniformityInfo &MUI,
     const RegisterBankInfo &RBI, const RegBankLegalizeRules &RBLRules)
-    : ST(B.getMF().getSubtarget<GCNSubtarget>()), B(B), MRI(*B.getMRI()),
-      MUI(MUI), RBI(RBI), RBLRules(RBLRules), IsWave32(ST.isWave32()),
+    : MF(B.getMF()), ST(MF.getSubtarget<GCNSubtarget>()), B(B),
+      MRI(*B.getMRI()), MUI(MUI), RBI(RBI), MORE(MF, nullptr),
+      RBLRules(RBLRules), IsWave32(ST.isWave32()),
       SgprRB(&RBI.getRegBank(AMDGPU::SGPRRegBankID)),
       VgprRB(&RBI.getRegBank(AMDGPU::VGPRRegBankID)),
       VccRB(&RBI.getRegBank(AMDGPU::VCCRegBankID)) {}
 
-void RegBankLegalizeHelper::findRuleAndApplyMapping(MachineInstr &MI) {
-  const SetOfRulesForOpcode &RuleSet = RBLRules.getRulesForOpc(MI);
-  const RegBankLLTMapping &Mapping = RuleSet.findMappingForMI(MI, MRI, MUI);
+bool RegBankLegalizeHelper::findRuleAndApplyMapping(MachineInstr &MI) {
+  const SetOfRulesForOpcode *RuleSet = RBLRules.getRulesForOpc(MI);
+  if (!RuleSet) {
+    reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                       "No AMDGPU RegBankLegalize rules defined for opcode",
+                       MI);
+    return false;
+  }
+
+  const RegBankLLTMapping *Mapping = RuleSet->findMappingForMI(MI, MRI, MUI);
+  if (!Mapping) {
+    reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                       "AMDGPU RegBankLegalize: none of the rules defined with "
+                       "'Any' for MI's opcode matched MI",
+                       MI);
+    return false;
+  }
 
   SmallSet<Register, 4> WaterfallSgprs;
   unsigned OpIdx = 0;
-  if (Mapping.DstOpMapping.size() > 0) {
+  if (Mapping->DstOpMapping.size() > 0) {
     B.setInsertPt(*MI.getParent(), std::next(MI.getIterator()));
-    applyMappingDst(MI, OpIdx, Mapping.DstOpMapping);
+    if (!applyMappingDst(MI, OpIdx, Mapping->DstOpMapping))
+      return false;
   }
-  if (Mapping.SrcOpMapping.size() > 0) {
+  if (Mapping->SrcOpMapping.size() > 0) {
     B.setInstr(MI);
-    applyMappingSrc(MI, OpIdx, Mapping.SrcOpMapping, WaterfallSgprs);
+    if (!applyMappingSrc(MI, OpIdx, Mapping->SrcOpMapping, WaterfallSgprs))
+      return false;
   }
 
-  lower(MI, Mapping, WaterfallSgprs);
+  if (!lower(MI, *Mapping, WaterfallSgprs))
+    return false;
+
+  return true;
 }
 
 bool RegBankLegalizeHelper::executeInWaterfallLoop(
@@ -274,7 +294,7 @@ bool RegBankLegalizeHelper::executeInWaterfallLoop(
   return true;
 }
 
-void RegBankLegalizeHelper::splitLoad(MachineInstr &MI,
+bool RegBankLegalizeHelper::splitLoad(MachineInstr &MI,
                                       ArrayRef<LLT> LLTBreakdown, LLT MergeTy) {
   MachineFunction &MF = B.getMF();
   assert(MI.getNumMemOperands() == 1);
@@ -322,9 +342,10 @@ void RegBankLegalizeHelper::splitLoad(MachineInstr &MI,
     B.buildMergeLikeInstr(Dst, MergeTyParts);
   }
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::widenLoad(MachineInstr &MI, LLT WideTy,
+bool RegBankLegalizeHelper::widenLoad(MachineInstr &MI, LLT WideTy,
                                       LLT MergeTy) {
   MachineFunction &MF = B.getMF();
   assert(MI.getNumMemOperands() == 1);
@@ -350,9 +371,10 @@ void RegBankLegalizeHelper::widenLoad(MachineInstr &MI, LLT WideTy,
     B.buildMergeLikeInstr(Dst, MergeTyParts);
   }
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::widenMMOToS32(GAnyLoad &MI) const {
+bool RegBankLegalizeHelper::widenMMOToS32(GAnyLoad &MI) const {
   Register Dst = MI.getDstReg();
   Register Ptr = MI.getPointerReg();
   MachineMemOperand &MMO = MI.getMMO();
@@ -376,9 +398,10 @@ void RegBankLegalizeHelper::widenMMOToS32(GAnyLoad &MI) const {
   }
 
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerVccExtToSel(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerVccExtToSel(MachineInstr &MI) {
   Register Dst = MI.getOperand(0).getReg();
   LLT Ty = MRI.getType(Dst);
   Register Src = MI.getOperand(1).getReg();
@@ -404,15 +427,22 @@ void RegBankLegalizeHelper::lowerVccExtToSel(MachineInstr &MI) {
       Hi = B.buildUndef({VgprRB_S32});
       break;
     default:
-      llvm_unreachable("Opcode not supported");
+      reportGISelFailure(
+          MF, MORE, "amdgpu-regbanklegalize",
+          "AMDGPU RegBankLegalize: lowerVccExtToSel, Opcode not supported", MI);
+      return false;
     }
 
     B.buildMergeValues(Dst, {Lo.getReg(0), Hi.getReg(0)});
   } else {
-    llvm_unreachable("Type not supported");
+    reportGISelFailure(
+        MF, MORE, "amdgpu-regbanklegalize",
+        "AMDGPU RegBankLegalize: lowerVccExtToSel, Type not supported", MI);
+    return false;
   }
 
   MI.eraseFromParent();
+  return true;
 }
 
 std::pair<Register, Register> RegBankLegalizeHelper::unpackZExt(Register Reg) {
@@ -444,7 +474,7 @@ RegBankLegalizeHelper::unpackAExtTruncS16(Register Reg) {
           B.buildTrunc(SgprRB_S16, Hi32).getReg(0)};
 }
 
-void RegBankLegalizeHelper::lowerUnpackBitShift(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerUnpackBitShift(MachineInstr &MI) {
   Register Lo, Hi;
   switch (MI.getOpcode()) {
   case AMDGPU::G_SHL: {
@@ -469,13 +499,18 @@ void RegBankLegalizeHelper::lowerUnpackBitShift(MachineInstr &MI) {
     break;
   }
   default:
-    llvm_unreachable("Unpack lowering not implemented");
+    reportGISelFailure(
+        MF, MORE, "amdgpu-regbanklegalize",
+        "AMDGPU RegBankLegalize: lowerUnpackBitShift, case not implemented",
+        MI);
+    return false;
   }
   B.buildBuildVectorTrunc(MI.getOperand(0).getReg(), {Lo, Hi});
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerUnpackMinMax(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerUnpackMinMax(MachineInstr &MI) {
   Register Lo, Hi;
   switch (MI.getOpcode()) {
   case AMDGPU::G_SMIN:
@@ -501,13 +536,17 @@ void RegBankLegalizeHelper::lowerUnpackMinMax(MachineInstr &MI) {
     break;
   }
   default:
-    llvm_unreachable("Unpack min/max lowering not implemented");
+    reportGISelFailure(
+        MF, MORE, "amdgpu-regbanklegalize",
+        "AMDGPU RegBankLegalize: lowerUnpackMinMax, case not implemented", MI);
+    return false;
   }
   B.buildBuildVectorTrunc(MI.getOperand(0).getReg(), {Lo, Hi});
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerUnpackAExt(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerUnpackAExt(MachineInstr &MI) {
   auto [Op1Lo, Op1Hi] = unpackAExt(MI.getOperand(1).getReg());
   auto [Op2Lo, Op2Hi] = unpackAExt(MI.getOperand(2).getReg());
   auto ResLo = B.buildInstr(MI.getOpcode(), {SgprRB_S32}, {Op1Lo, Op2Lo});
@@ -515,6 +554,7 @@ void RegBankLegalizeHelper::lowerUnpackAExt(MachineInstr &MI) {
   B.buildBuildVectorTrunc(MI.getOperand(0).getReg(),
                           {ResLo.getReg(0), ResHi.getReg(0)});
   MI.eraseFromParent();
+  return true;
 }
 
 static bool isSignedBFE(MachineInstr &MI) {
@@ -524,7 +564,7 @@ static bool isSignedBFE(MachineInstr &MI) {
   return MI.getOpcode() == AMDGPU::G_SBFX;
 }
 
-void RegBankLegalizeHelper::lowerV_BFE(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerV_BFE(MachineInstr &MI) {
   Register Dst = MI.getOperand(0).getReg();
   assert(MRI.getType(Dst) == LLT::scalar(64));
   bool Signed = isSignedBFE(MI);
@@ -551,7 +591,7 @@ void RegBankLegalizeHelper::lowerV_BFE(MachineInstr &MI) {
     auto SignBit = B.buildShl({VgprRB, S64}, SHRSrc, Amt);
     B.buildInstr(SHROpc, {Dst}, {SignBit, Amt});
     MI.eraseFromParent();
-    return;
+    return true;
   }
 
   uint64_t WidthImm = ConstWidth->Value.getZExtValue();
@@ -581,9 +621,10 @@ void RegBankLegalizeHelper::lowerV_BFE(MachineInstr &MI) {
   }
 
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerS_BFE(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerS_BFE(MachineInstr &MI) {
   Register DstReg = MI.getOperand(0).getReg();
   LLT Ty = MRI.getType(DstReg);
   bool Signed = isSignedBFE(MI);
@@ -609,14 +650,19 @@ void RegBankLegalizeHelper::lowerS_BFE(MachineInstr &MI) {
   auto S_BFE = B.buildInstr(Opc, {{SgprRB, Ty}},
                             {B.buildCopy(Ty, Src), B.buildCopy(S32, Src1)});
   if (!constrainSelectedInstRegOperands(*S_BFE, *ST.getInstrInfo(),
-                                        *ST.getRegisterInfo(), RBI))
-    llvm_unreachable("failed to constrain BFE");
+                                        *ST.getRegisterInfo(), RBI)) {
+    reportGISelFailure(
+        MF, MORE, "amdgpu-regbanklegalize",
+        "AMDGPU RegBankLegalize: lowerS_BFE, failed to constrain BFE", MI);
+    return false;
+  }
 
   B.buildCopy(DstReg, S_BFE->getOperand(0).getReg());
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerSplitTo32(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerSplitTo32(MachineInstr &MI) {
   Register Dst = MI.getOperand(0).getReg();
   LLT DstTy = MRI.getType(Dst);
   assert(DstTy == V4S16 || DstTy == V2S32 || DstTy == S64);
@@ -631,9 +677,10 @@ void RegBankLegalizeHelper::lowerSplitTo32(MachineInstr &MI) {
       B.buildInstr(Opc, {{VgprRB, Ty}}, {Op1.getReg(1), Op2.getReg(1)}, Flags);
   B.buildMergeLikeInstr(Dst, {Lo, Hi});
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerSplitTo16(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerSplitTo16(MachineInstr &MI) {
   Register Dst = MI.getOperand(0).getReg();
   assert(MRI.getType(Dst) == V2S16);
   unsigned Opc = MI.getOpcode();
@@ -645,7 +692,7 @@ void RegBankLegalizeHelper::lowerSplitTo16(MachineInstr &MI) {
     auto Hi = B.buildInstr(Opc, {SgprRB_S16}, {Op1Hi}, Flags);
     B.buildMergeLikeInstr(Dst, {Lo, Hi});
     MI.eraseFromParent();
-    return;
+    return true;
   }
 
   assert(MI.getNumOperands() == 3);
@@ -655,9 +702,10 @@ void RegBankLegalizeHelper::lowerSplitTo16(MachineInstr &MI) {
   auto Hi = B.buildInstr(Opc, {SgprRB_S16}, {Op1Hi, Op2Hi}, Flags);
   B.buildMergeLikeInstr(Dst, {Lo, Hi});
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerSplitTo32Select(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerSplitTo32Select(MachineInstr &MI) {
   Register Dst = MI.getOperand(0).getReg();
   LLT DstTy = MRI.getType(Dst);
   assert(DstTy == V4S16 || DstTy == V2S32 || DstTy == S64 ||
@@ -674,9 +722,10 @@ void RegBankLegalizeHelper::lowerSplitTo32Select(MachineInstr &MI) {
 
   B.buildMergeLikeInstr(Dst, {Lo, Hi});
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lowerSplitTo32SExtInReg(MachineInstr &MI) {
+bool RegBankLegalizeHelper::lowerSplitTo32SExtInReg(MachineInstr &MI) {
   auto Op1 = B.buildUnmerge(VgprRB_S32, MI.getOperand(1).getReg());
   int Amt = MI.getOperand(2).getImm();
   Register Lo, Hi;
@@ -701,9 +750,10 @@ void RegBankLegalizeHelper::lowerSplitTo32SExtInReg(MachineInstr &MI) {
 
   B.buildMergeLikeInstr(MI.getOperand(0).getReg(), {Lo, Hi});
   MI.eraseFromParent();
+  return true;
 }
 
-void RegBankLegalizeHelper::lower(MachineInstr &MI,
+bool RegBankLegalizeHelper::lower(MachineInstr &MI,
                                   const RegBankLLTMapping &Mapping,
                                   SmallSet<Register, 4> &WaterfallSgprs) {
 
@@ -723,7 +773,7 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
     B.buildSelect(MI.getOperand(0).getReg(), MI.getOperand(1).getReg(), True,
                   False);
     MI.eraseFromParent();
-    return;
+    return true;
   }
   case UnpackBitShift:
     return lowerUnpackBitShift(MI);
@@ -750,20 +800,23 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
       break;
     }
     default:
-      llvm_unreachable("Unsuported Opcode in Ext32To64");
+      reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                         "AMDGPU RegBankLegalize: Ext32To64, unsuported opcode",
+                         MI);
+      return false;
     }
 
     B.buildMergeLikeInstr(MI.getOperand(0).getReg(),
                           {MI.getOperand(1).getReg(), Hi});
     MI.eraseFromParent();
-    return;
+    return true;
   }
   case UniCstExt: {
     uint64_t ConstVal = MI.getOperand(1).getCImm()->getZExtValue();
     B.buildConstant(MI.getOperand(0).getReg(), ConstVal);
 
     MI.eraseFromParent();
-    return;
+    return true;
   }
   case VgprToVccCopy: {
     Register Src = MI.getOperand(1).getReg();
@@ -787,7 +840,7 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
     auto Zero = B.buildConstant({VgprRB, Ty}, 0);
     B.buildICmp(CmpInst::ICMP_NE, MI.getOperand(0).getReg(), BoolSrc, Zero);
     MI.eraseFromParent();
-    return;
+    return true;
   }
   case V_BFE:
     return lowerV_BFE(MI);
@@ -816,8 +869,10 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
       else if (Size / 128 == 4)
         splitLoad(MI, {B128, B128, B128, B128});
       else {
-        LLVM_DEBUG(dbgs() << "MI: "; MI.dump(););
-        llvm_unreachable("SplitLoad type not supported for MI");
+        reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                           "AMDGPU RegBankLegalize: SplitLoad, unsuported type",
+                           MI);
+        return false;
       }
     }
     // 64 and 32 bit load
@@ -828,10 +883,12 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
     else if (DstTy == V6S16)
       splitLoad(MI, {V4S16, V2S16}, V2S16);
     else {
-      LLVM_DEBUG(dbgs() << "MI: "; MI.dump(););
-      llvm_unreachable("SplitLoad type not supported for MI");
+      reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                         "AMDGPU RegBankLegalize: SplitLoad, unsuported type",
+                         MI);
+      return false;
     }
-    break;
+    return true;
   }
   case WidenLoad: {
     LLT DstTy = MRI.getType(MI.getOperand(0).getReg());
@@ -842,10 +899,12 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
     else if (DstTy == V6S16)
       widenLoad(MI, V8S16, V2S16);
     else {
-      LLVM_DEBUG(dbgs() << "MI: "; MI.dump(););
-      llvm_unreachable("WidenLoad type not supported for MI");
+      reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                         "AMDGPU RegBankLegalize: WidenLoad, unsuported type",
+                         MI);
+      return false;
     }
-    break;
+    return true;
   }
   case UnpackAExt:
     return lowerUnpackAExt(MI);
@@ -855,8 +914,10 @@ void RegBankLegalizeHelper::lower(MachineInstr &MI,
 
   if (!WaterfallSgprs.empty()) {
     MachineBasicBlock::iterator I = MI.getIterator();
-    executeInWaterfallLoop(B, make_range(I, std::next(I)), WaterfallSgprs);
+    if (!executeInWaterfallLoop(B, make_range(I, std::next(I)), WaterfallSgprs))
+      return false;
   }
+  return true;
 }
 
 LLT RegBankLegalizeHelper::getTyFromID(RegBankLLTMappingApplyID ID) {
@@ -1055,7 +1116,7 @@ RegBankLegalizeHelper::getRegBankFromID(RegBankLLTMappingApplyID ID) {
   }
 }
 
-void RegBankLegalizeHelper::applyMappingDst(
+bool RegBankLegalizeHelper::applyMappingDst(
     MachineInstr &MI, unsigned &OpIdx,
     const SmallVectorImpl<RegBankLLTMappingApplyID> &MethodIDs) {
   // Defs start from operand 0
@@ -1180,16 +1241,23 @@ void RegBankLegalizeHelper::applyMappingDst(
       break;
     }
     case InvalidMapping: {
-      LLVM_DEBUG(dbgs() << "Instruction with Invalid mapping: "; MI.dump(););
-      llvm_unreachable("missing fast rule for MI");
+      reportGISelFailure(
+          MF, MORE, "amdgpu-regbanklegalize",
+          "AMDGPU RegBankLegalize: missing fast rule ('Div' or 'Uni') for", MI);
+      return false;
     }
     default:
-      llvm_unreachable("ID not supported");
+      reportGISelFailure(
+          MF, MORE, "amdgpu-regbanklegalize",
+          "AMDGPU RegBankLegalize: applyMappingDst, ID not supported", MI);
+      return false;
     }
   }
+
+  return true;
 }
 
-void RegBankLegalizeHelper::applyMappingSrc(
+bool RegBankLegalizeHelper::applyMappingSrc(
     MachineInstr &MI, unsigned &OpIdx,
     const SmallVectorImpl<RegBankLLTMappingApplyID> &MethodIDs,
     SmallSet<Register, 4> &SgprWaterfallOperandRegs) {
@@ -1343,12 +1411,16 @@ void RegBankLegalizeHelper::applyMappingSrc(
       break;
     }
     default:
-      llvm_unreachable("ID not supported");
+      reportGISelFailure(
+          MF, MORE, "amdgpu-regbanklegalize",
+          "AMDGPU RegBankLegalize: applyMappingSrc, ID not supported", MI);
+      return false;
     }
   }
+  return true;
 }
 
-void RegBankLegalizeHelper::applyMappingPHI(MachineInstr &MI) {
+bool RegBankLegalizeHelper::applyMappingPHI(MachineInstr &MI) {
   Register Dst = MI.getOperand(0).getReg();
   LLT Ty = MRI.getType(Dst);
 
@@ -1371,16 +1443,17 @@ void RegBankLegalizeHelper::applyMappingPHI(MachineInstr &MI) {
       MI.getOperand(i).setReg(NewUse.getReg(0));
     }
 
-    return;
+    return true;
   }
 
-  // ALL divergent i1 phis should be already lowered and inst-selected into PHI
-  // with sgpr reg class and S1 LLT.
+  // ALL divergent i1 phis should have been lowered and inst-selected into PHI
+  // with sgpr reg class and S1 LLT in AMDGPUGlobalISelDivergenceLowering pass.
   // Note: this includes divergent phis that don't require lowering.
   if (Ty == LLT::scalar(1) && MUI.isDivergent(Dst)) {
-    LLVM_DEBUG(dbgs() << "Divergent S1 G_PHI: "; MI.dump(););
-    llvm_unreachable("Make sure to run AMDGPUGlobalISelDivergenceLowering "
-                     "before RegBankLegalize to lower lane mask(vcc) phis");
+    reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                       "AMDGPU RegBankLegalize: Can't lower divergent S1 G_PHI",
+                       MI);
+    return false;
   }
 
   // We accept all types that can fit in some register class.
@@ -1388,11 +1461,13 @@ void RegBankLegalizeHelper::applyMappingPHI(MachineInstr &MI) {
   // Divergent G_PHIs have vgpr dst but inputs can be sgpr or vgpr.
   if (Ty == LLT::scalar(32) || Ty == LLT::pointer(1, 64) ||
       Ty == LLT::pointer(4, 64)) {
-    return;
+    return true;
   }
 
-  LLVM_DEBUG(dbgs() << "G_PHI not handled: "; MI.dump(););
-  llvm_unreachable("type not supported");
+  reportGISelFailure(MF, MORE, "amdgpu-regbanklegalize",
+                     "AMDGPU RegBankLegalize: type not supported for G_PHI",
+                     MI);
+  return false;
 }
 
 [[maybe_unused]] static bool verifyRegBankOnOperands(MachineInstr &MI,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.h b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.h
index 4f1c3c02fa5d6..1dc0278d6d90d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeHelper.h
@@ -12,6 +12,7 @@
 #include "AMDGPURegBankLegalizeRules.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/CodeGen/GlobalISel/GenericMachineInstrs.h"
+#include "llvm/CodeGen/MachineOptimizationRemarkEmitter.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
 
 namespace llvm {
@@ -27,11 +28,13 @@ namespace AMDGPU {
 // to replace instruction. In other case InstApplyMethod will create new
 // instruction(s).
 class RegBankLegalizeHelper {
+  MachineFunction &MF;
   const GCNSubtarget &ST;
   MachineIRBuilder &B;
   MachineRegisterInfo &MRI;
   const MachineUniformityInfo &MUI;
   const RegisterBankInfo &RBI;
+  MachineOptimizationRemarkEmitter MORE;
   const RegBankLegalizeRules &RBLRules;
   const bool IsWave32;
   const RegisterBank *SgprRB;
@@ -81,10 +84,10 @@ class RegBankLegalizeHelper {
                         const RegisterBankInfo &RBI,
                         const RegBankLegalizeRules &RBLRules);
 
-  void findRuleAndApplyMapping(MachineInstr &MI);
+  bool findRuleAndApplyMapping(MachineInstr &MI);
 
   // Manual apply helpers.
-  void applyMappingPHI(MachineInstr &MI);
+  bool applyMappingPHI(MachineInstr &MI);
   void applyMappingTrivial(MachineInstr &MI);
 
 private:
@@ -97,37 +100,37 @@ class RegBankLegalizeHelper {
 
   const RegisterBank *getRegBankFromID(RegBankLLTMappingApplyID ID);
 
-  void
+  bool
   applyMappingDst(MachineInstr &MI, unsigned &OpIdx,
                   const SmallVectorImpl<RegBankLLTMappingApplyID> &MethodIDs);
 
-  void
+  bool
   applyMappingSrc(MachineInstr &MI, unsigned &OpIdx,
                   const SmallVectorImpl<RegBankLLTMappingApplyID> &MethodIDs,
                   SmallSet<Register, 4> &SgprWaterfallOperandRegs);
 
-  void splitLoad(MachineInstr &MI, ArrayRef<LLT> LLTBreakdown,
+  bool splitLoad(MachineInstr &MI, ArrayRef<LLT> LLTBreakdown,
                  LLT MergeTy = LLT());
-  void widenLoad(MachineInstr &MI, LLT WideTy, LLT MergeTy = LLT());
-  void widenMMOToS32(GAnyLoad &MI) const;
+  bool widenLoad(MachineInstr &MI, LLT WideTy, LLT MergeTy = LLT());
+  bool widenMMOToS32(GAnyLoad &MI) const;
 
-  void lower(MachineInstr &MI, const RegBankLLTMapping &Mapping,
+  bool lower(MachineInstr &MI, const RegBankLLTMapping &Mapping,
              SmallSet<Register, 4> &SgprWaterfallOperandRegs);
 
-  void lowerVccExtToSel(MachineInstr &MI);
+  bool lowerVccExtToSel(MachineInstr &MI);
   std::pair<Register, Register> unpackZExt(Register Reg);
   std::pair<Register, Register> unpackSExt(Register Reg);
   std::pair<Register, Register> unpackAExt(Register Reg);
   std::pair<Register, Register> unpackAExtTruncS16(Register Reg);
-  void lowerUnpackBitShift(MachineInstr &MI);
-  void lowerV_BFE(MachineInstr &MI);
-  void lowerS_BFE(MachineInstr &MI);
-  void lowerSplitTo32(MachineInstr &MI);
-  void lowerSplitTo16(MachineInstr &MI);
-  void lowerSplitTo32Select(MachineInstr &MI);
-  void lowerSplitTo32SExtInReg(MachineInstr &MI);
-  void lowerUnpackMinMax(MachineInstr &MI);
-  void lowerUnpackAExt(MachineInstr &MI);
+  bool lowerUnpackBitShift(MachineInstr &MI);
+  bool lowerV_BFE(MachineInstr &MI);
+  bool lowerS_BFE(MachineInstr &MI);
+  bool lowerSplitTo32(MachineInstr &MI);
+  bool lowerSplitTo16(MachineInstr &MI);
+  bool lowerSplitTo32Select(MachineInstr &MI);
+  bool lowerSplitTo32SExtInReg(MachineInstr &MI);
+  bool lowerUnpackMinMax(MachineInstr &MI);
+  bool lowerUnpackAExt(MachineInstr &MI);
 };
 
 } // end namespace AMDGPU
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
index 6ec51e1be8aca..9de309279a247 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
@@ -243,7 +243,7 @@ UniformityLLTOpPredicateID LLTToBId(LLT Ty) {
   return _;
 }
 
-const RegBankLLTMapping &
+const RegBankLLTMapping *
 SetOfRulesForOpcode::findMappingForMI(const MachineInstr &MI,
                                       const MachineRegisterInfo &MRI,
                                       const MachineUniformityInfo &MUI) const {
@@ -260,17 +260,16 @@ SetOfRulesForOpcode::findMappingForMI(const MachineInstr &MI,
       Slot = getFastPredicateSlot(LLTToId(MRI.getType(Reg)));
 
     if (Slot != -1)
-      return MUI.isUniform(Reg) ? Uni[Slot] : Div[Slot];
+      return MUI.isUniform(Reg) ? &Uni[Slot] : &Div[Slot];
   }
 
   // Slow search for more complex rules.
   for (const RegBankLegalizeRule &Rule : Rules) {
     if (Rule.Predicate.match(MI, MUI, MRI))
-      return Rule.OperandMapping;
+      return &Rule.OperandMapping;
   }
 
-  LLVM_DEBUG(dbgs() << "MI: "; MI.dump(););
-  llvm_unreachable("None of the rules defined for MI's opcode matched MI");
+  return nullptr;
 }
 
 void SetOfRulesForOpcode::addRule(RegBankLegalizeRule Rule) {
@@ -353,7 +352,7 @@ RegBankLegalizeRules::addRulesForIOpcs(std::initializer_list<unsigned> OpcList,
   return RuleSetInitializer(OpcList, IRulesAlias, IRules, FastTypes);
 }
 
-const SetOfRulesForOpcode &
+const SetOfRulesForOpcode *
 RegBankLegalizeRules::getRulesForOpc(MachineInstr &MI) const {
   unsigned Opc = MI.getOpcode();
   if (Opc == AMDGPU::G_INTRINSIC || Opc == AMDGPU::G_INTRINSIC_CONVERGENT ||
@@ -361,19 +360,15 @@ RegBankLegalizeRules::getRulesForOpc(MachineInstr &MI) const {
       Opc == AMDGPU::G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS) {
     unsigned IntrID = cast<GIntrinsic>(MI).getIntrinsicID();
     auto IRAIt = IRulesAlias.find(IntrID);
-    if (IRAIt == IRulesAlias.end()) {
-      LLVM_DEBUG(dbgs() << "MI: "; MI.dump(););
-      llvm_unreachable("No rules defined for intrinsic opcode");
-    }
-    return IRules.at(IRAIt->second);
+    if (IRAIt == IRulesAlias.end())
+      return nullptr;
+    return &IRules.at(IRAIt->second);
   }
 
   auto GRAIt = GRulesAlias.find(Opc);
-  if (GRAIt == GRulesAlias.end()) {
-    LLVM_DEBUG(dbgs() << "MI: "; MI.dump(););
-    llvm_unreachable("No rules defined for generic opcode");
-  }
-  return GRules.at(GRAIt->second);
+  if (GRAIt == GRulesAlias.end())
+    return nullptr;
+  return &GRules.at(GRAIt->second);
 }
 
 // Syntactic sugar wrapper for predicate lambda that enables '&&', '||' and '!'.
@@ -939,7 +934,7 @@ RegBankLegalizeRules::RegBankLegalizeRules(const GCNSubtarget &_ST,
 
   bool hasSALUFloat = ST->hasSALUFloatInsts();
 
-  addRulesForGOpcs({G_FADD, G_FMUL}, Standard)
+  addRulesForGOpcs({G_FADD, G_FMUL, G_STRICT_FADD, G_STRICT_FMUL}, Standard)
       .Uni(S16, {{UniInVgprS16}, {Vgpr16, Vgpr16}}, !hasSALUFloat)
       .Uni(S16, {{Sgpr16}, {Sgpr16, Sgpr16}}, hasSALUFloat)
       .Div(S16, {{Vgpr16}, {Vgpr16, Vgpr16}})
@@ -951,9 +946,7 @@ RegBankLegalizeRules::RegBankLegalizeRules(const GCNSubtarget &_ST,
       .Uni(V2S16, {{UniInVgprV2S16}, {VgprV2S16, VgprV2S16}}, !hasSALUFloat)
       .Uni(V2S16, {{SgprV2S16}, {SgprV2S16, SgprV2S16}, ScalarizeToS16},
            hasSALUFloat)
-      .Div(V2S16, {{VgprV2S16}, {VgprV2S16, VgprV2S16}})
-      .Any({{UniV2S32}, {{UniInVgprV2S32}, {VgprV2S32, VgprV2S32}}})
-      .Any({{DivV2S32}, {{VgprV2S32}, {VgprV2S32, VgprV2S32}}});
+      .Div(V2S16, {{VgprV2S16}, {VgprV2S16, VgprV2S16}});
 
   // FNEG and FABS are either folded as source modifiers or can be selected as
   // bitwise XOR and AND with Mask. XOR and AND are available on SALU but for
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.h b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.h
index 7e4ce7b43dc3b..1ac117304b76f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.h
@@ -287,7 +287,7 @@ class SetOfRulesForOpcode {
   SetOfRulesForOpcode();
   SetOfRulesForOpcode(FastRulesTypes FastTypes);
 
-  const RegBankLLTMapping &
+  const RegBankLLTMapping *
   findMappingForMI(const MachineInstr &MI, const MachineRegisterInfo &MRI,
                    const MachineUniformityInfo &MUI) const;
 
@@ -385,7 +385,7 @@ class RegBankLegalizeRules {
     MRI = &_MRI;
   };
 
-  const SetOfRulesForOpcode &getRulesForOpc(MachineInstr &MI) const;
+  const SetOfRulesForOpcode *getRulesForOpc(MachineInstr &MI) const;
 };
 
 } // end namespace AMDGPU
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp b/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp
index 89c16dadb4b41..ffbb1c183ca9e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURewriteAGPRCopyMFMA.cpp
@@ -32,6 +32,7 @@
 #include "llvm/CodeGen/LiveStacks.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/SlotIndexes.h"
 #include "llvm/CodeGen/VirtRegMap.h"
 #include "llvm/InitializePasses.h"
 
@@ -96,8 +97,8 @@ class AMDGPURewriteAGPRCopyMFMAImpl {
 
   /// Compute the register class constraints based on the uses of \p Reg,
   /// excluding MFMA uses from which can be rewritten to change the register
-  /// class constraint. This should be nearly identical to
-  /// MachineRegisterInfo::recomputeRegClass.
+  /// class constraint. MFMA scale operands need to be constraint checked.
+  /// This should be nearly identical to MachineRegisterInfo::recomputeRegClass.
 
   /// \p RewriteCandidates will collect the set of MFMA instructions that need
   /// to have the opcode mutated to perform the replacement.
@@ -151,9 +152,16 @@ bool AMDGPURewriteAGPRCopyMFMAImpl::recomputeRegClassExceptRewritable(
 
       // We can swap the classes of dst + src2 as a pair to AGPR, so ignore the
       // effects of rewrite candidates. It just so happens that we can use
-      // either AGPR or VGPR in src0/src1, so don't bother checking the
-      // constraint effects of the individual operands.
+      // either AGPR or VGPR in src0/src1. We still need to check constraint
+      // effects for scale variant, which does not allow AGPR.
       if (isRewriteCandidate(*MI)) {
+        int AGPROp = AMDGPU::getMFMASrcCVDstAGPROp(MI->getOpcode());
+        const MCInstrDesc &AGPRDesc = TII.get(AGPROp);
+        const TargetRegisterClass *NewRC =
+            TII.getRegClass(AGPRDesc, MO.getOperandNo());
+        if (!TRI.hasAGPRs(NewRC))
+          return false;
+
         const MachineOperand *VDst =
             TII.getNamedOperand(*MI, AMDGPU::OpName::vdst);
         const MachineOperand *Src2 =
@@ -659,7 +667,11 @@ AMDGPURewriteAGPRCopyMFMAPass::run(MachineFunction &MF,
   if (!Impl.run(MF))
     return PreservedAnalyses::all();
   auto PA = getMachineFunctionPassPreservedAnalyses();
-  PA.preserveSet<CFGAnalyses>();
-  PA.preserve<LiveStacksAnalysis>();
+  PA.preserveSet<CFGAnalyses>()
+      .preserve<LiveStacksAnalysis>()
+      .preserve<VirtRegMapAnalysis>()
+      .preserve<SlotIndexesAnalysis>()
+      .preserve<LiveIntervalsAnalysis>()
+      .preserve<LiveRegMatrixAnalysis>();
   return PA;
 }
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
index 7fbf520c670ae..6f1a5210fb7e0 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
@@ -190,6 +190,12 @@ GCNHazardRecognizer::getHazardType(SUnit *SU, int Stalls) {
   if (checkFPAtomicToDenormModeHazard(MI) > 0)
     return HazardType;
 
+  // Hazards which cannot be mitigated with S_NOPs.
+  if (!IsHazardRecognizerMode) {
+    if (checkWMMACoexecutionHazards(MI) > 0)
+      return Hazard;
+  }
+
   if (ST.hasNoDataDepHazard())
     return NoHazard;
 
@@ -437,9 +443,6 @@ void GCNHazardRecognizer::RecedeCycle() {
 
 enum HazardFnResult { HazardFound, HazardExpired, NoHazardFound };
 
-using IsExpiredFn = function_ref<bool(const MachineInstr &, int WaitStates)>;
-using GetNumWaitStatesFn = function_ref<unsigned int(const MachineInstr &)>;
-
 // Search for a hazard in a block and its predecessors.
 template <typename StateT>
 static bool
@@ -546,11 +549,14 @@ hasHazard(StateT InitialState,
 // Returns a minimum wait states since \p I walking all predecessors.
 // Only scans until \p IsExpired does not return true.
 // Can only be run in a hazard recognizer mode.
-static int getWaitStatesSince(
-    GCNHazardRecognizer::IsHazardFn IsHazard, const MachineBasicBlock *MBB,
-    MachineBasicBlock::const_reverse_instr_iterator I, int WaitStates,
-    IsExpiredFn IsExpired, DenseSet<const MachineBasicBlock *> &Visited,
-    GetNumWaitStatesFn GetNumWaitStates = SIInstrInfo::getNumWaitStates) {
+static int
+getWaitStatesSince(GCNHazardRecognizer::IsHazardFn IsHazard,
+                   const MachineBasicBlock *MBB,
+                   MachineBasicBlock::const_reverse_instr_iterator I,
+                   int WaitStates, GCNHazardRecognizer::IsExpiredFn IsExpired,
+                   DenseSet<const MachineBasicBlock *> &Visited,
+                   GCNHazardRecognizer::GetNumWaitStatesFn GetNumWaitStates =
+                       SIInstrInfo::getNumWaitStates) {
   for (auto E = MBB->instr_rend(); I != E; ++I) {
     // Don't add WaitStates for parent BUNDLE instructions.
     if (I->isBundle())
@@ -582,20 +588,26 @@ static int getWaitStatesSince(
   return MinWaitStates;
 }
 
-static int getWaitStatesSince(GCNHazardRecognizer::IsHazardFn IsHazard,
-                              const MachineInstr *MI, IsExpiredFn IsExpired) {
+static int
+getWaitStatesSince(GCNHazardRecognizer::IsHazardFn IsHazard,
+                   const MachineInstr *MI,
+                   GCNHazardRecognizer::IsExpiredFn IsExpired,
+                   GCNHazardRecognizer::GetNumWaitStatesFn GetNumWaitStates =
+                       SIInstrInfo::getNumWaitStates) {
   DenseSet<const MachineBasicBlock *> Visited;
   return getWaitStatesSince(IsHazard, MI->getParent(),
                             std::next(MI->getReverseIterator()), 0, IsExpired,
-                            Visited, SIInstrInfo::getNumWaitStates);
+                            Visited, GetNumWaitStates);
 }
 
-int GCNHazardRecognizer::getWaitStatesSince(IsHazardFn IsHazard, int Limit) {
+int GCNHazardRecognizer::getWaitStatesSince(
+    IsHazardFn IsHazard, int Limit, GetNumWaitStatesFn GetNumWaitStates) {
   if (IsHazardRecognizerMode) {
     auto IsExpiredFn = [Limit](const MachineInstr &, int WaitStates) {
       return WaitStates >= Limit;
     };
-    return ::getWaitStatesSince(IsHazard, CurrCycleInstr, IsExpiredFn);
+    return ::getWaitStatesSince(IsHazard, CurrCycleInstr, IsExpiredFn,
+                                GetNumWaitStates);
   }
 
   int WaitStates = 0;
@@ -607,7 +619,7 @@ int GCNHazardRecognizer::getWaitStatesSince(IsHazardFn IsHazard, int Limit) {
       if (MI->isInlineAsm())
         continue;
     }
-    ++WaitStates;
+    WaitStates += MI ? GetNumWaitStates(*MI) : 1;
 
     if (WaitStates >= Limit)
       break;
@@ -615,6 +627,10 @@ int GCNHazardRecognizer::getWaitStatesSince(IsHazardFn IsHazard, int Limit) {
   return std::numeric_limits<int>::max();
 }
 
+int GCNHazardRecognizer::getWaitStatesSince(IsHazardFn IsHazard, int Limit) {
+  return getWaitStatesSince(IsHazard, Limit, SIInstrInfo::getNumWaitStates);
+}
+
 int GCNHazardRecognizer::getWaitStatesSinceDef(unsigned Reg,
                                                IsHazardFn IsHazardDef,
                                                int Limit) {
@@ -1243,6 +1259,20 @@ int GCNHazardRecognizer::checkReadM0Hazards(MachineInstr *MI) {
          getWaitStatesSinceDef(AMDGPU::M0, IsHazardFn, ReadM0WaitStates);
 }
 
+// emit V_NOP instructions. \p WaitStatesNeeded is the number of V_NOPs we need
+// to insert, negative means not needed.
+bool GCNHazardRecognizer::emitVNops(MachineInstr *MI, int WaitStatesNeeded) {
+  if (WaitStatesNeeded <= 0)
+    return false;
+
+  const SIInstrInfo *TII = ST.getInstrInfo();
+  for (int I = 0; I < WaitStatesNeeded; ++I)
+    BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+            TII->get(AMDGPU::V_NOP_e32));
+
+  return true;
+}
+
 void GCNHazardRecognizer::fixHazards(MachineInstr *MI) {
   fixVMEMtoScalarWriteHazards(MI);
   fixVcmpxPermlaneHazards(MI);
@@ -1257,7 +1287,7 @@ void GCNHazardRecognizer::fixHazards(MachineInstr *MI) {
   fixVALUTransUseHazard(MI);
   fixVALUTransCoexecutionHazards(MI);
   fixWMMAHazards(MI); // fall-through if co-execution is enabled.
-  fixWMMACoexecutionHazards(MI);
+  emitVNops(MI, checkWMMACoexecutionHazards(MI));
   fixShift64HighRegBug(MI);
   fixVALUMaskWriteHazard(MI);
   fixRequiredExportPriority(MI);
@@ -2045,13 +2075,13 @@ static bool IsWMMAHazardInstInCategory(const MachineInstr &MI,
   return false;
 }
 
-bool GCNHazardRecognizer::fixWMMACoexecutionHazards(MachineInstr *MI) {
+int GCNHazardRecognizer::checkWMMACoexecutionHazards(MachineInstr *MI) {
   if (!AMDGPU::isGFX1250(ST))
-    return false;
+    return 0;
 
   const SIInstrInfo *TII = ST.getInstrInfo();
   if (!TII->isXDLWMMA(*MI) && !isCoexecutableVALUInst(*MI))
-    return false;
+    return 0;
 
   const SIRegisterInfo *TRI = ST.getRegisterInfo();
 
@@ -2129,9 +2159,6 @@ bool GCNHazardRecognizer::fixWMMACoexecutionHazards(MachineInstr *MI) {
   };
 
   int Limit = 0;
-  auto IsExpiredFn = [&Limit](const MachineInstr &, int WaitStates) {
-    return WaitStates >= Limit;
-  };
 
   auto GetWaitStatesFn = [](const MachineInstr &I) {
     return SIInstrInfo::isVALU(I) ? 1 : 0;
@@ -2141,38 +2168,26 @@ bool GCNHazardRecognizer::fixWMMACoexecutionHazards(MachineInstr *MI) {
   if (TII->isXDLWMMA(*MI)) {
     for (Category = 0; WaitStatesNeeded < 0 && Category < 4; Category++) {
       Limit = WMMAWaitStates[Category]; // for IsExpiredFn.
-      DenseSet<const MachineBasicBlock *> Visited;
-      // '::getWaitStatesSince' returns the number of VALUs in between if hazard
+      // 'getWaitStatesSince' returns the number of VALUs in between if hazard
       // exists, and INT_MAX if there is no hazard. As a result, a negative
       // WaitStatesNeeded here means no hazard, and we will continue to search
       // for other categories.
       WaitStatesNeeded =
-          Limit - ::getWaitStatesSince(IsWMMAHazardFn, MI->getParent(),
-                                       std::next(MI->getReverseIterator()), 0,
-                                       IsExpiredFn, Visited, GetWaitStatesFn);
+          Limit - getWaitStatesSince(IsWMMAHazardFn, Limit, GetWaitStatesFn);
     }
   } else { // Must be a co-executable VALU.
     for (Category = 0; WaitStatesNeeded < 0 && Category < 4; Category++) {
       Limit = VALUWaitStates[Category]; // for IsExpiredFn.
-      DenseSet<const MachineBasicBlock *> Visited;
-      // '::getWaitStatesSince' returns the number of VALUs in between if hazard
+      // 'getWaitStatesSince' returns the number of VALUs in between if hazard
       // exists, and INT_MAX if there is no hazard. As a result, a negative
       // WaitStatesNeeded here means no hazard, and we will continue to search
       // for other categories.
       WaitStatesNeeded =
-          Limit - ::getWaitStatesSince(IsVALUHazardFn, MI->getParent(),
-                                       std::next(MI->getReverseIterator()), 0,
-                                       IsExpiredFn, Visited, GetWaitStatesFn);
+          Limit - getWaitStatesSince(IsVALUHazardFn, Limit, GetWaitStatesFn);
     }
   }
 
-  // WaitStatesNeeded now is the number of V_NOPs we need to insert, negative
-  // means not needed.
-  for (int i = 0; i < WaitStatesNeeded; i++)
-    BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
-            TII->get(AMDGPU::V_NOP_e32));
-
-  return true;
+  return WaitStatesNeeded;
 }
 
 bool GCNHazardRecognizer::fixShift64HighRegBug(MachineInstr *MI) {
diff --git a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
index 67beffadc0913..d725134639cfe 100644
--- a/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
+++ b/llvm/lib/Target/AMDGPU/GCNHazardRecognizer.h
@@ -32,6 +32,8 @@ class GCNSubtarget;
 class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
 public:
   typedef function_ref<bool(const MachineInstr &)> IsHazardFn;
+  typedef function_ref<bool(const MachineInstr &, int WaitStates)> IsExpiredFn;
+  typedef function_ref<unsigned int(const MachineInstr &)> GetNumWaitStatesFn;
 
 private:
   // Distinguish if we are called from scheduler or hazard recognizer
@@ -74,6 +76,8 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   // used on a newly inserted instruction before returning from PreEmitNoops.
   void runOnInstruction(MachineInstr *MI);
 
+  int getWaitStatesSince(IsHazardFn IsHazard, int Limit,
+                         GetNumWaitStatesFn GetNumWaitStates);
   int getWaitStatesSince(IsHazardFn IsHazard, int Limit);
   int getWaitStatesSinceDef(unsigned Reg, IsHazardFn IsHazardDef, int Limit);
   int getWaitStatesSinceSetReg(IsHazardFn IsHazard, int Limit);
@@ -94,6 +98,9 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   int checkReadM0Hazards(MachineInstr *SMovRel);
   int checkNSAtoVMEMHazard(MachineInstr *MI);
   int checkFPAtomicToDenormModeHazard(MachineInstr *MI);
+  // Emit V_NOP instructions. \p WaitStatesNeeded is the number of V_NOPs we
+  // need to insert, negative means not needed.
+  bool emitVNops(MachineInstr *MI, int WaitStatesNeeded);
   void fixHazards(MachineInstr *MI);
   bool fixVcmpxPermlaneHazards(MachineInstr *MI);
   bool fixVMEMtoScalarWriteHazards(MachineInstr *MI);
@@ -106,7 +113,7 @@ class GCNHazardRecognizer final : public ScheduleHazardRecognizer {
   bool fixVALUTransUseHazard(MachineInstr *MI);
   bool fixVALUTransCoexecutionHazards(MachineInstr *MI);
   bool fixWMMAHazards(MachineInstr *MI);
-  bool fixWMMACoexecutionHazards(MachineInstr *MI);
+  int checkWMMACoexecutionHazards(MachineInstr *MI);
   bool fixShift64HighRegBug(MachineInstr *MI);
   bool fixVALUMaskWriteHazard(MachineInstr *MI);
   bool fixRequiredExportPriority(MachineInstr *MI);
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e567176e658b3..34eb8b2266311 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -1868,6 +1868,12 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
     return GFX1250Insts && getGeneration() == GFX12;
   }
 
+  // src_flat_scratch_hi cannot be used as a source in SALU producing a 64-bit
+  // result.
+  bool hasFlatScratchHiInB64InstHazard() const {
+    return GFX1250Insts && getGeneration() == GFX12;
+  }
+
   /// \returns true if the subtarget supports clusters of workgroups.
   bool hasClusters() const { return HasClusters; }
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 3aef0bd31debe..301f2fc8dab45 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -35,6 +35,7 @@
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineLoopInfo.h"
+#include "llvm/CodeGen/PseudoSourceValueManager.h"
 #include "llvm/CodeGen/SDPatternMatch.h"
 #include "llvm/IR/DiagnosticInfo.h"
 #include "llvm/IR/IRBuilder.h"
@@ -1308,7 +1309,7 @@ static unsigned getIntrMemWidth(unsigned IntrID) {
   }
 }
 
-static void getCoopAtomicOperandsInfo(const CallInst &CI, bool IsLoad,
+static void getCoopAtomicOperandsInfo(const CallBase &CI, bool IsLoad,
                                       TargetLoweringBase::IntrinsicInfo &Info) {
   Value *OrderingArg = CI.getArgOperand(IsLoad ? 1 : 2);
   unsigned Ord = cast<ConstantInt>(OrderingArg)->getZExtValue();
@@ -1338,7 +1339,7 @@ static void getCoopAtomicOperandsInfo(const CallInst &CI, bool IsLoad,
 }
 
 bool SITargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                          const CallInst &CI,
+                                          const CallBase &CI,
                                           MachineFunction &MF,
                                           unsigned IntrID) const {
   Info.flags = MachineMemOperand::MONone;
@@ -2265,6 +2266,14 @@ bool SITargetLowering::isTypeDesirableForOp(unsigned Op, EVT VT) const {
   return TargetLowering::isTypeDesirableForOp(Op, VT);
 }
 
+MachinePointerInfo
+SITargetLowering::getKernargSegmentPtrInfo(MachineFunction &MF) const {
+  // This isn't really a constant pool but close enough.
+  MachinePointerInfo PtrInfo(MF.getPSVManager().getConstantPool());
+  PtrInfo.AddrSpace = AMDGPUAS::CONSTANT_ADDRESS;
+  return PtrInfo;
+}
+
 SDValue SITargetLowering::lowerKernArgParameterPtr(SelectionDAG &DAG,
                                                    const SDLoc &SL,
                                                    SDValue Chain,
@@ -2341,7 +2350,9 @@ SDValue SITargetLowering::lowerKernargMemParameter(
     SelectionDAG &DAG, EVT VT, EVT MemVT, const SDLoc &SL, SDValue Chain,
     uint64_t Offset, Align Alignment, bool Signed,
     const ISD::InputArg *Arg) const {
-  MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
+
+  MachinePointerInfo PtrInfo =
+      getKernargSegmentPtrInfo(DAG.getMachineFunction());
 
   // Try to avoid using an extload by loading earlier than the argument address,
   // and extracting the relevant bits. The load should hopefully be merged with
@@ -2356,7 +2367,8 @@ SDValue SITargetLowering::lowerKernargMemParameter(
     // TODO: If we passed in the base kernel offset we could have a better
     // alignment than 4, but we don't really need it.
     SDValue Ptr = lowerKernArgParameterPtr(DAG, SL, Chain, AlignDownOffset);
-    SDValue Load = DAG.getLoad(MVT::i32, SL, Chain, Ptr, PtrInfo, Align(4),
+    SDValue Load = DAG.getLoad(MVT::i32, SL, Chain, Ptr,
+                               PtrInfo.getWithOffset(AlignDownOffset), Align(4),
                                MachineMemOperand::MODereferenceable |
                                    MachineMemOperand::MOInvariant);
 
@@ -2371,9 +2383,9 @@ SDValue SITargetLowering::lowerKernargMemParameter(
   }
 
   SDValue Ptr = lowerKernArgParameterPtr(DAG, SL, Chain, Offset);
-  SDValue Load = DAG.getLoad(MemVT, SL, Chain, Ptr, PtrInfo, Alignment,
-                             MachineMemOperand::MODereferenceable |
-                                 MachineMemOperand::MOInvariant);
+  SDValue Load = DAG.getLoad(
+      MemVT, SL, Chain, Ptr, PtrInfo.getWithOffset(Offset), Alignment,
+      MachineMemOperand::MODereferenceable | MachineMemOperand::MOInvariant);
 
   SDValue Val = convertArgType(DAG, VT, MemVT, SL, Load, Signed, Arg);
   return DAG.getMergeValues({Val, Load.getValue(1)}, SL);
@@ -8143,10 +8155,11 @@ SITargetLowering::loadImplicitKernelArgument(SelectionDAG &DAG, MVT VT,
   MachineFunction &MF = DAG.getMachineFunction();
   uint64_t Offset = getImplicitParameterOffset(MF, Param);
   SDValue Ptr = lowerKernArgParameterPtr(DAG, DL, DAG.getEntryNode(), Offset);
-  MachinePointerInfo PtrInfo(AMDGPUAS::CONSTANT_ADDRESS);
-  return DAG.getLoad(VT, DL, DAG.getEntryNode(), Ptr, PtrInfo, Alignment,
-                     MachineMemOperand::MODereferenceable |
-                         MachineMemOperand::MOInvariant);
+  MachinePointerInfo PtrInfo =
+      getKernargSegmentPtrInfo(DAG.getMachineFunction());
+  return DAG.getLoad(
+      VT, DL, DAG.getEntryNode(), Ptr, PtrInfo.getWithOffset(Offset), Alignment,
+      MachineMemOperand::MODereferenceable | MachineMemOperand::MOInvariant);
 }
 
 SDValue SITargetLowering::lowerTrapHsaQueuePtr(SDValue Op,
@@ -8408,6 +8421,9 @@ SDValue SITargetLowering::lowerADDRSPACECAST(SDValue Op,
       Op.getValueType() == MVT::i64) {
     const SIMachineFunctionInfo *Info =
         DAG.getMachineFunction().getInfo<SIMachineFunctionInfo>();
+    if (Info->get32BitAddressHighBits() == 0)
+      return DAG.getNode(ISD::ZERO_EXTEND, SL, MVT::i64, Src);
+
     SDValue Hi = DAG.getConstant(Info->get32BitAddressHighBits(), SL, MVT::i32);
     SDValue Vec = DAG.getNode(ISD::BUILD_VECTOR, SL, MVT::v2i32, Src, Hi);
     return DAG.getNode(ISD::BITCAST, SL, MVT::i64, Vec);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 74e58f4272e10..fb162948caf4c 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -45,6 +45,8 @@ class SITargetLowering final : public AMDGPUTargetLowering {
     LLVMContext &Context, CallingConv::ID CC, EVT VT, EVT &IntermediateVT,
     unsigned &NumIntermediates, MVT &RegisterVT) const override;
 
+  MachinePointerInfo getKernargSegmentPtrInfo(MachineFunction &MF) const;
+
 private:
   SDValue lowerKernArgParameterPtr(SelectionDAG &DAG, const SDLoc &SL,
                                    SDValue Chain, uint64_t Offset) const;
@@ -332,7 +334,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   MVT getPointerTy(const DataLayout &DL, unsigned AS) const override;
   MVT getPointerMemTy(const DataLayout &DL, unsigned AS) const override;
 
-  bool getTgtMemIntrinsic(IntrinsicInfo &, const CallInst &,
+  bool getTgtMemIntrinsic(IntrinsicInfo &, const CallBase &,
                           MachineFunction &MF,
                           unsigned IntrinsicID) const override;
 
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 70db7b4918515..64a29e1841245 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -552,9 +552,7 @@ class SIInsertWaitcnts {
       return VMEM_ACCESS;
     if (Inst.mayStore() &&
         (!Inst.mayLoad() || SIInstrInfo::isAtomicNoRet(Inst))) {
-      // FLAT and SCRATCH instructions may access scratch. Other VMEM
-      // instructions do not.
-      if (TII->mayAccessScratchThroughFlat(Inst))
+      if (TII->mayAccessScratch(Inst))
         return SCRATCH_WRITE_ACCESS;
       return VMEM_WRITE_ACCESS;
     }
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index da019b6e476df..7535407741f1f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -4399,8 +4399,9 @@ bool SIInstrInfo::isAlwaysGDS(uint16_t Opcode) const {
          Opcode == AMDGPU::DS_SUB_GS_REG_RTN || isGWS(Opcode);
 }
 
-bool SIInstrInfo::mayAccessScratchThroughFlat(const MachineInstr &MI) const {
-  if (!isFLAT(MI) || isFLATGlobal(MI))
+bool SIInstrInfo::mayAccessScratch(const MachineInstr &MI) const {
+  // Instructions that access scratch use FLAT encoding or BUF encodings.
+  if ((!isFLAT(MI) || isFLATGlobal(MI)) && !isBUF(MI))
     return false;
 
   // If scratch is not initialized, we can never access it.
@@ -6256,6 +6257,18 @@ bool SIInstrInfo::isLegalRegOperand(const MachineInstr &MI, unsigned OpIdx,
       (int)OpIdx == AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src0) &&
       RI.isSGPRReg(MRI, MO.getReg()))
     return false;
+
+  if (ST.hasFlatScratchHiInB64InstHazard() &&
+      MO.getReg() == AMDGPU::SRC_FLAT_SCRATCH_BASE_HI && isSALU(MI)) {
+    if (const MachineOperand *Dst = getNamedOperand(MI, AMDGPU::OpName::sdst)) {
+      if (AMDGPU::getRegBitWidth(*RI.getRegClassForReg(MRI, Dst->getReg())) ==
+          64)
+        return false;
+    }
+    if (Opc == AMDGPU::S_BITCMP0_B64 || Opc == AMDGPU::S_BITCMP1_B64)
+      return false;
+  }
+
   return true;
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 3174bfafb4154..b1d6563bf3c0b 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -694,11 +694,11 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
     return get(Opcode).TSFlags & SIInstrFlags::FLAT;
   }
 
-  /// \returns true for SCRATCH_ instructions, or FLAT_ instructions with
-  /// SCRATCH_ memory operands.
+  /// \returns true for SCRATCH_ instructions, or FLAT/BUF instructions unless
+  /// the MMOs do not include scratch.
   /// Conservatively correct; will return true if \p MI cannot be proven
   /// to not hit scratch.
-  bool mayAccessScratchThroughFlat(const MachineInstr &MI) const;
+  bool mayAccessScratch(const MachineInstr &MI) const;
 
   /// \returns true for FLAT instructions that can access VMEM.
   bool mayAccessVMEMThroughFlat(const MachineInstr &MI) const;
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index c5f5b7d53cfb1..ca5a4d7301bda 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -820,9 +820,8 @@ def SI_CALL : SPseudoInstSI <
   let isConvergent = 1;
 }
 
-class SI_TCRETURN_Pseudo<RegisterClass rc, SDNode sd> : SPseudoInstSI <(outs),
-  (ins rc:$src0, unknown:$callee, i32imm:$fpdiff),
-  [(sd i64:$src0, tglobaladdr:$callee, i32:$fpdiff)]> {
+class SI_TCRETURN_Pseudo<RegisterClass rc, list<dag> pattern = []>
+  : SPseudoInstSI <(outs), (ins rc:$src0, unknown:$callee, i32imm:$fpdiff), pattern> {
   let Size = 4;
   let FixedSize = 1;
   let isCall = 1;
@@ -836,8 +835,15 @@ class SI_TCRETURN_Pseudo<RegisterClass rc, SDNode sd> : SPseudoInstSI <(outs),
 }
 
 // Tail call handling pseudo
-def SI_TCRETURN :     SI_TCRETURN_Pseudo<CCR_SGPR_64, AMDGPUtc_return>;
-def SI_TCRETURN_GFX : SI_TCRETURN_Pseudo<Gfx_CCR_SGPR_64, AMDGPUtc_return_gfx>;
+def SI_TCRETURN : SI_TCRETURN_Pseudo<CCR_SGPR_64,
+  [(AMDGPUtc_return i64:$src0, tglobaladdr:$callee, i32:$fpdiff)]>;
+def SI_TCRETURN_GFX : SI_TCRETURN_Pseudo<Gfx_CCR_SGPR_64,
+  [(AMDGPUtc_return_gfx i64:$src0, tglobaladdr:$callee, i32:$fpdiff)]>;
+
+// Tail call for chain calling conventions.
+// Uses unrestricted SGPR_64 instead of CCR_SGPR_64 because chain calls
+// never return and don't need to preserve any SGPRs.
+def SI_TCRETURN_CHAIN : SI_TCRETURN_Pseudo<SGPR_64>;
 
 // Handle selecting indirect tail calls
 def : GCNPat<
@@ -867,13 +873,13 @@ multiclass SI_CS_CHAIN_TC<
     // This is essentially a tail call, but it also takes a mask to put in EXEC
     // right before jumping to the callee.
     def NAME: SPseudoInstSI <(outs),
-        (ins CCR_SGPR_64:$src0, unknown:$callee, i32imm:$fpdiff, execrc:$exec)>;
+        (ins SGPR_64:$src0, unknown:$callee, i32imm:$fpdiff, execrc:$exec)>;
 
     // Same as above, but it will first try to reallocate the VGPRs, and choose an
     // EXEC mask and a callee depending on the success of the reallocation attempt.
     def _DVGPR : SPseudoInstSI <(outs),
-        (ins CCR_SGPR_64:$src0, i64imm:$callee, i32imm:$fpdiff, execrc:$exec,
-             SSrc_b32:$numvgprs, execrc:$fbexec, CCR_SGPR_64:$fbcallee)>;
+        (ins SGPR_64:$src0, i64imm:$callee, i32imm:$fpdiff, execrc:$exec,
+             SSrc_b32:$numvgprs, execrc:$fbexec, SGPR_64:$fbcallee)>;
   } // End FixedSize = 0 etc
 }
 
@@ -885,7 +891,7 @@ multiclass si_cs_chain_tc_pattern<
   dag callee, ValueType execvt, RegisterOperand execrc, Instruction tc> {
   def : GCNPat<
     (AMDGPUtc_return_chain i64:$src0, callee, (i32 timm:$fpdiff), execvt:$exec),
-    (tc CCR_SGPR_64:$src0, callee, i32imm:$fpdiff, execrc:$exec)
+    (tc SGPR_64:$src0, callee, i32imm:$fpdiff, execrc:$exec)
   >;
 }
 
@@ -912,8 +918,8 @@ multiclass si_cs_chain_tc_dvgpr_patterns<
     (AMDGPUtc_return_chain_dvgpr i64:$src0, callee, (i32 timm:$fpdiff),
                                  execvt:$exec, i32:$numvgprs,
                                  execvt:$fbexec, i64:$fbcallee),
-    (tc CCR_SGPR_64:$src0, (i64 0), i32imm:$fpdiff, execrc:$exec,
-        SSrc_b32:$numvgprs, execrc:$fbexec, CCR_SGPR_64:$fbcallee)
+    (tc SGPR_64:$src0, (i64 0), i32imm:$fpdiff, execrc:$exec,
+        SSrc_b32:$numvgprs, execrc:$fbexec, SGPR_64:$fbcallee)
   >;
   }
 }
diff --git a/llvm/lib/Target/AMDGPU/SILateBranchLowering.cpp b/llvm/lib/Target/AMDGPU/SILateBranchLowering.cpp
index 6537b79d58021..340c9f682971c 100644
--- a/llvm/lib/Target/AMDGPU/SILateBranchLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SILateBranchLowering.cpp
@@ -186,7 +186,7 @@ void SILateBranchLowering::expandChainCall(MachineInstr &MI,
   for (int OpIdx = MI.getNumExplicitOperands() - 1; OpIdx >= ExecIdx; --OpIdx)
     MI.removeOperand(OpIdx);
 
-  MI.setDesc(TII->get(AMDGPU::SI_TCRETURN));
+  MI.setDesc(TII->get(AMDGPU::SI_TCRETURN_CHAIN));
 }
 
 void SILateBranchLowering::earlyTerm(MachineInstr &MI,
diff --git a/llvm/lib/Target/ARC/ARC.td b/llvm/lib/Target/ARC/ARC.td
index 142ce7f747919..71b3bb61639f8 100644
--- a/llvm/lib/Target/ARC/ARC.td
+++ b/llvm/lib/Target/ARC/ARC.td
@@ -24,6 +24,8 @@ include "ARCRegisterInfo.td"
 include "ARCInstrInfo.td"
 include "ARCCallingConv.td"
 
+defm : RemapAllTargetPseudoPointerOperands<GPR32>;
+
 def ARCInstrInfo : InstrInfo;
 
 class Proc<string Name, list<SubtargetFeature> Features>
diff --git a/llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp b/llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp
index fffb63738166d..d69c09fcb39db 100644
--- a/llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp
+++ b/llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp
@@ -932,6 +932,7 @@ static bool IsAnAddressOperand(const MachineOperand &MO) {
     return true;
   case MachineOperand::MO_RegisterMask:
   case MachineOperand::MO_RegisterLiveOut:
+  case MachineOperand::MO_LaneMask:
     return false;
   case MachineOperand::MO_Metadata:
   case MachineOperand::MO_MCSymbol:
diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 32f3e5fa3c842..2d26c67a8077a 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -546,16 +546,24 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM_,
       for (auto Op : {ISD::STRICT_FADD, ISD::STRICT_FSUB, ISD::STRICT_FMUL,
                       ISD::STRICT_FDIV, ISD::STRICT_FMA, ISD::STRICT_FSQRT})
         setOperationAction(Op, MVT::f64, Legal);
+
+      setOperationAction(ISD::STRICT_FP_ROUND, MVT::f32, Legal);
     }
   }
 
   if (Subtarget->hasFullFP16()) {
+    for (auto Op : {ISD::STRICT_FADD, ISD::STRICT_FSUB, ISD::STRICT_FMUL,
+                    ISD::STRICT_FDIV, ISD::STRICT_FMA, ISD::STRICT_FSQRT})
+      setOperationAction(Op, MVT::f16, Legal);
+
     addRegisterClass(MVT::f16, &ARM::HPRRegClass);
     setOperationAction(ISD::BITCAST, MVT::i16, Custom);
     setOperationAction(ISD::BITCAST, MVT::f16, Custom);
 
     setOperationAction(ISD::FMINNUM, MVT::f16, Legal);
     setOperationAction(ISD::FMAXNUM, MVT::f16, Legal);
+    setOperationAction(ISD::STRICT_FMINNUM, MVT::f16, Legal);
+    setOperationAction(ISD::STRICT_FMAXNUM, MVT::f16, Legal);
   }
 
   if (Subtarget->hasBF16()) {
@@ -865,13 +873,14 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM_,
     setOperationAction(ISD::FP_TO_SINT, MVT::f64, Custom);
     setOperationAction(ISD::FP_TO_UINT, MVT::f64, Custom);
     setOperationAction(ISD::FP_ROUND,   MVT::f32, Custom);
-    setOperationAction(ISD::STRICT_FP_TO_SINT, MVT::i32, Custom);
-    setOperationAction(ISD::STRICT_FP_TO_UINT, MVT::i32, Custom);
     setOperationAction(ISD::STRICT_FP_TO_SINT, MVT::f64, Custom);
     setOperationAction(ISD::STRICT_FP_TO_UINT, MVT::f64, Custom);
     setOperationAction(ISD::STRICT_FP_ROUND,   MVT::f32, Custom);
   }
 
+  setOperationAction(ISD::STRICT_FP_TO_SINT, MVT::i32, Custom);
+  setOperationAction(ISD::STRICT_FP_TO_UINT, MVT::i32, Custom);
+
   if (!Subtarget->hasFP64() || !Subtarget->hasFPARMv8Base()) {
     setOperationAction(ISD::FP_EXTEND,  MVT::f64, Custom);
     setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f64, Custom);
@@ -879,11 +888,16 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM_,
       setOperationAction(ISD::FP_ROUND,  MVT::f16, Custom);
       setOperationAction(ISD::STRICT_FP_ROUND, MVT::f16, Custom);
     }
+  } else {
+    setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f64, Legal);
   }
 
   if (!Subtarget->hasFP16()) {
     setOperationAction(ISD::FP_EXTEND,  MVT::f32, Custom);
     setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f32, Custom);
+  } else {
+    setOperationAction(ISD::STRICT_FP_EXTEND, MVT::f32, Legal);
+    setOperationAction(ISD::STRICT_FP_ROUND, MVT::f16, Legal);
   }
 
   computeRegisterProperties(Subtarget->getRegisterInfo());
@@ -1223,16 +1237,16 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM_,
     if (!Subtarget->hasFPARMv8Base() || !Subtarget->hasFP64()) {
       setOperationAction(ISD::FP16_TO_FP, MVT::f64, Expand);
       setOperationAction(ISD::FP_TO_FP16, MVT::f64, Expand);
-      setOperationAction(ISD::STRICT_FP16_TO_FP, MVT::f64, LibCall);
-      setOperationAction(ISD::STRICT_FP_TO_FP16, MVT::f64, LibCall);
+      setOperationAction(ISD::STRICT_FP16_TO_FP, MVT::f64, Expand);
+      setOperationAction(ISD::STRICT_FP_TO_FP16, MVT::f64, Expand);
     }
 
     // fp16 is a special v7 extension that adds f16 <-> f32 conversions.
     if (!Subtarget->hasFP16()) {
       setOperationAction(ISD::FP16_TO_FP, MVT::f32, Expand);
       setOperationAction(ISD::FP_TO_FP16, MVT::f32, Expand);
-      setOperationAction(ISD::STRICT_FP16_TO_FP, MVT::f32, LibCall);
-      setOperationAction(ISD::STRICT_FP_TO_FP16, MVT::f32, LibCall);
+      setOperationAction(ISD::STRICT_FP16_TO_FP, MVT::f32, Expand);
+      setOperationAction(ISD::STRICT_FP_TO_FP16, MVT::f32, Expand);
     }
 
     // Strict floating-point comparisons need custom lowering.
@@ -1248,34 +1262,26 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM_,
   setOperationAction(ISD::FSINCOS, MVT::f32, Expand);
 
   // FP-ARMv8 implements a lot of rounding-like FP operations.
-  if (Subtarget->hasFPARMv8Base()) {
-    setOperationAction(ISD::FFLOOR, MVT::f32, Legal);
-    setOperationAction(ISD::FCEIL, MVT::f32, Legal);
-    setOperationAction(ISD::FROUND, MVT::f32, Legal);
-    setOperationAction(ISD::FTRUNC, MVT::f32, Legal);
-    setOperationAction(ISD::FNEARBYINT, MVT::f32, Legal);
-    setOperationAction(ISD::FRINT, MVT::f32, Legal);
-    setOperationAction(ISD::FROUNDEVEN, MVT::f32, Legal);
-    setOperationAction(ISD::FMINNUM, MVT::f32, Legal);
-    setOperationAction(ISD::FMAXNUM, MVT::f32, Legal);
+  if (Subtarget->hasFPARMv8Base()) {    
+    for (auto Op :
+         {ISD::FFLOOR,            ISD::FCEIL,             ISD::FROUND,
+          ISD::FTRUNC,            ISD::FNEARBYINT,        ISD::FRINT,
+          ISD::FROUNDEVEN,        ISD::FMINNUM,           ISD::FMAXNUM,
+          ISD::STRICT_FFLOOR,     ISD::STRICT_FCEIL,      ISD::STRICT_FROUND,
+          ISD::STRICT_FTRUNC,     ISD::STRICT_FNEARBYINT, ISD::STRICT_FRINT,
+          ISD::STRICT_FROUNDEVEN, ISD::STRICT_FMINNUM,    ISD::STRICT_FMAXNUM}) {
+      setOperationAction(Op, MVT::f32, Legal);
+
+      if (Subtarget->hasFP64())
+        setOperationAction(Op, MVT::f64, Legal);
+    }
+
     if (Subtarget->hasNEON()) {
       setOperationAction(ISD::FMINNUM, MVT::v2f32, Legal);
       setOperationAction(ISD::FMAXNUM, MVT::v2f32, Legal);
       setOperationAction(ISD::FMINNUM, MVT::v4f32, Legal);
       setOperationAction(ISD::FMAXNUM, MVT::v4f32, Legal);
     }
-
-    if (Subtarget->hasFP64()) {
-      setOperationAction(ISD::FFLOOR, MVT::f64, Legal);
-      setOperationAction(ISD::FCEIL, MVT::f64, Legal);
-      setOperationAction(ISD::FROUND, MVT::f64, Legal);
-      setOperationAction(ISD::FTRUNC, MVT::f64, Legal);
-      setOperationAction(ISD::FNEARBYINT, MVT::f64, Legal);
-      setOperationAction(ISD::FRINT, MVT::f64, Legal);
-      setOperationAction(ISD::FROUNDEVEN, MVT::f64, Legal);
-      setOperationAction(ISD::FMINNUM, MVT::f64, Legal);
-      setOperationAction(ISD::FMAXNUM, MVT::f64, Legal);
-    }
   }
 
   // FP16 often need to be promoted to call lib functions
@@ -1430,6 +1436,8 @@ ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM_,
       Align(1ULL << Subtarget->getPreferBranchLogAlignment()));
 
   setMinFunctionAlignment(Subtarget->isThumb() ? Align(2) : Align(4));
+
+  IsStrictFPEnabled = true;
 }
 
 bool ARMTargetLowering::useSoftFloat() const {
@@ -20657,7 +20665,7 @@ bool ARMTargetLowering::isFPImmLegal(const APFloat &Imm, EVT VT,
 /// MemIntrinsicNodes.  The associated MachineMemOperands record the alignment
 /// specified in the intrinsic calls.
 bool ARMTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                           const CallInst &I,
+                                           const CallBase &I,
                                            MachineFunction &MF,
                                            unsigned Intrinsic) const {
   switch (Intrinsic) {
diff --git a/llvm/lib/Target/ARM/ARMISelLowering.h b/llvm/lib/Target/ARM/ARMISelLowering.h
index 8191eb40a712a..d0fb58c764edd 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.h
+++ b/llvm/lib/Target/ARM/ARMISelLowering.h
@@ -315,8 +315,7 @@ class VectorType;
     bool isFPImmLegal(const APFloat &Imm, EVT VT,
                       bool ForCodeSize = false) const override;
 
-    bool getTgtMemIntrinsic(IntrinsicInfo &Info,
-                            const CallInst &I,
+    bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                             MachineFunction &MF,
                             unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/ARM/ARMInstrMVE.td b/llvm/lib/Target/ARM/ARMInstrMVE.td
index 6da04c4ac6f18..097318711d137 100644
--- a/llvm/lib/Target/ARM/ARMInstrMVE.td
+++ b/llvm/lib/Target/ARM/ARMInstrMVE.td
@@ -393,6 +393,12 @@ def vsub : PatFrags<(ops node:$lhs, node:$rhs),
 def vmul : PatFrags<(ops node:$lhs, node:$rhs),
                     [(fmul node:$lhs, node:$rhs),
                      (int_arm_mve_vmul node:$lhs, node:$rhs)]>;
+def vminnm : PatFrags<(ops node:$lhs, node:$rhs),
+                    [(fminnum node:$lhs, node:$rhs),
+                     (int_arm_mve_vminnm node:$lhs, node:$rhs)]>;
+def vmaxnm : PatFrags<(ops node:$lhs, node:$rhs),
+                    [(fmaxnum node:$lhs, node:$rhs),
+                     (int_arm_mve_vmaxnm node:$lhs, node:$rhs)]>;
 
 // --------- Start of base classes for the instructions themselves
 
@@ -1489,7 +1495,7 @@ class MVE_VMINMAXNM<string iname, string suffix, bits<2> sz, bit bit_21,
   let validForTailPredication = 1;
 }
 
-multiclass MVE_VMINMAXNM_m<string iname, bit bit_4, MVEVectorVTInfo VTI, SDNode Op, Intrinsic PredInt> {
+multiclass MVE_VMINMAXNM_m<string iname, bit bit_4, MVEVectorVTInfo VTI, SDPatternOperator Op, Intrinsic PredInt> {
   def "" : MVE_VMINMAXNM<iname, VTI.Suffix, VTI.Size, bit_4>;
 
   let Predicates = [HasMVEFloat] in {
@@ -1497,10 +1503,10 @@ multiclass MVE_VMINMAXNM_m<string iname, bit bit_4, MVEVectorVTInfo VTI, SDNode
   }
 }
 
-defm MVE_VMAXNMf32 : MVE_VMINMAXNM_m<"vmaxnm", 0b0, MVE_v4f32, fmaxnum, int_arm_mve_max_predicated>;
-defm MVE_VMAXNMf16 : MVE_VMINMAXNM_m<"vmaxnm", 0b0, MVE_v8f16, fmaxnum, int_arm_mve_max_predicated>;
-defm MVE_VMINNMf32 : MVE_VMINMAXNM_m<"vminnm", 0b1, MVE_v4f32, fminnum, int_arm_mve_min_predicated>;
-defm MVE_VMINNMf16 : MVE_VMINMAXNM_m<"vminnm", 0b1, MVE_v8f16, fminnum, int_arm_mve_min_predicated>;
+defm MVE_VMAXNMf32 : MVE_VMINMAXNM_m<"vmaxnm", 0b0, MVE_v4f32, vmaxnm, int_arm_mve_max_predicated>;
+defm MVE_VMAXNMf16 : MVE_VMINMAXNM_m<"vmaxnm", 0b0, MVE_v8f16, vmaxnm, int_arm_mve_max_predicated>;
+defm MVE_VMINNMf32 : MVE_VMINMAXNM_m<"vminnm", 0b1, MVE_v4f32, vminnm, int_arm_mve_min_predicated>;
+defm MVE_VMINNMf16 : MVE_VMINMAXNM_m<"vminnm", 0b1, MVE_v8f16, vminnm, int_arm_mve_min_predicated>;
 
 
 class MVE_VMINMAX<string iname, string suffix, bit U, bits<2> size,
@@ -3723,6 +3729,10 @@ multiclass MVE_VFMA_fp_multi<string iname, bit fms, MVEVectorVTInfo VTI> {
     if fms then {
       def : Pat<(VTI.Vec (fma (fneg m1), m2, add)),
                 (Inst $add, $m1, $m2)>;
+      def : Pat<(VTI.Vec (int_arm_mve_fma (fneg m1), m2, add)),
+                (Inst $add, $m1, $m2)>;
+      def : Pat<(VTI.Vec (int_arm_mve_fma m1, (fneg m2), add)),
+                (Inst $add, $m1, $m2)>;
       def : Pat<(VTI.Vec (vselect (VTI.Pred VCCR:$pred),
                                   (VTI.Vec (fma (fneg m1), m2, add)),
                                   add)),
@@ -3734,6 +3744,8 @@ multiclass MVE_VFMA_fp_multi<string iname, bit fms, MVEVectorVTInfo VTI> {
     } else {
       def : Pat<(VTI.Vec (fma m1, m2, add)),
                 (Inst $add, $m1, $m2)>;
+      def : Pat<(VTI.Vec (int_arm_mve_fma m1, m2, add)),
+                (Inst $add, $m1, $m2)>;
       def : Pat<(VTI.Vec (vselect (VTI.Pred VCCR:$pred),
                                   (VTI.Vec (fma m1, m2, add)),
                                   add)),
@@ -4142,7 +4154,7 @@ class MVE_VMAXMINNMA<string iname, string suffix, bits<2> size, bit bit_12,
 }
 
 multiclass MVE_VMAXMINNMA_m<string iname, MVEVectorVTInfo VTI,
-                      SDNode unpred_op, Intrinsic pred_int,
+                      SDPatternOperator unpred_op, Intrinsic pred_int,
                       bit bit_12> {
   def "" : MVE_VMAXMINNMA<iname, VTI.Suffix, VTI.Size, bit_12>;
   defvar Inst = !cast<Instruction>(NAME);
@@ -4162,13 +4174,13 @@ multiclass MVE_VMAXMINNMA_m<string iname, MVEVectorVTInfo VTI,
 }
 
 multiclass MVE_VMAXNMA<MVEVectorVTInfo VTI, bit bit_12>
-  : MVE_VMAXMINNMA_m<"vmaxnma", VTI, fmaxnum, int_arm_mve_vmaxnma_predicated, bit_12>;
+  : MVE_VMAXMINNMA_m<"vmaxnma", VTI, vmaxnm, int_arm_mve_vmaxnma_predicated, bit_12>;
 
 defm MVE_VMAXNMAf32 : MVE_VMAXNMA<MVE_v4f32, 0b0>;
 defm MVE_VMAXNMAf16 : MVE_VMAXNMA<MVE_v8f16, 0b0>;
 
 multiclass MVE_VMINNMA<MVEVectorVTInfo VTI, bit bit_12>
-  : MVE_VMAXMINNMA_m<"vminnma", VTI, fminnum, int_arm_mve_vminnma_predicated, bit_12>;
+  : MVE_VMAXMINNMA_m<"vminnma", VTI, vminnm, int_arm_mve_vminnma_predicated, bit_12>;
 
 defm MVE_VMINNMAf32 : MVE_VMINNMA<MVE_v4f32, 0b1>;
 defm MVE_VMINNMAf16 : MVE_VMINNMA<MVE_v8f16, 0b1>;
@@ -5672,6 +5684,8 @@ multiclass MVE_VFMA_qr_multi<string iname, MVEVectorVTInfo VTI,
     if scalar_addend then {
       def : Pat<(VTI.Vec (fma v1, v2, vs)),
                 (VTI.Vec (Inst v1, v2, is))>;
+      def : Pat<(VTI.Vec (int_arm_mve_fma v1, v2, vs)),
+                (VTI.Vec (Inst v1, v2, is))>;
       def : Pat<(VTI.Vec (vselect (VTI.Pred VCCR:$pred),
                                   (VTI.Vec (fma v1, v2, vs)),
                                   v1)),
@@ -5681,6 +5695,10 @@ multiclass MVE_VFMA_qr_multi<string iname, MVEVectorVTInfo VTI,
                 (VTI.Vec (Inst v2, v1, is))>;
       def : Pat<(VTI.Vec (fma vs, v1, v2)),
                 (VTI.Vec (Inst v2, v1, is))>;
+      def : Pat<(VTI.Vec (int_arm_mve_fma v1, vs, v2)),
+                (VTI.Vec (Inst v2, v1, is))>;
+      def : Pat<(VTI.Vec (int_arm_mve_fma vs, v1, v2)),
+                (VTI.Vec (Inst v2, v1, is))>;
       def : Pat<(VTI.Vec (vselect (VTI.Pred VCCR:$pred),
                                   (VTI.Vec (fma vs, v2, v1)),
                                   v1)),
diff --git a/llvm/lib/Target/ARM/ARMInstrVFP.td b/llvm/lib/Target/ARM/ARMInstrVFP.td
index 65c61c259d465..5f5f703fbabf1 100644
--- a/llvm/lib/Target/ARM/ARMInstrVFP.td
+++ b/llvm/lib/Target/ARM/ARMInstrVFP.td
@@ -814,7 +814,7 @@ def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins SPR:$Sm),
 
 def : FP16Pat<(f32 (any_fpextend (f16 HPR:$Sm))),
               (VCVTBHS (COPY_TO_REGCLASS (f16 HPR:$Sm), SPR))>;
-def : FP16Pat<(f16_to_fp GPR:$a),
+def : FP16Pat<(any_f16_to_fp GPR:$a),
               (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;
 
 let hasSideEffects = 0, mayRaiseFPException = 1, Uses = [FPSCR_RM] in
@@ -826,7 +826,7 @@ def VCVTBSH: ASuI<0b11101, 0b11, 0b0011, 0b01, 0, (outs SPR:$Sd), (ins SPR:$Sda,
 
 def : FP16Pat<(f16 (any_fpround SPR:$Sm)),
               (COPY_TO_REGCLASS (VCVTBSH (IMPLICIT_DEF), SPR:$Sm), HPR)>;
-def : FP16Pat<(fp_to_f16 SPR:$a),
+def : FP16Pat<(any_fp_to_f16 SPR:$a),
               (i32 (COPY_TO_REGCLASS (VCVTBSH (IMPLICIT_DEF), SPR:$a), GPR))>;
 def : FP16Pat<(insertelt (v8f16 MQPR:$src1), (f16 (any_fpround (f32 SPR:$src2))), imm_even:$lane),
               (v8f16 (INSERT_SUBREG (v8f16 MQPR:$src1),
@@ -891,7 +891,7 @@ def VCVTBHD : ADuI<0b11101, 0b11, 0b0010, 0b01, 0,
 def : FullFP16Pat<(f64 (any_fpextend (f16 HPR:$Sm))),
                   (VCVTBHD (COPY_TO_REGCLASS (f16 HPR:$Sm), SPR))>,
                   Requires<[HasFPARMv8, HasDPVFP]>;
-def : FP16Pat<(f64 (f16_to_fp GPR:$a)),
+def : FP16Pat<(f64 (any_f16_to_fp GPR:$a)),
               (VCVTBHD (COPY_TO_REGCLASS GPR:$a, SPR))>,
               Requires<[HasFPARMv8, HasDPVFP]>;
 
@@ -917,7 +917,7 @@ def VCVTBDH : ADuI<0b11101, 0b11, 0b0011, 0b01, 0,
 def : FullFP16Pat<(f16 (any_fpround DPR:$Dm)),
                   (COPY_TO_REGCLASS (VCVTBDH (IMPLICIT_DEF), DPR:$Dm), HPR)>,
                   Requires<[HasFPARMv8, HasDPVFP]>;
-def : FP16Pat<(fp_to_f16 (f64 DPR:$a)),
+def : FP16Pat<(any_fp_to_f16 (f64 DPR:$a)),
               (i32 (COPY_TO_REGCLASS (VCVTBDH (IMPLICIT_DEF), DPR:$a), GPR))>,
                    Requires<[HasFPARMv8, HasDPVFP]>;
 
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index fdb0ec40cb41f..bdaf9c5e7105c 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -66,6 +66,11 @@ extern cl::opt<bool> EnableMaskedGatherScatters;
 
 extern cl::opt<unsigned> MVEMaxSupportedInterleaveFactor;
 
+static cl::opt<int> ArmForceUnrollThreshold(
+    "arm-force-unroll-threshold", cl::init(12), cl::Hidden,
+    cl::desc(
+        "Threshold for forced unrolling of small loops in Arm architecture"));
+
 /// Convert a vector load intrinsic into a simple llvm load instruction.
 /// This is beneficial when the underlying object being addressed comes
 /// from a constant, since we get constant-folding for free.
@@ -1694,13 +1699,19 @@ InstructionCost ARMTTIImpl::getInterleavedMemoryOpCost(
                                            UseMaskForCond, UseMaskForGaps);
 }
 
-InstructionCost ARMTTIImpl::getGatherScatterOpCost(
-    unsigned Opcode, Type *DataTy, const Value *Ptr, bool VariableMask,
-    Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) const {
+InstructionCost
+ARMTTIImpl::getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                                   TTI::TargetCostKind CostKind) const {
+
+  Type *DataTy = MICA.getDataType();
+  const Value *Ptr = MICA.getPointer();
+  bool VariableMask = MICA.getVariableMask();
+  Align Alignment = MICA.getAlignment();
+  const Instruction *I = MICA.getInst();
+
   using namespace PatternMatch;
   if (!ST->hasMVEIntegerOps() || !EnableMaskedGatherScatters)
-    return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                         Alignment, CostKind, I);
+    return BaseT::getGatherScatterOpCost(MICA, CostKind);
 
   assert(DataTy->isVectorTy() && "Can't do gather/scatters on scalar!");
   auto *VTy = cast<FixedVectorType>(DataTy);
@@ -2731,7 +2742,7 @@ void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
 
   // Force unrolling small loops can be very useful because of the branch
   // taken cost of the backedge.
-  if (Cost < 12)
+  if (Cost < ArmForceUnrollThreshold)
     UP.Force = true;
 }
 
diff --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
index 30f2151b41239..e5f9dd5fe26d9 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.h
@@ -288,10 +288,8 @@ class ARMTTIImpl final : public BasicTTIImplBase<ARMTTIImpl> {
       bool UseMaskForCond = false, bool UseMaskForGaps = false) const override;
 
   InstructionCost
-  getGatherScatterOpCost(unsigned Opcode, Type *DataTy, const Value *Ptr,
-                         bool VariableMask, Align Alignment,
-                         TTI::TargetCostKind CostKind,
-                         const Instruction *I = nullptr) const override;
+  getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override;
 
   InstructionCost
   getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
diff --git a/llvm/lib/Target/BPF/BPF.td b/llvm/lib/Target/BPF/BPF.td
index 50f9793fb29a7..1fc364dad9988 100644
--- a/llvm/lib/Target/BPF/BPF.td
+++ b/llvm/lib/Target/BPF/BPF.td
@@ -34,10 +34,6 @@ def MisalignedMemAccess : SubtargetFeature<"allows-misaligned-mem-access",
                                            "AllowsMisalignedMemAccess", "true",
                                            "Allows misaligned memory access">;
 
-def AllowBuiltinCall : SubtargetFeature<"allow-builtin-calls",
-                                        "AllowBuiltinCalls", "true",
-                                        "Allow calls to builtin functions">;
-
 def : Proc<"generic", []>;
 def : Proc<"v1", []>;
 def : Proc<"v2", []>;
diff --git a/llvm/lib/Target/BPF/BPFISelLowering.cpp b/llvm/lib/Target/BPF/BPFISelLowering.cpp
index 4485c41b4c0fa..a8d1faa85116b 100644
--- a/llvm/lib/Target/BPF/BPFISelLowering.cpp
+++ b/llvm/lib/Target/BPF/BPFISelLowering.cpp
@@ -208,7 +208,6 @@ BPFTargetLowering::BPFTargetLowering(const TargetMachine &TM,
   HasMovsx = STI.hasMovsx();
 
   AllowsMisalignedMemAccess = STI.getAllowsMisalignedMemAccess();
-  AllowBuiltinCalls = STI.getAllowBuiltinCalls();
 }
 
 bool BPFTargetLowering::allowsMisalignedMemoryAccesses(EVT VT, unsigned, Align,
@@ -568,10 +567,9 @@ SDValue BPFTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
   } else if (ExternalSymbolSDNode *E = dyn_cast<ExternalSymbolSDNode>(Callee)) {
     if (StringRef(E->getSymbol()) != BPF_TRAP) {
       Callee = DAG.getTargetExternalSymbol(E->getSymbol(), PtrVT, 0);
-      if (!AllowBuiltinCalls)
-        fail(CLI.DL, DAG,
-             Twine("A call to built-in function '" + StringRef(E->getSymbol()) +
-                   "' is not supported."));
+      fail(CLI.DL, DAG,
+           Twine("A call to built-in function '" + StringRef(E->getSymbol()) +
+                 "' is not supported."));
     }
   }
 
@@ -1198,18 +1196,3 @@ bool BPFTargetLowering::isLegalAddressingMode(const DataLayout &DL,
 
   return true;
 }
-
-bool BPFTargetLowering::shouldSignExtendTypeInLibCall(Type *Ty,
-                                                      bool IsSigned) const {
-  return IsSigned || Ty->isIntegerTy(32);
-}
-
-bool BPFTargetLowering::CanLowerReturn(
-    CallingConv::ID CallConv, MachineFunction &MF, bool IsVarArg,
-    const SmallVectorImpl<ISD::OutputArg> &Outs, LLVMContext &Context,
-    const Type *RetTy) const {
-  // At minimal return Outs.size() <= 1, or check valid types in CC.
-  SmallVector<CCValAssign, 16> RVLocs;
-  CCState CCInfo(CallConv, IsVarArg, MF, RVLocs, Context);
-  return CCInfo.CheckReturn(Outs, getHasAlu32() ? RetCC_BPF32 : RetCC_BPF64);
-}
\ No newline at end of file
diff --git a/llvm/lib/Target/BPF/BPFISelLowering.h b/llvm/lib/Target/BPF/BPFISelLowering.h
index a5036e31cb61d..8607e4f8c9e69 100644
--- a/llvm/lib/Target/BPF/BPFISelLowering.h
+++ b/llvm/lib/Target/BPF/BPFISelLowering.h
@@ -68,8 +68,6 @@ class BPFTargetLowering : public TargetLowering {
   // Allows Misalignment
   bool AllowsMisalignedMemAccess;
 
-  bool AllowBuiltinCalls;
-
   SDValue LowerSDIVSREM(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerBR_CC(SDValue Op, SelectionDAG &DAG) const;
@@ -165,14 +163,6 @@ class BPFTargetLowering : public TargetLowering {
   MachineBasicBlock *
   EmitInstrWithCustomInserterLDimm64(MachineInstr &MI,
                                      MachineBasicBlock *BB) const;
-
-  // Returns true if arguments should be sign-extended in lib calls.
-  bool shouldSignExtendTypeInLibCall(Type *Ty, bool IsSigned) const override;
-
-  bool CanLowerReturn(CallingConv::ID CallConv, MachineFunction &MF,
-                      bool IsVarArg,
-                      const SmallVectorImpl<ISD::OutputArg> &Outs,
-                      LLVMContext &Context, const Type *RetTy) const override;
 };
 }
 
diff --git a/llvm/lib/Target/BPF/BPFSubtarget.cpp b/llvm/lib/Target/BPF/BPFSubtarget.cpp
index 77a1a5fe7444c..726f8f4b39827 100644
--- a/llvm/lib/Target/BPF/BPFSubtarget.cpp
+++ b/llvm/lib/Target/BPF/BPFSubtarget.cpp
@@ -70,7 +70,6 @@ void BPFSubtarget::initializeEnvironment() {
   HasLoadAcqStoreRel = false;
   HasGotox = false;
   AllowsMisalignedMemAccess = false;
-  AllowBuiltinCalls = false;
 }
 
 void BPFSubtarget::initSubtargetFeatures(StringRef CPU, StringRef FS) {
diff --git a/llvm/lib/Target/BPF/BPFSubtarget.h b/llvm/lib/Target/BPF/BPFSubtarget.h
index 40751fc9b7454..24eff862224b0 100644
--- a/llvm/lib/Target/BPF/BPFSubtarget.h
+++ b/llvm/lib/Target/BPF/BPFSubtarget.h
@@ -70,8 +70,6 @@ class BPFSubtarget : public BPFGenSubtargetInfo {
   bool HasLdsx, HasMovsx, HasBswap, HasSdivSmod, HasGotol, HasStoreImm,
       HasLoadAcqStoreRel, HasGotox;
 
-  bool AllowBuiltinCalls;
-
   std::unique_ptr<CallLowering> CallLoweringInfo;
   std::unique_ptr<InstructionSelector> InstSelector;
   std::unique_ptr<LegalizerInfo> Legalizer;
@@ -103,7 +101,6 @@ class BPFSubtarget : public BPFGenSubtargetInfo {
   bool hasStoreImm() const { return HasStoreImm; }
   bool hasLoadAcqStoreRel() const { return HasLoadAcqStoreRel; }
   bool hasGotox() const { return HasGotox; }
-  bool getAllowBuiltinCalls() const { return AllowBuiltinCalls; }
 
   bool isLittleEndian() const { return IsLittleEndian; }
 
diff --git a/llvm/lib/Target/Hexagon/HexagonISelLowering.cpp b/llvm/lib/Target/Hexagon/HexagonISelLowering.cpp
index 5767a74513e8d..bae9d705f5a7a 100644
--- a/llvm/lib/Target/Hexagon/HexagonISelLowering.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonISelLowering.cpp
@@ -2115,7 +2115,7 @@ static Value *getUnderLyingObjectForBrevLdIntr(Value *V) {
 /// true and store the intrinsic information into the IntrinsicInfo that was
 /// passed to the function.
 bool HexagonTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                               const CallInst &I,
+                                               const CallBase &I,
                                                MachineFunction &MF,
                                                unsigned Intrinsic) const {
   switch (Intrinsic) {
diff --git a/llvm/lib/Target/Hexagon/HexagonISelLowering.h b/llvm/lib/Target/Hexagon/HexagonISelLowering.h
index f4d2a79051c10..cde8b5ba8d8a7 100644
--- a/llvm/lib/Target/Hexagon/HexagonISelLowering.h
+++ b/llvm/lib/Target/Hexagon/HexagonISelLowering.h
@@ -145,7 +145,7 @@ class HexagonTargetLowering : public TargetLowering {
       const SmallVectorImpl<SDValue> &OutVals,
       const SmallVectorImpl<ISD::InputArg> &Ins, SelectionDAG& DAG) const;
 
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/Hexagon/HexagonISelLoweringHVX.cpp b/llvm/lib/Target/Hexagon/HexagonISelLoweringHVX.cpp
index 212a57bc7cde5..0b782d79237da 100644
--- a/llvm/lib/Target/Hexagon/HexagonISelLoweringHVX.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonISelLoweringHVX.cpp
@@ -31,6 +31,10 @@ static cl::opt<unsigned> HvxWidenThreshold("hexagon-hvx-widen",
   cl::Hidden, cl::init(16),
   cl::desc("Lower threshold (in bytes) for widening to HVX vectors"));
 
+static cl::opt<bool>
+    EnableFpFastConvert("hexagon-fp-fast-convert", cl::Hidden, cl::init(false),
+                        cl::desc("Enable FP fast conversion routine."));
+
 static const MVT LegalV64[] =  { MVT::v64i8,  MVT::v32i16,  MVT::v16i32 };
 static const MVT LegalW64[] =  { MVT::v128i8, MVT::v64i16,  MVT::v32i32 };
 static const MVT LegalV128[] = { MVT::v128i8, MVT::v64i16,  MVT::v32i32 };
@@ -2970,6 +2974,32 @@ HexagonTargetLowering::ExpandHvxFpToInt(SDValue Op, SelectionDAG &DAG) const {
   MVT ResTy = ty(Op);
   assert(InpTy.changeTypeToInteger() == ResTy);
 
+  // At this point this is an experiment under a flag.
+  // In arch before V81 the rounding mode is towards nearest value.
+  // The C/C++ standard requires rounding towards zero:
+  // C (C99 and later): ISO/IEC 9899:2018 (C18), section 6.3.1.4 — "When a
+  // finite value of real floating type is converted to an integer type, the
+  // fractional part is discarded (i.e., the value is truncated toward zero)."
+  // C++: ISO/IEC 14882:2020 (C++20), section 7.3.7 — "A prvalue of a
+  // floating-point type can be converted to a prvalue of an integer type. The
+  // conversion truncates; that is, the fractional part is discarded."
+  if (InpTy == MVT::v64f16) {
+    if (Subtarget.useHVXV81Ops()) {
+      // This is c/c++ compliant
+      SDValue ConvVec =
+          getInstr(Hexagon::V6_vconv_h_hf_rnd, dl, ResTy, {Op0}, DAG);
+      return ConvVec;
+    } else if (EnableFpFastConvert) {
+      // Vd32.h=Vu32.hf same as Q6_Vh_equals_Vhf
+      SDValue ConvVec = getInstr(Hexagon::V6_vconv_h_hf, dl, ResTy, {Op0}, DAG);
+      return ConvVec;
+    }
+  } else if (EnableFpFastConvert && InpTy == MVT::v32f32) {
+    // Vd32.w=Vu32.sf same as Q6_Vw_equals_Vsf
+    SDValue ConvVec = getInstr(Hexagon::V6_vconv_w_sf, dl, ResTy, {Op0}, DAG);
+    return ConvVec;
+  }
+
   // int32_t conv_f32_to_i32(uint32_t inp) {
   //   // s | exp8 | frac23
   //
diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
index 3f84cbb6555ed..59c6201e07081 100644
--- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
+++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.cpp
@@ -223,12 +223,6 @@ InstructionCost HexagonTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
                                 OpInfo, I);
 }
 
-InstructionCost
-HexagonTTIImpl::getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
-                                      TTI::TargetCostKind CostKind) const {
-  return BaseT::getMaskedMemoryOpCost(MICA, CostKind);
-}
-
 InstructionCost
 HexagonTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, VectorType *DstTy,
                                VectorType *SrcTy, ArrayRef<int> Mask,
@@ -238,13 +232,6 @@ HexagonTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, VectorType *DstTy,
   return 1;
 }
 
-InstructionCost HexagonTTIImpl::getGatherScatterOpCost(
-    unsigned Opcode, Type *DataTy, const Value *Ptr, bool VariableMask,
-    Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) const {
-  return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                       Alignment, CostKind, I);
-}
-
 InstructionCost HexagonTTIImpl::getInterleavedMemoryOpCost(
     unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
     Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
index 67388984bb3e3..edf88cf476f6d 100644
--- a/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
+++ b/llvm/lib/Target/Hexagon/HexagonTargetTransformInfo.h
@@ -120,18 +120,10 @@ class HexagonTTIImpl final : public BasicTTIImplBase<HexagonTTIImpl> {
       TTI::OperandValueInfo OpInfo = {TTI::OK_AnyValue, TTI::OP_None},
       const Instruction *I = nullptr) const override;
   InstructionCost
-  getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
-                        TTI::TargetCostKind CostKind) const override;
-  InstructionCost
   getShuffleCost(TTI::ShuffleKind Kind, VectorType *DstTy, VectorType *SrcTy,
                  ArrayRef<int> Mask, TTI::TargetCostKind CostKind, int Index,
                  VectorType *SubTp, ArrayRef<const Value *> Args = {},
                  const Instruction *CxtI = nullptr) const override;
-  InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
-                                         const Value *Ptr, bool VariableMask,
-                                         Align Alignment,
-                                         TTI::TargetCostKind CostKind,
-                                         const Instruction *I) const override;
   InstructionCost getInterleavedMemoryOpCost(
       unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
       Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp b/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
index ba9d0682b26dd..32ea2198f7898 100644
--- a/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
+++ b/llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
@@ -8912,7 +8912,7 @@ bool LoongArchTargetLowering::hasAndNot(SDValue Y) const {
 }
 
 bool LoongArchTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                                 const CallInst &I,
+                                                 const CallBase &I,
                                                  MachineFunction &MF,
                                                  unsigned Intrinsic) const {
   switch (Intrinsic) {
diff --git a/llvm/lib/Target/LoongArch/LoongArchISelLowering.h b/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
index 0c09fb6afd2d1..5277e7e3e74ca 100644
--- a/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
+++ b/llvm/lib/Target/LoongArch/LoongArchISelLowering.h
@@ -78,7 +78,7 @@ class LoongArchTargetLowering : public TargetLowering {
                                           Value *NewVal, Value *Mask,
                                           AtomicOrdering Ord) const override;
 
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 8b72b1e1f3a52..5081a093d4c34 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -4077,9 +4077,10 @@ void NVPTXTargetLowering::LowerAsmOperandForConstraint(
 // because we need the information that is only available in the "Value" type
 // of destination
 // pointer. In particular, the address space information.
-bool NVPTXTargetLowering::getTgtMemIntrinsic(
-    IntrinsicInfo &Info, const CallInst &I,
-    MachineFunction &MF, unsigned Intrinsic) const {
+bool NVPTXTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
+                                             const CallBase &I,
+                                             MachineFunction &MF,
+                                             unsigned Intrinsic) const {
   switch (Intrinsic) {
   default:
     return false;
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index dd8e49de7aa6a..cb0a1aa5dc892 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -32,7 +32,7 @@ class NVPTXTargetLowering : public TargetLowering {
                                const NVPTXSubtarget &STI);
   SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;
 
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/PowerPC/MCTargetDesc/PPCAsmBackend.cpp b/llvm/lib/Target/PowerPC/MCTargetDesc/PPCAsmBackend.cpp
index 558351b515a2e..72a5e60a01d87 100644
--- a/llvm/lib/Target/PowerPC/MCTargetDesc/PPCAsmBackend.cpp
+++ b/llvm/lib/Target/PowerPC/MCTargetDesc/PPCAsmBackend.cpp
@@ -25,7 +25,25 @@
 #include "llvm/Support/ErrorHandling.h"
 using namespace llvm;
 
-static uint64_t adjustFixupValue(unsigned Kind, uint64_t Value) {
+static uint64_t adjustFixupValue(MCContext &Ctx, const MCFixup &Fixup,
+                                 unsigned Kind, uint64_t Value) {
+  auto checkBrFixup = [&](unsigned Bits) {
+    int64_t SVal = int64_t(Value);
+    if ((Value & 3) != 0) {
+      Ctx.reportError(Fixup.getLoc(), "branch target not a multiple of four (" +
+                                          Twine(SVal) + ")");
+      return;
+    }
+
+    // Low two bits are not encoded.
+    if (!isIntN(Bits + 2, Value)) {
+      Ctx.reportError(Fixup.getLoc(), "branch target out of range (" +
+                                          Twine(SVal) + " not between " +
+                                          Twine(minIntN(Bits) * 4) + " and " +
+                                          Twine(maxIntN(Bits) * 4) + ")");
+    }
+  };
+
   switch (Kind) {
   default:
     llvm_unreachable("Unknown fixup kind!");
@@ -37,10 +55,12 @@ static uint64_t adjustFixupValue(unsigned Kind, uint64_t Value) {
     return Value;
   case PPC::fixup_ppc_brcond14:
   case PPC::fixup_ppc_brcond14abs:
+    checkBrFixup(14);
     return Value & 0xfffc;
   case PPC::fixup_ppc_br24:
   case PPC::fixup_ppc_br24abs:
   case PPC::fixup_ppc_br24_notoc:
+    checkBrFixup(24);
     return Value & 0x3fffffc;
   case PPC::fixup_ppc_half16:
     return Value & 0xffff;
@@ -211,7 +231,7 @@ void PPCAsmBackend::applyFixup(const MCFragment &F, const MCFixup &Fixup,
   MCFixupKind Kind = Fixup.getKind();
   if (mc::isRelocation(Kind))
     return;
-  Value = adjustFixupValue(Kind, Value);
+  Value = adjustFixupValue(getContext(), Fixup, Kind, Value);
   if (!Value)
     return; // Doesn't change encoding.
 
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index 1a9310c46cd1d..51212837fbb17 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -18495,7 +18495,7 @@ PPCTargetLowering::isOffsetFoldingLegal(const GlobalAddressSDNode *GA) const {
 }
 
 bool PPCTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                           const CallInst &I,
+                                           const CallBase &I,
                                            MachineFunction &MF,
                                            unsigned Intrinsic) const {
   switch (Intrinsic) {
diff --git a/llvm/lib/Target/PowerPC/PPCISelLowering.h b/llvm/lib/Target/PowerPC/PPCISelLowering.h
index 74af055ed5d30..daae839479c3c 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.h
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.h
@@ -492,8 +492,7 @@ namespace llvm {
 
     bool isOffsetFoldingLegal(const GlobalAddressSDNode *GA) const override;
 
-    bool getTgtMemIntrinsic(IntrinsicInfo &Info,
-                            const CallInst &I,
+    bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                             MachineFunction &MF,
                             unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/RISCV/AsmParser/RISCVAsmParser.cpp b/llvm/lib/Target/RISCV/AsmParser/RISCVAsmParser.cpp
index 75ce1b144a2e7..9bb3724c96c11 100644
--- a/llvm/lib/Target/RISCV/AsmParser/RISCVAsmParser.cpp
+++ b/llvm/lib/Target/RISCV/AsmParser/RISCVAsmParser.cpp
@@ -4082,6 +4082,9 @@ bool RISCVAsmParser::processInstruction(MCInst &Inst, SMLoc IDLoc,
 
     return false;
   }
+  case RISCV::PseudoCV_ELW:
+    emitLoadStoreSymbol(Inst, RISCV::CV_ELW, IDLoc, Out, /*HasTmpReg=*/false);
+    return false;
   }
 
   emitToStreamer(Out, Inst);
diff --git a/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp b/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp
index 3d5a55c631301..1e5d0a4297465 100644
--- a/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp
+++ b/llvm/lib/Target/RISCV/GISel/RISCVInstructionSelector.cpp
@@ -1569,7 +1569,7 @@ bool RISCVInstructionSelector::selectAddr(MachineInstr &MI,
 
   switch (TM.getCodeModel()) {
   default: {
-    reportGISelFailure(*MF, *TPC, *MORE, getName(),
+    reportGISelFailure(*MF, *MORE, getName(),
                        "Unsupported code model for lowering", MI);
     return false;
   }
diff --git a/llvm/lib/Target/RISCV/GISel/RISCVLegalizerInfo.cpp b/llvm/lib/Target/RISCV/GISel/RISCVLegalizerInfo.cpp
index 1fba16d3d51c2..2cc594a33eb0d 100644
--- a/llvm/lib/Target/RISCV/GISel/RISCVLegalizerInfo.cpp
+++ b/llvm/lib/Target/RISCV/GISel/RISCVLegalizerInfo.cpp
@@ -1211,7 +1211,7 @@ bool RISCVLegalizerInfo::legalizeExtractSubvector(MachineInstr &MI,
   // to place the desired subvector starting at element 0.
   const LLT XLenTy(STI.getXLenVT());
   auto SlidedownAmt = MIB.buildVScale(XLenTy, RemIdx);
-  auto [Mask, VL] = buildDefaultVLOps(LitTy, MIB, MRI);
+  auto [Mask, VL] = buildDefaultVLOps(InterLitTy, MIB, MRI);
   uint64_t Policy = RISCVVType::TAIL_AGNOSTIC | RISCVVType::MASK_AGNOSTIC;
   auto Slidedown = MIB.buildInstr(
       RISCV::G_VSLIDEDOWN_VL, {InterLitTy},
diff --git a/llvm/lib/Target/RISCV/RISCVFeatures.td b/llvm/lib/Target/RISCV/RISCVFeatures.td
index bf1caafc2f9ba..0c75312847c87 100644
--- a/llvm/lib/Target/RISCV/RISCVFeatures.td
+++ b/llvm/lib/Target/RISCV/RISCVFeatures.td
@@ -1791,6 +1791,45 @@ def FeatureUnalignedVectorMem
                       "true", "Has reasonably performant unaligned vector "
                       "loads and stores">;
 
+// Assume that lock-free native-width atomics are available, even if the target
+// and operating system combination would not usually provide them. The user
+// is responsible for providing any necessary __sync implementations. Code
+// built with this feature is not ABI-compatible with code built without this
+// feature, if atomic variables are exposed across the ABI boundary.
+def FeatureForcedAtomics : SubtargetFeature<
+    "forced-atomics", "HasForcedAtomics", "true",
+    "Assume that lock-free native-width atomics are available">;
+def HasAtomicLdSt
+    : Predicate<"Subtarget->hasStdExtZalrsc() || Subtarget->hasForcedAtomics()">;
+
+// The RISC-V Unprivileged Architecture - ISA Volume 1 (Version: 20250508)
+// [https://docs.riscv.org/reference/isa/_attachments/riscv-unprivileged.pdf]
+// in section 13.3. Eventual Success of Store-Conditional Instructions, defines
+// _constrained_ LR/SC loops:
+//   The dynamic code executed between the LR and SC instructions can only
+//   contain instructions from the base ''I'' instruction set, excluding loads,
+//   stores, backward jumps, taken backward branches, JALR, FENCE, and SYSTEM
+//   instructions. Compressed forms of the aforementioned ''I'' instructions in
+//   the Zca and Zcb extensions are also permitted.
+// LR/SC loops that do not adhere to the above are _unconstrained_ LR/SC loops,
+// and success is implementation specific. For implementations which know that
+// non-base instructions (such as the ''B'' extension) will not violate any
+// forward progress guarantees, using these instructions to reduce the LR/SC
+// sequence length is desirable.
+def FeaturePermissiveZalrsc
+    : SubtargetFeature<
+          "permissive-zalrsc", "HasPermissiveZalrsc", "true",
+          "Implementation permits non-base instructions between LR/SC pairs">;
+
+def FeatureTaggedGlobals : SubtargetFeature<"tagged-globals",
+    "AllowTaggedGlobals",
+    "true", "Use an instruction sequence for taking the address of a global "
+    "that allows a memory tag in the upper address bits">;
+
+//===----------------------------------------------------------------------===//
+// Tuning features
+//===----------------------------------------------------------------------===//
+
 def TuneNLogNVRGather
    : SubtargetFeature<"log-vrgather", "RISCVVRGatherCostModel", "NLog2N",
                       "Has vrgather.vv with LMUL*log2(LMUL) latency">;
@@ -1850,23 +1889,44 @@ def TuneNoDefaultUnroll
     : SubtargetFeature<"no-default-unroll", "EnableDefaultUnroll", "false",
                        "Disable default unroll preference.">;
 
-// SiFive 7 is able to fuse integer ALU operations with a preceding branch
-// instruction.
-def TuneShortForwardBranchOpt
-    : SubtargetFeature<"short-forward-branch-opt", "HasShortForwardBranchOpt",
-                       "true", "Enable short forward branch optimization">;
-def HasShortForwardBranchOpt : Predicate<"Subtarget->hasShortForwardBranchOpt()">;
-def NoShortForwardBranchOpt : Predicate<"!Subtarget->hasShortForwardBranchOpt()">;
+// Many Microarchitectures are able to fuse a branch over a single instruction
+// with the branched-over instruction. We call this fusion "short forward
+// branches".
+//
+// We can do this for a variety of instruction groups, depending on the
+// microarch. We broadly group these by their scheduler class:
+// - IALU: RVI Integer instructions, plus ANDN/ORN/XNOR (Zbb/Zbkb)
+// - IMinMax: Zbb MIN(U)/MAX(U)
+// - IMul: MUL
+//
+// We make the simplifying assumption that any microarches that implement
+// any "short forward branches" can do the IALU fusions, and can opt into
+// the other fusions they implement.
+//
+// The important Pseudo used by all these instructions requires the IALU
+// short forward branches.
+//
+// Vendor-specific short-forward-branch opts may be added under IALU, as
+// the vendor-specific instructions should only be enabled for vendor
+// cores.
+def TuneShortForwardBranchIALU
+    : SubtargetFeature<"short-forward-branch-ialu", "HasShortForwardBranchIALU",
+                       "true", "Enable short forward branch optimization for RVI base instructions">;
+def HasShortForwardBranchIALU : Predicate<"Subtarget->hasShortForwardBranchIALU()">;
+def NoShortForwardBranch : Predicate<"!Subtarget->hasShortForwardBranchIALU()">;
 
 def TuneShortForwardBranchIMinMax
-    : SubtargetFeature<"short-forward-branch-i-minmax", "HasShortForwardBranchIMinMax",
-                       "true", "Enable short forward branch optimization for min,max instructions in Zbb",
-                       [TuneShortForwardBranchOpt]>;
+    : SubtargetFeature<"short-forward-branch-iminmax", "HasShortForwardBranchIMinMax",
+                       "true", "Enable short forward branch optimization for MIN,MAX instructions in Zbb",
+                       [TuneShortForwardBranchIALU]>;
+def HasShortForwardBranchIMinMax : Predicate<"Subtarget->hasShortForwardBranchIMinMax()">;
 
 def TuneShortForwardBranchIMul
-    : SubtargetFeature<"short-forward-branch-i-mul", "HasShortForwardBranchIMul",
-                       "true", "Enable short forward branch optimization for mul instruction",
-                       [TuneShortForwardBranchOpt]>;
+    : SubtargetFeature<"short-forward-branch-imul", "HasShortForwardBranchIMul",
+                       "true", "Enable short forward branch optimization for MUL instruction",
+                       [TuneShortForwardBranchIALU]>;
+def HasShortForwardBranchIMul : Predicate<"Subtarget->hasShortForwardBranchIMul()">;
+
 
 // Some subtargets require a S2V transfer buffer to move scalars into vectors.
 // FIXME: Forming .vx/.vf/.wx/.wf can reduce register pressure.
@@ -1890,19 +1950,6 @@ def TuneHasSingleElementVecFP64
                        "Certain vector FP64 operations produce a single result "
                        "element per cycle">;
 
-def TuneMIPSP8700
-    : SubtargetFeature<"mips-p8700", "RISCVProcFamily", "MIPSP8700",
-                       "MIPS p8700 processor">;
-
-def TuneSiFive7 : SubtargetFeature<"sifive7", "RISCVProcFamily", "SiFive7",
-                                   "SiFive 7-Series processors">;
-
-def TuneVentanaVeyron : SubtargetFeature<"ventana-veyron", "RISCVProcFamily", "VentanaVeyron",
-                                         "Ventana Veyron-Series processors">;
-
-def TuneAndes45 : SubtargetFeature<"andes45", "RISCVProcFamily", "Andes45",
-                                   "Andes 45-Series processors">;
-
 def TuneVXRMPipelineFlush : SubtargetFeature<"vxrm-pipeline-flush", "HasVXRMPipelineFlush",
                                              "true", "VXRM writes causes pipeline flush">;
 
@@ -1912,37 +1959,20 @@ def TunePreferVsetvliOverReadVLENB
                        "true",
                        "Prefer vsetvli over read vlenb CSR to calculate VLEN">;
 
-// Assume that lock-free native-width atomics are available, even if the target
-// and operating system combination would not usually provide them. The user
-// is responsible for providing any necessary __sync implementations. Code
-// built with this feature is not ABI-compatible with code built without this
-// feature, if atomic variables are exposed across the ABI boundary.
-def FeatureForcedAtomics : SubtargetFeature<
-    "forced-atomics", "HasForcedAtomics", "true",
-    "Assume that lock-free native-width atomics are available">;
-def HasAtomicLdSt
-    : Predicate<"Subtarget->hasStdExtZalrsc() || Subtarget->hasForcedAtomics()">;
+//===----------------------------------------------------------------------===//
+// CPU Families (alphabetized by vendor).
+//===----------------------------------------------------------------------===//
 
-// The RISC-V Unprivileged Architecture - ISA Volume 1 (Version: 20250508)
-// [https://docs.riscv.org/reference/isa/_attachments/riscv-unprivileged.pdf]
-// in section 13.3. Eventual Success of Store-Conditional Instructions, defines
-// _constrained_ LR/SC loops:
-//   The dynamic code executed between the LR and SC instructions can only
-//   contain instructions from the base ''I'' instruction set, excluding loads,
-//   stores, backward jumps, taken backward branches, JALR, FENCE, and SYSTEM
-//   instructions. Compressed forms of the aforementioned ''I'' instructions in
-//   the Zca and Zcb extensions are also permitted.
-// LR/SC loops that do not adhere to the above are _unconstrained_ LR/SC loops,
-// and success is implementation specific. For implementations which know that
-// non-base instructions (such as the ''B'' extension) will not violate any
-// forward progress guarantees, using these instructions to reduce the LR/SC
-// sequence length is desirable.
-def FeaturePermissiveZalrsc
-    : SubtargetFeature<
-          "permissive-zalrsc", "HasPermissiveZalrsc", "true",
-          "Implementation permits non-base instructions between LR/SC pairs">;
+def TuneAndes45 : SubtargetFeature<"andes45", "RISCVProcFamily", "Andes45",
+                                   "Andes 45-Series processors">;
+
+def TuneMIPSP8700
+    : SubtargetFeature<"mips-p8700", "RISCVProcFamily", "MIPSP8700",
+                       "MIPS p8700 processor">;
+
+def TuneSiFive7 : SubtargetFeature<"sifive7", "RISCVProcFamily", "SiFive7",
+                                   "SiFive 7-Series processors">;
+
+def TuneVentanaVeyron : SubtargetFeature<"ventana-veyron", "RISCVProcFamily", "VentanaVeyron",
+                                         "Ventana Veyron-Series processors">;
 
-def FeatureTaggedGlobals : SubtargetFeature<"tagged-globals",
-    "AllowTaggedGlobals",
-    "true", "Use an instruction sequence for taking the address of a global "
-    "that allows a memory tag in the upper address bits">;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index be53f51afe79f..ab2652eac3823 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -453,7 +453,7 @@ RISCVTargetLowering::RISCVTargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::ABS, XLenVT, Legal);
     if (Subtarget.is64Bit())
       setOperationAction(ISD::ABS, MVT::i32, Custom);
-  } else if (Subtarget.hasShortForwardBranchOpt()) {
+  } else if (Subtarget.hasShortForwardBranchIALU()) {
     // We can use PseudoCCSUB to implement ABS.
     setOperationAction(ISD::ABS, XLenVT, Legal);
   } else if (Subtarget.is64Bit()) {
@@ -1868,7 +1868,7 @@ bool RISCVTargetLowering::shouldExpandCttzElements(EVT VT) const {
 }
 
 bool RISCVTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                             const CallInst &I,
+                                             const CallBase &I,
                                              MachineFunction &MF,
                                              unsigned Intrinsic) const {
   auto &DL = I.getDataLayout();
@@ -9509,7 +9509,7 @@ static SDValue lowerSelectToBinOp(SDNode *N, SelectionDAG &DAG,
 static SDValue
 foldBinOpIntoSelectIfProfitable(SDNode *BO, SelectionDAG &DAG,
                                 const RISCVSubtarget &Subtarget) {
-  if (Subtarget.hasShortForwardBranchOpt())
+  if (Subtarget.hasShortForwardBranchIALU())
     return SDValue();
 
   unsigned SelOpNo = 0;
@@ -9584,6 +9584,50 @@ SDValue RISCVTargetLowering::lowerSELECT(SDValue Op, SelectionDAG &DAG) const {
   if (SDValue V = lowerSelectToBinOp(Op.getNode(), DAG, Subtarget))
     return V;
 
+  // When there is no cost for GPR <-> FPR, we can use zicond select for
+  // floating value when CondV is int type
+  bool FPinGPR = Subtarget.hasStdExtZfinx();
+
+  // We can handle FGPR without spliting into hi/lo parts
+  bool FitsInGPR = TypeSize::isKnownLE(VT.getSizeInBits(),
+                                       Subtarget.getXLenVT().getSizeInBits());
+
+  bool UseZicondForFPSel = Subtarget.hasStdExtZicond() && FPinGPR &&
+                           VT.isFloatingPoint() && FitsInGPR;
+
+  if (UseZicondForFPSel) {
+
+    auto CastToInt = [&](SDValue V) -> SDValue {
+      // Treat +0.0 as int 0 to enable single 'czero' instruction generation.
+      if (isNullFPConstant(V))
+        return DAG.getConstant(0, DL, XLenVT);
+
+      if (VT == MVT::f16)
+        return DAG.getNode(RISCVISD::FMV_X_ANYEXTH, DL, XLenVT, V);
+
+      if (VT == MVT::f32 && Subtarget.is64Bit())
+        return DAG.getNode(RISCVISD::FMV_X_ANYEXTW_RV64, DL, XLenVT, V);
+
+      return DAG.getBitcast(XLenVT, V);
+    };
+
+    SDValue TrueVInt = CastToInt(TrueV);
+    SDValue FalseVInt = CastToInt(FalseV);
+
+    // Emit integer SELECT (lowers to Zicond)
+    SDValue ResultInt =
+        DAG.getNode(ISD::SELECT, DL, XLenVT, CondV, TrueVInt, FalseVInt);
+
+    // Convert back to floating VT
+    if (VT == MVT::f32 && Subtarget.is64Bit())
+      return DAG.getNode(RISCVISD::FMV_W_X_RV64, DL, VT, ResultInt);
+
+    if (VT == MVT::f16)
+      return DAG.getNode(RISCVISD::FMV_H_X, DL, VT, ResultInt);
+
+    return DAG.getBitcast(VT, ResultInt);
+  }
+
   // When Zicond or XVentanaCondOps is present, emit CZERO_EQZ and CZERO_NEZ
   // nodes to implement the SELECT. Performing the lowering here allows for
   // greater control over when CZERO_{EQZ/NEZ} are used vs another branchless
@@ -10699,7 +10743,7 @@ SDValue RISCVTargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
         VecVT != MVT::v4i8 && VecVT != MVT::v2i32)
       return SDValue();
     SDValue Extracted = DAG.getBitcast(XLenVT, Vec);
-    unsigned ElemWidth = EltVT.getSizeInBits();
+    unsigned ElemWidth = VecVT.getVectorElementType().getSizeInBits();
     SDValue Shamt = DAG.getNode(ISD::MUL, DL, XLenVT, Idx,
                                 DAG.getConstant(ElemWidth, DL, XLenVT));
     return DAG.getNode(ISD::SRL, DL, XLenVT, Extracted, Shamt);
@@ -12739,10 +12783,7 @@ SDValue RISCVTargetLowering::lowerVECTOR_INTERLEAVE(SDValue Op,
 
     SmallVector<SDValue, 8> Loads(Factor);
 
-    SDValue Increment =
-        DAG.getVScale(DL, PtrVT,
-                      APInt(PtrVT.getFixedSizeInBits(),
-                            VecVT.getStoreSize().getKnownMinValue()));
+    SDValue Increment = DAG.getTypeSize(DL, PtrVT, VecVT.getStoreSize());
     for (unsigned i = 0; i != Factor; ++i) {
       if (i != 0)
         StackPtr = DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr, Increment);
@@ -14140,9 +14181,8 @@ RISCVTargetLowering::lowerVPReverseExperimental(SDValue Op,
 
       // Slide off any elements from past EVL that were reversed into the low
       // elements.
-      unsigned MinElts = GatherVT.getVectorMinNumElements();
       SDValue VLMax =
-          DAG.getVScale(DL, XLenVT, APInt(XLenVT.getSizeInBits(), MinElts));
+          DAG.getElementCount(DL, XLenVT, GatherVT.getVectorElementCount());
       SDValue Diff = DAG.getNode(ISD::SUB, DL, XLenVT, VLMax, EVL);
 
       Result = getVSlidedown(DAG, Subtarget, DL, GatherVT,
@@ -20973,7 +21013,7 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
 
     // (select (x < 0), y, z)  -> x >> (XLEN - 1) & (y - z) + z
     // (select (x >= 0), y, z) -> x >> (XLEN - 1) & (z - y) + y
-    if (!Subtarget.hasShortForwardBranchOpt() && isa<ConstantSDNode>(TrueV) &&
+    if (!Subtarget.hasShortForwardBranchIALU() && isa<ConstantSDNode>(TrueV) &&
         isa<ConstantSDNode>(FalseV) && isNullConstant(RHS) &&
         (CCVal == ISD::CondCode::SETLT || CCVal == ISD::CondCode::SETGE)) {
       if (CCVal == ISD::CondCode::SETGE)
@@ -25364,6 +25404,22 @@ bool RISCVTargetLowering::isLegalStridedLoadStore(EVT DataType,
   return true;
 }
 
+bool RISCVTargetLowering::isLegalFirstFaultLoad(EVT DataType,
+                                                Align Alignment) const {
+  if (!Subtarget.hasVInstructions())
+    return false;
+
+  EVT ScalarType = DataType.getScalarType();
+  if (!isLegalElementTypeForRVV(ScalarType))
+    return false;
+
+  if (!Subtarget.enableUnalignedVectorMem() &&
+      Alignment < ScalarType.getStoreSize())
+    return false;
+
+  return true;
+}
+
 MachineInstr *
 RISCVTargetLowering::EmitKCFICheck(MachineBasicBlock &MBB,
                                    MachineBasicBlock::instr_iterator &MBBI,
@@ -25551,7 +25607,7 @@ RISCVTargetLowering::BuildSDIVPow2(SDNode *N, const APInt &Divisor,
     return SDValue(N, 0); // Lower SDIV as SDIV
 
   // Only perform this transform if short forward branch opt is supported.
-  if (!Subtarget.hasShortForwardBranchOpt())
+  if (!Subtarget.hasShortForwardBranchIALU())
     return SDValue();
   EVT VT = N->getValueType(0);
   if (!(VT == MVT::i32 || (VT == MVT::i64 && Subtarget.is64Bit())))
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index 9b46936f195e6..8a55a5634452c 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -35,7 +35,7 @@ class RISCVTargetLowering : public TargetLowering {
 
   const RISCVSubtarget &getSubtarget() const { return Subtarget; }
 
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
   bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM, Type *Ty,
@@ -429,6 +429,10 @@ class RISCVTargetLowering : public TargetLowering {
   /// alignment is legal.
   bool isLegalStridedLoadStore(EVT DataType, Align Alignment) const;
 
+  /// Return true if a fault-only-first load of the given result type and
+  /// alignment is legal.
+  bool isLegalFirstFaultLoad(EVT DataType, Align Alignment) const;
+
   unsigned getMaxSupportedInterleaveFactor() const override { return 8; }
 
   bool fallBackToDAGISel(const Instruction &Inst) const override;
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 9d3663cb72ecd..89ec4a2a4a3e1 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -1821,7 +1821,7 @@ bool RISCVInstrInfo::analyzeSelect(const MachineInstr &MI,
   Cond.push_back(MI.getOperand(2));
   Cond.push_back(MI.getOperand(3));
   // We can only fold when we support short forward branch opt.
-  Optimizable = STI.hasShortForwardBranchOpt();
+  Optimizable = STI.hasShortForwardBranchIALU();
   return false;
 }
 
@@ -1831,7 +1831,7 @@ RISCVInstrInfo::optimizeSelect(MachineInstr &MI,
                                bool PreferFalse) const {
   assert(MI.getOpcode() == RISCV::PseudoCCMOVGPR &&
          "Unknown select instruction");
-  if (!STI.hasShortForwardBranchOpt())
+  if (!STI.hasShortForwardBranchIALU())
     return nullptr;
 
   MachineRegisterInfo &MRI = MI.getParent()->getParent()->getRegInfo();
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoSFB.td b/llvm/lib/Target/RISCV/RISCVInstrInfoSFB.td
index 5b1c13493bbf2..6563cc27ecb76 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoSFB.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoSFB.td
@@ -10,7 +10,7 @@
 //
 //===----------------------------------------------------------------------===//
 
-let Predicates = [HasShortForwardBranchOpt], isSelect = 1,
+let Predicates = [HasShortForwardBranchIALU], isSelect = 1,
     Constraints = "$dst = $falsev", isCommutable = 1, Size = 8 in {
 // This instruction moves $truev to $dst when the condition is true. It will
 // be expanded to control flow in RISCVExpandPseudoInsts.
@@ -28,7 +28,7 @@ def PseudoCCMOVGPR : Pseudo<(outs GPR:$dst),
 
 // This should always expand to a branch+c.mv so the size is 6 or 4 if the
 // branch is compressible.
-let Predicates = [HasConditionalMoveFusion, NoShortForwardBranchOpt],
+let Predicates = [HasConditionalMoveFusion, NoShortForwardBranch],
     Constraints = "$dst = $falsev", isCommutable = 1, Size = 6 in {
 // This instruction moves $truev to $dst when the condition is true. It will
 // be expanded to control flow in RISCVExpandPseudoInsts.
@@ -108,7 +108,7 @@ class SFBShiftW_ri
 // is true. Returns $falsev otherwise. Selected by optimizeSelect.
 // TODO: Can we use DefaultOperands on the regular binop to accomplish this more
 // like how ARM does predication?
-let Predicates = [HasShortForwardBranchOpt] in {
+let Predicates = [HasShortForwardBranchIALU] in {
 def PseudoCCADD : SFBALU_rr;
 def PseudoCCSUB : SFBALU_rr;
 def PseudoCCSLL : SFBALU_rr;
@@ -117,11 +117,6 @@ def PseudoCCSRA : SFBALU_rr;
 def PseudoCCAND : SFBALU_rr;
 def PseudoCCOR  : SFBALU_rr;
 def PseudoCCXOR : SFBALU_rr;
-def PseudoCCMAX : SFBALU_rr;
-def PseudoCCMIN : SFBALU_rr;
-def PseudoCCMAXU : SFBALU_rr;
-def PseudoCCMINU : SFBALU_rr;
-def PseudoCCMUL : SFBALU_rr;
 
 def PseudoCCADDI : SFBALU_ri;
 def PseudoCCANDI : SFBALU_ri;
@@ -153,11 +148,21 @@ def PseudoCCORN  : SFBALU_rr;
 def PseudoCCXNOR : SFBALU_rr;
 }
 
-let Predicates = [HasShortForwardBranchOpt] in
+let Predicates = [HasShortForwardBranchIALU] in
 def : Pat<(XLenVT (abs GPR:$rs1)),
           (PseudoCCSUB (XLenVT GPR:$rs1), (XLenVT X0), /* COND_LT */ 2,
            (XLenVT GPR:$rs1), (XLenVT X0), (XLenVT GPR:$rs1))>;
-let Predicates = [HasShortForwardBranchOpt, IsRV64] in
+let Predicates = [HasShortForwardBranchIALU, IsRV64] in
 def : Pat<(sext_inreg (abs 33signbits_node:$rs1), i32),
           (PseudoCCSUBW (i64 GPR:$rs1), (i64 X0), /* COND_LT */ 2,
            (i64 GPR:$rs1), (i64 X0), (i64 GPR:$rs1))>;
+
+let Predicates = [HasShortForwardBranchIMinMax] in {
+def PseudoCCMAX : SFBALU_rr;
+def PseudoCCMIN : SFBALU_rr;
+def PseudoCCMAXU : SFBALU_rr;
+def PseudoCCMINU : SFBALU_rr;
+}
+
+let Predicates = [HasShortForwardBranchIMul] in
+def PseudoCCMUL : SFBALU_rr;
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td b/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
index eb3c9b0defccb..e36204c536c0d 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoVPseudos.td
@@ -2982,21 +2982,21 @@ multiclass VPseudoVFWALU_WV_WF_RM {
 multiclass VPseudoVMRG_VM_XM_IM {
   foreach m = MxList in {
     defvar mx = m.MX;
-    def "_VVM" # "_" # m.MX:
-      VPseudoTiedBinaryCarryIn<GetVRegNoV0<m.vrclass>.R,
-                               m.vrclass, m.vrclass, m>,
-      SchedBinary<"WriteVIMergeV", "ReadVIMergeV", "ReadVIMergeV", mx,
-                          forcePassthruRead=true>;
-    def "_VXM" # "_" # m.MX:
-      VPseudoTiedBinaryCarryIn<GetVRegNoV0<m.vrclass>.R,
-                               m.vrclass, GPR, m>,
-      SchedBinary<"WriteVIMergeX", "ReadVIMergeV", "ReadVIMergeX", mx,
-                          forcePassthruRead=true>;
-    def "_VIM" # "_" # m.MX:
-      VPseudoTiedBinaryCarryIn<GetVRegNoV0<m.vrclass>.R,
-                               m.vrclass, simm5, m>,
-      SchedUnary<"WriteVIMergeI", "ReadVIMergeV", mx,
-                          forcePassthruRead=true>;
+    def "_VVM"#"_"#m.MX : VPseudoTiedBinaryCarryIn<GetVRegNoV0<m.vrclass>.R,
+                                                   GetVRegNoV0<m.vrclass>.R,
+                                                   GetVRegNoV0<m.vrclass>.R, m>,
+        SchedBinary<"WriteVIMergeV", "ReadVIMergeV", "ReadVIMergeV", mx,
+                    forcePassthruRead = true>;
+    def "_VXM"#"_"#m.MX
+        : VPseudoTiedBinaryCarryIn<GetVRegNoV0<m.vrclass>.R,
+                                   GetVRegNoV0<m.vrclass>.R, GPR, m>,
+        SchedBinary<"WriteVIMergeX", "ReadVIMergeV", "ReadVIMergeX", mx,
+                    forcePassthruRead = true>;
+    def "_VIM"#"_"#m.MX
+        : VPseudoTiedBinaryCarryIn<GetVRegNoV0<m.vrclass>.R,
+                                   GetVRegNoV0<m.vrclass>.R, simm5, m>,
+        SchedUnary<"WriteVIMergeI", "ReadVIMergeV", mx,
+                   forcePassthruRead = true>;
   }
 }
 
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoXAndes.td b/llvm/lib/Target/RISCV/RISCVInstrInfoXAndes.td
index bbe3baef36bab..80aded388ae65 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoXAndes.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoXAndes.td
@@ -912,7 +912,7 @@ defm : VPatTernaryVD4DOT_VV<"int_riscv_nds_vd4dotsu", "PseudoNDS_VD4DOTSU",
 // Pseudo-instructions for SFB (Short Forward Branch)
 //===----------------------------------------------------------------------===//
 
-let Predicates = [HasShortForwardBranchOpt], hasSideEffects = 0,
+let Predicates = [HasShortForwardBranchIALU], hasSideEffects = 0,
     mayLoad = 0, mayStore = 0, Size = 8, Constraints = "$dst = $falsev" in {
 def PseudoCCNDS_BFOS : Pseudo<(outs GPR:$dst),
                               (ins GPR:$lhs, GPR:$rhs, cond_code:$cc,
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoXCV.td b/llvm/lib/Target/RISCV/RISCVInstrInfoXCV.td
index aa8f1a1108b6b..7abc616f03141 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoXCV.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoXCV.td
@@ -633,8 +633,9 @@ let Predicates = [HasVendorXCVmem, IsRV32] in {
   def CV_SW_rr : CVStore_rr<0b011, 0b0010110, "cv.sw">;
 }
 
-let Predicates = [HasVendorXCVelw, IsRV32], hasSideEffects = 0,
+let Predicates = [HasVendorXCVelw, IsRV32], hasSideEffects = 1,
     mayLoad = 1, mayStore = 0 in {
+  def PseudoCV_ELW : PseudoLoad<"cv.elw">;
   // Event load
   def CV_ELW : CVLoad_ri<0b011, "cv.elw">;
 }
@@ -706,6 +707,12 @@ let Predicates = [HasVendorXCVmem, IsRV32], AddedComplexity = 1 in {
   def : CVStrrPat<store, CV_SW_rr>;
 }
 
+let Predicates = [HasVendorXCVelw, IsRV32] in {
+  def : Pat<(int_riscv_cv_elw_elw (XLenVT GPR:$rs1)), (PseudoCV_ELW GPR:$rs1)>;
+  def : Pat<(int_riscv_cv_elw_elw (AddrRegImm (XLenVT GPR:$rs1), simm12_lo:$imm12)),
+            (CV_ELW GPR:$rs1, simm12_lo:$imm12)>;
+}
+
 multiclass PatCoreVBitManip<Intrinsic intr> {
   def : PatGprGpr<intr, !cast<RVInst>("CV_" # NAME # "R")>;
   def : Pat<(intr GPR:$rs1, cv_uimm10:$imm),
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoXqci.td b/llvm/lib/Target/RISCV/RISCVInstrInfoXqci.td
index 8a38fe2f5ae16..13ceead2d28b4 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoXqci.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoXqci.td
@@ -1330,7 +1330,7 @@ def PseudoQC_E_SH : PseudoStore<"qc.e.sh">;
 def PseudoQC_E_SW : PseudoStore<"qc.e.sw">;
 } // Predicates = [HasVendorXqcilo, IsRV32]
 
-let Predicates = [HasShortForwardBranchOpt] in {
+let Predicates = [HasShortForwardBranchIALU] in {
 def PseudoCCQC_LI : SFBQC_LI;
 def PseudoCCQC_E_LI : SFBQC_E_LI;
 }
@@ -1571,7 +1571,7 @@ def: Pat<(i32 (ctlz (not (i32 GPR:$rs1)))), (QC_CLO GPR:$rs1)>;
 let Predicates = [HasVendorXqciint, IsRV32] in
 def : Pat<(riscv_mileaveret_glue), (QC_C_MILEAVERET)>;
 
-let Predicates = [HasVendorXqcicm, NoShortForwardBranchOpt, IsRV32] in {
+let Predicates = [HasVendorXqcicm, NoShortForwardBranch, IsRV32] in {
 def : QCIMVCCPat<SETEQ,  QC_MVEQ>;
 def : QCIMVCCPat<SETNE,  QC_MVNE>;
 def : QCIMVCCPat<SETLT,  QC_MVLT>;
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index 07f6a38c77897..5becfd2ad502b 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -141,7 +141,7 @@ def ROCKET : RISCVTuneProcessorModel<"rocket",
                                      RocketModel>;
 
 defvar SiFive7TuneFeatures = [TuneSiFive7, TuneNoDefaultUnroll,
-                              TuneShortForwardBranchOpt,
+                              TuneShortForwardBranchIALU,
                               TunePostRAScheduler];
 def SIFIVE_7 : RISCVTuneProcessorModel<"sifive-7-series",
                                        SiFive7Model, SiFive7TuneFeatures>;
@@ -805,7 +805,7 @@ def ANDES_AX25 : RISCVProcessorModel<"andes-ax25",
 
 defvar Andes45TuneFeatures = [TuneAndes45,
                               TuneNoDefaultUnroll,
-                              TuneShortForwardBranchOpt,
+                              TuneShortForwardBranchIALU,
                               TunePostRAScheduler];
 
 def ANDES_45 : RISCVTuneProcessorModel<"andes-45-series",
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index b659bb96f2f11..c16b23e290df1 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -208,7 +208,7 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
   bool hasConditionalMoveFusion() const {
     // Do we support fusing a branch+mv or branch+c.mv as a conditional move.
     return (hasConditionalCompressedMoveFusion() && hasStdExtZca()) ||
-           hasShortForwardBranchOpt();
+           hasShortForwardBranchIALU();
   }
 
   bool hasShlAdd(int64_t ShAmt) const {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4788a428d7e64..74c2c896a8a88 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1007,6 +1007,25 @@ InstructionCost RISCVTTIImpl::getScalarizationOverhead(
   return Cost;
 }
 
+InstructionCost
+RISCVTTIImpl::getMemIntrinsicInstrCost(const MemIntrinsicCostAttributes &MICA,
+                                       TTI::TargetCostKind CostKind) const {
+  Type *DataTy = MICA.getDataType();
+  Align Alignment = MICA.getAlignment();
+  switch (MICA.getID()) {
+  case Intrinsic::vp_load_ff: {
+    EVT DataTypeVT = TLI->getValueType(DL, DataTy);
+    if (!TLI->isLegalFirstFaultLoad(DataTypeVT, Alignment))
+      return BaseT::getMemIntrinsicInstrCost(MICA, CostKind);
+
+    unsigned AS = MICA.getAddressSpace();
+    return getMemoryOpCost(Instruction::Load, DataTy, Alignment, AS, CostKind,
+                           {TTI::OK_AnyValue, TTI::OP_None}, nullptr);
+  }
+  }
+  return BaseT::getMemIntrinsicInstrCost(MICA, CostKind);
+}
+
 InstructionCost
 RISCVTTIImpl::getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
                                     TTI::TargetCostKind CostKind) const {
@@ -1120,19 +1139,24 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
   return MemCost + ShuffleCost;
 }
 
-InstructionCost RISCVTTIImpl::getGatherScatterOpCost(
-    unsigned Opcode, Type *DataTy, const Value *Ptr, bool VariableMask,
-    Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) const {
+InstructionCost
+RISCVTTIImpl::getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                                     TTI::TargetCostKind CostKind) const {
+
+  bool IsLoad = MICA.getID() == Intrinsic::masked_gather ||
+                MICA.getID() == Intrinsic::vp_gather;
+  unsigned Opcode = IsLoad ? Instruction::Load : Instruction::Store;
+  Type *DataTy = MICA.getDataType();
+  Align Alignment = MICA.getAlignment();
+  const Instruction *I = MICA.getInst();
   if (CostKind != TTI::TCK_RecipThroughput)
-    return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                         Alignment, CostKind, I);
+    return BaseT::getGatherScatterOpCost(MICA, CostKind);
 
   if ((Opcode == Instruction::Load &&
        !isLegalMaskedGather(DataTy, Align(Alignment))) ||
       (Opcode == Instruction::Store &&
        !isLegalMaskedScatter(DataTy, Align(Alignment))))
-    return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                         Alignment, CostKind, I);
+    return BaseT::getGatherScatterOpCost(MICA, CostKind);
 
   // Cost is proportional to the number of memory operations implied.  For
   // scalable vectors, we use an estimate on that number since we don't
@@ -1188,14 +1212,20 @@ InstructionCost RISCVTTIImpl::getExpandCompressMemoryOpCost(
          LT.first * getRISCVInstructionCost(Opcodes, LT.second, CostKind);
 }
 
-InstructionCost RISCVTTIImpl::getStridedMemoryOpCost(
-    unsigned Opcode, Type *DataTy, const Value *Ptr, bool VariableMask,
-    Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) const {
-  if (((Opcode == Instruction::Load || Opcode == Instruction::Store) &&
-       !isLegalStridedLoadStore(DataTy, Alignment)) ||
-      (Opcode != Instruction::Load && Opcode != Instruction::Store))
-    return BaseT::getStridedMemoryOpCost(Opcode, DataTy, Ptr, VariableMask,
-                                         Alignment, CostKind, I);
+InstructionCost
+RISCVTTIImpl::getStridedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
+                                     TTI::TargetCostKind CostKind) const {
+
+  unsigned Opcode = MICA.getID() == Intrinsic::experimental_vp_strided_load
+                        ? Instruction::Load
+                        : Instruction::Store;
+
+  Type *DataTy = MICA.getDataType();
+  Align Alignment = MICA.getAlignment();
+  const Instruction *I = MICA.getInst();
+
+  if (!isLegalStridedLoadStore(DataTy, Alignment))
+    return BaseT::getStridedMemoryOpCost(MICA, CostKind);
 
   if (CostKind == TTI::TCK_CodeSize)
     return TTI::TCC_Basic;
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 5efa330b3ad71..c1746e6d13166 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -143,6 +143,10 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {
 
   bool shouldConsiderVectorizationRegPressure() const override { return true; }
 
+  InstructionCost
+  getMemIntrinsicInstrCost(const MemIntrinsicCostAttributes &MICA,
+                           TTI::TargetCostKind CostKind) const override;
+
   InstructionCost
   getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
                         TTI::TargetCostKind CostKind) const override;
@@ -190,21 +194,17 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {
       Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
       bool UseMaskForCond = false, bool UseMaskForGaps = false) const override;
 
-  InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
-                                         const Value *Ptr, bool VariableMask,
-                                         Align Alignment,
-                                         TTI::TargetCostKind CostKind,
-                                         const Instruction *I) const override;
+  InstructionCost
+  getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override;
 
   InstructionCost
   getExpandCompressMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
                                 TTI::TargetCostKind CostKind) const override;
 
-  InstructionCost getStridedMemoryOpCost(unsigned Opcode, Type *DataTy,
-                                         const Value *Ptr, bool VariableMask,
-                                         Align Alignment,
-                                         TTI::TargetCostKind CostKind,
-                                         const Instruction *I) const override;
+  InstructionCost
+  getStridedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override;
 
   InstructionCost
   getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const override;
diff --git a/llvm/lib/Target/RISCV/RISCVVectorPeephole.cpp b/llvm/lib/Target/RISCV/RISCVVectorPeephole.cpp
index e1ff243bb1a47..6ddca4a3e0909 100644
--- a/llvm/lib/Target/RISCV/RISCVVectorPeephole.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVectorPeephole.cpp
@@ -73,7 +73,7 @@ class RISCVVectorPeephole : public MachineFunctionPass {
   bool isAllOnesMask(const MachineInstr *MaskDef) const;
   std::optional<unsigned> getConstant(const MachineOperand &VL) const;
   bool ensureDominates(const MachineOperand &Use, MachineInstr &Src) const;
-  bool isKnownSameDefs(Register A, Register B) const;
+  Register lookThruCopies(Register Reg, bool OneUseOnly = false) const;
 };
 
 } // namespace
@@ -387,23 +387,21 @@ bool RISCVVectorPeephole::convertAllOnesVMergeToVMv(MachineInstr &MI) const {
   return true;
 }
 
-bool RISCVVectorPeephole::isKnownSameDefs(Register A, Register B) const {
-  if (A.isPhysical() || B.isPhysical())
-    return false;
-
-  auto LookThruVirtRegCopies = [this](Register Reg) {
-    while (MachineInstr *Def = MRI->getUniqueVRegDef(Reg)) {
-      if (!Def->isFullCopy())
-        break;
-      Register Src = Def->getOperand(1).getReg();
-      if (!Src.isVirtual())
-        break;
-      Reg = Src;
-    }
-    return Reg;
-  };
-
-  return LookThruVirtRegCopies(A) == LookThruVirtRegCopies(B);
+// If \p Reg is defined by one or more COPYs of virtual registers, traverses
+// the chain and returns the root non-COPY source.
+Register RISCVVectorPeephole::lookThruCopies(Register Reg,
+                                             bool OneUseOnly) const {
+  while (MachineInstr *Def = MRI->getUniqueVRegDef(Reg)) {
+    if (!Def->isFullCopy())
+      break;
+    Register Src = Def->getOperand(1).getReg();
+    if (!Src.isVirtual())
+      break;
+    if (OneUseOnly && !MRI->hasOneNonDBGUse(Reg))
+      break;
+    Reg = Src;
+  }
+  return Reg;
 }
 
 /// If a PseudoVMERGE_VVM's true operand is a masked pseudo and both have the
@@ -428,10 +426,11 @@ bool RISCVVectorPeephole::convertSameMaskVMergeToVMv(MachineInstr &MI) {
   if (!TrueMaskedInfo || !hasSameEEW(MI, *True))
     return false;
 
-  const MachineOperand &TrueMask =
-      True->getOperand(TrueMaskedInfo->MaskOpIdx + True->getNumExplicitDefs());
-  const MachineOperand &MIMask = MI.getOperand(4);
-  if (!isKnownSameDefs(TrueMask.getReg(), MIMask.getReg()))
+  Register TrueMaskReg = lookThruCopies(
+      True->getOperand(TrueMaskedInfo->MaskOpIdx + True->getNumExplicitDefs())
+          .getReg());
+  Register MIMaskReg = lookThruCopies(MI.getOperand(4).getReg());
+  if (!TrueMaskReg.isVirtual() || TrueMaskReg != MIMaskReg)
     return false;
 
   // Masked off lanes past TrueVL will come from False, and converting to vmv
@@ -717,9 +716,10 @@ bool RISCVVectorPeephole::foldVMergeToMask(MachineInstr &MI) const {
   if (RISCV::getRVVMCOpcode(MI.getOpcode()) != RISCV::VMERGE_VVM)
     return false;
 
-  Register PassthruReg = MI.getOperand(1).getReg();
-  Register FalseReg = MI.getOperand(2).getReg();
-  Register TrueReg = MI.getOperand(3).getReg();
+  Register PassthruReg = lookThruCopies(MI.getOperand(1).getReg());
+  Register FalseReg = lookThruCopies(MI.getOperand(2).getReg());
+  Register TrueReg =
+      lookThruCopies(MI.getOperand(3).getReg(), /*OneUseOnly=*/true);
   if (!TrueReg.isVirtual() || !MRI->hasOneUse(TrueReg))
     return false;
   MachineInstr &True = *MRI->getUniqueVRegDef(TrueReg);
@@ -740,16 +740,17 @@ bool RISCVVectorPeephole::foldVMergeToMask(MachineInstr &MI) const {
 
   // We require that either passthru and false are the same, or that passthru
   // is undefined.
-  if (PassthruReg && !isKnownSameDefs(PassthruReg, FalseReg))
+  if (PassthruReg && !(PassthruReg.isVirtual() && PassthruReg == FalseReg))
     return false;
 
   std::optional<std::pair<unsigned, unsigned>> NeedsCommute;
 
   // If True has a passthru operand then it needs to be the same as vmerge's
   // False, since False will be used for the result's passthru operand.
-  Register TruePassthru = True.getOperand(True.getNumExplicitDefs()).getReg();
+  Register TruePassthru =
+      lookThruCopies(True.getOperand(True.getNumExplicitDefs()).getReg());
   if (RISCVII::isFirstDefTiedToFirstUse(True.getDesc()) && TruePassthru &&
-      !isKnownSameDefs(TruePassthru, FalseReg)) {
+      !(TruePassthru.isVirtual() && TruePassthru == FalseReg)) {
     // If True's passthru != False, check if it uses False in another operand
     // and try to commute it.
     int OtherIdx = True.findRegisterUseOperandIdx(FalseReg, TRI);
@@ -837,6 +838,8 @@ bool RISCVVectorPeephole::foldVMergeToMask(MachineInstr &MI) const {
     MRI->constrainRegClass(
         MO.getReg(), True.getRegClassConstraint(MO.getOperandNo(), TII, TRI));
   }
+  // We should clear the IsKill flag since we have a new use now.
+  MRI->clearKillFlags(FalseReg);
   MI.eraseFromParent();
 
   return true;
diff --git a/llvm/lib/Target/SPIRV/SPIRVBuiltins.cpp b/llvm/lib/Target/SPIRV/SPIRVBuiltins.cpp
index 709f49b0fecc1..87ebee6a14eac 100644
--- a/llvm/lib/Target/SPIRV/SPIRVBuiltins.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVBuiltins.cpp
@@ -2399,6 +2399,77 @@ static bool generateBlockingPipesInst(const SPIRV::IncomingCall *Call,
   return buildOpFromWrapper(MIRBuilder, Opcode, Call, Register(0));
 }
 
+static bool buildAPFixedPointInst(const SPIRV::IncomingCall *Call,
+                                  unsigned Opcode, MachineIRBuilder &MIRBuilder,
+                                  SPIRVGlobalRegistry *GR) {
+  MachineRegisterInfo *MRI = MIRBuilder.getMRI();
+  SmallVector<uint32_t, 1> ImmArgs;
+  Register InputReg = Call->Arguments[0];
+  const Type *RetTy = GR->getTypeForSPIRVType(Call->ReturnType);
+  bool IsSRet = RetTy->isVoidTy();
+
+  if (IsSRet) {
+    const LLT ValTy = MRI->getType(InputReg);
+    Register ActualRetValReg = MRI->createGenericVirtualRegister(ValTy);
+    SPIRVType *InstructionType =
+        GR->getPointeeType(GR->getSPIRVTypeForVReg(InputReg));
+    InputReg = Call->Arguments[1];
+    auto InputType = GR->getTypeForSPIRVType(GR->getSPIRVTypeForVReg(InputReg));
+    Register PtrInputReg;
+    if (InputType->getTypeID() == llvm::Type::TypeID::TypedPointerTyID) {
+      LLT InputLLT = MRI->getType(InputReg);
+      PtrInputReg = MRI->createGenericVirtualRegister(InputLLT);
+      SPIRVType *PtrType =
+          GR->getPointeeType(GR->getSPIRVTypeForVReg(InputReg));
+      MachineMemOperand *MMO1 = MIRBuilder.getMF().getMachineMemOperand(
+          MachinePointerInfo(), MachineMemOperand::MOLoad,
+          InputLLT.getSizeInBytes(), Align(4));
+      MIRBuilder.buildLoad(PtrInputReg, InputReg, *MMO1);
+      MRI->setRegClass(PtrInputReg, &SPIRV::iIDRegClass);
+      GR->assignSPIRVTypeToVReg(PtrType, PtrInputReg, MIRBuilder.getMF());
+    }
+
+    for (unsigned index = 2; index < 7; index++) {
+      ImmArgs.push_back(getConstFromIntrinsic(Call->Arguments[index], MRI));
+    }
+
+    // Emit the instruction
+    auto MIB = MIRBuilder.buildInstr(Opcode)
+                   .addDef(ActualRetValReg)
+                   .addUse(GR->getSPIRVTypeID(InstructionType));
+    if (PtrInputReg)
+      MIB.addUse(PtrInputReg);
+    else
+      MIB.addUse(InputReg);
+
+    for (uint32_t Imm : ImmArgs)
+      MIB.addImm(Imm);
+    unsigned Size = ValTy.getSizeInBytes();
+    // Store result to the pointer passed in Arg[0]
+    MachineMemOperand *MMO = MIRBuilder.getMF().getMachineMemOperand(
+        MachinePointerInfo(), MachineMemOperand::MOStore, Size, Align(4));
+    MRI->setRegClass(ActualRetValReg, &SPIRV::pIDRegClass);
+    MIRBuilder.buildStore(ActualRetValReg, Call->Arguments[0], *MMO);
+    return true;
+  } else {
+    for (unsigned index = 1; index < 6; index++)
+      ImmArgs.push_back(getConstFromIntrinsic(Call->Arguments[index], MRI));
+
+    return buildOpFromWrapper(MIRBuilder, Opcode, Call,
+                              GR->getSPIRVTypeID(Call->ReturnType), ImmArgs);
+  }
+}
+
+static bool generateAPFixedPointInst(const SPIRV::IncomingCall *Call,
+                                     MachineIRBuilder &MIRBuilder,
+                                     SPIRVGlobalRegistry *GR) {
+  const SPIRV::DemangledBuiltin *Builtin = Call->Builtin;
+  unsigned Opcode =
+      SPIRV::lookupNativeBuiltin(Builtin->Name, Builtin->Set)->Opcode;
+
+  return buildAPFixedPointInst(Call, Opcode, MIRBuilder, GR);
+}
+
 static bool
 generateTernaryBitwiseFunctionINTELInst(const SPIRV::IncomingCall *Call,
                                         MachineIRBuilder &MIRBuilder,
@@ -3061,6 +3132,8 @@ std::optional<bool> lowerBuiltin(const StringRef DemangledCall,
     return generatePredicatedLoadStoreInst(Call.get(), MIRBuilder, GR);
   case SPIRV::BlockingPipes:
     return generateBlockingPipesInst(Call.get(), MIRBuilder, GR);
+  case SPIRV::ArbitraryPrecisionFixedPoint:
+    return generateAPFixedPointInst(Call.get(), MIRBuilder, GR);
   }
   return false;
 }
diff --git a/llvm/lib/Target/SPIRV/SPIRVBuiltins.td b/llvm/lib/Target/SPIRV/SPIRVBuiltins.td
index 492a98e1995fe..98440856387c9 100644
--- a/llvm/lib/Target/SPIRV/SPIRVBuiltins.td
+++ b/llvm/lib/Target/SPIRV/SPIRVBuiltins.td
@@ -71,6 +71,7 @@ def TernaryBitwiseINTEL : BuiltinGroup;
 def Block2DLoadStore : BuiltinGroup;
 def Pipe : BuiltinGroup;
 def PredicatedLoadStore : BuiltinGroup;
+def ArbitraryPrecisionFixedPoint : BuiltinGroup;
 def BlockingPipes : BuiltinGroup;
 
 //===----------------------------------------------------------------------===//
@@ -1181,6 +1182,19 @@ defm : DemangledNativeBuiltin<"__spirv_WritePipeBlockingINTEL", OpenCL_std, Bloc
 defm : DemangledNativeBuiltin<"__spirv_ReadPipeBlockingINTEL", OpenCL_std, BlockingPipes, 0, 0, OpReadPipeBlockingALTERA>;
 defm : DemangledNativeBuiltin<"__spirv_ReadClockKHR", OpenCL_std, KernelClock, 1, 1, OpReadClockKHR>;
 
+//SPV_ALTERA_arbitrary_precision_fixed_point
+defm : DemangledNativeBuiltin<"__spirv_FixedSqrtINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedSqrtALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedRecipINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedRecipALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedRsqrtINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedRsqrtALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedSinINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedSinALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedCosINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedCosALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedSinCosINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedSinCosALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedSinPiINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedSinPiALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedCosPiINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedCosPiALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedSinCosPiINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedSinCosPiALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedLogINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedLogALTERA>;
+defm : DemangledNativeBuiltin<"__spirv_FixedExpINTEL", OpenCL_std, ArbitraryPrecisionFixedPoint, 6 , 8, OpFixedExpALTERA>;
+
 //===----------------------------------------------------------------------===//
 // Class defining an atomic instruction on floating-point numbers.
 //
diff --git a/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp b/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp
index d394b3ac243a9..146384f4bf08c 100644
--- a/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVCommandLine.cpp
@@ -53,8 +53,8 @@ static const std::map<std::string, SPIRV::Extension::Extension, std::less<>>
          SPIRV::Extension::Extension::SPV_GOOGLE_hlsl_functionality1},
         {"SPV_GOOGLE_user_type",
          SPIRV::Extension::Extension::SPV_GOOGLE_user_type},
-        {"SPV_INTEL_arbitrary_precision_integers",
-         SPIRV::Extension::Extension::SPV_INTEL_arbitrary_precision_integers},
+        {"SPV_ALTERA_arbitrary_precision_integers",
+         SPIRV::Extension::Extension::SPV_ALTERA_arbitrary_precision_integers},
         {"SPV_INTEL_cache_controls",
          SPIRV::Extension::Extension::SPV_INTEL_cache_controls},
         {"SPV_INTEL_float_controls2",
@@ -163,7 +163,11 @@ static const std::map<std::string, SPIRV::Extension::Extension, std::less<>>
         {"SPV_INTEL_kernel_attributes",
          SPIRV::Extension::Extension::SPV_INTEL_kernel_attributes},
         {"SPV_ALTERA_blocking_pipes",
-         SPIRV::Extension::Extension::SPV_ALTERA_blocking_pipes}};
+         SPIRV::Extension::Extension::SPV_ALTERA_blocking_pipes},
+        {"SPV_INTEL_int4", SPIRV::Extension::Extension::SPV_INTEL_int4},
+        {"SPV_ALTERA_arbitrary_precision_fixed_point",
+         SPIRV::Extension::Extension::
+             SPV_ALTERA_arbitrary_precision_fixed_point}};
 
 bool SPIRVExtensionsParser::parse(cl::Option &O, StringRef ArgName,
                                   StringRef ArgValue,
diff --git a/llvm/lib/Target/SPIRV/SPIRVGlobalRegistry.cpp b/llvm/lib/Target/SPIRV/SPIRVGlobalRegistry.cpp
index 8b1a09caf907d..0fb44052527f0 100644
--- a/llvm/lib/Target/SPIRV/SPIRVGlobalRegistry.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVGlobalRegistry.cpp
@@ -155,7 +155,7 @@ unsigned SPIRVGlobalRegistry::adjustOpTypeIntWidth(unsigned Width) const {
     report_fatal_error("Unsupported integer width!");
   const SPIRVSubtarget &ST = cast<SPIRVSubtarget>(CurMF->getSubtarget());
   if (ST.canUseExtension(
-          SPIRV::Extension::SPV_INTEL_arbitrary_precision_integers) ||
+          SPIRV::Extension::SPV_ALTERA_arbitrary_precision_integers) ||
       ST.canUseExtension(SPIRV::Extension::SPV_INTEL_int4))
     return Width;
   if (Width <= 8)
@@ -183,11 +183,11 @@ SPIRVType *SPIRVGlobalRegistry::getOpTypeInt(unsigned Width,
           .addImm(SPIRV::Capability::Int4TypeINTEL);
     } else if ((!isPowerOf2_32(Width) || Width < 8) &&
                ST.canUseExtension(
-                   SPIRV::Extension::SPV_INTEL_arbitrary_precision_integers)) {
+                   SPIRV::Extension::SPV_ALTERA_arbitrary_precision_integers)) {
       MIRBuilder.buildInstr(SPIRV::OpExtension)
-          .addImm(SPIRV::Extension::SPV_INTEL_arbitrary_precision_integers);
+          .addImm(SPIRV::Extension::SPV_ALTERA_arbitrary_precision_integers);
       MIRBuilder.buildInstr(SPIRV::OpCapability)
-          .addImm(SPIRV::Capability::ArbitraryPrecisionIntegersINTEL);
+          .addImm(SPIRV::Capability::ArbitraryPrecisionIntegersALTERA);
     }
     return MIRBuilder.buildInstr(SPIRV::OpTypeInt)
         .addDef(createTypeVReg(MIRBuilder))
@@ -883,10 +883,12 @@ SPIRVType *SPIRVGlobalRegistry::getOpTypeArray(uint32_t NumElems,
           .addUse(NumElementsVReg);
     });
   } else {
-    assert(ST.isShader() && "Runtime arrays are not allowed in non-shader "
-                            "SPIR-V modules.");
-    if (!ST.isShader())
+    if (!ST.isShader()) {
+      llvm::reportFatalUsageError(
+          "Runtime arrays are not allowed in non-shader "
+          "SPIR-V modules");
       return nullptr;
+    }
     ArrayType = createOpType(MIRBuilder, [&](MachineIRBuilder &MIRBuilder) {
       return MIRBuilder.buildInstr(SPIRV::OpTypeRuntimeArray)
           .addDef(createTypeVReg(MIRBuilder))
diff --git a/llvm/lib/Target/SPIRV/SPIRVISelLowering.cpp b/llvm/lib/Target/SPIRV/SPIRVISelLowering.cpp
index 0ba6589c68944..36fa5fa9a70cb 100644
--- a/llvm/lib/Target/SPIRV/SPIRVISelLowering.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVISelLowering.cpp
@@ -94,7 +94,7 @@ MVT SPIRVTargetLowering::getRegisterTypeForCallingConv(LLVMContext &Context,
 }
 
 bool SPIRVTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                             const CallInst &I,
+                                             const CallBase &I,
                                              MachineFunction &MF,
                                              unsigned Intrinsic) const {
   unsigned AlignIdx = 3;
diff --git a/llvm/lib/Target/SPIRV/SPIRVISelLowering.h b/llvm/lib/Target/SPIRV/SPIRVISelLowering.h
index 3d31a116bad4a..5746832c8fd95 100644
--- a/llvm/lib/Target/SPIRV/SPIRVISelLowering.h
+++ b/llvm/lib/Target/SPIRV/SPIRVISelLowering.h
@@ -48,7 +48,7 @@ class SPIRVTargetLowering : public TargetLowering {
                                          EVT VT) const override;
   MVT getRegisterTypeForCallingConv(LLVMContext &Context, CallingConv::ID CC,
                                     EVT VT) const override;
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/SPIRV/SPIRVInstrInfo.td b/llvm/lib/Target/SPIRV/SPIRVInstrInfo.td
index 03bd61bdf2cf6..815d2d7ed854b 100644
--- a/llvm/lib/Target/SPIRV/SPIRVInstrInfo.td
+++ b/llvm/lib/Target/SPIRV/SPIRVInstrInfo.td
@@ -999,3 +999,27 @@ def OpReadPipeBlockingALTERA :Op<5946, (outs), (ins ID:$pipe, ID:$pointer, ID:$p
                    "OpReadPipeBlockingALTERA $pipe $pointer $packetSize $packetAlignment">;
 def OpWritePipeBlockingALTERA :Op<5946, (outs), (ins ID:$pipe, ID:$pointer, ID:$packetSize, ID:$packetAlignment),
                    "OpWritePipeBlockingALTERA $pipe $pointer $packetSize $packetAlignment">;
+
+//SPV_ALTERA_arbitrary_precision_fixed_point
+def OpFixedSqrtALTERA: Op<5923, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedSqrtALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedRecipALTERA: Op<5924, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedRecipALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedRsqrtALTERA: Op<5925, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedRsqrtALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedSinALTERA: Op<5926, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedSinALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedCosALTERA: Op<5927, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedCosALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedSinCosALTERA: Op<5928, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedSinCosALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedSinPiALTERA: Op<5929, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedSinPiALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedCosPiALTERA: Op<5930, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedCosPiALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedSinCosPiALTERA: Op<5931, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedSinCosPiALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedLogALTERA: Op<5932, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedLogALTERA $result_type $input $sign $l $rl $q $o">;
+def OpFixedExpALTERA: Op<5933, (outs ID:$res), (ins TYPE:$result_type, ID:$input, i32imm:$sign, i32imm:$l, i32imm:$rl, i32imm:$q, i32imm:$o),
+      "$res = OpFixedExpALTERA $result_type $input $sign $l $rl $q $o">;
diff --git a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
index 2c27289e759eb..a2e29366dc4cc 100644
--- a/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVInstructionSelector.cpp
@@ -1781,33 +1781,57 @@ bool SPIRVInstructionSelector::selectUnmergeValues(MachineInstr &I) const {
   unsigned ArgI = I.getNumOperands() - 1;
   Register SrcReg =
       I.getOperand(ArgI).isReg() ? I.getOperand(ArgI).getReg() : Register(0);
-  SPIRVType *DefType =
+  SPIRVType *SrcType =
       SrcReg.isValid() ? GR.getSPIRVTypeForVReg(SrcReg) : nullptr;
-  if (!DefType || DefType->getOpcode() != SPIRV::OpTypeVector)
+  if (!SrcType || SrcType->getOpcode() != SPIRV::OpTypeVector)
     report_fatal_error(
         "cannot select G_UNMERGE_VALUES with a non-vector argument");
 
   SPIRVType *ScalarType =
-      GR.getSPIRVTypeForVReg(DefType->getOperand(1).getReg());
+      GR.getSPIRVTypeForVReg(SrcType->getOperand(1).getReg());
   MachineBasicBlock &BB = *I.getParent();
   bool Res = false;
+  unsigned CurrentIndex = 0;
   for (unsigned i = 0; i < I.getNumDefs(); ++i) {
     Register ResVReg = I.getOperand(i).getReg();
     SPIRVType *ResType = GR.getSPIRVTypeForVReg(ResVReg);
     if (!ResType) {
-      // There was no "assign type" actions, let's fix this now
-      ResType = ScalarType;
+      LLT ResLLT = MRI->getType(ResVReg);
+      assert(ResLLT.isValid());
+      if (ResLLT.isVector()) {
+        ResType = GR.getOrCreateSPIRVVectorType(
+            ScalarType, ResLLT.getNumElements(), I, TII);
+      } else {
+        ResType = ScalarType;
+      }
       MRI->setRegClass(ResVReg, GR.getRegClass(ResType));
-      MRI->setType(ResVReg, LLT::scalar(GR.getScalarOrVectorBitWidth(ResType)));
       GR.assignSPIRVTypeToVReg(ResType, ResVReg, *GR.CurMF);
     }
-    auto MIB =
-        BuildMI(BB, I, I.getDebugLoc(), TII.get(SPIRV::OpCompositeExtract))
-            .addDef(ResVReg)
-            .addUse(GR.getSPIRVTypeID(ResType))
-            .addUse(SrcReg)
-            .addImm(static_cast<int64_t>(i));
-    Res |= MIB.constrainAllUses(TII, TRI, RBI);
+
+    if (ResType->getOpcode() == SPIRV::OpTypeVector) {
+      Register UndefReg = GR.getOrCreateUndef(I, SrcType, TII);
+      auto MIB =
+          BuildMI(BB, I, I.getDebugLoc(), TII.get(SPIRV::OpVectorShuffle))
+              .addDef(ResVReg)
+              .addUse(GR.getSPIRVTypeID(ResType))
+              .addUse(SrcReg)
+              .addUse(UndefReg);
+      unsigned NumElements = GR.getScalarOrVectorComponentCount(ResType);
+      for (unsigned j = 0; j < NumElements; ++j) {
+        MIB.addImm(CurrentIndex + j);
+      }
+      CurrentIndex += NumElements;
+      Res |= MIB.constrainAllUses(TII, TRI, RBI);
+    } else {
+      auto MIB =
+          BuildMI(BB, I, I.getDebugLoc(), TII.get(SPIRV::OpCompositeExtract))
+              .addDef(ResVReg)
+              .addUse(GR.getSPIRVTypeID(ResType))
+              .addUse(SrcReg)
+              .addImm(CurrentIndex);
+      CurrentIndex++;
+      Res |= MIB.constrainAllUses(TII, TRI, RBI);
+    }
   }
   return Res;
 }
diff --git a/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.cpp b/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.cpp
index 53074ea3b2597..b5912c27316c9 100644
--- a/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.cpp
@@ -14,16 +14,22 @@
 #include "SPIRV.h"
 #include "SPIRVGlobalRegistry.h"
 #include "SPIRVSubtarget.h"
+#include "llvm/CodeGen/GlobalISel/GenericMachineInstrs.h"
 #include "llvm/CodeGen/GlobalISel/LegalizerHelper.h"
 #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/CodeGen/MachineInstr.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
 #include "llvm/CodeGen/TargetOpcodes.h"
+#include "llvm/IR/IntrinsicsSPIRV.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/MathExtras.h"
 
 using namespace llvm;
 using namespace llvm::LegalizeActions;
 using namespace llvm::LegalityPredicates;
 
+#define DEBUG_TYPE "spirv-legalizer"
+
 LegalityPredicate typeOfExtendedScalars(unsigned TypeIdx, bool IsExtendedInts) {
   return [IsExtendedInts, TypeIdx](const LegalityQuery &Query) {
     const LLT Ty = Query.Types[TypeIdx];
@@ -84,23 +90,29 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
   const LLT p6 = LLT::pointer(6, PSize); // SPV_INTEL_usm_storage_classes (Host)
   const LLT p7 = LLT::pointer(7, PSize); // Input
   const LLT p8 = LLT::pointer(8, PSize); // Output
+  const LLT p9 =
+      LLT::pointer(9, PSize); // CodeSectionINTEL, SPV_INTEL_function_pointers
   const LLT p10 = LLT::pointer(10, PSize); // Private
   const LLT p11 = LLT::pointer(11, PSize); // StorageBuffer
   const LLT p12 = LLT::pointer(12, PSize); // Uniform
 
   // TODO: remove copy-pasting here by using concatenation in some way.
   auto allPtrsScalarsAndVectors = {
-      p0,    p1,    p2,    p3,     p4,     p5,    p6,    p7,    p8,
-      p10,   p11,   p12,   s1,     s8,     s16,   s32,   s64,   v2s1,
-      v2s8,  v2s16, v2s32, v2s64,  v3s1,   v3s8,  v3s16, v3s32, v3s64,
-      v4s1,  v4s8,  v4s16, v4s32,  v4s64,  v8s1,  v8s8,  v8s16, v8s32,
-      v8s64, v16s1, v16s8, v16s16, v16s32, v16s64};
+      p0,    p1,    p2,    p3,    p4,     p5,     p6,    p7,    p8,
+      p9,    p10,   p11,   p12,   s1,     s8,     s16,   s32,   s64,
+      v2s1,  v2s8,  v2s16, v2s32, v2s64,  v3s1,   v3s8,  v3s16, v3s32,
+      v3s64, v4s1,  v4s8,  v4s16, v4s32,  v4s64,  v8s1,  v8s8,  v8s16,
+      v8s32, v8s64, v16s1, v16s8, v16s16, v16s32, v16s64};
 
   auto allVectors = {v2s1,  v2s8,   v2s16,  v2s32, v2s64, v3s1,  v3s8,
                      v3s16, v3s32,  v3s64,  v4s1,  v4s8,  v4s16, v4s32,
                      v4s64, v8s1,   v8s8,   v8s16, v8s32, v8s64, v16s1,
                      v16s8, v16s16, v16s32, v16s64};
 
+  auto allShaderVectors = {v2s1, v2s8, v2s16, v2s32, v2s64,
+                           v3s1, v3s8, v3s16, v3s32, v3s64,
+                           v4s1, v4s8, v4s16, v4s32, v4s64};
+
   auto allScalarsAndVectors = {
       s1,   s8,   s16,   s32,   s64,   v2s1,  v2s8,  v2s16,  v2s32,  v2s64,
       v3s1, v3s8, v3s16, v3s32, v3s64, v4s1,  v4s8,  v4s16,  v4s32,  v4s64,
@@ -121,14 +133,16 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
       s16,   s32,   s64,   v2s16, v2s32, v2s64, v3s16,  v3s32,  v3s64,
       v4s16, v4s32, v4s64, v8s16, v8s32, v8s64, v16s16, v16s32, v16s64};
 
-  auto allFloatAndIntScalarsAndPtrs = {s8, s16, s32, s64, p0, p1,  p2,  p3,
-                                       p4, p5,  p6,  p7,  p8, p10, p11, p12};
+  auto allFloatAndIntScalarsAndPtrs = {s8, s16, s32, s64, p0, p1,  p2,  p3, p4,
+                                       p5, p6,  p7,  p8,  p9, p10, p11, p12};
+
+  auto allPtrs = {p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12};
 
-  auto allPtrs = {p0, p1, p2, p3, p4, p5, p6, p7, p8, p10, p11, p12};
+  auto &allowedVectorTypes = ST.isShader() ? allShaderVectors : allVectors;
 
   bool IsExtendedInts =
       ST.canUseExtension(
-          SPIRV::Extension::SPV_INTEL_arbitrary_precision_integers) ||
+          SPIRV::Extension::SPV_ALTERA_arbitrary_precision_integers) ||
       ST.canUseExtension(SPIRV::Extension::SPV_KHR_bit_instructions) ||
       ST.canUseExtension(SPIRV::Extension::SPV_INTEL_int4);
   auto extendedScalarsAndVectors =
@@ -148,14 +162,70 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
         return IsExtendedInts && Ty.isValid();
       };
 
-  for (auto Opc : getTypeFoldingSupportedOpcodes())
-    getActionDefinitionsBuilder(Opc).custom();
+  // The universal validation rules in the SPIR-V specification state that
+  // vector sizes are typically limited to 2, 3, or 4. However, larger vector
+  // sizes (8 and 16) are enabled when the Kernel capability is present. For
+  // shader execution models, vector sizes are strictly limited to 4. In
+  // non-shader contexts, vector sizes of 8 and 16 are also permitted, but
+  // arbitrary sizes (e.g., 6 or 11) are not.
+  uint32_t MaxVectorSize = ST.isShader() ? 4 : 16;
+
+  for (auto Opc : getTypeFoldingSupportedOpcodes()) {
+    if (Opc != G_EXTRACT_VECTOR_ELT)
+      getActionDefinitionsBuilder(Opc).custom();
+  }
 
-  getActionDefinitionsBuilder(G_GLOBAL_VALUE).alwaysLegal();
+  getActionDefinitionsBuilder(G_INTRINSIC_W_SIDE_EFFECTS).custom();
 
-  // TODO: add proper rules for vectors legalization.
-  getActionDefinitionsBuilder(
-      {G_BUILD_VECTOR, G_SHUFFLE_VECTOR, G_SPLAT_VECTOR})
+  getActionDefinitionsBuilder(G_SHUFFLE_VECTOR)
+      .legalForCartesianProduct(allowedVectorTypes, allowedVectorTypes)
+      .moreElementsToNextPow2(0)
+      .lowerIf(vectorElementCountIsGreaterThan(0, MaxVectorSize))
+      .moreElementsToNextPow2(1)
+      .lowerIf(vectorElementCountIsGreaterThan(1, MaxVectorSize))
+      .alwaysLegal();
+
+  getActionDefinitionsBuilder(G_EXTRACT_VECTOR_ELT)
+      .moreElementsToNextPow2(1)
+      .fewerElementsIf(vectorElementCountIsGreaterThan(1, MaxVectorSize),
+                       LegalizeMutations::changeElementCountTo(
+                           1, ElementCount::getFixed(MaxVectorSize)))
+      .custom();
+
+  // Illegal G_UNMERGE_VALUES instructions should be handled
+  // during the combine phase.
+  getActionDefinitionsBuilder(G_BUILD_VECTOR)
+      .legalIf(vectorElementCountIsLessThanOrEqualTo(0, MaxVectorSize))
+      .fewerElementsIf(vectorElementCountIsGreaterThan(0, MaxVectorSize),
+                       LegalizeMutations::changeElementCountTo(
+                           0, ElementCount::getFixed(MaxVectorSize)));
+
+  // When entering the legalizer, there should be no G_BITCAST instructions.
+  // They should all be calls to the `spv_bitcast` intrinsic. The call to
+  // the intrinsic will be converted to a G_BITCAST during legalization if
+  // the vectors are not legal. After using the rules to legalize a G_BITCAST,
+  // we turn it back into a call to the intrinsic with a custom rule to avoid
+  // potential machine verifier failures.
+  getActionDefinitionsBuilder(G_BITCAST)
+      .moreElementsToNextPow2(0)
+      .moreElementsToNextPow2(1)
+      .fewerElementsIf(vectorElementCountIsGreaterThan(0, MaxVectorSize),
+                       LegalizeMutations::changeElementCountTo(
+                           0, ElementCount::getFixed(MaxVectorSize)))
+      .lowerIf(vectorElementCountIsGreaterThan(1, MaxVectorSize))
+      .custom();
+
+  getActionDefinitionsBuilder(G_CONCAT_VECTORS)
+      .legalIf(vectorElementCountIsLessThanOrEqualTo(0, MaxVectorSize))
+      .moreElementsToNextPow2(0)
+      .lowerIf(vectorElementCountIsGreaterThan(0, MaxVectorSize))
+      .alwaysLegal();
+
+  getActionDefinitionsBuilder(G_SPLAT_VECTOR)
+      .legalIf(vectorElementCountIsLessThanOrEqualTo(0, MaxVectorSize))
+      .moreElementsToNextPow2(0)
+      .fewerElementsIf(vectorElementCountIsGreaterThan(0, MaxVectorSize),
+                       LegalizeMutations::changeElementSizeTo(0, MaxVectorSize))
       .alwaysLegal();
 
   // Vector Reduction Operations
@@ -164,7 +234,7 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
        G_VECREDUCE_ADD, G_VECREDUCE_MUL, G_VECREDUCE_FMUL, G_VECREDUCE_FMIN,
        G_VECREDUCE_FMAX, G_VECREDUCE_FMINIMUM, G_VECREDUCE_FMAXIMUM,
        G_VECREDUCE_OR, G_VECREDUCE_AND, G_VECREDUCE_XOR})
-      .legalFor(allVectors)
+      .legalFor(allowedVectorTypes)
       .scalarize(1)
       .lower();
 
@@ -172,20 +242,28 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
       .scalarize(2)
       .lower();
 
-  // Merge/Unmerge
-  // TODO: add proper legalization rules.
-  getActionDefinitionsBuilder(G_UNMERGE_VALUES).alwaysLegal();
+  // Illegal G_UNMERGE_VALUES instructions should be handled
+  // during the combine phase.
+  getActionDefinitionsBuilder(G_UNMERGE_VALUES)
+      .legalIf(vectorElementCountIsLessThanOrEqualTo(1, MaxVectorSize));
 
   getActionDefinitionsBuilder({G_MEMCPY, G_MEMMOVE})
+      .unsupportedIf(LegalityPredicates::any(typeIs(0, p9), typeIs(1, p9)))
       .legalIf(all(typeInSet(0, allPtrs), typeInSet(1, allPtrs)));
 
-  getActionDefinitionsBuilder(G_MEMSET).legalIf(
-      all(typeInSet(0, allPtrs), typeInSet(1, allIntScalars)));
+  getActionDefinitionsBuilder(G_MEMSET)
+      .unsupportedIf(typeIs(0, p9))
+      .legalIf(all(typeInSet(0, allPtrs), typeInSet(1, allIntScalars)));
 
   getActionDefinitionsBuilder(G_ADDRSPACE_CAST)
+      .unsupportedIf(
+          LegalityPredicates::any(all(typeIs(0, p9), typeIsNot(1, p9)),
+                                  all(typeIsNot(0, p9), typeIs(1, p9))))
       .legalForCartesianProduct(allPtrs, allPtrs);
 
-  getActionDefinitionsBuilder({G_LOAD, G_STORE}).legalIf(typeInSet(1, allPtrs));
+  getActionDefinitionsBuilder({G_LOAD, G_STORE})
+      .unsupportedIf(typeIs(1, p9))
+      .legalIf(typeInSet(1, allPtrs));
 
   getActionDefinitionsBuilder({G_SMIN, G_SMAX, G_UMIN, G_UMAX, G_ABS,
                                G_BITREVERSE, G_SADDSAT, G_UADDSAT, G_SSUBSAT,
@@ -228,7 +306,14 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
       all(typeInSet(0, allPtrsScalarsAndVectors),
           typeInSet(1, allPtrsScalarsAndVectors)));
 
-  getActionDefinitionsBuilder({G_IMPLICIT_DEF, G_FREEZE}).alwaysLegal();
+  getActionDefinitionsBuilder({G_IMPLICIT_DEF, G_FREEZE})
+      .legalFor({s1})
+      .legalFor(allFloatAndIntScalarsAndPtrs)
+      .legalFor(allowedVectorTypes)
+      .moreElementsToNextPow2(0)
+      .fewerElementsIf(vectorElementCountIsGreaterThan(0, MaxVectorSize),
+                       LegalizeMutations::changeElementCountTo(
+                           0, ElementCount::getFixed(MaxVectorSize)));
 
   getActionDefinitionsBuilder({G_STACKSAVE, G_STACKRESTORE}).alwaysLegal();
 
@@ -247,9 +332,12 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
 
   // ST.canDirectlyComparePointers() for pointer args is supported in
   // legalizeCustom().
-  getActionDefinitionsBuilder(G_ICMP).customIf(
-      all(typeInSet(0, allBoolScalarsAndVectors),
-          typeInSet(1, allPtrsScalarsAndVectors)));
+  getActionDefinitionsBuilder(G_ICMP)
+      .unsupportedIf(LegalityPredicates::any(
+          all(typeIs(0, p9), typeInSet(1, allPtrs), typeIsNot(1, p9)),
+          all(typeInSet(0, allPtrs), typeIsNot(0, p9), typeIs(1, p9))))
+      .customIf(all(typeInSet(0, allBoolScalarsAndVectors),
+                    typeInSet(1, allPtrsScalarsAndVectors)));
 
   getActionDefinitionsBuilder(G_FCMP).legalIf(
       all(typeInSet(0, allBoolScalarsAndVectors),
@@ -287,6 +375,8 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
   // Pointer-handling.
   getActionDefinitionsBuilder(G_FRAME_INDEX).legalFor({p0});
 
+  getActionDefinitionsBuilder(G_GLOBAL_VALUE).legalFor(allPtrs);
+
   // Control-flow. In some cases (e.g. constants) s1 may be promoted to s32.
   getActionDefinitionsBuilder(G_BRCOND).legalFor({s1, s32});
 
@@ -353,6 +443,21 @@ SPIRVLegalizerInfo::SPIRVLegalizerInfo(const SPIRVSubtarget &ST) {
   verify(*ST.getInstrInfo());
 }
 
+static bool legalizeExtractVectorElt(LegalizerHelper &Helper, MachineInstr &MI,
+                                     SPIRVGlobalRegistry *GR) {
+  MachineIRBuilder &MIRBuilder = Helper.MIRBuilder;
+  Register DstReg = MI.getOperand(0).getReg();
+  Register SrcReg = MI.getOperand(1).getReg();
+  Register IdxReg = MI.getOperand(2).getReg();
+
+  MIRBuilder
+      .buildIntrinsic(Intrinsic::spv_extractelt, ArrayRef<Register>{DstReg})
+      .addUse(SrcReg)
+      .addUse(IdxReg);
+  MI.eraseFromParent();
+  return true;
+}
+
 static Register convertPtrToInt(Register Reg, LLT ConvTy, SPIRVType *SpvType,
                                 LegalizerHelper &Helper,
                                 MachineRegisterInfo &MRI,
@@ -374,6 +479,13 @@ bool SPIRVLegalizerInfo::legalizeCustom(
   default:
     // TODO: implement legalization for other opcodes.
     return true;
+  case TargetOpcode::G_BITCAST:
+    return legalizeBitcast(Helper, MI);
+  case TargetOpcode::G_EXTRACT_VECTOR_ELT:
+    return legalizeExtractVectorElt(Helper, MI, GR);
+  case TargetOpcode::G_INTRINSIC:
+  case TargetOpcode::G_INTRINSIC_W_SIDE_EFFECTS:
+    return legalizeIntrinsic(Helper, MI);
   case TargetOpcode::G_IS_FPCLASS:
     return legalizeIsFPClass(Helper, MI, LocObserver);
   case TargetOpcode::G_ICMP: {
@@ -400,6 +512,76 @@ bool SPIRVLegalizerInfo::legalizeCustom(
   }
 }
 
+bool SPIRVLegalizerInfo::legalizeIntrinsic(LegalizerHelper &Helper,
+                                           MachineInstr &MI) const {
+  LLVM_DEBUG(dbgs() << "legalizeIntrinsic: " << MI);
+
+  MachineIRBuilder &MIRBuilder = Helper.MIRBuilder;
+  MachineRegisterInfo &MRI = *MIRBuilder.getMRI();
+  const SPIRVSubtarget &ST = MI.getMF()->getSubtarget<SPIRVSubtarget>();
+
+  auto IntrinsicID = cast<GIntrinsic>(MI).getIntrinsicID();
+  if (IntrinsicID == Intrinsic::spv_bitcast) {
+    LLVM_DEBUG(dbgs() << "Found a bitcast instruction\n");
+    Register DstReg = MI.getOperand(0).getReg();
+    Register SrcReg = MI.getOperand(2).getReg();
+    LLT DstTy = MRI.getType(DstReg);
+    LLT SrcTy = MRI.getType(SrcReg);
+
+    int32_t MaxVectorSize = ST.isShader() ? 4 : 16;
+
+    bool DstNeedsLegalization = false;
+    bool SrcNeedsLegalization = false;
+
+    if (DstTy.isVector()) {
+      if (DstTy.getNumElements() > 4 &&
+          !isPowerOf2_32(DstTy.getNumElements())) {
+        DstNeedsLegalization = true;
+      }
+
+      if (DstTy.getNumElements() > MaxVectorSize) {
+        DstNeedsLegalization = true;
+      }
+    }
+
+    if (SrcTy.isVector()) {
+      if (SrcTy.getNumElements() > 4 &&
+          !isPowerOf2_32(SrcTy.getNumElements())) {
+        SrcNeedsLegalization = true;
+      }
+
+      if (SrcTy.getNumElements() > MaxVectorSize) {
+        SrcNeedsLegalization = true;
+      }
+    }
+
+    // If an spv_bitcast needs to be legalized, we convert it to G_BITCAST to
+    // allow using the generic legalization rules.
+    if (DstNeedsLegalization || SrcNeedsLegalization) {
+      LLVM_DEBUG(dbgs() << "Replacing with a G_BITCAST\n");
+      MIRBuilder.buildBitcast(DstReg, SrcReg);
+      MI.eraseFromParent();
+    }
+    return true;
+  }
+  return true;
+}
+
+bool SPIRVLegalizerInfo::legalizeBitcast(LegalizerHelper &Helper,
+                                         MachineInstr &MI) const {
+  // Once the G_BITCAST is using vectors that are allowed, we turn it back into
+  // an spv_bitcast to avoid verifier problems when the register types are the
+  // same for the source and the result. Note that the SPIR-V types associated
+  // with the bitcast can be different even if the register types are the same.
+  MachineIRBuilder &MIRBuilder = Helper.MIRBuilder;
+  Register DstReg = MI.getOperand(0).getReg();
+  Register SrcReg = MI.getOperand(1).getReg();
+  SmallVector<Register, 1> DstRegs = {DstReg};
+  MIRBuilder.buildIntrinsic(Intrinsic::spv_bitcast, DstRegs).addUse(SrcReg);
+  MI.eraseFromParent();
+  return true;
+}
+
 // Note this code was copied from LegalizerHelper::lowerISFPCLASS and adjusted
 // to ensure that all instructions created during the lowering have SPIR-V types
 // assigned to them.
diff --git a/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.h b/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.h
index eeefa4239c778..86e7e711caa60 100644
--- a/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.h
+++ b/llvm/lib/Target/SPIRV/SPIRVLegalizerInfo.h
@@ -29,11 +29,15 @@ class SPIRVLegalizerInfo : public LegalizerInfo {
 public:
   bool legalizeCustom(LegalizerHelper &Helper, MachineInstr &MI,
                       LostDebugLocObserver &LocObserver) const override;
+  bool legalizeIntrinsic(LegalizerHelper &Helper,
+                         MachineInstr &MI) const override;
+
   SPIRVLegalizerInfo(const SPIRVSubtarget &ST);
 
 private:
   bool legalizeIsFPClass(LegalizerHelper &Helper, MachineInstr &MI,
                          LostDebugLocObserver &LocObserver) const;
+  bool legalizeBitcast(LegalizerHelper &Helper, MachineInstr &MI) const;
 };
 } // namespace llvm
 #endif // LLVM_LIB_TARGET_SPIRV_SPIRVMACHINELEGALIZER_H
diff --git a/llvm/lib/Target/SPIRV/SPIRVModuleAnalysis.cpp b/llvm/lib/Target/SPIRV/SPIRVModuleAnalysis.cpp
index 00f750b88a608..2feb73d8dedfa 100644
--- a/llvm/lib/Target/SPIRV/SPIRVModuleAnalysis.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVModuleAnalysis.cpp
@@ -1692,6 +1692,27 @@ void addInstrRequirements(const MachineInstr &MI,
     Reqs.addCapability(SPIRV::Capability::GroupNonUniformRotateKHR);
     Reqs.addCapability(SPIRV::Capability::GroupNonUniform);
     break;
+  case SPIRV::OpFixedCosALTERA:
+  case SPIRV::OpFixedSinALTERA:
+  case SPIRV::OpFixedCosPiALTERA:
+  case SPIRV::OpFixedSinPiALTERA:
+  case SPIRV::OpFixedExpALTERA:
+  case SPIRV::OpFixedLogALTERA:
+  case SPIRV::OpFixedRecipALTERA:
+  case SPIRV::OpFixedSqrtALTERA:
+  case SPIRV::OpFixedSinCosALTERA:
+  case SPIRV::OpFixedSinCosPiALTERA:
+  case SPIRV::OpFixedRsqrtALTERA:
+    if (!ST.canUseExtension(
+            SPIRV::Extension::SPV_ALTERA_arbitrary_precision_fixed_point))
+      report_fatal_error("This instruction requires the "
+                         "following SPIR-V extension: "
+                         "SPV_ALTERA_arbitrary_precision_fixed_point",
+                         false);
+    Reqs.addExtension(
+        SPIRV::Extension::SPV_ALTERA_arbitrary_precision_fixed_point);
+    Reqs.addCapability(SPIRV::Capability::ArbitraryPrecisionFixedPointALTERA);
+    break;
   case SPIRV::OpGroupIMulKHR:
   case SPIRV::OpGroupFMulKHR:
   case SPIRV::OpGroupBitwiseAndKHR:
diff --git a/llvm/lib/Target/SPIRV/SPIRVPreLegalizer.cpp b/llvm/lib/Target/SPIRV/SPIRVPreLegalizer.cpp
index 0f4b3d59b904a..7ca463460ffad 100644
--- a/llvm/lib/Target/SPIRV/SPIRVPreLegalizer.cpp
+++ b/llvm/lib/Target/SPIRV/SPIRVPreLegalizer.cpp
@@ -509,7 +509,7 @@ generateAssignInstrs(MachineFunction &MF, SPIRVGlobalRegistry *GR,
 
   bool IsExtendedInts =
       ST->canUseExtension(
-          SPIRV::Extension::SPV_INTEL_arbitrary_precision_integers) ||
+          SPIRV::Extension::SPV_ALTERA_arbitrary_precision_integers) ||
       ST->canUseExtension(SPIRV::Extension::SPV_KHR_bit_instructions) ||
       ST->canUseExtension(SPIRV::Extension::SPV_INTEL_int4);
 
diff --git a/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td b/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td
index f02a587013856..94e0138c66487 100644
--- a/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td
+++ b/llvm/lib/Target/SPIRV/SPIRVSymbolicOperands.td
@@ -318,7 +318,7 @@ defm SPV_INTEL_io_pipes : ExtensionOperand<63, [EnvOpenCL]>;
 defm SPV_KHR_ray_tracing : ExtensionOperand<64, [EnvVulkan]>;
 defm SPV_KHR_ray_query : ExtensionOperand<65, [EnvVulkan]>;
 defm SPV_INTEL_fpga_memory_accesses : ExtensionOperand<66, [EnvOpenCL]>;
-defm SPV_INTEL_arbitrary_precision_integers : ExtensionOperand<67, [EnvOpenCL]>;
+defm SPV_ALTERA_arbitrary_precision_integers : ExtensionOperand<67, [EnvOpenCL]>;
 defm SPV_EXT_shader_atomic_float_add
     : ExtensionOperand<68, [EnvVulkan, EnvOpenCL]>;
 defm SPV_KHR_terminate_invocation : ExtensionOperand<69, [EnvVulkan]>;
@@ -390,6 +390,7 @@ defm SPV_KHR_maximal_reconvergence : ExtensionOperand<128, [EnvVulkan]>;
 defm SPV_INTEL_bfloat16_arithmetic
     : ExtensionOperand<129, [EnvVulkan, EnvOpenCL]>;
 defm SPV_INTEL_16bit_atomics : ExtensionOperand<130, [EnvVulkan, EnvOpenCL]>;
+defm SPV_ALTERA_arbitrary_precision_fixed_point : ExtensionOperand<131, [EnvOpenCL, EnvVulkan]>;
 
 //===----------------------------------------------------------------------===//
 // Multiclass used to define Capabilities enum values and at the same time
@@ -549,7 +550,7 @@ defm ComputeDerivativeGroupLinearNV : CapabilityOperand<5350, 0, 0, [], []>;
 defm FragmentDensityEXT : CapabilityOperand<5291, 0, 0, [], [Shader]>;
 defm PhysicalStorageBufferAddressesEXT : CapabilityOperand<5347, 0, 0, [], [Shader]>;
 defm CooperativeMatrixNV : CapabilityOperand<5357, 0, 0, [], [Shader]>;
-defm ArbitraryPrecisionIntegersINTEL : CapabilityOperand<5844, 0, 0, [SPV_INTEL_arbitrary_precision_integers], [Int8, Int16]>;
+defm ArbitraryPrecisionIntegersALTERA : CapabilityOperand<5844, 0, 0, [SPV_ALTERA_arbitrary_precision_integers], [Int8, Int16]>;
 defm OptNoneINTEL : CapabilityOperand<6094, 0, 0, [SPV_INTEL_optnone], []>;
 defm OptNoneEXT : CapabilityOperand<6094, 0, 0, [SPV_EXT_optnone], []>;
 defm BitInstructions : CapabilityOperand<6025, 0, 0, [SPV_KHR_bit_instructions], []>;
@@ -615,6 +616,7 @@ defm BFloat16TypeKHR : CapabilityOperand<5116, 0, 0, [SPV_KHR_bfloat16], []>;
 defm BFloat16DotProductKHR : CapabilityOperand<5117, 0, 0, [SPV_KHR_bfloat16], [BFloat16TypeKHR]>;
 defm BFloat16CooperativeMatrixKHR : CapabilityOperand<5118, 0, 0, [SPV_KHR_bfloat16], [BFloat16TypeKHR, CooperativeMatrixKHR]>;
 defm BlockingPipesALTERA : CapabilityOperand<5945, 0, 0, [SPV_ALTERA_blocking_pipes], []>;
+defm ArbitraryPrecisionFixedPointALTERA : CapabilityOperand<5922, 0, 0, [SPV_ALTERA_arbitrary_precision_fixed_point], []>;
 
 //===----------------------------------------------------------------------===//
 // Multiclass used to define SourceLanguage enum values and at the same time
diff --git a/llvm/lib/Target/Sparc/SparcCallingConv.td b/llvm/lib/Target/Sparc/SparcCallingConv.td
index 8afd0a7fc09ad..6214000ddce5b 100644
--- a/llvm/lib/Target/Sparc/SparcCallingConv.td
+++ b/llvm/lib/Target/Sparc/SparcCallingConv.td
@@ -17,6 +17,9 @@
 def CC_Sparc32 : CallingConv<[
   // Custom assign SRet to [sp+64].
   CCIfSRet<CCCustom<"CC_Sparc_Assign_SRet">>,
+  // f128 arguments are passed indirectly, using i32 pointers.
+  // FIXME GCC in soft-float mode passes f128 as if 2xi64 values.
+  CCIfType<[f128], CCPassIndirect<i32>>,
   // i32 f32 arguments get passed in integer registers if there is space.
   CCIfType<[i32, f32], CCAssignToReg<[I0, I1, I2, I3, I4, I5]>>,
   // f64 arguments are split and passed through registers or through stack.
@@ -24,20 +27,20 @@ def CC_Sparc32 : CallingConv<[
   // As are v2i32 arguments (this would be the default behavior for
   // v2i32 if it wasn't allocated to the IntPair register-class)
   CCIfType<[v2i32], CCCustom<"CC_Sparc_Assign_Split_64">>,
-
-
   // Alternatively, they are assigned to the stack in 4-byte aligned units.
   CCAssignToStack<4, 4>
 ]>;
 
+
 def RetCC_Sparc32 : CallingConv<[
   CCIfType<[i32], CCAssignToReg<[I0, I1, I2, I3, I4, I5]>>,
   CCIfType<[f32], CCAssignToReg<[F0, F1, F2, F3]>>,
   CCIfType<[f64], CCAssignToReg<[D0, D1]>>,
+  // FIXME GCC in soft-float mode passes f128 as if 2xi64 values.
+  CCIfType<[f128], CCIfInReg<CCAssignToReg<[Q0, Q1]>>>,
   CCIfType<[v2i32], CCCustom<"CC_Sparc_Assign_Ret_Split_64">>
 ]>;
 
-
 //===----------------------------------------------------------------------===//
 // SPARC v9 64-bit.
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/Sparc/SparcISelLowering.cpp b/llvm/lib/Target/Sparc/SparcISelLowering.cpp
index a4a9eafd52ffe..de8768a7cdbca 100644
--- a/llvm/lib/Target/Sparc/SparcISelLowering.cpp
+++ b/llvm/lib/Target/Sparc/SparcISelLowering.cpp
@@ -440,6 +440,7 @@ SDValue SparcTargetLowering::LowerFormalArguments_32(
   MachineFunction &MF = DAG.getMachineFunction();
   MachineRegisterInfo &RegInfo = MF.getRegInfo();
   SparcMachineFunctionInfo *FuncInfo = MF.getInfo<SparcMachineFunctionInfo>();
+  EVT PtrVT = getPointerTy(DAG.getDataLayout());
 
   // Assign locations to all of the incoming arguments.
   SmallVector<CCValAssign, 16> ArgLocs;
@@ -453,6 +454,7 @@ SDValue SparcTargetLowering::LowerFormalArguments_32(
   unsigned InIdx = 0;
   for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i, ++InIdx) {
     CCValAssign &VA = ArgLocs[i];
+    EVT LocVT = VA.getLocVT();
 
     if (Ins[InIdx].Flags.isSRet()) {
       if (InIdx != 0)
@@ -466,6 +468,7 @@ SDValue SparcTargetLowering::LowerFormalArguments_32(
       continue;
     }
 
+    SDValue Arg;
     if (VA.isRegLoc()) {
       if (VA.needsCustom()) {
         assert(VA.getLocVT() == MVT::f64 || VA.getLocVT() == MVT::v2i32);
@@ -500,76 +503,85 @@ SDValue SparcTargetLowering::LowerFormalArguments_32(
       }
       Register VReg = RegInfo.createVirtualRegister(&SP::IntRegsRegClass);
       MF.getRegInfo().addLiveIn(VA.getLocReg(), VReg);
-      SDValue Arg = DAG.getCopyFromReg(Chain, dl, VReg, MVT::i32);
-      if (VA.getLocVT() == MVT::f32)
-        Arg = DAG.getNode(ISD::BITCAST, dl, MVT::f32, Arg);
-      else if (VA.getLocVT() != MVT::i32) {
-        Arg = DAG.getNode(ISD::AssertSext, dl, MVT::i32, Arg,
-                          DAG.getValueType(VA.getLocVT()));
-        Arg = DAG.getNode(ISD::TRUNCATE, dl, VA.getLocVT(), Arg);
+      Arg = DAG.getCopyFromReg(Chain, dl, VReg, MVT::i32);
+      if (VA.getLocInfo() != CCValAssign::Indirect) {
+        if (VA.getLocVT() == MVT::f32)
+          Arg = DAG.getNode(ISD::BITCAST, dl, MVT::f32, Arg);
+        else if (VA.getLocVT() != MVT::i32) {
+          Arg = DAG.getNode(ISD::AssertSext, dl, MVT::i32, Arg,
+                            DAG.getValueType(VA.getLocVT()));
+          Arg = DAG.getNode(ISD::TRUNCATE, dl, VA.getLocVT(), Arg);
+        }
+        InVals.push_back(Arg);
+        continue;
       }
-      InVals.push_back(Arg);
-      continue;
-    }
+    } else {
+      assert(VA.isMemLoc());
 
-    assert(VA.isMemLoc());
+      unsigned Offset = VA.getLocMemOffset() + StackOffset;
 
-    unsigned Offset = VA.getLocMemOffset()+StackOffset;
-    auto PtrVT = getPointerTy(DAG.getDataLayout());
+      if (VA.needsCustom()) {
+        assert(VA.getValVT() == MVT::f64 || VA.getValVT() == MVT::v2i32);
+        // If it is double-word aligned, just load.
+        if (Offset % 8 == 0) {
+          int FI = MF.getFrameInfo().CreateFixedObject(8, Offset, true);
+          SDValue FIPtr = DAG.getFrameIndex(FI, PtrVT);
+          SDValue Load = DAG.getLoad(VA.getValVT(), dl, Chain, FIPtr,
+                                     MachinePointerInfo());
+          InVals.push_back(Load);
+          continue;
+        }
 
-    if (VA.needsCustom()) {
-      assert(VA.getValVT() == MVT::f64 || VA.getValVT() == MVT::v2i32);
-      // If it is double-word aligned, just load.
-      if (Offset % 8 == 0) {
-        int FI = MF.getFrameInfo().CreateFixedObject(8,
-                                                     Offset,
-                                                     true);
+        int FI = MF.getFrameInfo().CreateFixedObject(4, Offset, true);
         SDValue FIPtr = DAG.getFrameIndex(FI, PtrVT);
-        SDValue Load =
-            DAG.getLoad(VA.getValVT(), dl, Chain, FIPtr, MachinePointerInfo());
-        InVals.push_back(Load);
-        continue;
-      }
+        SDValue HiVal =
+            DAG.getLoad(MVT::i32, dl, Chain, FIPtr, MachinePointerInfo());
+        int FI2 = MF.getFrameInfo().CreateFixedObject(4, Offset + 4, true);
+        SDValue FIPtr2 = DAG.getFrameIndex(FI2, PtrVT);
 
-      int FI = MF.getFrameInfo().CreateFixedObject(4,
-                                                   Offset,
-                                                   true);
-      SDValue FIPtr = DAG.getFrameIndex(FI, PtrVT);
-      SDValue HiVal =
-          DAG.getLoad(MVT::i32, dl, Chain, FIPtr, MachinePointerInfo());
-      int FI2 = MF.getFrameInfo().CreateFixedObject(4,
-                                                    Offset+4,
-                                                    true);
-      SDValue FIPtr2 = DAG.getFrameIndex(FI2, PtrVT);
+        SDValue LoVal =
+            DAG.getLoad(MVT::i32, dl, Chain, FIPtr2, MachinePointerInfo());
 
-      SDValue LoVal =
-          DAG.getLoad(MVT::i32, dl, Chain, FIPtr2, MachinePointerInfo());
+        if (IsLittleEndian)
+          std::swap(LoVal, HiVal);
 
-      if (IsLittleEndian)
-        std::swap(LoVal, HiVal);
+        SDValue WholeValue =
+            DAG.getNode(ISD::BUILD_PAIR, dl, MVT::i64, LoVal, HiVal);
+        WholeValue = DAG.getNode(ISD::BITCAST, dl, VA.getValVT(), WholeValue);
+        InVals.push_back(WholeValue);
+        continue;
+      }
 
-      SDValue WholeValue =
-        DAG.getNode(ISD::BUILD_PAIR, dl, MVT::i64, LoVal, HiVal);
-      WholeValue = DAG.getNode(ISD::BITCAST, dl, VA.getValVT(), WholeValue);
-      InVals.push_back(WholeValue);
-      continue;
+      int FI = MF.getFrameInfo().CreateFixedObject(LocVT.getSizeInBits() / 8,
+                                                   Offset, true);
+      SDValue FIPtr = DAG.getFrameIndex(FI, PtrVT);
+      SDValue Load = DAG.getLoad(LocVT, dl, Chain, FIPtr,
+                                 MachinePointerInfo::getFixedStack(MF, FI));
+      if (VA.getLocInfo() != CCValAssign::Indirect) {
+        InVals.push_back(Load);
+        continue;
+      }
+      Arg = Load;
     }
 
-    int FI = MF.getFrameInfo().CreateFixedObject(4,
-                                                 Offset,
-                                                 true);
-    SDValue FIPtr = DAG.getFrameIndex(FI, PtrVT);
-    SDValue Load ;
-    if (VA.getValVT() == MVT::i32 || VA.getValVT() == MVT::f32) {
-      Load = DAG.getLoad(VA.getValVT(), dl, Chain, FIPtr, MachinePointerInfo());
-    } else if (VA.getValVT() == MVT::f128) {
-      report_fatal_error("SPARCv8 does not handle f128 in calls; "
-                         "pass indirectly");
-    } else {
-      // We shouldn't see any other value types here.
-      llvm_unreachable("Unexpected ValVT encountered in frame lowering.");
+    assert(VA.getLocInfo() == CCValAssign::Indirect);
+
+    SDValue ArgValue =
+        DAG.getLoad(VA.getValVT(), dl, Chain, Arg, MachinePointerInfo());
+    InVals.push_back(ArgValue);
+
+    unsigned ArgIndex = Ins[InIdx].OrigArgIndex;
+    assert(Ins[InIdx].PartOffset == 0);
+    while (i + 1 != e && Ins[InIdx + 1].OrigArgIndex == ArgIndex) {
+      CCValAssign &PartVA = ArgLocs[i + 1];
+      unsigned PartOffset = Ins[InIdx + 1].PartOffset;
+      SDValue Address = DAG.getMemBasePlusOffset(
+          ArgValue, TypeSize::getFixed(PartOffset), dl);
+      InVals.push_back(DAG.getLoad(PartVA.getValVT(), dl, Chain, Address,
+                                   MachinePointerInfo()));
+      ++i;
+      ++InIdx;
     }
-    InVals.push_back(Load);
   }
 
   if (MF.getFunction().hasStructRetAttr()) {
@@ -836,6 +848,8 @@ SparcTargetLowering::LowerCall_32(TargetLowering::CallLoweringInfo &CLI,
   CallingConv::ID CallConv              = CLI.CallConv;
   bool isVarArg                         = CLI.IsVarArg;
   MachineFunction &MF = DAG.getMachineFunction();
+  LLVMContext &Ctx = *DAG.getContext();
+  EVT PtrVT = getPointerTy(MF.getDataLayout());
 
   // Analyze operands of the call, assigning locations to each operand.
   SmallVector<CCValAssign, 16> ArgLocs;
@@ -914,7 +928,9 @@ SparcTargetLowering::LowerCall_32(TargetLowering::CallLoweringInfo &CLI,
     // Promote the value if needed.
     switch (VA.getLocInfo()) {
     default: llvm_unreachable("Unknown loc info!");
-    case CCValAssign::Full: break;
+    case CCValAssign::Full:
+    case CCValAssign::Indirect:
+      break;
     case CCValAssign::SExt:
       Arg = DAG.getNode(ISD::SIGN_EXTEND, dl, VA.getLocVT(), Arg);
       break;
@@ -1013,6 +1029,49 @@ SparcTargetLowering::LowerCall_32(TargetLowering::CallLoweringInfo &CLI,
       continue;
     }
 
+    if (VA.getLocInfo() == CCValAssign::Indirect) {
+      // Store the argument in a stack slot and pass its address.
+      unsigned ArgIndex = Outs[realArgIdx].OrigArgIndex;
+      assert(Outs[realArgIdx].PartOffset == 0);
+
+      EVT SlotVT;
+      if (i + 1 != e && Outs[realArgIdx + 1].OrigArgIndex == ArgIndex) {
+        Type *OrigArgType = CLI.Args[ArgIndex].Ty;
+        EVT OrigArgVT = getValueType(MF.getDataLayout(), OrigArgType);
+        MVT PartVT =
+            getRegisterTypeForCallingConv(Ctx, CLI.CallConv, OrigArgVT);
+        unsigned N =
+            getNumRegistersForCallingConv(Ctx, CLI.CallConv, OrigArgVT);
+        SlotVT = EVT::getIntegerVT(Ctx, PartVT.getSizeInBits() * N);
+      } else {
+        SlotVT = Outs[realArgIdx].VT;
+      }
+
+      SDValue SpillSlot = DAG.CreateStackTemporary(SlotVT);
+      int FI = cast<FrameIndexSDNode>(SpillSlot)->getIndex();
+      MemOpChains.push_back(
+          DAG.getStore(Chain, dl, Arg, SpillSlot,
+                       MachinePointerInfo::getFixedStack(MF, FI)));
+      // If the original argument was split (e.g. f128), we need
+      // to store all parts of it here (and pass just one address).
+      while (i + 1 != e && Outs[realArgIdx + 1].OrigArgIndex == ArgIndex) {
+        SDValue PartValue = OutVals[realArgIdx + 1];
+        unsigned PartOffset = Outs[realArgIdx + 1].PartOffset;
+        SDValue Address = DAG.getMemBasePlusOffset(
+            DAG.getFrameIndex(FI, PtrVT), TypeSize::getFixed(PartOffset), dl);
+        MemOpChains.push_back(
+            DAG.getStore(Chain, dl, PartValue, Address,
+                         MachinePointerInfo::getFixedStack(MF, FI)));
+        assert((PartOffset + PartValue.getValueType().getStoreSize() <=
+                SlotVT.getStoreSize()) &&
+               "Not enough space for argument part!");
+        ++i;
+        ++realArgIdx;
+      }
+
+      Arg = SpillSlot;
+    }
+
     // Arguments that can be passed on register must be kept at
     // RegsToPass vector
     if (VA.isRegLoc()) {
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
index 98cb7aba562c4..e0c527b9b2581 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
@@ -1060,7 +1060,7 @@ EVT WebAssemblyTargetLowering::getSetCCResultType(const DataLayout &DL,
 }
 
 bool WebAssemblyTargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                                   const CallInst &I,
+                                                   const CallBase &I,
                                                    MachineFunction &MF,
                                                    unsigned Intrinsic) const {
   switch (Intrinsic) {
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.h b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.h
index f7052989b3c75..c37970f458e36 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.h
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.h
@@ -58,7 +58,7 @@ class WebAssemblyTargetLowering final : public TargetLowering {
   bool isOffsetFoldingLegal(const GlobalAddressSDNode *GA) const override;
   EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,
                          EVT VT) const override;
-  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+  bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                           MachineFunction &MF,
                           unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyInstrInfo.td b/llvm/lib/Target/WebAssembly/WebAssemblyInstrInfo.td
index 13d048a98d6ea..ce4db2e112fa0 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyInstrInfo.td
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyInstrInfo.td
@@ -460,8 +460,8 @@ def : Pat<(i64 (WebAssemblyWrapperREL texternalsym:$addr)),
 include "WebAssemblyInstrMemory.td"
 include "WebAssemblyInstrCall.td"
 include "WebAssemblyInstrControl.td"
-include "WebAssemblyInstrInteger.td"
 include "WebAssemblyInstrConv.td"
+include "WebAssemblyInstrInteger.td"
 include "WebAssemblyInstrFloat.td"
 include "WebAssemblyInstrAtomics.td"
 include "WebAssemblyInstrSIMD.td"
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyInstrInteger.td b/llvm/lib/Target/WebAssembly/WebAssemblyInstrInteger.td
index d4c8f92c883e7..991507e883f28 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyInstrInteger.td
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyInstrInteger.td
@@ -107,6 +107,13 @@ def : Pat<(rotr I32:$lhs, (and I32:$rhs, 31)), (ROTR_I32 I32:$lhs, I32:$rhs)>;
 def : Pat<(rotl I64:$lhs, (and I64:$rhs, 63)), (ROTL_I64 I64:$lhs, I64:$rhs)>;
 def : Pat<(rotr I64:$lhs, (and I64:$rhs, 63)), (ROTR_I64 I64:$lhs, I64:$rhs)>;
 
+def : Pat<(shl I64:$lhs, (zext (and I32:$rhs, 63))),
+                               (SHL_I64 I64:$lhs, (I64_EXTEND_U_I32 I32:$rhs))>;
+def : Pat<(sra I64:$lhs, (zext (and I32:$rhs, 63))),
+                               (SHR_S_I64 I64:$lhs, (I64_EXTEND_U_I32 I32:$rhs))>;
+def : Pat<(srl I64:$lhs, (zext (and I32:$rhs, 63))),
+                               (SHR_U_I64 I64:$lhs, (I64_EXTEND_U_I32 I32:$rhs))>;
+
 defm SELECT_I32 : I<(outs I32:$dst), (ins I32:$lhs, I32:$rhs, I32:$cond),
                     (outs), (ins),
                     [(set I32:$dst, (select I32:$cond, I32:$lhs, I32:$rhs))],
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 50df19b3e6e47..6e16bb148b5df 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -2073,8 +2073,8 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
 
     if (Subtarget.hasVBMI2()) {
       for (auto VT : {MVT::v32i16, MVT::v16i32, MVT::v8i64}) {
-        setOperationAction(ISD::FSHL, VT, Custom);
-        setOperationAction(ISD::FSHR, VT, Custom);
+        setOperationAction(ISD::FSHL, VT, Legal);
+        setOperationAction(ISD::FSHR, VT, Legal);
       }
 
       setOperationAction(ISD::ROTL, MVT::v32i16, Custom);
@@ -2089,8 +2089,8 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
   if (!Subtarget.useSoftFloat() && Subtarget.hasVBMI2()) {
     for (auto VT : {MVT::v8i16, MVT::v4i32, MVT::v2i64, MVT::v16i16, MVT::v8i32,
                     MVT::v4i64}) {
-      setOperationAction(ISD::FSHL, VT, Custom);
-      setOperationAction(ISD::FSHR, VT, Custom);
+      setOperationAction(ISD::FSHL, VT, Subtarget.hasVLX() ? Legal : Custom);
+      setOperationAction(ISD::FSHR, VT, Subtarget.hasVLX() ? Legal : Custom);
     }
   }
 
@@ -2709,6 +2709,8 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
                        ISD::STRICT_FP_EXTEND,
                        ISD::FP_ROUND,
                        ISD::STRICT_FP_ROUND,
+                       ISD::FSHL,
+                       ISD::FSHR,
                        ISD::INTRINSIC_VOID,
                        ISD::INTRINSIC_WO_CHAIN,
                        ISD::INTRINSIC_W_CHAIN});
@@ -3102,7 +3104,7 @@ static bool useVPTERNLOG(const X86Subtarget &Subtarget, MVT VT) {
 }
 
 bool X86TargetLowering::getTgtMemIntrinsic(IntrinsicInfo &Info,
-                                           const CallInst &I,
+                                           const CallBase &I,
                                            MachineFunction &MF,
                                            unsigned Intrinsic) const {
   Info.flags = MachineMemOperand::MONone;
@@ -31322,19 +31324,15 @@ static SDValue LowerFunnelShift(SDValue Op, const X86Subtarget &Subtarget,
     bool IsCstSplat = X86::isConstantSplat(Amt, APIntShiftAmt);
     unsigned NumElts = VT.getVectorNumElements();
 
-    if (Subtarget.hasVBMI2() && EltSizeInBits > 8) {
-
-      if (IsCstSplat) {
-        if (IsFSHR)
-          std::swap(Op0, Op1);
-        uint64_t ShiftAmt = APIntShiftAmt.urem(EltSizeInBits);
-        SDValue Imm = DAG.getTargetConstant(ShiftAmt, DL, MVT::i8);
-        return getAVX512Node(IsFSHR ? X86ISD::VSHRD : X86ISD::VSHLD, DL, VT,
-                             {Op0, Op1, Imm}, DAG, Subtarget);
-      }
+    // For non-VLX VBMI2 targets, widen 128/256-bit to 512-bit so
+    // the rest of the lowering/isel can select the VBMI2 forms.
+    // Only Custom types (v8i16, v4i32, v2i64, v16i16, v8i32, v4i64) can
+    // reach LowerFunnelShift with VBMI2 but no VLX, so no type check needed.
+    if (Subtarget.hasVBMI2() && !Subtarget.hasVLX() && EltSizeInBits > 8) {
       return getAVX512Node(IsFSHR ? ISD::FSHR : ISD::FSHL, DL, VT,
                            {Op0, Op1, Amt}, DAG, Subtarget);
     }
+
     assert((VT == MVT::v16i8 || VT == MVT::v32i8 || VT == MVT::v64i8 ||
             VT == MVT::v8i16 || VT == MVT::v16i16 || VT == MVT::v32i16 ||
             VT == MVT::v4i32 || VT == MVT::v8i32 || VT == MVT::v16i32) &&
@@ -53525,18 +53523,48 @@ static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,
   if (Mst->isCompressingStore())
     return SDValue();
 
-  EVT VT = Mst->getValue().getValueType();
+  if (SDValue ScalarStore = reduceMaskedStoreToScalarStore(Mst, DAG, Subtarget))
+    return ScalarStore;
+
   const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+  SDLoc DL(N);
 
-  if (Mst->isTruncatingStore())
-    return SDValue();
+  SDValue Mask = Mst->getMask();
+  SDValue Value = Mst->getValue();
+  EVT MemVT = Mst->getMemoryVT();
+  EVT VT = Value.getValueType();
 
-  if (SDValue ScalarStore = reduceMaskedStoreToScalarStore(Mst, DAG, Subtarget))
-    return ScalarStore;
+  // See if the truncating store can be a saturating truncated store.
+  if (Mst->isTruncatingStore()) {
+    if (VT.isVector() && MemVT.isVector() && VT.getScalarType().isInteger() &&
+        MemVT.getScalarType().isInteger() &&
+        VT.getVectorNumElements() == MemVT.getVectorNumElements() &&
+        Subtarget.hasBWI() && Subtarget.hasVLX()) {
+
+      SDValue SatSrc;
+      unsigned Opc;
+      if (SDValue SVal = detectSSatPattern(Value, MemVT)) {
+        SatSrc = SVal;
+        Opc = X86ISD::VMTRUNCSTORES;
+      } else if (SDValue UVal = detectUSatPattern(Value, MemVT, DAG, DL)) {
+        SatSrc = UVal;
+        Opc = X86ISD::VMTRUNCSTOREUS;
+      } else {
+        return SDValue();
+      }
+
+      SDVTList VTs = DAG.getVTList(MVT::Other);
+      SDValue Ops[] = {Mst->getChain(), SatSrc, Mst->getBasePtr(), Mask};
+      MachineMemOperand *MMO = Mst->getMemOperand();
+      return DAG.getMemIntrinsicNode(Opc, DL, VTs, Ops, MemVT, MMO);
+    }
+
+    // Otherwise don't combine if this store already truncates.
+    return SDValue();
+  }
 
   // If the mask value has been legalized to a non-boolean vector, try to
   // simplify ops leading up to it. We only demand the MSB of each lane.
-  SDValue Mask = Mst->getMask();
   if (Mask.getScalarValueSizeInBits() != 1) {
     APInt DemandedBits(APInt::getSignMask(VT.getScalarSizeInBits()));
     if (TLI.SimplifyDemandedBits(Mask, DemandedBits, DCI)) {
@@ -53552,14 +53580,12 @@ static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,
                                 Mst->getAddressingMode());
   }
 
-  SDValue Value = Mst->getValue();
   if (Value.getOpcode() == ISD::TRUNCATE && Value.getNode()->hasOneUse() &&
-      TLI.isTruncStoreLegal(Value.getOperand(0).getValueType(),
-                            Mst->getMemoryVT())) {
-    return DAG.getMaskedStore(Mst->getChain(), SDLoc(N), Value.getOperand(0),
-                              Mst->getBasePtr(), Mst->getOffset(), Mask,
-                              Mst->getMemoryVT(), Mst->getMemOperand(),
-                              Mst->getAddressingMode(), true);
+      TLI.isTruncStoreLegal(Value.getOperand(0).getValueType(), MemVT)) {
+    return DAG.getMaskedStore(Mst->getChain(), DL, Value.getOperand(0),
+                              Mst->getBasePtr(), Mst->getOffset(), Mask, MemVT,
+                              Mst->getMemOperand(), Mst->getAddressingMode(),
+                              true);
   }
 
   return SDValue();
@@ -57637,6 +57663,40 @@ static SDValue combineFP_TO_xINT_SAT(SDNode *N, SelectionDAG &DAG,
   return SDValue();
 }
 
+// Combiner: turn uniform-constant splat funnel shifts into VSHLD/VSHRD
+static SDValue combineFunnelShift(SDNode *N, SelectionDAG &DAG,
+                                  TargetLowering::DAGCombinerInfo &DCI,
+                                  const X86Subtarget &Subtarget) {
+  SDLoc DL(N);
+  SDValue Op0 = N->getOperand(0);
+  SDValue Op1 = N->getOperand(1);
+  SDValue Amt = N->getOperand(2);
+  EVT VT = Op0.getValueType();
+
+  if (!VT.isVector())
+    return SDValue();
+
+  // Only combine if the operation is legal for this type.
+  // This ensures we don't try to convert types that need to be
+  // widened/promoted.
+  if (!DAG.getTargetLoweringInfo().isOperationLegal(N->getOpcode(), VT))
+    return SDValue();
+
+  unsigned EltSize = VT.getScalarSizeInBits();
+  APInt ShiftVal;
+  if (!X86::isConstantSplat(Amt, ShiftVal))
+    return SDValue();
+
+  uint64_t ModAmt = ShiftVal.urem(EltSize);
+  SDValue Imm = DAG.getTargetConstant(ModAmt, DL, MVT::i8);
+  bool IsFSHR = N->getOpcode() == ISD::FSHR;
+
+  if (IsFSHR)
+    std::swap(Op0, Op1);
+  unsigned Opcode = IsFSHR ? X86ISD::VSHRD : X86ISD::VSHLD;
+  return DAG.getNode(Opcode, DL, VT, {Op0, Op1, Imm});
+}
+
 static bool needCarryOrOverflowFlag(SDValue Flags) {
   assert(Flags.getValueType() == MVT::i32 && "Unexpected VT!");
 
@@ -59323,7 +59383,8 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
     case X86ISD::ANDNP:
       // TODO: AVX512 targets should only use CombineSubOperand like AVX1/2.
       if (!IsSplat && (VT.is256BitVector() ||
-                       (VT.is512BitVector() && Subtarget.useAVX512Regs()))) {
+                       (VT.is512BitVector() && Subtarget.useAVX512Regs()) ||
+                       (EltSizeInBits == 1 && TLI.isTypeLegal(VT)))) {
         // Don't concatenate root AVX1 NOT patterns.
         // TODO: Allow NOT folding if Concat0 succeeds.
         if (Opcode == ISD::XOR && Depth == 0 && !Subtarget.hasInt256() &&
@@ -59333,7 +59394,8 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
           break;
         SDValue Concat0 = CombineSubOperand(VT, Ops, 0);
         SDValue Concat1 = CombineSubOperand(VT, Ops, 1);
-        if (Concat0 || Concat1 || Subtarget.useAVX512Regs())
+        if (Concat0 || Concat1 ||
+            (EltSizeInBits != 1 && Subtarget.useAVX512Regs()))
           return DAG.getNode(Opcode, DL, VT,
                              Concat0 ? Concat0 : ConcatSubOperand(VT, Ops, 0),
                              Concat1 ? Concat1 : ConcatSubOperand(VT, Ops, 1));
@@ -59393,6 +59455,31 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
         }
       }
       break;
+    case ISD::SETCC:
+      if (!IsSplat && EltSizeInBits == 1 &&
+          llvm::all_of(Ops, [Op0](SDValue Op) {
+            return Op0.getOperand(0).getValueType() ==
+                       Op.getOperand(0).getValueType() &&
+                   Op0.getOperand(2) == Op.getOperand(2);
+          })) {
+        EVT SrcVT = Op0.getOperand(0).getValueType();
+        EVT NewSrcVT = EVT::getVectorVT(Ctx, SrcVT.getScalarType(),
+                                        NumOps * SrcVT.getVectorNumElements());
+        unsigned SrcSizeInBits = SrcVT.getScalarSizeInBits();
+        if (TLI.isTypeLegal(VT) && TLI.isTypeLegal(NewSrcVT) &&
+            (NewSrcVT.is256BitVector() ||
+             (NewSrcVT.is512BitVector() && Subtarget.useAVX512Regs() &&
+              (SrcSizeInBits >= 32 || Subtarget.useBWIRegs())))) {
+          SDValue LHS = CombineSubOperand(NewSrcVT.getSimpleVT(), Ops, 0);
+          SDValue RHS = CombineSubOperand(NewSrcVT.getSimpleVT(), Ops, 1);
+          if (LHS || RHS)
+            return DAG.getNode(Opcode, DL, VT,
+                               LHS ? LHS : ConcatSubOperand(NewSrcVT, Ops, 0),
+                               RHS ? RHS : ConcatSubOperand(NewSrcVT, Ops, 1),
+                               Op0.getOperand(2));
+        }
+      }
+      break;
     case ISD::CTPOP:
     case ISD::CTTZ:
     case ISD::CTLZ:
@@ -59456,6 +59543,36 @@ static SDValue combineConcatVectorOps(const SDLoc &DL, MVT VT,
                            ConcatSubOperand(VT, Ops, 1));
       }
       break;
+    case ISD::FSQRT:
+    case ISD::FCEIL:
+    case ISD::FTRUNC:
+    case ISD::FRINT:
+    case ISD::FNEARBYINT:
+    case ISD::FROUND:
+    case ISD::FROUNDEVEN:
+    case ISD::FFLOOR:
+      if (!IsSplat && (VT.is256BitVector() ||
+                       (VT.is512BitVector() && Subtarget.useAVX512Regs()))) {
+        return DAG.getNode(Opcode, DL, VT, ConcatSubOperand(VT, Ops, 0));
+      }
+      break;
+    case X86ISD::FRCP:
+    case X86ISD::FRSQRT:
+      if (!IsSplat && VT.is256BitVector()) {
+        return DAG.getNode(Opcode, DL, VT, ConcatSubOperand(VT, Ops, 0));
+      }
+      break;
+    case X86ISD::VRNDSCALE:
+      if (!IsSplat &&
+          (VT.is256BitVector() ||
+           (VT.is512BitVector() && Subtarget.useAVX512Regs())) &&
+          llvm::all_of(Ops, [Op0](SDValue Op) {
+            return Op0.getOperand(1) == Op.getOperand(1);
+          })) {
+        return DAG.getNode(Opcode, DL, VT, ConcatSubOperand(VT, Ops, 0),
+                           Op0.getOperand(1));
+      }
+      break;
     case X86ISD::HADD:
     case X86ISD::HSUB:
     case X86ISD::FHADD:
@@ -59727,6 +59844,17 @@ static SDValue combineCONCAT_VECTORS(SDNode *N, SelectionDAG &DAG,
       }
     }
 
+    // Attempt to merge comparison/logic ops if the type is legal.
+    if (TLI.isTypeLegal(VT) &&
+        (all_of(Ops, [](SDValue Op) { return Op.getOpcode() == ISD::SETCC; }) ||
+         all_of(Ops, [](SDValue Op) {
+           return ISD::isBitwiseLogicOp(Op.getOpcode());
+         }))) {
+      if (SDValue R = combineConcatVectorOps(SDLoc(N), VT.getSimpleVT(), Ops,
+                                             DAG, Subtarget))
+        return R;
+    }
+
     // Don't do anything else for i1 vectors.
     return SDValue();
   }
@@ -61239,6 +61367,8 @@ SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
   case ISD::INTRINSIC_VOID:  return combineINTRINSIC_VOID(N, DAG, DCI);
   case ISD::FP_TO_SINT_SAT:
   case ISD::FP_TO_UINT_SAT: return combineFP_TO_xINT_SAT(N, DAG, Subtarget);
+  case ISD::FSHL:
+  case ISD::FSHR: return combineFunnelShift(N, DAG, DCI, Subtarget);
     // clang-format on
   }
 
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index c5085299716ed..848fe4bf86d2c 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1482,7 +1482,7 @@ namespace llvm {
     /// to a MemIntrinsicNode (touches memory). If this is the case, it returns
     /// true and stores the intrinsic information into the IntrinsicInfo that was
     /// passed to the function.
-    bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallInst &I,
+    bool getTgtMemIntrinsic(IntrinsicInfo &Info, const CallBase &I,
                             MachineFunction &MF,
                             unsigned Intrinsic) const override;
 
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index a9dc96b53d530..aa75d2c60803e 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -6258,10 +6258,15 @@ InstructionCost X86TTIImpl::getGSVectorCost(unsigned Opcode,
 }
 
 /// Calculate the cost of Gather / Scatter operation
-InstructionCost X86TTIImpl::getGatherScatterOpCost(
-    unsigned Opcode, Type *SrcVTy, const Value *Ptr, bool VariableMask,
-    Align Alignment, TTI::TargetCostKind CostKind,
-    const Instruction *I = nullptr) const {
+InstructionCost
+X86TTIImpl::getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                                   TTI::TargetCostKind CostKind) const {
+  bool IsLoad = MICA.getID() == Intrinsic::masked_gather ||
+                MICA.getID() == Intrinsic::vp_gather;
+  unsigned Opcode = IsLoad ? Instruction::Load : Instruction::Store;
+  Type *SrcVTy = MICA.getDataType();
+  const Value *Ptr = MICA.getPointer();
+  Align Alignment = MICA.getAlignment();
   if ((Opcode == Instruction::Load &&
        (!isLegalMaskedGather(SrcVTy, Align(Alignment)) ||
         forceScalarizeMaskedGather(cast<VectorType>(SrcVTy),
@@ -6270,8 +6275,7 @@ InstructionCost X86TTIImpl::getGatherScatterOpCost(
        (!isLegalMaskedScatter(SrcVTy, Align(Alignment)) ||
         forceScalarizeMaskedScatter(cast<VectorType>(SrcVTy),
                                     Align(Alignment)))))
-    return BaseT::getGatherScatterOpCost(Opcode, SrcVTy, Ptr, VariableMask,
-                                         Alignment, CostKind, I);
+    return BaseT::getGatherScatterOpCost(MICA, CostKind);
 
   assert(SrcVTy->isVectorTy() && "Unexpected data type for Gather/Scatter");
   PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
diff --git a/llvm/lib/Target/X86/X86TargetTransformInfo.h b/llvm/lib/Target/X86/X86TargetTransformInfo.h
index d6dea9427990b..d35911965d8b5 100644
--- a/llvm/lib/Target/X86/X86TargetTransformInfo.h
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.h
@@ -185,11 +185,9 @@ class X86TTIImpl final : public BasicTTIImplBase<X86TTIImpl> {
   InstructionCost
   getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
                         TTI::TargetCostKind CostKind) const override;
-  InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
-                                         const Value *Ptr, bool VariableMask,
-                                         Align Alignment,
-                                         TTI::TargetCostKind CostKind,
-                                         const Instruction *I) const override;
+  InstructionCost
+  getGatherScatterOpCost(const MemIntrinsicCostAttributes &MICA,
+                         TTI::TargetCostKind CostKind) const override;
   InstructionCost
   getPointersChainCost(ArrayRef<const Value *> Ptrs, const Value *Base,
                        const TTI::PointersChainInfo &Info, Type *AccessTy,
diff --git a/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp b/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
index 7ed8fb68f107e..2397133fa61ef 100644
--- a/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
+++ b/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
@@ -710,9 +710,17 @@ static bool foldLoadsRecursive(Value *V, LoadOps &LOps, const DataLayout &DL,
   MemoryLocation Loc;
   if (!Start->comesBefore(End)) {
     std::swap(Start, End);
-    Loc = MemoryLocation::get(End);
+    // If LOps.RootInsert comes after LI2, since we use LI2 as the new insert
+    // point, we should make sure whether the memory region accessed by LOps
+    // isn't modified.
     if (LOps.FoundRoot)
-      Loc = Loc.getWithNewSize(LOps.LoadSize);
+      Loc = MemoryLocation(
+          LOps.Root->getPointerOperand(),
+          LocationSize::precise(DL.getTypeStoreSize(
+              IntegerType::get(LI1->getContext(), LOps.LoadSize))),
+          LOps.AATags);
+    else
+      Loc = MemoryLocation::get(End);
   } else
     Loc = MemoryLocation::get(End);
   unsigned NumScanned = 0;
diff --git a/llvm/lib/Transforms/IPO/AttributorAttributes.cpp b/llvm/lib/Transforms/IPO/AttributorAttributes.cpp
index e806a02a1f58f..a6ac7610a2c7a 100644
--- a/llvm/lib/Transforms/IPO/AttributorAttributes.cpp
+++ b/llvm/lib/Transforms/IPO/AttributorAttributes.cpp
@@ -665,10 +665,7 @@ static void followUsesInMBEC(AAType &AA, Attributor &A, StateType &S,
     return;
 
   SmallVector<const BranchInst *, 4> BrInsts;
-  SmallPtrSet<const Instruction *, 16> Visited;
   auto Pred = [&](const Instruction *I) {
-    if (!Visited.insert(I).second)
-      return false;
     if (const BranchInst *Br = dyn_cast<BranchInst>(I))
       if (Br->isConditional())
         BrInsts.push_back(Br);
@@ -687,10 +684,28 @@ static void followUsesInMBEC(AAType &AA, Attributor &A, StateType &S,
   // ParentS_m = ChildS_{m, 1} /\ ChildS_{m, 2} /\ ... /\ ChildS_{m, n_m}
   //
   // Known State |= ParentS_1 \/ ParentS_2 \/... \/ ParentS_m
+  //
+  // FIXME: Currently, recursive branches are not handled. For example, we
+  // can't deduce that ptr must be dereferenced in below function.
+  //
+  // void f(int a, int c, int *ptr) {
+  //    if(a)
+  //      if (b) {
+  //        *ptr = 0;
+  //      } else {
+  //        *ptr = 1;
+  //      }
+  //    else {
+  //      if (b) {
+  //        *ptr = 0;
+  //      } else {
+  //        *ptr = 1;
+  //      }
+  //    }
+  // }
 
   Explorer->checkForAllContext(&CtxI, Pred);
-  while (!BrInsts.empty()) {
-    const BranchInst *Br = BrInsts.pop_back_val();
+  for (const BranchInst *Br : BrInsts) {
     StateType ParentState;
 
     // The known state of the parent state is a conjunction of children's
@@ -699,18 +714,15 @@ static void followUsesInMBEC(AAType &AA, Attributor &A, StateType &S,
 
     for (const BasicBlock *BB : Br->successors()) {
       StateType ChildState;
+
       size_t BeforeSize = Uses.size();
-      const Instruction *I = &BB->front();
-      followUsesInContext(AA, A, *Explorer, I, Uses, ChildState);
+      followUsesInContext(AA, A, *Explorer, &BB->front(), Uses, ChildState);
 
       // Erase uses which only appear in the child.
       for (auto It = Uses.begin() + BeforeSize; It != Uses.end();)
         It = Uses.erase(It);
 
       ParentState &= ChildState;
-
-      // Check for recursive conditional branches.
-      Explorer->checkForAllContext(I, Pred);
     }
 
     // Use only known state.
diff --git a/llvm/lib/Transforms/IPO/GlobalOpt.cpp b/llvm/lib/Transforms/IPO/GlobalOpt.cpp
index 6add1e5c092d3..939071725253f 100644
--- a/llvm/lib/Transforms/IPO/GlobalOpt.cpp
+++ b/llvm/lib/Transforms/IPO/GlobalOpt.cpp
@@ -99,6 +99,11 @@ static cl::opt<bool>
                                    "functions from non-versioned callers."),
                           cl::init(true), cl::Hidden);
 
+static cl::opt<unsigned> MaxIFuncVersions(
+    "max-ifunc-versions", cl::Hidden, cl::init(5),
+    cl::desc("Maximum number of caller/callee versions that is allowed for "
+             "using the expensive (cubic) static resolution algorithm."));
+
 static cl::opt<bool>
     EnableColdCCStressTest("enable-coldcc-stress-test",
                            cl::desc("Enable stress test of coldcc by adding "
@@ -2632,31 +2637,56 @@ static bool OptimizeNonTrivialIFuncs(
     LLVM_DEBUG(dbgs() << "Statically resolving calls to function "
                       << CalleeIF->getResolverFunction()->getName() << "\n");
 
-    // The complexity of this algorithm is linear: O(NumCallers + NumCallees).
-    // TODO
-    // A limitation it has is that we are not using information about the
-    // current caller to deduce why an earlier caller of higher priority was
-    // skipped. For example let's say the current caller is aes+sve2 and a
-    // previous caller was mops+sve2. Knowing that sve2 is available we could
-    // infer that mops is unavailable. This would allow us to skip callee
-    // versions which depend on mops. I tried implementing this but the
-    // complexity was cubic :/
+    // The complexity of this algorithm is linear: O(NumCallers + NumCallees)
+    // if NumCallers > MaxIFuncVersions || NumCallees > MaxIFuncVersions,
+    // otherwise it is cubic: O((NumCallers ^ 2) x NumCallees).
     auto staticallyResolveCalls = [&](ArrayRef<Function *> Callers,
                                       ArrayRef<Function *> Callees,
                                       bool CallerIsFMV) {
+      bool AllowExpensiveChecks = CallerIsFMV &&
+                                  Callers.size() <= MaxIFuncVersions &&
+                                  Callees.size() <= MaxIFuncVersions;
       // Index to the highest callee candidate.
-      unsigned I = 0;
+      unsigned J = 0;
 
-      for (Function *const &Caller : Callers) {
-        if (I == Callees.size())
+      for (unsigned I = 0, E = Callers.size(); I < E; ++I) {
+        // There are no callee candidates left.
+        if (J == Callees.size())
           break;
 
+        Function *Caller = Callers[I];
+        APInt CallerBits = FeatureMask[Caller];
+
+        // Compare the feature bits of the best callee candidate with all the
+        // caller versions preceeding the current one. For each prior caller
+        // discard feature bits that are known to be available in the current
+        // caller. As long as the known missing feature bits are a subset of the
+        // callee feature bits, advance to the next callee and start over.
+        auto eliminateAvailableFeatures = [&](unsigned BestCandidate) {
+          unsigned K = 0;
+          while (K < I && BestCandidate < Callees.size()) {
+            APInt MissingBits = FeatureMask[Callers[K]] & ~CallerBits;
+            if (MissingBits.isSubsetOf(FeatureMask[Callees[BestCandidate]])) {
+              ++BestCandidate;
+              // Start over.
+              K = 0;
+            } else
+              ++K;
+          }
+          return BestCandidate;
+        };
+
+        unsigned BestCandidate =
+            AllowExpensiveChecks ? eliminateAvailableFeatures(J) : J;
+        // No callee candidate was found for this caller.
+        if (BestCandidate == Callees.size())
+          continue;
+
         LLVM_DEBUG(dbgs() << "   Examining "
                           << (CallerIsFMV ? "FMV" : "regular") << " caller "
                           << Caller->getName() << "\n");
 
-        Function *Callee = Callees[I];
-        APInt CallerBits = FeatureMask[Caller];
+        Function *Callee = Callees[BestCandidate];
         APInt CalleeBits = FeatureMask[Callee];
 
         // Statically resolve calls from the current caller to the current
@@ -2682,8 +2712,8 @@ static bool OptimizeNonTrivialIFuncs(
         // the callee feature bits, advance to the next callee. This effectively
         // prevents considering the current callee as a candidate for static
         // resolution by following callers.
-        while (CallerBits.isSubsetOf(FeatureMask[Callees[I]]) &&
-               ++I < Callees.size())
+        while (CallerBits.isSubsetOf(FeatureMask[Callees[J]]) &&
+               ++J < Callees.size())
           ;
       }
     };
diff --git a/llvm/lib/Transforms/IPO/LowerTypeTests.cpp b/llvm/lib/Transforms/IPO/LowerTypeTests.cpp
index fa35eef2c00f5..f7aeda95e41b3 100644
--- a/llvm/lib/Transforms/IPO/LowerTypeTests.cpp
+++ b/llvm/lib/Transforms/IPO/LowerTypeTests.cpp
@@ -1554,12 +1554,7 @@ void LowerTypeTestsModule::createJumpTable(
 
   // Align the whole table by entry size.
   F->setAlignment(Align(getJumpTableEntrySize(JumpTableArch)));
-  // Skip prologue.
-  // Disabled on win32 due to https://llvm.org/bugs/show_bug.cgi?id=28641#c3.
-  // Luckily, this function does not get any prologue even without the
-  // attribute.
-  if (OS != Triple::Win32)
-    F->addFnAttr(Attribute::Naked);
+  F->addFnAttr(Attribute::Naked);
   if (JumpTableArch == Triple::arm)
     F->addFnAttr("target-features", "-thumb-mode");
   if (JumpTableArch == Triple::thumb) {
diff --git a/llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp b/llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp
index 2dd0fde6b34d6..4642da0abdc13 100644
--- a/llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp
+++ b/llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp
@@ -99,6 +99,7 @@
 #include "llvm/IR/ProfDataUtils.h"
 #include "llvm/Support/Casting.h"
 #include "llvm/Support/CommandLine.h"
+#include "llvm/Support/DebugCounter.h"
 #include "llvm/Support/Errc.h"
 #include "llvm/Support/Error.h"
 #include "llvm/Support/FileSystem.h"
@@ -130,6 +131,8 @@ STATISTIC(NumUniqueRetVal, "Number of unique return value optimizations");
 STATISTIC(NumVirtConstProp1Bit,
           "Number of 1 bit virtual constant propagations");
 STATISTIC(NumVirtConstProp, "Number of virtual constant propagations");
+DEBUG_COUNTER(CallsToDevirt, "calls-to-devirt",
+              "Controls how many calls should be devirtualized.");
 
 namespace llvm {
 
@@ -219,14 +222,6 @@ static cl::opt<bool> WholeProgramDevirtKeepUnreachableFunction(
     cl::desc("Regard unreachable functions as possible devirtualize targets."),
     cl::Hidden, cl::init(true));
 
-/// If explicitly specified, the devirt module pass will stop transformation
-/// once the total number of devirtualizations reach the cutoff value. Setting
-/// this option to 0 explicitly will do 0 devirtualization.
-static cl::opt<unsigned> WholeProgramDevirtCutoff(
-    "wholeprogramdevirt-cutoff",
-    cl::desc("Max number of devirtualizations for devirt module pass"),
-    cl::init(0));
-
 /// Mechanism to add runtime checking of devirtualization decisions, optionally
 /// trapping or falling back to indirect call on any that are not correct.
 /// Trapping mode is useful for debugging undefined behavior leading to failures
@@ -377,9 +372,6 @@ VirtualCallTarget::VirtualCallTarget(GlobalValue *Fn, const TypeMemberInfo *TM)
 
 namespace {
 
-// Tracks the number of devirted calls in the IR transformation.
-static unsigned NumDevirtCalls = 0;
-
 // A slot in a set of virtual tables. The TypeID identifies the set of virtual
 // tables, and the ByteOffset is the offset in bytes from the address point to
 // the virtual function pointer.
@@ -1216,15 +1208,13 @@ void DevirtModule::applySingleImplDevirt(VTableSlotInfo &SlotInfo,
         continue;
 
       // Stop when the number of devirted calls reaches the cutoff.
-      if (WholeProgramDevirtCutoff.getNumOccurrences() > 0 &&
-          NumDevirtCalls >= WholeProgramDevirtCutoff)
-        return;
+      if (!DebugCounter::shouldExecute(CallsToDevirt))
+        continue;
 
       if (RemarksEnabled)
         VCallSite.emitRemark("single-impl",
                              TheFn->stripPointerCasts()->getName(), OREGetter);
       NumSingleImpl++;
-      NumDevirtCalls++;
       auto &CB = VCallSite.CB;
       assert(!CB.getCalledFunction() && "devirtualizing direct call?");
       IRBuilder<> Builder(&CB);
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp b/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
index e7dc366b13798..c9f51e4b294b1 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
@@ -1163,6 +1163,7 @@ static Value *canonicalizeSaturatedAddSigned(ICmpInst *Cmp, Value *TVal,
   // (X >= Y) ? INT_MAX : (X + C) --> sadd.sat(X, C)
   // where Y is INT_MAX - C or INT_MAX - C - 1, and C > 0
   if ((Pred == ICmpInst::ICMP_SGT || Pred == ICmpInst::ICMP_SGE) &&
+      isa<Constant>(Cmp1) &&
       match(FVal, m_Add(m_Specific(Cmp0), m_StrictlyPositive(C)))) {
     APInt IntMax =
         APInt::getSignedMaxValue(Cmp1->getType()->getScalarSizeInBits());
diff --git a/llvm/lib/Transforms/Instrumentation/AllocToken.cpp b/llvm/lib/Transforms/Instrumentation/AllocToken.cpp
index cf74354cb438f..38eeee287b94e 100644
--- a/llvm/lib/Transforms/Instrumentation/AllocToken.cpp
+++ b/llvm/lib/Transforms/Instrumentation/AllocToken.cpp
@@ -234,12 +234,31 @@ class TypeHashPointerSplitMode : public TypeHashMode {
   }
 };
 
-// Apply opt overrides.
-AllocTokenOptions transformOptionsFromCl(AllocTokenOptions Opts) {
-  if (!Opts.MaxTokens.has_value())
+// Apply opt overrides and module flags.
+static AllocTokenOptions resolveOptions(AllocTokenOptions Opts,
+                                        const Module &M) {
+  auto IntModuleFlagOrNull = [&](StringRef Key) {
+    return mdconst::extract_or_null<ConstantInt>(M.getModuleFlag(Key));
+  };
+
+  if (auto *S = dyn_cast_or_null<MDString>(M.getModuleFlag("alloc-token-mode")))
+    if (auto Mode = getAllocTokenModeFromString(S->getString()))
+      Opts.Mode = *Mode;
+  if (auto *Val = IntModuleFlagOrNull("alloc-token-max"))
+    Opts.MaxTokens = Val->getZExtValue();
+  if (auto *Val = IntModuleFlagOrNull("alloc-token-fast-abi"))
+    Opts.FastABI |= Val->isOne();
+  if (auto *Val = IntModuleFlagOrNull("alloc-token-extended"))
+    Opts.Extended |= Val->isOne();
+
+  // Allow overriding options from command line options.
+  if (ClMaxTokens.getNumOccurrences())
     Opts.MaxTokens = ClMaxTokens;
-  Opts.FastABI |= ClFastABI;
-  Opts.Extended |= ClExtended;
+  if (ClFastABI.getNumOccurrences())
+    Opts.FastABI = ClFastABI;
+  if (ClExtended.getNumOccurrences())
+    Opts.Extended = ClExtended;
+
   return Opts;
 }
 
@@ -247,21 +266,21 @@ class AllocToken {
 public:
   explicit AllocToken(AllocTokenOptions Opts, Module &M,
                       ModuleAnalysisManager &MAM)
-      : Options(transformOptionsFromCl(std::move(Opts))), Mod(M),
+      : Options(resolveOptions(std::move(Opts), M)), Mod(M),
         FAM(MAM.getResult<FunctionAnalysisManagerModuleProxy>(M).getManager()),
-        Mode(IncrementMode(*IntPtrTy, *Options.MaxTokens)) {
+        Mode(IncrementMode(*IntPtrTy, Options.MaxTokens)) {
     switch (Options.Mode) {
     case TokenMode::Increment:
       break;
     case TokenMode::Random:
-      Mode.emplace<RandomMode>(*IntPtrTy, *Options.MaxTokens,
+      Mode.emplace<RandomMode>(*IntPtrTy, Options.MaxTokens,
                                M.createRNG(DEBUG_TYPE));
       break;
     case TokenMode::TypeHash:
-      Mode.emplace<TypeHashMode>(*IntPtrTy, *Options.MaxTokens);
+      Mode.emplace<TypeHashMode>(*IntPtrTy, Options.MaxTokens);
       break;
     case TokenMode::TypeHashPointerSplit:
-      Mode.emplace<TypeHashPointerSplitMode>(*IntPtrTy, *Options.MaxTokens);
+      Mode.emplace<TypeHashPointerSplitMode>(*IntPtrTy, Options.MaxTokens);
       break;
     }
   }
@@ -318,8 +337,6 @@ bool AllocToken::instrumentFunction(Function &F) {
   if (F.getLinkage() == GlobalValue::AvailableExternallyLinkage)
     return false;
 
-  auto &ORE = FAM.getResult<OptimizationRemarkEmitterAnalysis>(F);
-  auto &TLI = FAM.getResult<TargetLibraryAnalysis>(F);
   SmallVector<std::pair<CallBase *, LibFunc>, 4> AllocCalls;
   SmallVector<IntrinsicInst *, 4> IntrinsicInsts;
 
@@ -328,6 +345,10 @@ bool AllocToken::instrumentFunction(Function &F) {
       F.hasFnAttribute(Attribute::SanitizeAllocToken) &&
       !F.hasFnAttribute(Attribute::DisableSanitizerInstrumentation);
 
+  // Get TLI only when required.
+  const TargetLibraryInfo *TLI =
+      InstrumentFunction ? &FAM.getResult<TargetLibraryAnalysis>(F) : nullptr;
+
   // Collect all allocation calls to avoid iterator invalidation.
   for (Instruction &I : instructions(F)) {
     // Collect all alloc_token_* intrinsics.
@@ -343,26 +364,28 @@ bool AllocToken::instrumentFunction(Function &F) {
     auto *CB = dyn_cast<CallBase>(&I);
     if (!CB)
       continue;
-    if (std::optional<LibFunc> Func = shouldInstrumentCall(*CB, TLI))
+    if (std::optional<LibFunc> Func = shouldInstrumentCall(*CB, *TLI))
       AllocCalls.emplace_back(CB, Func.value());
   }
 
+  // Return early to avoid unnecessarily instantiating the ORE.
+  if (AllocCalls.empty() && IntrinsicInsts.empty())
+    return false;
+
+  auto &ORE = FAM.getResult<OptimizationRemarkEmitterAnalysis>(F);
   bool Modified = false;
 
-  if (!AllocCalls.empty()) {
-    for (auto &[CB, Func] : AllocCalls)
-      Modified |= replaceAllocationCall(CB, Func, ORE, TLI);
-    if (Modified)
-      NumFunctionsModified++;
-  }
+  for (auto &[CB, Func] : AllocCalls)
+    Modified |= replaceAllocationCall(CB, Func, ORE, *TLI);
 
-  if (!IntrinsicInsts.empty()) {
-    for (auto *II : IntrinsicInsts)
-      replaceIntrinsicInst(II, ORE);
+  for (auto *II : IntrinsicInsts) {
+    replaceIntrinsicInst(II, ORE);
     Modified = true;
-    NumFunctionsModified++;
   }
 
+  if (Modified)
+    NumFunctionsModified++;
+
   return Modified;
 }
 
diff --git a/llvm/lib/Transforms/Instrumentation/RealtimeSanitizer.cpp b/llvm/lib/Transforms/Instrumentation/RealtimeSanitizer.cpp
index 5ef6ffb58a7c1..667fdb746175f 100644
--- a/llvm/lib/Transforms/Instrumentation/RealtimeSanitizer.cpp
+++ b/llvm/lib/Transforms/Instrumentation/RealtimeSanitizer.cpp
@@ -90,6 +90,9 @@ PreservedAnalyses RealtimeSanitizerPass::run(Module &M,
       [&](Function *Ctor, FunctionCallee) { appendToGlobalCtors(M, Ctor, 0); });
 
   for (Function &F : M) {
+    if (F.empty())
+      continue;
+
     if (F.hasFnAttribute(Attribute::SanitizeRealtime))
       runSanitizeRealtime(F);
 
diff --git a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
index 001215abcfb26..63b228efe3b11 100644
--- a/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
@@ -2195,8 +2195,8 @@ class LSRInstance {
   SmallSetVector<Instruction *, 4> InsertedNonLCSSAInsts;
 
   void OptimizeShadowIV();
-  bool FindIVUserForCond(ICmpInst *Cond, IVStrideUse *&CondUse);
-  ICmpInst *OptimizeMax(ICmpInst *Cond, IVStrideUse* &CondUse);
+  bool FindIVUserForCond(Instruction *Cond, IVStrideUse *&CondUse);
+  Instruction *OptimizeMax(ICmpInst *Cond, IVStrideUse *&CondUse);
   void OptimizeLoopTermCond();
 
   void ChainInstruction(Instruction *UserInst, Instruction *IVOper,
@@ -2431,7 +2431,7 @@ void LSRInstance::OptimizeShadowIV() {
 
 /// If Cond has an operand that is an expression of an IV, set the IV user and
 /// stride information and return true, otherwise return false.
-bool LSRInstance::FindIVUserForCond(ICmpInst *Cond, IVStrideUse *&CondUse) {
+bool LSRInstance::FindIVUserForCond(Instruction *Cond, IVStrideUse *&CondUse) {
   for (IVStrideUse &U : IU)
     if (U.getUser() == Cond) {
       // NOTE: we could handle setcc instructions with multiple uses here, but
@@ -2491,7 +2491,7 @@ bool LSRInstance::FindIVUserForCond(ICmpInst *Cond, IVStrideUse *&CondUse) {
 /// This function solves this problem by detecting this type of loop and
 /// rewriting their conditions from ICMP_NE back to ICMP_SLT, and deleting
 /// the instructions for the maximum computation.
-ICmpInst *LSRInstance::OptimizeMax(ICmpInst *Cond, IVStrideUse* &CondUse) {
+Instruction *LSRInstance::OptimizeMax(ICmpInst *Cond, IVStrideUse *&CondUse) {
   // Check that the loop matches the pattern we're looking for.
   if (Cond->getPredicate() != CmpInst::ICMP_EQ &&
       Cond->getPredicate() != CmpInst::ICMP_NE)
@@ -2635,15 +2635,22 @@ LSRInstance::OptimizeLoopTermCond() {
     // one register value.
 
     BranchInst *TermBr = dyn_cast<BranchInst>(ExitingBlock->getTerminator());
-    if (!TermBr)
+    if (!TermBr || TermBr->isUnconditional())
       continue;
-    // FIXME: Overly conservative, termination condition could be an 'or' etc..
-    if (TermBr->isUnconditional() || !isa<ICmpInst>(TermBr->getCondition()))
+
+    Instruction *Cond = dyn_cast<Instruction>(TermBr->getCondition());
+    // If the argument to TermBr is an extractelement, then the source of that
+    // instruction is what's generated the condition.
+    auto *Extract = dyn_cast_or_null<ExtractElementInst>(Cond);
+    if (Extract)
+      Cond = dyn_cast<Instruction>(Extract->getVectorOperand());
+    // FIXME: We could do more here, like handling logical operations where one
+    // side is a cmp that uses an induction variable.
+    if (!Cond)
       continue;
 
     // Search IVUsesByStride to find Cond's IVUse if there is one.
     IVStrideUse *CondUse = nullptr;
-    ICmpInst *Cond = cast<ICmpInst>(TermBr->getCondition());
     if (!FindIVUserForCond(Cond, CondUse))
       continue;
 
@@ -2653,7 +2660,8 @@ LSRInstance::OptimizeLoopTermCond() {
     // One consequence of doing this now is that it disrupts the count-down
     // optimization. That's not always a bad thing though, because in such
     // cases it may still be worthwhile to avoid a max.
-    Cond = OptimizeMax(Cond, CondUse);
+    if (auto *Cmp = dyn_cast<ICmpInst>(Cond))
+      Cond = OptimizeMax(Cmp, CondUse);
 
     // If this exiting block dominates the latch block, it may also use
     // the post-inc value if it won't be shared with other uses.
@@ -2718,13 +2726,14 @@ LSRInstance::OptimizeLoopTermCond() {
     // It's possible for the setcc instruction to be anywhere in the loop, and
     // possible for it to have multiple users.  If it is not immediately before
     // the exiting block branch, move it.
-    if (Cond->getNextNode() != TermBr) {
+    if (isa_and_nonnull<CmpInst>(Cond) && Cond->getNextNode() != TermBr &&
+        !Extract) {
       if (Cond->hasOneUse()) {
         Cond->moveBefore(TermBr->getIterator());
       } else {
         // Clone the terminating condition and insert into the loopend.
-        ICmpInst *OldCond = Cond;
-        Cond = cast<ICmpInst>(Cond->clone());
+        Instruction *OldCond = Cond;
+        Cond = Cond->clone();
         Cond->setName(L->getHeader()->getName() + ".termcond");
         Cond->insertInto(ExitingBlock, TermBr->getIterator());
 
@@ -6024,33 +6033,34 @@ void LSRInstance::Rewrite(const LSRUse &LU, const LSRFixup &LF,
     DeadInsts.emplace_back(OperandIsInstr);
 }
 
-// Trying to hoist the IVInc to loop header if all IVInc users are in
-// the loop header. It will help backend to generate post index load/store
-// when the latch block is different from loop header block.
-static bool canHoistIVInc(const TargetTransformInfo &TTI, const LSRFixup &Fixup,
-                          const LSRUse &LU, Instruction *IVIncInsertPos,
-                          Loop *L) {
+// Determine where to insert the transformed IV increment instruction for this
+// fixup. By default this is the default insert position, but if this is a
+// postincrement opportunity then we try to insert it in the same block as the
+// fixup user instruction, as this is needed for a postincrement instruction to
+// be generated.
+static Instruction *getFixupInsertPos(const TargetTransformInfo &TTI,
+                                      const LSRFixup &Fixup, const LSRUse &LU,
+                                      Instruction *IVIncInsertPos,
+                                      DominatorTree &DT) {
+  // Only address uses can be postincremented
   if (LU.Kind != LSRUse::Address)
-    return false;
-
-  // For now this code do the conservative optimization, only work for
-  // the header block. Later we can hoist the IVInc to the block post
-  // dominate all users.
-  BasicBlock *LHeader = L->getHeader();
-  if (IVIncInsertPos->getParent() == LHeader)
-    return false;
-
-  if (!Fixup.OperandValToReplace ||
-      any_of(Fixup.OperandValToReplace->users(), [&LHeader](User *U) {
-        Instruction *UI = cast<Instruction>(U);
-        return UI->getParent() != LHeader;
-      }))
-    return false;
+    return IVIncInsertPos;
 
+  // Don't try to postincrement if it's not legal
   Instruction *I = Fixup.UserInst;
   Type *Ty = I->getType();
-  return (isa<LoadInst>(I) && TTI.isIndexedLoadLegal(TTI.MIM_PostInc, Ty)) ||
-         (isa<StoreInst>(I) && TTI.isIndexedStoreLegal(TTI.MIM_PostInc, Ty));
+  if (!(isa<LoadInst>(I) && TTI.isIndexedLoadLegal(TTI.MIM_PostInc, Ty)) &&
+      !(isa<StoreInst>(I) && TTI.isIndexedStoreLegal(TTI.MIM_PostInc, Ty)))
+    return IVIncInsertPos;
+
+  // It's only legal to hoist to the user block if it dominates the default
+  // insert position.
+  BasicBlock *HoistBlock = I->getParent();
+  BasicBlock *IVIncBlock = IVIncInsertPos->getParent();
+  if (!DT.dominates(I, IVIncBlock))
+    return IVIncInsertPos;
+
+  return HoistBlock->getTerminator();
 }
 
 /// Rewrite all the fixup locations with new values, following the chosen
@@ -6071,9 +6081,7 @@ void LSRInstance::ImplementSolution(
   for (size_t LUIdx = 0, NumUses = Uses.size(); LUIdx != NumUses; ++LUIdx)
     for (const LSRFixup &Fixup : Uses[LUIdx].Fixups) {
       Instruction *InsertPos =
-          canHoistIVInc(TTI, Fixup, Uses[LUIdx], IVIncInsertPos, L)
-              ? L->getHeader()->getTerminator()
-              : IVIncInsertPos;
+          getFixupInsertPos(TTI, Fixup, Uses[LUIdx], IVIncInsertPos, DT);
       Rewriter.setIVIncInsertPos(L, InsertPos);
       Rewrite(Uses[LUIdx], Fixup, *Solution[LUIdx], DeadInsts);
       Changed = true;
diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 5c60fad6f91aa..3a70830cf8c0e 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -2178,35 +2178,6 @@ static bool isVectorPromotionViableForSlice(Partition &P, const Slice &S,
   return true;
 }
 
-/// Test whether a vector type is viable for promotion.
-///
-/// This implements the necessary checking for \c checkVectorTypesForPromotion
-/// (and thus isVectorPromotionViable) over all slices of the alloca for the
-/// given VectorType.
-static bool checkVectorTypeForPromotion(Partition &P, VectorType *VTy,
-                                        const DataLayout &DL, unsigned VScale) {
-  uint64_t ElementSize =
-      DL.getTypeSizeInBits(VTy->getElementType()).getFixedValue();
-
-  // While the definition of LLVM vectors is bitpacked, we don't support sizes
-  // that aren't byte sized.
-  if (ElementSize % 8)
-    return false;
-  assert((DL.getTypeSizeInBits(VTy).getFixedValue() % 8) == 0 &&
-         "vector size not a multiple of element size?");
-  ElementSize /= 8;
-
-  for (const Slice &S : P)
-    if (!isVectorPromotionViableForSlice(P, S, VTy, ElementSize, DL, VScale))
-      return false;
-
-  for (const Slice *S : P.splitSliceTails())
-    if (!isVectorPromotionViableForSlice(P, *S, VTy, ElementSize, DL, VScale))
-      return false;
-
-  return true;
-}
-
 /// Test whether any vector type in \p CandidateTys is viable for promotion.
 ///
 /// This implements the necessary checking for \c isVectorPromotionViable over
@@ -2291,11 +2262,30 @@ checkVectorTypesForPromotion(Partition &P, const DataLayout &DL,
            std::numeric_limits<unsigned short>::max();
   });
 
-  for (VectorType *VTy : CandidateTys)
-    if (checkVectorTypeForPromotion(P, VTy, DL, VScale))
-      return VTy;
+  // Find a vector type viable for promotion by iterating over all slices.
+  auto *VTy = llvm::find_if(CandidateTys, [&](VectorType *VTy) -> bool {
+    uint64_t ElementSize =
+        DL.getTypeSizeInBits(VTy->getElementType()).getFixedValue();
 
-  return nullptr;
+    // While the definition of LLVM vectors is bitpacked, we don't support sizes
+    // that aren't byte sized.
+    if (ElementSize % 8)
+      return false;
+    assert((DL.getTypeSizeInBits(VTy).getFixedValue() % 8) == 0 &&
+           "vector size not a multiple of element size?");
+    ElementSize /= 8;
+
+    for (const Slice &S : P)
+      if (!isVectorPromotionViableForSlice(P, S, VTy, ElementSize, DL, VScale))
+        return false;
+
+    for (const Slice *S : P.splitSliceTails())
+      if (!isVectorPromotionViableForSlice(P, *S, VTy, ElementSize, DL, VScale))
+        return false;
+
+    return true;
+  });
+  return VTy != CandidateTys.end() ? *VTy : nullptr;
 }
 
 static VectorType *createAndCheckVectorTypesForPromotion(
@@ -3150,7 +3140,6 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
     assert(IsSplit || BeginOffset == NewBeginOffset);
     uint64_t Offset = NewBeginOffset - NewAllocaBeginOffset;
 
-#ifndef NDEBUG
     StringRef OldName = OldPtr->getName();
     // Skip through the last '.sroa.' component of the name.
     size_t LastSROAPrefix = OldName.rfind(".sroa.");
@@ -3169,17 +3158,10 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
     }
     // Strip any SROA suffixes as well.
     OldName = OldName.substr(0, OldName.find(".sroa_"));
-#endif
 
     return getAdjustedPtr(IRB, DL, &NewAI,
                           APInt(DL.getIndexTypeSizeInBits(PointerTy), Offset),
-                          PointerTy,
-#ifndef NDEBUG
-                          Twine(OldName) + "."
-#else
-                          Twine()
-#endif
-    );
+                          PointerTy, Twine(OldName) + ".");
   }
 
   /// Compute suitable alignment to access this slice of the *new*
@@ -5213,7 +5195,6 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
   // won't always succeed, in which case we fall back to a legal integer type
   // or an i8 array of an appropriate size.
   Type *SliceTy = nullptr;
-  VectorType *SliceVecTy = nullptr;
   const DataLayout &DL = AI.getDataLayout();
   unsigned VScale = AI.getFunction()->getVScaleValue();
 
@@ -5222,10 +5203,8 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
   // Do all uses operate on the same type?
   if (CommonUseTy.first) {
     TypeSize CommonUseSize = DL.getTypeAllocSize(CommonUseTy.first);
-    if (CommonUseSize.isFixed() && CommonUseSize.getFixedValue() >= P.size()) {
+    if (CommonUseSize.isFixed() && CommonUseSize.getFixedValue() >= P.size())
       SliceTy = CommonUseTy.first;
-      SliceVecTy = dyn_cast<VectorType>(SliceTy);
-    }
   }
   // If not, can we find an appropriate subtype in the original allocated type?
   if (!SliceTy)
@@ -5235,27 +5214,14 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
 
   // If still not, can we use the largest bitwidth integer type used?
   if (!SliceTy && CommonUseTy.second)
-    if (DL.getTypeAllocSize(CommonUseTy.second).getFixedValue() >= P.size()) {
+    if (DL.getTypeAllocSize(CommonUseTy.second).getFixedValue() >= P.size())
       SliceTy = CommonUseTy.second;
-      SliceVecTy = dyn_cast<VectorType>(SliceTy);
-    }
   if ((!SliceTy || (SliceTy->isArrayTy() &&
                     SliceTy->getArrayElementType()->isIntegerTy())) &&
       DL.isLegalInteger(P.size() * 8)) {
     SliceTy = Type::getIntNTy(*C, P.size() * 8);
   }
 
-  // If the common use types are not viable for promotion then attempt to find
-  // another type that is viable.
-  if (SliceVecTy && !checkVectorTypeForPromotion(P, SliceVecTy, DL, VScale))
-    if (Type *TypePartitionTy = getTypePartition(DL, AI.getAllocatedType(),
-                                                 P.beginOffset(), P.size())) {
-      VectorType *TypePartitionVecTy = dyn_cast<VectorType>(TypePartitionTy);
-      if (TypePartitionVecTy &&
-          checkVectorTypeForPromotion(P, TypePartitionVecTy, DL, VScale))
-        SliceTy = TypePartitionTy;
-    }
-
   if (!SliceTy)
     SliceTy = ArrayType::get(Type::getInt8Ty(*C), P.size());
   assert(DL.getTypeAllocSize(SliceTy).getFixedValue() >= P.size());
diff --git a/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp b/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
index 11db0ec487328..076c5da4393fc 100644
--- a/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
+++ b/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
@@ -92,6 +92,15 @@ emptyAndDetachBlock(BasicBlock *BB,
          "applying corresponding DTU updates.");
 }
 
+static bool HasLoopOrEntryConvergenceToken(const BasicBlock *BB) {
+  for (const Instruction &I : *BB) {
+    const ConvergenceControlInst *CCI = dyn_cast<ConvergenceControlInst>(&I);
+    if (CCI && (CCI->isLoop() || CCI->isEntry()))
+      return true;
+  }
+  return false;
+}
+
 void llvm::detachDeadBlocks(ArrayRef<BasicBlock *> BBs,
                             SmallVectorImpl<DominatorTree::UpdateType> *Updates,
                             bool KeepOneInputPHIs) {
@@ -259,6 +268,13 @@ bool llvm::MergeBlockIntoPredecessor(BasicBlock *BB, DomTreeUpdater *DTU,
     if (llvm::is_contained(PN.incoming_values(), &PN))
       return false;
 
+  // Don't break if both the basic block and the predecessor contain loop or
+  // entry convergent intrinsics, since there may only be one convergence token
+  // per block.
+  if (HasLoopOrEntryConvergenceToken(BB) &&
+      HasLoopOrEntryConvergenceToken(PredBB))
+    return false;
+
   LLVM_DEBUG(dbgs() << "Merging: " << BB->getName() << " into "
                     << PredBB->getName() << "\n");
 
diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
index 5f1db9c54b291..0f256398e5b1e 100644
--- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp
+++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp
@@ -1254,6 +1254,8 @@ llvm::canParallelizeReductionWhenUnrolling(PHINode &Phi, Loop *L,
                                             /*DemandedBits=*/nullptr,
                                             /*AC=*/nullptr, /*DT=*/nullptr, SE))
     return std::nullopt;
+  if (RdxDesc.hasUsesOutsideReductionChain())
+    return std::nullopt;
   RecurKind RK = RdxDesc.getRecurrenceKind();
   // Skip unsupported reductions.
   // TODO: Handle additional reductions, including min-max reductions.
diff --git a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
index 18b0f617ca232..4ab99edd64baa 100644
--- a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
+++ b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
@@ -21,6 +21,218 @@
 
 using namespace llvm;
 
+/// \returns \p Len urem \p OpSize, checking for optimization opportunities.
+/// \p OpSizeVal must be the integer value of the \c ConstantInt \p OpSize.
+static Value *getRuntimeLoopRemainder(IRBuilderBase &B, Value *Len,
+                                      Value *OpSize, unsigned OpSizeVal) {
+  // For powers of 2, we can and by (OpSizeVal - 1) instead of using urem.
+  if (isPowerOf2_32(OpSizeVal))
+    return B.CreateAnd(Len, OpSizeVal - 1);
+  return B.CreateURem(Len, OpSize);
+}
+
+/// \returns (\p Len udiv \p OpSize) mul \p OpSize, checking for optimization
+/// opportunities.
+/// If \p RTLoopRemainder is provided, it must be the result of
+/// \c getRuntimeLoopRemainder() with the same arguments.
+static Value *getRuntimeLoopUnits(IRBuilderBase &B, Value *Len, Value *OpSize,
+                                  unsigned OpSizeVal,
+                                  Value *RTLoopRemainder = nullptr) {
+  if (!RTLoopRemainder)
+    RTLoopRemainder = getRuntimeLoopRemainder(B, Len, OpSize, OpSizeVal);
+  return B.CreateSub(Len, RTLoopRemainder);
+}
+
+namespace {
+/// Container for the return values of insertLoopExpansion.
+struct LoopExpansionInfo {
+  /// The instruction at the end of the main loop body.
+  Instruction *MainLoopIP = nullptr;
+
+  /// The unit index in the main loop body.
+  Value *MainLoopIndex = nullptr;
+
+  /// The instruction at the end of the residual loop body. Can be nullptr if no
+  /// residual is required.
+  Instruction *ResidualLoopIP = nullptr;
+
+  /// The unit index in the residual loop body. Can be nullptr if no residual is
+  /// required.
+  Value *ResidualLoopIndex = nullptr;
+};
+} // namespace
+
+/// Insert the control flow and loop counters for a memcpy/memset loop
+/// expansion.
+///
+/// This function inserts IR corresponding to the following C code before
+/// \p InsertBefore:
+/// \code
+/// LoopUnits = (Len / MainLoopStep) * MainLoopStep;
+/// ResidualUnits = Len - LoopUnits;
+/// MainLoopIndex = 0;
+/// if (LoopUnits > 0) {
+///   do {
+///     // MainLoopIP
+///     MainLoopIndex += MainLoopStep;
+///   } while (MainLoopIndex < LoopUnits);
+/// }
+/// for (size_t i = 0; i < ResidualUnits; i += ResidualLoopStep) {
+///   ResidualLoopIndex = LoopUnits + i;
+///   // ResidualLoopIP
+/// }
+/// \endcode
+///
+/// \p MainLoopStep and \p ResidualLoopStep determine by how many "units" the
+/// loop index is increased in each iteration of the main and residual loops,
+/// respectively. In most cases, the "unit" will be bytes, but larger units are
+/// useful for lowering memset.pattern.
+///
+/// The computation of \c LoopUnits and \c ResidualUnits is performed at compile
+/// time if \p Len is a \c ConstantInt.
+/// The second (residual) loop is omitted if \p ResidualLoopStep is 0 or equal
+/// to \p MainLoopStep.
+/// The generated \c MainLoopIP, \c MainLoopIndex, \c ResidualLoopIP, and
+/// \c ResidualLoopIndex are returned in a \c LoopExpansionInfo object.
+static LoopExpansionInfo insertLoopExpansion(Instruction *InsertBefore,
+                                             Value *Len, unsigned MainLoopStep,
+                                             unsigned ResidualLoopStep,
+                                             StringRef BBNamePrefix) {
+  assert((ResidualLoopStep == 0 || MainLoopStep % ResidualLoopStep == 0) &&
+         "ResidualLoopStep must divide MainLoopStep if specified");
+  assert(ResidualLoopStep <= MainLoopStep &&
+         "ResidualLoopStep cannot be larger than MainLoopStep");
+  assert(MainLoopStep > 0 && "MainLoopStep must be non-zero");
+  LoopExpansionInfo LEI;
+  BasicBlock *PreLoopBB = InsertBefore->getParent();
+  BasicBlock *PostLoopBB = PreLoopBB->splitBasicBlock(
+      InsertBefore, BBNamePrefix + "-post-expansion");
+  Function *ParentFunc = PreLoopBB->getParent();
+  LLVMContext &Ctx = PreLoopBB->getContext();
+  IRBuilder<> PreLoopBuilder(PreLoopBB->getTerminator());
+
+  // Calculate the main loop trip count and remaining units to cover after the
+  // loop.
+  Type *LenType = Len->getType();
+  IntegerType *ILenType = cast<IntegerType>(LenType);
+  ConstantInt *CIMainLoopStep = ConstantInt::get(ILenType, MainLoopStep);
+
+  Value *LoopUnits = Len;
+  Value *ResidualUnits = nullptr;
+  // We can make a conditional branch unconditional if we know that the
+  // MainLoop must be executed at least once.
+  bool MustTakeMainLoop = false;
+  if (MainLoopStep != 1) {
+    if (auto *CLen = dyn_cast<ConstantInt>(Len)) {
+      uint64_t TotalUnits = CLen->getZExtValue();
+      uint64_t LoopEndCount = alignDown(TotalUnits, MainLoopStep);
+      uint64_t ResidualCount = TotalUnits - LoopEndCount;
+      LoopUnits = ConstantInt::get(LenType, LoopEndCount);
+      ResidualUnits = ConstantInt::get(LenType, ResidualCount);
+      MustTakeMainLoop = LoopEndCount > 0;
+      // As an optimization, we could skip generating the residual loop if
+      // ResidualCount is known to be 0. However, current uses of this function
+      // don't request a residual loop if the length is constant (they generate
+      // a (potentially empty) sequence of loads and stores instead), so this
+      // optimization would have no effect here.
+    } else {
+      ResidualUnits = getRuntimeLoopRemainder(PreLoopBuilder, Len,
+                                              CIMainLoopStep, MainLoopStep);
+      LoopUnits = getRuntimeLoopUnits(PreLoopBuilder, Len, CIMainLoopStep,
+                                      MainLoopStep, ResidualUnits);
+    }
+  } else if (auto *CLen = dyn_cast<ConstantInt>(Len)) {
+    MustTakeMainLoop = CLen->getZExtValue() > 0;
+  }
+
+  BasicBlock *MainLoopBB = BasicBlock::Create(
+      Ctx, BBNamePrefix + "-expansion-main-body", ParentFunc, PostLoopBB);
+  IRBuilder<> LoopBuilder(MainLoopBB);
+
+  PHINode *LoopIndex = LoopBuilder.CreatePHI(LenType, 2, "loop-index");
+  LEI.MainLoopIndex = LoopIndex;
+  LoopIndex->addIncoming(ConstantInt::get(LenType, 0U), PreLoopBB);
+
+  Value *NewIndex =
+      LoopBuilder.CreateAdd(LoopIndex, ConstantInt::get(LenType, MainLoopStep));
+  LoopIndex->addIncoming(NewIndex, MainLoopBB);
+
+  // One argument of the addition is a loop-variant PHI, so it must be an
+  // Instruction (i.e., it cannot be a Constant).
+  LEI.MainLoopIP = cast<Instruction>(NewIndex);
+
+  if (ResidualLoopStep > 0 && ResidualLoopStep < MainLoopStep) {
+    // Loop body for the residual accesses.
+    BasicBlock *ResLoopBB =
+        BasicBlock::Create(Ctx, BBNamePrefix + "-expansion-residual-body",
+                           PreLoopBB->getParent(), PostLoopBB);
+    // BB to check if the residual loop is needed.
+    BasicBlock *ResidualCondBB =
+        BasicBlock::Create(Ctx, BBNamePrefix + "-expansion-residual-cond",
+                           PreLoopBB->getParent(), ResLoopBB);
+
+    // Enter the MainLoop unless no main loop iteration is required.
+    ConstantInt *Zero = ConstantInt::get(ILenType, 0U);
+    if (MustTakeMainLoop)
+      PreLoopBuilder.CreateBr(MainLoopBB);
+    else
+      PreLoopBuilder.CreateCondBr(PreLoopBuilder.CreateICmpNE(LoopUnits, Zero),
+                                  MainLoopBB, ResidualCondBB);
+    PreLoopBB->getTerminator()->eraseFromParent();
+
+    // Stay in the MainLoop until we have handled all the LoopUnits. Then go to
+    // the residual condition BB.
+    LoopBuilder.CreateCondBr(LoopBuilder.CreateICmpULT(NewIndex, LoopUnits),
+                             MainLoopBB, ResidualCondBB);
+
+    // Determine if we need to branch to the residual loop or bypass it.
+    IRBuilder<> RCBuilder(ResidualCondBB);
+    RCBuilder.CreateCondBr(RCBuilder.CreateICmpNE(ResidualUnits, Zero),
+                           ResLoopBB, PostLoopBB);
+
+    IRBuilder<> ResBuilder(ResLoopBB);
+    PHINode *ResidualIndex =
+        ResBuilder.CreatePHI(LenType, 2, "residual-loop-index");
+    ResidualIndex->addIncoming(Zero, ResidualCondBB);
+
+    // Add the offset at the end of the main loop to the loop counter of the
+    // residual loop to get the proper index.
+    Value *FullOffset = ResBuilder.CreateAdd(LoopUnits, ResidualIndex);
+    LEI.ResidualLoopIndex = FullOffset;
+
+    Value *ResNewIndex = ResBuilder.CreateAdd(
+        ResidualIndex, ConstantInt::get(LenType, ResidualLoopStep));
+    ResidualIndex->addIncoming(ResNewIndex, ResLoopBB);
+
+    // One argument of the addition is a loop-variant PHI, so it must be an
+    // Instruction (i.e., it cannot be a Constant).
+    LEI.ResidualLoopIP = cast<Instruction>(ResNewIndex);
+
+    // Stay in the residual loop until all ResidualUnits are handled.
+    ResBuilder.CreateCondBr(
+        ResBuilder.CreateICmpULT(ResNewIndex, ResidualUnits), ResLoopBB,
+        PostLoopBB);
+  } else {
+    // There is no need for a residual loop after the main loop. We do however
+    // need to patch up the control flow by creating the terminators for the
+    // preloop block and the main loop.
+
+    // Enter the MainLoop unless no main loop iteration is required.
+    if (MustTakeMainLoop) {
+      PreLoopBuilder.CreateBr(MainLoopBB);
+    } else {
+      ConstantInt *Zero = ConstantInt::get(ILenType, 0U);
+      PreLoopBuilder.CreateCondBr(PreLoopBuilder.CreateICmpNE(LoopUnits, Zero),
+                                  MainLoopBB, PostLoopBB);
+    }
+    PreLoopBB->getTerminator()->eraseFromParent();
+    // Stay in the MainLoop until we have handled all the LoopUnits.
+    LoopBuilder.CreateCondBr(LoopBuilder.CreateICmpULT(NewIndex, LoopUnits),
+                             MainLoopBB, PostLoopBB);
+  }
+  return LEI;
+}
+
 void llvm::createMemCpyLoopKnownSize(
     Instruction *InsertBefore, Value *SrcAddr, Value *DstAddr,
     ConstantInt *CopyLen, Align SrcAlign, Align DstAlign, bool SrcIsVolatile,
@@ -31,7 +243,6 @@ void llvm::createMemCpyLoopKnownSize(
     return;
 
   BasicBlock *PreLoopBB = InsertBefore->getParent();
-  BasicBlock *PostLoopBB = nullptr;
   Function *ParentFunc = PreLoopBB->getParent();
   LLVMContext &Ctx = PreLoopBB->getContext();
   const DataLayout &DL = ParentFunc->getDataLayout();
@@ -56,37 +267,32 @@ void llvm::createMemCpyLoopKnownSize(
 
   uint64_t LoopEndCount = alignDown(CopyLen->getZExtValue(), LoopOpSize);
 
+  // Skip the loop expansion entirely if the loop would never be taken.
   if (LoopEndCount != 0) {
-    // Split
-    PostLoopBB = PreLoopBB->splitBasicBlock(InsertBefore, "memcpy-split");
-    BasicBlock *LoopBB =
-        BasicBlock::Create(Ctx, "load-store-loop", ParentFunc, PostLoopBB);
-    PreLoopBB->getTerminator()->setSuccessor(0, LoopBB);
-
-    IRBuilder<> PLBuilder(PreLoopBB->getTerminator());
+    LoopExpansionInfo LEI = insertLoopExpansion(InsertBefore, CopyLen,
+                                                LoopOpSize, 0, "static-memcpy");
 
+    // Fill MainLoopBB
+    IRBuilder<> MainLoopBuilder(LEI.MainLoopIP);
     Align PartDstAlign(commonAlignment(DstAlign, LoopOpSize));
     Align PartSrcAlign(commonAlignment(SrcAlign, LoopOpSize));
 
-    IRBuilder<> LoopBuilder(LoopBB);
-    PHINode *LoopIndex = LoopBuilder.CreatePHI(TypeOfCopyLen, 2, "loop-index");
-    LoopIndex->addIncoming(ConstantInt::get(TypeOfCopyLen, 0U), PreLoopBB);
-    // Loop Body
-
     // If we used LoopOpType as GEP element type, we would iterate over the
     // buffers in TypeStoreSize strides while copying TypeAllocSize bytes, i.e.,
     // we would miss bytes if TypeStoreSize != TypeAllocSize. Therefore, use
     // byte offsets computed from the TypeStoreSize.
-    Value *SrcGEP = LoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, LoopIndex);
-    LoadInst *Load = LoopBuilder.CreateAlignedLoad(LoopOpType, SrcGEP,
-                                                   PartSrcAlign, SrcIsVolatile);
+    Value *SrcGEP =
+        MainLoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, LEI.MainLoopIndex);
+    LoadInst *Load = MainLoopBuilder.CreateAlignedLoad(
+        LoopOpType, SrcGEP, PartSrcAlign, SrcIsVolatile);
     if (!CanOverlap) {
       // Set alias scope for loads.
       Load->setMetadata(LLVMContext::MD_alias_scope,
                         MDNode::get(Ctx, NewScope));
     }
-    Value *DstGEP = LoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr, LoopIndex);
-    StoreInst *Store = LoopBuilder.CreateAlignedStore(
+    Value *DstGEP =
+        MainLoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr, LEI.MainLoopIndex);
+    StoreInst *Store = MainLoopBuilder.CreateAlignedStore(
         Load, DstGEP, PartDstAlign, DstIsVolatile);
     if (!CanOverlap) {
       // Indicate that stores don't overlap loads.
@@ -96,96 +302,63 @@ void llvm::createMemCpyLoopKnownSize(
       Load->setAtomic(AtomicOrdering::Unordered);
       Store->setAtomic(AtomicOrdering::Unordered);
     }
-    Value *NewIndex = LoopBuilder.CreateAdd(
-        LoopIndex, ConstantInt::get(TypeOfCopyLen, LoopOpSize));
-    LoopIndex->addIncoming(NewIndex, LoopBB);
-
-    // Create the loop branch condition.
-    Constant *LoopEndCI = ConstantInt::get(TypeOfCopyLen, LoopEndCount);
-    LoopBuilder.CreateCondBr(LoopBuilder.CreateICmpULT(NewIndex, LoopEndCI),
-                             LoopBB, PostLoopBB);
+    assert(!LEI.ResidualLoopIP && !LEI.ResidualLoopIndex &&
+           "No residual loop was requested");
   }
 
+  // Copy the remaining bytes with straight-line code.
   uint64_t BytesCopied = LoopEndCount;
   uint64_t RemainingBytes = CopyLen->getZExtValue() - BytesCopied;
-  if (RemainingBytes) {
-    BasicBlock::iterator InsertIt = PostLoopBB ? PostLoopBB->getFirstNonPHIIt()
-                                               : InsertBefore->getIterator();
-    IRBuilder<> RBuilder(InsertIt->getParent(), InsertIt);
+  if (RemainingBytes == 0)
+    return;
 
-    SmallVector<Type *, 5> RemainingOps;
-    TTI.getMemcpyLoopResidualLoweringType(RemainingOps, Ctx, RemainingBytes,
-                                          SrcAS, DstAS, SrcAlign, DstAlign,
-                                          AtomicElementSize);
+  IRBuilder<> RBuilder(InsertBefore);
+  SmallVector<Type *, 5> RemainingOps;
+  TTI.getMemcpyLoopResidualLoweringType(RemainingOps, Ctx, RemainingBytes,
+                                        SrcAS, DstAS, SrcAlign, DstAlign,
+                                        AtomicElementSize);
 
-    for (auto *OpTy : RemainingOps) {
-      Align PartSrcAlign(commonAlignment(SrcAlign, BytesCopied));
-      Align PartDstAlign(commonAlignment(DstAlign, BytesCopied));
-
-      unsigned OperandSize = DL.getTypeStoreSize(OpTy);
-      assert(
-          (!AtomicElementSize || OperandSize % *AtomicElementSize == 0) &&
-          "Atomic memcpy lowering is not supported for selected operand size");
-
-      Value *SrcGEP = RBuilder.CreateInBoundsGEP(
-          Int8Type, SrcAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
-      LoadInst *Load =
-          RBuilder.CreateAlignedLoad(OpTy, SrcGEP, PartSrcAlign, SrcIsVolatile);
-      if (!CanOverlap) {
-        // Set alias scope for loads.
-        Load->setMetadata(LLVMContext::MD_alias_scope,
-                          MDNode::get(Ctx, NewScope));
-      }
-      Value *DstGEP = RBuilder.CreateInBoundsGEP(
-          Int8Type, DstAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
-      StoreInst *Store = RBuilder.CreateAlignedStore(Load, DstGEP, PartDstAlign,
-                                                     DstIsVolatile);
-      if (!CanOverlap) {
-        // Indicate that stores don't overlap loads.
-        Store->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
-      }
-      if (AtomicElementSize) {
-        Load->setAtomic(AtomicOrdering::Unordered);
-        Store->setAtomic(AtomicOrdering::Unordered);
-      }
-      BytesCopied += OperandSize;
+  for (auto *OpTy : RemainingOps) {
+    Align PartSrcAlign(commonAlignment(SrcAlign, BytesCopied));
+    Align PartDstAlign(commonAlignment(DstAlign, BytesCopied));
+
+    unsigned OperandSize = DL.getTypeStoreSize(OpTy);
+    assert((!AtomicElementSize || OperandSize % *AtomicElementSize == 0) &&
+           "Atomic memcpy lowering is not supported for selected operand size");
+
+    Value *SrcGEP = RBuilder.CreateInBoundsGEP(
+        Int8Type, SrcAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
+    LoadInst *Load =
+        RBuilder.CreateAlignedLoad(OpTy, SrcGEP, PartSrcAlign, SrcIsVolatile);
+    if (!CanOverlap) {
+      // Set alias scope for loads.
+      Load->setMetadata(LLVMContext::MD_alias_scope,
+                        MDNode::get(Ctx, NewScope));
+    }
+    Value *DstGEP = RBuilder.CreateInBoundsGEP(
+        Int8Type, DstAddr, ConstantInt::get(TypeOfCopyLen, BytesCopied));
+    StoreInst *Store =
+        RBuilder.CreateAlignedStore(Load, DstGEP, PartDstAlign, DstIsVolatile);
+    if (!CanOverlap) {
+      // Indicate that stores don't overlap loads.
+      Store->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
     }
+    if (AtomicElementSize) {
+      Load->setAtomic(AtomicOrdering::Unordered);
+      Store->setAtomic(AtomicOrdering::Unordered);
+    }
+    BytesCopied += OperandSize;
   }
   assert(BytesCopied == CopyLen->getZExtValue() &&
          "Bytes copied should match size in the call!");
 }
 
-// \returns \p Len urem \p OpSize, checking for optimization opportunities.
-static Value *getRuntimeLoopRemainder(const DataLayout &DL, IRBuilderBase &B,
-                                      Value *Len, Value *OpSize,
-                                      unsigned OpSizeVal) {
-  // For powers of 2, we can and by (OpSizeVal - 1) instead of using urem.
-  if (isPowerOf2_32(OpSizeVal))
-    return B.CreateAnd(Len, OpSizeVal - 1);
-  return B.CreateURem(Len, OpSize);
-}
-
-// \returns (\p Len udiv \p OpSize) mul \p OpSize, checking for optimization
-// opportunities.
-// If RTLoopRemainder is provided, it must be the result of
-// getRuntimeLoopRemainder() with the same arguments.
-static Value *getRuntimeLoopBytes(const DataLayout &DL, IRBuilderBase &B,
-                                  Value *Len, Value *OpSize, unsigned OpSizeVal,
-                                  Value *RTLoopRemainder = nullptr) {
-  if (!RTLoopRemainder)
-    RTLoopRemainder = getRuntimeLoopRemainder(DL, B, Len, OpSize, OpSizeVal);
-  return B.CreateSub(Len, RTLoopRemainder);
-}
-
 void llvm::createMemCpyLoopUnknownSize(
     Instruction *InsertBefore, Value *SrcAddr, Value *DstAddr, Value *CopyLen,
     Align SrcAlign, Align DstAlign, bool SrcIsVolatile, bool DstIsVolatile,
     bool CanOverlap, const TargetTransformInfo &TTI,
     std::optional<uint32_t> AtomicElementSize) {
   BasicBlock *PreLoopBB = InsertBefore->getParent();
-  BasicBlock *PostLoopBB =
-      PreLoopBB->splitBasicBlock(InsertBefore, "post-loop-memcpy-expansion");
-
   Function *ParentFunc = PreLoopBB->getParent();
   const DataLayout &DL = ParentFunc->getDataLayout();
   LLVMContext &Ctx = PreLoopBB->getContext();
@@ -205,50 +378,39 @@ void llvm::createMemCpyLoopUnknownSize(
   assert((!AtomicElementSize || LoopOpSize % *AtomicElementSize == 0) &&
          "Atomic memcpy lowering is not supported for selected operand size");
 
-  IRBuilder<> PLBuilder(PreLoopBB->getTerminator());
-
-  // Calculate the loop trip count, and remaining bytes to copy after the loop.
-  Type *CopyLenType = CopyLen->getType();
-  IntegerType *ILengthType = dyn_cast<IntegerType>(CopyLenType);
-  assert(ILengthType &&
-         "expected size argument to memcpy to be an integer type!");
   Type *Int8Type = Type::getInt8Ty(Ctx);
-  bool LoopOpIsInt8 = LoopOpType == Int8Type;
-  ConstantInt *CILoopOpSize = ConstantInt::get(ILengthType, LoopOpSize);
 
-  Value *RuntimeLoopBytes = CopyLen;
-  Value *RuntimeResidualBytes = nullptr;
-  if (!LoopOpIsInt8) {
-    RuntimeResidualBytes = getRuntimeLoopRemainder(DL, PLBuilder, CopyLen,
-                                                   CILoopOpSize, LoopOpSize);
-    RuntimeLoopBytes = getRuntimeLoopBytes(DL, PLBuilder, CopyLen, CILoopOpSize,
-                                           LoopOpSize, RuntimeResidualBytes);
-  }
+  Type *ResidualLoopOpType = AtomicElementSize
+                                 ? Type::getIntNTy(Ctx, *AtomicElementSize * 8)
+                                 : Int8Type;
+  unsigned ResidualLoopOpSize = DL.getTypeStoreSize(ResidualLoopOpType);
+  assert(ResidualLoopOpSize == (AtomicElementSize ? *AtomicElementSize : 1) &&
+         "Store size is expected to match type size");
 
-  BasicBlock *LoopBB =
-      BasicBlock::Create(Ctx, "loop-memcpy-expansion", ParentFunc, PostLoopBB);
-  IRBuilder<> LoopBuilder(LoopBB);
+  LoopExpansionInfo LEI = insertLoopExpansion(
+      InsertBefore, CopyLen, LoopOpSize, ResidualLoopOpSize, "dynamic-memcpy");
 
+  // Fill MainLoopBB
+  IRBuilder<> MainLoopBuilder(LEI.MainLoopIP);
   Align PartSrcAlign(commonAlignment(SrcAlign, LoopOpSize));
   Align PartDstAlign(commonAlignment(DstAlign, LoopOpSize));
 
-  PHINode *LoopIndex = LoopBuilder.CreatePHI(CopyLenType, 2, "loop-index");
-  LoopIndex->addIncoming(ConstantInt::get(CopyLenType, 0U), PreLoopBB);
-
   // If we used LoopOpType as GEP element type, we would iterate over the
   // buffers in TypeStoreSize strides while copying TypeAllocSize bytes, i.e.,
   // we would miss bytes if TypeStoreSize != TypeAllocSize. Therefore, use byte
   // offsets computed from the TypeStoreSize.
-  Value *SrcGEP = LoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, LoopIndex);
-  LoadInst *Load = LoopBuilder.CreateAlignedLoad(LoopOpType, SrcGEP,
-                                                 PartSrcAlign, SrcIsVolatile);
+  Value *SrcGEP =
+      MainLoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, LEI.MainLoopIndex);
+  LoadInst *Load = MainLoopBuilder.CreateAlignedLoad(
+      LoopOpType, SrcGEP, PartSrcAlign, SrcIsVolatile);
   if (!CanOverlap) {
     // Set alias scope for loads.
     Load->setMetadata(LLVMContext::MD_alias_scope, MDNode::get(Ctx, NewScope));
   }
-  Value *DstGEP = LoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr, LoopIndex);
-  StoreInst *Store =
-      LoopBuilder.CreateAlignedStore(Load, DstGEP, PartDstAlign, DstIsVolatile);
+  Value *DstGEP =
+      MainLoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr, LEI.MainLoopIndex);
+  StoreInst *Store = MainLoopBuilder.CreateAlignedStore(
+      Load, DstGEP, PartDstAlign, DstIsVolatile);
   if (!CanOverlap) {
     // Indicate that stores don't overlap loads.
     Store->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
@@ -257,95 +419,35 @@ void llvm::createMemCpyLoopUnknownSize(
     Load->setAtomic(AtomicOrdering::Unordered);
     Store->setAtomic(AtomicOrdering::Unordered);
   }
-  Value *NewIndex = LoopBuilder.CreateAdd(
-      LoopIndex, ConstantInt::get(CopyLenType, LoopOpSize));
-  LoopIndex->addIncoming(NewIndex, LoopBB);
-
-  bool RequiresResidual =
-      !LoopOpIsInt8 && !(AtomicElementSize && LoopOpSize == AtomicElementSize);
-  if (RequiresResidual) {
-    Type *ResLoopOpType = AtomicElementSize
-                              ? Type::getIntNTy(Ctx, *AtomicElementSize * 8)
-                              : Int8Type;
-    unsigned ResLoopOpSize = DL.getTypeStoreSize(ResLoopOpType);
-    assert((ResLoopOpSize == AtomicElementSize ? *AtomicElementSize : 1) &&
-           "Store size is expected to match type size");
-
-    Align ResSrcAlign(commonAlignment(PartSrcAlign, ResLoopOpSize));
-    Align ResDstAlign(commonAlignment(PartDstAlign, ResLoopOpSize));
-
-    // Loop body for the residual copy.
-    BasicBlock *ResLoopBB = BasicBlock::Create(
-        Ctx, "loop-memcpy-residual", PreLoopBB->getParent(), PostLoopBB);
-    // Residual loop header.
-    BasicBlock *ResHeaderBB = BasicBlock::Create(
-        Ctx, "loop-memcpy-residual-header", PreLoopBB->getParent(), nullptr);
-
-    // Need to update the pre-loop basic block to branch to the correct place.
-    // branch to the main loop if the count is non-zero, branch to the residual
-    // loop if the copy size is smaller then 1 iteration of the main loop but
-    // non-zero and finally branch to after the residual loop if the memcpy
-    //  size is zero.
-    ConstantInt *Zero = ConstantInt::get(ILengthType, 0U);
-    PLBuilder.CreateCondBr(PLBuilder.CreateICmpNE(RuntimeLoopBytes, Zero),
-                           LoopBB, ResHeaderBB);
-    PreLoopBB->getTerminator()->eraseFromParent();
 
-    LoopBuilder.CreateCondBr(
-        LoopBuilder.CreateICmpULT(NewIndex, RuntimeLoopBytes), LoopBB,
-        ResHeaderBB);
-
-    // Determine if we need to branch to the residual loop or bypass it.
-    IRBuilder<> RHBuilder(ResHeaderBB);
-    RHBuilder.CreateCondBr(RHBuilder.CreateICmpNE(RuntimeResidualBytes, Zero),
-                           ResLoopBB, PostLoopBB);
+  // Fill ResidualLoopBB.
+  if (!LEI.ResidualLoopIP)
+    return;
 
-    // Copy the residual with single byte load/store loop.
-    IRBuilder<> ResBuilder(ResLoopBB);
-    PHINode *ResidualIndex =
-        ResBuilder.CreatePHI(CopyLenType, 2, "residual-loop-index");
-    ResidualIndex->addIncoming(Zero, ResHeaderBB);
+  Align ResSrcAlign(commonAlignment(PartSrcAlign, ResidualLoopOpSize));
+  Align ResDstAlign(commonAlignment(PartDstAlign, ResidualLoopOpSize));
 
-    Value *FullOffset = ResBuilder.CreateAdd(RuntimeLoopBytes, ResidualIndex);
-    Value *SrcGEP = ResBuilder.CreateInBoundsGEP(Int8Type, SrcAddr, FullOffset);
-    LoadInst *Load = ResBuilder.CreateAlignedLoad(ResLoopOpType, SrcGEP,
-                                                  ResSrcAlign, SrcIsVolatile);
-    if (!CanOverlap) {
-      // Set alias scope for loads.
-      Load->setMetadata(LLVMContext::MD_alias_scope,
-                        MDNode::get(Ctx, NewScope));
-    }
-    Value *DstGEP = ResBuilder.CreateInBoundsGEP(Int8Type, DstAddr, FullOffset);
-    StoreInst *Store =
-        ResBuilder.CreateAlignedStore(Load, DstGEP, ResDstAlign, DstIsVolatile);
-    if (!CanOverlap) {
-      // Indicate that stores don't overlap loads.
-      Store->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
-    }
-    if (AtomicElementSize) {
-      Load->setAtomic(AtomicOrdering::Unordered);
-      Store->setAtomic(AtomicOrdering::Unordered);
-    }
-    Value *ResNewIndex = ResBuilder.CreateAdd(
-        ResidualIndex, ConstantInt::get(CopyLenType, ResLoopOpSize));
-    ResidualIndex->addIncoming(ResNewIndex, ResLoopBB);
-
-    // Create the loop branch condition.
-    ResBuilder.CreateCondBr(
-        ResBuilder.CreateICmpULT(ResNewIndex, RuntimeResidualBytes), ResLoopBB,
-        PostLoopBB);
-  } else {
-    // In this case the loop operand type was a byte, and there is no need for a
-    // residual loop to copy the remaining memory after the main loop.
-    // We do however need to patch up the control flow by creating the
-    // terminators for the preloop block and the memcpy loop.
-    ConstantInt *Zero = ConstantInt::get(ILengthType, 0U);
-    PLBuilder.CreateCondBr(PLBuilder.CreateICmpNE(RuntimeLoopBytes, Zero),
-                           LoopBB, PostLoopBB);
-    PreLoopBB->getTerminator()->eraseFromParent();
-    LoopBuilder.CreateCondBr(
-        LoopBuilder.CreateICmpULT(NewIndex, RuntimeLoopBytes), LoopBB,
-        PostLoopBB);
+  IRBuilder<> ResLoopBuilder(LEI.ResidualLoopIP);
+  Value *ResSrcGEP = ResLoopBuilder.CreateInBoundsGEP(Int8Type, SrcAddr,
+                                                      LEI.ResidualLoopIndex);
+  LoadInst *ResLoad = ResLoopBuilder.CreateAlignedLoad(
+      ResidualLoopOpType, ResSrcGEP, ResSrcAlign, SrcIsVolatile);
+  if (!CanOverlap) {
+    // Set alias scope for loads.
+    ResLoad->setMetadata(LLVMContext::MD_alias_scope,
+                         MDNode::get(Ctx, NewScope));
+  }
+  Value *ResDstGEP = ResLoopBuilder.CreateInBoundsGEP(Int8Type, DstAddr,
+                                                      LEI.ResidualLoopIndex);
+  StoreInst *ResStore = ResLoopBuilder.CreateAlignedStore(
+      ResLoad, ResDstGEP, ResDstAlign, DstIsVolatile);
+  if (!CanOverlap) {
+    // Indicate that stores don't overlap loads.
+    ResStore->setMetadata(LLVMContext::MD_noalias, MDNode::get(Ctx, NewScope));
+  }
+  if (AtomicElementSize) {
+    ResLoad->setAtomic(AtomicOrdering::Unordered);
+    ResStore->setAtomic(AtomicOrdering::Unordered);
   }
 }
 
@@ -439,9 +541,9 @@ static void createMemMoveLoopUnknownSize(Instruction *InsertBefore,
   Value *RuntimeLoopRemainder = nullptr;
   Value *SkipResidualCondition = nullptr;
   if (RequiresResidual) {
-    RuntimeLoopRemainder = getRuntimeLoopRemainder(DL, PLBuilder, CopyLen,
-                                                   CILoopOpSize, LoopOpSize);
-    RuntimeLoopBytes = getRuntimeLoopBytes(DL, PLBuilder, CopyLen, CILoopOpSize,
+    RuntimeLoopRemainder =
+        getRuntimeLoopRemainder(PLBuilder, CopyLen, CILoopOpSize, LoopOpSize);
+    RuntimeLoopBytes = getRuntimeLoopUnits(PLBuilder, CopyLen, CILoopOpSize,
                                            LoopOpSize, RuntimeLoopRemainder);
     SkipResidualCondition =
         PLBuilder.CreateICmpEQ(RuntimeLoopRemainder, Zero, "skip_residual");
diff --git a/llvm/lib/Transforms/Utils/SCCPSolver.cpp b/llvm/lib/Transforms/Utils/SCCPSolver.cpp
index 4947d03a2dc66..021bf0618754a 100644
--- a/llvm/lib/Transforms/Utils/SCCPSolver.cpp
+++ b/llvm/lib/Transforms/Utils/SCCPSolver.cpp
@@ -2098,6 +2098,38 @@ void SCCPInstVisitor::handleCallResult(CallBase &CB) {
       return (void)mergeInValue(ValueState[II], II,
                                 ValueLatticeElement::getRange(Result));
     }
+    if (II->getIntrinsicID() == Intrinsic::experimental_get_vector_length) {
+      Value *CountArg = II->getArgOperand(0);
+      Value *VF = II->getArgOperand(1);
+      bool Scalable = cast<ConstantInt>(II->getArgOperand(2))->isOne();
+
+      // Computation happens in the larger type.
+      unsigned BitWidth = std::max(CountArg->getType()->getScalarSizeInBits(),
+                                   VF->getType()->getScalarSizeInBits());
+
+      ConstantRange Count = getValueState(CountArg)
+                                .asConstantRange(CountArg->getType(), false)
+                                .zeroExtend(BitWidth);
+      ConstantRange MaxLanes = getValueState(VF)
+                                   .asConstantRange(VF->getType(), false)
+                                   .zeroExtend(BitWidth);
+      if (Scalable)
+        MaxLanes =
+            MaxLanes.multiply(getVScaleRange(II->getFunction(), BitWidth));
+
+      // The result is always less than both Count and MaxLanes.
+      ConstantRange Result(
+          APInt::getZero(BitWidth),
+          APIntOps::umin(Count.getUpper(), MaxLanes.getUpper()));
+
+      // If Count <= MaxLanes, getvectorlength(Count, MaxLanes) = Count
+      if (Count.icmp(CmpInst::ICMP_ULE, MaxLanes))
+        Result = Count;
+
+      Result = Result.truncate(II->getType()->getScalarSizeInBits());
+      return (void)mergeInValue(ValueState[II], II,
+                                ValueLatticeElement::getRange(Result));
+    }
 
     if (ConstantRange::isIntrinsicSupported(II->getIntrinsicID())) {
       // Compute result range for intrinsics supported by ConstantRange.
diff --git a/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp b/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
index 43264cce73719..61acf3ab61a22 100644
--- a/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
+++ b/llvm/lib/Transforms/Utils/SimplifyIndVar.cpp
@@ -43,7 +43,9 @@ STATISTIC(
 STATISTIC(
     NumSimplifiedSRem,
     "Number of IV signed remainder operations converted to unsigned remainder");
-STATISTIC(NumElimCmp     , "Number of IV comparisons eliminated");
+STATISTIC(NumElimCmp, "Number of IV comparisons eliminated");
+STATISTIC(NumInvariantCmp, "Number of IV comparisons made loop invariant");
+STATISTIC(NumSameSign, "Number of IV comparisons with new samesign flags");
 
 namespace {
   /// This is a utility for simplifying induction variables
@@ -275,25 +277,33 @@ void SimplifyIndvar::eliminateIVComparison(ICmpInst *ICmp,
     ICmp->replaceAllUsesWith(ConstantInt::getBool(ICmp->getContext(), *Ev));
     DeadInsts.emplace_back(ICmp);
     LLVM_DEBUG(dbgs() << "INDVARS: Eliminated comparison: " << *ICmp << '\n');
-  } else if (makeIVComparisonInvariant(ICmp, IVOperand)) {
-    // fallthrough to end of function
-  } else if (ICmpInst::isSigned(OriginalPred) &&
-             SE->isKnownNonNegative(S) && SE->isKnownNonNegative(X)) {
-    // If we were unable to make anything above, all we can is to canonicalize
-    // the comparison hoping that it will open the doors for other
-    // optimizations. If we find out that we compare two non-negative values,
-    // we turn the instruction's predicate to its unsigned version. Note that
-    // we cannot rely on Pred here unless we check if we have swapped it.
+    ++NumElimCmp;
+    Changed = true;
+    return;
+  }
+
+  if (makeIVComparisonInvariant(ICmp, IVOperand)) {
+    ++NumInvariantCmp;
+    Changed = true;
+    return;
+  }
+
+  if ((ICmpInst::isSigned(OriginalPred) ||
+       (ICmpInst::isUnsigned(OriginalPred) && !ICmp->hasSameSign())) &&
+      SE->haveSameSign(S, X)) {
+    // Set the samesign flag on the compare if legal, and canonicalize to
+    // the unsigned variant (for signed compares) hoping that it will open
+    // the doors for other optimizations.  Note that we cannot rely on Pred
+    // here unless we check if we have swapped it.
     assert(ICmp->getPredicate() == OriginalPred && "Predicate changed?");
-    LLVM_DEBUG(dbgs() << "INDVARS: Turn to unsigned comparison: " << *ICmp
+    LLVM_DEBUG(dbgs() << "INDVARS: Marking comparison samesign: " << *ICmp
                       << '\n');
     ICmp->setPredicate(ICmpInst::getUnsignedPredicate(OriginalPred));
     ICmp->setSameSign();
-  } else
+    NumSameSign++;
+    Changed = true;
     return;
-
-  ++NumElimCmp;
-  Changed = true;
+  }
 }
 
 bool SimplifyIndvar::eliminateSDiv(BinaryOperator *SDiv) {
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 379f4e6602a7d..26e2d44bdc9e6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -179,17 +179,37 @@ void LoopVectorizeHints::setAlreadyVectorized() {
   IsVectorized.Value = 1;
 }
 
+void LoopVectorizeHints::reportDisallowedVectorization(
+    const StringRef DebugMsg, const StringRef RemarkName,
+    const StringRef RemarkMsg, const Loop *L) const {
+  LLVM_DEBUG(dbgs() << "LV: Not vectorizing: " << DebugMsg << ".\n");
+  ORE.emit(OptimizationRemarkMissed(LV_NAME, RemarkName, L->getStartLoc(),
+                                    L->getHeader())
+           << "loop not vectorized: " << RemarkMsg);
+}
+
 bool LoopVectorizeHints::allowVectorization(
     Function *F, Loop *L, bool VectorizeOnlyWhenForced) const {
   if (getForce() == LoopVectorizeHints::FK_Disabled) {
-    LLVM_DEBUG(dbgs() << "LV: Not vectorizing: #pragma vectorize disable.\n");
-    emitRemarkWithHints();
+    if (Force.Value == LoopVectorizeHints::FK_Disabled) {
+      reportDisallowedVectorization("#pragma vectorize disable",
+                                    "MissedExplicitlyDisabled",
+                                    "vectorization is explicitly disabled", L);
+    } else if (hasDisableAllTransformsHint(L)) {
+      reportDisallowedVectorization("loop hasDisableAllTransformsHint",
+                                    "MissedTransformsDisabled",
+                                    "loop transformations are disabled", L);
+    } else {
+      llvm_unreachable("loop vect disabled for an unknown reason");
+    }
     return false;
   }
 
   if (VectorizeOnlyWhenForced && getForce() != LoopVectorizeHints::FK_Enabled) {
-    LLVM_DEBUG(dbgs() << "LV: Not vectorizing: No #pragma vectorize enable.\n");
-    emitRemarkWithHints();
+    reportDisallowedVectorization(
+        "VectorizeOnlyWhenForced is set, and no #pragma vectorize enable",
+        "MissedForceOnly", "only vectorizing loops that explicitly request it",
+        L);
     return false;
   }
 
@@ -877,6 +897,11 @@ bool LoopVectorizationLegality::canVectorizeInstr(Instruction &I) {
       Requirements->addExactFPMathInst(RedDes.getExactFPMathInst());
       AllowedExit.insert(RedDes.getLoopExitInstr());
       Reductions[Phi] = RedDes;
+      assert((!RedDes.hasUsesOutsideReductionChain() ||
+              RecurrenceDescriptor::isMinMaxRecurrenceKind(
+                  RedDes.getRecurrenceKind())) &&
+             "Only min/max recurrences are allowed to have multiple uses "
+             "currently");
       return true;
     }
 
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 62b68232925d9..9a94d29ba3307 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5122,8 +5122,18 @@ InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
       InstructionCost C = getInstructionCost(&I, VF);
 
       // Check if we should override the cost.
-      if (C.isValid() && ForceTargetInstructionCost.getNumOccurrences() > 0)
-        C = InstructionCost(ForceTargetInstructionCost);
+      if (C.isValid() && ForceTargetInstructionCost.getNumOccurrences() > 0) {
+        // For interleave groups, use ForceTargetInstructionCost once for the
+        // whole group.
+        if (VF.isVector() && getWideningDecision(&I, VF) == CM_Interleave) {
+          if (getInterleavedAccessGroup(&I)->getInsertPos() == &I)
+            C = InstructionCost(ForceTargetInstructionCost);
+          else
+            C = InstructionCost(0);
+        } else {
+          C = InstructionCost(ForceTargetInstructionCost);
+        }
+      }
 
       BlockCost += C;
       LLVM_DEBUG(dbgs() << "LV: Found an estimated cost of " << C << " for VF "
@@ -6593,6 +6603,11 @@ void LoopVectorizationCostModel::collectInLoopReductions() {
     PHINode *Phi = Reduction.first;
     const RecurrenceDescriptor &RdxDesc = Reduction.second;
 
+    // Multi-use reductions (e.g., used in FindLastIV patterns) are handled
+    // separately and should not be considered for in-loop reductions.
+    if (RdxDesc.hasUsesOutsideReductionChain())
+      continue;
+
     // We don't collect reductions that are type promoted (yet).
     if (RdxDesc.getRecurrenceType() != Phi->getType())
       continue;
@@ -7182,17 +7197,29 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
   VPCostContext CostCtx(CM.TTI, *CM.TLI, BestPlan, CM, CM.CostKind,
                         *CM.PSE.getSE(), OrigLoop);
   precomputeCosts(BestPlan, BestFactor.Width, CostCtx);
-  // Verify that the VPlan-based and legacy cost models agree, except for VPlans
-  // with early exits and plans with additional VPlan simplifications. The
-  // legacy cost model doesn't properly model costs for such loops.
-  assert((BestFactor.Width == LegacyVF.Width || BestPlan.hasEarlyExit() ||
-          !Legal->getLAI()->getSymbolicStrides().empty() ||
-          planContainsAdditionalSimplifications(getPlanFor(BestFactor.Width),
-                                                CostCtx, OrigLoop,
-                                                BestFactor.Width) ||
-          planContainsAdditionalSimplifications(
-              getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width)) &&
-         " VPlan cost model and legacy cost model disagreed");
+  // Verify that the VPlan-based and legacy cost models agree, except for
+  // * VPlans with early exits,
+  // * VPlans with additional VPlan simplifications,
+  // * EVL-based VPlans with gather/scatters (the VPlan-based cost model uses
+  //   vp_scatter/vp_gather).
+  // The legacy cost model doesn't properly model costs for such loops.
+  bool UsesEVLGatherScatter =
+      any_of(VPBlockUtils::blocksOnly<VPBasicBlock>(vp_depth_first_shallow(
+                 BestPlan.getVectorLoopRegion()->getEntry())),
+             [](VPBasicBlock *VPBB) {
+               return any_of(*VPBB, [](VPRecipeBase &R) {
+                 return isa<VPWidenLoadEVLRecipe, VPWidenStoreEVLRecipe>(&R) &&
+                        !cast<VPWidenMemoryRecipe>(&R)->isConsecutive();
+               });
+             });
+  assert(
+      (BestFactor.Width == LegacyVF.Width || BestPlan.hasEarlyExit() ||
+       !Legal->getLAI()->getSymbolicStrides().empty() || UsesEVLGatherScatter ||
+       planContainsAdditionalSimplifications(
+           getPlanFor(BestFactor.Width), CostCtx, OrigLoop, BestFactor.Width) ||
+       planContainsAdditionalSimplifications(
+           getPlanFor(LegacyVF.Width), CostCtx, OrigLoop, LegacyVF.Width)) &&
+      " VPlan cost model and legacy cost model disagreed");
   assert((BestFactor.Width.isScalar() || BestFactor.ScalarCost > 0) &&
          "when vectorizing, the scalar cost must be computed.");
 #endif
@@ -7998,9 +8025,10 @@ void VPRecipeBuilder::collectScaledReductions(VFRange &Range) {
   MapVector<Instruction *,
             SmallVector<std::pair<PartialReductionChain, unsigned>>>
       ChainsByPhi;
-  for (const auto &[Phi, RdxDesc] : Legal->getReductionVars())
-    getScaledReductions(Phi, RdxDesc.getLoopExitInstr(), Range,
-                        ChainsByPhi[Phi]);
+  for (const auto &[Phi, RdxDesc] : Legal->getReductionVars()) {
+    if (Instruction *RdxExitInstr = RdxDesc.getLoopExitInstr())
+      getScaledReductions(Phi, RdxExitInstr, Range, ChainsByPhi[Phi]);
+  }
 
   // A partial reduction is invalid if any of its extends are used by
   // something that isn't another partial reduction. This is because the
@@ -8221,7 +8249,8 @@ VPRecipeBase *VPRecipeBuilder::tryToCreateWidenRecipe(VPSingleDefRecipe *R,
       PhiRecipe = new VPReductionPHIRecipe(
           Phi, RdxDesc.getRecurrenceKind(), *StartV,
           getReductionStyle(UseInLoopReduction, UseOrderedReductions,
-                            ScaleFactor));
+                            ScaleFactor),
+          RdxDesc.hasUsesOutsideReductionChain());
     } else {
       // TODO: Currently fixed-order recurrences are modeled as chains of
       // first-order recurrences. If there are no users of the intermediate
@@ -8350,6 +8379,7 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
             std::unique_ptr<VPlan>(VPlan0->duplicate()), SubRange, &LVer)) {
       // Now optimize the initial VPlan.
       VPlanTransforms::hoistPredicatedLoads(*Plan, *PSE.getSE(), OrigLoop);
+      VPlanTransforms::sinkPredicatedStores(*Plan, *PSE.getSE(), OrigLoop);
       VPlanTransforms::runPass(VPlanTransforms::truncateToMinimalBitwidths,
                                *Plan, CM.getMinimalBitwidths());
       VPlanTransforms::runPass(VPlanTransforms::optimize, *Plan);
@@ -8555,6 +8585,11 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
   // Adjust the recipes for any inloop reductions.
   adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
 
+  // Apply mandatory transformation to handle reductions with multiple in-loop
+  // uses if possible, bail out otherwise.
+  if (!VPlanTransforms::runPass(VPlanTransforms::handleMultiUseReductions,
+                                *Plan))
+    return nullptr;
   // Apply mandatory transformation to handle FP maxnum/minnum reduction with
   // NaNs if possible, bail out otherwise.
   if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions,
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 0eb8ad8d3c93d..5ca89766dc837 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -95,6 +95,7 @@
 #include <cassert>
 #include <cstdint>
 #include <iterator>
+#include <map>
 #include <memory>
 #include <optional>
 #include <set>
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index a464d019754ba..6ca750fc53279 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2071,6 +2071,9 @@ class LLVM_ABI_FOR_TEST VPHeaderPHIRecipe : public VPSingleDefRecipe,
   static inline bool classof(const VPValue *V) {
     return isa<VPHeaderPHIRecipe>(V->getDefiningRecipe());
   }
+  static inline bool classof(const VPSingleDefRecipe *R) {
+    return isa<VPHeaderPHIRecipe>(static_cast<const VPRecipeBase *>(R));
+  }
 
   /// Generate the phi nodes.
   void execute(VPTransformState &State) override = 0;
@@ -2136,7 +2139,7 @@ class VPWidenInductionRecipe : public VPHeaderPHIRecipe {
     return R && classof(R);
   }
 
-  static inline bool classof(const VPHeaderPHIRecipe *R) {
+  static inline bool classof(const VPSingleDefRecipe *R) {
     return classof(static_cast<const VPRecipeBase *>(R));
   }
 
@@ -2432,19 +2435,27 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
 
   ReductionStyle Style;
 
+  /// The phi is part of a multi-use reduction (e.g., used in FindLastIV
+  /// patterns for argmin/argmax).
+  /// TODO: Also support cases where the phi itself has a single use, but its
+  /// compare has multiple uses.
+  bool HasUsesOutsideReductionChain;
+
 public:
   /// Create a new VPReductionPHIRecipe for the reduction \p Phi.
   VPReductionPHIRecipe(PHINode *Phi, RecurKind Kind, VPValue &Start,
-                       ReductionStyle Style)
+                       ReductionStyle Style,
+                       bool HasUsesOutsideReductionChain = false)
       : VPHeaderPHIRecipe(VPDef::VPReductionPHISC, Phi, &Start), Kind(Kind),
-        Style(Style) {}
+        Style(Style),
+        HasUsesOutsideReductionChain(HasUsesOutsideReductionChain) {}
 
   ~VPReductionPHIRecipe() override = default;
 
   VPReductionPHIRecipe *clone() override {
     auto *R = new VPReductionPHIRecipe(
         dyn_cast_or_null<PHINode>(getUnderlyingValue()), getRecurrenceKind(),
-        *getOperand(0), Style);
+        *getOperand(0), Style, HasUsesOutsideReductionChain);
     R->addOperand(getBackedgeValue());
     return R;
   }
@@ -2481,6 +2492,11 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
   /// Returns true if the reduction outputs a vector with a scaled down VF.
   bool isPartialReduction() const { return getVFScaleFactor() > 1; }
 
+  /// Returns true, if the phi is part of a multi-use reduction.
+  bool hasUsesOutsideReductionChain() const {
+    return HasUsesOutsideReductionChain;
+  }
+
   /// Returns true if the recipe only uses the first lane of operand \p Op.
   bool usesFirstLaneOnly(const VPValue *Op) const override {
     assert(is_contained(operands(), Op) &&
diff --git a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
index 92969c8ed9ec0..329b62cee4fce 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
@@ -22,6 +22,7 @@
 #include "llvm/Analysis/ScalarEvolution.h"
 #include "llvm/IR/InstrTypes.h"
 #include "llvm/IR/MDBuilder.h"
+#include "llvm/Transforms/Utils/LoopUtils.h"
 #include "llvm/Transforms/Utils/LoopVersioning.h"
 
 #define DEBUG_TYPE "vplan"
@@ -827,15 +828,18 @@ void VPlanTransforms::addMinimumVectorEpilogueIterationCheck(
   Branch->setMetadata(LLVMContext::MD_prof, BranchWeights);
 }
 
-/// If \p RedPhiR is used by a ComputeReductionResult recipe, return it.
-/// Otherwise return nullptr.
-static VPInstruction *
-findComputeReductionResult(VPReductionPHIRecipe *RedPhiR) {
-  auto It = find_if(RedPhiR->users(), [](VPUser *U) {
-    auto *VPI = dyn_cast<VPInstruction>(U);
-    return VPI && VPI->getOpcode() == VPInstruction::ComputeReductionResult;
-  });
-  return It == RedPhiR->user_end() ? nullptr : cast<VPInstruction>(*It);
+/// If \p V is used by a recipe matching pattern \p P, return it. Otherwise
+/// return nullptr;
+template <typename MatchT>
+static VPRecipeBase *findUserOf(VPValue *V, const MatchT &P) {
+  auto It = find_if(V->users(), match_fn(P));
+  return It == V->user_end() ? nullptr : cast<VPRecipeBase>(*It);
+}
+
+/// If \p V is used by a VPInstruction with \p Opcode, return it. Otherwise
+/// return nullptr.
+template <unsigned Opcode> static VPInstruction *findUserOf(VPValue *V) {
+  return cast_or_null<VPInstruction>(findUserOf(V, m_VPInstruction<Opcode>()));
 }
 
 bool VPlanTransforms::handleMaxMinNumReductions(VPlan &Plan) {
@@ -932,7 +936,8 @@ bool VPlanTransforms::handleMaxMinNumReductions(VPlan &Plan) {
 
     // If we exit early due to NaNs, compute the final reduction result based on
     // the reduction phi at the beginning of the last vector iteration.
-    auto *RdxResult = findComputeReductionResult(RedPhiR);
+    auto *RdxResult =
+        findUserOf<VPInstruction::ComputeReductionResult>(RedPhiR);
 
     auto *NewSel = MiddleBuilder.createSelect(AnyNaNLane, RedPhiR,
                                               RdxResult->getOperand(1));
@@ -991,3 +996,155 @@ bool VPlanTransforms::handleMaxMinNumReductions(VPlan &Plan) {
   MiddleTerm->setOperand(0, NewCond);
   return true;
 }
+
+bool VPlanTransforms::handleMultiUseReductions(VPlan &Plan) {
+  for (auto &PhiR : make_early_inc_range(
+           Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis())) {
+    auto *MinMaxPhiR = dyn_cast<VPReductionPHIRecipe>(&PhiR);
+    // TODO: check for multi-uses in VPlan directly.
+    if (!MinMaxPhiR || !MinMaxPhiR->hasUsesOutsideReductionChain())
+      continue;
+
+    // MinMaxPhiR has users outside the reduction cycle in the loop. Check if
+    // the only other user is a FindLastIV reduction. MinMaxPhiR must have
+    // exactly 3 users: 1) the min/max operation, the compare of a FindLastIV
+    // reduction and ComputeReductionResult. The comparisom must compare
+    // MinMaxPhiR against the min/max operand used for the min/max reduction
+    // and only be used by the select of the FindLastIV reduction.
+    RecurKind RdxKind = MinMaxPhiR->getRecurrenceKind();
+    assert(
+        RecurrenceDescriptor::isIntMinMaxRecurrenceKind(RdxKind) &&
+        "only min/max recurrences support users outside the reduction chain");
+
+    auto *MinMaxOp =
+        dyn_cast<VPRecipeWithIRFlags>(MinMaxPhiR->getBackedgeValue());
+    if (!MinMaxOp)
+      return false;
+
+    // Check that MinMaxOp is a VPWidenIntrinsicRecipe or VPReplicateRecipe
+    // with an intrinsic that matches the reduction kind.
+    Intrinsic::ID ExpectedIntrinsicID = getMinMaxReductionIntrinsicOp(RdxKind);
+    if (!match(MinMaxOp, m_Intrinsic(ExpectedIntrinsicID)))
+      return false;
+
+    // MinMaxOp must have 2 users: 1) MinMaxPhiR and 2) ComputeReductionResult
+    // (asserted below).
+    assert(MinMaxOp->getNumUsers() == 2 &&
+           "MinMaxOp must have exactly 2 users");
+    VPValue *MinMaxOpValue = MinMaxOp->getOperand(0);
+    if (MinMaxOpValue == MinMaxPhiR)
+      MinMaxOpValue = MinMaxOp->getOperand(1);
+
+    VPValue *CmpOpA;
+    VPValue *CmpOpB;
+    CmpPredicate Pred;
+    auto *Cmp = dyn_cast_or_null<VPRecipeWithIRFlags>(findUserOf(
+        MinMaxPhiR, m_Cmp(Pred, m_VPValue(CmpOpA), m_VPValue(CmpOpB))));
+    if (!Cmp || Cmp->getNumUsers() != 1 ||
+        (CmpOpA != MinMaxOpValue && CmpOpB != MinMaxOpValue))
+      return false;
+
+    if (MinMaxOpValue != CmpOpB)
+      Pred = CmpInst::getSwappedPredicate(Pred);
+
+    // MinMaxPhiR must have exactly 3 users:
+    // * MinMaxOp,
+    // * Cmp (that's part of a FindLastIV chain),
+    // * ComputeReductionResult.
+    if (MinMaxPhiR->getNumUsers() != 3)
+      return false;
+
+    VPInstruction *MinMaxResult =
+        findUserOf<VPInstruction::ComputeReductionResult>(MinMaxPhiR);
+    assert(is_contained(MinMaxPhiR->users(), MinMaxOp) &&
+           "one user must be MinMaxOp");
+    assert(MinMaxResult && "MinMaxResult must be a user of MinMaxPhiR");
+    assert(is_contained(MinMaxOp->users(), MinMaxResult) &&
+           "MinMaxResult must be a user of MinMaxOp (and of MinMaxPhiR");
+
+    // Cmp must be used by the select of a FindLastIV chain.
+    VPValue *Sel = dyn_cast<VPSingleDefRecipe>(Cmp->getSingleUser());
+    VPValue *IVOp, *FindIV;
+    if (!Sel || Sel->getNumUsers() != 2 ||
+        !match(Sel,
+               m_Select(m_Specific(Cmp), m_VPValue(IVOp), m_VPValue(FindIV))))
+      return false;
+
+    if (!isa<VPReductionPHIRecipe>(FindIV)) {
+      std::swap(FindIV, IVOp);
+      Pred = CmpInst::getInversePredicate(Pred);
+    }
+
+    auto *FindIVPhiR = dyn_cast<VPReductionPHIRecipe>(FindIV);
+    if (!FindIVPhiR || !RecurrenceDescriptor::isFindLastIVRecurrenceKind(
+                           FindIVPhiR->getRecurrenceKind()))
+      return false;
+
+    // TODO: Support cases where IVOp is the IV increment.
+    if (!match(IVOp, m_TruncOrSelf(m_VPValue(IVOp))) ||
+        !isa<VPWidenIntOrFpInductionRecipe>(IVOp))
+      return false;
+
+    CmpInst::Predicate RdxPredicate = [RdxKind]() {
+      switch (RdxKind) {
+      case RecurKind::UMin:
+        return CmpInst::ICMP_UGE;
+      case RecurKind::UMax:
+        return CmpInst::ICMP_ULE;
+      case RecurKind::SMax:
+        return CmpInst::ICMP_SLE;
+      case RecurKind::SMin:
+        return CmpInst::ICMP_SGE;
+      default:
+        llvm_unreachable("unhandled recurrence kind");
+      }
+    }();
+
+    // TODO: Strict predicates need to find the first IV value for which the
+    // predicate holds, not the last.
+    if (Pred != RdxPredicate)
+      return false;
+
+    assert(!FindIVPhiR->isInLoop() && !FindIVPhiR->isOrdered() &&
+           "cannot handle inloop/ordered reductions yet");
+
+    // The reduction using MinMaxPhiR needs adjusting to compute the correct
+    // result:
+    //  1. We need to find the last IV for which the condition based on the
+    //     min/max recurrence is true,
+    //  2. Compare the partial min/max reduction result to its final value and,
+    //  3. Select the lanes of the partial FindLastIV reductions which
+    //     correspond to the lanes matching the min/max reduction result.
+    //
+    // For example, this transforms
+    // vp<%min.result> = compute-reduction-result ir<%min.val>,
+    //                                            ir<%min.val.next>
+    // vp<%find.iv.result = compute-find-iv-result ir<%min.idx>, ir<0>,
+    //                                             SENTINEL, vp<%min.idx.next>
+    //
+    // into:
+    //
+    // vp<min.result> = compute-reduction-result ir<%min.val>, ir<%min.val.next>
+    // vp<%final.min.cmp> = icmp eq ir<%min.val.next>, vp<min.result>
+    // vp<%final.iv> = select vp<%final.min.cmp>, ir<%min.idx.next>, SENTINEL
+    // vp<%find.iv.result> = compute-find-iv-result ir<%min.idx>, ir<0>,
+    //                                             SENTINEL, vp<%final.iv>
+    VPInstruction *FindIVResult =
+        findUserOf<VPInstruction::ComputeFindIVResult>(FindIVPhiR);
+    assert(FindIVResult->getParent() == MinMaxResult->getParent() &&
+           "both results must be computed in the same block");
+    MinMaxResult->moveBefore(*FindIVResult->getParent(),
+                             FindIVResult->getIterator());
+
+    VPBuilder B(FindIVResult);
+    VPValue *MinMaxExiting = MinMaxResult->getOperand(1);
+    auto *FinalMinMaxCmp =
+        B.createICmp(CmpInst::ICMP_EQ, MinMaxExiting, MinMaxResult);
+    VPValue *Sentinel = FindIVResult->getOperand(2);
+    VPValue *LastIVExiting = FindIVResult->getOperand(3);
+    auto *FinalIVSelect =
+        B.createSelect(FinalMinMaxCmp, LastIVExiting, Sentinel);
+    FindIVResult->setOperand(3, FinalIVSelect);
+  }
+  return true;
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
index 07dfe31eea46d..750ef8edd94bb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
@@ -468,6 +468,12 @@ inline AllRecipe_match<Instruction::Trunc, Op0_t> m_Trunc(const Op0_t &Op0) {
   return m_Unary<Instruction::Trunc, Op0_t>(Op0);
 }
 
+template <typename Op0_t>
+inline match_combine_or<AllRecipe_match<Instruction::Trunc, Op0_t>, Op0_t>
+m_TruncOrSelf(const Op0_t &Op0) {
+  return m_CombineOr(m_Trunc(Op0), Op0);
+}
+
 template <typename Op0_t>
 inline AllRecipe_match<Instruction::ZExt, Op0_t> m_ZExt(const Op0_t &Op0) {
   return m_Unary<Instruction::ZExt, Op0_t>(Op0);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0baf7172e4443..4d46478aa7373 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -280,7 +280,6 @@ InstructionCost VPRecipeBase::cost(ElementCount VF, VPCostContext &Ctx) {
   if (UI && Ctx.skipCostComputation(UI, VF.isVector())) {
     RecipeCost = 0;
   } else {
-    RecipeCost = computeCost(VF, Ctx);
     RecipeCost = computeCost(VF, Ctx);
     if (ForceTargetInstructionCost.getNumOccurrences() > 0 &&
         RecipeCost.isValid()) {
@@ -594,10 +593,9 @@ Value *VPInstruction::generate(VPTransformState &State) {
       return Builder.CreateCmp(CmpInst::Predicate::ICMP_ULT, VIVElem0, ScalarTC,
                                Name);
 
-    auto *Int1Ty = Type::getInt1Ty(Builder.getContext());
-    auto PredTy = VectorType::get(
-        Int1Ty, State.VF * cast<ConstantInt>(getOperand(2)->getLiveInIRValue())
-                               ->getZExtValue());
+    ElementCount EC = State.VF.multiplyCoefficientBy(
+        cast<ConstantInt>(getOperand(2)->getLiveInIRValue())->getZExtValue());
+    auto *PredTy = VectorType::get(Builder.getInt1Ty(), EC);
     return Builder.CreateIntrinsic(Intrinsic::get_active_lane_mask,
                                    {PredTy, ScalarTC->getType()},
                                    {VIVElem0, ScalarTC}, nullptr, Name);
@@ -627,7 +625,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
     Value *Step = createStepForVF(Builder, ScalarTC->getType(), State.VF, UF);
     Value *Sub = Builder.CreateSub(ScalarTC, Step);
     Value *Cmp = Builder.CreateICmp(CmpInst::Predicate::ICMP_UGT, ScalarTC, Step);
-    Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
+    Value *Zero = ConstantInt::getNullValue(ScalarTC->getType());
     return Builder.CreateSelect(Cmp, Sub, Zero);
   }
   case VPInstruction::ExplicitVectorLength: {
@@ -639,11 +637,11 @@ Value *VPInstruction::generate(VPTransformState &State) {
            "Requested vector length should be an integer.");
 
     assert(State.VF.isScalable() && "Expected scalable vector factor.");
-    Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+    Value *VFArg = Builder.getInt32(State.VF.getKnownMinValue());
 
-    Value *EVL = State.Builder.CreateIntrinsic(
-        State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
-        {AVL, VFArg, State.Builder.getTrue()});
+    Value *EVL = Builder.CreateIntrinsic(
+        Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+        {AVL, VFArg, Builder.getTrue()});
     return EVL;
   }
   case VPInstruction::CanonicalIVIncrementForPart: {
@@ -697,8 +695,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
     auto NumOfElements = ElementCount::getFixed(getNumOperands());
     Value *Res = PoisonValue::get(toVectorizedTy(ScalarTy, NumOfElements));
     for (const auto &[Idx, Op] : enumerate(operands()))
-      Res = State.Builder.CreateInsertElement(Res, State.get(Op, true),
-                                              State.Builder.getInt32(Idx));
+      Res = Builder.CreateInsertElement(Res, State.get(Op, true),
+                                        Builder.getInt32(Idx));
     return Res;
   }
   case VPInstruction::ReductionStartVector: {
@@ -711,9 +709,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
     ElementCount VF = State.VF.divideCoefficientBy(
         cast<ConstantInt>(getOperand(2)->getLiveInIRValue())->getZExtValue());
     auto *Iden = Builder.CreateVectorSplat(VF, State.get(getOperand(1), true));
-    Constant *Zero = Builder.getInt32(0);
     return Builder.CreateInsertElement(Iden, State.get(getOperand(0), true),
-                                       Zero);
+                                       Builder.getInt32(0));
   }
   case VPInstruction::ComputeAnyOfResult: {
     // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
@@ -790,14 +787,12 @@ Value *VPInstruction::generate(VPTransformState &State) {
         if (RecurrenceDescriptor::isMinMaxRecurrenceKind(RK))
           ReducedPartRdx = createMinMaxOp(Builder, RK, ReducedPartRdx, RdxPart);
         else {
-          Instruction::BinaryOps Opcode;
           // For sub-recurrences, each UF's reduction variable is already
           // negative, we need to do: reduce.add(-acc_uf0 + -acc_uf1)
-          if (RK == RecurKind::Sub)
-            Opcode = Instruction::Add;
-          else
-            Opcode =
-                (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK);
+          Instruction::BinaryOps Opcode =
+              RK == RecurKind::Sub
+                  ? Instruction::Add
+                  : (Instruction::BinaryOps)RecurrenceDescriptor::getOpcode(RK);
           ReducedPartRdx =
               Builder.CreateBinOp(Opcode, RdxPart, ReducedPartRdx, "bin.rdx");
         }
@@ -862,7 +857,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
     Value *LaneToExtract = State.get(getOperand(0), true);
     Type *IdxTy = State.TypeAnalysis.inferScalarType(getOperand(0));
     Value *Res = nullptr;
-    Value *RuntimeVF = getRuntimeVF(State.Builder, IdxTy, State.VF);
+    Value *RuntimeVF = getRuntimeVF(Builder, IdxTy, State.VF);
 
     for (unsigned Idx = 1; Idx != getNumOperands(); ++Idx) {
       Value *VectorStart =
@@ -892,8 +887,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
     // If there are multiple operands, create a chain of selects to pick the
     // first operand with an active lane and add the number of lanes of the
     // preceding operands.
-    Value *RuntimeVF =
-        getRuntimeVF(State.Builder, State.Builder.getInt64Ty(), State.VF);
+    Value *RuntimeVF = getRuntimeVF(Builder, Builder.getInt64Ty(), State.VF);
     unsigned LastOpIdx = getNumOperands() - 1;
     Value *Res = nullptr;
     for (int Idx = LastOpIdx; Idx >= 0; --Idx) {
@@ -1013,9 +1007,8 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
 
   switch (getOpcode()) {
   case Instruction::Select: {
-    // TODO: It may be possible to improve this by analyzing where the
-    // condition operand comes from.
-    CmpInst::Predicate Pred = CmpInst::BAD_ICMP_PREDICATE;
+    llvm::CmpPredicate Pred = CmpInst::BAD_ICMP_PREDICATE;
+    match(getOperand(0), m_Cmp(Pred, m_VPValue(), m_VPValue()));
     auto *CondTy = Ctx.Types.inferScalarType(getOperand(0));
     auto *VecTy = Ctx.Types.inferScalarType(getOperand(1));
     if (!vputils::onlyFirstLaneUsed(this)) {
@@ -2367,15 +2360,6 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
   // Compute the scalar steps and save the results in State.
   Type *IntStepTy =
       IntegerType::get(BaseIVTy->getContext(), BaseIVTy->getScalarSizeInBits());
-  Type *VecIVTy = nullptr;
-  Value *UnitStepVec = nullptr, *SplatStep = nullptr, *SplatIV = nullptr;
-  if (!FirstLaneOnly && State.VF.isScalable()) {
-    VecIVTy = VectorType::get(BaseIVTy, State.VF);
-    UnitStepVec =
-        Builder.CreateStepVector(VectorType::get(IntStepTy, State.VF));
-    SplatStep = Builder.CreateVectorSplat(State.VF, Step);
-    SplatIV = Builder.CreateVectorSplat(State.VF, BaseIV);
-  }
 
   unsigned StartLane = 0;
   unsigned EndLane = FirstLaneOnly ? 1 : State.VF.getKnownMinValue();
@@ -2396,19 +2380,6 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
     StartIdx0 = Builder.CreateSExtOrTrunc(StartIdx0, IntStepTy);
   }
 
-  if (!FirstLaneOnly && State.VF.isScalable()) {
-    auto *SplatStartIdx = Builder.CreateVectorSplat(State.VF, StartIdx0);
-    auto *InitVec = Builder.CreateAdd(SplatStartIdx, UnitStepVec);
-    if (BaseIVTy->isFloatingPointTy())
-      InitVec = Builder.CreateSIToFP(InitVec, VecIVTy);
-    auto *Mul = Builder.CreateBinOp(MulOp, InitVec, SplatStep);
-    auto *Add = Builder.CreateBinOp(AddOp, SplatIV, Mul);
-    State.set(this, Add);
-    // It's useful to record the lane values too for the known minimum number
-    // of elements so we do those below. This improves the code quality when
-    // trying to extract the first element, for example.
-  }
-
   if (BaseIVTy->isFloatingPointTy())
     StartIdx0 = Builder.CreateSIToFP(StartIdx0, BaseIVTy);
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index b12f8ccc73c7e..d2f9263e32213 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -139,35 +139,51 @@ bool VPlanTransforms::tryToConvertVPInstructionsToVPRecipes(
   return true;
 }
 
-// Check if a load can be hoisted by verifying it doesn't alias with any stores
-// in blocks between FirstBB and LastBB using scoped noalias metadata.
-static bool canHoistLoadWithNoAliasCheck(VPReplicateRecipe *Load,
-                                         VPBasicBlock *FirstBB,
-                                         VPBasicBlock *LastBB) {
-  // Get the load's memory location and check if it aliases with any stores
-  // using scoped noalias metadata.
-  auto LoadLoc = vputils::getMemoryLocation(*Load);
-  if (!LoadLoc || !LoadLoc->AATags.Scope)
+// Check if a memory operation doesn't alias with memory operations in blocks
+// between FirstBB and LastBB using scoped noalias metadata.
+// For load hoisting, we only check writes in one direction.
+// For store sinking, we check both reads and writes bidirectionally.
+static bool canHoistOrSinkWithNoAliasCheck(
+    const MemoryLocation &MemLoc, VPBasicBlock *FirstBB, VPBasicBlock *LastBB,
+    bool CheckReads,
+    const SmallPtrSetImpl<VPRecipeBase *> *ExcludeRecipes = nullptr) {
+  if (!MemLoc.AATags.Scope)
     return false;
 
-  const AAMDNodes &LoadAA = LoadLoc->AATags;
+  const AAMDNodes &MemAA = MemLoc.AATags;
+
   for (VPBlockBase *Block = FirstBB; Block;
        Block = Block->getSingleSuccessor()) {
-    // This function assumes a simple linear chain of blocks. If there are
-    // multiple successors, we would need more complex analysis.
     assert(Block->getNumSuccessors() <= 1 &&
            "Expected at most one successor in block chain");
     auto *VPBB = cast<VPBasicBlock>(Block);
     for (VPRecipeBase &R : *VPBB) {
-      if (R.mayWriteToMemory()) {
-        auto Loc = vputils::getMemoryLocation(R);
-        // Bail out if we can't get the location or if the scoped noalias
-        // metadata indicates potential aliasing.
-        if (!Loc || ScopedNoAliasAAResult::mayAliasInScopes(
-                        LoadAA.Scope, Loc->AATags.NoAlias))
-          return false;
-      }
+      if (ExcludeRecipes && ExcludeRecipes->contains(&R))
+        continue;
+
+      // Skip recipes that don't need checking.
+      if (!R.mayWriteToMemory() && !(CheckReads && R.mayReadFromMemory()))
+        continue;
+
+      auto Loc = vputils::getMemoryLocation(R);
+      if (!Loc)
+        // Conservatively assume aliasing for memory operations without
+        // location.
+        return false;
+
+      // For reads, check if they don't alias in the reverse direction and
+      // skip if so.
+      if (CheckReads && R.mayReadFromMemory() &&
+          !ScopedNoAliasAAResult::mayAliasInScopes(Loc->AATags.Scope,
+                                                   MemAA.NoAlias))
+        continue;
+
+      // Check if the memory operations may alias in the forward direction.
+      if (ScopedNoAliasAAResult::mayAliasInScopes(MemAA.Scope,
+                                                  Loc->AATags.NoAlias))
+        return false;
     }
+
     if (Block == LastBB)
       break;
   }
@@ -741,35 +757,11 @@ static void legalizeAndOptimizeInductions(VPlan &Plan) {
     if (!PhiR)
       continue;
 
-    // Try to narrow wide and replicating recipes to uniform recipes, based on
-    // VPlan analysis.
-    // TODO: Apply to all recipes in the future, to replace legacy uniformity
-    // analysis.
-    auto Users = collectUsersRecursively(PhiR);
-    for (VPUser *U : reverse(Users)) {
-      auto *Def = dyn_cast<VPRecipeWithIRFlags>(U);
-      auto *RepR = dyn_cast<VPReplicateRecipe>(U);
-      // Skip recipes that shouldn't be narrowed.
-      if (!Def || !isa<VPReplicateRecipe, VPWidenRecipe>(Def) ||
-          Def->getNumUsers() == 0 || !Def->getUnderlyingValue() ||
-          (RepR && (RepR->isSingleScalar() || RepR->isPredicated())))
-        continue;
-
-      // Skip recipes that may have other lanes than their first used.
-      if (!vputils::isSingleScalar(Def) && !vputils::onlyFirstLaneUsed(Def))
-        continue;
-
-      auto *Clone = new VPReplicateRecipe(Def->getUnderlyingInstr(),
-                                          Def->operands(), /*IsUniform*/ true,
-                                          /*Mask*/ nullptr, /*Flags*/ *Def);
-      Clone->insertAfter(Def);
-      Def->replaceAllUsesWith(Clone);
-    }
-
     // Replace wide pointer inductions which have only their scalars used by
     // PtrAdd(IndStart, ScalarIVSteps (0, Step)).
     if (auto *PtrIV = dyn_cast<VPWidenPointerInductionRecipe>(&Phi)) {
-      if (!PtrIV->onlyScalarsGenerated(Plan.hasScalableVF()))
+      if (!Plan.hasScalarVFOnly() &&
+          !PtrIV->onlyScalarsGenerated(Plan.hasScalableVF()))
         continue;
 
       VPValue *PtrAdd = scalarizeVPWidenPointerInduction(PtrIV, Plan, Builder);
@@ -793,12 +785,19 @@ static void legalizeAndOptimizeInductions(VPlan &Plan) {
         WideIV->getDebugLoc(), Builder);
 
     // Update scalar users of IV to use Step instead.
-    if (!HasOnlyVectorVFs)
+    if (!HasOnlyVectorVFs) {
+      assert(!Plan.hasScalableVF() &&
+             "plans containing a scalar VF cannot also include scalable VFs");
       WideIV->replaceAllUsesWith(Steps);
-    else
-      WideIV->replaceUsesWithIf(Steps, [WideIV](VPUser &U, unsigned) {
-        return U.usesScalars(WideIV);
-      });
+    } else {
+      bool HasScalableVF = Plan.hasScalableVF();
+      WideIV->replaceUsesWithIf(Steps,
+                                [WideIV, HasScalableVF](VPUser &U, unsigned) {
+                                  if (HasScalableVF)
+                                    return U.usesFirstLaneOnly(WideIV);
+                                  return U.usesScalars(WideIV);
+                                });
+    }
   }
 }
 
@@ -1522,8 +1521,11 @@ static void narrowToSingleScalarRecipes(VPlan &Plan) {
         continue;
       }
 
-      // Skip recipes that aren't single scalars.
-      if (!vputils::isSingleScalar(RepOrWidenR))
+      // Skip recipes that aren't single scalars and don't just have their first
+      // lane used.
+      if (!vputils::isSingleScalar(RepOrWidenR) &&
+          (!vputils::onlyFirstLaneUsed(RepOrWidenR) ||
+           RepOrWidenR->getNumUsers() == 0))
         continue;
 
       // Skip recipes for which conversion to single-scalar does introduce
@@ -4127,119 +4129,217 @@ void VPlanTransforms::hoistInvariantLoads(VPlan &Plan) {
   }
 }
 
-// Returns the intersection of metadata from a group of loads.
-static VPIRMetadata getCommonLoadMetadata(ArrayRef<VPReplicateRecipe *> Loads) {
-  VPIRMetadata CommonMetadata = *Loads.front();
-  for (VPReplicateRecipe *Load : drop_begin(Loads))
-    CommonMetadata.intersect(*Load);
+// Collect common metadata from a group of replicate recipes by intersecting
+// metadata from all recipes in the group.
+static VPIRMetadata getCommonMetadata(ArrayRef<VPReplicateRecipe *> Recipes) {
+  VPIRMetadata CommonMetadata = *Recipes.front();
+  for (VPReplicateRecipe *Recipe : drop_begin(Recipes))
+    CommonMetadata.intersect(*Recipe);
   return CommonMetadata;
 }
 
-void VPlanTransforms::hoistPredicatedLoads(VPlan &Plan, ScalarEvolution &SE,
-                                           const Loop *L) {
+template <unsigned Opcode>
+static SmallVector<SmallVector<VPReplicateRecipe *, 4>>
+collectComplementaryPredicatedMemOps(VPlan &Plan, ScalarEvolution &SE,
+                                     const Loop *L) {
+  static_assert(Opcode == Instruction::Load || Opcode == Instruction::Store,
+                "Only Load and Store opcodes supported");
+  constexpr bool IsLoad = (Opcode == Instruction::Load);
   VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
   VPTypeAnalysis TypeInfo(Plan);
-  VPDominatorTree VPDT(Plan);
 
-  // Group predicated loads by their address SCEV.
-  DenseMap<const SCEV *, SmallVector<VPReplicateRecipe *>> LoadsByAddress;
+  // Group predicated operations by their address SCEV.
+  DenseMap<const SCEV *, SmallVector<VPReplicateRecipe *>> RecipesByAddress;
   for (VPBlockBase *Block : vp_depth_first_shallow(LoopRegion->getEntry())) {
     auto *VPBB = cast<VPBasicBlock>(Block);
     for (VPRecipeBase &R : *VPBB) {
       auto *RepR = dyn_cast<VPReplicateRecipe>(&R);
-      if (!RepR || RepR->getOpcode() != Instruction::Load ||
-          !RepR->isPredicated())
+      if (!RepR || RepR->getOpcode() != Opcode || !RepR->isPredicated())
         continue;
 
-      VPValue *Addr = RepR->getOperand(0);
+      // For loads, operand 0 is address; for stores, operand 1 is address.
+      VPValue *Addr = RepR->getOperand(IsLoad ? 0 : 1);
       const SCEV *AddrSCEV = vputils::getSCEVExprForVPValue(Addr, SE, L);
       if (!isa<SCEVCouldNotCompute>(AddrSCEV))
-        LoadsByAddress[AddrSCEV].push_back(RepR);
+        RecipesByAddress[AddrSCEV].push_back(RepR);
     }
   }
 
-  // For each address, collect loads with complementary masks, sort by
-  // dominance, and use the earliest load.
-  for (auto &[Addr, Loads] : LoadsByAddress) {
-    if (Loads.size() < 2)
+  // For each address, collect operations with the same or complementary masks.
+  SmallVector<SmallVector<VPReplicateRecipe *, 4>> AllGroups;
+  auto GetLoadStoreValueType = [&](VPReplicateRecipe *Recipe) {
+    return TypeInfo.inferScalarType(IsLoad ? Recipe : Recipe->getOperand(0));
+  };
+  for (auto &[Addr, Recipes] : RecipesByAddress) {
+    if (Recipes.size() < 2)
       continue;
 
-    // Collect groups of loads with complementary masks.
-    SmallVector<SmallVector<VPReplicateRecipe *, 4>> LoadGroups;
-    for (VPReplicateRecipe *&LoadI : Loads) {
-      if (!LoadI)
+    // Collect groups with the same or complementary masks.
+    for (VPReplicateRecipe *&RecipeI : Recipes) {
+      if (!RecipeI)
         continue;
 
-      VPValue *MaskI = LoadI->getMask();
-      Type *TypeI = TypeInfo.inferScalarType(LoadI);
+      VPValue *MaskI = RecipeI->getMask();
+      Type *TypeI = GetLoadStoreValueType(RecipeI);
       SmallVector<VPReplicateRecipe *, 4> Group;
-      Group.push_back(LoadI);
-      LoadI = nullptr;
+      Group.push_back(RecipeI);
+      RecipeI = nullptr;
 
-      // Find all loads with the same type.
-      for (VPReplicateRecipe *&LoadJ : Loads) {
-        if (!LoadJ)
+      // Find all operations with the same or complementary masks.
+      bool HasComplementaryMask = false;
+      for (VPReplicateRecipe *&RecipeJ : Recipes) {
+        if (!RecipeJ)
           continue;
 
-        Type *TypeJ = TypeInfo.inferScalarType(LoadJ);
+        VPValue *MaskJ = RecipeJ->getMask();
+        Type *TypeJ = GetLoadStoreValueType(RecipeJ);
         if (TypeI == TypeJ) {
-          Group.push_back(LoadJ);
-          LoadJ = nullptr;
+          // Check if any operation in the group has a complementary mask with
+          // another, that is M1 == NOT(M2) or M2 == NOT(M1).
+          HasComplementaryMask |= match(MaskI, m_Not(m_Specific(MaskJ))) ||
+                                  match(MaskJ, m_Not(m_Specific(MaskI)));
+          Group.push_back(RecipeJ);
+          RecipeJ = nullptr;
         }
       }
 
-      // Check if any load in the group has a complementary mask with another,
-      // that is M1 == NOT(M2) or M2 == NOT(M1).
-      bool HasComplementaryMask =
-          any_of(drop_begin(Group), [MaskI](VPReplicateRecipe *Load) {
-            VPValue *MaskJ = Load->getMask();
-            return match(MaskI, m_Not(m_Specific(MaskJ))) ||
-                   match(MaskJ, m_Not(m_Specific(MaskI)));
-          });
+      if (HasComplementaryMask) {
+        assert(Group.size() >= 2 && "must have at least 2 entries");
+        AllGroups.push_back(std::move(Group));
+      }
+    }
+  }
+
+  return AllGroups;
+}
 
-      if (HasComplementaryMask)
-        LoadGroups.push_back(std::move(Group));
+// Find the recipe with minimum alignment in the group.
+template <typename InstType>
+static VPReplicateRecipe *
+findRecipeWithMinAlign(ArrayRef<VPReplicateRecipe *> Group) {
+  return *min_element(Group, [](VPReplicateRecipe *A, VPReplicateRecipe *B) {
+    return cast<InstType>(A->getUnderlyingInstr())->getAlign() <
+           cast<InstType>(B->getUnderlyingInstr())->getAlign();
+  });
+}
+
+void VPlanTransforms::hoistPredicatedLoads(VPlan &Plan, ScalarEvolution &SE,
+                                           const Loop *L) {
+  auto Groups =
+      collectComplementaryPredicatedMemOps<Instruction::Load>(Plan, SE, L);
+  if (Groups.empty())
+    return;
+
+  VPDominatorTree VPDT(Plan);
+
+  // Process each group of loads.
+  for (auto &Group : Groups) {
+    // Sort loads by dominance order, with earliest (most dominating) first.
+    sort(Group, [&VPDT](VPReplicateRecipe *A, VPReplicateRecipe *B) {
+      return VPDT.properlyDominates(A, B);
+    });
+
+    // Try to use the earliest (most dominating) load to replace all others.
+    VPReplicateRecipe *EarliestLoad = Group[0];
+    VPBasicBlock *FirstBB = EarliestLoad->getParent();
+    VPBasicBlock *LastBB = Group.back()->getParent();
+
+    // Check that the load doesn't alias with stores between first and last.
+    auto LoadLoc = vputils::getMemoryLocation(*EarliestLoad);
+    if (!LoadLoc || !canHoistOrSinkWithNoAliasCheck(*LoadLoc, FirstBB, LastBB,
+                                                    /*CheckReads=*/false))
+      continue;
+
+    // Collect common metadata from all loads in the group.
+    VPIRMetadata CommonMetadata = getCommonMetadata(Group);
+
+    // Find the load with minimum alignment to use.
+    auto *LoadWithMinAlign = findRecipeWithMinAlign<LoadInst>(Group);
+
+    // Create an unpredicated version of the earliest load with common
+    // metadata.
+    auto *UnpredicatedLoad = new VPReplicateRecipe(
+        LoadWithMinAlign->getUnderlyingInstr(), {EarliestLoad->getOperand(0)},
+        /*IsSingleScalar=*/false, /*Mask=*/nullptr, *EarliestLoad,
+        CommonMetadata);
+
+    UnpredicatedLoad->insertBefore(EarliestLoad);
+
+    // Replace all loads in the group with the unpredicated load.
+    for (VPReplicateRecipe *Load : Group) {
+      Load->replaceAllUsesWith(UnpredicatedLoad);
+      Load->eraseFromParent();
     }
+  }
+}
 
-    // For each group, check memory dependencies and hoist the earliest load.
-    for (auto &Group : LoadGroups) {
-      // Sort loads by dominance order, with earliest (most dominating) first.
-      sort(Group, [&VPDT](VPReplicateRecipe *A, VPReplicateRecipe *B) {
-        return VPDT.properlyDominates(A, B);
-      });
+static bool
+canSinkStoreWithNoAliasCheck(ArrayRef<VPReplicateRecipe *> StoresToSink) {
+  auto StoreLoc = vputils::getMemoryLocation(*StoresToSink.front());
+  if (!StoreLoc || !StoreLoc->AATags.Scope)
+    return false;
 
-      VPReplicateRecipe *EarliestLoad = Group.front();
-      VPBasicBlock *FirstBB = EarliestLoad->getParent();
-      VPBasicBlock *LastBB = Group.back()->getParent();
+  // When sinking a group of stores, all members of the group alias each other.
+  // Skip them during the alias checks.
+  SmallPtrSet<VPRecipeBase *, 4> StoresToSinkSet(StoresToSink.begin(),
+                                                 StoresToSink.end());
 
-      // Check that the load doesn't alias with stores between first and last.
-      if (!canHoistLoadWithNoAliasCheck(EarliestLoad, FirstBB, LastBB))
-        continue;
+  VPBasicBlock *FirstBB = StoresToSink.front()->getParent();
+  VPBasicBlock *LastBB = StoresToSink.back()->getParent();
+  return canHoistOrSinkWithNoAliasCheck(*StoreLoc, FirstBB, LastBB,
+                                        /*CheckReads=*/true, &StoresToSinkSet);
+}
 
-      // Find the load with minimum alignment to use.
-      auto *LoadWithMinAlign =
-          *min_element(Group, [](VPReplicateRecipe *A, VPReplicateRecipe *B) {
-            return cast<LoadInst>(A->getUnderlyingInstr())->getAlign() <
-                   cast<LoadInst>(B->getUnderlyingInstr())->getAlign();
-          });
+void VPlanTransforms::sinkPredicatedStores(VPlan &Plan, ScalarEvolution &SE,
+                                           const Loop *L) {
+  auto Groups =
+      collectComplementaryPredicatedMemOps<Instruction::Store>(Plan, SE, L);
+  if (Groups.empty())
+    return;
 
-      // Collect common metadata from all loads in the group.
-      VPIRMetadata CommonMetadata = getCommonLoadMetadata(Group);
-
-      // Create an unpredicated load with minimum alignment using the earliest
-      // dominating address and common metadata.
-      auto *UnpredicatedLoad = new VPReplicateRecipe(
-          LoadWithMinAlign->getUnderlyingInstr(), EarliestLoad->getOperand(0),
-          /*IsSingleScalar=*/false, /*Mask=*/nullptr, /*Flags=*/{},
-          CommonMetadata);
-      UnpredicatedLoad->insertBefore(EarliestLoad);
-
-      // Replace all loads in the group with the unpredicated load.
-      for (VPReplicateRecipe *Load : Group) {
-        Load->replaceAllUsesWith(UnpredicatedLoad);
-        Load->eraseFromParent();
-      }
+  VPDominatorTree VPDT(Plan);
+
+  for (auto &Group : Groups) {
+    sort(Group, [&VPDT](VPReplicateRecipe *A, VPReplicateRecipe *B) {
+      return VPDT.properlyDominates(A, B);
+    });
+
+    if (!canSinkStoreWithNoAliasCheck(Group))
+      continue;
+
+    // Use the last (most dominated) store's location for the unconditional
+    // store.
+    VPReplicateRecipe *LastStore = Group.back();
+    VPBasicBlock *InsertBB = LastStore->getParent();
+
+    // Collect common alias metadata from all stores in the group.
+    VPIRMetadata CommonMetadata = getCommonMetadata(Group);
+
+    // Build select chain for stored values.
+    VPValue *SelectedValue = Group[0]->getOperand(0);
+    VPBuilder Builder(InsertBB, LastStore->getIterator());
+
+    for (unsigned I = 1; I < Group.size(); ++I) {
+      VPValue *Mask = Group[I]->getMask();
+      VPValue *Value = Group[I]->getOperand(0);
+      SelectedValue = Builder.createSelect(Mask, Value, SelectedValue,
+                                           Group[I]->getDebugLoc());
     }
+
+    // Find the store with minimum alignment to use.
+    auto *StoreWithMinAlign = findRecipeWithMinAlign<StoreInst>(Group);
+
+    // Create unconditional store with selected value and common metadata.
+    auto *UnpredicatedStore =
+        new VPReplicateRecipe(StoreWithMinAlign->getUnderlyingInstr(),
+                              {SelectedValue, LastStore->getOperand(1)},
+                              /*IsSingleScalar=*/false,
+                              /*Mask=*/nullptr, *LastStore, CommonMetadata);
+    UnpredicatedStore->insertBefore(*InsertBB, LastStore->getIterator());
+
+    // Remove all predicated stores from the group.
+    for (VPReplicateRecipe *Store : Group)
+      Store->eraseFromParent();
   }
 }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 6245a5107a5d0..afdf1655b4622 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -145,6 +145,11 @@ struct VPlanTransforms {
           GetIntOrFpInductionDescriptor,
       const TargetLibraryInfo &TLI);
 
+  /// Try to legalize reductions with multiple in-loop uses. Currently only
+  /// min/max reductions used by FindLastIV reductions are supported. Otherwise
+  /// return false.
+  static bool handleMultiUseReductions(VPlan &Plan);
+
   /// Try to have all users of fixed-order recurrences appear after the recipe
   /// defining their previous value, by either sinking users or hoisting recipes
   /// defining their previous value (and its operands). Then introduce
@@ -320,6 +325,13 @@ struct VPlanTransforms {
   static void hoistPredicatedLoads(VPlan &Plan, ScalarEvolution &SE,
                                    const Loop *L);
 
+  /// Sink predicated stores to the same address with complementary predicates
+  /// (P and NOT P) to an unconditional store with select recipes for the
+  /// stored values. This eliminates branching overhead when all paths
+  /// unconditionally store to the same location.
+  static void sinkPredicatedStores(VPlan &Plan, ScalarEvolution &SE,
+                                   const Loop *L);
+
   // Materialize vector trip counts for constants early if it can simply be
   // computed as (Original TC / VF * UF) * VF * UF.
   static void
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
index c7a0fd7407a4e..d36975699c4a8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUtils.cpp
@@ -197,7 +197,8 @@ bool vputils::isSingleScalar(const VPValue *VPV) {
             all_of(VPI->operands(), isSingleScalar));
   if (auto *RR = dyn_cast<VPReductionRecipe>(VPV))
     return !RR->isPartialReduction();
-  if (isa<VPVectorPointerRecipe, VPVectorEndPointerRecipe>(VPV))
+  if (isa<VPCanonicalIVPHIRecipe, VPVectorPointerRecipe,
+          VPVectorEndPointerRecipe>(VPV))
     return true;
   if (auto *Expr = dyn_cast<VPExpressionRecipe>(VPV))
     return Expr->isSingleScalar();
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index f1890e4f5fb95..e34a70c54ee4a 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -2924,8 +2924,9 @@ bool VectorCombine::foldShuffleOfIntrinsics(Instruction &I) {
       auto *ArgTy = FixedVectorType::get(VecTy->getElementType(),
                                          ShuffleDstTy->getNumElements());
       NewArgsTy.push_back(ArgTy);
-      NewCost += TTI.getShuffleCost(TargetTransformInfo::SK_PermuteTwoSrc,
-                                    ArgTy, VecTy, OldMask, CostKind);
+      NewCost += TTI.getShuffleCost(
+          TargetTransformInfo::SK_PermuteTwoSrc, ArgTy, VecTy, OldMask,
+          CostKind, 0, nullptr, {II0->getArgOperand(I), II1->getArgOperand(I)});
     }
   }
   IntrinsicCostAttributes NewAttr(IID, ShuffleDstTy, NewArgsTy);
diff --git a/llvm/test/Analysis/CostModel/RISCV/cmp-select.ll b/llvm/test/Analysis/CostModel/RISCV/cmp-select.ll
index dc0810b128698..58848cc4a97ef 100644
--- a/llvm/test/Analysis/CostModel/RISCV/cmp-select.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/cmp-select.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
-; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+f,+short-forward-branch-opt -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=SFB64
+; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+f,+short-forward-branch-ialu -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=SFB64
 ; RUN: opt < %s -mtriple=riscv64 -mattr=+v,+f -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output | FileCheck %s --check-prefixes=RV64
 
 define i32 @icmp-iselect(i64 %ca, i64 %cb, i32 %a, i32 %b) {
diff --git a/llvm/test/Analysis/CostModel/RISCV/vp-intrinsics.ll b/llvm/test/Analysis/CostModel/RISCV/vp-intrinsics.ll
index 71746caf35f2e..ba792d8f0955b 100644
--- a/llvm/test/Analysis/CostModel/RISCV/vp-intrinsics.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/vp-intrinsics.ll
@@ -836,74 +836,74 @@ define void @abs() {
   ret void
 }
 
-define void @load() {
+define void @load(ptr %src) {
 ; CHECK-LABEL: 'load'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t0 = call <2 x i8> @llvm.vp.load.v2i8.p0(ptr undef, <2 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t1 = load <2 x i8>, ptr undef, align 2
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t2 = call <4 x i8> @llvm.vp.load.v4i8.p0(ptr undef, <4 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t3 = load <4 x i8>, ptr undef, align 4
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t4 = call <8 x i8> @llvm.vp.load.v8i8.p0(ptr undef, <8 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t5 = load <8 x i8>, ptr undef, align 8
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t6 = call <16 x i8> @llvm.vp.load.v16i8.p0(ptr undef, <16 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t7 = load <16 x i8>, ptr undef, align 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t8 = call <2 x i64> @llvm.vp.load.v2i64.p0(ptr undef, <2 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t9 = load <2 x i64>, ptr undef, align 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t10 = call <4 x i64> @llvm.vp.load.v4i64.p0(ptr undef, <4 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t12 = load <4 x i64>, ptr undef, align 32
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t13 = call <8 x i64> @llvm.vp.load.v8i64.p0(ptr undef, <8 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t14 = load <8 x i64>, ptr undef, align 64
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t15 = call <16 x i64> @llvm.vp.load.v16i64.p0(ptr undef, <16 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t16 = load <16 x i64>, ptr undef, align 128
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t17 = call <vscale x 2 x i8> @llvm.vp.load.nxv2i8.p0(ptr undef, <vscale x 2 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t18 = load <vscale x 2 x i8>, ptr undef, align 2
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t19 = call <vscale x 4 x i8> @llvm.vp.load.nxv4i8.p0(ptr undef, <vscale x 4 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t20 = load <vscale x 4 x i8>, ptr undef, align 4
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t21 = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr undef, <vscale x 8 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t22 = load <vscale x 8 x i8>, ptr undef, align 8
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t23 = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr undef, <vscale x 16 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t24 = load <vscale x 16 x i8>, ptr undef, align 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t25 = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr undef, <vscale x 2 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t26 = load <vscale x 2 x i64>, ptr undef, align 16
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t27 = call <vscale x 4 x i64> @llvm.vp.load.nxv4i64.p0(ptr undef, <vscale x 4 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t28 = load <vscale x 4 x i64>, ptr undef, align 32
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t29 = call <vscale x 8 x i64> @llvm.vp.load.nxv8i64.p0(ptr undef, <vscale x 8 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t30 = load <vscale x 8 x i64>, ptr undef, align 64
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %t31 = call <vscale x 16 x i64> @llvm.vp.load.nxv16i64.p0(ptr undef, <vscale x 16 x i1> undef, i32 undef)
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %t32 = load <vscale x 16 x i64>, ptr undef, align 128
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t0 = call <2 x i8> @llvm.vp.load.v2i8.p0(ptr %src, <2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t1 = call { <2 x i8>, i32 } @llvm.vp.load.ff.v2i8.p0(ptr align 1 %src, <2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t2 = call <4 x i8> @llvm.vp.load.v4i8.p0(ptr %src, <4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t3 = call { <4 x i8>, i32 } @llvm.vp.load.ff.v4i8.p0(ptr align 1 %src, <4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t4 = call <8 x i8> @llvm.vp.load.v8i8.p0(ptr %src, <8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t5 = call { <8 x i8>, i32 } @llvm.vp.load.ff.v8i8.p0(ptr align 1 %src, <8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t6 = call <16 x i8> @llvm.vp.load.v16i8.p0(ptr %src, <16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t7 = call { <16 x i8>, i32 } @llvm.vp.load.ff.v16i8.p0(ptr align 1 %src, <16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t8 = call <2 x i64> @llvm.vp.load.v2i64.p0(ptr %src, <2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t9 = call { <2 x i64>, i32 } @llvm.vp.load.ff.v2i64.p0(ptr align 8 %src, <2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t10 = call <4 x i64> @llvm.vp.load.v4i64.p0(ptr %src, <4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t11 = call { <4 x i64>, i32 } @llvm.vp.load.ff.v4i64.p0(ptr align 8 %src, <4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t12 = call <8 x i64> @llvm.vp.load.v8i64.p0(ptr %src, <8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t13 = call { <8 x i64>, i32 } @llvm.vp.load.ff.v8i64.p0(ptr align 8 %src, <8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t14 = call <16 x i64> @llvm.vp.load.v16i64.p0(ptr %src, <16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t15 = call { <16 x i64>, i32 } @llvm.vp.load.ff.v16i64.p0(ptr align 8 %src, <16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t16 = call <vscale x 2 x i8> @llvm.vp.load.nxv2i8.p0(ptr %src, <vscale x 2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t17 = call { <vscale x 2 x i8>, i32 } @llvm.vp.load.ff.nxv2i8.p0(ptr align 1 %src, <vscale x 2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t18 = call <vscale x 4 x i8> @llvm.vp.load.nxv4i8.p0(ptr %src, <vscale x 4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t19 = call { <vscale x 4 x i8>, i32 } @llvm.vp.load.ff.nxv4i8.p0(ptr align 1 %src, <vscale x 4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t20 = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8.p0(ptr %src, <vscale x 8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %t21 = call { <vscale x 8 x i8>, i32 } @llvm.vp.load.ff.nxv8i8.p0(ptr align 1 %src, <vscale x 8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t22 = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8.p0(ptr %src, <vscale x 16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t23 = call { <vscale x 16 x i8>, i32 } @llvm.vp.load.ff.nxv16i8.p0(ptr align 1 %src, <vscale x 16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t24 = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64.p0(ptr %src, <vscale x 2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %t25 = call { <vscale x 2 x i64>, i32 } @llvm.vp.load.ff.nxv2i64.p0(ptr align 8 %src, <vscale x 2 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t26 = call <vscale x 4 x i64> @llvm.vp.load.nxv4i64.p0(ptr %src, <vscale x 4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %t27 = call { <vscale x 4 x i64>, i32 } @llvm.vp.load.ff.nxv4i64.p0(ptr align 8 %src, <vscale x 4 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t28 = call <vscale x 8 x i64> @llvm.vp.load.nxv8i64.p0(ptr %src, <vscale x 8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %t29 = call { <vscale x 8 x i64>, i32 } @llvm.vp.load.ff.nxv8i64.p0(ptr align 8 %src, <vscale x 8 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %t30 = call <vscale x 16 x i64> @llvm.vp.load.nxv16i64.p0(ptr %src, <vscale x 16 x i1> undef, i32 undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 16 for instruction: %t31 = call { <vscale x 16 x i64>, i32 } @llvm.vp.load.ff.nxv16i64.p0(ptr align 8 %src, <vscale x 16 x i1> undef, i32 undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-  %t0 = call <2 x i8> @llvm.vp.load.v2i8(ptr undef, <2 x i1> undef, i32 undef)
-  %t1 = load <2 x i8>, ptr undef
-  %t2 = call <4 x i8> @llvm.vp.load.v4i8(ptr undef, <4 x i1> undef, i32 undef)
-  %t3 = load <4 x i8>, ptr undef
-  %t4 = call <8 x i8> @llvm.vp.load.v8i8(ptr undef, <8 x i1> undef, i32 undef)
-  %t5 = load <8 x i8>, ptr undef
-  %t6 = call <16 x i8> @llvm.vp.load.v16i8(ptr undef, <16 x i1> undef, i32 undef)
-  %t7 = load <16 x i8>, ptr undef
-  %t8 = call <2 x i64> @llvm.vp.load.v2i64(ptr undef, <2 x i1> undef, i32 undef)
-  %t9 = load <2 x i64>, ptr undef
-  %t10 = call <4 x i64> @llvm.vp.load.v4i64(ptr undef, <4 x i1> undef, i32 undef)
-  %t12 = load <4 x i64>, ptr undef
-  %t13 = call <8 x i64> @llvm.vp.load.v8i64(ptr undef, <8 x i1> undef, i32 undef)
-  %t14 = load <8 x i64>, ptr undef
-  %t15 = call <16 x i64> @llvm.vp.load.v16i64(ptr undef, <16 x i1> undef, i32 undef)
-  %t16 = load <16 x i64>, ptr undef
-  %t17 = call <vscale x 2 x i8> @llvm.vp.load.nxv2i8(ptr undef, <vscale x 2 x i1> undef, i32 undef)
-  %t18 = load <vscale x 2 x i8>, ptr undef
-  %t19 = call <vscale x 4 x i8> @llvm.vp.load.nxv4i8(ptr undef, <vscale x 4 x i1> undef, i32 undef)
-  %t20 = load <vscale x 4 x i8>, ptr undef
-  %t21 = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8(ptr undef, <vscale x 8 x i1> undef, i32 undef)
-  %t22 = load <vscale x 8 x i8>, ptr undef
-  %t23 = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8(ptr undef, <vscale x 16 x i1> undef, i32 undef)
-  %t24 = load <vscale x 16 x i8>, ptr undef
-  %t25 = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64(ptr undef, <vscale x 2 x i1> undef, i32 undef)
-  %t26 = load <vscale x 2 x i64>, ptr undef
-  %t27 = call <vscale x 4 x i64> @llvm.vp.load.nxv4i64(ptr undef, <vscale x 4 x i1> undef, i32 undef)
-  %t28 = load <vscale x 4 x i64>, ptr undef
-  %t29 = call <vscale x 8 x i64> @llvm.vp.load.nxv8i64(ptr undef, <vscale x 8 x i1> undef, i32 undef)
-  %t30 = load <vscale x 8 x i64>, ptr undef
-  %t31 = call <vscale x 16 x i64> @llvm.vp.load.nxv16i64(ptr undef, <vscale x 16 x i1> undef, i32 undef)
-  %t32 = load <vscale x 16 x i64>, ptr undef
+  %t0 = call <2 x i8> @llvm.vp.load.v2i8(ptr %src, <2 x i1> undef, i32 undef)
+  %t1 = call { <2 x i8>, i32 } @llvm.vp.load.ff.v2i8.p0(ptr align 1 %src, <2 x i1> undef, i32 undef)
+  %t2 = call <4 x i8> @llvm.vp.load.v4i8(ptr %src, <4 x i1> undef, i32 undef)
+  %t3 = call { <4 x i8>, i32 } @llvm.vp.load.ff.v4i8.p0(ptr align 1 %src, <4 x i1> undef, i32 undef)
+  %t4 = call <8 x i8> @llvm.vp.load.v8i8(ptr %src, <8 x i1> undef, i32 undef)
+  %t5 = call { <8 x i8>, i32 } @llvm.vp.load.ff.v8i8.p0(ptr align 1 %src, <8 x i1> undef, i32 undef)
+  %t6 = call <16 x i8> @llvm.vp.load.v16i8(ptr %src, <16 x i1> undef, i32 undef)
+  %t7 = call { <16 x i8>, i32 } @llvm.vp.load.ff.v16i8.p0(ptr align 1 %src, <16 x i1> undef, i32 undef)
+  %t8 = call <2 x i64> @llvm.vp.load.v2i64(ptr %src, <2 x i1> undef, i32 undef)
+  %t9 = call { <2 x i64>, i32 } @llvm.vp.load.ff.v2i64.p0(ptr align 8 %src, <2 x i1> undef, i32 undef)
+  %t10 = call <4 x i64> @llvm.vp.load.v4i64(ptr %src, <4 x i1> undef, i32 undef)
+  %t11 = call { <4 x i64>, i32 } @llvm.vp.load.ff.v4i64.p0(ptr align 8 %src, <4 x i1> undef, i32 undef)
+  %t12 = call <8 x i64> @llvm.vp.load.v8i64(ptr %src, <8 x i1> undef, i32 undef)
+  %t13 = call { <8 x i64>, i32 } @llvm.vp.load.ff.v8i64.p0(ptr align 8 %src, <8 x i1> undef, i32 undef)
+  %t14 = call <16 x i64> @llvm.vp.load.v16i64(ptr %src, <16 x i1> undef, i32 undef)
+  %t15 = call { <16 x i64>, i32 } @llvm.vp.load.ff.v16i64.p0(ptr align 8 %src, <16 x i1> undef, i32 undef)
+  %t16 = call <vscale x 2 x i8> @llvm.vp.load.nxv2i8(ptr %src, <vscale x 2 x i1> undef, i32 undef)
+  %t17 = call { <vscale x 2 x i8>, i32 } @llvm.vp.load.ff.nxv2i8.p0(ptr align 1 %src, <vscale x 2 x i1> undef, i32 undef)
+  %t18 = call <vscale x 4 x i8> @llvm.vp.load.nxv4i8(ptr %src, <vscale x 4 x i1> undef, i32 undef)
+  %t19 = call { <vscale x 4 x i8>, i32 } @llvm.vp.load.ff.nxv4i8.p0(ptr align 1 %src, <vscale x 4 x i1> undef, i32 undef)
+  %t20 = call <vscale x 8 x i8> @llvm.vp.load.nxv8i8(ptr %src, <vscale x 8 x i1> undef, i32 undef)
+  %t21 = call { <vscale x 8 x i8>, i32 } @llvm.vp.load.ff.nxv8i8.p0(ptr align 1 %src, <vscale x 8 x i1> undef, i32 undef)
+  %t22 = call <vscale x 16 x i8> @llvm.vp.load.nxv16i8(ptr %src, <vscale x 16 x i1> undef, i32 undef)
+  %t23 = call { <vscale x 16 x i8>, i32 } @llvm.vp.load.ff.nxv16i8.p0(ptr align 1 %src, <vscale x 16 x i1> undef, i32 undef)
+  %t24 = call <vscale x 2 x i64> @llvm.vp.load.nxv2i64(ptr %src, <vscale x 2 x i1> undef, i32 undef)
+  %t25 = call { <vscale x 2 x i64>, i32 } @llvm.vp.load.ff.nxv2i64.p0(ptr align 8 %src, <vscale x 2 x i1> undef, i32 undef)
+  %t26 = call <vscale x 4 x i64> @llvm.vp.load.nxv4i64(ptr %src, <vscale x 4 x i1> undef, i32 undef)
+  %t27 = call { <vscale x 4 x i64>, i32 } @llvm.vp.load.ff.nxv4i64.p0(ptr align 8 %src, <vscale x 4 x i1> undef, i32 undef)
+  %t28 = call <vscale x 8 x i64> @llvm.vp.load.nxv8i64(ptr %src, <vscale x 8 x i1> undef, i32 undef)
+  %t29 = call { <vscale x 8 x i64>, i32 } @llvm.vp.load.ff.nxv8i64.p0(ptr align 8 %src, <vscale x 8 x i1> undef, i32 undef)
+  %t30 = call <vscale x 16 x i64> @llvm.vp.load.nxv16i64(ptr %src, <vscale x 16 x i1> undef, i32 undef)
+  %t31 = call { <vscale x 16 x i64>, i32 } @llvm.vp.load.ff.nxv16i64.p0(ptr align 8 %src, <vscale x 16 x i1> undef, i32 undef)
   ret void
 }
 
diff --git a/llvm/test/Analysis/Delinearization/constant_functions_multi_dim.ll b/llvm/test/Analysis/Delinearization/constant_functions_multi_dim.ll
index 9e6a4221f8eda..7e5c5142dccbc 100644
--- a/llvm/test/Analysis/Delinearization/constant_functions_multi_dim.ll
+++ b/llvm/test/Analysis/Delinearization/constant_functions_multi_dim.ll
@@ -11,7 +11,7 @@ define void @mat_mul(ptr %C, ptr %A, ptr %B, i64 %N) !kernel_arg_addr_space !2 !
 ; CHECK-NEXT:  Base offset: %A
 ; CHECK-NEXT:  ArrayDecl[UnknownSize][%N] with elements of 4 bytes.
 ; CHECK-NEXT:  ArrayRef[%call][{0,+,1}<nuw><nsw><%for.inc>]
-; CHECK-NEXT:  Delinearization validation: Succeeded
+; CHECK-NEXT:  Delinearization validation: Failed
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  Inst: %tmp5 = load float, ptr %arrayidx4, align 4
 ; CHECK-NEXT:  AccessFunction: {(4 * %call1),+,(4 * %N)}<%for.inc>
diff --git a/llvm/test/Analysis/Delinearization/multidim_only_ivs_2d.ll b/llvm/test/Analysis/Delinearization/multidim_only_ivs_2d.ll
index e1ad1c55313a4..e5d2806101926 100644
--- a/llvm/test/Analysis/Delinearization/multidim_only_ivs_2d.ll
+++ b/llvm/test/Analysis/Delinearization/multidim_only_ivs_2d.ll
@@ -16,14 +16,14 @@ define void @foo(i64 %n, i64 %m, ptr %A) {
 ; CHECK-NEXT:  Base offset: %A
 ; CHECK-NEXT:  ArrayDecl[UnknownSize][%m] with elements of 8 bytes.
 ; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i>][{0,+,1}<nuw><nsw><%for.j>]
-; CHECK-NEXT:  Delinearization validation: Succeeded
+; CHECK-NEXT:  Delinearization validation: Failed
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  Inst: store double %val, ptr %arrayidx, align 8
 ; CHECK-NEXT:  AccessFunction: {{\{\{}}0,+,(8 * %m)}<%for.i>,+,8}<%for.j>
 ; CHECK-NEXT:  Base offset: %A
 ; CHECK-NEXT:  ArrayDecl[UnknownSize][%m] with elements of 8 bytes.
 ; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i>][{0,+,1}<nuw><nsw><%for.j>]
-; CHECK-NEXT:  Delinearization validation: Succeeded
+; CHECK-NEXT:  Delinearization validation: Failed
 ;
 entry:
   br label %for.i
diff --git a/llvm/test/Analysis/Delinearization/multidim_only_ivs_3d.ll b/llvm/test/Analysis/Delinearization/multidim_only_ivs_3d.ll
index d5213e5afb33c..f5f0628ede937 100644
--- a/llvm/test/Analysis/Delinearization/multidim_only_ivs_3d.ll
+++ b/llvm/test/Analysis/Delinearization/multidim_only_ivs_3d.ll
@@ -16,7 +16,7 @@ define void @foo(i64 %n, i64 %m, i64 %o, ptr %A) {
 ; CHECK-NEXT:  Base offset: %A
 ; CHECK-NEXT:  ArrayDecl[UnknownSize][%m][%o] with elements of 8 bytes.
 ; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i>][{0,+,1}<nuw><nsw><%for.j>][{0,+,1}<nuw><nsw><%for.k>]
-; CHECK-NEXT:  Delinearization validation: Succeeded
+; CHECK-NEXT:  Delinearization validation: Failed
 ;
 entry:
   br label %for.i
diff --git a/llvm/test/Analysis/Delinearization/multidim_two_accesses_different_delinearization.ll b/llvm/test/Analysis/Delinearization/multidim_two_accesses_different_delinearization.ll
index 011dc40697cb5..f768002dd9e41 100644
--- a/llvm/test/Analysis/Delinearization/multidim_two_accesses_different_delinearization.ll
+++ b/llvm/test/Analysis/Delinearization/multidim_two_accesses_different_delinearization.ll
@@ -19,14 +19,14 @@ define void @foo(i64 %n, i64 %m, ptr %A) {
 ; CHECK-NEXT:  Base offset: %A
 ; CHECK-NEXT:  ArrayDecl[UnknownSize][%m] with elements of 8 bytes.
 ; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i>][{0,+,1}<nuw><nsw><%for.j>]
-; CHECK-NEXT:  Delinearization validation: Succeeded
+; CHECK-NEXT:  Delinearization validation: Failed
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  Inst: store double 1.000000e+00, ptr %arrayidx1, align 8
 ; CHECK-NEXT:  AccessFunction: {{\{\{}}0,+,8}<%for.i>,+,(8 * %n)}<%for.j>
 ; CHECK-NEXT:  Base offset: %A
 ; CHECK-NEXT:  ArrayDecl[UnknownSize][%n] with elements of 8 bytes.
 ; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.j>][{0,+,1}<nuw><nsw><%for.i>]
-; CHECK-NEXT:  Delinearization validation: Succeeded
+; CHECK-NEXT:  Delinearization validation: Failed
 ;
 entry:
   br label %for.i
diff --git a/llvm/test/Analysis/Delinearization/validation_large_size.ll b/llvm/test/Analysis/Delinearization/validation_large_size.ll
new file mode 100644
index 0000000000000..ad36d84b8d914
--- /dev/null
+++ b/llvm/test/Analysis/Delinearization/validation_large_size.ll
@@ -0,0 +1,180 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 6
+; RUN: opt < %s -passes='print<delinearization>' --delinearize-use-fixed-size-array-heuristic -disable-output 2>&1 | FileCheck %s
+
+; As for array accesses, the following property should hold (without
+; out-of-bound accesses):
+;
+;   &A[I_1][I_2]...[I_n] == &A[J_1][J_2]...[J_n] iff
+;   (I_1, I_2, ..., I_n) == (J_1, J_2, ..., J_n)
+;
+; This property may not hold if the inferred array size is very large and the
+; offset calculation can overflow. The delinearization validation should
+; consider such cases as invalid.
+
+; for (i = 0; i < (1ULL << 60); i++)
+;   for (j = 0; j < 256; j++)
+;     A[i*256 + j] = 0;
+;
+; The store will be delinearized to `A[i][j]` with its size `[][256]`. Since
+; `i` can be very large, the mapping from subscripts to addresses is not
+; injective. E.g., `&A[0][j] = &A[2^56][j] = ...`.
+;
+define void @large_size_fixed(ptr %A) {
+; CHECK-LABEL: 'large_size_fixed'
+; CHECK-NEXT:  Inst: store i8 0, ptr %gep, align 1
+; CHECK-NEXT:  AccessFunction: {{\{\{}}0,+,256}<%for.i.header>,+,1}<nsw><%for.j>
+; CHECK-NEXT:  Base offset: %A
+; CHECK-NEXT:  ArrayDecl[UnknownSize][256] with elements of 1 bytes.
+; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i.header>][{0,+,1}<nuw><nsw><%for.j>]
+; CHECK-NEXT:  Delinearization validation: Failed
+;
+entry:
+  br label %for.i.header
+
+for.i.header:
+  %i = phi i64 [ 0, %entry ], [ %i.inc, %for.i.latch ]
+  %i.mul = mul i64 %i, 256
+  br label %for.j
+
+for.j:
+  %j = phi i64 [ 0, %for.i.header ], [ %j.inc, %for.j ]
+  %offset = add nsw i64 %i.mul, %j
+  %gep = getelementptr i8, ptr %A, i64 %offset
+  store i8 0, ptr %gep
+  %j.inc = add i64 %j, 1
+  %ec.j = icmp eq i64 %j.inc, 256
+  br i1 %ec.j, label %for.i.latch, label %for.j
+
+for.i.latch:
+  %i.inc = add i64 %i, 1
+  %ec.i = icmp eq i64 %i.inc, 1152921504606846976
+  br i1 %ec.i, label %exit, label %for.i.header
+
+exit:
+  ret void
+}
+
+; for (i = 0; i < n; i++)
+;   for (j = 0; j < m; j++)
+;     for (k = 0; k < o; k++)
+;       if (i < 5 && j < 5 && k < 5)
+;         A[i*m*o + j*o + k] = 0;
+;
+; The product (%m * %o) can overflow, e.g., (%m, %o) = (2^32 - 1, 2^32). In
+; this case, the delinearization `A[%i][%j][%k]` with its size `[][%m][%o]`
+; should be considered invalid, because the address calculation will be:
+;
+; A[%i][%j][%k] = %A + %i*%m*%o + %j*%o + %k
+;               = %A - 2^32*%i + %j*2^32 + %k
+;               = %A + 2^32*(%j - %i) + %k
+;
+; It means `&A[0][0][%k]` = `&A[1][1][%k]` = ..., which implies that the
+; mapping from subscripts to addresses is not injective.
+;
+define void @large_size_parametric(i64 %n, i64 %m, i64 %o, ptr %A) {
+; CHECK-LABEL: 'large_size_parametric'
+; CHECK-NEXT:  Inst: store i8 0, ptr %gep, align 1
+; CHECK-NEXT:  AccessFunction: {{\{\{\{}}0,+,(%m * %o)}<%for.i.header>,+,%o}<%for.j.header>,+,1}<nw><%for.k.header>
+; CHECK-NEXT:  Base offset: %A
+; CHECK-NEXT:  ArrayDecl[UnknownSize][%m][%o] with elements of 1 bytes.
+; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i.header>][{0,+,1}<nuw><nsw><%for.j.header>][{0,+,1}<nuw><nsw><%for.k.header>]
+; CHECK-NEXT:  Delinearization validation: Failed
+;
+entry:
+  %guard.i = icmp sgt i64 %n, 0
+  %m_o = mul i64 %m, %o
+  br i1 %guard.i, label %for.i.header, label %exit
+
+for.i.header:
+  %i = phi i64 [ 0, %entry ], [ %i.inc, %for.i.latch ]
+  %i_m_o = mul i64 %i, %m_o
+  br label %for.j.preheader
+
+for.j.preheader:
+  %guard.j = icmp sgt i64 %m, 0
+  br i1 %guard.j, label %for.j.header, label %for.i.latch
+
+for.j.header:
+  %j = phi i64 [ 0, %for.j.preheader ], [ %j.inc, %for.j.latch ]
+  %j_o = mul i64 %j, %o
+  br label %for.k.preheader
+
+for.k.preheader:
+  %guard.k = icmp sgt i64 %o, 0
+  br i1 %guard.k, label %for.k.header, label %for.j.latch
+
+for.k.header:
+  %k = phi i64 [ 0, %for.k.preheader ], [ %k.inc, %for.k.latch ]
+  %cond.i = icmp slt i64 %i, 5
+  %cond.j = icmp slt i64 %j, 5
+  %cond.k = icmp slt i64 %k, 5
+  %cond.ij = and i1 %cond.i, %cond.j
+  %cond = and i1 %cond.ij, %cond.k
+  br i1 %cond, label %if.then, label %for.k.latch
+
+if.then:
+  %offset.tmp = add i64 %i_m_o, %j_o
+  %offset = add i64 %offset.tmp, %k
+  %gep = getelementptr i8, ptr %A, i64 %offset
+  store i8 0, ptr %gep, align 1
+  br label %for.k.latch
+
+for.k.latch:
+  %k.inc = add nsw i64 %k, 1
+  %ec.k = icmp eq i64 %k.inc, %o
+  br i1 %ec.k, label %for.j.latch, label %for.k.header
+
+for.j.latch:
+  %j.inc = add nsw i64 %j, 1
+  %ec.j = icmp eq i64 %j.inc, %m
+  br i1 %ec.j, label %for.i.latch, label %for.j.header
+
+for.i.latch:
+  %i.inc = add nsw i64 %i, 1
+  %ec.i = icmp eq i64 %i.inc, %n
+  br i1 %ec.i, label %exit, label %for.i.header
+
+exit:
+  ret void
+}
+
+; for (i = 0; i < (1 << 54); i++)
+;   for (j = 0; j < 256; j++)
+;     A[i*256 + j] = 0;
+;
+; We also need to consider the element size when validation.
+;
+define void @elementsize_cause_ovfl(ptr %A) {
+; CHECK-LABEL: 'elementsize_cause_ovfl'
+; CHECK-NEXT:  Inst: store i64 0, ptr %gep, align 4
+; CHECK-NEXT:  AccessFunction: {{\{\{}}0,+,2048}<%for.i.header>,+,8}<%for.j>
+; CHECK-NEXT:  Base offset: %A
+; CHECK-NEXT:  ArrayDecl[UnknownSize][256] with elements of 8 bytes.
+; CHECK-NEXT:  ArrayRef[{0,+,1}<nuw><nsw><%for.i.header>][{0,+,1}<nuw><nsw><%for.j>]
+; CHECK-NEXT:  Delinearization validation: Failed
+;
+entry:
+  br label %for.i.header
+
+for.i.header:
+  %i = phi i64 [ 0, %entry ], [ %i.inc, %for.i.latch ]
+  %i.mul = mul i64 %i, 256
+  br label %for.j
+
+for.j:
+  %j = phi i64 [ 0, %for.i.header ], [ %j.inc, %for.j ]
+  %offset = add i64 %i.mul, %j
+  %gep = getelementptr i64, ptr %A, i64 %offset
+  store i64 0, ptr %gep
+  %j.inc = add i64 %j, 1
+  %ec.j = icmp eq i64 %j.inc, 256
+  br i1 %ec.j, label %for.i.latch, label %for.j
+
+for.i.latch:
+  %i.inc = add i64 %i, 1
+  %ec.i = icmp eq i64 %i.inc, 18014398509481984
+  br i1 %ec.i, label %exit, label %for.i.header
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/DependenceAnalysis/DADelin.ll b/llvm/test/Analysis/DependenceAnalysis/DADelin.ll
index 8f94a455d3724..429e37de0a453 100644
--- a/llvm/test/Analysis/DependenceAnalysis/DADelin.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/DADelin.ll
@@ -13,11 +13,11 @@ target triple = "thumbv8m.main-arm-none-eabi"
 define void @t1(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't1'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - consistent anti [0 0 0|<]!
+; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - output [* * *]!
 ;
 entry:
   %cmp49 = icmp sgt i32 %n, 0
@@ -78,7 +78,7 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t2(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't2'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
 ; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
@@ -145,7 +145,7 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t3(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't3'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
 ; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
@@ -212,7 +212,7 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t4(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't4'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
 ; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
@@ -279,7 +279,7 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t5(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't5'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
 ; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
@@ -346,11 +346,11 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t6(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't6'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - consistent anti [-1 0 0]!
+; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - output [* * *]!
 ;
 entry:
   %cmp49 = icmp sgt i32 %n, 0
@@ -414,11 +414,11 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t7(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't7'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - consistent anti [1 0 0]!
+; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - output [* * *]!
 ;
 entry:
   %cmp49 = icmp sgt i32 %n, 0
@@ -482,11 +482,11 @@ for.cond.cleanup:                                 ; preds = %for.cond.cleanup3,
 define void @t8(i32 %n, i32 %m, i32 %o, ptr nocapture %A) {
 ; CHECK-LABEL: 't8'
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - input [* * *]!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - consistent anti [0 0 1]!
+; CHECK-NEXT:    da analyze - anti [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 %add12, ptr %arrayidx2, align 4 --> Dst: store i32 %add12, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - output [* * *]!
 ;
 entry:
   %cmp49 = icmp sgt i32 %n, 0
@@ -646,11 +646,15 @@ exit:
 define void @coeff_may_negative(ptr %a, i32 %k) {
 ; CHECK-LABEL: 'coeff_may_negative'
 ; CHECK-NEXT:  Src: store i8 42, ptr %idx.0, align 1 --> Dst: store i8 42, ptr %idx.0, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: %k ne) 0
 ; CHECK-NEXT:  Src: store i8 42, ptr %idx.0, align 1 --> Dst: store i8 42, ptr %idx.1, align 1
 ; CHECK-NEXT:    da analyze - output [*|<]!
 ; CHECK-NEXT:  Src: store i8 42, ptr %idx.1, align 1 --> Dst: store i8 42, ptr %idx.1, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: %k ne) 0
 ;
 entry:
   br label %loop
@@ -685,11 +689,15 @@ exit:
 define void @coeff_positive(ptr %a, i32 %k) {
 ; CHECK-LABEL: 'coeff_positive'
 ; CHECK-NEXT:  Src: store i8 42, ptr %idx.0, align 1 --> Dst: store i8 42, ptr %idx.0, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: %k ne) 0
 ; CHECK-NEXT:  Src: store i8 42, ptr %idx.0, align 1 --> Dst: store i8 42, ptr %idx.1, align 1
 ; CHECK-NEXT:    da analyze - output [*|<]!
 ; CHECK-NEXT:  Src: store i8 42, ptr %idx.1, align 1 --> Dst: store i8 42, ptr %idx.1, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: %k ne) 0
 ;
 entry:
   br label %loop
diff --git a/llvm/test/Analysis/DependenceAnalysis/DifferentOffsets.ll b/llvm/test/Analysis/DependenceAnalysis/DifferentOffsets.ll
index 91d127cfc09d6..3360e603eb406 100644
--- a/llvm/test/Analysis/DependenceAnalysis/DifferentOffsets.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/DifferentOffsets.ll
@@ -27,7 +27,9 @@ entry:
 define i32 @alias_with_parametric_offset(ptr nocapture %A, i64 %n) {
 ; CHECK-LABEL: 'alias_with_parametric_offset'
 ; CHECK-NEXT:  Src: store i32 2, ptr %arrayidx, align 1 --> Dst: store i32 2, ptr %arrayidx, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output []!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Equal predicate: (zext i2 (trunc i64 %n to i2) to i64) == 0
 ; CHECK-NEXT:  Src: store i32 2, ptr %arrayidx, align 1 --> Dst: %0 = load i32, ptr %A, align 1
 ; CHECK-NEXT:    da analyze - flow [|<]!
 ; CHECK-NEXT:    Runtime Assumptions:
@@ -45,14 +47,18 @@ entry:
 define i32 @alias_with_parametric_expr(ptr nocapture %A, i64 %n, i64 %m) {
 ; CHECK-LABEL: 'alias_with_parametric_expr'
 ; CHECK-NEXT:  Src: store i32 2, ptr %arrayidx, align 1 --> Dst: store i32 2, ptr %arrayidx, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output []!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Equal predicate: (zext i2 ((trunc i64 %m to i2) + (-2 * (trunc i64 %n to i2))) to i64) == 0
 ; CHECK-NEXT:  Src: store i32 2, ptr %arrayidx, align 1 --> Dst: %0 = load i32, ptr %arrayidx1, align 1
 ; CHECK-NEXT:    da analyze - flow [|<]!
 ; CHECK-NEXT:    Runtime Assumptions:
 ; CHECK-NEXT:    Equal predicate: (zext i2 ((trunc i64 %m to i2) + (-2 * (trunc i64 %n to i2))) to i64) == 0
 ; CHECK-NEXT:    Equal predicate: (zext i2 (-2 + (trunc i64 %m to i2)) to i64) == 0
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx1, align 1 --> Dst: %0 = load i32, ptr %arrayidx1, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input []!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Equal predicate: (zext i2 (-2 + (trunc i64 %m to i2)) to i64) == 0
 ;
 entry:
   %mul = mul nsw i64 %n, 10
@@ -69,7 +75,9 @@ entry:
 define i32 @gep_i8_vs_i32(ptr nocapture %A, i64 %n, i64 %m) {
 ; CHECK-LABEL: 'gep_i8_vs_i32'
 ; CHECK-NEXT:  Src: store i32 42, ptr %arrayidx0, align 1 --> Dst: store i32 42, ptr %arrayidx0, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output []!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Equal predicate: (zext i2 (trunc i64 %n to i2) to i64) == 0
 ; CHECK-NEXT:  Src: store i32 42, ptr %arrayidx0, align 1 --> Dst: store i32 42, ptr %arrayidx1, align 4
 ; CHECK-NEXT:    da analyze - output [|<]!
 ; CHECK-NEXT:    Runtime Assumptions:
@@ -93,7 +101,7 @@ define void @linearized_accesses(i64 %n, i64 %m, i64 %o, ptr %A) {
 ; CHECK-NEXT:  Src: store i32 1, ptr %idx0, align 4 --> Dst: store i32 1, ptr %idx1, align 4
 ; CHECK-NEXT:    da analyze - output [* * *|<]!
 ; CHECK-NEXT:  Src: store i32 1, ptr %idx1, align 4 --> Dst: store i32 1, ptr %idx1, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - output [* * *]!
 ;
 entry:
   br label %for.i
diff --git a/llvm/test/Analysis/DependenceAnalysis/MIVCheckConst.ll b/llvm/test/Analysis/DependenceAnalysis/MIVCheckConst.ll
index bcf73683e4dab..d91a916d785c2 100644
--- a/llvm/test/Analysis/DependenceAnalysis/MIVCheckConst.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/MIVCheckConst.ll
@@ -43,6 +43,7 @@ define void @test(ptr %A, ptr %B, i1 %arg, i32 %n, i32 %m) align 2 {
 ; CHECK-NEXT:    Runtime Assumptions:
 ; CHECK-NEXT:    Equal predicate: (zext i7 (4 * (trunc i32 %v1 to i7) * (1 + (trunc i32 %n to i7))) to i32) == 0
 ; CHECK-NEXT:    Equal predicate: (8 * (zext i4 (trunc i32 %v1 to i4) to i32))<nuw><nsw> == 0
+; CHECK-NEXT:    Compare predicate: (8 * %v1) ne) 0
 ; CHECK-NEXT:  Src: %v27 = load <32 x i32>, ptr %v25, align 256 --> Dst: %v32 = load <32 x i32>, ptr %v30, align 128
 ; CHECK-NEXT:    da analyze - input [* S S|<]!
 ; CHECK-NEXT:    Runtime Assumptions:
diff --git a/llvm/test/Analysis/DependenceAnalysis/PR148435.ll b/llvm/test/Analysis/DependenceAnalysis/PR148435.ll
new file mode 100644
index 0000000000000..54aa9c9399985
--- /dev/null
+++ b/llvm/test/Analysis/DependenceAnalysis/PR148435.ll
@@ -0,0 +1,86 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -disable-output "-passes=print<da>" 2>&1 | FileCheck %s
+
+define void @_Z1cb(ptr %a) {
+; CHECK-LABEL: '_Z1cb'
+; CHECK-NEXT:  Src: store i8 0, ptr %arrayidx9, align 1 --> Dst: store i8 0, ptr %arrayidx9, align 1
+; CHECK-NEXT:    da analyze - output [*]!
+;
+entry:
+  br label %for.body
+
+for.cond.cleanup.loopexit:                        ; preds = %for.body
+  ret void
+
+for.body:                                         ; preds = %for.body, %entry
+  %indvars.iv23 = phi i64 [ 0, %entry ], [ %indvars.iv.next24, %for.body ]
+  %idxprom = and i64 %indvars.iv23, 1
+  %arrayidx9 = getelementptr inbounds [0 x [12 x [12 x i8]]], ptr %a, i64 0, i64 %idxprom, i64 0, i64 %indvars.iv23
+  store i8 0, ptr %arrayidx9, align 1
+  %indvars.iv.next24 = add i64 %indvars.iv23, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next24, 0
+  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
+}
+
+ at a = external global [0 x [12 x [12 x i8]]], align 1
+
+define void @test_siv_no_addrec(i1 %d, i32 %b) {
+; CHECK-LABEL: 'test_siv_no_addrec'
+; CHECK-NEXT:  Src: store i8 0, ptr %arrayidx7, align 1 --> Dst: store i8 0, ptr %arrayidx7, align 1
+; CHECK-NEXT:    da analyze - output [* *]!
+;
+entry:
+  %conv.val = select i1 %d, i16 1, i16 0
+  br label %for.cond
+
+for.cond:                                         ; preds = %for.inc8, %entry
+  %e.0 = phi i32 [ %b, %entry ], [ %inc9, %for.inc8 ]
+  %cmp = icmp ult i32 %e.0, 10
+  br i1 %cmp, label %for.cond1, label %for.end10
+
+for.cond1:                                        ; preds = %for.inc, %for.cond
+  %f.0 = phi i16 [ %conv.val, %for.cond ], [ %add, %for.inc ]
+  %cmp2 = icmp slt i16 %f.0, 10
+  br i1 %cmp2, label %for.body4, label %for.inc8
+
+for.body4:                                        ; preds = %for.cond1
+  %sub = add i32 %e.0, -3
+  %idxprom = zext i32 %sub to i64
+  %idxprom5 = sext i16 %f.0 to i64
+  %idxprom6 = zext i32 %e.0 to i64
+  %arrayidx7 = getelementptr inbounds [0 x [12 x [12 x i8]]], ptr @a, i64 0, i64 %idxprom, i64 %idxprom5, i64 %idxprom6
+  store i8 0, ptr %arrayidx7, align 1
+  br label %for.inc
+
+for.inc:                                          ; preds = %for.body4
+  %add = add i16 %f.0, 2
+  br label %for.cond1
+
+for.inc8:                                         ; preds = %for.cond1
+  %inc9 = add i32 %e.0, 1
+  br label %for.cond
+
+for.end10:                                        ; preds = %for.cond
+  ret void
+}
+
+define void @f1(ptr %a) {
+; CHECK-LABEL: 'f1'
+; CHECK-NEXT:  Src: store i8 0, ptr %idx, align 1 --> Dst: store i8 0, ptr %idx, align 1
+; CHECK-NEXT:    da analyze - output [*]!
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
+  %and = and i64 %i, 1
+  %idx = getelementptr inbounds [4 x [4 x i8]], ptr %a, i64 0, i64 %and, i64 %and
+  store i8 0, ptr %idx
+  %i.next = add i64 %i, 1
+  %exitcond.not = icmp slt i64 %i.next, 8
+  br i1 %exitcond.not, label %loop, label %exit
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/DependenceAnalysis/StrongSIV.ll b/llvm/test/Analysis/DependenceAnalysis/StrongSIV.ll
index 19cef4537a769..02fbaf2910c70 100644
--- a/llvm/test/Analysis/DependenceAnalysis/StrongSIV.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/StrongSIV.ll
@@ -492,13 +492,19 @@ for.end:                                          ; preds = %for.end.loopexit, %
 define void @strong10(ptr %A, ptr %B, i64 %n) nounwind uwtable ssp {
 ; CHECK-LABEL: 'strong10'
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %conv, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 * %n) ne) 0
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx3, align 4
 ; CHECK-NEXT:    da analyze - consistent flow [0|<]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 * %n) ne) 0
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %0, ptr %B.addr.01, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx3, align 4 --> Dst: %0 = load i32, ptr %arrayidx3, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 * %n) ne) 0
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx3, align 4 --> Dst: store i32 %0, ptr %B.addr.01, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: store i32 %0, ptr %B.addr.01, align 4 --> Dst: store i32 %0, ptr %B.addr.01, align 4
@@ -536,9 +542,13 @@ for.end:                                          ; preds = %for.body
 ;;        A[i] = 0;
 
 define void @strong11(ptr %A) nounwind uwtable ssp {
-; CHECK-LABEL: 'strong11'
-; CHECK-NEXT:  Src: store i32 0, ptr %arrayidx, align 4 --> Dst: store i32 0, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - consistent output [0 S]!
+; CHECK-ALL-LABEL: 'strong11'
+; CHECK-ALL-NEXT:  Src: store i32 0, ptr %arrayidx, align 4 --> Dst: store i32 0, ptr %arrayidx, align 4
+; CHECK-ALL-NEXT:    da analyze - none!
+;
+; CHECK-STRONG-SIV-LABEL: 'strong11'
+; CHECK-STRONG-SIV-NEXT:  Src: store i32 0, ptr %arrayidx, align 4 --> Dst: store i32 0, ptr %arrayidx, align 4
+; CHECK-STRONG-SIV-NEXT:    da analyze - consistent output [0 S]!
 ;
 entry:
   br label %for.cond1.preheader
diff --git a/llvm/test/Analysis/DependenceAnalysis/SymbolicSIV.ll b/llvm/test/Analysis/DependenceAnalysis/SymbolicSIV.ll
index e43aa0f407b50..a33a10b5e1c2d 100644
--- a/llvm/test/Analysis/DependenceAnalysis/SymbolicSIV.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/SymbolicSIV.ll
@@ -381,13 +381,17 @@ for.end:                                          ; preds = %for.end.loopexit, %
 define void @symbolicsiv6(ptr %A, ptr %B, i64 %n, i64 %N, i64 %M) nounwind uwtable ssp {
 ; CHECK-LABEL: 'symbolicsiv6'
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %conv, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (16 * %N) ne) 0
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx7, align 4
 ; CHECK-NEXT:    da analyze - flow [*|<]!
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx7, align 4 --> Dst: %0 = load i32, ptr %arrayidx7, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (16 * %N) ne) 0
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx7, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: store i32 %0, ptr %B.addr.02, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
@@ -437,13 +441,17 @@ for.end:                                          ; preds = %for.end.loopexit, %
 define void @symbolicsiv7(ptr %A, ptr %B, i64 %n, i64 %N, i64 %M) nounwind uwtable ssp {
 ; CHECK-LABEL: 'symbolicsiv7'
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %conv, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (8 * %N) ne) 0
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: %1 = load i32, ptr %arrayidx6, align 4
 ; CHECK-NEXT:    da analyze - flow [*|<]!
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %1, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: %1 = load i32, ptr %arrayidx6, align 4 --> Dst: %1 = load i32, ptr %arrayidx6, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (8 * %N) ne) 0
 ; CHECK-NEXT:  Src: %1 = load i32, ptr %arrayidx6, align 4 --> Dst: store i32 %1, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: store i32 %1, ptr %B.addr.02, align 4 --> Dst: store i32 %1, ptr %B.addr.02, align 4
diff --git a/llvm/test/Analysis/DependenceAnalysis/WeakCrossingSIV.ll b/llvm/test/Analysis/DependenceAnalysis/WeakCrossingSIV.ll
index 906505eef51b9..c7accfd46a4d7 100644
--- a/llvm/test/Analysis/DependenceAnalysis/WeakCrossingSIV.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/WeakCrossingSIV.ll
@@ -14,13 +14,17 @@ target triple = "x86_64-apple-macosx10.6.0"
 define void @weakcrossing0(ptr %A, ptr %B, i64 %n) nounwind uwtable ssp {
 ; CHECK-LABEL: 'weakcrossing0'
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %conv, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 * %n) ne) 0
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx2, align 4
 ; CHECK-NEXT:    da analyze - flow [0|<]!
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx2, align 4 --> Dst: %0 = load i32, ptr %arrayidx2, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (-4 * %n) ne) 0
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx2, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: store i32 %0, ptr %B.addr.02, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
diff --git a/llvm/test/Analysis/DependenceAnalysis/WeakZeroDstSIV.ll b/llvm/test/Analysis/DependenceAnalysis/WeakZeroDstSIV.ll
index 71a5ed62b8e21..f8a045c425029 100644
--- a/llvm/test/Analysis/DependenceAnalysis/WeakZeroDstSIV.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/WeakZeroDstSIV.ll
@@ -90,7 +90,9 @@ for.end:                                          ; preds = %for.body
 define void @weakzerodst1(ptr %A, ptr %B, i64 %n) nounwind uwtable ssp {
 ; CHECK-LABEL: 'weakzerodst1'
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %conv, ptr %arrayidx, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 * %n) ne) 0
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: %0 = load i32, ptr %arrayidx1, align 4
 ; CHECK-NEXT:    da analyze - flow [p<=|<]!
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
diff --git a/llvm/test/Analysis/DependenceAnalysis/WeakZeroSrcSIV.ll b/llvm/test/Analysis/DependenceAnalysis/WeakZeroSrcSIV.ll
index 2ee1a37775f46..4ed0abd8d98a9 100644
--- a/llvm/test/Analysis/DependenceAnalysis/WeakZeroSrcSIV.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/WeakZeroSrcSIV.ll
@@ -94,7 +94,9 @@ define void @weakzerosrc1(ptr %A, ptr %B, i64 %n) nounwind uwtable ssp {
 ; CHECK-NEXT:  Src: store i32 %conv, ptr %arrayidx, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx1, align 4 --> Dst: %0 = load i32, ptr %arrayidx1, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 * %n) ne) 0
 ; CHECK-NEXT:  Src: %0 = load i32, ptr %arrayidx1, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
 ; CHECK-NEXT:    da analyze - confused!
 ; CHECK-NEXT:  Src: store i32 %0, ptr %B.addr.02, align 4 --> Dst: store i32 %0, ptr %B.addr.02, align 4
diff --git a/llvm/test/Analysis/DependenceAnalysis/becount-couldnotcompute.ll b/llvm/test/Analysis/DependenceAnalysis/becount-couldnotcompute.ll
index 49fbad3510ae6..1674badd4d6b9 100644
--- a/llvm/test/Analysis/DependenceAnalysis/becount-couldnotcompute.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/becount-couldnotcompute.ll
@@ -7,7 +7,9 @@
 define void @test(i64 %conv, ptr %a) {
 ; CHECK-LABEL: 'test'
 ; CHECK-NEXT:  Src: %ld = load i32, ptr %arrayidx12, align 4 --> Dst: %ld = load i32, ptr %arrayidx12, align 4
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent input [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (4 + (4 * %conv)) ne) 0
 ;
 entry:
   %sub = add i64 %conv, 1
diff --git a/llvm/test/Analysis/DependenceAnalysis/bounds-check.ll b/llvm/test/Analysis/DependenceAnalysis/bounds-check.ll
new file mode 100644
index 0000000000000..86086f77d2a47
--- /dev/null
+++ b/llvm/test/Analysis/DependenceAnalysis/bounds-check.ll
@@ -0,0 +1,27 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -disable-output "-passes=print<da>" 2>&1 | FileCheck %s
+
+; Test case for SmallBitVector bounds checking bug in DependenceAnalysis.
+; This test ensures that loop index mapping functions don't cause out-of-bounds
+; access to SmallBitVector when loop depths exceed MaxLevels.
+
+define void @bounds_check_test(ptr %a) {
+; CHECK-LABEL: 'bounds_check_test'
+; CHECK-NEXT:  Src: store i8 0, ptr %idx, align 1 --> Dst: store i8 0, ptr %idx, align 1
+; CHECK-NEXT:    da analyze - output [*]!
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
+  %and = and i64 %i, 1  ; Creates index 0 or 1
+  %idx = getelementptr inbounds [4 x [4 x i8]], ptr %a, i64 0, i64 %and, i64 %i
+  store i8 0, ptr %idx
+  %i.next = add i64 %i, 1
+  %exitcond.not = icmp slt i64 %i.next, 4
+  br i1 %exitcond.not, label %loop, label %exit
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/DependenceAnalysis/compute-absolute-value.ll b/llvm/test/Analysis/DependenceAnalysis/compute-absolute-value.ll
index 783150af2cd13..f3ce956545482 100644
--- a/llvm/test/Analysis/DependenceAnalysis/compute-absolute-value.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/compute-absolute-value.ll
@@ -16,11 +16,15 @@
 define void @unknown_sign(ptr %a, i64 %k) {
 ; CHECK-LABEL: 'unknown_sign'
 ; CHECK-NEXT:  Src: store i8 1, ptr %idx.0, align 1 --> Dst: store i8 1, ptr %idx.0, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (-1 * %k) ne) 0
 ; CHECK-NEXT:  Src: store i8 1, ptr %idx.0, align 1 --> Dst: store i8 2, ptr %idx.1, align 1
 ; CHECK-NEXT:    da analyze - output [*|<]!
 ; CHECK-NEXT:  Src: store i8 2, ptr %idx.1, align 1 --> Dst: store i8 2, ptr %idx.1, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (-1 * %k) ne) 0
 ;
 entry:
   %k.neg = sub nsw i64 0, %k
diff --git a/llvm/test/Analysis/DependenceAnalysis/gcd-miv-overflow.ll b/llvm/test/Analysis/DependenceAnalysis/gcd-miv-overflow.ll
index 9169ac323d834..ff8f32b9c8276 100644
--- a/llvm/test/Analysis/DependenceAnalysis/gcd-miv-overflow.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/gcd-miv-overflow.ll
@@ -24,7 +24,9 @@
 define void @gcdmiv_coef_ovfl(ptr %A, i64 %m) {
 ; CHECK-ALL-LABEL: 'gcdmiv_coef_ovfl'
 ; CHECK-ALL-NEXT:  Src: store i8 1, ptr %gep.0, align 1 --> Dst: store i8 1, ptr %gep.0, align 1
-; CHECK-ALL-NEXT:    da analyze - none!
+; CHECK-ALL-NEXT:    da analyze - consistent output [0]!
+; CHECK-ALL-NEXT:    Runtime Assumptions:
+; CHECK-ALL-NEXT:    Compare predicate: (3 * %m) ne) 0
 ; CHECK-ALL-NEXT:  Src: store i8 1, ptr %gep.0, align 1 --> Dst: store i8 2, ptr %gep.1, align 1
 ; CHECK-ALL-NEXT:    da analyze - output [*|<]!
 ; CHECK-ALL-NEXT:  Src: store i8 2, ptr %gep.1, align 1 --> Dst: store i8 2, ptr %gep.1, align 1
@@ -56,6 +58,124 @@ loop:
   %ec = icmp eq i64 %i.inc, 100
   br i1 %ec, label %exit, label %loop
 
+exit:
+  ret void
+}
+
+; max_i = INT64_MAX/6  // 1537228672809129301
+; for (long long i = 0; i <= max_i; i++) {
+;   A[-6*i + INT64_MAX] = 0;
+;   if (i)
+;     A[3*i - 2] = 1;
+; }
+;
+; FIXME: DependenceAnalysis currently detects no dependency between the two
+; stores, but it does exist. For example,
+;
+;  memory access       | i == 1 | i == (max_i + 1) / 2 | i == max_i
+; ---------------------|--------|----------------------|-------------------
+;  A[-6*i + INT64_MAX] |        | A[3*max_i - 2]       | A[1]
+;  A[3*i - 2]          | A[1]   |                      | A[3*max_i - 2]
+;
+;
+define void @gcdmiv_delta_ovfl(ptr %A) {
+; CHECK-ALL-LABEL: 'gcdmiv_delta_ovfl'
+; CHECK-ALL-NEXT:  Src: store i8 0, ptr %idx.0, align 1 --> Dst: store i8 0, ptr %idx.0, align 1
+; CHECK-ALL-NEXT:    da analyze - none!
+; CHECK-ALL-NEXT:  Src: store i8 0, ptr %idx.0, align 1 --> Dst: store i8 1, ptr %idx.1, align 1
+; CHECK-ALL-NEXT:    da analyze - none!
+; CHECK-ALL-NEXT:  Src: store i8 1, ptr %idx.1, align 1 --> Dst: store i8 1, ptr %idx.1, align 1
+; CHECK-ALL-NEXT:    da analyze - none!
+;
+; CHECK-GCD-MIV-LABEL: 'gcdmiv_delta_ovfl'
+; CHECK-GCD-MIV-NEXT:  Src: store i8 0, ptr %idx.0, align 1 --> Dst: store i8 0, ptr %idx.0, align 1
+; CHECK-GCD-MIV-NEXT:    da analyze - consistent output [*]!
+; CHECK-GCD-MIV-NEXT:  Src: store i8 0, ptr %idx.0, align 1 --> Dst: store i8 1, ptr %idx.1, align 1
+; CHECK-GCD-MIV-NEXT:    da analyze - consistent output [*|<]!
+; CHECK-GCD-MIV-NEXT:  Src: store i8 1, ptr %idx.1, align 1 --> Dst: store i8 1, ptr %idx.1, align 1
+; CHECK-GCD-MIV-NEXT:    da analyze - consistent output [*]!
+;
+entry:
+  br label %loop.header
+
+loop.header:
+  %i = phi i64 [ 0, %entry ], [ %i.inc, %loop.latch ]
+  %subscript.0 = phi i64 [ 9223372036854775807, %entry ], [ %subscript.0.next, %loop.latch ]
+  %subscript.1 = phi i64 [ -2, %entry ], [ %subscript.1.next, %loop.latch ]
+  %idx.0 = getelementptr inbounds i8, ptr %A, i64 %subscript.0
+  store i8 0, ptr %idx.0
+  %cond.store = icmp ne i64 %i, 0
+  br i1 %cond.store, label %if.store, label %loop.latch
+
+if.store:
+  %idx.1 = getelementptr i8, ptr %A, i64 %subscript.1
+  store i8 1, ptr %idx.1
+  br label %loop.latch
+
+loop.latch:
+  %i.inc = add nuw nsw i64 %i, 1
+  %subscript.0.next = add nsw i64 %subscript.0, -6
+  %subscript.1.next = add nsw i64 %subscript.1, 3
+  %exitcond = icmp sgt i64 %i.inc, 1537228672809129301
+  br i1 %exitcond, label %exit, label %loop.header
+
+exit:
+  ret void
+}
+
+; max_i = INT64_MAX/3  // 3074457345618258602
+; for (i = 0; i < max_i; i++) {
+;   A[3*i] = 0;
+;   A[-3*i - 3*m - INT64_MAX] = 1;
+; }
+;
+; Dependency may exist between the two stores. For example, consider `m = 1`.
+; Then, `-3*m - INT64_MAX` is `INT64_MAX - 1`. So `-3*i - 3*m - INT64_MAX` is
+; effectively `-3*i + (INT64_MAX - 1)`. Thus, accesses will be as follows:
+;
+;  memory access             | i == 1           | i == max_i - 1
+; ---------------------------|------------------|------------------
+;  A[3*i]                    | A[3]             | A[INT64_MAX - 4]
+;  A[-3*i - 3*m - INT64_MAX] | A[INT64_MAX - 4] | A[3]
+;
+
+define void @gcdmiv_const_ovfl(ptr %A, i64 %m) {
+; CHECK-ALL-LABEL: 'gcdmiv_const_ovfl'
+; CHECK-ALL-NEXT:  Src: store i8 0, ptr %gep.0, align 1 --> Dst: store i8 0, ptr %gep.0, align 1
+; CHECK-ALL-NEXT:    da analyze - none!
+; CHECK-ALL-NEXT:  Src: store i8 0, ptr %gep.0, align 1 --> Dst: store i8 1, ptr %gep.1, align 1
+; CHECK-ALL-NEXT:    da analyze - output [*|<]!
+; CHECK-ALL-NEXT:  Src: store i8 1, ptr %gep.1, align 1 --> Dst: store i8 1, ptr %gep.1, align 1
+; CHECK-ALL-NEXT:    da analyze - none!
+;
+; CHECK-GCD-MIV-LABEL: 'gcdmiv_const_ovfl'
+; CHECK-GCD-MIV-NEXT:  Src: store i8 0, ptr %gep.0, align 1 --> Dst: store i8 0, ptr %gep.0, align 1
+; CHECK-GCD-MIV-NEXT:    da analyze - consistent output [*]!
+; CHECK-GCD-MIV-NEXT:  Src: store i8 0, ptr %gep.0, align 1 --> Dst: store i8 1, ptr %gep.1, align 1
+; CHECK-GCD-MIV-NEXT:    da analyze - consistent output [*|<]!
+; CHECK-GCD-MIV-NEXT:  Src: store i8 1, ptr %gep.1, align 1 --> Dst: store i8 1, ptr %gep.1, align 1
+; CHECK-GCD-MIV-NEXT:    da analyze - consistent output [*]!
+;
+entry:
+  %m.3 = mul nsw i64 -3, %m
+  %const = sub i64 %m.3, 9223372036854775807
+  %guard = icmp ne i64 %const, 0
+  br i1 %guard, label %loop, label %exit
+
+loop:
+  %i = phi i64 [ 0, %entry ], [ %i.inc, %loop ]
+  %offset.0 = phi i64 [ 0, %entry ], [ %offset.0.next, %loop ]
+  %offset.1 = phi i64 [ %const, %entry ], [ %offset.1.next, %loop ]
+  %gep.0 = getelementptr inbounds i8, ptr %A, i64 %offset.0
+  %gep.1 = getelementptr inbounds i8, ptr %A, i64 %offset.1
+  store i8 0, ptr %gep.0
+  store i8 1, ptr %gep.1
+  %i.inc = add nuw nsw i64 %i, 1
+  %offset.0.next = add nsw i64 %offset.0, 3
+  %offset.1.next = add nsw i64 %offset.1, -3
+  %ec = icmp eq i64 %i.inc, 3074457345618258602
+  br i1 %ec, label %exit, label %loop
+
 exit:
   ret void
 }
diff --git a/llvm/test/Analysis/DependenceAnalysis/monotonicity-cast.ll b/llvm/test/Analysis/DependenceAnalysis/monotonicity-cast.ll
index e43d00d0bf651..966e4462fb887 100644
--- a/llvm/test/Analysis/DependenceAnalysis/monotonicity-cast.ll
+++ b/llvm/test/Analysis/DependenceAnalysis/monotonicity-cast.ll
@@ -14,7 +14,9 @@ define void @sext_nsw(ptr %a, i8 %start, i8 %step) {
 ; CHECK-NEXT:      Monotonicity: MultivariateSignedMonotonic
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  Src: store i8 0, ptr %idx, align 1 --> Dst: store i8 0, ptr %idx, align 1
-; CHECK-NEXT:    da analyze - none!
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: (sext i8 %step to i64) ne) 0
 ;
 entry:
   br label %loop
diff --git a/llvm/test/Analysis/DependenceAnalysis/wrapping-addrec-1.ll b/llvm/test/Analysis/DependenceAnalysis/wrapping-addrec-1.ll
new file mode 100644
index 0000000000000..00ce2e91809c5
--- /dev/null
+++ b/llvm/test/Analysis/DependenceAnalysis/wrapping-addrec-1.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -disable-output "-passes=print<da>" 2>&1 | FileCheck %s
+
+; Test case for bug #148435 - SIV test assertion failure
+
+define void @f(ptr %a) {
+; CHECK-LABEL: 'f'
+; CHECK-NEXT:  Src: store i8 42, ptr %idx, align 1 --> Dst: store i8 42, ptr %idx, align 1
+; CHECK-NEXT:    da analyze - output [* *]!
+;
+entry:
+  br label %loop.i.header
+
+loop.i.header:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %loop.i.latch ]
+  %and.i = and i64 %i, 1
+  br label %loop.j
+
+loop.j:
+  %j = phi i64 [ 0, %loop.i.header ], [ %j.next, %loop.j ]
+  %and.j = and i64 %j, 1
+  %idx = getelementptr [2 x [2 x i8]], ptr %a, i64 0, i64 %and.i, i64 %and.j
+  store i8 42, ptr %idx
+  %j.next = add i64 %j, 1
+  %exitcond.j = icmp eq i64 %j.next, 100
+ br i1 %exitcond.j, label %loop.i.latch, label %loop.j
+
+loop.i.latch:
+  %i.next = add i64 %i, 1
+  %exitcond.i = icmp eq i64 %i.next, 100
+  br i1 %exitcond.i, label %exit, label %loop.i.header
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/DependenceAnalysis/wrapping-addrec.ll b/llvm/test/Analysis/DependenceAnalysis/wrapping-addrec.ll
new file mode 100644
index 0000000000000..a1693e9d66575
--- /dev/null
+++ b/llvm/test/Analysis/DependenceAnalysis/wrapping-addrec.ll
@@ -0,0 +1,33 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -disable-output "-passes=print<da>" 2>&1 | FileCheck %s
+
+; Test case for wrapping AddRec detection in DependenceAnalysis.
+; This ensures that AddRec expressions that wrap (creating cyclic rather than
+; linear patterns) are rejected from SIV analysis and treated conservatively.
+
+; This test case has a clear dependence pattern that was incorrectly reported as "none!"
+; The issue: {false,+,true} in i1 arithmetic creates pattern (0,1,0,1,0,1,...).
+; - i=0: a[0][0][0], i=1: a[0][1][1], i=2: a[0][0][0], i=3: a[0][1][1], ...
+; - Clear dependencies at distances 2, 4, 6 between iterations accessing same locations.
+; - Strong SIV test was missing these due to treating wrapping pattern as linear.
+
+define void @test_wrapping_i1_addrec(ptr %a) {
+; CHECK-LABEL: 'test_wrapping_i1_addrec'
+; CHECK-NEXT:  Src: store i8 0, ptr %idx, align 1 --> Dst: store i8 0, ptr %idx, align 1
+; CHECK-NEXT:    da analyze - output [*]!
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
+  %and = and i64 %i, 1
+  %idx = getelementptr inbounds [4 x [4 x i8]], ptr %a, i64 0, i64 %and, i64 %and
+  store i8 0, ptr %idx
+  %i.next = add i64 %i, 1
+  %exitcond.not = icmp slt i64 %i.next, 8
+  br i1 %exitcond.not, label %loop, label %exit
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/DependenceAnalysis/zero-coefficient.ll b/llvm/test/Analysis/DependenceAnalysis/zero-coefficient.ll
new file mode 100644
index 0000000000000..55f0ecc123e3a
--- /dev/null
+++ b/llvm/test/Analysis/DependenceAnalysis/zero-coefficient.ll
@@ -0,0 +1,34 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -disable-output "-passes=print<da>" -aa-pipeline=basic-aa 2>&1 \
+; RUN: | FileCheck %s
+
+; Test case for bug #149991 where Strong SIV test incorrectly concludes "no dependence"
+; when the coefficient is symbolic (unknown at compile time) and delta is zero.
+;
+; In this case, the array access is A[k*i] with both src and dst at the same
+; location in the same iteration. If k=0, then all iterations access the same
+; element, meaning there IS a dependence between different iterations.
+; The Strong SIV test should add a runtime assumption that k != 0.
+
+define void @test_zero_coefficient(ptr noalias %A, i64 %k) {
+; CHECK-LABEL: 'test_zero_coefficient'
+; CHECK-NEXT:  Src: store i8 42, ptr %idx, align 1 --> Dst: store i8 42, ptr %idx, align 1
+; CHECK-NEXT:    da analyze - consistent output [0]!
+; CHECK-NEXT:    Runtime Assumptions:
+; CHECK-NEXT:    Compare predicate: %k ne) 0
+;
+entry:
+  br label %loop
+
+loop:
+  %i = phi i64 [ 0, %entry ], [ %i.next, %loop ]
+  %off = mul nsw i64 %i, %k
+  %idx = getelementptr inbounds i8, ptr %A, i64 %off
+  store i8 42, ptr %idx, align 1
+  %i.next = add nsw i64 %i, 1
+  %cmp = icmp slt i64 %i.next, 100
+  br i1 %cmp, label %loop, label %exit
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/ScalarEvolution/addrec-may-wrap-udiv-canonicalize.ll b/llvm/test/Analysis/ScalarEvolution/addrec-may-wrap-udiv-canonicalize.ll
index 0a6ef0dad4569..e041c96371762 100644
--- a/llvm/test/Analysis/ScalarEvolution/addrec-may-wrap-udiv-canonicalize.ll
+++ b/llvm/test/Analysis/ScalarEvolution/addrec-may-wrap-udiv-canonicalize.ll
@@ -13,7 +13,7 @@ define void @test_step2_div4(i64 %n) {
 ; CHECK-NEXT:    %iv.1 = add i64 %iv, 1
 ; CHECK-NEXT:    --> {1,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.1 = udiv i64 %iv.1, 4
-; CHECK-NEXT:    --> ({1,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
+; CHECK-NEXT:    --> ({0,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %iv.2 = add i64 %iv, 2
 ; CHECK-NEXT:    --> {2,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.2 = udiv i64 %iv.2, 4
@@ -21,7 +21,7 @@ define void @test_step2_div4(i64 %n) {
 ; CHECK-NEXT:    %iv.neg.1 = add i64 %iv, -1
 ; CHECK-NEXT:    --> {-1,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.neg.1 = udiv i64 %iv.neg.1, 4
-; CHECK-NEXT:    --> ({-1,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
+; CHECK-NEXT:    --> ({-2,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %iv.next = add i64 %iv, 2
 ; CHECK-NEXT:    --> {2,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:  Determining loop execution counts for: @test_step2_div4
@@ -114,15 +114,15 @@ define void @test_step4_div4(i64 %n) {
 ; CHECK-NEXT:    %iv.1 = add i64 %iv, 1
 ; CHECK-NEXT:    --> {1,+,4}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.1 = udiv i64 %iv.1, 4
-; CHECK-NEXT:    --> ({1,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
+; CHECK-NEXT:    --> ({0,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %iv.2 = add i64 %iv, 2
 ; CHECK-NEXT:    --> {2,+,4}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.2 = udiv i64 %iv.2, 4
-; CHECK-NEXT:    --> ({2,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
+; CHECK-NEXT:    --> ({0,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %iv.3 = add i64 %iv, 3
 ; CHECK-NEXT:    --> {3,+,4}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.3 = udiv i64 %iv.3, 4
-; CHECK-NEXT:    --> ({3,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
+; CHECK-NEXT:    --> ({0,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %iv.4 = add i64 %iv, 4
 ; CHECK-NEXT:    --> {4,+,4}<%loop> U: [0,-3) S: [-9223372036854775808,9223372036854775805) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.4 = udiv i64 %iv.4, 4
@@ -130,7 +130,7 @@ define void @test_step4_div4(i64 %n) {
 ; CHECK-NEXT:    %iv.5 = add i64 %iv, 5
 ; CHECK-NEXT:    --> {5,+,4}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %div.5 = udiv i64 %iv.5, 4
-; CHECK-NEXT:    --> ({5,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
+; CHECK-NEXT:    --> ({4,+,4}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:    %iv.next = add i64 %iv, 4
 ; CHECK-NEXT:    --> {4,+,4}<%loop> U: [0,-3) S: [-9223372036854775808,9223372036854775805) Exits: <<Unknown>> LoopDispositions: { %loop: Computable }
 ; CHECK-NEXT:  Determining loop execution counts for: @test_step4_div4
@@ -167,3 +167,236 @@ loop:
 exit:
   ret void
 }
+
+define void @test_step2_start_outer_add_rec_step_16(i64 %n, i64 %m) {
+; CHECK-LABEL: 'test_step2_start_outer_add_rec_step_16'
+; CHECK-NEXT:  Classifying expressions for: @test_step2_start_outer_add_rec_step_16
+; CHECK-NEXT:    %outer.iv = phi i64 [ 0, %entry ], [ %outer.iv.next, %outer.latch ]
+; CHECK-NEXT:    --> {0,+,16}<%outer.header> U: [0,-15) S: [-9223372036854775808,9223372036854775793) Exits: <<Unknown>> LoopDispositions: { %outer.header: Computable, %loop: Invariant }
+; CHECK-NEXT:    %iv = phi i64 [ %outer.iv, %outer.header ], [ %iv.next, %loop ]
+; CHECK-NEXT:    --> {{\{\{}}0,+,16}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.0 = udiv i64 %iv, 4
+; CHECK-NEXT:    --> ({{\{\{}}0,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.1 = add i64 %iv, 1
+; CHECK-NEXT:    --> {{\{\{}}1,+,16}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.1 = udiv i64 %iv.1, 4
+; CHECK-NEXT:    --> ({{\{\{}}0,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.2 = add i64 %iv, 2
+; CHECK-NEXT:    --> {{\{\{}}2,+,16}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.2 = udiv i64 %iv.2, 4
+; CHECK-NEXT:    --> ({{\{\{}}2,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.3 = add i64 %iv, 3
+; CHECK-NEXT:    --> {{\{\{}}3,+,16}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.3 = udiv i64 %iv.3, 4
+; CHECK-NEXT:    --> ({{\{\{}}2,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.4 = add i64 %iv, 4
+; CHECK-NEXT:    --> {{\{\{}}4,+,16}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.4 = udiv i64 %iv.4, 4
+; CHECK-NEXT:    --> ({{\{\{}}4,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.5 = add i64 %iv, 5
+; CHECK-NEXT:    --> {{\{\{}}5,+,16}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.5 = udiv i64 %iv.5, 4
+; CHECK-NEXT:    --> ({{\{\{}}4,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.neg.1 = add i64 %iv, -1
+; CHECK-NEXT:    --> {{\{\{}}-1,+,16}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.neg.1 = udiv i64 %iv.neg.1, 4
+; CHECK-NEXT:    --> ({{\{\{}}-2,+,16}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.0 = udiv i64 %iv, 3
+; CHECK-NEXT:    --> ({{\{\{}}0,+,16}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517205) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.1 = udiv i64 %iv.1, 3
+; CHECK-NEXT:    --> ({{\{\{}}1,+,16}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517206) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.2 = udiv i64 %iv.2, 3
+; CHECK-NEXT:    --> ({{\{\{}}2,+,16}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517205) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.4 = udiv i64 %iv.4, 3
+; CHECK-NEXT:    --> ({{\{\{}}4,+,16}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517205) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.5 = udiv i64 %iv.5, 3
+; CHECK-NEXT:    --> ({{\{\{}}5,+,16}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517206) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.next = add i64 %iv, 2
+; CHECK-NEXT:    --> {{\{\{}}2,+,16}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %outer.iv.next = add i64 %outer.iv, 16
+; CHECK-NEXT:    --> {16,+,16}<%outer.header> U: [0,-15) S: [-9223372036854775808,9223372036854775793) Exits: <<Unknown>> LoopDispositions: { %outer.header: Computable, %loop: Invariant }
+; CHECK-NEXT:  Determining loop execution counts for: @test_step2_start_outer_add_rec_step_16
+; CHECK-NEXT:  Loop %loop: Unpredictable backedge-taken count.
+; CHECK-NEXT:  Loop %loop: Unpredictable constant max backedge-taken count.
+; CHECK-NEXT:  Loop %loop: Unpredictable symbolic max backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Unpredictable backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Unpredictable constant max backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Unpredictable symbolic max backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Predicated backedge-taken count is (%m /u 16)
+; CHECK-NEXT:   Predicates:
+; CHECK-NEXT:      Equal predicate: (zext i4 (trunc i64 %m to i4) to i64) == 0
+; CHECK-NEXT:  Loop %outer.header: Predicated constant max backedge-taken count is i64 1152921504606846975
+; CHECK-NEXT:   Predicates:
+; CHECK-NEXT:      Equal predicate: (zext i4 (trunc i64 %m to i4) to i64) == 0
+; CHECK-NEXT:  Loop %outer.header: Predicated symbolic max backedge-taken count is (%m /u 16)
+; CHECK-NEXT:   Predicates:
+; CHECK-NEXT:      Equal predicate: (zext i4 (trunc i64 %m to i4) to i64) == 0
+;
+entry:
+  br label %outer.header
+
+outer.header:
+  %outer.iv = phi i64 [ 0, %entry ], [ %outer.iv.next, %outer.latch ]
+  br label %loop
+
+loop:
+  %iv = phi i64 [ %outer.iv, %outer.header ], [ %iv.next, %loop ]
+  %div.0 = udiv i64 %iv, 4
+  call void @use(i64 %div.0)
+  %iv.1 = add i64 %iv, 1
+  %div.1 = udiv i64 %iv.1, 4
+  call void @use(i64 %div.1)
+  %iv.2 = add i64 %iv, 2
+  %div.2 = udiv i64 %iv.2, 4
+  call void @use(i64 %div.2)
+  %iv.3 = add i64 %iv, 3
+  %div.3 = udiv i64 %iv.3, 4
+  call void @use(i64 %div.3)
+  %iv.4 = add i64 %iv, 4
+  %div.4 = udiv i64 %iv.4, 4
+  call void @use(i64 %div.4)
+  %iv.5 = add i64 %iv, 5
+  %div.5 = udiv i64 %iv.5, 4
+  call void @use(i64 %div.5)
+  %iv.neg.1 = add i64 %iv, -1
+  %div.neg.1 = udiv i64 %iv.neg.1, 4
+  call void @use(i64 %div.neg.1)
+  %div3.0 = udiv i64 %iv, 3
+  call void @use(i64 %div3.0)
+  %div3.1 = udiv i64 %iv.1,3
+  call void @use(i64 %div3.1)
+  %div3.2 = udiv i64 %iv.2, 3
+  call void @use(i64 %div3.2)
+  %div3.4 = udiv i64 %iv.4, 3
+  call void @use(i64 %div3.4)
+  %div3.5 = udiv i64 %iv.5, 3
+  call void @use(i64 %div3.5)
+  %iv.next = add i64 %iv, 2
+  %cond = icmp slt i64 %iv, %n
+  br i1 %cond, label %loop, label %outer.latch
+
+outer.latch:
+  %outer.iv.next = add i64 %outer.iv, 16
+  %outer.ec = icmp eq i64 %outer.iv, %m
+  br i1 %outer.ec, label %exit, label %outer.header
+
+exit:
+  ret void
+}
+
+define void @test_step2_div4_start_outer_add_rec_step_2(i64 %n, i64 %m) {
+; CHECK-LABEL: 'test_step2_div4_start_outer_add_rec_step_2'
+; CHECK-NEXT:  Classifying expressions for: @test_step2_div4_start_outer_add_rec_step_2
+; CHECK-NEXT:    %outer.iv = phi i64 [ 0, %entry ], [ %outer.iv.next, %outer.latch ]
+; CHECK-NEXT:    --> {0,+,2}<%outer.header> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %outer.header: Computable, %loop: Invariant }
+; CHECK-NEXT:    %iv = phi i64 [ %outer.iv, %outer.header ], [ %iv.next, %loop ]
+; CHECK-NEXT:    --> {{\{\{}}0,+,2}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.0 = udiv i64 %iv, 4
+; CHECK-NEXT:    --> ({{\{\{}}0,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.1 = add i64 %iv, 1
+; CHECK-NEXT:    --> {{\{\{}}1,+,2}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.1 = udiv i64 %iv.1, 4
+; CHECK-NEXT:    --> ({{\{\{}}0,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.2 = add i64 %iv, 2
+; CHECK-NEXT:    --> {{\{\{}}2,+,2}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.2 = udiv i64 %iv.2, 4
+; CHECK-NEXT:    --> ({{\{\{}}2,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.3 = add i64 %iv, 3
+; CHECK-NEXT:    --> {{\{\{}}3,+,2}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.3 = udiv i64 %iv.3, 4
+; CHECK-NEXT:    --> ({{\{\{}}2,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.4 = add i64 %iv, 4
+; CHECK-NEXT:    --> {{\{\{}}4,+,2}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.4 = udiv i64 %iv.4, 4
+; CHECK-NEXT:    --> ({{\{\{}}4,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.5 = add i64 %iv, 5
+; CHECK-NEXT:    --> {{\{\{}}5,+,2}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.5 = udiv i64 %iv.5, 4
+; CHECK-NEXT:    --> ({{\{\{}}4,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.neg.1 = add i64 %iv, -1
+; CHECK-NEXT:    --> {{\{\{}}-1,+,2}<%outer.header>,+,2}<%loop> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div.neg.1 = udiv i64 %iv.neg.1, 4
+; CHECK-NEXT:    --> ({{\{\{}}-2,+,2}<%outer.header>,+,2}<%loop> /u 4) U: [0,4611686018427387904) S: [0,4611686018427387904) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.0 = udiv i64 %iv, 3
+; CHECK-NEXT:    --> ({{\{\{}}0,+,2}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517205) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.1 = udiv i64 %iv.1, 3
+; CHECK-NEXT:    --> ({{\{\{}}1,+,2}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517206) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.2 = udiv i64 %iv.2, 3
+; CHECK-NEXT:    --> ({{\{\{}}2,+,2}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517205) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.4 = udiv i64 %iv.4, 3
+; CHECK-NEXT:    --> ({{\{\{}}4,+,2}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517205) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %div3.5 = udiv i64 %iv.5, 3
+; CHECK-NEXT:    --> ({{\{\{}}5,+,2}<%outer.header>,+,2}<%loop> /u 3) U: [0,6148914691236517206) S: [0,6148914691236517206) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %iv.next = add i64 %iv, 2
+; CHECK-NEXT:    --> {{\{\{}}2,+,2}<%outer.header>,+,2}<%loop> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %loop: Computable, %outer.header: Variant }
+; CHECK-NEXT:    %outer.iv.next = add i64 %outer.iv, 2
+; CHECK-NEXT:    --> {2,+,2}<%outer.header> U: [0,-1) S: [-9223372036854775808,9223372036854775807) Exits: <<Unknown>> LoopDispositions: { %outer.header: Computable, %loop: Invariant }
+; CHECK-NEXT:  Determining loop execution counts for: @test_step2_div4_start_outer_add_rec_step_2
+; CHECK-NEXT:  Loop %loop: Unpredictable backedge-taken count.
+; CHECK-NEXT:  Loop %loop: Unpredictable constant max backedge-taken count.
+; CHECK-NEXT:  Loop %loop: Unpredictable symbolic max backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Unpredictable backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Unpredictable constant max backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Unpredictable symbolic max backedge-taken count.
+; CHECK-NEXT:  Loop %outer.header: Predicated backedge-taken count is (%m /u 2)
+; CHECK-NEXT:   Predicates:
+; CHECK-NEXT:      Equal predicate: (zext i1 (trunc i64 %m to i1) to i64) == 0
+; CHECK-NEXT:  Loop %outer.header: Predicated constant max backedge-taken count is i64 9223372036854775807
+; CHECK-NEXT:   Predicates:
+; CHECK-NEXT:      Equal predicate: (zext i1 (trunc i64 %m to i1) to i64) == 0
+; CHECK-NEXT:  Loop %outer.header: Predicated symbolic max backedge-taken count is (%m /u 2)
+; CHECK-NEXT:   Predicates:
+; CHECK-NEXT:      Equal predicate: (zext i1 (trunc i64 %m to i1) to i64) == 0
+;
+entry:
+  br label %outer.header
+
+outer.header:
+  %outer.iv = phi i64 [ 0, %entry ], [ %outer.iv.next, %outer.latch ]
+  br label %loop
+
+loop:
+  %iv = phi i64 [ %outer.iv, %outer.header ], [ %iv.next, %loop ]
+  %div.0 = udiv i64 %iv, 4
+  call void @use(i64 %div.0)
+  %iv.1 = add i64 %iv, 1
+  %div.1 = udiv i64 %iv.1, 4
+  call void @use(i64 %div.1)
+  %iv.2 = add i64 %iv, 2
+  %div.2 = udiv i64 %iv.2, 4
+  call void @use(i64 %div.2)
+  %iv.3 = add i64 %iv, 3
+  %div.3 = udiv i64 %iv.3, 4
+  call void @use(i64 %div.3)
+  %iv.4 = add i64 %iv, 4
+  %div.4 = udiv i64 %iv.4, 4
+  call void @use(i64 %div.4)
+  %iv.5 = add i64 %iv, 5
+  %div.5 = udiv i64 %iv.5, 4
+  call void @use(i64 %div.5)
+  %iv.neg.1 = add i64 %iv, -1
+  %div.neg.1 = udiv i64 %iv.neg.1, 4
+  call void @use(i64 %div.neg.1)
+  %div3.0 = udiv i64 %iv, 3
+  call void @use(i64 %div3.0)
+  %div3.1 = udiv i64 %iv.1,3
+  call void @use(i64 %div3.1)
+  %div3.2 = udiv i64 %iv.2, 3
+  call void @use(i64 %div3.2)
+  %div3.4 = udiv i64 %iv.4, 3
+  call void @use(i64 %div3.4)
+  %div3.5 = udiv i64 %iv.5, 3
+  call void @use(i64 %div3.5)
+  call void @use(i64 %div.neg.1)
+  %iv.next = add i64 %iv, 2
+  %cond = icmp slt i64 %iv, %n
+  br i1 %cond, label %loop, label %outer.latch
+
+outer.latch:
+  %outer.iv.next = add i64 %outer.iv, 2
+  %outer.ec = icmp eq i64 %outer.iv, %m
+  br i1 %outer.ec, label %exit, label %outer.header
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Analysis/ScalarEvolution/pr44605.ll b/llvm/test/Analysis/ScalarEvolution/pr44605.ll
index ca068d3a6f801..e6f3b6bbeefa2 100644
--- a/llvm/test/Analysis/ScalarEvolution/pr44605.ll
+++ b/llvm/test/Analysis/ScalarEvolution/pr44605.ll
@@ -21,12 +21,12 @@ define i32 @test() {
 ; CHECK-NEXT:    [[TMP1:%.*]] = sub i32 [[TMP0]], [[LOCAL_3_4]]
 ; CHECK-NEXT:    [[TMP2]] = add i32 [[TMP1]], [[LOCAL_3_31]]
 ; CHECK-NEXT:    [[TMP3]] = add nuw nsw i32 [[LOCAL_7_3]], 1
-; CHECK-NEXT:    [[TMP4:%.*]] = icmp ugt i32 [[LOCAL_7_3]], 4
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp samesign ugt i32 [[LOCAL_7_3]], 4
 ; CHECK-NEXT:    br i1 [[TMP4]], label [[LATCH]], label [[INNER]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[DOTLCSSA:%.*]] = phi i32 [ [[TMP2]], [[INNER]] ]
 ; CHECK-NEXT:    [[TMP5]] = add nuw nsw i32 [[LOCAL_6_6]], 1
-; CHECK-NEXT:    [[TMP6:%.*]] = icmp ugt i32 [[LOCAL_6_6]], 276
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp samesign ugt i32 [[LOCAL_6_6]], 276
 ; CHECK-NEXT:    br i1 [[TMP6]], label [[RETURN:%.*]], label [[OUTER]]
 ; CHECK:       return:
 ; CHECK-NEXT:    [[DOTLCSSA_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA]], [[LATCH]] ]
diff --git a/llvm/test/Analysis/ValueTracking/assume-queries-counter.ll b/llvm/test/Analysis/ValueTracking/assume-queries-counter.ll
index 0223ab8c4b677..754e43d327f50 100644
--- a/llvm/test/Analysis/ValueTracking/assume-queries-counter.ll
+++ b/llvm/test/Analysis/ValueTracking/assume-queries-counter.ll
@@ -1,5 +1,4 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; REQUIRES: asserts
 
 ; RUN: opt < %s -passes=instcombine --debug-counter=assume-queries-counter=0 -S | FileCheck %s --check-prefixes=COUNTER1
 ; RUN: opt < %s -passes=instcombine --debug-counter=assume-queries-counter=1-2 -S | FileCheck %s --check-prefixes=COUNTER2
diff --git a/llvm/test/Assembler/thinlto-summary.ll b/llvm/test/Assembler/thinlto-summary.ll
index e0d866da0d8a2..4c6e47af183e0 100644
--- a/llvm/test/Assembler/thinlto-summary.ll
+++ b/llvm/test/Assembler/thinlto-summary.ll
@@ -46,32 +46,35 @@
 ^18 = gv: (guid: 17, summaries: (alias: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1), aliasee: ^14)))
 
 ; Test all types of TypeIdInfo on function summaries.
-^19 = gv: (guid: 18, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 4, typeIdInfo: (typeTests: (^26, ^28)))))
-^20 = gv: (guid: 19, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 8, typeIdInfo: (typeTestAssumeVCalls: (vFuncId: (^29, offset: 16))))))
-^21 = gv: (guid: 20, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 5, typeIdInfo: (typeCheckedLoadVCalls: (vFuncId: (^27, offset: 16))))))
-^22 = gv: (guid: 21, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 15, typeIdInfo: (typeTestAssumeConstVCalls: ((vFuncId: (^29, offset: 16), args: (42)), (vFuncId: (^29, offset: 24)))))))
-^23 = gv: (guid: 22, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 5, typeIdInfo: (typeCheckedLoadConstVCalls: ((vFuncId: (^30, offset: 16), args: (42)))))))
+^19 = gv: (guid: 18, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 4, typeIdInfo: (typeTests: (^27, ^29)))))
+^20 = gv: (guid: 19, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 8, typeIdInfo: (typeTestAssumeVCalls: (vFuncId: (^30, offset: 16))))))
+^21 = gv: (guid: 20, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 5, typeIdInfo: (typeCheckedLoadVCalls: (vFuncId: (^28, offset: 16))))))
+^22 = gv: (guid: 21, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 15, typeIdInfo: (typeTestAssumeConstVCalls: ((vFuncId: (^30, offset: 16), args: (42)), (vFuncId: (^30, offset: 24)))))))
+^23 = gv: (guid: 22, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0), insts: 5, typeIdInfo: (typeCheckedLoadConstVCalls: ((vFuncId: (^31, offset: 16), args: (42)))))))
 
 ; Function summary with an import type of declaration
 ^24 = gv: (guid: 23, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, importType: declaration), insts: 5)))
 
+; Alias summary with null aliasee.
+^25 = gv: (guid: 24, summaries: (alias: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1), aliasee: null)))
+
 ; GUID that are 64-bit
 
-^25 = gv: (guid: 9123456789101112131, summaries: (function: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, importType: definition), insts: 1)))
+^26 = gv: (guid: 9123456789101112131, summaries: (function: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, importType: definition), insts: 1)))
 
 ; Test TypeId summaries:
 
-^26 = typeid: (name: "_ZTS1C", summary: (typeTestRes: (kind: single, sizeM1BitWidth: 0)))
+^27 = typeid: (name: "_ZTS1C", summary: (typeTestRes: (kind: single, sizeM1BitWidth: 0)))
 ; Test TypeId with other optional fields (alignLog2/sizeM1/bitMask/inlineBits)
-^27 = typeid: (name: "_ZTS1B", summary: (typeTestRes: (kind: inline, sizeM1BitWidth: 0, alignLog2: 1, sizeM1: 2, bitMask: 3, inlineBits: 4)))
+^28 = typeid: (name: "_ZTS1B", summary: (typeTestRes: (kind: inline, sizeM1BitWidth: 0, alignLog2: 1, sizeM1: 2, bitMask: 3, inlineBits: 4)))
 ; Test the AllOnes resolution, and all kinds of WholeProgramDevirtResolution
 ; types, including all optional resolution by argument kinds.
-^28 = typeid: (name: "_ZTS1A", summary: (typeTestRes: (kind: allOnes, sizeM1BitWidth: 7), wpdResolutions: ((offset: 0, wpdRes: (kind: branchFunnel)), (offset: 8, wpdRes: (kind: singleImpl, singleImplName: "_ZN1A1nEi")), (offset: 16, wpdRes: (kind: indir, resByArg: (args: (1, 2), byArg: (kind: indir, byte: 2, bit: 3), args: (3), byArg: (kind: uniformRetVal, info: 1), args: (4), byArg: (kind: uniqueRetVal, info: 1), args: (5), byArg: (kind: virtualConstProp)))))))
+^29 = typeid: (name: "_ZTS1A", summary: (typeTestRes: (kind: allOnes, sizeM1BitWidth: 7), wpdResolutions: ((offset: 0, wpdRes: (kind: branchFunnel)), (offset: 8, wpdRes: (kind: singleImpl, singleImplName: "_ZN1A1nEi")), (offset: 16, wpdRes: (kind: indir, resByArg: (args: (1, 2), byArg: (kind: indir, byte: 2, bit: 3), args: (3), byArg: (kind: uniformRetVal, info: 1), args: (4), byArg: (kind: uniqueRetVal, info: 1), args: (5), byArg: (kind: virtualConstProp)))))))
 ; Test the other kinds of type test resoultions
-^29 = typeid: (name: "_ZTS1D", summary: (typeTestRes: (kind: byteArray, sizeM1BitWidth: 0)))
-^30 = typeid: (name: "_ZTS1E", summary: (typeTestRes: (kind: unsat, sizeM1BitWidth: 0)))
-^31 = flags: 8
-^32 = blockcount: 1888
+^30 = typeid: (name: "_ZTS1D", summary: (typeTestRes: (kind: byteArray, sizeM1BitWidth: 0)))
+^31 = typeid: (name: "_ZTS1E", summary: (typeTestRes: (kind: unsat, sizeM1BitWidth: 0)))
+^32 = flags: 8
+^33 = blockcount: 1888
 
 ; Make sure we get back from llvm-dis essentially what we put in via llvm-as.
 ; CHECK: ^0 = module: (path: "thinlto-summary1.o", hash: (1369602428, 2747878711, 259090915, 2507395659, 1141468049))
@@ -95,20 +98,21 @@
 ; CHECK: ^16 = gv: (guid: 15, summaries: (function: (module: ^1, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 1, funcFlags: (readNone: 1, readOnly: 0, noRecurse: 1, returnDoesNotAlias: 0, noInline: 0, alwaysInline: 1, noUnwind: 1, mayThrow: 1, hasUnknownCall: 1, mustBeUnreachable: 0))))
 ; CHECK: ^17 = gv: (guid: 16, summaries: (function: (module: ^1, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 1, funcFlags: (readNone: 0, readOnly: 1, noRecurse: 0, returnDoesNotAlias: 1, noInline: 0, alwaysInline: 0, noUnwind: 0, mayThrow: 0, hasUnknownCall: 0, mustBeUnreachable: 1), calls: ((callee: ^15)))))
 ; CHECK: ^18 = gv: (guid: 17, summaries: (alias: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, canAutoHide: 0, importType: definition), aliasee: ^14)))
-; CHECK: ^19 = gv: (guid: 18, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 4, typeIdInfo: (typeTests: (^26, ^28)))))
-; CHECK: ^20 = gv: (guid: 19, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 8, typeIdInfo: (typeTestAssumeVCalls: (vFuncId: (^29, offset: 16))))))
-; CHECK: ^21 = gv: (guid: 20, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 5, typeIdInfo: (typeCheckedLoadVCalls: (vFuncId: (^27, offset: 16))))))
-; CHECK: ^22 = gv: (guid: 21, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 15, typeIdInfo: (typeTestAssumeConstVCalls: ((vFuncId: (^29, offset: 16), args: (42)), (vFuncId: (^29, offset: 24)))))))
-; CHECK: ^23 = gv: (guid: 22, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 5, typeIdInfo: (typeCheckedLoadConstVCalls: ((vFuncId: (^30, offset: 16), args: (42)))))))
+; CHECK: ^19 = gv: (guid: 18, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 4, typeIdInfo: (typeTests: (^27, ^29)))))
+; CHECK: ^20 = gv: (guid: 19, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 8, typeIdInfo: (typeTestAssumeVCalls: (vFuncId: (^30, offset: 16))))))
+; CHECK: ^21 = gv: (guid: 20, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 5, typeIdInfo: (typeCheckedLoadVCalls: (vFuncId: (^28, offset: 16))))))
+; CHECK: ^22 = gv: (guid: 21, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 15, typeIdInfo: (typeTestAssumeConstVCalls: ((vFuncId: (^30, offset: 16), args: (42)), (vFuncId: (^30, offset: 24)))))))
+; CHECK: ^23 = gv: (guid: 22, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: definition), insts: 5, typeIdInfo: (typeCheckedLoadConstVCalls: ((vFuncId: (^31, offset: 16), args: (42)))))))
 ; CHECK: ^24 = gv: (guid: 23, summaries: (function: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 0, canAutoHide: 0, importType: declaration), insts: 5)))
-; CHECK: ^25 = gv: (guid: 9123456789101112131, summaries: (function: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, canAutoHide: 0, importType: definition), insts: 1)))
-; CHECK: ^26 = typeid: (name: "_ZTS1C", summary: (typeTestRes: (kind: single, sizeM1BitWidth: 0))) ; guid = 1884921850105019584
-; CHECK: ^27 = typeid: (name: "_ZTS1B", summary: (typeTestRes: (kind: inline, sizeM1BitWidth: 0, alignLog2: 1, sizeM1: 2, bitMask: 3, inlineBits: 4))) ; guid = 6203814149063363976
-; CHECK: ^28 = typeid: (name: "_ZTS1A", summary: (typeTestRes: (kind: allOnes, sizeM1BitWidth: 7), wpdResolutions: ((offset: 0, wpdRes: (kind: branchFunnel)), (offset: 8, wpdRes: (kind: singleImpl, singleImplName: "_ZN1A1nEi")), (offset: 16, wpdRes: (kind: indir, resByArg: (args: (1, 2), byArg: (kind: indir, byte: 2, bit: 3), args: (3), byArg: (kind: uniformRetVal, info: 1), args: (4), byArg: (kind: uniqueRetVal, info: 1), args: (5), byArg: (kind: virtualConstProp))))))) ; guid = 7004155349499253778
-; CHECK: ^29 = typeid: (name: "_ZTS1D", summary: (typeTestRes: (kind: byteArray, sizeM1BitWidth: 0))) ; guid = 9614786172484273522
-; CHECK: ^30 = typeid: (name: "_ZTS1E", summary: (typeTestRes: (kind: unsat, sizeM1BitWidth: 0))) ; guid = 17437243864166745132
-; CHECK: ^31 = flags: 8
-; CHECK: ^32 = blockcount: 1888
+; CHECK: ^25 = gv: (guid: 24, summaries: (alias: (module: ^0, flags: (linkage: external, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, canAutoHide: 0, importType: definition), aliasee: null)))
+; CHECK: ^26 = gv: (guid: 9123456789101112131, summaries: (function: (module: ^0, flags: (linkage: internal, visibility: default, notEligibleToImport: 0, live: 0, dsoLocal: 1, canAutoHide: 0, importType: definition), insts: 1)))
+; CHECK: ^27 = typeid: (name: "_ZTS1C", summary: (typeTestRes: (kind: single, sizeM1BitWidth: 0))) ; guid = 1884921850105019584
+; CHECK: ^28 = typeid: (name: "_ZTS1B", summary: (typeTestRes: (kind: inline, sizeM1BitWidth: 0, alignLog2: 1, sizeM1: 2, bitMask: 3, inlineBits: 4))) ; guid = 6203814149063363976
+; CHECK: ^29 = typeid: (name: "_ZTS1A", summary: (typeTestRes: (kind: allOnes, sizeM1BitWidth: 7), wpdResolutions: ((offset: 0, wpdRes: (kind: branchFunnel)), (offset: 8, wpdRes: (kind: singleImpl, singleImplName: "_ZN1A1nEi")), (offset: 16, wpdRes: (kind: indir, resByArg: (args: (1, 2), byArg: (kind: indir, byte: 2, bit: 3), args: (3), byArg: (kind: uniformRetVal, info: 1), args: (4), byArg: (kind: uniqueRetVal, info: 1), args: (5), byArg: (kind: virtualConstProp))))))) ; guid = 7004155349499253778
+; CHECK: ^30 = typeid: (name: "_ZTS1D", summary: (typeTestRes: (kind: byteArray, sizeM1BitWidth: 0))) ; guid = 9614786172484273522
+; CHECK: ^31 = typeid: (name: "_ZTS1E", summary: (typeTestRes: (kind: unsat, sizeM1BitWidth: 0))) ; guid = 17437243864166745132
+; CHECK: ^32 = flags: 8
+; CHECK: ^33 = blockcount: 1888
 
 ; Make sure parsing of a non-summary entry containing a ":" does not fail
 ; after summary parsing, which handles colons differently.
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/counter-fallback.ll b/llvm/test/CodeGen/AArch64/GlobalISel/counter-fallback.ll
index 72c8103de875d..aecb939d4ffe4 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/counter-fallback.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/counter-fallback.ll
@@ -2,8 +2,6 @@
 ; RUN: llc -mtriple=aarch64-- -global-isel -global-isel-abort=0 -debug-counter=globalisel=0 %s -o - 2>/dev/null | FileCheck %s
 ; RUN: llc -mtriple=aarch64-- -global-isel -global-isel-abort=0 -debug-counter=globalisel=0 %s -o /dev/null 2>&1 | FileCheck %s --check-prefix=DEBUG
 
-; REQUIRES: asserts
-
 ; DEBUG-NOT: Falling back for function test1
 ; DEBUG: Falling back for function test2
 
diff --git a/llvm/test/CodeGen/AArch64/O0-pipeline.ll b/llvm/test/CodeGen/AArch64/O0-pipeline.ll
index abc67eec32391..96f5e5a4afb3e 100644
--- a/llvm/test/CodeGen/AArch64/O0-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O0-pipeline.ll
@@ -5,9 +5,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Create Garbage Collector Module Metadata
 ; CHECK-NEXT: Profile summary info
 ; CHECK-NEXT: Assumption Cache Tracker
diff --git a/llvm/test/CodeGen/AArch64/O3-pipeline.ll b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
index e1481667a4ab7..2102029e608ab 100644
--- a/llvm/test/CodeGen/AArch64/O3-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
@@ -5,9 +5,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Profile summary info
 ; CHECK-NEXT: Type-Based Alias Analysis
diff --git a/llvm/test/CodeGen/AArch64/arm64-int-neon.ll b/llvm/test/CodeGen/AArch64/arm64-int-neon.ll
new file mode 100644
index 0000000000000..f33d41b0dd6ef
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/arm64-int-neon.ll
@@ -0,0 +1,238 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple aarch64-unknown-unknown -mattr=+fprcvt,+fullfp16 | FileCheck %s --check-prefixes=CHECK
+; RUN: llc < %s -mtriple aarch64-unknown-unknown -global-isel -global-isel-abort=2 -mattr=+fprcvt,+fullfp16 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-GI
+
+
+; CHECK-GI:       warning: Instruction selection used fallback path for test_sqrshl_s32
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_sqrshl_s64
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_sqshl_s32
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_sqshl_s64
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqrshl_s32
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqrshl_s64
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqshl_s32
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqshl_s64
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqadd_s32
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqadd_s64
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqsub_s32
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_uqsub_s64
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for test_sqdmulls_scalar
+
+define i32 @test_sqrshl_s32(float noundef %a){
+; CHECK-LABEL: test_sqrshl_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    sqrshl s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.sqrshl.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_sqrshl_s64(float noundef %a){
+; CHECK-LABEL: test_sqrshl_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    sqrshl d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.sqrshl.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_sqshl_s32(float noundef %a) {
+; CHECK-LABEL: test_sqshl_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    sqshl s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.sqshl.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_sqshl_s64(float noundef %a) {
+; CHECK-LABEL: test_sqshl_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    sqshl d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.sqshl.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_uqrshl_s32(float noundef %a) {
+; CHECK-LABEL: test_uqrshl_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    uqrshl s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.uqrshl.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_uqrshl_s64(float noundef %a) {
+; CHECK-LABEL: test_uqrshl_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    uqrshl d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.uqrshl.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_uqshl_s32(float noundef %a) {
+; CHECK-LABEL: test_uqshl_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    uqshl s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.uqshl.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_uqshl_s64(float noundef %a) {
+; CHECK-LABEL: test_uqshl_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    uqshl d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.uqshl.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_sqadd_s32(float noundef %a) {
+; CHECK-LABEL: test_sqadd_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    sqadd s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.sqadd.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_sqadd_s64(float noundef %a) {
+; CHECK-LABEL: test_sqadd_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    sqadd d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.sqadd.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_sqsub_s32(float noundef %a) {
+; CHECK-LABEL: test_sqsub_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    sqsub s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.sqsub.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_sqsub_s64(float noundef %a) {
+; CHECK-LABEL: test_sqsub_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    sqsub d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.sqsub.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_uqadd_s32(float noundef %a) {
+; CHECK-LABEL: test_uqadd_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    uqadd s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.uqadd.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_uqadd_s64(float noundef %a) {
+; CHECK-LABEL: test_uqadd_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    uqadd d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.uqadd.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i32 @test_uqsub_s32(float noundef %a) {
+; CHECK-LABEL: test_uqsub_s32:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    uqsub s0, s0, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %a)
+  %res = tail call i32 @llvm.aarch64.neon.uqsub.i32(i32 %cvt, i32 %cvt)
+  ret i32 %res
+}
+
+define i64 @test_uqsub_s64(float noundef %a) {
+; CHECK-LABEL: test_uqsub_s64:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvtzs d0, s0
+; CHECK-NEXT:    uqsub d0, d0, d0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+entry:
+  %cvt = tail call i64 @llvm.aarch64.neon.fcvtzs.i64.f32(float %a)
+  %res = tail call i64 @llvm.aarch64.neon.uqsub.i64(i64 %cvt, i64 %cvt)
+  ret i64 %res
+}
+
+define i64 @test_sqdmulls_scalar(float %A){
+; CHECK-LABEL: test_sqdmulls_scalar:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvtzs s0, s0
+; CHECK-NEXT:    sqdmull d0, s0, s0
+; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    ret
+  %cvt = tail call i32 @llvm.aarch64.neon.fcvtzs.i32.f32(float %A)
+  %prod = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32  %cvt, i32  %cvt)
+  ret i64 %prod
+}
diff --git a/llvm/test/CodeGen/AArch64/arm64-vmul.ll b/llvm/test/CodeGen/AArch64/arm64-vmul.ll
index 90abc7d389c13..712452c70aab1 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vmul.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vmul.ll
@@ -1721,14 +1721,23 @@ define <2 x i64> @sqdmlal2_lane_2d(<4 x i32> %A, <4 x i32> %B, <2 x i64> %C) nou
 }
 
 define i32 @sqdmlal_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {
-; CHECK-LABEL: sqdmlal_lane_1s:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov s1, w1
-; CHECK-NEXT:    fmov s2, w0
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-NEXT:    sqdmlal s2, h1, v0.h[1]
-; CHECK-NEXT:    fmov w0, s2
-; CHECK-NEXT:    ret
+; CHECK-SD-LABEL: sqdmlal_lane_1s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    fmov s1, w0
+; CHECK-SD-NEXT:    fmov s2, w1
+; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-SD-NEXT:    sqdmlal s1, h2, v0.h[1]
+; CHECK-SD-NEXT:    fmov w0, s1
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sqdmlal_lane_1s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    fmov s1, w1
+; CHECK-GI-NEXT:    fmov s2, w0
+; CHECK-GI-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-GI-NEXT:    sqdmlal s2, h1, v0.h[1]
+; CHECK-GI-NEXT:    fmov w0, s2
+; CHECK-GI-NEXT:    ret
   %lhs = insertelement <4 x i16> undef, i16 %B, i32 0
   %rhs = shufflevector <4 x i16> %C, <4 x i16> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
   %prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %lhs, <4 x i16> %rhs)
@@ -1739,14 +1748,23 @@ define i32 @sqdmlal_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {
 declare i32 @llvm.aarch64.neon.sqadd.i32(i32, i32)
 
 define i32 @sqdmlsl_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {
-; CHECK-LABEL: sqdmlsl_lane_1s:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov s1, w1
-; CHECK-NEXT:    fmov s2, w0
-; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-NEXT:    sqdmlsl s2, h1, v0.h[1]
-; CHECK-NEXT:    fmov w0, s2
-; CHECK-NEXT:    ret
+; CHECK-SD-LABEL: sqdmlsl_lane_1s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    fmov s1, w0
+; CHECK-SD-NEXT:    fmov s2, w1
+; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-SD-NEXT:    sqdmlsl s1, h2, v0.h[1]
+; CHECK-SD-NEXT:    fmov w0, s1
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sqdmlsl_lane_1s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    fmov s1, w1
+; CHECK-GI-NEXT:    fmov s2, w0
+; CHECK-GI-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-GI-NEXT:    sqdmlsl s2, h1, v0.h[1]
+; CHECK-GI-NEXT:    fmov w0, s2
+; CHECK-GI-NEXT:    ret
   %lhs = insertelement <4 x i16> undef, i16 %B, i32 0
   %rhs = shufflevector <4 x i16> %C, <4 x i16> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
   %prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %lhs, <4 x i16> %rhs)
@@ -1757,24 +1775,14 @@ define i32 @sqdmlsl_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {
 declare i32 @llvm.aarch64.neon.sqsub.i32(i32, i32)
 
 define i32 @sqadd_lane1_sqdmull4s(i32 %A, <4 x i16> %B, <4 x i16> %C) nounwind {
-; CHECK-SD-LABEL: sqadd_lane1_sqdmull4s:
-; CHECK-SD:       // %bb.0:
-; CHECK-SD-NEXT:    sqdmull v0.4s, v0.4h, v1.4h
-; CHECK-SD-NEXT:    mov w8, v0.s[1]
-; CHECK-SD-NEXT:    fmov s0, w0
-; CHECK-SD-NEXT:    fmov s1, w8
-; CHECK-SD-NEXT:    sqadd s0, s0, s1
-; CHECK-SD-NEXT:    fmov w0, s0
-; CHECK-SD-NEXT:    ret
-;
-; CHECK-GI-LABEL: sqadd_lane1_sqdmull4s:
-; CHECK-GI:       // %bb.0:
-; CHECK-GI-NEXT:    sqdmull v0.4s, v0.4h, v1.4h
-; CHECK-GI-NEXT:    fmov s1, w0
-; CHECK-GI-NEXT:    mov s0, v0.s[1]
-; CHECK-GI-NEXT:    sqadd s0, s1, s0
-; CHECK-GI-NEXT:    fmov w0, s0
-; CHECK-GI-NEXT:    ret
+; CHECK-LABEL: sqadd_lane1_sqdmull4s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sqdmull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    fmov s1, w0
+; CHECK-NEXT:    mov s0, v0.s[1]
+; CHECK-NEXT:    sqadd s0, s1, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %B, <4 x i16> %C)
   %prod = extractelement <4 x i32> %prod.vec, i32 1
   %res = call i32 @llvm.aarch64.neon.sqadd.i32(i32 %A, i32 %prod)
@@ -1782,24 +1790,14 @@ define i32 @sqadd_lane1_sqdmull4s(i32 %A, <4 x i16> %B, <4 x i16> %C) nounwind {
 }
 
 define i32 @sqsub_lane1_sqdmull4s(i32 %A, <4 x i16> %B, <4 x i16> %C) nounwind {
-; CHECK-SD-LABEL: sqsub_lane1_sqdmull4s:
-; CHECK-SD:       // %bb.0:
-; CHECK-SD-NEXT:    sqdmull v0.4s, v0.4h, v1.4h
-; CHECK-SD-NEXT:    mov w8, v0.s[1]
-; CHECK-SD-NEXT:    fmov s0, w0
-; CHECK-SD-NEXT:    fmov s1, w8
-; CHECK-SD-NEXT:    sqsub s0, s0, s1
-; CHECK-SD-NEXT:    fmov w0, s0
-; CHECK-SD-NEXT:    ret
-;
-; CHECK-GI-LABEL: sqsub_lane1_sqdmull4s:
-; CHECK-GI:       // %bb.0:
-; CHECK-GI-NEXT:    sqdmull v0.4s, v0.4h, v1.4h
-; CHECK-GI-NEXT:    fmov s1, w0
-; CHECK-GI-NEXT:    mov s0, v0.s[1]
-; CHECK-GI-NEXT:    sqsub s0, s1, s0
-; CHECK-GI-NEXT:    fmov w0, s0
-; CHECK-GI-NEXT:    ret
+; CHECK-LABEL: sqsub_lane1_sqdmull4s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sqdmull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    fmov s1, w0
+; CHECK-NEXT:    mov s0, v0.s[1]
+; CHECK-NEXT:    sqsub s0, s1, s0
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    ret
   %prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %B, <4 x i16> %C)
   %prod = extractelement <4 x i32> %prod.vec, i32 1
   %res = call i32 @llvm.aarch64.neon.sqsub.i32(i32 %A, i32 %prod)
@@ -1809,11 +1807,11 @@ define i32 @sqsub_lane1_sqdmull4s(i32 %A, <4 x i16> %B, <4 x i16> %C) nounwind {
 define i64 @sqdmlal_lane_1d(i64 %A, i32 %B, <2 x i32> %C) nounwind {
 ; CHECK-LABEL: sqdmlal_lane_1d:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov d1, x0
-; CHECK-NEXT:    fmov s2, w1
+; CHECK-NEXT:    fmov s1, w1
+; CHECK-NEXT:    fmov d2, x0
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-NEXT:    sqdmlal d1, s2, v0.s[1]
-; CHECK-NEXT:    fmov x0, d1
+; CHECK-NEXT:    sqdmlal d2, s1, v0.s[1]
+; CHECK-NEXT:    fmov x0, d2
 ; CHECK-NEXT:    ret
   %rhs = extractelement <2 x i32> %C, i32 1
   %prod = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32 %B, i32 %rhs)
@@ -1826,11 +1824,11 @@ declare i64 @llvm.aarch64.neon.sqadd.i64(i64, i64)
 define i64 @sqdmlsl_lane_1d(i64 %A, i32 %B, <2 x i32> %C) nounwind {
 ; CHECK-LABEL: sqdmlsl_lane_1d:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov d1, x0
-; CHECK-NEXT:    fmov s2, w1
+; CHECK-NEXT:    fmov s1, w1
+; CHECK-NEXT:    fmov d2, x0
 ; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-NEXT:    sqdmlsl d1, s2, v0.s[1]
-; CHECK-NEXT:    fmov x0, d1
+; CHECK-NEXT:    sqdmlsl d2, s1, v0.s[1]
+; CHECK-NEXT:    fmov x0, d2
 ; CHECK-NEXT:    ret
   %rhs = extractelement <2 x i32> %C, i32 1
   %prod = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32 %B, i32 %rhs)
@@ -3189,14 +3187,23 @@ define <1 x double> @test_fdiv_v1f64(<1 x double> %L, <1 x double> %R) nounwind
 }
 
 define i32 @sqdmlal_s(i16 %A, i16 %B, i32 %C) nounwind {
-; CHECK-LABEL: sqdmlal_s:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov s0, w0
-; CHECK-NEXT:    fmov s1, w1
-; CHECK-NEXT:    fmov s2, w2
-; CHECK-NEXT:    sqdmlal s2, h0, v1.h[0]
-; CHECK-NEXT:    fmov w0, s2
-; CHECK-NEXT:    ret
+; CHECK-SD-LABEL: sqdmlal_s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    fmov s0, w2
+; CHECK-SD-NEXT:    fmov s1, w0
+; CHECK-SD-NEXT:    fmov s2, w1
+; CHECK-SD-NEXT:    sqdmlal s0, h1, v2.h[0]
+; CHECK-SD-NEXT:    fmov w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sqdmlal_s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    fmov s0, w0
+; CHECK-GI-NEXT:    fmov s1, w1
+; CHECK-GI-NEXT:    fmov s2, w2
+; CHECK-GI-NEXT:    sqdmlal s2, h0, v1.h[0]
+; CHECK-GI-NEXT:    fmov w0, s2
+; CHECK-GI-NEXT:    ret
   %tmp1 = insertelement <4 x i16> undef, i16 %A, i64 0
   %tmp2 = insertelement <4 x i16> undef, i16 %B, i64 0
   %tmp3 = tail call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %tmp1, <4 x i16> %tmp2)
@@ -3208,11 +3215,11 @@ define i32 @sqdmlal_s(i16 %A, i16 %B, i32 %C) nounwind {
 define i64 @sqdmlal_d(i32 %A, i32 %B, i64 %C) nounwind {
 ; CHECK-LABEL: sqdmlal_d:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov d0, x2
+; CHECK-NEXT:    fmov s0, w1
 ; CHECK-NEXT:    fmov s1, w0
-; CHECK-NEXT:    fmov s2, w1
-; CHECK-NEXT:    sqdmlal d0, s1, s2
-; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    fmov d2, x2
+; CHECK-NEXT:    sqdmlal d2, s1, s0
+; CHECK-NEXT:    fmov x0, d2
 ; CHECK-NEXT:    ret
   %tmp4 = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32 %A, i32 %B)
   %tmp5 = call i64 @llvm.aarch64.neon.sqadd.i64(i64 %C, i64 %tmp4)
@@ -3220,14 +3227,23 @@ define i64 @sqdmlal_d(i32 %A, i32 %B, i64 %C) nounwind {
 }
 
 define i32 @sqdmlsl_s(i16 %A, i16 %B, i32 %C) nounwind {
-; CHECK-LABEL: sqdmlsl_s:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov s0, w0
-; CHECK-NEXT:    fmov s1, w1
-; CHECK-NEXT:    fmov s2, w2
-; CHECK-NEXT:    sqdmlsl s2, h0, v1.h[0]
-; CHECK-NEXT:    fmov w0, s2
-; CHECK-NEXT:    ret
+; CHECK-SD-LABEL: sqdmlsl_s:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    fmov s0, w2
+; CHECK-SD-NEXT:    fmov s1, w0
+; CHECK-SD-NEXT:    fmov s2, w1
+; CHECK-SD-NEXT:    sqdmlsl s0, h1, v2.h[0]
+; CHECK-SD-NEXT:    fmov w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: sqdmlsl_s:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    fmov s0, w0
+; CHECK-GI-NEXT:    fmov s1, w1
+; CHECK-GI-NEXT:    fmov s2, w2
+; CHECK-GI-NEXT:    sqdmlsl s2, h0, v1.h[0]
+; CHECK-GI-NEXT:    fmov w0, s2
+; CHECK-GI-NEXT:    ret
   %tmp1 = insertelement <4 x i16> undef, i16 %A, i64 0
   %tmp2 = insertelement <4 x i16> undef, i16 %B, i64 0
   %tmp3 = tail call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %tmp1, <4 x i16> %tmp2)
@@ -3239,11 +3255,11 @@ define i32 @sqdmlsl_s(i16 %A, i16 %B, i32 %C) nounwind {
 define i64 @sqdmlsl_d(i32 %A, i32 %B, i64 %C) nounwind {
 ; CHECK-LABEL: sqdmlsl_d:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    fmov d0, x2
+; CHECK-NEXT:    fmov s0, w1
 ; CHECK-NEXT:    fmov s1, w0
-; CHECK-NEXT:    fmov s2, w1
-; CHECK-NEXT:    sqdmlsl d0, s1, s2
-; CHECK-NEXT:    fmov x0, d0
+; CHECK-NEXT:    fmov d2, x2
+; CHECK-NEXT:    sqdmlsl d2, s1, s0
+; CHECK-NEXT:    fmov x0, d2
 ; CHECK-NEXT:    ret
   %tmp4 = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32 %A, i32 %B)
   %tmp5 = call i64 @llvm.aarch64.neon.sqsub.i64(i64 %C, i64 %tmp4)
diff --git a/llvm/test/CodeGen/AArch64/arm64-vshift.ll b/llvm/test/CodeGen/AArch64/arm64-vshift.ll
index 8ec5434085d6a..d27e2e69f8605 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vshift.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vshift.ll
@@ -168,10 +168,8 @@ define <1 x i64> @sqshl1d_constant(ptr %A) nounwind {
 define i64 @sqshl_scalar(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: sqshl_scalar:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x8, [x0]
-; CHECK-NEXT:    ldr x9, [x1]
-; CHECK-NEXT:    fmov d0, x8
-; CHECK-NEXT:    fmov d1, x9
+; CHECK-NEXT:    ldr d0, [x0]
+; CHECK-NEXT:    ldr d1, [x1]
 ; CHECK-NEXT:    sqshl d0, d0, d1
 ; CHECK-NEXT:    fmov x0, d0
 ; CHECK-NEXT:    ret
@@ -363,10 +361,8 @@ define <1 x i64> @uqshl1d_constant(ptr %A) nounwind {
 define i64 @uqshl_scalar(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: uqshl_scalar:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x8, [x0]
-; CHECK-NEXT:    ldr x9, [x1]
-; CHECK-NEXT:    fmov d0, x8
-; CHECK-NEXT:    fmov d1, x9
+; CHECK-NEXT:    ldr d0, [x0]
+; CHECK-NEXT:    ldr d1, [x1]
 ; CHECK-NEXT:    uqshl d0, d0, d1
 ; CHECK-NEXT:    fmov x0, d0
 ; CHECK-NEXT:    ret
@@ -888,10 +884,8 @@ define <1 x i64> @sqrshl1d_constant(ptr %A) nounwind {
 define i64 @sqrshl_scalar(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: sqrshl_scalar:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x8, [x0]
-; CHECK-NEXT:    ldr x9, [x1]
-; CHECK-NEXT:    fmov d0, x8
-; CHECK-NEXT:    fmov d1, x9
+; CHECK-NEXT:    ldr d0, [x0]
+; CHECK-NEXT:    ldr d1, [x1]
 ; CHECK-NEXT:    sqrshl d0, d0, d1
 ; CHECK-NEXT:    fmov x0, d0
 ; CHECK-NEXT:    ret
@@ -904,10 +898,9 @@ define i64 @sqrshl_scalar(ptr %A, ptr %B) nounwind {
 define i64 @sqrshl_scalar_constant(ptr %A) nounwind {
 ; CHECK-LABEL: sqrshl_scalar_constant:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x9, [x0]
-; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    mov x8, #1 // =0x1
+; CHECK-NEXT:    ldr d0, [x0]
 ; CHECK-NEXT:    fmov d1, x8
-; CHECK-NEXT:    fmov d0, x9
 ; CHECK-NEXT:    sqrshl d0, d0, d1
 ; CHECK-NEXT:    fmov x0, d0
 ; CHECK-NEXT:    ret
@@ -997,10 +990,8 @@ define <1 x i64> @uqrshl1d_constant(ptr %A) nounwind {
 define i64 @uqrshl_scalar(ptr %A, ptr %B) nounwind {
 ; CHECK-LABEL: uqrshl_scalar:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x8, [x0]
-; CHECK-NEXT:    ldr x9, [x1]
-; CHECK-NEXT:    fmov d0, x8
-; CHECK-NEXT:    fmov d1, x9
+; CHECK-NEXT:    ldr d0, [x0]
+; CHECK-NEXT:    ldr d1, [x1]
 ; CHECK-NEXT:    uqrshl d0, d0, d1
 ; CHECK-NEXT:    fmov x0, d0
 ; CHECK-NEXT:    ret
@@ -1013,10 +1004,9 @@ define i64 @uqrshl_scalar(ptr %A, ptr %B) nounwind {
 define i64 @uqrshl_scalar_constant(ptr %A) nounwind {
 ; CHECK-LABEL: uqrshl_scalar_constant:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ldr x9, [x0]
-; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    mov x8, #1 // =0x1
+; CHECK-NEXT:    ldr d0, [x0]
 ; CHECK-NEXT:    fmov d1, x8
-; CHECK-NEXT:    fmov d0, x9
 ; CHECK-NEXT:    uqrshl d0, d0, d1
 ; CHECK-NEXT:    fmov x0, d0
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll
index d67aa08125f74..79f0cd345f95c 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-reductions-predicated-scalable.ll
@@ -16,32 +16,32 @@ define %"class.std::complex" @complex_mul_v2f64(ptr %a, ptr %b) {
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    movi v0.2d, #0000000000000000
 ; CHECK-NEXT:    movi v1.2d, #0000000000000000
-; CHECK-NEXT:    mov w8, #100 // =0x64
-; CHECK-NEXT:    whilelo p1.d, xzr, x8
-; CHECK-NEXT:    cntd x9
-; CHECK-NEXT:    rdvl x10, #2
+; CHECK-NEXT:    mov w9, #100 // =0x64
+; CHECK-NEXT:    whilelo p1.d, xzr, x9
+; CHECK-NEXT:    mov x8, xzr
+; CHECK-NEXT:    cntd x10
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    mov x11, x9
+; CHECK-NEXT:    rdvl x11, #2
 ; CHECK-NEXT:  .LBB0_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    zip2 p2.d, p1.d, p1.d
 ; CHECK-NEXT:    mov z6.d, z0.d
 ; CHECK-NEXT:    mov z7.d, z1.d
 ; CHECK-NEXT:    zip1 p1.d, p1.d, p1.d
+; CHECK-NEXT:    add x8, x8, x10
 ; CHECK-NEXT:    ld1d { z2.d }, p2/z, [x0, #1, mul vl]
 ; CHECK-NEXT:    ld1d { z4.d }, p2/z, [x1, #1, mul vl]
 ; CHECK-NEXT:    ld1d { z3.d }, p1/z, [x0]
 ; CHECK-NEXT:    ld1d { z5.d }, p1/z, [x1]
-; CHECK-NEXT:    add x1, x1, x10
-; CHECK-NEXT:    add x0, x0, x10
+; CHECK-NEXT:    add x1, x1, x11
+; CHECK-NEXT:    add x0, x0, x11
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #0
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #0
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #90
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #90
 ; CHECK-NEXT:    mov z1.d, p2/m, z7.d
 ; CHECK-NEXT:    mov z0.d, p1/m, z6.d
-; CHECK-NEXT:    whilelo p1.d, x11, x8
-; CHECK-NEXT:    add x11, x11, x9
+; CHECK-NEXT:    whilelo p1.d, x8, x9
 ; CHECK-NEXT:    b.mi .LBB0_1
 ; CHECK-NEXT:  // %bb.2: // %exit.block
 ; CHECK-NEXT:    uzp1 z2.d, z0.d, z1.d
@@ -213,19 +213,18 @@ define %"class.std::complex" @complex_mul_predicated_x2_v2f64(ptr %a, ptr %b, pt
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    movi v0.2d, #0000000000000000
 ; CHECK-NEXT:    movi v1.2d, #0000000000000000
-; CHECK-NEXT:    mov w8, #100 // =0x64
-; CHECK-NEXT:    whilelo p1.d, xzr, x8
-; CHECK-NEXT:    cntd x9
-; CHECK-NEXT:    rdvl x10, #2
+; CHECK-NEXT:    mov w9, #100 // =0x64
+; CHECK-NEXT:    whilelo p1.d, xzr, x9
+; CHECK-NEXT:    mov x8, xzr
+; CHECK-NEXT:    cntd x10
 ; CHECK-NEXT:    ptrue p0.d
-; CHECK-NEXT:    cnth x11
-; CHECK-NEXT:    mov x12, x9
+; CHECK-NEXT:    rdvl x11, #2
 ; CHECK-NEXT:  .LBB2_1: // %vector.body
 ; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    ld1w { z2.d }, p1/z, [x2]
+; CHECK-NEXT:    ld1w { z2.d }, p1/z, [x2, x8, lsl #2]
 ; CHECK-NEXT:    mov z6.d, z0.d
 ; CHECK-NEXT:    mov z7.d, z1.d
-; CHECK-NEXT:    add x2, x2, x11
+; CHECK-NEXT:    add x8, x8, x10
 ; CHECK-NEXT:    and z2.d, z2.d, #0xffffffff
 ; CHECK-NEXT:    cmpne p1.d, p1/z, z2.d, #0
 ; CHECK-NEXT:    zip2 p2.d, p1.d, p1.d
@@ -234,16 +233,15 @@ define %"class.std::complex" @complex_mul_predicated_x2_v2f64(ptr %a, ptr %b, pt
 ; CHECK-NEXT:    ld1d { z4.d }, p2/z, [x1, #1, mul vl]
 ; CHECK-NEXT:    ld1d { z3.d }, p1/z, [x0]
 ; CHECK-NEXT:    ld1d { z5.d }, p1/z, [x1]
-; CHECK-NEXT:    add x1, x1, x10
-; CHECK-NEXT:    add x0, x0, x10
+; CHECK-NEXT:    add x1, x1, x11
+; CHECK-NEXT:    add x0, x0, x11
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #0
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #0
 ; CHECK-NEXT:    fcmla z7.d, p0/m, z4.d, z2.d, #90
 ; CHECK-NEXT:    fcmla z6.d, p0/m, z5.d, z3.d, #90
 ; CHECK-NEXT:    mov z1.d, p2/m, z7.d
 ; CHECK-NEXT:    mov z0.d, p1/m, z6.d
-; CHECK-NEXT:    whilelo p1.d, x12, x8
-; CHECK-NEXT:    add x12, x12, x9
+; CHECK-NEXT:    whilelo p1.d, x8, x9
 ; CHECK-NEXT:    b.mi .LBB2_1
 ; CHECK-NEXT:  // %bb.2: // %exit.block
 ; CHECK-NEXT:    uzp1 z2.d, z0.d, z1.d
diff --git a/llvm/test/CodeGen/AArch64/expand-select.ll b/llvm/test/CodeGen/AArch64/expand-select.ll
index 1ca4719d9b6bf..8ad9ea3b7a8d5 100644
--- a/llvm/test/CodeGen/AArch64/expand-select.ll
+++ b/llvm/test/CodeGen/AArch64/expand-select.ll
@@ -4,20 +4,15 @@
 define void @foo(i32 %In1, <2 x i128> %In2, <2 x i128> %In3, ptr %Out) {
 ; CHECK-LABEL: foo:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    movi d0, #0000000000000000
-; CHECK-NEXT:    and w8, w0, #0x1
-; CHECK-NEXT:    ldr x11, [sp]
-; CHECK-NEXT:    fmov s1, w8
-; CHECK-NEXT:    ldp x8, x10, [sp, #8]
-; CHECK-NEXT:    cmeq v0.4s, v1.4s, v0.4s
-; CHECK-NEXT:    fmov w9, s0
-; CHECK-NEXT:    tst w9, #0x1
-; CHECK-NEXT:    csel x8, x5, x8, ne
-; CHECK-NEXT:    csel x9, x4, x11, ne
-; CHECK-NEXT:    stp x9, x8, [x10, #16]
-; CHECK-NEXT:    csel x8, x3, x7, ne
-; CHECK-NEXT:    csel x9, x2, x6, ne
-; CHECK-NEXT:    stp x9, x8, [x10]
+; CHECK-NEXT:    ldp x8, x9, [sp, #8]
+; CHECK-NEXT:    tst w0, #0x1
+; CHECK-NEXT:    ldr x10, [sp]
+; CHECK-NEXT:    csel x8, x5, x8, eq
+; CHECK-NEXT:    csel x10, x4, x10, eq
+; CHECK-NEXT:    stp x10, x8, [x9, #16]
+; CHECK-NEXT:    csel x8, x3, x7, eq
+; CHECK-NEXT:    csel x10, x2, x6, eq
+; CHECK-NEXT:    stp x10, x8, [x9]
 ; CHECK-NEXT:    ret
   %cond = and i32 %In1, 1
   %cbool = icmp eq i32 %cond, 0
@@ -31,22 +26,17 @@ define void @foo(i32 %In1, <2 x i128> %In2, <2 x i128> %In3, ptr %Out) {
 define void @bar(i32 %In1, <2 x i96> %In2, <2 x i96> %In3, ptr %Out) {
 ; CHECK-LABEL: bar:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    movi d0, #0000000000000000
-; CHECK-NEXT:    and w8, w0, #0x1
-; CHECK-NEXT:    ldr x10, [sp, #16]
-; CHECK-NEXT:    fmov s1, w8
-; CHECK-NEXT:    cmeq v0.4s, v1.4s, v0.4s
-; CHECK-NEXT:    fmov w9, s0
-; CHECK-NEXT:    tst w9, #0x1
-; CHECK-NEXT:    ldp x8, x9, [sp]
-; CHECK-NEXT:    csel x11, x2, x6, ne
-; CHECK-NEXT:    str x11, [x10]
-; CHECK-NEXT:    csel x8, x4, x8, ne
-; CHECK-NEXT:    stur x8, [x10, #12]
-; CHECK-NEXT:    csel x8, x5, x9, ne
-; CHECK-NEXT:    csel x9, x3, x7, ne
-; CHECK-NEXT:    str w8, [x10, #20]
-; CHECK-NEXT:    str w9, [x10, #8]
+; CHECK-NEXT:    ldp x8, x10, [sp]
+; CHECK-NEXT:    tst w0, #0x1
+; CHECK-NEXT:    ldr x9, [sp, #16]
+; CHECK-NEXT:    csel x11, x2, x6, eq
+; CHECK-NEXT:    csel x8, x4, x8, eq
+; CHECK-NEXT:    str x11, [x9]
+; CHECK-NEXT:    stur x8, [x9, #12]
+; CHECK-NEXT:    csel x8, x5, x10, eq
+; CHECK-NEXT:    csel x10, x3, x7, eq
+; CHECK-NEXT:    str w8, [x9, #20]
+; CHECK-NEXT:    str w10, [x9, #8]
 ; CHECK-NEXT:    ret
   %cond = and i32 %In1, 1
   %cbool = icmp eq i32 %cond, 0
diff --git a/llvm/test/CodeGen/AArch64/lrint-conv-fp16-win.ll b/llvm/test/CodeGen/AArch64/lrint-conv-fp16-win.ll
index ec9a8b2be8745..0fc8b9a9f57ad 100644
--- a/llvm/test/CodeGen/AArch64/lrint-conv-fp16-win.ll
+++ b/llvm/test/CodeGen/AArch64/lrint-conv-fp16-win.ll
@@ -1,36 +1,49 @@
-; RUN: llc < %s -mtriple=aarch64-windows -mattr=+fullfp16 | FileCheck %s
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon | FileCheck %s --check-prefixes=CHECK,CHECK-SD
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon -global-isel -global-isel-abort=2 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-GI
+
+; CHECK-GI:       warning: Instruction selection used fallback path for testmhhs
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for testmhws
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for testmhxs
 
-; CHECK-LABEL: testmhhs:
-; CHECK:       frintx  h0, h0
-; CHECK-NEXT:  fcvtzs  w0, h0
-; CHECK-NEXT:  ret
 define i16 @testmhhs(half %x) {
+; CHECK-LABEL: testmhhs:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    frintx s0, s0
+; CHECK-NEXT:    fcvtzs w0, s0
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f16(half %x)
   %conv = trunc i32 %0 to i16
   ret i16 %conv
 }
 
-; CHECK-LABEL: testmhws:
-; CHECK:       frintx  h0, h0
-; CHECK-NEXT:  fcvtzs  w0, h0
-; CHECK-NEXT:  ret
 define i32 @testmhws(half %x) {
+; CHECK-LABEL: testmhws:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    frintx s0, s0
+; CHECK-NEXT:    fcvtzs w0, s0
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f16(half %x)
   ret i32 %0
 }
 
-; CHECK-LABEL: testmhxs:
-; CHECK:       frintx  h0, h0
-; CHECK-NEXT:  fcvtzs  w8, h0
-; CHECK-NEXT:  sxtw    x0, w8
-; CHECK-NEXT:  ret
 define i64 @testmhxs(half %x) {
+; CHECK-LABEL: testmhxs:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    frintx s0, s0
+; CHECK-NEXT:    fcvtzs w8, s0
+; CHECK-NEXT:    sxtw x0, w8
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f16(half %x)
   %conv = sext i32 %0 to i64
   ret i64 %conv
 }
-
-declare i32 @llvm.lrint.i32.f16(half) nounwind readnone
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; CHECK-GI: {{.*}}
+; CHECK-SD: {{.*}}
diff --git a/llvm/test/CodeGen/AArch64/lrint-conv-win.ll b/llvm/test/CodeGen/AArch64/lrint-conv-win.ll
index 490f009c3fbab..164dbd854173c 100644
--- a/llvm/test/CodeGen/AArch64/lrint-conv-win.ll
+++ b/llvm/test/CodeGen/AArch64/lrint-conv-win.ll
@@ -1,48 +1,59 @@
-; RUN: llc < %s -mtriple=aarch64-windows -mattr=+neon | FileCheck %s
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon | FileCheck %s --check-prefixes=CHECK,CHECK-SD
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon -global-isel -global-isel-abort=2 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-GI
+
+; CHECK-GI:       warning: Instruction selection used fallback path for testmsxs
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for testmsws
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for testmsxd
+; CHECK-GI-NEXT:  warning: Instruction selection used fallback path for testmswd
 
-; CHECK-LABEL: testmsxs:
-; CHECK:       frintx  [[SREG:s[0-9]+]], s0
-; CHECK-NEXT:  fcvtzs  [[WREG:w[0-9]+]], [[SREG]]
-; CHECK-NEXT:  sxtw    x0, [[WREG]]
-; CHECK-NEXT:  ret
 define i64 @testmsxs(float %x) {
+; CHECK-LABEL: testmsxs:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    frintx s0, s0
+; CHECK-NEXT:    fcvtzs w8, s0
+; CHECK-NEXT:    sxtw x0, w8
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f32(float %x)
   %conv = sext i32 %0 to i64
   ret i64 %conv
 }
 
-; CHECK-LABEL: testmsws:
-; CHECK:       frintx  [[SREG:s[0-9]+]], s0
-; CHECK-NEXT:  fcvtzs  [[WREG:w[0-9]+]], [[SREG]]
-; CHECK-NEXT:  ret
 define i32 @testmsws(float %x) {
+; CHECK-LABEL: testmsws:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    frintx s0, s0
+; CHECK-NEXT:    fcvtzs w0, s0
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f32(float %x)
   ret i32 %0
 }
 
-; CHECK-LABEL: testmsxd:
-; CHECK:       frintx  [[DREG:d[0-9]+]], d0
-; CHECK-NEXT:  fcvtzs  [[WREG:w[0-9]+]], [[DREG]]
-; CHECK-NEXT:  sxtw    x0, [[WREG]]
-; CHECK-NEXT:  ret
 define i64 @testmsxd(double %x) {
+; CHECK-LABEL: testmsxd:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    frintx d0, d0
+; CHECK-NEXT:    fcvtzs w8, d0
+; CHECK-NEXT:    sxtw x0, w8
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f64(double %x)
   %conv = sext i32 %0 to i64
   ret i64 %conv
 }
 
-; CHECK-LABEL: testmswd:
-; CHECK:       frintx  [[DREG:d[0-9]+]], d0
-; CHECK-NEXT:  fcvtzs  [[WREG:w[0-9]+]], [[DREG]]
-; CHECK-NEXT:  ret
 define i32 @testmswd(double %x) {
+; CHECK-LABEL: testmswd:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    frintx d0, d0
+; CHECK-NEXT:    fcvtzs w0, d0
+; CHECK-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lrint.i32.f64(double %x)
   ret i32 %0
 }
-
-declare i32 @llvm.lrint.i32.f32(float) nounwind readnone
-declare i32 @llvm.lrint.i32.f64(double) nounwind readnone
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; CHECK-GI: {{.*}}
+; CHECK-SD: {{.*}}
diff --git a/llvm/test/CodeGen/AArch64/lround-conv-fp16-win.ll b/llvm/test/CodeGen/AArch64/lround-conv-fp16-win.ll
index 5eabc2a4f4630..e5390169c51d6 100644
--- a/llvm/test/CodeGen/AArch64/lround-conv-fp16-win.ll
+++ b/llvm/test/CodeGen/AArch64/lround-conv-fp16-win.ll
@@ -1,33 +1,62 @@
-; RUN: llc < %s -mtriple=aarch64-windows -mattr=+fullfp16 | FileCheck %s
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon | FileCheck %s --check-prefixes=CHECK,CHECK-SD
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon -global-isel | FileCheck %s --check-prefixes=CHECK,CHECK-GI
 
-; CHECK-LABEL: testmhhs:
-; CHECK:       fcvtas  w0, h0
-; CHECK:       ret
 define i16 @testmhhs(half %x) {
+; CHECK-SD-LABEL: testmhhs:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvt s0, h0
+; CHECK-SD-NEXT:    fcvtas w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmhhs:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvt s0, h0
+; CHECK-GI-NEXT:    fcvtas x0, s0
+; CHECK-GI-NEXT:    // kill: def $w0 killed $w0 killed $x0
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f16(half %x)
   %conv = trunc i32 %0 to i16
   ret i16 %conv
 }
 
-; CHECK-LABEL: testmhws:
-; CHECK:       fcvtas  w0, h0
-; CHECK:       ret
 define i32 @testmhws(half %x) {
+; CHECK-SD-LABEL: testmhws:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvt s0, h0
+; CHECK-SD-NEXT:    fcvtas w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmhws:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvt s0, h0
+; CHECK-GI-NEXT:    fcvtas x0, s0
+; CHECK-GI-NEXT:    // kill: def $w0 killed $w0 killed $x0
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f16(half %x)
   ret i32 %0
 }
 
-; CHECK-LABEL: testmhxs:
-; CHECK:       fcvtas  w8, h0
-; CHECK-NEXT:  sxtw    x0, w8
-; CHECK-NEXT:  ret
 define i64 @testmhxs(half %x) {
+; CHECK-SD-LABEL: testmhxs:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvt s0, h0
+; CHECK-SD-NEXT:    fcvtas w8, s0
+; CHECK-SD-NEXT:    sxtw x0, w8
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmhxs:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvt s0, h0
+; CHECK-GI-NEXT:    fcvtas x8, s0
+; CHECK-GI-NEXT:    sxtw x0, w8
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f16(half %x)
   %conv = sext i32 %0 to i64
   ret i64 %conv
 }
-
-declare i32 @llvm.lround.i32.f16(half) nounwind readnone
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; CHECK: {{.*}}
diff --git a/llvm/test/CodeGen/AArch64/lround-conv-win.ll b/llvm/test/CodeGen/AArch64/lround-conv-win.ll
index 8bc9213fdcedf..02c1e9381eb06 100644
--- a/llvm/test/CodeGen/AArch64/lround-conv-win.ll
+++ b/llvm/test/CodeGen/AArch64/lround-conv-win.ll
@@ -1,44 +1,74 @@
-; RUN: llc < %s -mtriple=aarch64-windows -mattr=+neon | FileCheck %s
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon | FileCheck %s --check-prefixes=CHECK,CHECK-SD
+; RUN: llc < %s -mtriple=aarch64 -mattr=+neon -global-isel | FileCheck %s --check-prefixes=CHECK,CHECK-GI
 
-; CHECK-LABEL: testmsxs:
-; CHECK:       fcvtas  w8, s0
-; CHECK-NEXT:  sxtw    x0, w8
-; CHECK-NEXT:  ret
 define i64 @testmsxs(float %x) {
+; CHECK-SD-LABEL: testmsxs:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvtas w8, s0
+; CHECK-SD-NEXT:    sxtw x0, w8
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmsxs:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvtas x8, s0
+; CHECK-GI-NEXT:    sxtw x0, w8
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f32(float %x)
   %conv = sext i32 %0 to i64
   ret i64 %conv
 }
 
-; CHECK-LABEL: testmsws:
-; CHECK:       fcvtas  w0, s0
-; CHECK-NEXT:  ret
 define i32 @testmsws(float %x) {
+; CHECK-SD-LABEL: testmsws:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvtas w0, s0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmsws:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvtas x0, s0
+; CHECK-GI-NEXT:    // kill: def $w0 killed $w0 killed $x0
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f32(float %x)
   ret i32 %0
 }
 
-; CHECK-LABEL: testmsxd:
-; CHECK:       fcvtas  w8, d0
-; CHECK-NEXT:  sxtw    x0, w8
-; CHECK-NEXT:  ret
 define i64 @testmsxd(double %x) {
+; CHECK-SD-LABEL: testmsxd:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvtas w8, d0
+; CHECK-SD-NEXT:    sxtw x0, w8
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmsxd:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvtas x8, d0
+; CHECK-GI-NEXT:    sxtw x0, w8
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f64(double %x)
   %conv = sext i32 %0 to i64
   ret i64 %conv
 }
 
-; CHECK-LABEL: testmswd:
-; CHECK:       fcvtas  w0, d0
-; CHECK-NEXT:  ret
 define i32 @testmswd(double %x) {
+; CHECK-SD-LABEL: testmswd:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    fcvtas w0, d0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: testmswd:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    fcvtas x0, d0
+; CHECK-GI-NEXT:    // kill: def $w0 killed $w0 killed $x0
+; CHECK-GI-NEXT:    ret
 entry:
   %0 = tail call i32 @llvm.lround.i32.f64(double %x)
   ret i32 %0
 }
 
-declare i32 @llvm.lround.i32.f32(float) nounwind readnone
-declare i32 @llvm.lround.i32.f64(double) nounwind readnone
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; CHECK: {{.*}}
diff --git a/llvm/test/CodeGen/AArch64/machine-sme-abi-find-insert-pt.mir b/llvm/test/CodeGen/AArch64/machine-sme-abi-find-insert-pt.mir
index 3f174a62128a8..ed768dec77998 100644
--- a/llvm/test/CodeGen/AArch64/machine-sme-abi-find-insert-pt.mir
+++ b/llvm/test/CodeGen/AArch64/machine-sme-abi-find-insert-pt.mir
@@ -79,14 +79,12 @@ body:             |
     ; CHECK-NEXT: RequiresZASavePseudo
     ; CHECK-NEXT: BL @clobber, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp
     ; CHECK-NEXT: ADJCALLSTACKUP 0, 0, implicit-def dead $sp, implicit $sp
-    ; CHECK-NEXT: $x0 = IMPLICIT_DEF
-    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gpr64 = COPY $x0
     ; CHECK-NEXT: MSRpstatesvcrImm1 2, 1, implicit-def $nzcv
     ; CHECK-NEXT: [[MRS:%[0-9]+]]:gpr64 = MRS 56965, implicit-def $nzcv
     ; CHECK-NEXT: $x0 = ADDXri %stack.0, 0, 0
     ; CHECK-NEXT: RestoreZAPseudo [[MRS]], $x0, &__arm_tpidr2_restore, csr_aarch64_sme_abi_support_routines_preservemost_from_x0
     ; CHECK-NEXT: MSR 56965, $xzr
-    ; CHECK-NEXT: $x0 = COPY [[COPY2]]
+    ; CHECK-NEXT: $x0 = IMPLICIT_DEF
     ; CHECK-NEXT: $nzcv = IMPLICIT_DEF
     ; CHECK-NEXT: FAKE_USE $x0
     ; CHECK-NEXT: $zab0 = IMPLICIT_DEF
diff --git a/llvm/test/CodeGen/AArch64/memtag-compact-unwind.ll b/llvm/test/CodeGen/AArch64/memtag-compact-unwind.ll
new file mode 100644
index 0000000000000..50cda8d285a42
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/memtag-compact-unwind.ll
@@ -0,0 +1,27 @@
+; RUN: llc -mtriple=arm64-apple-macosx -mattr=+mte %s -filetype=obj -o %t.o
+; RUN: llvm-objdump --unwind-info %t.o | FileCheck %s
+
+; Frames with MTE stack tagging must use DWARF unwinding because compact unwind
+; doesn't handle MTE tag untagging during exception unwinding.
+
+; MTE-tagged frame should use DWARF mode (0x03000000)
+; CHECK-LABEL: Contents of __compact_unwind section:
+; CHECK:       compact encoding: 0x03000000
+
+; Normal frame should NOT use DWARF mode
+; CHECK-NOT:   compact encoding: 0x03000000
+; CHECK:       compact encoding: 0x{{[0-9a-f]+}}
+
+define void @mte_tagged_frame() sanitize_memtag "frame-pointer"="all" {
+  %x = alloca i32, align 4
+  store i32 42, ptr %x
+  call void asm sideeffect "", "r"(ptr %x)
+  ret void
+}
+
+define void @normal_frame() "frame-pointer"="all" {
+  %x = alloca i32, align 4
+  store i32 42, ptr %x
+  call void asm sideeffect "", "r"(ptr %x)
+  ret void
+}
diff --git a/llvm/test/CodeGen/AArch64/neon-anyof-splat.ll b/llvm/test/CodeGen/AArch64/neon-anyof-splat.ll
new file mode 100644
index 0000000000000..dedd4323f1519
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/neon-anyof-splat.ll
@@ -0,0 +1,67 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc %s -o - | FileCheck %s
+target triple = "aarch64-linux-gnu"
+
+;; An 'AnyOf' reduction (vector.reduce.or) is instcombined to a bitcast to an
+;; integer of a bitwidth equal to the number of lanes being reduced, then
+;; compared against zero. To select between vectors for NEON, we then need to
+;; broadcast the result, but we must be careful when the bitwidth of the scalar
+;; result is smaller than the element size of the vectors being selected. We
+;; don't want to end up with scalarization.
+
+define <4 x i32> @any_of_select_vf4(<4 x i32> %mask, <4 x i32> %a, <4 x i32> %b) {
+; CHECK-LABEL: any_of_select_vf4:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    cmlt v0.4s, v0.4s, #0
+; CHECK-NEXT:    umaxv s0, v0.4s
+; CHECK-NEXT:    fmov w8, s0
+; CHECK-NEXT:    tst w8, #0x1
+; CHECK-NEXT:    csetm w8, ne
+; CHECK-NEXT:    dup v0.4s, w8
+; CHECK-NEXT:    bsl v0.16b, v2.16b, v1.16b
+; CHECK-NEXT:    ret
+  %cmp = icmp slt <4 x i32> %mask, zeroinitializer
+  %cmp.bc = bitcast <4 x i1> %cmp to i4
+  %cmp.bc.not = icmp eq i4 %cmp.bc, 0
+  %res = select i1 %cmp.bc.not, <4 x i32> %a, <4 x i32> %b
+  ret <4 x i32> %res
+}
+
+define <2 x i64> @any_of_select_vf2(<2 x i64> %mask, <2 x i64> %a, <2 x i64> %b) {
+; CHECK-LABEL: any_of_select_vf2:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    cmlt v0.2d, v0.2d, #0
+; CHECK-NEXT:    umaxv s0, v0.4s
+; CHECK-NEXT:    fmov w8, s0
+; CHECK-NEXT:    tst w8, #0x1
+; CHECK-NEXT:    csetm x8, ne
+; CHECK-NEXT:    dup v0.2d, x8
+; CHECK-NEXT:    bsl v0.16b, v2.16b, v1.16b
+; CHECK-NEXT:    ret
+  %cmp = icmp slt <2 x i64> %mask, zeroinitializer
+  %cmp.bc = bitcast <2 x i1> %cmp to i2
+  %cmp.bc.not = icmp eq i2 %cmp.bc, 0
+  %res = select i1 %cmp.bc.not, <2 x i64> %a, <2 x i64> %b
+  ret <2 x i64> %res
+}
+
+define <32 x i8> @any_of_select_vf32(<32 x i8> %mask, <32 x i8> %a, <32 x i8> %b) {
+; CHECK-LABEL: any_of_select_vf32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    orr v0.16b, v0.16b, v1.16b
+; CHECK-NEXT:    cmlt v0.16b, v0.16b, #0
+; CHECK-NEXT:    umaxv b0, v0.16b
+; CHECK-NEXT:    fmov w8, s0
+; CHECK-NEXT:    tst w8, #0x1
+; CHECK-NEXT:    csetm w8, ne
+; CHECK-NEXT:    dup v1.16b, w8
+; CHECK-NEXT:    mov v0.16b, v1.16b
+; CHECK-NEXT:    bsl v1.16b, v5.16b, v3.16b
+; CHECK-NEXT:    bsl v0.16b, v4.16b, v2.16b
+; CHECK-NEXT:    ret
+  %cmp = icmp slt <32 x i8> %mask, zeroinitializer
+  %cmp.bc = bitcast <32 x i1> %cmp to i32
+  %cmp.bc.not = icmp eq i32 %cmp.bc, 0
+  %res = select i1 %cmp.bc.not, <32 x i8> %a, <32 x i8> %b
+  ret <32 x i8> %res
+}
diff --git a/llvm/test/CodeGen/AArch64/print-pipeline-passes.ll b/llvm/test/CodeGen/AArch64/print-pipeline-passes.ll
index 5852f97a63798..86090324c770c 100644
--- a/llvm/test/CodeGen/AArch64/print-pipeline-passes.ll
+++ b/llvm/test/CodeGen/AArch64/print-pipeline-passes.ll
@@ -2,7 +2,7 @@
 ; RUN: opt -mtriple=aarch64 -S -passes='default<O2>' -print-pipeline-passes < %s | FileCheck %s
 
 ; CHECK: loop-idiom-vectorize
-; O0: {{^}}function(ee-instrument<>),always-inline,coro-cond(coro-early,cgscc(coro-split),coro-cleanup,globaldce),function(annotation-remarks),verify,print{{$}}
+; O0: {{^}}function(ee-instrument<>),always-inline,coro-cond(coro-early,cgscc(coro-split),coro-cleanup,globaldce),alloc-token,function(annotation-remarks),verify,print{{$}}
 
 define void @foo() {
 entry:
diff --git a/llvm/test/CodeGen/AArch64/rem-by-const.ll b/llvm/test/CodeGen/AArch64/rem-by-const.ll
index a55aaeb62830f..ffaf045fa45c2 100644
--- a/llvm/test/CodeGen/AArch64/rem-by-const.ll
+++ b/llvm/test/CodeGen/AArch64/rem-by-const.ll
@@ -1433,35 +1433,13 @@ entry:
 define <4 x i8> @uv4i8_7(<4 x i8> %d, <4 x i8> %e) {
 ; CHECK-SD-LABEL: uv4i8_7:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-SD-NEXT:    mov w8, #18725 // =0x4925
+; CHECK-SD-NEXT:    mov w8, #9363 // =0x2493
 ; CHECK-SD-NEXT:    bic v0.4h, #255, lsl #8
-; CHECK-SD-NEXT:    movk w8, #9362, lsl #16
-; CHECK-SD-NEXT:    umov w9, v0.h[0]
-; CHECK-SD-NEXT:    umov w10, v0.h[1]
-; CHECK-SD-NEXT:    umov w13, v0.h[2]
-; CHECK-SD-NEXT:    umov w15, v0.h[3]
-; CHECK-SD-NEXT:    umull x11, w9, w8
-; CHECK-SD-NEXT:    umull x12, w10, w8
-; CHECK-SD-NEXT:    umull x14, w13, w8
-; CHECK-SD-NEXT:    lsr x11, x11, #32
-; CHECK-SD-NEXT:    umull x8, w15, w8
-; CHECK-SD-NEXT:    lsr x12, x12, #32
-; CHECK-SD-NEXT:    sub w11, w11, w11, lsl #3
-; CHECK-SD-NEXT:    sub w12, w12, w12, lsl #3
-; CHECK-SD-NEXT:    lsr x8, x8, #32
-; CHECK-SD-NEXT:    add w9, w9, w11
-; CHECK-SD-NEXT:    fmov s0, w9
-; CHECK-SD-NEXT:    add w10, w10, w12
-; CHECK-SD-NEXT:    lsr x9, x14, #32
-; CHECK-SD-NEXT:    sub w8, w8, w8, lsl #3
-; CHECK-SD-NEXT:    sub w9, w9, w9, lsl #3
-; CHECK-SD-NEXT:    mov v0.h[1], w10
-; CHECK-SD-NEXT:    add w8, w15, w8
-; CHECK-SD-NEXT:    add w9, w13, w9
-; CHECK-SD-NEXT:    mov v0.h[2], w9
-; CHECK-SD-NEXT:    mov v0.h[3], w8
-; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-SD-NEXT:    movi v2.4h, #7
+; CHECK-SD-NEXT:    dup v1.4h, w8
+; CHECK-SD-NEXT:    umull v1.4s, v0.4h, v1.4h
+; CHECK-SD-NEXT:    shrn v1.4h, v1.4s, #16
+; CHECK-SD-NEXT:    mls v0.4h, v1.4h, v2.4h
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv4i8_7:
@@ -1508,32 +1486,13 @@ entry:
 define <4 x i8> @uv4i8_100(<4 x i8> %d, <4 x i8> %e) {
 ; CHECK-SD-LABEL: uv4i8_100:
 ; CHECK-SD:       // %bb.0: // %entry
-; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-SD-NEXT:    mov w8, #23593 // =0x5c29
-; CHECK-SD-NEXT:    mov w14, #100 // =0x64
+; CHECK-SD-NEXT:    mov w8, #656 // =0x290
 ; CHECK-SD-NEXT:    bic v0.4h, #255, lsl #8
-; CHECK-SD-NEXT:    movk w8, #655, lsl #16
-; CHECK-SD-NEXT:    umov w9, v0.h[0]
-; CHECK-SD-NEXT:    umov w10, v0.h[1]
-; CHECK-SD-NEXT:    umov w12, v0.h[2]
-; CHECK-SD-NEXT:    umov w15, v0.h[3]
-; CHECK-SD-NEXT:    umull x11, w9, w8
-; CHECK-SD-NEXT:    umull x13, w10, w8
-; CHECK-SD-NEXT:    lsr x11, x11, #32
-; CHECK-SD-NEXT:    lsr x13, x13, #32
-; CHECK-SD-NEXT:    msub w9, w11, w14, w9
-; CHECK-SD-NEXT:    umull x11, w12, w8
-; CHECK-SD-NEXT:    msub w10, w13, w14, w10
-; CHECK-SD-NEXT:    fmov s0, w9
-; CHECK-SD-NEXT:    umull x8, w15, w8
-; CHECK-SD-NEXT:    lsr x9, x11, #32
-; CHECK-SD-NEXT:    mov v0.h[1], w10
-; CHECK-SD-NEXT:    msub w9, w9, w14, w12
-; CHECK-SD-NEXT:    lsr x8, x8, #32
-; CHECK-SD-NEXT:    msub w8, w8, w14, w15
-; CHECK-SD-NEXT:    mov v0.h[2], w9
-; CHECK-SD-NEXT:    mov v0.h[3], w8
-; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-SD-NEXT:    movi v2.4h, #100
+; CHECK-SD-NEXT:    dup v1.4h, w8
+; CHECK-SD-NEXT:    umull v1.4s, v0.4h, v1.4h
+; CHECK-SD-NEXT:    shrn v1.4h, v1.4s, #16
+; CHECK-SD-NEXT:    mls v0.4h, v1.4h, v2.4h
 ; CHECK-SD-NEXT:    ret
 ;
 ; CHECK-GI-LABEL: uv4i8_100:
diff --git a/llvm/test/CodeGen/AArch64/shift.ll b/llvm/test/CodeGen/AArch64/shift.ll
index 9827cb3526f99..98c7f673ecd01 100644
--- a/llvm/test/CodeGen/AArch64/shift.ll
+++ b/llvm/test/CodeGen/AArch64/shift.ll
@@ -1033,6 +1033,37 @@ define <2 x i128> @lshr_v2i128(<2 x i128> %0, <2 x i128> %1){
     ret <2 x i128> %3
 }
 
+define <2 x i8> @pr168848(<2 x i1> %shift) {
+; CHECK-SD-LABEL: pr168848:
+; CHECK-SD:       // %bb.0: // %entry
+; CHECK-SD-NEXT:    movi v1.2s, #1
+; CHECK-SD-NEXT:    and v0.8b, v0.8b, v1.8b
+; CHECK-SD-NEXT:    ushl v0.2s, v1.2s, v0.2s
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: pr168848:
+; CHECK-GI:       // %bb.0: // %entry
+; CHECK-GI-NEXT:    movi v1.2s, #1
+; CHECK-GI-NEXT:    mov w8, #1 // =0x1
+; CHECK-GI-NEXT:    and v0.8b, v0.8b, v1.8b
+; CHECK-GI-NEXT:    fmov s1, w8
+; CHECK-GI-NEXT:    uzp1 v0.4h, v0.4h, v0.4h
+; CHECK-GI-NEXT:    mov v1.b[1], w8
+; CHECK-GI-NEXT:    uzp1 v0.8b, v0.8b, v0.8b
+; CHECK-GI-NEXT:    ushl v0.8b, v1.8b, v0.8b
+; CHECK-GI-NEXT:    umov w8, v0.b[0]
+; CHECK-GI-NEXT:    umov w9, v0.b[1]
+; CHECK-GI-NEXT:    fmov s0, w8
+; CHECK-GI-NEXT:    mov v0.s[1], w9
+; CHECK-GI-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-GI-NEXT:    ret
+entry:
+  %shift.zext = zext <2 x i1> %shift to <2 x i32>
+  %ones = shl <2 x i32> splat (i32 1), %shift.zext
+  %ones.trunc = trunc <2 x i32> %ones to <2 x i8>
+  ret <2 x i8> %ones.trunc
+}
+
 ; ===== Vector with Non-Pow 2 Width =====
 
 define <3 x i8> @shl_v3i8(<3 x i8> %0, <3 x i8> %1){
diff --git a/llvm/test/CodeGen/AArch64/sme-agnostic-za.ll b/llvm/test/CodeGen/AArch64/sme-agnostic-za.ll
index 30dbd1cb34667..0906e10b551b7 100644
--- a/llvm/test/CodeGen/AArch64/sme-agnostic-za.ll
+++ b/llvm/test/CodeGen/AArch64/sme-agnostic-za.ll
@@ -67,10 +67,10 @@ define i64 @agnostic_caller_private_za_callee(i64 %v) nounwind "aarch64_za_state
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x8
 ; CHECK-NEWLOWERING-NEXT:    bl private_za_decl
 ; CHECK-NEWLOWERING-NEXT:    bl private_za_decl
-; CHECK-NEWLOWERING-NEXT:    mov x8, x0
+; CHECK-NEWLOWERING-NEXT:    mov x1, x0
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x19
 ; CHECK-NEWLOWERING-NEXT:    bl __arm_sme_restore
-; CHECK-NEWLOWERING-NEXT:    mov x0, x8
+; CHECK-NEWLOWERING-NEXT:    mov x0, x1
 ; CHECK-NEWLOWERING-NEXT:    mov sp, x29
 ; CHECK-NEWLOWERING-NEXT:    ldr x19, [sp, #16] // 8-byte Reload
 ; CHECK-NEWLOWERING-NEXT:    ldp x29, x30, [sp], #32 // 16-byte Folded Reload
@@ -170,11 +170,11 @@ define i64 @streaming_agnostic_caller_nonstreaming_private_za_callee(i64 %v) nou
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x8
 ; CHECK-NEWLOWERING-NEXT:    bl private_za_decl
 ; CHECK-NEWLOWERING-NEXT:    bl private_za_decl
+; CHECK-NEWLOWERING-NEXT:    mov x1, x0
 ; CHECK-NEWLOWERING-NEXT:    smstart sm
-; CHECK-NEWLOWERING-NEXT:    mov x8, x0
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x20
 ; CHECK-NEWLOWERING-NEXT:    bl __arm_sme_restore
-; CHECK-NEWLOWERING-NEXT:    mov x0, x8
+; CHECK-NEWLOWERING-NEXT:    mov x0, x1
 ; CHECK-NEWLOWERING-NEXT:    sub sp, x29, #64
 ; CHECK-NEWLOWERING-NEXT:    ldp x20, x19, [sp, #80] // 16-byte Folded Reload
 ; CHECK-NEWLOWERING-NEXT:    ldp x29, x30, [sp, #64] // 16-byte Folded Reload
@@ -267,14 +267,14 @@ define i64 @streaming_compatible_agnostic_caller_nonstreaming_private_za_callee(
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x8
 ; CHECK-NEWLOWERING-NEXT:    bl private_za_decl
 ; CHECK-NEWLOWERING-NEXT:    bl private_za_decl
+; CHECK-NEWLOWERING-NEXT:    mov x1, x0
 ; CHECK-NEWLOWERING-NEXT:    tbz w20, #0, .LBB5_4
 ; CHECK-NEWLOWERING-NEXT:  // %bb.3:
 ; CHECK-NEWLOWERING-NEXT:    smstart sm
 ; CHECK-NEWLOWERING-NEXT:  .LBB5_4:
-; CHECK-NEWLOWERING-NEXT:    mov x8, x0
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x19
 ; CHECK-NEWLOWERING-NEXT:    bl __arm_sme_restore
-; CHECK-NEWLOWERING-NEXT:    mov x0, x8
+; CHECK-NEWLOWERING-NEXT:    mov x0, x1
 ; CHECK-NEWLOWERING-NEXT:    sub sp, x29, #64
 ; CHECK-NEWLOWERING-NEXT:    ldp x20, x19, [sp, #80] // 16-byte Folded Reload
 ; CHECK-NEWLOWERING-NEXT:    ldp x29, x30, [sp, #64] // 16-byte Folded Reload
@@ -336,10 +336,10 @@ define i64  @test_many_callee_arguments(
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x8
 ; CHECK-NEWLOWERING-NEXT:    bl many_args_private_za_callee
 ; CHECK-NEWLOWERING-NEXT:    add sp, sp, #16
-; CHECK-NEWLOWERING-NEXT:    mov x8, x0
+; CHECK-NEWLOWERING-NEXT:    mov x1, x0
 ; CHECK-NEWLOWERING-NEXT:    mov x0, x19
 ; CHECK-NEWLOWERING-NEXT:    bl __arm_sme_restore
-; CHECK-NEWLOWERING-NEXT:    mov x0, x8
+; CHECK-NEWLOWERING-NEXT:    mov x0, x1
 ; CHECK-NEWLOWERING-NEXT:    mov sp, x29
 ; CHECK-NEWLOWERING-NEXT:    ldr x19, [sp, #16] // 8-byte Reload
 ; CHECK-NEWLOWERING-NEXT:    ldp x29, x30, [sp], #32 // 16-byte Folded Reload
diff --git a/llvm/test/CodeGen/AArch64/sme-dynamic-tls.ll b/llvm/test/CodeGen/AArch64/sme-dynamic-tls.ll
index 0c886c643c5fb..87a63fed0546c 100644
--- a/llvm/test/CodeGen/AArch64/sme-dynamic-tls.ll
+++ b/llvm/test/CodeGen/AArch64/sme-dynamic-tls.ll
@@ -87,8 +87,7 @@ define i32 @load_tls_shared_za() nounwind "aarch64_inout_za" {
 ; CHECK-NEXT:    .tlsdesccall x
 ; CHECK-NEXT:    blr x1
 ; CHECK-NEXT:    mrs x8, TPIDR_EL0
-; CHECK-NEXT:    ldr w0, [x8, x0]
-; CHECK-NEXT:    mov w8, w0
+; CHECK-NEXT:    ldr w8, [x8, x0]
 ; CHECK-NEXT:    smstart za
 ; CHECK-NEXT:    mrs x9, TPIDR2_EL0
 ; CHECK-NEXT:    sub x0, x29, #16
@@ -133,8 +132,7 @@ define i32 @load_tls_streaming_shared_za() nounwind "aarch64_inout_za" "aarch64_
 ; CHECK-NEXT:    blr x1
 ; CHECK-NEXT:    smstart sm
 ; CHECK-NEXT:    mrs x8, TPIDR_EL0
-; CHECK-NEXT:    ldr w0, [x8, x0]
-; CHECK-NEXT:    mov w8, w0
+; CHECK-NEXT:    ldr w8, [x8, x0]
 ; CHECK-NEXT:    smstart za
 ; CHECK-NEXT:    mrs x9, TPIDR2_EL0
 ; CHECK-NEXT:    sub x0, x29, #80
diff --git a/llvm/test/CodeGen/AArch64/sme-lazy-save-call.ll b/llvm/test/CodeGen/AArch64/sme-lazy-save-call.ll
index 50dd0c699284c..e672f777703a6 100644
--- a/llvm/test/CodeGen/AArch64/sme-lazy-save-call.ll
+++ b/llvm/test/CodeGen/AArch64/sme-lazy-save-call.ll
@@ -621,15 +621,15 @@ define i64  @test_many_callee_arguments(
 ; CHECK-NEWLOWERING-NEXT:    stp x10, x11, [sp, #-16]!
 ; CHECK-NEWLOWERING-NEXT:    bl many_args_private_za_callee
 ; CHECK-NEWLOWERING-NEXT:    add sp, sp, #16
-; CHECK-NEWLOWERING-NEXT:    mov x8, x0
+; CHECK-NEWLOWERING-NEXT:    mov x1, x0
 ; CHECK-NEWLOWERING-NEXT:    smstart za
-; CHECK-NEWLOWERING-NEXT:    mrs x9, TPIDR2_EL0
+; CHECK-NEWLOWERING-NEXT:    mrs x8, TPIDR2_EL0
 ; CHECK-NEWLOWERING-NEXT:    sub x0, x29, #16
-; CHECK-NEWLOWERING-NEXT:    cbnz x9, .LBB9_2
+; CHECK-NEWLOWERING-NEXT:    cbnz x8, .LBB9_2
 ; CHECK-NEWLOWERING-NEXT:  // %bb.1:
 ; CHECK-NEWLOWERING-NEXT:    bl __arm_tpidr2_restore
 ; CHECK-NEWLOWERING-NEXT:  .LBB9_2:
-; CHECK-NEWLOWERING-NEXT:    mov x0, x8
+; CHECK-NEWLOWERING-NEXT:    mov x0, x1
 ; CHECK-NEWLOWERING-NEXT:    msr TPIDR2_EL0, xzr
 ; CHECK-NEWLOWERING-NEXT:    mov sp, x29
 ; CHECK-NEWLOWERING-NEXT:    ldr x19, [sp, #16] // 8-byte Reload
diff --git a/llvm/test/CodeGen/AArch64/sme-peephole-opts.ll b/llvm/test/CodeGen/AArch64/sme-peephole-opts.ll
index a3027f01e73cf..ea1341186ddfa 100644
--- a/llvm/test/CodeGen/AArch64/sme-peephole-opts.ll
+++ b/llvm/test/CodeGen/AArch64/sme-peephole-opts.ll
@@ -230,10 +230,6 @@ define void @test7() nounwind "aarch64_inout_zt0" {
 ; CHECK-NEXT:    str zt0, [x19]
 ; CHECK-NEXT:    smstop za
 ; CHECK-NEXT:    bl callee
-; CHECK-NEXT:    smstart za
-; CHECK-NEXT:    ldr zt0, [x19]
-; CHECK-NEXT:    str zt0, [x19]
-; CHECK-NEXT:    smstop za
 ; CHECK-NEXT:    bl callee
 ; CHECK-NEXT:    smstart za
 ; CHECK-NEXT:    ldr zt0, [x19]
diff --git a/llvm/test/CodeGen/AArch64/sme-za-exceptions.ll b/llvm/test/CodeGen/AArch64/sme-za-exceptions.ll
index ef74825e02881..3947127c47844 100644
--- a/llvm/test/CodeGen/AArch64/sme-za-exceptions.ll
+++ b/llvm/test/CodeGen/AArch64/sme-za-exceptions.ll
@@ -511,7 +511,6 @@ exit:
 ;
 ; This code may require reloading ZT0 in the cleanup for ~ZT0Resource().
 ;
-; FIXME: Codegen with `-aarch64-new-sme-abi` is broken with ZT0 (as it is not implemented).
 define void @try_catch_shared_zt0_callee() "aarch64_inout_zt0" personality ptr @__gxx_personality_v0 {
 ; CHECK-LABEL: try_catch_shared_zt0_callee:
 ; CHECK:       .Lfunc_begin3:
@@ -519,52 +518,37 @@ define void @try_catch_shared_zt0_callee() "aarch64_inout_zt0" personality ptr @
 ; CHECK-NEXT:    .cfi_personality 156, DW.ref.__gxx_personality_v0
 ; CHECK-NEXT:    .cfi_lsda 28, .Lexception3
 ; CHECK-NEXT:  // %bb.0:
-; CHECK-NEXT:    stp x29, x30, [sp, #-32]! // 16-byte Folded Spill
-; CHECK-NEXT:    stp x20, x19, [sp, #16] // 16-byte Folded Spill
-; CHECK-NEXT:    mov x29, sp
-; CHECK-NEXT:    sub sp, sp, #80
-; CHECK-NEXT:    .cfi_def_cfa w29, 32
+; CHECK-NEXT:    sub sp, sp, #96
+; CHECK-NEXT:    str x30, [sp, #64] // 8-byte Spill
+; CHECK-NEXT:    stp x20, x19, [sp, #80] // 16-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 96
 ; CHECK-NEXT:    .cfi_offset w19, -8
 ; CHECK-NEXT:    .cfi_offset w20, -16
-; CHECK-NEXT:    .cfi_offset w30, -24
-; CHECK-NEXT:    .cfi_offset w29, -32
-; CHECK-NEXT:    rdsvl x8, #1
-; CHECK-NEXT:    mov x9, sp
-; CHECK-NEXT:    msub x9, x8, x8, x9
-; CHECK-NEXT:    mov sp, x9
-; CHECK-NEXT:    stp x9, x8, [x29, #-80]
+; CHECK-NEXT:    .cfi_offset w30, -32
 ; CHECK-NEXT:  .Ltmp9: // EH_LABEL
-; CHECK-NEXT:    sub x19, x29, #64
+; CHECK-NEXT:    mov x19, sp
 ; CHECK-NEXT:    str zt0, [x19]
 ; CHECK-NEXT:    smstop za
 ; CHECK-NEXT:    bl may_throw
+; CHECK-NEXT:  .Ltmp10: // EH_LABEL
 ; CHECK-NEXT:    smstart za
 ; CHECK-NEXT:    ldr zt0, [x19]
-; CHECK-NEXT:  .Ltmp10: // EH_LABEL
 ; CHECK-NEXT:  // %bb.1: // %return_normally
-; CHECK-NEXT:    mov sp, x29
-; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
-; CHECK-NEXT:    ldp x29, x30, [sp], #32 // 16-byte Folded Reload
+; CHECK-NEXT:    ldp x20, x19, [sp, #80] // 16-byte Folded Reload
+; CHECK-NEXT:    ldr x30, [sp, #64] // 8-byte Reload
+; CHECK-NEXT:    add sp, sp, #96
 ; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB3_2: // %unwind_dtors
 ; CHECK-NEXT:  .Ltmp11: // EH_LABEL
-; CHECK-NEXT:    sub x20, x29, #64
+; CHECK-NEXT:    mov x20, sp
 ; CHECK-NEXT:    mov x19, x0
 ; CHECK-NEXT:    smstart za
-; CHECK-NEXT:    mrs x8, TPIDR2_EL0
-; CHECK-NEXT:    sub x0, x29, #80
-; CHECK-NEXT:    cbnz x8, .LBB3_4
-; CHECK-NEXT:  // %bb.3: // %unwind_dtors
-; CHECK-NEXT:    bl __arm_tpidr2_restore
-; CHECK-NEXT:  .LBB3_4: // %unwind_dtors
-; CHECK-NEXT:    msr TPIDR2_EL0, xzr
+; CHECK-NEXT:    ldr zt0, [x20]
 ; CHECK-NEXT:    bl shared_zt0_call
 ; CHECK-NEXT:    str zt0, [x20]
 ; CHECK-NEXT:    smstop za
 ; CHECK-NEXT:    mov x0, x19
 ; CHECK-NEXT:    bl _Unwind_Resume
-; CHECK-NEXT:    smstart za
-; CHECK-NEXT:    ldr zt0, [x20]
 ;
 ; CHECK-SDAG-LABEL: try_catch_shared_zt0_callee:
 ; CHECK-SDAG:       .Lfunc_begin3:
@@ -965,6 +949,239 @@ exit:
   ret void
 }
 
+define void @try_catch_inout_zt0() "aarch64_inout_zt0" personality ptr @__gxx_personality_v0 {
+; CHECK-LABEL: try_catch_inout_zt0:
+; CHECK:       .Lfunc_begin7:
+; CHECK-NEXT:    .cfi_startproc
+; CHECK-NEXT:    .cfi_personality 156, DW.ref.__gxx_personality_v0
+; CHECK-NEXT:    .cfi_lsda 28, .Lexception7
+; CHECK-NEXT:  // %bb.0: // %entry
+; CHECK-NEXT:    sub sp, sp, #80
+; CHECK-NEXT:    stp x30, x19, [sp, #64] // 16-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 80
+; CHECK-NEXT:    .cfi_offset w19, -8
+; CHECK-NEXT:    .cfi_offset w30, -16
+; CHECK-NEXT:  .Ltmp21: // EH_LABEL
+; CHECK-NEXT:    mov x19, sp
+; CHECK-NEXT:    str zt0, [x19]
+; CHECK-NEXT:    smstop za
+; CHECK-NEXT:    bl may_throw
+; CHECK-NEXT:  .Ltmp22: // EH_LABEL
+; CHECK-NEXT:  .LBB7_1: // %exit
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    ldr zt0, [x19]
+; CHECK-NEXT:    ldp x30, x19, [sp, #64] // 16-byte Folded Reload
+; CHECK-NEXT:    add sp, sp, #80
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB7_2: // %catch
+; CHECK-NEXT:  .Ltmp23: // EH_LABEL
+; CHECK-NEXT:    bl __cxa_begin_catch
+; CHECK-NEXT:    bl __cxa_end_catch
+; CHECK-NEXT:    b .LBB7_1
+;
+; CHECK-SDAG-LABEL: try_catch_inout_zt0:
+; CHECK-SDAG:       .Lfunc_begin7:
+; CHECK-SDAG-NEXT:    .cfi_startproc
+; CHECK-SDAG-NEXT:    .cfi_personality 156, DW.ref.__gxx_personality_v0
+; CHECK-SDAG-NEXT:    .cfi_lsda 28, .Lexception7
+; CHECK-SDAG-NEXT:  // %bb.0: // %entry
+; CHECK-SDAG-NEXT:    sub sp, sp, #80
+; CHECK-SDAG-NEXT:    stp x30, x19, [sp, #64] // 16-byte Folded Spill
+; CHECK-SDAG-NEXT:    .cfi_def_cfa_offset 80
+; CHECK-SDAG-NEXT:    .cfi_offset w19, -8
+; CHECK-SDAG-NEXT:    .cfi_offset w30, -16
+; CHECK-SDAG-NEXT:  .Ltmp21: // EH_LABEL
+; CHECK-SDAG-NEXT:    mov x19, sp
+; CHECK-SDAG-NEXT:    str zt0, [x19]
+; CHECK-SDAG-NEXT:    smstop za
+; CHECK-SDAG-NEXT:    bl may_throw
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x19]
+; CHECK-SDAG-NEXT:  .Ltmp22: // EH_LABEL
+; CHECK-SDAG-NEXT:  .LBB7_1: // %exit
+; CHECK-SDAG-NEXT:    ldp x30, x19, [sp, #64] // 16-byte Folded Reload
+; CHECK-SDAG-NEXT:    add sp, sp, #80
+; CHECK-SDAG-NEXT:    ret
+; CHECK-SDAG-NEXT:  .LBB7_2: // %catch
+; CHECK-SDAG-NEXT:  .Ltmp23: // EH_LABEL
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x19]
+; CHECK-SDAG-NEXT:    str zt0, [x19]
+; CHECK-SDAG-NEXT:    smstop za
+; CHECK-SDAG-NEXT:    bl __cxa_begin_catch
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x19]
+; CHECK-SDAG-NEXT:    str zt0, [x19]
+; CHECK-SDAG-NEXT:    smstop za
+; CHECK-SDAG-NEXT:    bl __cxa_end_catch
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x19]
+; CHECK-SDAG-NEXT:    b .LBB7_1
+entry:
+  invoke void @may_throw()
+          to label %exit unwind label %catch
+
+catch:
+  %eh_info = landingpad { ptr, i32 }
+          catch ptr null
+  %exception_ptr = extractvalue { ptr, i32 } %eh_info, 0
+  tail call ptr @__cxa_begin_catch(ptr %exception_ptr)
+  tail call void @__cxa_end_catch()
+  br label %exit
+
+exit:
+  ret void
+}
+
+define void @try_catch_shared_za_callee_zt0_saved(ptr %callee) "aarch64_inout_za" "aarch64_in_zt0" personality ptr @__gxx_personality_v0 {
+; CHECK-LABEL: try_catch_shared_za_callee_zt0_saved:
+; CHECK:       .Lfunc_begin8:
+; CHECK-NEXT:    .cfi_startproc
+; CHECK-NEXT:    .cfi_personality 156, DW.ref.__gxx_personality_v0
+; CHECK-NEXT:    .cfi_lsda 28, .Lexception8
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    stp x29, x30, [sp, #-32]! // 16-byte Folded Spill
+; CHECK-NEXT:    stp x20, x19, [sp, #16] // 16-byte Folded Spill
+; CHECK-NEXT:    mov x29, sp
+; CHECK-NEXT:    sub sp, sp, #80
+; CHECK-NEXT:    .cfi_def_cfa w29, 32
+; CHECK-NEXT:    .cfi_offset w19, -8
+; CHECK-NEXT:    .cfi_offset w20, -16
+; CHECK-NEXT:    .cfi_offset w30, -24
+; CHECK-NEXT:    .cfi_offset w29, -32
+; CHECK-NEXT:    rdsvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    msub x9, x8, x8, x9
+; CHECK-NEXT:    mov sp, x9
+; CHECK-NEXT:    mov x19, x0
+; CHECK-NEXT:    stp x9, x8, [x29, #-80]
+; CHECK-NEXT:  .Ltmp24: // EH_LABEL
+; CHECK-NEXT:    sub x20, x29, #64
+; CHECK-NEXT:    sub x8, x29, #80
+; CHECK-NEXT:    str zt0, [x20]
+; CHECK-NEXT:    msr TPIDR2_EL0, x8
+; CHECK-NEXT:    bl may_throw
+; CHECK-NEXT:  .Ltmp25: // EH_LABEL
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    mrs x8, TPIDR2_EL0
+; CHECK-NEXT:    sub x0, x29, #80
+; CHECK-NEXT:    cbnz x8, .LBB8_2
+; CHECK-NEXT:  // %bb.1:
+; CHECK-NEXT:    bl __arm_tpidr2_restore
+; CHECK-NEXT:  .LBB8_2:
+; CHECK-NEXT:    msr TPIDR2_EL0, xzr
+; CHECK-NEXT:    ldr zt0, [x20]
+; CHECK-NEXT:  // %bb.3: // %return_normally
+; CHECK-NEXT:    mov sp, x29
+; CHECK-NEXT:    ldp x20, x19, [sp, #16] // 16-byte Folded Reload
+; CHECK-NEXT:    ldp x29, x30, [sp], #32 // 16-byte Folded Reload
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB8_4: // %unwind_dtors
+; CHECK-NEXT:  .Ltmp26: // EH_LABEL
+; CHECK-NEXT:    mov x20, x0
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    mrs x8, TPIDR2_EL0
+; CHECK-NEXT:    sub x0, x29, #80
+; CHECK-NEXT:    cbnz x8, .LBB8_6
+; CHECK-NEXT:  // %bb.5: // %unwind_dtors
+; CHECK-NEXT:    bl __arm_tpidr2_restore
+; CHECK-NEXT:  .LBB8_6: // %unwind_dtors
+; CHECK-NEXT:    msr TPIDR2_EL0, xzr
+; CHECK-NEXT:    blr x19
+; CHECK-NEXT:    sub x8, x29, #80
+; CHECK-NEXT:    mov x0, x20
+; CHECK-NEXT:    msr TPIDR2_EL0, x8
+; CHECK-NEXT:    bl _Unwind_Resume
+;
+; CHECK-SDAG-LABEL: try_catch_shared_za_callee_zt0_saved:
+; CHECK-SDAG:       .Lfunc_begin8:
+; CHECK-SDAG-NEXT:    .cfi_startproc
+; CHECK-SDAG-NEXT:    .cfi_personality 156, DW.ref.__gxx_personality_v0
+; CHECK-SDAG-NEXT:    .cfi_lsda 28, .Lexception8
+; CHECK-SDAG-NEXT:  // %bb.0:
+; CHECK-SDAG-NEXT:    stp x29, x30, [sp, #-48]! // 16-byte Folded Spill
+; CHECK-SDAG-NEXT:    stp x22, x21, [sp, #16] // 16-byte Folded Spill
+; CHECK-SDAG-NEXT:    mov x29, sp
+; CHECK-SDAG-NEXT:    stp x20, x19, [sp, #32] // 16-byte Folded Spill
+; CHECK-SDAG-NEXT:    sub sp, sp, #80
+; CHECK-SDAG-NEXT:    .cfi_def_cfa w29, 48
+; CHECK-SDAG-NEXT:    .cfi_offset w19, -8
+; CHECK-SDAG-NEXT:    .cfi_offset w20, -16
+; CHECK-SDAG-NEXT:    .cfi_offset w21, -24
+; CHECK-SDAG-NEXT:    .cfi_offset w22, -32
+; CHECK-SDAG-NEXT:    .cfi_offset w30, -40
+; CHECK-SDAG-NEXT:    .cfi_offset w29, -48
+; CHECK-SDAG-NEXT:    rdsvl x8, #1
+; CHECK-SDAG-NEXT:    mov x9, sp
+; CHECK-SDAG-NEXT:    mov x19, x0
+; CHECK-SDAG-NEXT:    msub x9, x8, x8, x9
+; CHECK-SDAG-NEXT:    mov sp, x9
+; CHECK-SDAG-NEXT:    stp x9, x8, [x29, #-16]
+; CHECK-SDAG-NEXT:  .Ltmp24: // EH_LABEL
+; CHECK-SDAG-NEXT:    sub x8, x29, #16
+; CHECK-SDAG-NEXT:    sub x20, x29, #80
+; CHECK-SDAG-NEXT:    msr TPIDR2_EL0, x8
+; CHECK-SDAG-NEXT:    str zt0, [x20]
+; CHECK-SDAG-NEXT:    bl may_throw
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x20]
+; CHECK-SDAG-NEXT:    mrs x8, TPIDR2_EL0
+; CHECK-SDAG-NEXT:    sub x0, x29, #16
+; CHECK-SDAG-NEXT:    cbnz x8, .LBB8_2
+; CHECK-SDAG-NEXT:  // %bb.1:
+; CHECK-SDAG-NEXT:    bl __arm_tpidr2_restore
+; CHECK-SDAG-NEXT:  .LBB8_2:
+; CHECK-SDAG-NEXT:    msr TPIDR2_EL0, xzr
+; CHECK-SDAG-NEXT:  .Ltmp25: // EH_LABEL
+; CHECK-SDAG-NEXT:  // %bb.3: // %return_normally
+; CHECK-SDAG-NEXT:    mov sp, x29
+; CHECK-SDAG-NEXT:    ldp x20, x19, [sp, #32] // 16-byte Folded Reload
+; CHECK-SDAG-NEXT:    ldp x22, x21, [sp, #16] // 16-byte Folded Reload
+; CHECK-SDAG-NEXT:    ldp x29, x30, [sp], #48 // 16-byte Folded Reload
+; CHECK-SDAG-NEXT:    ret
+; CHECK-SDAG-NEXT:  .LBB8_4: // %unwind_dtors
+; CHECK-SDAG-NEXT:  .Ltmp26: // EH_LABEL
+; CHECK-SDAG-NEXT:    sub x21, x29, #80
+; CHECK-SDAG-NEXT:    sub x22, x29, #16
+; CHECK-SDAG-NEXT:    mov x20, x0
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x21]
+; CHECK-SDAG-NEXT:    mrs x8, TPIDR2_EL0
+; CHECK-SDAG-NEXT:    sub x0, x29, #16
+; CHECK-SDAG-NEXT:    cbnz x8, .LBB8_6
+; CHECK-SDAG-NEXT:  // %bb.5: // %unwind_dtors
+; CHECK-SDAG-NEXT:    bl __arm_tpidr2_restore
+; CHECK-SDAG-NEXT:  .LBB8_6: // %unwind_dtors
+; CHECK-SDAG-NEXT:    msr TPIDR2_EL0, xzr
+; CHECK-SDAG-NEXT:    str zt0, [x21]
+; CHECK-SDAG-NEXT:    blr x19
+; CHECK-SDAG-NEXT:    ldr zt0, [x21]
+; CHECK-SDAG-NEXT:    mov x0, x20
+; CHECK-SDAG-NEXT:    msr TPIDR2_EL0, x22
+; CHECK-SDAG-NEXT:    str zt0, [x21]
+; CHECK-SDAG-NEXT:    bl _Unwind_Resume
+; CHECK-SDAG-NEXT:    smstart za
+; CHECK-SDAG-NEXT:    ldr zt0, [x21]
+; CHECK-SDAG-NEXT:    mrs x8, TPIDR2_EL0
+; CHECK-SDAG-NEXT:    sub x0, x29, #16
+; CHECK-SDAG-NEXT:    cbnz x8, .LBB8_8
+; CHECK-SDAG-NEXT:  // %bb.7: // %unwind_dtors
+; CHECK-SDAG-NEXT:    bl __arm_tpidr2_restore
+; CHECK-SDAG-NEXT:  .LBB8_8: // %unwind_dtors
+; CHECK-SDAG-NEXT:    msr TPIDR2_EL0, xzr
+  invoke void @may_throw()
+          to label %return_normally unwind label %unwind_dtors
+
+unwind_dtors:
+  %5 = landingpad { ptr, i32 }
+          cleanup
+  call void %callee() "aarch64_inout_za"
+  resume { ptr, i32 } %5
+
+return_normally:
+  ret void
+}
+
 declare ptr @__cxa_allocate_exception(i64)
 declare void @__cxa_throw(ptr, ptr, ptr)
 declare ptr @__cxa_begin_catch(ptr)
diff --git a/llvm/test/CodeGen/AArch64/sme-zt0-state.ll b/llvm/test/CodeGen/AArch64/sme-zt0-state.ll
index 69c69f027a33f..0d4a39b2eeb2f 100644
--- a/llvm/test/CodeGen/AArch64/sme-zt0-state.ll
+++ b/llvm/test/CodeGen/AArch64/sme-zt0-state.ll
@@ -193,7 +193,7 @@ define void @zt0_new_caller_zt0_new_callee(ptr %callee) "aarch64_new_zt0" nounwi
 ; CHECK-NEWLOWERING-LABEL: zt0_new_caller_zt0_new_callee:
 ; CHECK-NEWLOWERING:       // %bb.0:
 ; CHECK-NEWLOWERING-NEXT:    sub sp, sp, #80
-; CHECK-NEWLOWERING-NEXT:    stp x30, x19, [sp, #64] // 16-byte Folded Spill
+; CHECK-NEWLOWERING-NEXT:    str x30, [sp, #64] // 8-byte Spill
 ; CHECK-NEWLOWERING-NEXT:    mrs x8, TPIDR2_EL0
 ; CHECK-NEWLOWERING-NEXT:    cbz x8, .LBB6_2
 ; CHECK-NEWLOWERING-NEXT:  // %bb.1:
@@ -202,14 +202,11 @@ define void @zt0_new_caller_zt0_new_callee(ptr %callee) "aarch64_new_zt0" nounwi
 ; CHECK-NEWLOWERING-NEXT:    zero { zt0 }
 ; CHECK-NEWLOWERING-NEXT:  .LBB6_2:
 ; CHECK-NEWLOWERING-NEXT:    smstart za
-; CHECK-NEWLOWERING-NEXT:    mov x19, sp
-; CHECK-NEWLOWERING-NEXT:    str zt0, [x19]
+; CHECK-NEWLOWERING-NEXT:    mov x8, sp
+; CHECK-NEWLOWERING-NEXT:    str zt0, [x8]
 ; CHECK-NEWLOWERING-NEXT:    smstop za
 ; CHECK-NEWLOWERING-NEXT:    blr x0
-; CHECK-NEWLOWERING-NEXT:    smstart za
-; CHECK-NEWLOWERING-NEXT:    ldr zt0, [x19]
-; CHECK-NEWLOWERING-NEXT:    smstop za
-; CHECK-NEWLOWERING-NEXT:    ldp x30, x19, [sp, #64] // 16-byte Folded Reload
+; CHECK-NEWLOWERING-NEXT:    ldr x30, [sp, #64] // 8-byte Reload
 ; CHECK-NEWLOWERING-NEXT:    add sp, sp, #80
 ; CHECK-NEWLOWERING-NEXT:    ret
   call void %callee() "aarch64_new_zt0";
@@ -246,7 +243,7 @@ define i64 @zt0_new_caller_abi_routine_callee() "aarch64_new_zt0" nounwind {
 ; CHECK-NEWLOWERING-LABEL: zt0_new_caller_abi_routine_callee:
 ; CHECK-NEWLOWERING:       // %bb.0:
 ; CHECK-NEWLOWERING-NEXT:    sub sp, sp, #80
-; CHECK-NEWLOWERING-NEXT:    stp x30, x19, [sp, #64] // 16-byte Folded Spill
+; CHECK-NEWLOWERING-NEXT:    str x30, [sp, #64] // 8-byte Spill
 ; CHECK-NEWLOWERING-NEXT:    mrs x8, TPIDR2_EL0
 ; CHECK-NEWLOWERING-NEXT:    cbz x8, .LBB7_2
 ; CHECK-NEWLOWERING-NEXT:  // %bb.1:
@@ -255,12 +252,11 @@ define i64 @zt0_new_caller_abi_routine_callee() "aarch64_new_zt0" nounwind {
 ; CHECK-NEWLOWERING-NEXT:    zero { zt0 }
 ; CHECK-NEWLOWERING-NEXT:  .LBB7_2:
 ; CHECK-NEWLOWERING-NEXT:    smstart za
-; CHECK-NEWLOWERING-NEXT:    mov x19, sp
-; CHECK-NEWLOWERING-NEXT:    str zt0, [x19]
-; CHECK-NEWLOWERING-NEXT:    bl __arm_sme_state
-; CHECK-NEWLOWERING-NEXT:    ldr zt0, [x19]
+; CHECK-NEWLOWERING-NEXT:    mov x8, sp
+; CHECK-NEWLOWERING-NEXT:    str zt0, [x8]
 ; CHECK-NEWLOWERING-NEXT:    smstop za
-; CHECK-NEWLOWERING-NEXT:    ldp x30, x19, [sp, #64] // 16-byte Folded Reload
+; CHECK-NEWLOWERING-NEXT:    bl __arm_sme_state
+; CHECK-NEWLOWERING-NEXT:    ldr x30, [sp, #64] // 8-byte Reload
 ; CHECK-NEWLOWERING-NEXT:    add sp, sp, #80
 ; CHECK-NEWLOWERING-NEXT:    ret
   %res = call {i64, i64} @__arm_sme_state()
@@ -382,37 +378,57 @@ define void @shared_za_new_zt0(ptr %callee) "aarch64_inout_za" "aarch64_new_zt0"
 
 
 define void @zt0_multiple_private_za_calls(ptr %callee) "aarch64_in_zt0" nounwind {
-; CHECK-COMMON-LABEL: zt0_multiple_private_za_calls:
-; CHECK-COMMON:       // %bb.0:
-; CHECK-COMMON-NEXT:    sub sp, sp, #96
-; CHECK-COMMON-NEXT:    stp x20, x19, [sp, #80] // 16-byte Folded Spill
-; CHECK-COMMON-NEXT:    mov x20, sp
-; CHECK-COMMON-NEXT:    mov x19, x0
-; CHECK-COMMON-NEXT:    str x30, [sp, #64] // 8-byte Spill
-; CHECK-COMMON-NEXT:    str zt0, [x20]
-; CHECK-COMMON-NEXT:    smstop za
-; CHECK-COMMON-NEXT:    blr x0
-; CHECK-COMMON-NEXT:    smstart za
-; CHECK-COMMON-NEXT:    ldr zt0, [x20]
-; CHECK-COMMON-NEXT:    str zt0, [x20]
-; CHECK-COMMON-NEXT:    smstop za
-; CHECK-COMMON-NEXT:    blr x19
-; CHECK-COMMON-NEXT:    smstart za
-; CHECK-COMMON-NEXT:    ldr zt0, [x20]
-; CHECK-COMMON-NEXT:    str zt0, [x20]
-; CHECK-COMMON-NEXT:    smstop za
-; CHECK-COMMON-NEXT:    blr x19
-; CHECK-COMMON-NEXT:    smstart za
-; CHECK-COMMON-NEXT:    ldr zt0, [x20]
-; CHECK-COMMON-NEXT:    str zt0, [x20]
-; CHECK-COMMON-NEXT:    smstop za
-; CHECK-COMMON-NEXT:    blr x19
-; CHECK-COMMON-NEXT:    smstart za
-; CHECK-COMMON-NEXT:    ldr zt0, [x20]
-; CHECK-COMMON-NEXT:    ldp x20, x19, [sp, #80] // 16-byte Folded Reload
-; CHECK-COMMON-NEXT:    ldr x30, [sp, #64] // 8-byte Reload
-; CHECK-COMMON-NEXT:    add sp, sp, #96
-; CHECK-COMMON-NEXT:    ret
+; CHECK-LABEL: zt0_multiple_private_za_calls:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sub sp, sp, #96
+; CHECK-NEXT:    stp x20, x19, [sp, #80] // 16-byte Folded Spill
+; CHECK-NEXT:    mov x20, sp
+; CHECK-NEXT:    mov x19, x0
+; CHECK-NEXT:    str x30, [sp, #64] // 8-byte Spill
+; CHECK-NEXT:    str zt0, [x20]
+; CHECK-NEXT:    smstop za
+; CHECK-NEXT:    blr x0
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    ldr zt0, [x20]
+; CHECK-NEXT:    str zt0, [x20]
+; CHECK-NEXT:    smstop za
+; CHECK-NEXT:    blr x19
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    ldr zt0, [x20]
+; CHECK-NEXT:    str zt0, [x20]
+; CHECK-NEXT:    smstop za
+; CHECK-NEXT:    blr x19
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    ldr zt0, [x20]
+; CHECK-NEXT:    str zt0, [x20]
+; CHECK-NEXT:    smstop za
+; CHECK-NEXT:    blr x19
+; CHECK-NEXT:    smstart za
+; CHECK-NEXT:    ldr zt0, [x20]
+; CHECK-NEXT:    ldp x20, x19, [sp, #80] // 16-byte Folded Reload
+; CHECK-NEXT:    ldr x30, [sp, #64] // 8-byte Reload
+; CHECK-NEXT:    add sp, sp, #96
+; CHECK-NEXT:    ret
+;
+; CHECK-NEWLOWERING-LABEL: zt0_multiple_private_za_calls:
+; CHECK-NEWLOWERING:       // %bb.0:
+; CHECK-NEWLOWERING-NEXT:    sub sp, sp, #96
+; CHECK-NEWLOWERING-NEXT:    stp x20, x19, [sp, #80] // 16-byte Folded Spill
+; CHECK-NEWLOWERING-NEXT:    mov x20, sp
+; CHECK-NEWLOWERING-NEXT:    mov x19, x0
+; CHECK-NEWLOWERING-NEXT:    str x30, [sp, #64] // 8-byte Spill
+; CHECK-NEWLOWERING-NEXT:    str zt0, [x20]
+; CHECK-NEWLOWERING-NEXT:    smstop za
+; CHECK-NEWLOWERING-NEXT:    blr x0
+; CHECK-NEWLOWERING-NEXT:    blr x19
+; CHECK-NEWLOWERING-NEXT:    blr x19
+; CHECK-NEWLOWERING-NEXT:    blr x19
+; CHECK-NEWLOWERING-NEXT:    smstart za
+; CHECK-NEWLOWERING-NEXT:    ldr zt0, [x20]
+; CHECK-NEWLOWERING-NEXT:    ldp x20, x19, [sp, #80] // 16-byte Folded Reload
+; CHECK-NEWLOWERING-NEXT:    ldr x30, [sp, #64] // 8-byte Reload
+; CHECK-NEWLOWERING-NEXT:    add sp, sp, #96
+; CHECK-NEWLOWERING-NEXT:    ret
   call void %callee()
   call void %callee()
   call void %callee()
diff --git a/llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll b/llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll
index 8fc27248abac3..1cfff7e239de4 100644
--- a/llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-extract-scalable-vector.ll
@@ -3,12 +3,199 @@
 
 ; Extracting illegal subvectors
 
-define <vscale x 1 x i32> @extract_nxv1i32_nxv4i32(<vscale x 4 x i32> %vec) nounwind {
-; CHECK-LABEL: extract_nxv1i32_nxv4i32:
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x i32> @extract_nxv1i32_nxv4i32_0(<vscale x 4 x i32> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1i32_nxv4i32_0:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ret
-  %retval = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %vec, i64 0)
-  ret <vscale x 1 x i32> %retval
+  %e = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %vec, i64 0)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> poison, <vscale x 1 x i32> %e, i64 0)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x i32> @extract_nxv1i32_nxv4i32_1(<vscale x 4 x i32> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1i32_nxv4i32_1:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %vec, i64 1)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> poison, <vscale x 1 x i32> %e, i64 0)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x i32> @extract_nxv1i32_nxv4i32_2(<vscale x 4 x i32> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1i32_nxv4i32_2:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    cnth x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %vec, i64 2)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> poison, <vscale x 1 x i32> %e, i64 0)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x i32> @extract_nxv1i32_nxv4i32_3(<vscale x 4 x i32> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1i32_nxv4i32_3:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    cntw x8, all, mul #3
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %vec, i64 3)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> poison, <vscale x 1 x i32> %e, i64 0)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 2 x float> @extract_nxv1f32_nxv2f32_0(<vscale x 2 x float> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1f32_nxv2f32_0:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv2f32(<vscale x 2 x float> %vec, i64 0)
+  %retval = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.nxv1f32(<vscale x 2 x float> poison, <vscale x 1 x float> %e, i64 0)
+  ret <vscale x 2 x float> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 2 x float> @extract_nxv1f32_nxv2f32_1(<vscale x 2 x float> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1f32_nxv2f32_1:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    addpl x9, sp, #4
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    st1w { z0.d }, p0, [sp, #1, mul vl]
+; CHECK-NEXT:    whilelo p1.d, xzr, x8
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    ld1w { z0.d }, p1/z, [x8]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv2f32(<vscale x 2 x float> %vec, i64 1)
+  %retval = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.nxv1f32(<vscale x 2 x float> poison, <vscale x 1 x float> %e, i64 0)
+  ret <vscale x 2 x float> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x float> @extract_nxv1f32_nxv4f32_0(<vscale x 4 x float> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1f32_nxv4f32_0:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %vec, i64 0)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> poison, <vscale x 1 x float> %e, i64 0)
+  ret <vscale x 4 x float> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x float> @extract_nxv1f32_nxv4f32_1(<vscale x 4 x float> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1f32_nxv4f32_1:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    ld1w { z0.d }, p0/z, [x8]
+; CHECK-NEXT:    uzp1 z0.s, z0.s, z0.s
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %vec, i64 1)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> poison, <vscale x 1 x float> %e, i64 0)
+  ret <vscale x 4 x float> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x float> @extract_nxv1f32_nxv4f32_2(<vscale x 4 x float> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1f32_nxv4f32_2:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    ld1w { z0.d }, p0/z, [sp, #1, mul vl]
+; CHECK-NEXT:    uzp1 z0.s, z0.s, z0.s
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %vec, i64 2)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> poison, <vscale x 1 x float> %e, i64 0)
+  ret <vscale x 4 x float> %retval
+}
+
+; NOTE: Insert sub-vector into a legal type to avoid relying on an undefined
+; calling convention.
+define <vscale x 4 x float> @extract_nxv1f32_nxv4f32_3(<vscale x 4 x float> %vec) nounwind {
+; CHECK-LABEL: extract_nxv1f32_nxv4f32_3:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    cntw x8, all, mul #3
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    ld1w { z0.d }, p0/z, [x8]
+; CHECK-NEXT:    uzp1 z0.s, z0.s, z0.s
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %e = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %vec, i64 3)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> poison, <vscale x 1 x float> %e, i64 0)
+  ret <vscale x 4 x float> %retval
 }
 
 define <vscale x 1 x i16> @extract_nxv1i16_nxv6i16(<vscale x 6 x i16> %vec) nounwind {
@@ -19,9 +206,6 @@ define <vscale x 1 x i16> @extract_nxv1i16_nxv6i16(<vscale x 6 x i16> %vec) noun
   ret <vscale x 1 x i16> %retval
 }
 
-declare <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32>, i64)
-declare <vscale x 1 x i16> @llvm.vector.extract.nxv1i16.nxv6i16(<vscale x 6 x i16>, i64)
-
 ;
 ; Extract half i1 vector that needs promotion from legal type.
 ;
@@ -43,8 +227,6 @@ define <vscale x 8 x i1> @extract_nxv8i1_nxv16i1_8(<vscale x 16 x i1> %in) {
   ret <vscale x 8 x i1> %res
 }
 
-declare <vscale x 8 x i1> @llvm.vector.extract.nxv8i1.nxv16i1(<vscale x 16 x i1>, i64)
-
 ;
 ; Extract i1 vector that needs widening from one that needs widening.
 ;
@@ -99,8 +281,6 @@ define <vscale x 14 x i1> @extract_nxv14i1_nxv28i1_14(<vscale x 28 x i1> %in) uw
   ret <vscale x 14 x i1> %res
 }
 
-declare <vscale x 14 x i1> @llvm.vector.extract.nxv14i1.nxv28i1(<vscale x 28 x i1>, i64)
-
 ;
 ; Extract half i1 vector that needs promotion from one that needs splitting.
 ;
@@ -140,8 +320,6 @@ define <vscale x 8 x i1> @extract_nxv8i1_nxv32i1_24(<vscale x 32 x i1> %in) {
   ret <vscale x 8 x i1> %res
 }
 
-declare <vscale x 8 x i1> @llvm.vector.extract.nxv8i1.nxv32i1(<vscale x 32 x i1>, i64)
-
 ;
 ; Extract 1/4th i1 vector that needs promotion from legal type.
 ;
@@ -185,8 +363,6 @@ define <vscale x 4 x i1> @extract_nxv4i1_nxv16i1_12(<vscale x 16 x i1> %in) {
   ret <vscale x 4 x i1> %res
 }
 
-declare <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv16i1(<vscale x 16 x i1>, i64)
-
 ;
 ; Extract 1/8th i1 vector that needs promotion from legal type.
 ;
@@ -278,8 +454,6 @@ define <vscale x 2 x i1> @extract_nxv2i1_nxv16i1_14(<vscale x 16 x i1> %in) {
   ret <vscale x 2 x i1> %res
 }
 
-declare <vscale x 2 x i1> @llvm.vector.extract.nxv2i1.nxv16i1(<vscale x 16 x i1>, i64)
-
 ;
 ; Extract i1 vector that needs promotion from one that needs widening.
 ;
@@ -313,8 +487,6 @@ define <vscale x 4 x i1> @extract_nxv4i1_nxv12i1_8(<vscale x 12 x i1> %in) {
   ret <vscale x 4 x i1> %res
 }
 
-declare <vscale x 4 x i1> @llvm.vector.extract.nxv4i1.nxv12i1(<vscale x 12 x i1>, i64)
-
 ;
 ; Extract 1/8th i8 vector that needs promotion from legal type.
 ;
@@ -406,8 +578,6 @@ define <vscale x 2 x i8> @extract_nxv2i8_nxv16i8_14(<vscale x 16 x i8> %in) {
   ret <vscale x 2 x i8> %res
 }
 
-declare <vscale x 2 x i8> @llvm.vector.extract.nxv2i8.nxv16i8(<vscale x 16 x i8>, i64)
-
 ;
 ; Extract i8 vector that needs promotion from one that needs widening.
 ;
@@ -441,8 +611,6 @@ define <vscale x 4 x i8> @extract_nxv4i8_nxv12i8_8(<vscale x 12 x i8> %in) {
   ret <vscale x 4 x i8> %res
 }
 
-declare <vscale x 4 x i8> @llvm.vector.extract.nxv4i8.nxv12i8(<vscale x 12 x i8>, i64)
-
 ;
 ; Extract i8 vector that needs both widening + promotion from one that needs widening.
 ; (nxv6i8 -> nxv8i8 -> nxv8i16)
@@ -474,8 +642,6 @@ define <vscale x 6 x i8> @extract_nxv6i8_nxv12i8_6(<vscale x 12 x i8> %in) {
   ret <vscale x 6 x i8> %res
 }
 
-declare <vscale x 6 x i8> @llvm.vector.extract.nxv6i8.nxv12i8(<vscale x 12 x i8>, i64)
-
 ;
 ; Extract half i8 vector that needs promotion from one that needs splitting.
 ;
@@ -515,8 +681,6 @@ define <vscale x 8 x i8> @extract_nxv8i8_nxv32i8_24(<vscale x 32 x i8> %in) {
   ret <vscale x 8 x i8> %res
 }
 
-declare <vscale x 8 x i8> @llvm.vector.extract.nxv8i8.nxv32i8(<vscale x 32 x i8>, i64)
-
 ;
 ; Extract half i8 vector that needs promotion from legal type.
 ;
@@ -538,8 +702,6 @@ define <vscale x 8 x i8> @extract_nxv8i8_nxv16i8_8(<vscale x 16 x i8> %in) {
   ret <vscale x 8 x i8> %res
 }
 
-declare <vscale x 8 x i8> @llvm.vector.extract.nxv8i8.nxv16i8(<vscale x 16 x i8>, i64)
-
 ;
 ; Extract i8 vector that needs widening from one that needs widening.
 ;
@@ -618,8 +780,6 @@ define <vscale x 14 x i8> @extract_nxv14i8_nxv28i8_14(<vscale x 28 x i8> %in) {
   ret <vscale x 14 x i8> %res
 }
 
-declare <vscale x 14 x i8> @llvm.vector.extract.nxv14i8.nxv28i8(<vscale x 28 x i8>, i64)
-
 ;
 ; Extract 1/4th i8 vector that needs promotion from legal type.
 ;
@@ -663,8 +823,6 @@ define <vscale x 4 x i8> @extract_nxv4i8_nxv16i8_12(<vscale x 16 x i8> %in) {
   ret <vscale x 4 x i8> %res
 }
 
-declare <vscale x 4 x i8> @llvm.vector.extract.nxv4i8.nxv16i8(<vscale x 16 x i8>, i64)
-
 ;
 ; Extract f16 vector that needs promotion from one that needs widening.
 ;
@@ -698,8 +856,6 @@ define <vscale x 2 x half> @extract_nxv2f16_nxv6f16_4(<vscale x 6 x half> %in) {
   ret <vscale x 2 x half> %res
 }
 
-declare <vscale x 2 x half> @llvm.vector.extract.nxv2f16.nxv6f16(<vscale x 6 x half>, i64)
-
 ;
 ; Extract half f16 vector that needs promotion from legal type.
 ;
@@ -721,8 +877,6 @@ define <vscale x 4 x half> @extract_nxv4f16_nxv8f16_4(<vscale x 8 x half> %in) {
   ret <vscale x 4 x half> %res
 }
 
-declare <vscale x 4 x half> @llvm.vector.extract.nxv4f16.nxv8f16(<vscale x 8 x half>, i64)
-
 ;
 ; Extract f16 vector that needs widening from one that needs widening.
 ;
@@ -750,8 +904,6 @@ define <vscale x 6 x half> @extract_nxv6f16_nxv12f16_6(<vscale x 12 x half> %in)
   ret <vscale x 6 x half> %res
 }
 
-declare <vscale x 6 x half> @llvm.vector.extract.nxv6f16.nxv12f16(<vscale x 12 x half>, i64)
-
 ;
 ; Extract half f16 vector that needs promotion from one that needs splitting.
 ;
@@ -791,8 +943,6 @@ define <vscale x 4 x half> @extract_nxv4f16_nxv16f16_12(<vscale x 16 x half> %in
   ret <vscale x 4 x half> %res
 }
 
-declare <vscale x 4 x half> @llvm.vector.extract.nxv4f16.nxv16f16(<vscale x 16 x half>, i64)
-
 ;
 ; Extract 1/4th f16 vector that needs promotion from legal type.
 ;
@@ -836,8 +986,6 @@ define <vscale x 2 x half> @extract_nxv2f16_nxv8f16_6(<vscale x 8 x half> %in) {
   ret <vscale x 2 x half> %res
 }
 
-declare <vscale x 2 x half> @llvm.vector.extract.nxv2f16.nxv8f16(<vscale x 8 x half>, i64)
-
 ;
 ; Extract half bf16 vector that needs promotion from legal type.
 ;
@@ -859,8 +1007,6 @@ define <vscale x 4 x bfloat> @extract_nxv4bf16_nxv8bf16_4(<vscale x 8 x bfloat>
   ret <vscale x 4 x bfloat> %res
 }
 
-declare <vscale x 4 x bfloat> @llvm.vector.extract.nxv4bf16.nxv8bf16(<vscale x 8 x bfloat>, i64)
-
 ;
 ; Extract bf16 vector that needs widening from one that needs widening.
 ;
@@ -888,8 +1034,6 @@ define <vscale x 6 x bfloat> @extract_nxv6bf16_nxv12bf16_6(<vscale x 12 x bfloat
   ret <vscale x 6 x bfloat> %res
 }
 
-declare <vscale x 6 x bfloat> @llvm.vector.extract.nxv6bf16.nxv12bf16(<vscale x 12 x bfloat>, i64)
-
 ;
 ; Extract bf16 vector that needs promotion from one that needs widening.
 ;
@@ -923,8 +1067,6 @@ define <vscale x 2 x bfloat> @extract_nxv2bf16_nxv6bf16_4(<vscale x 6 x bfloat>
   ret <vscale x 2 x bfloat> %res
 }
 
-declare <vscale x 2 x bfloat> @llvm.vector.extract.nxv2bf16.nxv6bf16(<vscale x 6 x bfloat>, i64)
-
 ;
 ; Extract 1/4th bf16 vector that needs promotion from legal type.
 ;
@@ -968,8 +1110,6 @@ define <vscale x 2 x bfloat> @extract_nxv2bf16_nxv8bf16_6(<vscale x 8 x bfloat>
   ret <vscale x 2 x bfloat> %res
 }
 
-declare <vscale x 2 x bfloat> @llvm.vector.extract.nxv2bf16.nxv8bf16(<vscale x 8 x bfloat>, i64)
-
 ;
 ; Extract half bf16 vector that needs promotion from one that needs splitting.
 ;
@@ -1009,9 +1149,6 @@ define <vscale x 4 x bfloat> @extract_nxv4bf16_nxv16bf16_12(<vscale x 16 x bfloa
   ret <vscale x 4 x bfloat> %res
 }
 
-declare <vscale x 4 x bfloat> @llvm.vector.extract.nxv4bf16.nxv16bf16(<vscale x 16 x bfloat>, i64)
-
-
 ;
 ; Extract from a splat
 ;
@@ -1063,9 +1200,6 @@ define <vscale x 2 x i1> @extract_nxv2i1_nxv16i1_all_zero() {
   ret <vscale x 2 x i1> %ext
 }
 
-declare <vscale x 2 x float> @llvm.vector.extract.nxv2f32.nxv4f32(<vscale x 4 x float>, i64)
-declare <vscale x 4 x i32> @llvm.vector.extract.nxv4i32.nxv8i32(<vscale x 8 x i32>, i64)
-
 ;
 ; Extract nxv1i1 type from: nxv2i1
 ;
@@ -1420,8 +1554,3 @@ define <vscale x 1 x i1> @extract_nxv1i1_nxv16i1_15(<vscale x 16 x i1> %in) {
   %res = call <vscale x 1 x i1> @llvm.vector.extract.nxv1i1.nxv16i1(<vscale x 16 x i1> %in, i64 15)
   ret <vscale x 1 x i1> %res
 }
-
-declare <vscale x 1 x i1> @llvm.vector.extract.nxv1i1.nxv2i1(<vscale x 2 x i1>, i64)
-declare <vscale x 1 x i1> @llvm.vector.extract.nxv1i1.nxv4i1(<vscale x 4 x i1>, i64)
-declare <vscale x 1 x i1> @llvm.vector.extract.nxv1i1.nxv8i1(<vscale x 8 x i1>, i64)
-declare <vscale x 1 x i1> @llvm.vector.extract.nxv1i1.nxv16i1(<vscale x 16 x i1>, i64)
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-vector-extract-256-bits.ll b/llvm/test/CodeGen/AArch64/sve-fixed-vector-extract-256-bits.ll
new file mode 100644
index 0000000000000..71c9a941807a4
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-vector-extract-256-bits.ll
@@ -0,0 +1,23 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 < %s -o - | FileCheck %s
+
+; Note: This test case is reduced from: https://github.com/llvm/llvm-project/pull/166748#issuecomment-3600498185
+
+define i32 @test_extract_v8i32_from_nxv8i32(<vscale x 8 x i32> %vec) nounwind {
+; CHECK-LABEL: test_extract_v8i32_from_nxv8i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-2
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    str z1, [sp, #1, mul vl]
+; CHECK-NEXT:    uaddv d0, p0, z0.s
+; CHECK-NEXT:    fmov w0, s0
+; CHECK-NEXT:    addvl sp, sp, #2
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %1 = tail call <8 x i32> @llvm.vector.extract.v8i32.nxv8i32(<vscale x 8 x i32> %vec, i64 0)
+  %2 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %1)
+  ret i32 %2
+}
diff --git a/llvm/test/CodeGen/AArch64/sve-insert-vector.ll b/llvm/test/CodeGen/AArch64/sve-insert-vector.ll
index 00a08e505b943..19827e24184ac 100644
--- a/llvm/test/CodeGen/AArch64/sve-insert-vector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-insert-vector.ll
@@ -1322,49 +1322,238 @@ define <vscale x 16 x i1> @insert_nxv1i1_nxv16i1_15(<vscale x 16 x i1> %vec, <vs
   ret <vscale x 16 x i1> %res
 }
 
-attributes #0 = { vscale_range(2,2) }
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x i32> @insert_nxv1i32_nxv4i32_0(<vscale x 4 x i32> %vec, <vscale x 4 x i32> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1i32_nxv4i32_0:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    st1w { z1.s }, p0, [sp]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %subvec, i64 0)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> %vec, <vscale x 1 x i32> %i, i64 0)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x i32> @insert_nxv1i32_nxv4i32_1(<vscale x 4 x i32> %vec, <vscale x 4 x i32> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1i32_nxv4i32_1:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    st1w { z1.s }, p0, [x8]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %subvec, i64 0)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> %vec, <vscale x 1 x i32> %i, i64 1)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x i32> @insert_nxv1i32_nxv4i32_2(<vscale x 4 x i32> %vec, <vscale x 4 x i32> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1i32_nxv4i32_2:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    cnth x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    st1w { z1.s }, p0, [x8]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %subvec, i64 0)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> %vec, <vscale x 1 x i32> %i, i64 2)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x i32> @insert_nxv1i32_nxv4i32_3(<vscale x 4 x i32> %vec, <vscale x 4 x i32> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1i32_nxv4i32_3:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.s, xzr, x8
+; CHECK-NEXT:    cntw x8, all, mul #3
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    st1w { z1.s }, p0, [x8]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x i32> @llvm.vector.extract.nxv1i32.nxv4i32(<vscale x 4 x i32> %subvec, i64 0)
+  %retval = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> %vec, <vscale x 1 x i32> %i, i64 3)
+  ret <vscale x 4 x i32> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 2 x float> @insert_nxv1f32_nxv2f32_0(<vscale x 2 x float> %vec, <vscale x 2 x float> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1f32_nxv2f32_0:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    st1w { z0.d }, p0, [sp, #1, mul vl]
+; CHECK-NEXT:    whilelo p1.d, xzr, x8
+; CHECK-NEXT:    st1w { z1.d }, p1, [sp, #1, mul vl]
+; CHECK-NEXT:    ld1w { z0.d }, p0/z, [sp, #1, mul vl]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv2f32(<vscale x 2 x float> %subvec, i64 0)
+  %retval = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.nxv1f32(<vscale x 2 x float> %vec, <vscale x 1 x float> %i, i64 0)
+  ret <vscale x 2 x float> %retval
+}
 
-declare <vscale x 16 x i8> @llvm.vector.insert.nxv16i8.v16i8(<vscale x 16 x i8>, <16 x i8>, i64)
-
-declare <vscale x 6 x i16> @llvm.vector.insert.nxv6i16.nxv1i16(<vscale x 6 x i16>, <vscale x 1 x i16>, i64)
-declare <vscale x 8 x i16> @llvm.vector.insert.nxv8i16.nxv2i16(<vscale x 8 x i16>, <vscale x 2 x i16>, i64)
-declare <vscale x 8 x i16> @llvm.vector.insert.nxv8i16.v8i16(<vscale x 8 x i16>, <8 x i16>, i64)
-
-declare <vscale x 3 x i32> @llvm.vector.insert.nxv3i32.nxv2i32(<vscale x 3 x i32>, <vscale x 2 x i32>, i64)
-declare <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32>, <vscale x 1 x i32>, i64)
-declare <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v4i32(<vscale x 4 x i32>, <4 x i32>, i64)
-declare <vscale x 12 x i32> @llvm.vector.insert.nxv4i32.nxv12i32(<vscale x 12 x i32>, <vscale x 4 x i32>, i64)
-declare <vscale x 6 x i32> @llvm.vector.insert.nxv6i32.nxv2i32(<vscale x 6 x i32>, <vscale x 2 x i32>, i64)
-declare <vscale x 6 x i32> @llvm.vector.insert.nxv6i32.nxv3i32(<vscale x 6 x i32>, <vscale x 3 x i32>, i64)
-
-declare <vscale x 2 x bfloat> @llvm.vector.insert.nxv2bf16.nxv2bf16(<vscale x 2 x bfloat>, <vscale x 2 x bfloat>, i64)
-declare <vscale x 4 x bfloat> @llvm.vector.insert.nxv4bf16.nxv2bf16(<vscale x 4 x bfloat>, <vscale x 2 x bfloat>, i64)
-declare <vscale x 4 x bfloat> @llvm.vector.insert.nxv4bf16.nxv4bf16(<vscale x 4 x bfloat>, <vscale x 4 x bfloat>, i64)
-declare <vscale x 4 x bfloat> @llvm.vector.insert.nxv4bf16.v4bf16(<vscale x 4 x bfloat>, <4 x bfloat>, i64)
-declare <vscale x 8 x bfloat> @llvm.vector.insert.nxv8bf16.nxv8bf16(<vscale x 8 x bfloat>, <vscale x 8 x bfloat>, i64)
-declare <vscale x 8 x bfloat> @llvm.vector.insert.nxv8bf16.nxv4bf16(<vscale x 8 x bfloat>, <vscale x 4 x bfloat>, i64)
-declare <vscale x 8 x bfloat> @llvm.vector.insert.nxv8bf16.v8bf16(<vscale x 8 x bfloat>, <8 x bfloat>, i64)
-
-declare <vscale x 2 x i64> @llvm.vector.insert.nxv2i64.v2i64(<vscale x 2 x i64>, <2 x i64>, i64)
-declare <vscale x 2 x i64> @llvm.vector.insert.nxv2i64.v4i64(<vscale x 2 x i64>, <4 x i64>, i64)
-declare <vscale x 16 x i64> @llvm.vector.insert.nxv8i64.nxv16i64(<vscale x 16 x i64>, <vscale x 8 x i64>, i64)
-declare <vscale x 16 x i64> @llvm.vector.insert.v2i64.nxv16i64(<vscale x 16 x i64>, <2 x i64>, i64)
-
-declare <vscale x 4 x half> @llvm.vector.insert.nxv4f16.nxv2f16(<vscale x 4 x half>, <vscale x 2 x half>, i64)
-declare <vscale x 8 x half> @llvm.vector.insert.nxv8f16.nxv2f16(<vscale x 8 x half>, <vscale x 2 x half>, i64)
-declare <vscale x 8 x half> @llvm.vector.insert.nxv8f16.nxv4f16(<vscale x 8 x half>, <vscale x 4 x half>, i64)
-
-declare <vscale x 3 x float> @llvm.vector.insert.nxv3f32.nxv2f32(<vscale x 3 x float>, <vscale x 2 x float>, i64)
-declare <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float>, <vscale x 1 x float>, i64)
-declare <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv2f32(<vscale x 4 x float>, <vscale x 2 x float>, i64)
-
-declare <vscale x 2 x i1> @llvm.vector.insert.nxv2i1.v8i1(<vscale x 2 x i1>, <8 x i1>, i64)
-declare <vscale x 4 x i1> @llvm.vector.insert.nxv4i1.v16i1(<vscale x 4 x i1>, <16 x i1>, i64)
-declare <vscale x 8 x i1> @llvm.vector.insert.nxv8i1.v32i1(<vscale x 8 x i1>, <32 x i1>, i64)
-declare <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.nxv1i1(<vscale x 16 x i1>, <vscale x 1 x i1>, i64)
-declare <vscale x 8 x i1> @llvm.vector.insert.nxv8i1.nxv1i1(<vscale x 8 x i1>, <vscale x 1 x i1>, i64)
-declare <vscale x 4 x i1> @llvm.vector.insert.nxv4i1.nxv1i1(<vscale x 4 x i1>, <vscale x 1 x i1>, i64)
-declare <vscale x 2 x i1> @llvm.vector.insert.nxv2i1.nxv1i1(<vscale x 2 x i1>, <vscale x 1 x i1>, i64)
-declare <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.nxv4i1(<vscale x 16 x i1>, <vscale x 4 x i1>, i64)
-declare <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.nxv8i1(<vscale x 16 x i1>, <vscale x 8 x i1>, i64)
-declare <vscale x 16 x i1> @llvm.vector.insert.nxv16i1.v64i1(<vscale x 16 x i1>, <64 x i1>, i64)
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 2 x float> @insert_nxv1f32_nxv2f32_1(<vscale x 2 x float> %vec, <vscale x 2 x float> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1f32_nxv2f32_1:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    addpl x9, sp, #4
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    st1w { z0.d }, p0, [sp, #1, mul vl]
+; CHECK-NEXT:    whilelo p1.d, xzr, x8
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    st1w { z1.d }, p1, [x8]
+; CHECK-NEXT:    ld1w { z0.d }, p0/z, [sp, #1, mul vl]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv2f32(<vscale x 2 x float> %subvec, i64 0)
+  %retval = call <vscale x 2 x float> @llvm.vector.insert.nxv2f32.nxv1f32(<vscale x 2 x float> %vec, <vscale x 1 x float> %i, i64 1)
+  ret <vscale x 2 x float> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x float> @insert_nxv1f32_nxv4f32_0(<vscale x 4 x float> %vec, <vscale x 4 x float> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1f32_nxv4f32_0:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    uunpklo z1.d, z1.s
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    st1w { z1.d }, p0, [sp]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %subvec, i64 0)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> %vec, <vscale x 1 x float> %i, i64 0)
+  ret <vscale x 4 x float> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x float> @insert_nxv1f32_nxv4f32_1(<vscale x 4 x float> %vec, <vscale x 4 x float> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1f32_nxv4f32_1:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    uunpklo z1.d, z1.s
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    cntw x8
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    st1w { z1.d }, p0, [x8]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %subvec, i64 0)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> %vec, <vscale x 1 x float> %i, i64 1)
+  ret <vscale x 4 x float> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x float> @insert_nxv1f32_nxv4f32_2(<vscale x 4 x float> %vec, <vscale x 4 x float> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1f32_nxv4f32_2:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    uunpklo z1.d, z1.s
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    st1w { z1.d }, p0, [sp, #1, mul vl]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %subvec, i64 0)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> %vec, <vscale x 1 x float> %i, i64 2)
+  ret <vscale x 4 x float> %retval
+}
+
+; NOTE: Extract input sub-vector from a legal type to avoid relying on an
+; undefined calling convention.
+define <vscale x 4 x float> @insert_nxv1f32_nxv4f32_3(<vscale x 4 x float> %vec, <vscale x 4 x float> %subvec) nounwind {
+; CHECK-LABEL: insert_nxv1f32_nxv4f32_3:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    addvl sp, sp, #-1
+; CHECK-NEXT:    rdvl x8, #1
+; CHECK-NEXT:    uunpklo z1.d, z1.s
+; CHECK-NEXT:    mov x9, sp
+; CHECK-NEXT:    lsr x8, x8, #4
+; CHECK-NEXT:    str z0, [sp]
+; CHECK-NEXT:    whilelo p0.d, xzr, x8
+; CHECK-NEXT:    cntw x8, all, mul #3
+; CHECK-NEXT:    add x8, x9, x8
+; CHECK-NEXT:    st1w { z1.d }, p0, [x8]
+; CHECK-NEXT:    ldr z0, [sp]
+; CHECK-NEXT:    addvl sp, sp, #1
+; CHECK-NEXT:    ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
+  %i = call <vscale x 1 x float> @llvm.vector.extract.nxv1f32.nxv4f32(<vscale x 4 x float> %subvec, i64 0)
+  %retval = call <vscale x 4 x float> @llvm.vector.insert.nxv4f32.nxv1f32(<vscale x 4 x float> %vec, <vscale x 1 x float> %i, i64 3)
+  ret <vscale x 4 x float> %retval
+}
+
+attributes #0 = { vscale_range(2,2) }
diff --git a/llvm/test/CodeGen/AArch64/sve-int-mulh-pred.ll b/llvm/test/CodeGen/AArch64/sve-int-mulh-pred.ll
index 32760caa524ec..146720febf486 100644
--- a/llvm/test/CodeGen/AArch64/sve-int-mulh-pred.ll
+++ b/llvm/test/CodeGen/AArch64/sve-int-mulh-pred.ll
@@ -1,11 +1,11 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
 
 ;
 ; SMULH
 ;
 
-define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
 ; CHECK-LABEL: smulh_i8:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.b
@@ -19,7 +19,7 @@ define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b
   ret <vscale x 16 x i8> %tr
 }
 
-define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) {
 ; CHECK-LABEL: smulh_i16:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.h
@@ -33,7 +33,7 @@ define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %
   ret <vscale x 8 x i16> %tr
 }
 
-define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) {
 ; CHECK-LABEL: smulh_i32:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.s
@@ -47,7 +47,7 @@ define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %
   ret <vscale x 4 x i32> %tr
 }
 
-define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) {
 ; CHECK-LABEL: smulh_i64:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.d
@@ -65,7 +65,7 @@ define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %
 ; UMULH
 ;
 
-define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
 ; CHECK-LABEL: umulh_i8:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.b
@@ -79,7 +79,7 @@ define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b
   ret <vscale x 16 x i8> %tr
 }
 
-define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) {
 ; CHECK-LABEL: umulh_i16:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.h
@@ -93,7 +93,7 @@ define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %
   ret <vscale x 8 x i16> %tr
 }
 
-define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) {
 ; CHECK-LABEL: umulh_i32:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.s
@@ -107,7 +107,7 @@ define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %
   ret <vscale x 4 x i32> %tr
 }
 
-define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) {
 ; CHECK-LABEL: umulh_i64:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    ptrue p0.d
@@ -121,4 +121,262 @@ define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %
   ret <vscale x 2 x i64> %tr
 }
 
-attributes #0 = { "target-features"="+sve" }
+
+; Fixed-length 128bits
+
+define <16 x i8> @smulh_v16i8(<16 x i8> %a, <16 x i8> %b) {
+; CHECK-LABEL: smulh_v16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull2 v2.8h, v0.16b, v1.16b
+; CHECK-NEXT:    smull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    uzp2 v0.16b, v0.16b, v2.16b
+; CHECK-NEXT:    ret
+  %1 = sext <16 x i8> %a to <16 x i16>
+  %2 = sext <16 x i8> %b to <16 x i16>
+  %mul = mul <16 x i16> %1, %2
+  %shr = lshr <16 x i16> %mul, splat(i16 8)
+  %tr = trunc <16 x i16> %shr to <16 x i8>
+  ret <16 x i8> %tr
+}
+
+define <8 x i16> @smulh_v8i16(<8 x i16> %a, <8 x i16> %b) {
+; CHECK-LABEL: smulh_v8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull2 v2.4s, v0.8h, v1.8h
+; CHECK-NEXT:    smull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    uzp2 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    ret
+  %1 = sext <8 x i16> %a to <8 x i32>
+  %2 = sext <8 x i16> %b to <8 x i32>
+  %mul = mul <8 x i32> %1, %2
+  %shr = lshr <8 x i32> %mul, splat(i32 16)
+  %tr = trunc <8 x i32> %shr to <8 x i16>
+  ret <8 x i16> %tr
+}
+
+define <4 x i32> @smulh_v4i32(<4 x i32> %a, <4 x i32> %b) {
+; CHECK-LABEL: smulh_v4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull2 v2.2d, v0.4s, v1.4s
+; CHECK-NEXT:    smull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    uzp2 v0.4s, v0.4s, v2.4s
+; CHECK-NEXT:    ret
+  %1 = sext <4 x i32> %a to <4 x i64>
+  %2 = sext <4 x i32> %b to <4 x i64>
+  %mul = mul <4 x i64> %1, %2
+  %shr = lshr <4 x i64> %mul, splat(i64 32)
+  %tr = trunc <4 x i64> %shr to <4 x i32>
+  ret <4 x i32> %tr
+}
+
+define <2 x i64> @smulh_v2i64(<2 x i64> %a, <2 x i64> %b) {
+; CHECK-LABEL: smulh_v2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, v0.d[1]
+; CHECK-NEXT:    mov x9, v1.d[1]
+; CHECK-NEXT:    fmov x10, d0
+; CHECK-NEXT:    fmov x11, d1
+; CHECK-NEXT:    smulh x10, x10, x11
+; CHECK-NEXT:    smulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x10
+; CHECK-NEXT:    fmov d1, x8
+; CHECK-NEXT:    mov v0.d[1], v1.d[0]
+; CHECK-NEXT:    ret
+  %1 = sext <2 x i64> %a to <2 x i128>
+  %2 = sext <2 x i64> %b to <2 x i128>
+  %mul = mul <2 x i128> %1, %2
+  %shr = lshr <2 x i128> %mul, splat(i128 64)
+  %tr = trunc <2 x i128> %shr to <2 x i64>
+  ret <2 x i64> %tr
+}
+
+define <16 x i8> @umulh_v16i8(<16 x i8> %a, <16 x i8> %b) {
+; CHECK-LABEL: umulh_v16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull2 v2.8h, v0.16b, v1.16b
+; CHECK-NEXT:    umull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    uzp2 v0.16b, v0.16b, v2.16b
+; CHECK-NEXT:    ret
+  %1 = zext <16 x i8> %a to <16 x i16>
+  %2 = zext <16 x i8> %b to <16 x i16>
+  %mul = mul <16 x i16> %1, %2
+  %shr = lshr <16 x i16> %mul, splat(i16 8)
+  %tr = trunc <16 x i16> %shr to <16 x i8>
+  ret <16 x i8> %tr
+}
+
+define <8 x i16> @umulh_v8i16(<8 x i16> %a, <8 x i16> %b) {
+; CHECK-LABEL: umulh_v8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull2 v2.4s, v0.8h, v1.8h
+; CHECK-NEXT:    umull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    uzp2 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    ret
+  %1 = zext <8 x i16> %a to <8 x i32>
+  %2 = zext <8 x i16> %b to <8 x i32>
+  %mul = mul <8 x i32> %1, %2
+  %shr = lshr <8 x i32> %mul, splat(i32 16)
+  %tr = trunc <8 x i32> %shr to <8 x i16>
+  ret <8 x i16> %tr
+}
+
+define <4 x i32> @umulh_v4i32(<4 x i32> %a, <4 x i32> %b) {
+; CHECK-LABEL: umulh_v4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull2 v2.2d, v0.4s, v1.4s
+; CHECK-NEXT:    umull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    uzp2 v0.4s, v0.4s, v2.4s
+; CHECK-NEXT:    ret
+  %1 = zext <4 x i32> %a to <4 x i64>
+  %2 = zext <4 x i32> %b to <4 x i64>
+  %mul = mul <4 x i64> %1, %2
+  %shr = lshr <4 x i64> %mul, splat(i64 32)
+  %tr = trunc <4 x i64> %shr to <4 x i32>
+  ret <4 x i32> %tr
+}
+
+define <2 x i64> @umulh_v2i64(<2 x i64> %a, <2 x i64> %b) {
+; CHECK-LABEL: umulh_v2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, v0.d[1]
+; CHECK-NEXT:    mov x9, v1.d[1]
+; CHECK-NEXT:    fmov x10, d0
+; CHECK-NEXT:    fmov x11, d1
+; CHECK-NEXT:    umulh x10, x10, x11
+; CHECK-NEXT:    umulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x10
+; CHECK-NEXT:    fmov d1, x8
+; CHECK-NEXT:    mov v0.d[1], v1.d[0]
+; CHECK-NEXT:    ret
+  %1 = zext <2 x i64> %a to <2 x i128>
+  %2 = zext <2 x i64> %b to <2 x i128>
+  %mul = mul <2 x i128> %1, %2
+  %shr = lshr <2 x i128> %mul, splat(i128 64)
+  %tr = trunc <2 x i128> %shr to <2 x i64>
+  ret <2 x i64> %tr
+}
+
+
+
+; Fixed-length 64bits
+
+define <8 x i8> @smulh_v8i8(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: smulh_v8i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    shrn v0.8b, v0.8h, #8
+; CHECK-NEXT:    ret
+  %1 = sext <8 x i8> %a to <8 x i16>
+  %2 = sext <8 x i8> %b to <8 x i16>
+  %mul = mul <8 x i16> %1, %2
+  %shr = lshr <8 x i16> %mul, splat(i16 8)
+  %tr = trunc <8 x i16> %shr to <8 x i8>
+  ret <8 x i8> %tr
+}
+
+define <4 x i16> @smulh_v4i16(<4 x i16> %a, <4 x i16> %b) {
+; CHECK-LABEL: smulh_v4i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    shrn v0.4h, v0.4s, #16
+; CHECK-NEXT:    ret
+  %1 = sext <4 x i16> %a to <4 x i32>
+  %2 = sext <4 x i16> %b to <4 x i32>
+  %mul = mul <4 x i32> %1, %2
+  %shr = lshr <4 x i32> %mul, splat(i32 16)
+  %tr = trunc <4 x i32> %shr to <4 x i16>
+  ret <4 x i16> %tr
+}
+
+define <2 x i32> @smulh_v2i32(<2 x i32> %a, <2 x i32> %b) {
+; CHECK-LABEL: smulh_v2i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    shrn v0.2s, v0.2d, #32
+; CHECK-NEXT:    ret
+  %1 = sext <2 x i32> %a to <2 x i64>
+  %2 = sext <2 x i32> %b to <2 x i64>
+  %mul = mul <2 x i64> %1, %2
+  %shr = lshr <2 x i64> %mul, splat(i64 32)
+  %tr = trunc <2 x i64> %shr to <2 x i32>
+  ret <2 x i32> %tr
+}
+
+define <1 x i64> @smulh_v1i64(<1 x i64> %a, <1 x i64> %b) {
+; CHECK-LABEL: smulh_v1i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $d1 killed $d1 def $q1
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    fmov x8, d0
+; CHECK-NEXT:    fmov x9, d1
+; CHECK-NEXT:    smulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+  %1 = sext <1 x i64> %a to <1 x i128>
+  %2 = sext <1 x i64> %b to <1 x i128>
+  %mul = mul <1 x i128> %1, %2
+  %shr = lshr <1 x i128> %mul, splat(i128 64)
+  %tr = trunc <1 x i128> %shr to <1 x i64>
+  ret <1 x i64> %tr
+}
+
+define <8 x i8> @umulh_v8i8(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: umulh_v8i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    shrn v0.8b, v0.8h, #8
+; CHECK-NEXT:    ret
+  %1 = zext <8 x i8> %a to <8 x i16>
+  %2 = zext <8 x i8> %b to <8 x i16>
+  %mul = mul <8 x i16> %1, %2
+  %shr = lshr <8 x i16> %mul, splat(i16 8)
+  %tr = trunc <8 x i16> %shr to <8 x i8>
+  ret <8 x i8> %tr
+}
+
+define <4 x i16> @umulh_v4i16(<4 x i16> %a, <4 x i16> %b) {
+; CHECK-LABEL: umulh_v4i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    shrn v0.4h, v0.4s, #16
+; CHECK-NEXT:    ret
+  %1 = zext <4 x i16> %a to <4 x i32>
+  %2 = zext <4 x i16> %b to <4 x i32>
+  %mul = mul <4 x i32> %1, %2
+  %shr = lshr <4 x i32> %mul, splat(i32 16)
+  %tr = trunc <4 x i32> %shr to <4 x i16>
+  ret <4 x i16> %tr
+}
+
+define <2 x i32> @umulh_v2i32(<2 x i32> %a, <2 x i32> %b) {
+; CHECK-LABEL: umulh_v2i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    shrn v0.2s, v0.2d, #32
+; CHECK-NEXT:    ret
+  %1 = zext <2 x i32> %a to <2 x i64>
+  %2 = zext <2 x i32> %b to <2 x i64>
+  %mul = mul <2 x i64> %1, %2
+  %shr = lshr <2 x i64> %mul, splat(i64 32)
+  %tr = trunc <2 x i64> %shr to <2 x i32>
+  ret <2 x i32> %tr
+}
+
+define <1 x i64> @umulh_v1i64(<1 x i64> %a, <1 x i64> %b) {
+; CHECK-LABEL: umulh_v1i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $d1 killed $d1 def $q1
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    fmov x8, d0
+; CHECK-NEXT:    fmov x9, d1
+; CHECK-NEXT:    umulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+  %1 = zext <1 x i64> %a to <1 x i128>
+  %2 = zext <1 x i64> %b to <1 x i128>
+  %mul = mul <1 x i128> %1, %2
+  %shr = lshr <1 x i128> %mul, splat(i128 64)
+  %tr = trunc <1 x i128> %shr to <1 x i64>
+  ret <1 x i64> %tr
+}
+
diff --git a/llvm/test/CodeGen/AArch64/sve-sext-zext.ll b/llvm/test/CodeGen/AArch64/sve-sext-zext.ll
index 88e13ea1e0fa4..845628a91498b 100644
--- a/llvm/test/CodeGen/AArch64/sve-sext-zext.ll
+++ b/llvm/test/CodeGen/AArch64/sve-sext-zext.ll
@@ -456,3 +456,131 @@ define <vscale x 2 x i64> @zext_i18_i64(<vscale x 2 x i18> %a) {
   %r = zext <vscale x 2 x i18> %a to <vscale x 2 x i64>
   ret <vscale x 2 x i64> %r
 }
+
+define <vscale x 8 x i16> @sext_inreg_i16_from_i8(<vscale x 16 x i8> %a) {
+; CHECK-LABEL: sext_inreg_i16_from_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sunpklo z0.h, z0.b
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 8 x i8> @llvm.vector.extract.nxv8i8.nxv16i8(<vscale x 16 x i8> %a, i64 0)
+  %sext = sext <vscale x 8 x i8> %subvec to <vscale x 8 x i16>
+  ret <vscale x 8 x i16> %sext
+}
+
+define <vscale x 4 x i32> @sext_inreg_i32_from_i8(<vscale x 16 x i8> %a) {
+; CHECK-LABEL: sext_inreg_i32_from_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sunpklo z0.h, z0.b
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 4 x i8> @llvm.vector.extract.nxv4i8.nxv16i8(<vscale x 16 x i8> %a, i64 0)
+  %sext = sext <vscale x 4 x i8> %subvec to <vscale x 4 x i32>
+  ret <vscale x 4 x i32> %sext
+}
+
+define <vscale x 4 x i32> @sext_inreg_i32_from_i16(<vscale x 8 x i16> %a) {
+; CHECK-LABEL: sext_inreg_i32_from_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 4 x i16> @llvm.vector.extract.nxv4i16.nxv8i16(<vscale x 8 x i16> %a, i64 0)
+  %sext = sext <vscale x 4 x i16> %subvec to <vscale x 4 x i32>
+  ret <vscale x 4 x i32> %sext
+}
+
+define <vscale x 2 x i64> @sext_inreg_i64_from_i8(<vscale x 16 x i8> %a) {
+; CHECK-LABEL: sext_inreg_i64_from_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sunpklo z0.h, z0.b
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    sunpklo z0.d, z0.s
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 2 x i8> @llvm.vector.extract.nxv2i8.nxv16i8(<vscale x 16 x i8> %a, i64 0)
+  %sext = sext <vscale x 2 x i8> %subvec to <vscale x 2 x i64>
+  ret <vscale x 2 x i64> %sext
+}
+
+define <vscale x 2 x i64> @sext_inreg_i64_from_i16(<vscale x 8 x i16> %a) {
+; CHECK-LABEL: sext_inreg_i64_from_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sunpklo z0.s, z0.h
+; CHECK-NEXT:    sunpklo z0.d, z0.s
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 2 x i16> @llvm.vector.extract.nxv2i16.nxv8i16(<vscale x 8 x i16> %a, i64 0)
+  %sext = sext <vscale x 2 x i16> %subvec to <vscale x 2 x i64>
+  ret <vscale x 2 x i64> %sext
+}
+
+define <vscale x 2 x i64> @sext_inreg_i64_from_i32(<vscale x 4 x i32> %a) {
+; CHECK-LABEL: sext_inreg_i64_from_i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    sunpklo z0.d, z0.s
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 2 x i32> @llvm.vector.extract.nxv2i32.nxv4i32(<vscale x 4 x i32> %a, i64 0)
+  %sext = sext <vscale x 2 x i32> %subvec to <vscale x 2 x i64>
+  ret <vscale x 2 x i64> %sext
+}
+
+define <vscale x 8 x i16> @zext_inreg_i16_from_i8(<vscale x 16 x i8> %a) {
+; CHECK-LABEL: zext_inreg_i16_from_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uunpklo z0.h, z0.b
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 8 x i8> @llvm.vector.extract.nxv8i8.nxv16i8(<vscale x 16 x i8> %a, i64 0)
+  %zext = zext <vscale x 8 x i8> %subvec to <vscale x 8 x i16>
+  ret <vscale x 8 x i16> %zext
+}
+
+define <vscale x 4 x i32> @zext_inreg_i32_from_i8(<vscale x 16 x i8> %a) {
+; CHECK-LABEL: zext_inreg_i32_from_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uunpklo z0.h, z0.b
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 4 x i8> @llvm.vector.extract.nxv4i8.nxv16i8(<vscale x 16 x i8> %a, i64 0)
+  %zext = zext <vscale x 4 x i8> %subvec to <vscale x 4 x i32>
+  ret <vscale x 4 x i32> %zext
+}
+
+define <vscale x 4 x i32> @zext_inreg_i32_from_i16(<vscale x 8 x i16> %a) {
+; CHECK-LABEL: zext_inreg_i32_from_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 4 x i16> @llvm.vector.extract.nxv4i16.nxv8i16(<vscale x 8 x i16> %a, i64 0)
+  %zext = zext <vscale x 4 x i16> %subvec to <vscale x 4 x i32>
+  ret <vscale x 4 x i32> %zext
+}
+
+define <vscale x 2 x i64> @zext_inreg_i64_from_i8(<vscale x 16 x i8> %a) {
+; CHECK-LABEL: zext_inreg_i64_from_i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uunpklo z0.h, z0.b
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    uunpklo z0.d, z0.s
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 2 x i8> @llvm.vector.extract.nxv2i8.nxv16i8(<vscale x 16 x i8> %a, i64 0)
+  %zext = zext <vscale x 2 x i8> %subvec to <vscale x 2 x i64>
+  ret <vscale x 2 x i64> %zext
+}
+
+define <vscale x 2 x i64> @zext_inreg_i64_from_i16(<vscale x 8 x i16> %a) {
+; CHECK-LABEL: zext_inreg_i64_from_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uunpklo z0.s, z0.h
+; CHECK-NEXT:    uunpklo z0.d, z0.s
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 2 x i16> @llvm.vector.extract.nxv2i16.nxv8i16(<vscale x 8 x i16> %a, i64 0)
+  %zext = zext <vscale x 2 x i16> %subvec to <vscale x 2 x i64>
+  ret <vscale x 2 x i64> %zext
+}
+
+define <vscale x 2 x i64> @zext_inreg_i64_from_i32(<vscale x 4 x i32> %a) {
+; CHECK-LABEL: zext_inreg_i64_from_i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    uunpklo z0.d, z0.s
+; CHECK-NEXT:    ret
+  %subvec = call <vscale x 2 x i32> @llvm.vector.extract.nxv2i32.nxv4i32(<vscale x 4 x i32> %a, i64 0)
+  %zext = zext <vscale x 2 x i32> %subvec to <vscale x 2 x i64>
+  ret <vscale x 2 x i64> %zext
+}
diff --git a/llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll b/llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll
index 3aaae5e73ff23..37adfb89e4762 100644
--- a/llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll
+++ b/llvm/test/CodeGen/AArch64/sve-stack-frame-layout.ll
@@ -33,7 +33,7 @@ define i32 @csr_d8_allocnxv4i32i32f64(double %d) "aarch64_pstate_sm_compatible"
 ; CHECK-COMMON-NEXT:    ldr x29, [sp, #8] // 8-byte Reload
 ; CHECK-COMMON-NEXT:    ldr d8, [sp], #16 // 8-byte Folded Reload
 ; CHECK-COMMON-NEXT:    ret
-; CHECK-COMMON-NE
+; CHECK-NE
 entry:
   %a = alloca <vscale x 4 x i32>
   %b = alloca i32
@@ -626,23 +626,21 @@ define i32 @vastate(i32 %x) "aarch64_inout_za" "aarch64_pstate_sm_enabled" "targ
 ; CHECK-NEWLOWERING-NEXT:    mov x9, sp
 ; CHECK-NEWLOWERING-NEXT:    msub x9, x8, x8, x9
 ; CHECK-NEWLOWERING-NEXT:    mov sp, x9
-; CHECK-NEWLOWERING-NEXT:    sub x10, x29, #80
 ; CHECK-NEWLOWERING-NEXT:    mov w20, w0
+; CHECK-NEWLOWERING-NEXT:    sub x10, x29, #80
 ; CHECK-NEWLOWERING-NEXT:    stp x9, x8, [x29, #-80]
 ; CHECK-NEWLOWERING-NEXT:    msr TPIDR2_EL0, x10
 ; CHECK-NEWLOWERING-NEXT:    smstop sm
 ; CHECK-NEWLOWERING-NEXT:    bl other
 ; CHECK-NEWLOWERING-NEXT:    smstart sm
-; CHECK-NEWLOWERING-NEXT:    mov w0, w20
-; CHECK-NEWLOWERING-NEXT:    mov w8, w0
 ; CHECK-NEWLOWERING-NEXT:    smstart za
-; CHECK-NEWLOWERING-NEXT:    mrs x9, TPIDR2_EL0
+; CHECK-NEWLOWERING-NEXT:    mrs x8, TPIDR2_EL0
 ; CHECK-NEWLOWERING-NEXT:    sub x0, x29, #80
-; CHECK-NEWLOWERING-NEXT:    cbnz x9, .LBB8_2
+; CHECK-NEWLOWERING-NEXT:    cbnz x8, .LBB8_2
 ; CHECK-NEWLOWERING-NEXT:  // %bb.1: // %entry
 ; CHECK-NEWLOWERING-NEXT:    bl __arm_tpidr2_restore
 ; CHECK-NEWLOWERING-NEXT:  .LBB8_2: // %entry
-; CHECK-NEWLOWERING-NEXT:    mov w0, w8
+; CHECK-NEWLOWERING-NEXT:    mov w0, w20
 ; CHECK-NEWLOWERING-NEXT:    msr TPIDR2_EL0, xzr
 ; CHECK-NEWLOWERING-NEXT:    sub sp, x29, #64
 ; CHECK-NEWLOWERING-NEXT:    .cfi_def_cfa wsp, 112
@@ -671,4 +669,4 @@ entry:
   tail call void @other()
   ret i32 %x
 }
-declare void @other()
\ No newline at end of file
+declare void @other()
diff --git a/llvm/test/CodeGen/AArch64/sve2-int-mulh.ll b/llvm/test/CodeGen/AArch64/sve2-int-mulh.ll
index bcf76d5b13d62..d7534712b53a0 100644
--- a/llvm/test/CodeGen/AArch64/sve2-int-mulh.ll
+++ b/llvm/test/CodeGen/AArch64/sve2-int-mulh.ll
@@ -1,11 +1,11 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve2 < %s | FileCheck %s
 
 ;
 ; SMULH
 ;
 
-define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
 ; CHECK-LABEL: smulh_i8:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    smulh z0.b, z0.b, z1.b
@@ -18,7 +18,7 @@ define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b
   ret <vscale x 16 x i8> %tr
 }
 
-define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) {
 ; CHECK-LABEL: smulh_i16:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    smulh z0.h, z0.h, z1.h
@@ -31,7 +31,7 @@ define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %
   ret <vscale x 8 x i16> %tr
 }
 
-define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) {
 ; CHECK-LABEL: smulh_i32:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    smulh z0.s, z0.s, z1.s
@@ -44,7 +44,7 @@ define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %
   ret <vscale x 4 x i32> %tr
 }
 
-define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) {
 ; CHECK-LABEL: smulh_i64:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    smulh z0.d, z0.d, z1.d
@@ -61,7 +61,7 @@ define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %
 ; UMULH
 ;
 
-define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
 ; CHECK-LABEL: umulh_i8:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    umulh z0.b, z0.b, z1.b
@@ -74,7 +74,7 @@ define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b
   ret <vscale x 16 x i8> %tr
 }
 
-define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) {
 ; CHECK-LABEL: umulh_i16:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    umulh z0.h, z0.h, z1.h
@@ -87,7 +87,7 @@ define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %
   ret <vscale x 8 x i16> %tr
 }
 
-define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) {
 ; CHECK-LABEL: umulh_i32:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    umulh z0.s, z0.s, z1.s
@@ -100,7 +100,7 @@ define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %
   ret <vscale x 4 x i32> %tr
 }
 
-define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) {
 ; CHECK-LABEL: umulh_i64:
 ; CHECK:       // %bb.0:
 ; CHECK-NEXT:    umulh z0.d, z0.d, z1.d
@@ -113,4 +113,261 @@ define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %
   ret <vscale x 2 x i64> %tr
 }
 
-attributes #0 = { "target-features"="+sve2" }
+
+; Fixed-length 128bits
+
+define <16 x i8> @smulh_v16i8(<16 x i8> %a, <16 x i8> %b) {
+; CHECK-LABEL: smulh_v16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull2 v2.8h, v0.16b, v1.16b
+; CHECK-NEXT:    smull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    uzp2 v0.16b, v0.16b, v2.16b
+; CHECK-NEXT:    ret
+  %1 = sext <16 x i8> %a to <16 x i16>
+  %2 = sext <16 x i8> %b to <16 x i16>
+  %mul = mul <16 x i16> %1, %2
+  %shr = lshr <16 x i16> %mul, splat(i16 8)
+  %tr = trunc <16 x i16> %shr to <16 x i8>
+  ret <16 x i8> %tr
+}
+
+define <8 x i16> @smulh_v8i16(<8 x i16> %a, <8 x i16> %b) {
+; CHECK-LABEL: smulh_v8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull2 v2.4s, v0.8h, v1.8h
+; CHECK-NEXT:    smull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    uzp2 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    ret
+  %1 = sext <8 x i16> %a to <8 x i32>
+  %2 = sext <8 x i16> %b to <8 x i32>
+  %mul = mul <8 x i32> %1, %2
+  %shr = lshr <8 x i32> %mul, splat(i32 16)
+  %tr = trunc <8 x i32> %shr to <8 x i16>
+  ret <8 x i16> %tr
+}
+
+define <4 x i32> @smulh_v4i32(<4 x i32> %a, <4 x i32> %b) {
+; CHECK-LABEL: smulh_v4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull2 v2.2d, v0.4s, v1.4s
+; CHECK-NEXT:    smull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    uzp2 v0.4s, v0.4s, v2.4s
+; CHECK-NEXT:    ret
+  %1 = sext <4 x i32> %a to <4 x i64>
+  %2 = sext <4 x i32> %b to <4 x i64>
+  %mul = mul <4 x i64> %1, %2
+  %shr = lshr <4 x i64> %mul, splat(i64 32)
+  %tr = trunc <4 x i64> %shr to <4 x i32>
+  ret <4 x i32> %tr
+}
+
+define <2 x i64> @smulh_v2i64(<2 x i64> %a, <2 x i64> %b) {
+; CHECK-LABEL: smulh_v2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, v0.d[1]
+; CHECK-NEXT:    mov x9, v1.d[1]
+; CHECK-NEXT:    fmov x10, d0
+; CHECK-NEXT:    fmov x11, d1
+; CHECK-NEXT:    smulh x10, x10, x11
+; CHECK-NEXT:    smulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x10
+; CHECK-NEXT:    fmov d1, x8
+; CHECK-NEXT:    mov v0.d[1], v1.d[0]
+; CHECK-NEXT:    ret
+  %1 = sext <2 x i64> %a to <2 x i128>
+  %2 = sext <2 x i64> %b to <2 x i128>
+  %mul = mul <2 x i128> %1, %2
+  %shr = lshr <2 x i128> %mul, splat(i128 64)
+  %tr = trunc <2 x i128> %shr to <2 x i64>
+  ret <2 x i64> %tr
+}
+
+define <16 x i8> @umulh_v16i8(<16 x i8> %a, <16 x i8> %b) {
+; CHECK-LABEL: umulh_v16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull2 v2.8h, v0.16b, v1.16b
+; CHECK-NEXT:    umull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    uzp2 v0.16b, v0.16b, v2.16b
+; CHECK-NEXT:    ret
+  %1 = zext <16 x i8> %a to <16 x i16>
+  %2 = zext <16 x i8> %b to <16 x i16>
+  %mul = mul <16 x i16> %1, %2
+  %shr = lshr <16 x i16> %mul, splat(i16 8)
+  %tr = trunc <16 x i16> %shr to <16 x i8>
+  ret <16 x i8> %tr
+}
+
+define <8 x i16> @umulh_v8i16(<8 x i16> %a, <8 x i16> %b) {
+; CHECK-LABEL: umulh_v8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull2 v2.4s, v0.8h, v1.8h
+; CHECK-NEXT:    umull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    uzp2 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    ret
+  %1 = zext <8 x i16> %a to <8 x i32>
+  %2 = zext <8 x i16> %b to <8 x i32>
+  %mul = mul <8 x i32> %1, %2
+  %shr = lshr <8 x i32> %mul, splat(i32 16)
+  %tr = trunc <8 x i32> %shr to <8 x i16>
+  ret <8 x i16> %tr
+}
+
+define <4 x i32> @umulh_v4i32(<4 x i32> %a, <4 x i32> %b) {
+; CHECK-LABEL: umulh_v4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull2 v2.2d, v0.4s, v1.4s
+; CHECK-NEXT:    umull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    uzp2 v0.4s, v0.4s, v2.4s
+; CHECK-NEXT:    ret
+  %1 = zext <4 x i32> %a to <4 x i64>
+  %2 = zext <4 x i32> %b to <4 x i64>
+  %mul = mul <4 x i64> %1, %2
+  %shr = lshr <4 x i64> %mul, splat(i64 32)
+  %tr = trunc <4 x i64> %shr to <4 x i32>
+  ret <4 x i32> %tr
+}
+
+define <2 x i64> @umulh_v2i64(<2 x i64> %a, <2 x i64> %b) {
+; CHECK-LABEL: umulh_v2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, v0.d[1]
+; CHECK-NEXT:    mov x9, v1.d[1]
+; CHECK-NEXT:    fmov x10, d0
+; CHECK-NEXT:    fmov x11, d1
+; CHECK-NEXT:    umulh x10, x10, x11
+; CHECK-NEXT:    umulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x10
+; CHECK-NEXT:    fmov d1, x8
+; CHECK-NEXT:    mov v0.d[1], v1.d[0]
+; CHECK-NEXT:    ret
+  %1 = zext <2 x i64> %a to <2 x i128>
+  %2 = zext <2 x i64> %b to <2 x i128>
+  %mul = mul <2 x i128> %1, %2
+  %shr = lshr <2 x i128> %mul, splat(i128 64)
+  %tr = trunc <2 x i128> %shr to <2 x i64>
+  ret <2 x i64> %tr
+}
+
+
+
+; Fixed-length 64bits
+
+define <8 x i8> @smulh_v8i8(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: smulh_v8i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    shrn v0.8b, v0.8h, #8
+; CHECK-NEXT:    ret
+  %1 = sext <8 x i8> %a to <8 x i16>
+  %2 = sext <8 x i8> %b to <8 x i16>
+  %mul = mul <8 x i16> %1, %2
+  %shr = lshr <8 x i16> %mul, splat(i16 8)
+  %tr = trunc <8 x i16> %shr to <8 x i8>
+  ret <8 x i8> %tr
+}
+
+define <4 x i16> @smulh_v4i16(<4 x i16> %a, <4 x i16> %b) {
+; CHECK-LABEL: smulh_v4i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    shrn v0.4h, v0.4s, #16
+; CHECK-NEXT:    ret
+  %1 = sext <4 x i16> %a to <4 x i32>
+  %2 = sext <4 x i16> %b to <4 x i32>
+  %mul = mul <4 x i32> %1, %2
+  %shr = lshr <4 x i32> %mul, splat(i32 16)
+  %tr = trunc <4 x i32> %shr to <4 x i16>
+  ret <4 x i16> %tr
+}
+
+define <2 x i32> @smulh_v2i32(<2 x i32> %a, <2 x i32> %b) {
+; CHECK-LABEL: smulh_v2i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    smull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    shrn v0.2s, v0.2d, #32
+; CHECK-NEXT:    ret
+  %1 = sext <2 x i32> %a to <2 x i64>
+  %2 = sext <2 x i32> %b to <2 x i64>
+  %mul = mul <2 x i64> %1, %2
+  %shr = lshr <2 x i64> %mul, splat(i64 32)
+  %tr = trunc <2 x i64> %shr to <2 x i32>
+  ret <2 x i32> %tr
+}
+
+define <1 x i64> @smulh_v1i64(<1 x i64> %a, <1 x i64> %b) {
+; CHECK-LABEL: smulh_v1i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $d1 killed $d1 def $q1
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    fmov x8, d0
+; CHECK-NEXT:    fmov x9, d1
+; CHECK-NEXT:    smulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+  %1 = sext <1 x i64> %a to <1 x i128>
+  %2 = sext <1 x i64> %b to <1 x i128>
+  %mul = mul <1 x i128> %1, %2
+  %shr = lshr <1 x i128> %mul, splat(i128 64)
+  %tr = trunc <1 x i128> %shr to <1 x i64>
+  ret <1 x i64> %tr
+}
+
+define <8 x i8> @umulh_v8i8(<8 x i8> %a, <8 x i8> %b) {
+; CHECK-LABEL: umulh_v8i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull v0.8h, v0.8b, v1.8b
+; CHECK-NEXT:    shrn v0.8b, v0.8h, #8
+; CHECK-NEXT:    ret
+  %1 = zext <8 x i8> %a to <8 x i16>
+  %2 = zext <8 x i8> %b to <8 x i16>
+  %mul = mul <8 x i16> %1, %2
+  %shr = lshr <8 x i16> %mul, splat(i16 8)
+  %tr = trunc <8 x i16> %shr to <8 x i8>
+  ret <8 x i8> %tr
+}
+
+define <4 x i16> @umulh_v4i16(<4 x i16> %a, <4 x i16> %b) {
+; CHECK-LABEL: umulh_v4i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    shrn v0.4h, v0.4s, #16
+; CHECK-NEXT:    ret
+  %1 = zext <4 x i16> %a to <4 x i32>
+  %2 = zext <4 x i16> %b to <4 x i32>
+  %mul = mul <4 x i32> %1, %2
+  %shr = lshr <4 x i32> %mul, splat(i32 16)
+  %tr = trunc <4 x i32> %shr to <4 x i16>
+  ret <4 x i16> %tr
+}
+
+define <2 x i32> @umulh_v2i32(<2 x i32> %a, <2 x i32> %b) {
+; CHECK-LABEL: umulh_v2i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    umull v0.2d, v0.2s, v1.2s
+; CHECK-NEXT:    shrn v0.2s, v0.2d, #32
+; CHECK-NEXT:    ret
+  %1 = zext <2 x i32> %a to <2 x i64>
+  %2 = zext <2 x i32> %b to <2 x i64>
+  %mul = mul <2 x i64> %1, %2
+  %shr = lshr <2 x i64> %mul, splat(i64 32)
+  %tr = trunc <2 x i64> %shr to <2 x i32>
+  ret <2 x i32> %tr
+}
+
+define <1 x i64> @umulh_v1i64(<1 x i64> %a, <1 x i64> %b) {
+; CHECK-LABEL: umulh_v1i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $d1 killed $d1 def $q1
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    fmov x8, d0
+; CHECK-NEXT:    fmov x9, d1
+; CHECK-NEXT:    umulh x8, x8, x9
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+  %1 = zext <1 x i64> %a to <1 x i128>
+  %2 = zext <1 x i64> %b to <1 x i128>
+  %mul = mul <1 x i128> %1, %2
+  %shr = lshr <1 x i128> %mul, splat(i128 64)
+  %tr = trunc <1 x i128> %shr to <1 x i64>
+  ret <1 x i64> %tr
+}
diff --git a/llvm/test/CodeGen/AArch64/udiv-by-const-promoted-ops.ll b/llvm/test/CodeGen/AArch64/udiv-by-const-promoted-ops.ll
new file mode 100644
index 0000000000000..cdd238cdd81ff
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/udiv-by-const-promoted-ops.ll
@@ -0,0 +1,78 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-none-linux-gnu < %s | FileCheck %s
+
+; This test verifies that udiv by constant works correctly even when type
+; legalization promotes constant operands (e.g., i16 -> i32 in BUILD_VECTOR).
+; This is a regression test for a bug where v16i16 would be split into two
+; v8i16 operations during legalization, the i16 constants would be promoted
+; to i32, and then the second DAGCombine round would fail to recognize the
+; promoted constants when trying to convert udiv into mul+shift.
+
+define <8 x i16> @udiv_v8i16_by_255(<8 x i16> %x) {
+; CHECK-LABEL: udiv_v8i16_by_255:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #32897 // =0x8081
+; CHECK-NEXT:    dup v1.8h, w8
+; CHECK-NEXT:    umull2 v2.4s, v0.8h, v1.8h
+; CHECK-NEXT:    umull v0.4s, v0.4h, v1.4h
+; CHECK-NEXT:    uzp2 v0.8h, v0.8h, v2.8h
+; CHECK-NEXT:    ushr v0.8h, v0.8h, #7
+; CHECK-NEXT:    ret
+  %div = udiv <8 x i16> %x, splat (i16 255)
+  ret <8 x i16> %div
+}
+
+define <16 x i16> @udiv_v16i16_by_255(<16 x i16> %x) {
+; CHECK-LABEL: udiv_v16i16_by_255:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #32897 // =0x8081
+; CHECK-NEXT:    dup v2.8h, w8
+; CHECK-NEXT:    umull2 v3.4s, v0.8h, v2.8h
+; CHECK-NEXT:    umull v0.4s, v0.4h, v2.4h
+; CHECK-NEXT:    umull2 v4.4s, v1.8h, v2.8h
+; CHECK-NEXT:    umull v1.4s, v1.4h, v2.4h
+; CHECK-NEXT:    uzp2 v0.8h, v0.8h, v3.8h
+; CHECK-NEXT:    uzp2 v1.8h, v1.8h, v4.8h
+; CHECK-NEXT:    ushr v0.8h, v0.8h, #7
+; CHECK-NEXT:    ushr v1.8h, v1.8h, #7
+; CHECK-NEXT:    ret
+  %div = udiv <16 x i16> %x, splat (i16 255)
+  ret <16 x i16> %div
+}
+
+define <8 x i16> @urem_v8i16_by_255(<8 x i16> %x) {
+; CHECK-LABEL: urem_v8i16_by_255:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #32897 // =0x8081
+; CHECK-NEXT:    dup v1.8h, w8
+; CHECK-NEXT:    umull2 v2.4s, v0.8h, v1.8h
+; CHECK-NEXT:    umull v1.4s, v0.4h, v1.4h
+; CHECK-NEXT:    uzp2 v1.8h, v1.8h, v2.8h
+; CHECK-NEXT:    movi v2.2d, #0xff00ff00ff00ff
+; CHECK-NEXT:    ushr v1.8h, v1.8h, #7
+; CHECK-NEXT:    mls v0.8h, v1.8h, v2.8h
+; CHECK-NEXT:    ret
+  %rem = urem <8 x i16> %x, splat (i16 255)
+  ret <8 x i16> %rem
+}
+
+define <16 x i16> @urem_v16i16_by_255(<16 x i16> %x) {
+; CHECK-LABEL: urem_v16i16_by_255:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #32897 // =0x8081
+; CHECK-NEXT:    dup v2.8h, w8
+; CHECK-NEXT:    umull2 v3.4s, v0.8h, v2.8h
+; CHECK-NEXT:    umull v4.4s, v0.4h, v2.4h
+; CHECK-NEXT:    umull2 v5.4s, v1.8h, v2.8h
+; CHECK-NEXT:    umull v2.4s, v1.4h, v2.4h
+; CHECK-NEXT:    uzp2 v3.8h, v4.8h, v3.8h
+; CHECK-NEXT:    movi v4.2d, #0xff00ff00ff00ff
+; CHECK-NEXT:    uzp2 v2.8h, v2.8h, v5.8h
+; CHECK-NEXT:    ushr v3.8h, v3.8h, #7
+; CHECK-NEXT:    ushr v2.8h, v2.8h, #7
+; CHECK-NEXT:    mls v0.8h, v3.8h, v4.8h
+; CHECK-NEXT:    mls v1.8h, v2.8h, v4.8h
+; CHECK-NEXT:    ret
+  %rem = urem <16 x i16> %x, splat (i16 255)
+  ret <16 x i16> %rem
+}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
index 405861d791169..9dfd0a47d1e1e 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
@@ -10,41 +10,75 @@ define amdgpu_ps i128 @extractelement_sgpr_v4i128_sgpr_idx(ptr addrspace(4) inre
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_and_b32 s0, s4, 3
 ; GFX9-NEXT:    s_lshl_b32 s0, s0, 4
-; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[2:3], s0 offset:0x0
-; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v0, s0
+; GFX9-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3]
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX9-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX9-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
 ; GFX8-LABEL: extractelement_sgpr_v4i128_sgpr_idx:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_and_b32 s0, s4, 3
 ; GFX8-NEXT:    s_lshl_b32 s0, s0, 4
-; GFX8-NEXT:    s_load_dwordx4 s[0:3], s[2:3], s0
-; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8-NEXT:    s_add_u32 s0, s2, s0
+; GFX8-NEXT:    s_addc_u32 s1, s3, 0
+; GFX8-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8-NEXT:    flat_load_dwordx4 v[0:3], v[0:1]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX8-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX8-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX8-NEXT:    ; return to shader part epilog
 ;
 ; GFX7-LABEL: extractelement_sgpr_v4i128_sgpr_idx:
 ; GFX7:       ; %bb.0:
-; GFX7-NEXT:    s_and_b32 s0, s4, 3
-; GFX7-NEXT:    s_lshl_b32 s0, s0, 4
-; GFX7-NEXT:    s_load_dwordx4 s[0:3], s[2:3], s0
-; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7-NEXT:    s_mov_b32 s0, s2
+; GFX7-NEXT:    s_and_b32 s2, s4, 3
+; GFX7-NEXT:    s_lshl_b32 s4, s2, 4
+; GFX7-NEXT:    s_mov_b32 s5, 0
+; GFX7-NEXT:    v_mov_b32_e32 v0, s4
+; GFX7-NEXT:    s_mov_b32 s1, s3
+; GFX7-NEXT:    s_mov_b32 s3, 0xf000
+; GFX7-NEXT:    s_mov_b32 s2, s5
+; GFX7-NEXT:    v_mov_b32_e32 v1, s5
+; GFX7-NEXT:    buffer_load_dwordx4 v[0:3], v[0:1], s[0:3], 0 addr64
+; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX7-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX7-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX7-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX7-NEXT:    ; return to shader part epilog
 ;
 ; GFX10-LABEL: extractelement_sgpr_v4i128_sgpr_idx:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_and_b32 s0, s4, 3
 ; GFX10-NEXT:    s_lshl_b32 s0, s0, 4
-; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[2:3], s0 offset:0x0
-; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3]
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: extractelement_sgpr_v4i128_sgpr_idx:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_and_b32 s0, s4, 3
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
 ; GFX11-NEXT:    s_lshl_b32 s0, s0, 4
-; GFX11-NEXT:    s_load_b128 s[0:3], s[2:3], s0 offset:0x0
-; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_mov_b32_e32 v0, s0
+; GFX11-NEXT:    global_load_b128 v[0:3], v0, s[2:3]
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %vector = load <4 x i128>, ptr addrspace(4) %ptr
   %element = extractelement <4 x i128> %vector, i32 %idx
@@ -281,22 +315,63 @@ define amdgpu_ps i128 @extractelement_sgpr_v4i128_vgpr_idx(ptr addrspace(4) inre
 }
 
 define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx0(ptr addrspace(4) inreg %ptr) {
-; GCN-LABEL: extractelement_sgpr_v4i128_idx0:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x0
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    ; return to shader part epilog
+; GFX9-LABEL: extractelement_sgpr_v4i128_idx0:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3]
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX9-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX9-NEXT:    v_readfirstlane_b32 s3, v3
+; GFX9-NEXT:    ; return to shader part epilog
+;
+; GFX8-LABEL: extractelement_sgpr_v4i128_idx0:
+; GFX8:       ; %bb.0:
+; GFX8-NEXT:    v_mov_b32_e32 v0, s2
+; GFX8-NEXT:    v_mov_b32_e32 v1, s3
+; GFX8-NEXT:    flat_load_dwordx4 v[0:3], v[0:1]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX8-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX8-NEXT:    v_readfirstlane_b32 s3, v3
+; GFX8-NEXT:    ; return to shader part epilog
+;
+; GFX7-LABEL: extractelement_sgpr_v4i128_idx0:
+; GFX7:       ; %bb.0:
+; GFX7-NEXT:    s_mov_b32 s0, s2
+; GFX7-NEXT:    s_mov_b32 s1, s3
+; GFX7-NEXT:    s_mov_b32 s2, -1
+; GFX7-NEXT:    s_mov_b32 s3, 0xf000
+; GFX7-NEXT:    buffer_load_dwordx4 v[0:3], off, s[0:3], 0
+; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX7-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX7-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX7-NEXT:    v_readfirstlane_b32 s3, v3
+; GFX7-NEXT:    ; return to shader part epilog
 ;
 ; GFX10-LABEL: extractelement_sgpr_v4i128_idx0:
 ; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x0
-; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3]
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: extractelement_sgpr_v4i128_idx0:
 ; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_load_b128 s[0:3], s[2:3], 0x0
-; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    global_load_b128 v[0:3], v0, s[2:3]
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %vector = load <4 x i128>, ptr addrspace(4) %ptr
   %element = extractelement <4 x i128> %vector, i32 0
@@ -306,32 +381,63 @@ define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx0(ptr addrspace(4) inreg %p
 define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx1(ptr addrspace(4) inreg %ptr) {
 ; GFX9-LABEL: extractelement_sgpr_v4i128_idx1:
 ; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x10
-; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3] offset:16
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX9-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX9-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
 ; GFX8-LABEL: extractelement_sgpr_v4i128_idx1:
 ; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x10
-; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8-NEXT:    s_add_u32 s0, s2, 16
+; GFX8-NEXT:    s_addc_u32 s1, s3, 0
+; GFX8-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8-NEXT:    flat_load_dwordx4 v[0:3], v[0:1]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX8-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX8-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX8-NEXT:    ; return to shader part epilog
 ;
 ; GFX7-LABEL: extractelement_sgpr_v4i128_idx1:
 ; GFX7:       ; %bb.0:
-; GFX7-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x4
-; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7-NEXT:    s_mov_b32 s0, s2
+; GFX7-NEXT:    s_mov_b32 s1, s3
+; GFX7-NEXT:    s_mov_b32 s2, -1
+; GFX7-NEXT:    s_mov_b32 s3, 0xf000
+; GFX7-NEXT:    buffer_load_dwordx4 v[0:3], off, s[0:3], 0 offset:16
+; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX7-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX7-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX7-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX7-NEXT:    ; return to shader part epilog
 ;
 ; GFX10-LABEL: extractelement_sgpr_v4i128_idx1:
 ; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x10
-; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3] offset:16
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: extractelement_sgpr_v4i128_idx1:
 ; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_load_b128 s[0:3], s[2:3], 0x10
-; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    global_load_b128 v[0:3], v0, s[2:3] offset:16
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %vector = load <4 x i128>, ptr addrspace(4) %ptr
   %element = extractelement <4 x i128> %vector, i32 1
@@ -341,32 +447,63 @@ define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx1(ptr addrspace(4) inreg %p
 define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx2(ptr addrspace(4) inreg %ptr) {
 ; GFX9-LABEL: extractelement_sgpr_v4i128_idx2:
 ; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x20
-; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3] offset:32
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX9-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX9-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
 ; GFX8-LABEL: extractelement_sgpr_v4i128_idx2:
 ; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x20
-; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8-NEXT:    s_add_u32 s0, s2, 32
+; GFX8-NEXT:    s_addc_u32 s1, s3, 0
+; GFX8-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8-NEXT:    flat_load_dwordx4 v[0:3], v[0:1]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX8-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX8-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX8-NEXT:    ; return to shader part epilog
 ;
 ; GFX7-LABEL: extractelement_sgpr_v4i128_idx2:
 ; GFX7:       ; %bb.0:
-; GFX7-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x8
-; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7-NEXT:    s_mov_b32 s0, s2
+; GFX7-NEXT:    s_mov_b32 s1, s3
+; GFX7-NEXT:    s_mov_b32 s2, -1
+; GFX7-NEXT:    s_mov_b32 s3, 0xf000
+; GFX7-NEXT:    buffer_load_dwordx4 v[0:3], off, s[0:3], 0 offset:32
+; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX7-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX7-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX7-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX7-NEXT:    ; return to shader part epilog
 ;
 ; GFX10-LABEL: extractelement_sgpr_v4i128_idx2:
 ; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x20
-; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3] offset:32
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: extractelement_sgpr_v4i128_idx2:
 ; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_load_b128 s[0:3], s[2:3], 0x20
-; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    global_load_b128 v[0:3], v0, s[2:3] offset:32
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %vector = load <4 x i128>, ptr addrspace(4) %ptr
   %element = extractelement <4 x i128> %vector, i32 2
@@ -376,32 +513,63 @@ define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx2(ptr addrspace(4) inreg %p
 define amdgpu_ps i128 @extractelement_sgpr_v4i128_idx3(ptr addrspace(4) inreg %ptr) {
 ; GFX9-LABEL: extractelement_sgpr_v4i128_idx3:
 ; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x30
-; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3] offset:48
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX9-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX9-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
 ; GFX8-LABEL: extractelement_sgpr_v4i128_idx3:
 ; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x30
-; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8-NEXT:    s_add_u32 s0, s2, 48
+; GFX8-NEXT:    s_addc_u32 s1, s3, 0
+; GFX8-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8-NEXT:    flat_load_dwordx4 v[0:3], v[0:1]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
+; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX8-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX8-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX8-NEXT:    ; return to shader part epilog
 ;
 ; GFX7-LABEL: extractelement_sgpr_v4i128_idx3:
 ; GFX7:       ; %bb.0:
-; GFX7-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0xc
-; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX7-NEXT:    s_mov_b32 s0, s2
+; GFX7-NEXT:    s_mov_b32 s1, s3
+; GFX7-NEXT:    s_mov_b32 s2, -1
+; GFX7-NEXT:    s_mov_b32 s3, 0xf000
+; GFX7-NEXT:    buffer_load_dwordx4 v[0:3], off, s[0:3], 0 offset:48
+; GFX7-NEXT:    s_waitcnt vmcnt(0)
+; GFX7-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX7-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX7-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX7-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX7-NEXT:    ; return to shader part epilog
 ;
 ; GFX10-LABEL: extractelement_sgpr_v4i128_idx3:
 ; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[2:3], 0x30
-; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10-NEXT:    global_load_dwordx4 v[0:3], v0, s[2:3] offset:48
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX10-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: extractelement_sgpr_v4i128_idx3:
 ; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_load_b128 s[0:3], s[2:3], 0x30
-; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0
+; GFX11-NEXT:    global_load_b128 v[0:3], v0, s[2:3] offset:48
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX11-NEXT:    v_readfirstlane_b32 s3, v3
 ; GFX11-NEXT:    ; return to shader part epilog
   %vector = load <4 x i128>, ptr addrspace(4) %ptr
   %element = extractelement <4 x i128> %vector, i32 3
@@ -585,3 +753,5 @@ define i128 @extractelement_vgpr_v4i128_idx3(ptr addrspace(1) %ptr) {
   %element = extractelement <4 x i128> %vector, i32 3
   ret i128 %element
 }
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; GCN: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
index 9539ec465e02f..91ee7642790fc 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/implicit-kernarg-backend-usage-global-isel.ll
@@ -11,28 +11,40 @@ define amdgpu_kernel void @addrspacecast(ptr addrspace(5) %ptr.private, ptr addr
 ; GFX8V4-LABEL: addrspacecast:
 ; GFX8V4:       ; %bb.0:
 ; GFX8V4-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; GFX8V4-NEXT:    s_load_dwordx2 s[2:3], s[6:7], 0x40
 ; GFX8V4-NEXT:    s_add_i32 s12, s12, s17
 ; GFX8V4-NEXT:    s_lshr_b32 flat_scratch_hi, s12, 8
-; GFX8V4-NEXT:    s_mov_b32 flat_scratch_lo, s13
+; GFX8V4-NEXT:    s_add_u32 s2, s6, 0x44
+; GFX8V4-NEXT:    s_addc_u32 s3, s7, 0
+; GFX8V4-NEXT:    v_mov_b32_e32 v0, s2
 ; GFX8V4-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8V4-NEXT:    s_mov_b32 s4, s0
-; GFX8V4-NEXT:    s_mov_b32 s5, s3
 ; GFX8V4-NEXT:    s_cmp_lg_u32 s0, -1
-; GFX8V4-NEXT:    s_cselect_b64 s[4:5], s[4:5], 0
-; GFX8V4-NEXT:    s_mov_b32 s6, s1
-; GFX8V4-NEXT:    s_mov_b32 s7, s2
+; GFX8V4-NEXT:    v_mov_b32_e32 v1, s3
+; GFX8V4-NEXT:    s_cselect_b32 s2, 1, 0
+; GFX8V4-NEXT:    s_and_b32 s4, 1, s2
+; GFX8V4-NEXT:    s_mov_b32 flat_scratch_lo, s13
+; GFX8V4-NEXT:    s_add_u32 s2, s6, 64
+; GFX8V4-NEXT:    flat_load_dword v3, v[0:1]
+; GFX8V4-NEXT:    s_addc_u32 s3, s7, 0
+; GFX8V4-NEXT:    v_mov_b32_e32 v0, s2
+; GFX8V4-NEXT:    v_mov_b32_e32 v1, s3
+; GFX8V4-NEXT:    flat_load_dword v4, v[0:1]
 ; GFX8V4-NEXT:    s_cmp_lg_u32 s1, -1
-; GFX8V4-NEXT:    v_mov_b32_e32 v0, s4
-; GFX8V4-NEXT:    s_cselect_b64 s[0:1], s[6:7], 0
-; GFX8V4-NEXT:    v_mov_b32_e32 v2, 1
-; GFX8V4-NEXT:    v_mov_b32_e32 v1, s5
-; GFX8V4-NEXT:    flat_store_dword v[0:1], v2
-; GFX8V4-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8V4-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8V4-NEXT:    v_mov_b32_e32 v2, 2
+; GFX8V4-NEXT:    s_cselect_b32 s0, 1, 0
+; GFX8V4-NEXT:    s_and_b32 s0, 1, s0
 ; GFX8V4-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8V4-NEXT:    flat_store_dword v[0:1], v2
+; GFX8V4-NEXT:    v_cmp_ne_u32_e64 vcc, 0, s4
+; GFX8V4-NEXT:    v_cmp_ne_u32_e64 s[0:1], 0, s0
+; GFX8V4-NEXT:    v_mov_b32_e32 v5, 1
+; GFX8V4-NEXT:    v_cndmask_b32_e32 v0, 0, v0, vcc
+; GFX8V4-NEXT:    v_cndmask_b32_e64 v2, 0, v1, s[0:1]
+; GFX8V4-NEXT:    s_waitcnt vmcnt(1)
+; GFX8V4-NEXT:    v_cndmask_b32_e32 v1, 0, v3, vcc
+; GFX8V4-NEXT:    flat_store_dword v[0:1], v5
+; GFX8V4-NEXT:    s_waitcnt vmcnt(0)
+; GFX8V4-NEXT:    v_mov_b32_e32 v0, 2
+; GFX8V4-NEXT:    v_cndmask_b32_e64 v3, 0, v4, s[0:1]
+; GFX8V4-NEXT:    flat_store_dword v[2:3], v0
 ; GFX8V4-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8V4-NEXT:    s_endpgm
 ;
@@ -124,13 +136,15 @@ define amdgpu_kernel void @addrspacecast(ptr addrspace(5) %ptr.private, ptr addr
 define amdgpu_kernel void @llvm_amdgcn_is_shared(ptr %ptr) #0 {
 ; GFX8V4-LABEL: llvm_amdgcn_is_shared:
 ; GFX8V4:       ; %bb.0:
-; GFX8V4-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; GFX8V4-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8V4-NEXT:    s_load_dword s0, s[6:7], 0x40
-; GFX8V4-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8V4-NEXT:    s_cmp_eq_u32 s1, s0
-; GFX8V4-NEXT:    s_cselect_b32 s0, 1, 0
+; GFX8V4-NEXT:    s_add_u32 s0, s6, 64
+; GFX8V4-NEXT:    s_addc_u32 s1, s7, 0
 ; GFX8V4-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8V4-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8V4-NEXT:    flat_load_dword v0, v[0:1]
+; GFX8V4-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
+; GFX8V4-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX8V4-NEXT:    v_cmp_eq_u32_e32 vcc, s1, v0
+; GFX8V4-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
 ; GFX8V4-NEXT:    flat_store_dword v[0:1], v0
 ; GFX8V4-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8V4-NEXT:    s_endpgm
@@ -180,13 +194,15 @@ define amdgpu_kernel void @llvm_amdgcn_is_shared(ptr %ptr) #0 {
 define amdgpu_kernel void @llvm_amdgcn_is_private(ptr %ptr) #0 {
 ; GFX8V4-LABEL: llvm_amdgcn_is_private:
 ; GFX8V4:       ; %bb.0:
-; GFX8V4-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
-; GFX8V4-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8V4-NEXT:    s_load_dword s0, s[6:7], 0x44
-; GFX8V4-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8V4-NEXT:    s_cmp_eq_u32 s1, s0
-; GFX8V4-NEXT:    s_cselect_b32 s0, 1, 0
+; GFX8V4-NEXT:    s_add_u32 s0, s6, 0x44
+; GFX8V4-NEXT:    s_addc_u32 s1, s7, 0
 ; GFX8V4-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8V4-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8V4-NEXT:    flat_load_dword v0, v[0:1]
+; GFX8V4-NEXT:    s_load_dwordx2 s[0:1], s[8:9], 0x0
+; GFX8V4-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
+; GFX8V4-NEXT:    v_cmp_eq_u32_e32 vcc, s1, v0
+; GFX8V4-NEXT:    v_cndmask_b32_e64 v0, 0, 1, vcc
 ; GFX8V4-NEXT:    flat_store_dword v[0:1], v0
 ; GFX8V4-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8V4-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll
index d4b485a379184..3043484b48717 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgcn-cs-chain.ll
@@ -22,7 +22,7 @@ define amdgpu_cs_chain void @chain_call(<3 x i32> inreg %sgpr, { i32, ptr addrsp
   ; GFX11-NEXT:   [[GV:%[0-9]+]]:_(p0) = G_GLOBAL_VALUE @callee
   ; GFX11-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
   ; GFX11-NEXT:   [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-  ; GFX11-NEXT:   [[GV1:%[0-9]+]]:ccr_sgpr_64(p0) = G_GLOBAL_VALUE @callee
+  ; GFX11-NEXT:   [[GV1:%[0-9]+]]:sgpr_64(p0) = G_GLOBAL_VALUE @callee
   ; GFX11-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<3 x s32>)
   ; GFX11-NEXT:   [[INTRINSIC_CONVERGENT:%[0-9]+]]:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), [[UV]](s32)
   ; GFX11-NEXT:   $sgpr0 = COPY [[INTRINSIC_CONVERGENT]](s32)
@@ -51,7 +51,7 @@ define amdgpu_cs_chain void @chain_call(<3 x i32> inreg %sgpr, { i32, ptr addrsp
   ; GFX10-NEXT:   [[GV:%[0-9]+]]:_(p0) = G_GLOBAL_VALUE @callee
   ; GFX10-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
   ; GFX10-NEXT:   [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-  ; GFX10-NEXT:   [[GV1:%[0-9]+]]:ccr_sgpr_64(p0) = G_GLOBAL_VALUE @callee
+  ; GFX10-NEXT:   [[GV1:%[0-9]+]]:sgpr_64(p0) = G_GLOBAL_VALUE @callee
   ; GFX10-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<3 x s32>)
   ; GFX10-NEXT:   [[INTRINSIC_CONVERGENT:%[0-9]+]]:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), [[UV]](s32)
   ; GFX10-NEXT:   $sgpr0 = COPY [[INTRINSIC_CONVERGENT]](s32)
@@ -86,7 +86,7 @@ define amdgpu_cs_chain void @chain_preserve_call(<3 x i32> inreg %sgpr, { i32, p
   ; GFX11-NEXT:   [[GV:%[0-9]+]]:_(p0) = G_GLOBAL_VALUE @callee_preserve
   ; GFX11-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
   ; GFX11-NEXT:   [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-  ; GFX11-NEXT:   [[GV1:%[0-9]+]]:ccr_sgpr_64(p0) = G_GLOBAL_VALUE @callee_preserve
+  ; GFX11-NEXT:   [[GV1:%[0-9]+]]:sgpr_64(p0) = G_GLOBAL_VALUE @callee_preserve
   ; GFX11-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<3 x s32>)
   ; GFX11-NEXT:   [[INTRINSIC_CONVERGENT:%[0-9]+]]:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), [[UV]](s32)
   ; GFX11-NEXT:   $sgpr0 = COPY [[INTRINSIC_CONVERGENT]](s32)
@@ -115,7 +115,7 @@ define amdgpu_cs_chain void @chain_preserve_call(<3 x i32> inreg %sgpr, { i32, p
   ; GFX10-NEXT:   [[GV:%[0-9]+]]:_(p0) = G_GLOBAL_VALUE @callee_preserve
   ; GFX10-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
   ; GFX10-NEXT:   [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-  ; GFX10-NEXT:   [[GV1:%[0-9]+]]:ccr_sgpr_64(p0) = G_GLOBAL_VALUE @callee_preserve
+  ; GFX10-NEXT:   [[GV1:%[0-9]+]]:sgpr_64(p0) = G_GLOBAL_VALUE @callee_preserve
   ; GFX10-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32), [[UV2:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<3 x s32>)
   ; GFX10-NEXT:   [[INTRINSIC_CONVERGENT:%[0-9]+]]:_(s32) = G_INTRINSIC_CONVERGENT intrinsic(@llvm.amdgcn.readfirstlane), [[UV]](s32)
   ; GFX10-NEXT:   $sgpr0 = COPY [[INTRINSIC_CONVERGENT]](s32)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgpu_kernel.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgpu_kernel.ll
index 11153bbbba612..333207a24fe7d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgpu_kernel.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-amdgpu_kernel.ll
@@ -10,10 +10,10 @@ define amdgpu_kernel void @i8_arg(ptr addrspace(1) nocapture %out, i8 %in) nounw
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s8)
   ; HSA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -25,10 +25,10 @@ define amdgpu_kernel void @i8_arg(ptr addrspace(1) nocapture %out, i8 %in) nounw
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s8)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -45,10 +45,10 @@ define amdgpu_kernel void @i8_zext_arg(ptr addrspace(1) nocapture %out, i8 zeroe
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s8)
   ; HSA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -60,10 +60,10 @@ define amdgpu_kernel void @i8_zext_arg(ptr addrspace(1) nocapture %out, i8 zeroe
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s8)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -80,10 +80,10 @@ define amdgpu_kernel void @i8_sext_arg(ptr addrspace(1) nocapture %out, i8 signe
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s32) = G_SEXT [[LOAD1]](s8)
   ; HSA-VI-NEXT:   G_STORE [[SEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -95,10 +95,10 @@ define amdgpu_kernel void @i8_sext_arg(ptr addrspace(1) nocapture %out, i8 signe
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s32) = G_SEXT [[LOAD1]](s8)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[SEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -115,10 +115,10 @@ define amdgpu_kernel void @i16_arg(ptr addrspace(1) nocapture %out, i16 %in) nou
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s16)
   ; HSA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -130,10 +130,10 @@ define amdgpu_kernel void @i16_arg(ptr addrspace(1) nocapture %out, i16 %in) nou
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s16)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -150,10 +150,10 @@ define amdgpu_kernel void @i16_zext_arg(ptr addrspace(1) nocapture %out, i16 zer
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s16)
   ; HSA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -165,10 +165,10 @@ define amdgpu_kernel void @i16_zext_arg(ptr addrspace(1) nocapture %out, i16 zer
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s16)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -185,10 +185,10 @@ define amdgpu_kernel void @i16_sext_arg(ptr addrspace(1) nocapture %out, i16 sig
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s32) = G_SEXT [[LOAD1]](s16)
   ; HSA-VI-NEXT:   G_STORE [[SEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -200,10 +200,10 @@ define amdgpu_kernel void @i16_sext_arg(ptr addrspace(1) nocapture %out, i16 sig
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s32) = G_SEXT [[LOAD1]](s16)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[SEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -220,10 +220,10 @@ define amdgpu_kernel void @i32_arg(ptr addrspace(1) nocapture %out, i32 %in) nou
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -234,10 +234,10 @@ define amdgpu_kernel void @i32_arg(ptr addrspace(1) nocapture %out, i32 %in) nou
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -253,10 +253,10 @@ define amdgpu_kernel void @f32_arg(ptr addrspace(1) nocapture %out, float %in) n
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -267,10 +267,10 @@ define amdgpu_kernel void @f32_arg(ptr addrspace(1) nocapture %out, float %in) n
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -286,10 +286,10 @@ define amdgpu_kernel void @v2i8_arg(ptr addrspace(1) %out, <2 x i8> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s8>), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s8>) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s8>), [[LOAD]](p1) :: (store (<2 x s8>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -300,10 +300,10 @@ define amdgpu_kernel void @v2i8_arg(ptr addrspace(1) %out, <2 x i8> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s8>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s8>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s8>), [[LOAD]](p1) :: (store (<2 x s8>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -319,10 +319,10 @@ define amdgpu_kernel void @v2i16_arg(ptr addrspace(1) %out, <2 x i16> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s16>), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s16>) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s16>), [[LOAD]](p1) :: (store (<2 x s16>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -333,10 +333,10 @@ define amdgpu_kernel void @v2i16_arg(ptr addrspace(1) %out, <2 x i16> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s16>), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s16>) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s16>), [[LOAD]](p1) :: (store (<2 x s16>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -352,10 +352,10 @@ define amdgpu_kernel void @v2i32_arg(ptr addrspace(1) nocapture %out, <2 x i32>
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s32>), [[LOAD]](p1) :: (store (<2 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -366,10 +366,10 @@ define amdgpu_kernel void @v2i32_arg(ptr addrspace(1) nocapture %out, <2 x i32>
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s32>), [[LOAD]](p1) :: (store (<2 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -385,10 +385,10 @@ define amdgpu_kernel void @v2f32_arg(ptr addrspace(1) nocapture %out, <2 x float
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s32>), [[LOAD]](p1) :: (store (<2 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -399,10 +399,10 @@ define amdgpu_kernel void @v2f32_arg(ptr addrspace(1) nocapture %out, <2 x float
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<2 x s32>), [[LOAD]](p1) :: (store (<2 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -418,10 +418,10 @@ define amdgpu_kernel void @v3i8_arg(ptr addrspace(1) nocapture %out, <3 x i8> %i
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s8>), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s8>) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s8>), [[LOAD]](p1) :: (store (<3 x s8>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -432,10 +432,10 @@ define amdgpu_kernel void @v3i8_arg(ptr addrspace(1) nocapture %out, <3 x i8> %i
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s8>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s8>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s8>), [[LOAD]](p1) :: (store (<3 x s8>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -451,10 +451,10 @@ define amdgpu_kernel void @v3i16_arg(ptr addrspace(1) nocapture %out, <3 x i16>
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s16>), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s16>) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s16>), [[LOAD]](p1) :: (store (<3 x s16>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -465,10 +465,10 @@ define amdgpu_kernel void @v3i16_arg(ptr addrspace(1) nocapture %out, <3 x i16>
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s16>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s16>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s16>), [[LOAD]](p1) :: (store (<3 x s16>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -484,10 +484,10 @@ define amdgpu_kernel void @v3i32_arg(ptr addrspace(1) nocapture %out, <3 x i32>
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s32>), [[LOAD]](p1) :: (store (<3 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -498,10 +498,10 @@ define amdgpu_kernel void @v3i32_arg(ptr addrspace(1) nocapture %out, <3 x i32>
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s32>), [[LOAD]](p1) :: (store (<3 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -517,10 +517,10 @@ define amdgpu_kernel void @v3f32_arg(ptr addrspace(1) nocapture %out, <3 x float
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s32>), [[LOAD]](p1) :: (store (<3 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -531,10 +531,10 @@ define amdgpu_kernel void @v3f32_arg(ptr addrspace(1) nocapture %out, <3 x float
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<3 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<3 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<3 x s32>), [[LOAD]](p1) :: (store (<3 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -550,10 +550,10 @@ define amdgpu_kernel void @v4i8_arg(ptr addrspace(1) %out, <4 x i8> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s8>), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s8>) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s8>), [[LOAD]](p1) :: (store (<4 x s8>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -564,10 +564,10 @@ define amdgpu_kernel void @v4i8_arg(ptr addrspace(1) %out, <4 x i8> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s8>), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s8>) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s8>), [[LOAD]](p1) :: (store (<4 x s8>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -583,10 +583,10 @@ define amdgpu_kernel void @v4i16_arg(ptr addrspace(1) %out, <4 x i16> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s16>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s16>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s16>), [[LOAD]](p1) :: (store (<4 x s16>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -597,10 +597,10 @@ define amdgpu_kernel void @v4i16_arg(ptr addrspace(1) %out, <4 x i16> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s16>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s16>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s16>), [[LOAD]](p1) :: (store (<4 x s16>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -616,10 +616,10 @@ define amdgpu_kernel void @v4i32_arg(ptr addrspace(1) nocapture %out, <4 x i32>
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s32>), [[LOAD]](p1) :: (store (<4 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -630,10 +630,10 @@ define amdgpu_kernel void @v4i32_arg(ptr addrspace(1) nocapture %out, <4 x i32>
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s32>), [[LOAD]](p1) :: (store (<4 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -649,10 +649,10 @@ define amdgpu_kernel void @v4f32_arg(ptr addrspace(1) nocapture %out, <4 x float
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s32>), [[LOAD]](p1) :: (store (<4 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -663,10 +663,10 @@ define amdgpu_kernel void @v4f32_arg(ptr addrspace(1) nocapture %out, <4 x float
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<4 x s32>), [[LOAD]](p1) :: (store (<4 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -682,10 +682,10 @@ define amdgpu_kernel void @v8i8_arg(ptr addrspace(1) %out, <8 x i8> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s8>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s8>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s8>), [[LOAD]](p1) :: (store (<8 x s8>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -696,10 +696,10 @@ define amdgpu_kernel void @v8i8_arg(ptr addrspace(1) %out, <8 x i8> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s8>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s8>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s8>), [[LOAD]](p1) :: (store (<8 x s8>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -715,10 +715,10 @@ define amdgpu_kernel void @v8i16_arg(ptr addrspace(1) %out, <8 x i16> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s16>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s16>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s16>), [[LOAD]](p1) :: (store (<8 x s16>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -729,10 +729,10 @@ define amdgpu_kernel void @v8i16_arg(ptr addrspace(1) %out, <8 x i16> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s16>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s16>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s16>), [[LOAD]](p1) :: (store (<8 x s16>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -748,10 +748,10 @@ define amdgpu_kernel void @v8i32_arg(ptr addrspace(1) nocapture %out, <8 x i32>
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 32
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s32>), [[LOAD]](p1) :: (store (<8 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -762,10 +762,10 @@ define amdgpu_kernel void @v8i32_arg(ptr addrspace(1) nocapture %out, <8 x i32>
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 68
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s32>), [[LOAD]](p1) :: (store (<8 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -781,10 +781,10 @@ define amdgpu_kernel void @v8f32_arg(ptr addrspace(1) nocapture %out, <8 x float
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 32
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s32>), [[LOAD]](p1) :: (store (<8 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -795,10 +795,10 @@ define amdgpu_kernel void @v8f32_arg(ptr addrspace(1) nocapture %out, <8 x float
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 68
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<8 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<8 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<8 x s32>), [[LOAD]](p1) :: (store (<8 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -814,10 +814,10 @@ define amdgpu_kernel void @v16i8_arg(ptr addrspace(1) %out, <16 x i8> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s8>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s8>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s8>), [[LOAD]](p1) :: (store (<16 x s8>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -828,10 +828,10 @@ define amdgpu_kernel void @v16i8_arg(ptr addrspace(1) %out, <16 x i8> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s8>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s8>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s8>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s8>), [[LOAD]](p1) :: (store (<16 x s8>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -847,10 +847,10 @@ define amdgpu_kernel void @v16i16_arg(ptr addrspace(1) %out, <16 x i16> %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 32
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s16>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s16>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s16>), [[LOAD]](p1) :: (store (<16 x s16>) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -861,10 +861,10 @@ define amdgpu_kernel void @v16i16_arg(ptr addrspace(1) %out, <16 x i16> %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 68
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s16>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s16>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s16>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s16>), [[LOAD]](p1) :: (store (<16 x s16>) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -880,10 +880,10 @@ define amdgpu_kernel void @v16i32_arg(ptr addrspace(1) nocapture %out, <16 x i32
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 64
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s32>), [[LOAD]](p1) :: (store (<16 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -894,10 +894,10 @@ define amdgpu_kernel void @v16i32_arg(ptr addrspace(1) nocapture %out, <16 x i32
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 100
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s32>), [[LOAD]](p1) :: (store (<16 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -913,10 +913,10 @@ define amdgpu_kernel void @v16f32_arg(ptr addrspace(1) nocapture %out, <16 x flo
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 64
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s32>), [[LOAD]](p1) :: (store (<16 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -927,10 +927,10 @@ define amdgpu_kernel void @v16f32_arg(ptr addrspace(1) nocapture %out, <16 x flo
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 100
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](<16 x s32>), [[LOAD]](p1) :: (store (<16 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -946,10 +946,10 @@ define amdgpu_kernel void @kernel_arg_i64(ptr addrspace(1) %out, i64 %a) nounwin
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -960,10 +960,10 @@ define amdgpu_kernel void @kernel_arg_i64(ptr addrspace(1) %out, i64 %a) nounwin
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
   store i64 %a, ptr addrspace(1) %out, align 8
@@ -978,10 +978,10 @@ define amdgpu_kernel void @f64_kernel_arg(ptr addrspace(1) %out, double  %in) {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -992,10 +992,10 @@ define amdgpu_kernel void @f64_kernel_arg(ptr addrspace(1) %out, double  %in) {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
 entry:
@@ -1011,10 +1011,10 @@ define amdgpu_kernel void @i1_arg(ptr addrspace(1) %out, i1 %x) nounwind {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s1), [[LOAD]](p1) :: (store (s1) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
   ;
@@ -1025,10 +1025,10 @@ define amdgpu_kernel void @i1_arg(ptr addrspace(1) %out, i1 %x) nounwind {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s1), [[LOAD]](p1) :: (store (s1) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
   store i1 %x, ptr addrspace(1) %out, align 1
@@ -1043,10 +1043,10 @@ define amdgpu_kernel void @i1_arg_zext_i32(ptr addrspace(1) %out, i1 %x) nounwin
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s1)
   ; HSA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1058,10 +1058,10 @@ define amdgpu_kernel void @i1_arg_zext_i32(ptr addrspace(1) %out, i1 %x) nounwin
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s32) = G_ZEXT [[LOAD1]](s1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[ZEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -1078,10 +1078,10 @@ define amdgpu_kernel void @i1_arg_zext_i64(ptr addrspace(1) %out, i1 %x) nounwin
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD1]](s1)
   ; HSA-VI-NEXT:   G_STORE [[ZEXT]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1093,10 +1093,10 @@ define amdgpu_kernel void @i1_arg_zext_i64(ptr addrspace(1) %out, i1 %x) nounwin
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD1]](s1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[ZEXT]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -1113,10 +1113,10 @@ define amdgpu_kernel void @i1_arg_sext_i32(ptr addrspace(1) %out, i1 %x) nounwin
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s32) = G_SEXT [[LOAD1]](s1)
   ; HSA-VI-NEXT:   G_STORE [[SEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1128,10 +1128,10 @@ define amdgpu_kernel void @i1_arg_sext_i32(ptr addrspace(1) %out, i1 %x) nounwin
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s32) = G_SEXT [[LOAD1]](s1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[SEXT]](s32), [[LOAD]](p1) :: (store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -1148,10 +1148,10 @@ define amdgpu_kernel void @i1_arg_sext_i64(ptr addrspace(1) %out, i1 %x) nounwin
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s64) = G_SEXT [[LOAD1]](s1)
   ; HSA-VI-NEXT:   G_STORE [[SEXT]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1163,10 +1163,10 @@ define amdgpu_kernel void @i1_arg_sext_i64(ptr addrspace(1) %out, i1 %x) nounwin
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[SEXT:%[0-9]+]]:_(s64) = G_SEXT [[LOAD1]](s1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[SEXT]](s64), [[LOAD]](p1) :: (store (s64) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -1185,7 +1185,7 @@ define amdgpu_kernel void @empty_struct_arg({} %arg0, i32 %arg1) nounwind {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](s32), [[DEF]](p1) :: (store (s32) into `ptr addrspace(1) poison`, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1197,7 +1197,7 @@ define amdgpu_kernel void @empty_struct_arg({} %arg0, i32 %arg1) nounwind {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](s32), [[DEF]](p1) :: (store (s32) into `ptr addrspace(1) poison`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -1213,7 +1213,7 @@ define amdgpu_kernel void @empty_array_arg([0 x i8] %arg0, i32 %arg1) nounwind {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](s32), [[DEF]](p1) :: (store (s32) into `ptr addrspace(1) poison`, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1225,7 +1225,7 @@ define amdgpu_kernel void @empty_array_arg([0 x i8] %arg0, i32 %arg1) nounwind {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](s32), [[DEF]](p1) :: (store (s32) into `ptr addrspace(1) poison`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -1249,19 +1249,19 @@ define amdgpu_kernel void @struct_argument_alignment({i32, i64} %arg0, i8 %pad,
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 24
   ; HSA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 32
   ; HSA-VI-NEXT:   [[PTR_ADD4:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C4]](s64)
-  ; HSA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[C5:%[0-9]+]]:_(p1) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](s32), [[C5]](p1) :: (volatile store (s32) into `ptr addrspace(1) null`, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[C5]](p1) :: (volatile store (s64) into `ptr addrspace(1) null`, addrspace 1)
@@ -1277,19 +1277,19 @@ define amdgpu_kernel void @struct_argument_alignment({i32, i64} %arg0, i8 %pad,
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 60
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 68
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD4:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C4]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C5:%[0-9]+]]:_(p1) = G_CONSTANT i64 0
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](s32), [[C5]](p1) :: (volatile store (s32) into `ptr addrspace(1) null`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[C5]](p1) :: (volatile store (s64) into `ptr addrspace(1) null`, addrspace 1)
@@ -1317,19 +1317,19 @@ define amdgpu_kernel void @pointer_in_struct_argument({ptr addrspace(3), ptr add
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 24
   ; HSA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32), align 8, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 8, addrspace 4)
   ; HSA-VI-NEXT:   [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 32
   ; HSA-VI-NEXT:   [[PTR_ADD4:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C4]](s64)
-  ; HSA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(p1234) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(p1234) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[C5:%[0-9]+]]:_(p1) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](p3), [[C5]](p1) :: (volatile store (p3) into `ptr addrspace(1) null`, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](p1), [[C5]](p1) :: (volatile store (p1) into `ptr addrspace(1) null`, addrspace 1)
@@ -1345,19 +1345,19 @@ define amdgpu_kernel void @pointer_in_struct_argument({ptr addrspace(3), ptr add
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s8) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 60
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 68
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD4:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C4]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(p1234) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD4:%[0-9]+]]:_(p1234) = G_LOAD [[PTR_ADD4]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C5:%[0-9]+]]:_(p1) = G_CONSTANT i64 0
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](p3), [[C5]](p1) :: (volatile store (p3) into `ptr addrspace(1) null`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](p1), [[C5]](p1) :: (volatile store (p1) into `ptr addrspace(1) null`, addrspace 1)
@@ -1387,16 +1387,16 @@ define amdgpu_kernel void @packed_struct_argument_alignment(<{i32, i64}> %arg0,
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 4
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 13
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), align 1, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 1, addrspace 4)
   ; HSA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 17
   ; HSA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s64), align 1, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 1, addrspace 4)
   ; HSA-VI-NEXT:   [[C4:%[0-9]+]]:_(p1) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](s32), [[C4]](p1) :: (volatile store (s32) into `ptr addrspace(1) null`, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[C4]](p1) :: (volatile store (s64) into `ptr addrspace(1) null`, addrspace 1)
@@ -1411,16 +1411,16 @@ define amdgpu_kernel void @packed_struct_argument_alignment(<{i32, i64}> %arg0,
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 40
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 49
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), align 1, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 1, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 53
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s64), align 1, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s64) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s64) from constant-pool, align 1, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C4:%[0-9]+]]:_(p1) = G_CONSTANT i64 0
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](s32), [[C4]](p1) :: (volatile store (s32) into `ptr addrspace(1) null`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s64), [[C4]](p1) :: (volatile store (s64) into `ptr addrspace(1) null`, addrspace 1)
@@ -1465,7 +1465,7 @@ define amdgpu_kernel void @byref_constant_i8_arg(ptr addrspace(1) nocapture %out
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from %ir.in.byref, addrspace 4)
@@ -1480,7 +1480,7 @@ define amdgpu_kernel void @byref_constant_i8_arg(ptr addrspace(1) nocapture %out
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s8) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s8) from %ir.in.byref, addrspace 4)
@@ -1501,7 +1501,7 @@ define amdgpu_kernel void @byref_constant_i16_arg(ptr addrspace(1) nocapture %ou
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from %ir.in.byref, addrspace 4)
@@ -1516,7 +1516,7 @@ define amdgpu_kernel void @byref_constant_i16_arg(ptr addrspace(1) nocapture %ou
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s16) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s16) from %ir.in.byref, addrspace 4)
@@ -1537,12 +1537,12 @@ define amdgpu_kernel void @byref_constant_i32_arg(ptr addrspace(1) nocapture %ou
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 12
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from %ir.in.byref, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD2]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1555,12 +1555,12 @@ define amdgpu_kernel void @byref_constant_i32_arg(ptr addrspace(1) nocapture %ou
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 48
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from %ir.in.byref, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD2]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1579,12 +1579,12 @@ define amdgpu_kernel void @byref_constant_v4i32_arg(ptr addrspace(1) nocapture %
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 32
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>) from %ir.in.byref, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD2]](<4 x s32>), [[LOAD]](p1) :: (volatile store (<4 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1597,12 +1597,12 @@ define amdgpu_kernel void @byref_constant_v4i32_arg(ptr addrspace(1) nocapture %
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 68
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(<4 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<4 x s32>) from %ir.in.byref, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD2]](<4 x s32>), [[LOAD]](p1) :: (volatile store (<4 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1621,12 +1621,12 @@ define amdgpu_kernel void @byref_align_constant_i32_arg(ptr addrspace(1) nocaptu
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 256
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 260
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from %ir.in.byref, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD2]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1639,12 +1639,12 @@ define amdgpu_kernel void @byref_align_constant_i32_arg(ptr addrspace(1) nocaptu
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 292
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 296
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), align 8, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 8, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from %ir.in.byref, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD2]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1663,12 +1663,12 @@ define amdgpu_kernel void @byref_natural_align_constant_v16i32_arg(ptr addrspace
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 64
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 128
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>) from %ir.in.byref, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD2]](<16 x s32>), [[LOAD]](p1) :: (volatile store (<16 x s32>) into %ir.out, align 4, addrspace 1)
   ; HSA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1681,12 +1681,12 @@ define amdgpu_kernel void @byref_natural_align_constant_v16i32_arg(ptr addrspace
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 100
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 164
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(<16 x s32>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<16 x s32>) from %ir.in.byref, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD2]](<16 x s32>), [[LOAD]](p1) :: (volatile store (<16 x s32>) into %ir.out, align 4, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD1]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1706,7 +1706,7 @@ define amdgpu_kernel void @byref_global_i32_arg(ptr addrspace(1) nocapture %out,
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p1) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1721,7 +1721,7 @@ define amdgpu_kernel void @byref_global_i32_arg(ptr addrspace(1) nocapture %out,
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p1) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1741,7 +1741,7 @@ define amdgpu_kernel void @byref_flat_i32_arg(ptr addrspace(1) nocapture %out, p
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p0) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1756,7 +1756,7 @@ define amdgpu_kernel void @byref_flat_i32_arg(ptr addrspace(1) nocapture %out, p
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p0) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1776,7 +1776,7 @@ define amdgpu_kernel void @byref_constant_32bit_i32_arg(ptr addrspace(1) nocaptu
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p6) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1791,7 +1791,7 @@ define amdgpu_kernel void @byref_constant_32bit_i32_arg(ptr addrspace(1) nocaptu
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p6) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1811,7 +1811,7 @@ define amdgpu_kernel void @byref_unknown_as_i32_arg(ptr addrspace(1) nocapture %
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p999) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1826,7 +1826,7 @@ define amdgpu_kernel void @byref_unknown_as_i32_arg(ptr addrspace(1) nocapture %
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p999) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1847,7 +1847,7 @@ define amdgpu_kernel void @byref_local_i32_arg(ptr addrspace(1) nocapture %out,
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p3) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1862,7 +1862,7 @@ define amdgpu_kernel void @byref_local_i32_arg(ptr addrspace(1) nocapture %out,
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[ADDRSPACE_CAST:%[0-9]+]]:_(p3) = G_ADDRSPACE_CAST [[PTR_ADD1]](p4)
@@ -1882,14 +1882,14 @@ define amdgpu_kernel void @multi_byref_constant_i32_arg(ptr addrspace(1) nocaptu
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 8
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 12
   ; HSA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
   ; HSA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from %ir.in0.byref, addrspace 4)
   ; HSA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from %ir.in1.byref, addrspace 4)
   ; HSA-VI-NEXT:   G_STORE [[LOAD2]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1904,14 +1904,14 @@ define amdgpu_kernel void @multi_byref_constant_i32_arg(ptr addrspace(1) nocaptu
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p1) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p1) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 44
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 48
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD2:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C2]](s64)
   ; LEGACY-MESA-VI-NEXT:   [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD3:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C3]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD3]](p4) :: (dereferenceable invariant load (s32) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD2:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (s32) from %ir.in0.byref, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[LOAD3:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (dereferenceable invariant load (s32) from %ir.in1.byref, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD2]](s32), [[LOAD]](p1) :: (volatile store (s32) into %ir.out, addrspace 1)
@@ -1963,7 +1963,7 @@ define amdgpu_kernel void @p3i8_arg(ptr addrspace(3) %arg) nounwind {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p3), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p3) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s8) = G_CONSTANT i8 9
   ; HSA-VI-NEXT:   G_STORE [[C1]](s8), [[LOAD]](p3) :: (store (s8) into %ir.arg, align 4, addrspace 3)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -1975,7 +1975,7 @@ define amdgpu_kernel void @p3i8_arg(ptr addrspace(3) %arg) nounwind {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p3), addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(p3) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (p3) from constant-pool, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s8) = G_CONSTANT i8 9
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[C1]](s8), [[LOAD]](p3) :: (store (s8) into %ir.arg, align 4, addrspace 3)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -2015,7 +2015,7 @@ define amdgpu_kernel void @v2p1i8_arg(<2 x ptr addrspace(1)> %arg) nounwind {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p1>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p1>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](<2 x p1>), [[DEF]](p1) :: (store (<2 x p1>) into `ptr addrspace(1) poison`, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -2027,7 +2027,7 @@ define amdgpu_kernel void @v2p1i8_arg(<2 x ptr addrspace(1)> %arg) nounwind {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p1>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p1>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](<2 x p1>), [[DEF]](p1) :: (store (<2 x p1>) into `ptr addrspace(1) poison`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -2043,7 +2043,7 @@ define amdgpu_kernel void @v2p3i8_arg(<2 x ptr addrspace(3)> %arg) nounwind {
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p3>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p3>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](<2 x p3>), [[DEF]](p1) :: (store (<2 x p3>) into `ptr addrspace(1) poison`, addrspace 1)
   ; HSA-VI-NEXT:   S_ENDPGM 0
@@ -2055,7 +2055,7 @@ define amdgpu_kernel void @v2p3i8_arg(<2 x ptr addrspace(3)> %arg) nounwind {
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p3>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x p3>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](<2 x p3>), [[DEF]](p1) :: (store (<2 x p3>) into `ptr addrspace(1) poison`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   S_ENDPGM 0
@@ -2071,10 +2071,10 @@ define amdgpu_kernel void @v2p1i8_in_struct_arg({ <2 x ptr addrspace(1)>, <2 x p
   ; HSA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; HSA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
   ; HSA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x s64>), addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x s64>) from constant-pool, addrspace 4)
   ; HSA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
   ; HSA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>), align 16, addrspace 4)
+  ; HSA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>) from constant-pool, align 16, addrspace 4)
   ; HSA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; HSA-VI-NEXT:   G_STORE [[LOAD]](<2 x p1>), [[DEF]](p1) :: (store (<2 x p1>) into `ptr addrspace(1) poison`, addrspace 1)
   ; HSA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
@@ -2089,10 +2089,10 @@ define amdgpu_kernel void @v2p1i8_in_struct_arg({ <2 x ptr addrspace(1)>, <2 x p
   ; LEGACY-MESA-VI-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr4_sgpr5
   ; LEGACY-MESA-VI-NEXT:   [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 36
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x s64>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD:%[0-9]+]]:_(<2 x p1>) = G_LOAD [[PTR_ADD]](p4) :: (dereferenceable invariant load (<2 x s64>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 52
   ; LEGACY-MESA-VI-NEXT:   [[PTR_ADD1:%[0-9]+]]:_(p4) = G_PTR_ADD [[COPY]], [[C1]](s64)
-  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>), align 4, addrspace 4)
+  ; LEGACY-MESA-VI-NEXT:   [[LOAD1:%[0-9]+]]:_(<2 x p3>) = G_LOAD [[PTR_ADD1]](p4) :: (dereferenceable invariant load (<2 x s32>) from constant-pool, align 4, addrspace 4)
   ; LEGACY-MESA-VI-NEXT:   [[DEF:%[0-9]+]]:_(p1) = G_IMPLICIT_DEF
   ; LEGACY-MESA-VI-NEXT:   G_STORE [[LOAD]](<2 x p1>), [[DEF]](p1) :: (store (<2 x p1>) into `ptr addrspace(1) poison`, addrspace 1)
   ; LEGACY-MESA-VI-NEXT:   [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 16
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-addrspacecast.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-addrspacecast.mir
index 4471980c1ba1c..e83b4eabd5dc8 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-addrspacecast.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-addrspacecast.mir
@@ -428,9 +428,9 @@ body: |
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $vgpr0
     ; GCN-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; GCN-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; GCN-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; GCN-NEXT: $vgpr0_vgpr1 = COPY [[MV]](p4)
+    ; GCN-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; GCN-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; GCN-NEXT: $vgpr0_vgpr1 = COPY [[INTTOPTR]](p4)
     %0:_(p6) = COPY $vgpr0
     %1:_(p4) = G_ADDRSPACE_CAST %0
     $vgpr0_vgpr1 = COPY %1
@@ -485,9 +485,9 @@ body: |
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $vgpr0
     ; GCN-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; GCN-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; GCN-NEXT: [[MV:%[0-9]+]]:_(p0) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; GCN-NEXT: $vgpr0_vgpr1 = COPY [[MV]](p0)
+    ; GCN-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; GCN-NEXT: [[INTTOPTR:%[0-9]+]]:_(p0) = G_INTTOPTR [[ZEXT]](s64)
+    ; GCN-NEXT: $vgpr0_vgpr1 = COPY [[INTTOPTR]](p0)
     %0:_(p6) = COPY $vgpr0
     %1:_(p0) = G_ADDRSPACE_CAST %0
     $vgpr0_vgpr1 = COPY %1
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-constant-32bit.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-constant-32bit.mir
index b91f1f408dc58..b9c0217aa591f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-constant-32bit.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-constant-32bit.mir
@@ -12,24 +12,24 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $vgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[MV]](p4) :: (load (s8), addrspace 6)
-    ; CI-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
-    ; CI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p4) = nuw inbounds G_PTR_ADD [[MV]], [[C1]](s64)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[INTTOPTR]](p4) :: (load (s8), addrspace 6)
+    ; CI-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+    ; CI-NEXT: [[PTR_ADD:%[0-9]+]]:_(p4) = nuw inbounds G_PTR_ADD [[INTTOPTR]], [[C]](s64)
     ; CI-NEXT: [[ZEXTLOAD1:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[PTR_ADD]](p4) :: (load (s8) from unknown-address + 1, addrspace 6)
-    ; CI-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
-    ; CI-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[ZEXTLOAD1]], [[C2]](s32)
+    ; CI-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
+    ; CI-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[ZEXTLOAD1]], [[C1]](s32)
     ; CI-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
-    ; CI-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
-    ; CI-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p4) = nuw inbounds G_PTR_ADD [[MV]], [[C3]](s64)
+    ; CI-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+    ; CI-NEXT: [[PTR_ADD1:%[0-9]+]]:_(p4) = nuw inbounds G_PTR_ADD [[INTTOPTR]], [[C2]](s64)
     ; CI-NEXT: [[ZEXTLOAD2:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[PTR_ADD1]](p4) :: (load (s8) from unknown-address + 2, addrspace 6)
-    ; CI-NEXT: [[PTR_ADD2:%[0-9]+]]:_(p4) = nuw inbounds G_PTR_ADD [[PTR_ADD1]], [[C1]](s64)
+    ; CI-NEXT: [[PTR_ADD2:%[0-9]+]]:_(p4) = nuw inbounds G_PTR_ADD [[PTR_ADD1]], [[C]](s64)
     ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD2]](p4) :: (load (s8) from unknown-address + 3, addrspace 6)
-    ; CI-NEXT: [[SHL1:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C2]](s32)
+    ; CI-NEXT: [[SHL1:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
     ; CI-NEXT: [[OR1:%[0-9]+]]:_(s32) = G_OR [[SHL1]], [[ZEXTLOAD2]]
-    ; CI-NEXT: [[C4:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
-    ; CI-NEXT: [[SHL2:%[0-9]+]]:_(s32) = G_SHL [[OR1]], [[C4]](s32)
+    ; CI-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
+    ; CI-NEXT: [[SHL2:%[0-9]+]]:_(s32) = G_SHL [[OR1]], [[C3]](s32)
     ; CI-NEXT: [[OR2:%[0-9]+]]:_(s32) = G_OR [[SHL2]], [[OR]]
     ; CI-NEXT: $vgpr0 = COPY [[OR2]](s32)
     %0:_(p6) = COPY $vgpr0
@@ -48,9 +48,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $vgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p6) = COPY $vgpr0
     %1:_(s32) = G_LOAD %0 :: (load (s32), align 4, addrspace 6)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-sextload-constant-32bit.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-sextload-constant-32bit.mir
index d87212d64d625..067844d506ef5 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-sextload-constant-32bit.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-sextload-constant-32bit.mir
@@ -13,9 +13,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), addrspace 6)
     ; CI-NEXT: [[SEXT:%[0-9]+]]:_(s64) = G_SEXT [[LOAD]](s32)
     ; CI-NEXT: $vgpr0_vgpr1 = COPY [[SEXT]](s64)
     %0:_(p6) = COPY $sgpr0
@@ -34,9 +34,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), align 2, addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), align 2, addrspace 6)
     ; CI-NEXT: [[SEXT:%[0-9]+]]:_(s64) = G_SEXT [[LOAD]](s32)
     ; CI-NEXT: $vgpr0_vgpr1 = COPY [[SEXT]](s64)
     %0:_(p6) = COPY $sgpr0
@@ -55,9 +55,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), align 1, addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), align 1, addrspace 6)
     ; CI-NEXT: [[SEXT:%[0-9]+]]:_(s64) = G_SEXT [[LOAD]](s32)
     ; CI-NEXT: $vgpr0_vgpr1 = COPY [[SEXT]](s64)
     %0:_(p6) = COPY $sgpr0
@@ -76,9 +76,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[SEXTLOAD:%[0-9]+]]:_(s32) = G_SEXTLOAD [[MV]](p4) :: (load (s8), addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[SEXTLOAD:%[0-9]+]]:_(s32) = G_SEXTLOAD [[INTTOPTR]](p4) :: (load (s8), addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[SEXTLOAD]](s32)
     %0:_(p6) = COPY $sgpr0
     %1:_(s32) = G_SEXTLOAD %0 :: (load (s8), align 1, addrspace 6)
@@ -96,9 +96,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[SEXTLOAD:%[0-9]+]]:_(s32) = G_SEXTLOAD [[MV]](p4) :: (load (s16), addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[SEXTLOAD:%[0-9]+]]:_(s32) = G_SEXTLOAD [[INTTOPTR]](p4) :: (load (s16), addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[SEXTLOAD]](s32)
     %0:_(p6) = COPY $sgpr0
     %1:_(s32) = G_SEXTLOAD %0 :: (load (s16), align 2, addrspace 6)
@@ -116,9 +116,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[SEXTLOAD:%[0-9]+]]:_(s32) = G_SEXTLOAD [[MV]](p4) :: (load (s16), align 1, addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[SEXTLOAD:%[0-9]+]]:_(s32) = G_SEXTLOAD [[INTTOPTR]](p4) :: (load (s16), align 1, addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[SEXTLOAD]](s32)
     %0:_(p6) = COPY $sgpr0
     %1:_(s32) = G_SEXTLOAD %0 :: (load (s16), align 1, addrspace 6)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-zextload-constant-32bit.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-zextload-constant-32bit.mir
index a4971e94e75f6..c72cdd5dbc8be 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-zextload-constant-32bit.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-zextload-constant-32bit.mir
@@ -14,11 +14,11 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), addrspace 6)
-    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD]](s32)
-    ; CI-NEXT: $vgpr0_vgpr1 = COPY [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), addrspace 6)
+    ; CI-NEXT: [[ZEXT1:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD]](s32)
+    ; CI-NEXT: $vgpr0_vgpr1 = COPY [[ZEXT1]](s64)
     %0:_(p6) = COPY $sgpr0
     %1:_(s64) = G_ZEXTLOAD %0 :: (load (s32), align 4, addrspace 6)
     $vgpr0_vgpr1 = COPY %1
@@ -35,11 +35,11 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), align 2, addrspace 6)
-    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD]](s32)
-    ; CI-NEXT: $vgpr0_vgpr1 = COPY [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), align 2, addrspace 6)
+    ; CI-NEXT: [[ZEXT1:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD]](s32)
+    ; CI-NEXT: $vgpr0_vgpr1 = COPY [[ZEXT1]](s64)
     %0:_(p6) = COPY $sgpr0
     %1:_(s64) = G_ZEXTLOAD %0 :: (load (s32), align 2, addrspace 6)
     $vgpr0_vgpr1 = COPY %1
@@ -56,11 +56,11 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[MV]](p4) :: (load (s32), align 1, addrspace 6)
-    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD]](s32)
-    ; CI-NEXT: $vgpr0_vgpr1 = COPY [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[INTTOPTR]](p4) :: (load (s32), align 1, addrspace 6)
+    ; CI-NEXT: [[ZEXT1:%[0-9]+]]:_(s64) = G_ZEXT [[LOAD]](s32)
+    ; CI-NEXT: $vgpr0_vgpr1 = COPY [[ZEXT1]](s64)
     %0:_(p6) = COPY $sgpr0
     %1:_(s64) = G_ZEXTLOAD %0 :: (load (s32), align 1, addrspace 6)
     $vgpr0_vgpr1 = COPY %1
@@ -77,9 +77,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[MV]](p4) :: (load (s8), addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[INTTOPTR]](p4) :: (load (s8), addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[ZEXTLOAD]](s32)
     %0:_(p6) = COPY $sgpr0
     %1:_(s32) = G_ZEXTLOAD %0 :: (load (s8), align 1, addrspace 6)
@@ -97,9 +97,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[MV]](p4) :: (load (s16), addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[INTTOPTR]](p4) :: (load (s16), addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[ZEXTLOAD]](s32)
     %0:_(p6) = COPY $sgpr0
     %1:_(s32) = G_ZEXTLOAD %0 :: (load (s16), align 2, addrspace 6)
@@ -117,9 +117,9 @@ body: |
     ; CI-NEXT: {{  $}}
     ; CI-NEXT: [[COPY:%[0-9]+]]:_(p6) = COPY $sgpr0
     ; CI-NEXT: [[PTRTOINT:%[0-9]+]]:_(s32) = G_PTRTOINT [[COPY]](p6)
-    ; CI-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
-    ; CI-NEXT: [[MV:%[0-9]+]]:_(p4) = G_MERGE_VALUES [[PTRTOINT]](s32), [[C]](s32)
-    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[MV]](p4) :: (load (s16), align 1, addrspace 6)
+    ; CI-NEXT: [[ZEXT:%[0-9]+]]:_(s64) = G_ZEXT [[PTRTOINT]](s32)
+    ; CI-NEXT: [[INTTOPTR:%[0-9]+]]:_(p4) = G_INTTOPTR [[ZEXT]](s64)
+    ; CI-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[INTTOPTR]](p4) :: (load (s16), align 1, addrspace 6)
     ; CI-NEXT: $vgpr0 = COPY [[ZEXTLOAD]](s32)
     %0:_(p6) = COPY $sgpr0
     %1:_(s32) = G_ZEXTLOAD %0 :: (load (s16), align 1, addrspace 6)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..993fb7eeb3aa9 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -12,7 +12,7 @@ define amdgpu_cs void @memcpy_p1i8(ptr addrspace(1) %dst, ptr addrspace(1) %src)
 ; LOOP-NEXT:    s_mov_b32 s3, 0xf000
 ; LOOP-NEXT:    v_mov_b32_e32 v5, s1
 ; LOOP-NEXT:    v_mov_b32_e32 v4, s0
-; LOOP-NEXT:  .LBB0_1: ; %load-store-loop
+; LOOP-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; LOOP-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; LOOP-NEXT:    v_add_i32_e32 v6, vcc, v2, v4
 ; LOOP-NEXT:    v_addc_u32_e32 v7, vcc, v3, v5, vcc
@@ -177,7 +177,7 @@ define amdgpu_cs void @memcpy_p1i8(ptr addrspace(1) %dst, ptr addrspace(1) %src)
 ; LOOP-NEXT:    buffer_store_byte v30, v[6:7], s[0:3], 0 addr64 offset:30
 ; LOOP-NEXT:    buffer_store_byte v13, v[6:7], s[0:3], 0 addr64 offset:31
 ; LOOP-NEXT:    s_cbranch_vccnz .LBB0_1
-; LOOP-NEXT:  ; %bb.2: ; %memcpy-split
+; LOOP-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; LOOP-NEXT:    s_mov_b32 s2, 0
 ; LOOP-NEXT:    s_mov_b32 s3, 0xf000
 ; LOOP-NEXT:    s_mov_b64 s[0:1], 0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant32bit.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant32bit.ll
index 0038a097174c6..2e1b853ff8c58 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant32bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant32bit.ll
@@ -11,8 +11,8 @@ define amdgpu_ps float @load_constant32bit_vgpr_offset(i32 %arg) {
 ; GFX6-LABEL: load_constant32bit_vgpr_offset:
 ; GFX6:       ; %bb.0: ; %entry
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
-; GFX6-NEXT:    s_mov_b32 s2, 0
 ; GFX6-NEXT:    v_mov_b32_e32 v1, 0
+; GFX6-NEXT:    s_mov_b32 s2, 0
 ; GFX6-NEXT:    s_mov_b32 s3, 0xf000
 ; GFX6-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX6-NEXT:    buffer_load_dword v0, v[0:1], s[0:3], 0 addr64
@@ -59,8 +59,8 @@ define amdgpu_ps <8 x float> @load_constant32bit_vgpr_v8f32(ptr addrspace(6) %ar
 ; GFX6-LABEL: load_constant32bit_vgpr_v8f32:
 ; GFX6:       ; %bb.0: ; %entry
 ; GFX6-NEXT:    v_mov_b32_e32 v4, v0
-; GFX6-NEXT:    s_mov_b32 s2, 0
 ; GFX6-NEXT:    v_mov_b32_e32 v5, 0
+; GFX6-NEXT:    s_mov_b32 s2, 0
 ; GFX6-NEXT:    s_mov_b32 s3, 0xf000
 ; GFX6-NEXT:    s_mov_b64 s[0:1], 0
 ; GFX6-NEXT:    buffer_load_dwordx4 v[0:3], v[4:5], s[0:3], 0 addr64
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-load.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-load.mir
index d52b5fe9df247..9034a94535005 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-load.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-load.mir
@@ -296,9 +296,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s32>) = G_LOAD [[COPY]](p1) :: (invariant load (<8 x s32>), addrspace 1)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s32>) = G_LOAD [[COPY]](p1) :: (invariant load (<8 x s32>) from constant-pool, addrspace 1)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(<8 x s32>) = G_LOAD %0 :: (invariant load (<8 x s32>), addrspace 1)
+    %1:_(<8 x s32>) = G_LOAD %0 :: (invariant load (<8 x s32>) from constant-pool, addrspace 1)
 ...
 
 ---
@@ -313,9 +313,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s64>) = G_LOAD [[COPY]](p1) :: (invariant load (<4 x s64>), addrspace 1)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s64>) = G_LOAD [[COPY]](p1) :: (invariant load (<4 x s64>) from constant-pool, addrspace 1)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(<4 x s64>) = G_LOAD %0 :: (invariant load (<4 x s64>), addrspace 1)
+    %1:_(<4 x s64>) = G_LOAD %0 :: (invariant load (<4 x s64>) from constant-pool, addrspace 1)
 ...
 
 ---
@@ -330,9 +330,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<16 x s32>) = G_LOAD [[COPY]](p1) :: (invariant load (<16 x s32>), addrspace 1)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<16 x s32>) = G_LOAD [[COPY]](p1) :: (invariant load (<16 x s32>) from constant-pool, addrspace 1)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(<16 x s32>) = G_LOAD %0 :: (invariant load (<16 x s32>), addrspace 1)
+    %1:_(<16 x s32>) = G_LOAD %0 :: (invariant load (<16 x s32>) from constant-pool, addrspace 1)
 ...
 
 ---
@@ -347,9 +347,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s64>) = G_LOAD [[COPY]](p1) :: (invariant load (<8 x s64>), addrspace 1)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s64>) = G_LOAD [[COPY]](p1) :: (invariant load (<8 x s64>) from constant-pool, addrspace 1)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(<8 x s64>) = G_LOAD %0 :: (invariant load (<8 x s64>), addrspace 1)
+    %1:_(<8 x s64>) = G_LOAD %0 :: (invariant load (<8 x s64>) from constant-pool, addrspace 1)
 ...
 
 ---
@@ -603,9 +603,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s32>) = G_LOAD [[COPY]](p4) :: (load (<8 x s32>), addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s32>) = G_LOAD [[COPY]](p4) :: (load (<8 x s32>) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<8 x s32>) = G_LOAD %0 :: (load (<8 x s32>), addrspace 4)
+    %1:_(<8 x s32>) = G_LOAD %0 :: (load (<8 x s32>) from constant-pool, addrspace 4)
 ...
 
 ---
@@ -620,9 +620,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<16 x s16>) = G_LOAD [[COPY]](p4) :: (load (<16 x s16>), addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<16 x s16>) = G_LOAD [[COPY]](p4) :: (load (<16 x s16>) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<16 x s16>) = G_LOAD %0 :: (load (<16 x s16>), addrspace 4)
+    %1:_(<16 x s16>) = G_LOAD %0 :: (load (<16 x s16>) from constant-pool, addrspace 4)
 ...
 
 ---
@@ -637,9 +637,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s64>) = G_LOAD [[COPY]](p4) :: (load (<4 x s64>), addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s64>) = G_LOAD [[COPY]](p4) :: (load (<4 x s64>) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<4 x s64>) = G_LOAD %0 :: (load (<4 x s64>), addrspace 4)
+    %1:_(<4 x s64>) = G_LOAD %0 :: (load (<4 x s64>) from constant-pool, addrspace 4)
 ...
 
 ---
@@ -654,9 +654,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<16 x s32>) = G_LOAD [[COPY]](p4) :: (load (<16 x s32>), addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<16 x s32>) = G_LOAD [[COPY]](p4) :: (load (<16 x s32>) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<16 x s32>) = G_LOAD %0 :: (load (<16 x s32>), addrspace 4)
+    %1:_(<16 x s32>) = G_LOAD %0 :: (load (<16 x s32>) from constant-pool, addrspace 4)
 ...
 
 ---
@@ -671,9 +671,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s64>) = G_LOAD [[COPY]](p4) :: (load (<8 x s64>), addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s64>) = G_LOAD [[COPY]](p4) :: (load (<8 x s64>) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<8 x s64>) = G_LOAD %0 :: (load (<8 x s64>), addrspace 4)
+    %1:_(<8 x s64>) = G_LOAD %0 :: (load (<8 x s64>) from constant-pool, addrspace 4)
 ...
 
 ---
@@ -726,16 +726,16 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s8), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s8) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ;
     ; GFX12-LABEL: name: extload_constant_i8_to_i32_uniform
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s8), addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s8) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s8), addrspace 4, align 1)
+    %1:_(s32) = G_LOAD %0 :: (load (s8) from constant-pool, addrspace 4, align 1)
 ...
 
 ---
@@ -751,10 +751,10 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s8), addrspace 1)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s8) from constant-pool, addrspace 1)
     ; GCN-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s8), addrspace 1, align 1)
+    %1:_(s32) = G_LOAD %0 :: (load (s8) from constant-pool, addrspace 1, align 1)
 ...
 
 ---
@@ -770,16 +770,16 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s16), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s16) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ;
     ; GFX12-LABEL: name: extload_constant_i16_to_i32_uniform
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s16), addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s16) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s16), addrspace 4, align 2)
+    %1:_(s32) = G_LOAD %0 :: (load (s16) from constant-pool, addrspace 4, align 2)
 ...
 
 ---
@@ -795,10 +795,10 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s16), addrspace 1)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s16) from constant-pool, addrspace 1)
     ; GCN-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s16), addrspace 1, align 2)
+    %1:_(s32) = G_LOAD %0 :: (load (s16) from constant-pool, addrspace 1, align 2)
 ...
 
 ---
@@ -813,9 +813,9 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s32), addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s32) from constant-pool, addrspace 4)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s32), addrspace 4, align 4)
+    %1:_(s32) = G_LOAD %0 :: (load (s32) from constant-pool, addrspace 4, align 4)
 ...
 
 ---
@@ -831,10 +831,10 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s32), align 2, addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s32) from constant-pool, align 2, addrspace 4)
     ; GCN-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s32), addrspace 4, align 2)
+    %1:_(s32) = G_LOAD %0 :: (load (s32) from constant-pool, addrspace 4, align 2)
 ...
 
 ---
@@ -850,10 +850,10 @@ body: |
     ; GCN: liveins: $sgpr0_sgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s32), align 1, addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p4) :: (load (s32) from constant-pool, align 1, addrspace 4)
     ; GCN-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s32), addrspace 4, align 1)
+    %1:_(s32) = G_LOAD %0 :: (load (s32) from constant-pool, addrspace 4, align 1)
 ...
 
 ---
@@ -888,13 +888,13 @@ body: |
     ; GCN: liveins: $vgpr0_vgpr1
     ; GCN-NEXT: {{  $}}
     ; GCN-NEXT: [[COPY:%[0-9]+]]:vgpr(p4) = COPY $vgpr0_vgpr1
-    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[COPY]](p4) :: (load (<4 x s32>), align 32, addrspace 4)
+    ; GCN-NEXT: [[LOAD:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[COPY]](p4) :: (load (<4 x s32>) from constant-pool, align 32, addrspace 4)
     ; GCN-NEXT: [[C:%[0-9]+]]:vgpr(s64) = G_CONSTANT i64 16
     ; GCN-NEXT: [[PTR_ADD:%[0-9]+]]:vgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GCN-NEXT: [[LOAD1:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[PTR_ADD]](p4) :: (load (<4 x s32>) from unknown-address + 16, addrspace 4)
+    ; GCN-NEXT: [[LOAD1:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[PTR_ADD]](p4) :: (load (<4 x s32>) from constant-pool + 16, basealign 32, addrspace 4)
     ; GCN-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:vgpr(<8 x s32>) = G_CONCAT_VECTORS [[LOAD]](<4 x s32>), [[LOAD1]](<4 x s32>)
     %0:_(p4) = COPY $vgpr0_vgpr1
-    %1:_(<8 x s32>) = G_LOAD %0 :: (load (<8 x s32>), addrspace 4)
+    %1:_(<8 x s32>) = G_LOAD %0 :: (load (<8 x s32>) from constant-pool, addrspace 4)
 ...
 
 ---
@@ -916,10 +916,10 @@ body: |
   ; GCN-NEXT:   successors: %bb.1(0x80000000)
   ; GCN-NEXT: {{  $}}
   ; GCN-NEXT:   [[PHI:%[0-9]+]]:vgpr(p4) = G_PHI [[COPY]](p4), %bb.0, %3(p4), %bb.1
-  ; GCN-NEXT:   [[LOAD:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[PHI]](p4) :: (load (<4 x s32>), align 32, addrspace 4)
+  ; GCN-NEXT:   [[LOAD:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[PHI]](p4) :: (load (<4 x s32>) from constant-pool, align 32, addrspace 4)
   ; GCN-NEXT:   [[C:%[0-9]+]]:vgpr(s64) = G_CONSTANT i64 16
   ; GCN-NEXT:   [[PTR_ADD:%[0-9]+]]:vgpr(p4) = nuw inbounds G_PTR_ADD [[PHI]], [[C]](s64)
-  ; GCN-NEXT:   [[LOAD1:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[PTR_ADD]](p4) :: (load (<4 x s32>) from unknown-address + 16, addrspace 4)
+  ; GCN-NEXT:   [[LOAD1:%[0-9]+]]:vgpr(<4 x s32>) = G_LOAD [[PTR_ADD]](p4) :: (load (<4 x s32>) from constant-pool + 16, basealign 32, addrspace 4)
   ; GCN-NEXT:   [[CONCAT_VECTORS:%[0-9]+]]:vgpr(<8 x s32>) = G_CONCAT_VECTORS [[LOAD]](<4 x s32>), [[LOAD1]](<4 x s32>)
   ; GCN-NEXT:   [[COPY2:%[0-9]+]]:vgpr(p4) = COPY [[COPY1]](p4)
   ; GCN-NEXT:   G_BR %bb.1
@@ -933,7 +933,7 @@ body: |
 
   bb.1:
     %2:_(p4) = G_PHI %0, %bb.0, %4, %bb.1
-    %3:_(<8 x s32>) = G_LOAD %2 :: (load (<8 x s32>), addrspace 4)
+    %3:_(<8 x s32>) = G_LOAD %2 :: (load (<8 x s32>) from constant-pool, addrspace 4)
     %4:_(p4) = COPY %1
     G_BR %bb.1
 ...
@@ -950,10 +950,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<2 x s32>), align 4, addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<2 x s32>) from constant-pool, align 4, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from unknown-address + 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from constant-pool + 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](<2 x s32>)
     ; GFX7-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<3 x s32>) = G_BUILD_VECTOR [[UV]](s32), [[UV1]](s32), [[LOAD1]](s32)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[BUILD_VECTOR]](<3 x s32>)
@@ -962,10 +962,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<3 x s32>), align 4, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<3 x s32>) from constant-pool, align 4, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](<3 x s32>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<3 x s32>) = G_LOAD %0 :: (invariant load (<3 x s32>), addrspace 4, align 4)
+    %1:_(<3 x s32>) = G_LOAD %0 :: (invariant load (<3 x s32>) from constant-pool, addrspace 4, align 4)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -981,10 +981,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<2 x s32>), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<2 x s32>) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from unknown-address + 8, align 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from constant-pool + 8, align 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](<2 x s32>)
     ; GFX7-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<3 x s32>) = G_BUILD_VECTOR [[UV]](s32), [[UV1]](s32), [[LOAD1]](s32)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[BUILD_VECTOR]](<3 x s32>)
@@ -993,10 +993,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<3 x s32>), align 8, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<3 x s32>) from constant-pool, align 8, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](<3 x s32>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<3 x s32>) = G_LOAD %0 :: (invariant load (<3 x s32>), addrspace 4, align 8)
+    %1:_(<3 x s32>) = G_LOAD %0 :: (invariant load (<3 x s32>) from constant-pool, addrspace 4, align 8)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1012,7 +1012,7 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<4 x s32>), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<4 x s32>) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32), [[UV2:%[0-9]+]]:sgpr(s32), [[UV3:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](<4 x s32>)
     ; GFX7-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<3 x s32>) = G_BUILD_VECTOR [[UV]](s32), [[UV1]](s32), [[UV2]](s32)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[BUILD_VECTOR]](<3 x s32>)
@@ -1021,10 +1021,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<3 x s32>), align 16, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (invariant load (<3 x s32>) from constant-pool, align 16, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](<3 x s32>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<3 x s32>) = G_LOAD %0 :: (invariant load (<3 x s32>), addrspace 4, align 16)
+    %1:_(<3 x s32>) = G_LOAD %0 :: (invariant load (<3 x s32>) from constant-pool, addrspace 4, align 16)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1040,10 +1040,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<4 x s16>), align 4, addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<4 x s16>) from constant-pool, align 4, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(<2 x s16>) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (<2 x s16>) from unknown-address + 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(<2 x s16>) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (<2 x s16>) from constant-pool + 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(<2 x s16>), [[UV1:%[0-9]+]]:sgpr(<2 x s16>) = G_UNMERGE_VALUES [[LOAD]](<4 x s16>)
     ; GFX7-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:sgpr(<6 x s16>) = G_CONCAT_VECTORS [[UV]](<2 x s16>), [[UV1]](<2 x s16>), [[LOAD1]](<2 x s16>)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[CONCAT_VECTORS]](<6 x s16>)
@@ -1052,10 +1052,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<6 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<6 x s16>), align 4, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<6 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<6 x s16>) from constant-pool, align 4, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](<6 x s16>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<6 x s16>) = G_LOAD %0 :: (invariant load (<6 x s16>), addrspace 4, align 4)
+    %1:_(<6 x s16>) = G_LOAD %0 :: (invariant load (<6 x s16>) from constant-pool, addrspace 4, align 4)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1071,10 +1071,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<4 x s16>), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<4 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<4 x s16>) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(<2 x s16>) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (<2 x s16>) from unknown-address + 8, align 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(<2 x s16>) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (<2 x s16>) from constant-pool + 8, align 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(<2 x s16>), [[UV1:%[0-9]+]]:sgpr(<2 x s16>) = G_UNMERGE_VALUES [[LOAD]](<4 x s16>)
     ; GFX7-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:sgpr(<6 x s16>) = G_CONCAT_VECTORS [[UV]](<2 x s16>), [[UV1]](<2 x s16>), [[LOAD1]](<2 x s16>)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[CONCAT_VECTORS]](<6 x s16>)
@@ -1083,10 +1083,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<6 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<6 x s16>), align 8, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<6 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<6 x s16>) from constant-pool, align 8, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](<6 x s16>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<6 x s16>) = G_LOAD %0 :: (invariant load (<6 x s16>), addrspace 4, align 8)
+    %1:_(<6 x s16>) = G_LOAD %0 :: (invariant load (<6 x s16>) from constant-pool, addrspace 4, align 8)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1102,7 +1102,7 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<8 x s16>), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<8 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<8 x s16>) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(<2 x s16>), [[UV1:%[0-9]+]]:sgpr(<2 x s16>), [[UV2:%[0-9]+]]:sgpr(<2 x s16>), [[UV3:%[0-9]+]]:sgpr(<2 x s16>) = G_UNMERGE_VALUES [[LOAD]](<8 x s16>)
     ; GFX7-NEXT: [[CONCAT_VECTORS:%[0-9]+]]:sgpr(<6 x s16>) = G_CONCAT_VECTORS [[UV]](<2 x s16>), [[UV1]](<2 x s16>), [[UV2]](<2 x s16>)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[CONCAT_VECTORS]](<6 x s16>)
@@ -1111,10 +1111,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<6 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<6 x s16>), align 16, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<6 x s16>) = G_LOAD [[COPY]](p4) :: (invariant load (<6 x s16>) from constant-pool, align 16, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](<6 x s16>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<6 x s16>) = G_LOAD %0 :: (invariant load (<6 x s16>), addrspace 4, align 16)
+    %1:_(<6 x s16>) = G_LOAD %0 :: (invariant load (<6 x s16>) from constant-pool, addrspace 4, align 16)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1130,10 +1130,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(s64) = G_LOAD [[COPY]](p4) :: (invariant load (s64), align 4, addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(s64) = G_LOAD [[COPY]](p4) :: (invariant load (s64) from constant-pool, align 4, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from unknown-address + 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from constant-pool + 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](s64)
     ; GFX7-NEXT: [[MV:%[0-9]+]]:sgpr(s96) = G_MERGE_VALUES [[UV]](s32), [[UV1]](s32), [[LOAD1]](s32)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[MV]](s96)
@@ -1142,10 +1142,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s96) = G_LOAD [[COPY]](p4) :: (invariant load (s96), align 4, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s96) = G_LOAD [[COPY]](p4) :: (invariant load (s96) from constant-pool, align 4, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](s96)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s96) = G_LOAD %0 :: (invariant load (s96), addrspace 4, align 4)
+    %1:_(s96) = G_LOAD %0 :: (invariant load (s96) from constant-pool, addrspace 4, align 4)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1161,10 +1161,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(s64) = G_LOAD [[COPY]](p4) :: (invariant load (s64), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(s64) = G_LOAD [[COPY]](p4) :: (invariant load (s64) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from unknown-address + 8, align 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (invariant load (s32) from constant-pool + 8, align 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](s64)
     ; GFX7-NEXT: [[MV:%[0-9]+]]:sgpr(s96) = G_MERGE_VALUES [[UV]](s32), [[UV1]](s32), [[LOAD1]](s32)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[MV]](s96)
@@ -1173,10 +1173,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s96) = G_LOAD [[COPY]](p4) :: (invariant load (s96), align 8, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s96) = G_LOAD [[COPY]](p4) :: (invariant load (s96) from constant-pool, align 8, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](s96)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s96) = G_LOAD %0 :: (invariant load (s96), addrspace 4, align 8)
+    %1:_(s96) = G_LOAD %0 :: (invariant load (s96) from constant-pool, addrspace 4, align 8)
     S_ENDPGM 0, implicit %1
 ...
 
@@ -1192,7 +1192,7 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(s128) = G_LOAD [[COPY]](p4) :: (invariant load (s128), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(s128) = G_LOAD [[COPY]](p4) :: (invariant load (s128) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[TRUNC:%[0-9]+]]:sgpr(s96) = G_TRUNC [[LOAD]](s128)
     ; GFX7-NEXT: S_ENDPGM 0, implicit [[TRUNC]](s96)
     ;
@@ -1200,9 +1200,9 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s96) = G_LOAD [[COPY]](p4) :: (invariant load (s96), align 16, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(s96) = G_LOAD [[COPY]](p4) :: (invariant load (s96) from constant-pool, align 16, addrspace 4)
     ; GFX12-NEXT: S_ENDPGM 0, implicit [[LOAD]](s96)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(s96) = G_LOAD %0 :: (invariant load (s96), addrspace 4, align 16)
+    %1:_(s96) = G_LOAD %0 :: (invariant load (s96) from constant-pool, addrspace 4, align 16)
     S_ENDPGM 0, implicit %1
 ...
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-split-scalar-load-metadata.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-split-scalar-load-metadata.mir
index b2ff0995ce578..cdc673ea6804f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-split-scalar-load-metadata.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-split-scalar-load-metadata.mir
@@ -35,10 +35,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (load (<2 x s32>), addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (load (<2 x s32>) from constant-pool, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (load (s32) from unknown-address + 8, align 8, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (load (s32) from constant-pool + 8, align 8, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](<2 x s32>)
     ; GFX7-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<3 x s32>) = G_BUILD_VECTOR [[UV]](s32), [[UV1]](s32), [[LOAD1]](s32)
     ; GFX7-NEXT: $sgpr0_sgpr1_sgpr2 = COPY [[BUILD_VECTOR]](<3 x s32>)
@@ -47,10 +47,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (load (<3 x s32>), align 8
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (load (<3 x s32>) from constant-pool, align 8, !range {{.+}}, addrspace 4)
     ; GFX12-NEXT: $sgpr0_sgpr1_sgpr2 = COPY [[LOAD]](<3 x s32>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<3 x s32>) = G_LOAD %0 :: (load (<3 x s32>), align 8, addrspace 4, !range !3)
+    %1:_(<3 x s32>) = G_LOAD %0 :: (load (<3 x s32>) from constant-pool, align 8, addrspace 4, !range !3)
     $sgpr0_sgpr1_sgpr2 = COPY %1
 
 ...
@@ -66,10 +66,10 @@ body: |
     ; GFX7: liveins: $sgpr0_sgpr1
     ; GFX7-NEXT: {{  $}}
     ; GFX7-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (load (<2 x s32>), !tbaa !2, addrspace 4)
+    ; GFX7-NEXT: [[LOAD:%[0-9]+]]:sgpr(<2 x s32>) = G_LOAD [[COPY]](p4) :: (load (<2 x s32>) from constant-pool, !tbaa !2, addrspace 4)
     ; GFX7-NEXT: [[C:%[0-9]+]]:sgpr(s64) = G_CONSTANT i64 8
     ; GFX7-NEXT: [[PTR_ADD:%[0-9]+]]:sgpr(p4) = nuw inbounds G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (load (s32) from unknown-address + 8, align 8, !tbaa !2, addrspace 4)
+    ; GFX7-NEXT: [[LOAD1:%[0-9]+]]:sgpr(s32) = G_LOAD [[PTR_ADD]](p4) :: (load (s32) from constant-pool + 8, align 8, !tbaa !2, addrspace 4)
     ; GFX7-NEXT: [[UV:%[0-9]+]]:sgpr(s32), [[UV1:%[0-9]+]]:sgpr(s32) = G_UNMERGE_VALUES [[LOAD]](<2 x s32>)
     ; GFX7-NEXT: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<3 x s32>) = G_BUILD_VECTOR [[UV]](s32), [[UV1]](s32), [[LOAD1]](s32)
     ; GFX7-NEXT: $sgpr0_sgpr1_sgpr2 = COPY [[BUILD_VECTOR]](<3 x s32>)
@@ -78,10 +78,10 @@ body: |
     ; GFX12: liveins: $sgpr0_sgpr1
     ; GFX12-NEXT: {{  $}}
     ; GFX12-NEXT: [[COPY:%[0-9]+]]:sgpr(p4) = COPY $sgpr0_sgpr1
-    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (load (<3 x s32>), align 8, !tbaa !2, addrspace 4)
+    ; GFX12-NEXT: [[LOAD:%[0-9]+]]:sgpr(<3 x s32>) = G_LOAD [[COPY]](p4) :: (load (<3 x s32>) from constant-pool, align 8, !tbaa !2, addrspace 4)
     ; GFX12-NEXT: $sgpr0_sgpr1_sgpr2 = COPY [[LOAD]](<3 x s32>)
     %0:_(p4) = COPY $sgpr0_sgpr1
-    %1:_(<3 x s32>) = G_LOAD %0 :: (load (<3 x s32>), align 8, addrspace 4, !tbaa !1)
+    %1:_(<3 x s32>) = G_LOAD %0 :: (load (<3 x s32>) from constant-pool, align 8, addrspace 4, !tbaa !1)
     $sgpr0_sgpr1_sgpr2 = COPY %1
 
 ...
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-widen-scalar-loads.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-widen-scalar-loads.mir
index 7838e979befef..70a2ddffe87bb 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-widen-scalar-loads.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-widen-scalar-loads.mir
@@ -14,24 +14,24 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), align 8, addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, align 8, addrspace 4)
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX9-LABEL: name: constant_load_i8_align8
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), align 8, addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, align 8, addrspace 4)
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX10-LABEL: name: constant_load_i8_align8
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), align 8, addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, align 8, addrspace 4)
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_LOAD %0 :: (invariant load (s8), align 8, addrspace 4)
+   %1:_(s32) = G_LOAD %0 :: (invariant load (s8) from constant-pool, align 8, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -45,24 +45,24 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX9-LABEL: name: constant_load_i8_align4
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX10-LABEL: name: constant_load_i8_align4
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_LOAD %0 :: (invariant load (s8), align 4, addrspace 4)
+   %1:_(s32) = G_LOAD %0 :: (invariant load (s8) from constant-pool, align 4, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -76,24 +76,24 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX9-LABEL: name: constant_load_i16_align4
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX10-LABEL: name: constant_load_i16_align4
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_LOAD %0 :: (invariant load (s16), align 4, addrspace 4)
+   %1:_(s32) = G_LOAD %0 :: (invariant load (s16) from constant-pool, align 4, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -107,7 +107,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 8
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
     ;
@@ -115,7 +115,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 8
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
     ;
@@ -123,11 +123,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 8
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s8), align 4, addrspace 4)
+   %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s8) from constant-pool, align 4, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -141,7 +141,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 16
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
     ;
@@ -149,7 +149,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 16
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
     ;
@@ -157,11 +157,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 16
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s16), align 4, addrspace 4)
+   %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s16) from constant-pool, align 4, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 
@@ -176,7 +176,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 255
     ; GFX8-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
@@ -185,7 +185,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 255
     ; GFX9-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
@@ -194,12 +194,12 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 255
     ; GFX10-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s8), align 4, addrspace 4)
+   %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s8) from constant-pool, align 4, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -213,7 +213,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 65535
     ; GFX8-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
@@ -222,7 +222,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 65535
     ; GFX9-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
@@ -231,12 +231,12 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 65535
     ; GFX10-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s16), align 4, addrspace 4)
+   %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s16) from constant-pool, align 4, addrspace 4)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -250,24 +250,24 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX9-LABEL: name: global_load_i8_align4
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX10-LABEL: name: global_load_i8_align4
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_LOAD %0 :: (invariant load (s8), align 4, addrspace 1)
+   %1:_(s32) = G_LOAD %0 :: (invariant load (s8) from constant-pool, align 4, addrspace 1)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -281,24 +281,24 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX9-LABEL: name: global_load_i16_align4
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX10-LABEL: name: global_load_i16_align4
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_LOAD %0 :: (invariant load (s16), align 4, addrspace 1)
+   %1:_(s32) = G_LOAD %0 :: (invariant load (s16) from constant-pool, align 4, addrspace 1)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -312,7 +312,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 8
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
     ;
@@ -320,7 +320,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 8
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
     ;
@@ -328,11 +328,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[SEXT_INREG:%[0-9]+]]:sgpr(s32) = G_SEXT_INREG [[LOAD]], 8
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[SEXT_INREG]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s8), align 4, addrspace 1)
+   %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s8) from constant-pool, align 4, addrspace 1)
    S_ENDPGM 0, implicit %1
 ...
 ---
@@ -346,7 +346,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 65535
     ; GFX8-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
@@ -355,7 +355,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 65535
     ; GFX9-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
@@ -364,12 +364,12 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32), addrspace 1)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:sgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s32) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 65535
     ; GFX10-NEXT: [[AND:%[0-9]+]]:sgpr(s32) = G_AND [[LOAD]], [[C]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AND]](s32)
    %0:_(p1) = COPY $sgpr0_sgpr1
-   %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s16), align 4, addrspace 1)
+   %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s16) from constant-pool, align 4, addrspace 1)
    S_ENDPGM 0, implicit %1
 ...
 # Some negative test cases
@@ -383,7 +383,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -391,7 +391,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -399,11 +399,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (invariant load (s8), align 2, addrspace 4)
+    %1:_(s32) = G_LOAD %0 :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -417,7 +417,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -425,7 +425,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -433,11 +433,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (invariant load (s16), align 2, addrspace 4)
+    %1:_(s32) = G_LOAD %0 :: (invariant load (s16) from constant-pool, align 2, addrspace 4)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -451,7 +451,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX8-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[SEXTLOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -459,7 +459,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX9-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[SEXTLOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -467,11 +467,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX10-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[SEXTLOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s8), align 2, addrspace 4)
+    %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -485,7 +485,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX8-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[SEXTLOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -493,7 +493,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX9-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[SEXTLOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -501,11 +501,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX10-NEXT: [[SEXTLOAD:%[0-9]+]]:vgpr(s32) = G_SEXTLOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[SEXTLOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s16), align 2, addrspace 4)
+    %1:_(s32) = G_SEXTLOAD %0 :: (invariant load (s16) from constant-pool, align 2, addrspace 4)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -519,7 +519,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX8-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[ZEXTLOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -527,7 +527,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[ZEXTLOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -535,11 +535,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s8), align 2, addrspace 4)
+    ; GFX10-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[ZEXTLOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s8), align 2, addrspace 4)
+    %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s8) from constant-pool, align 2, addrspace 4)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -553,7 +553,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX8-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[ZEXTLOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -561,7 +561,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[ZEXTLOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -569,11 +569,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s16), addrspace 4)
+    ; GFX10-NEXT: [[ZEXTLOAD:%[0-9]+]]:vgpr(s32) = G_ZEXTLOAD [[COPY]](p1) :: (invariant load (s16) from constant-pool, addrspace 4)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[ZEXTLOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s16), align 2, addrspace 4)
+    %1:_(s32) = G_ZEXTLOAD %0 :: (invariant load (s16) from constant-pool, align 2, addrspace 4)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -587,7 +587,7 @@ body: |
     ; GFX8: liveins: $sgpr0_sgpr1
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (load (s8), align 4, addrspace 3)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (load (s8) from constant-pool, align 4, addrspace 3)
     ; GFX8-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -595,7 +595,7 @@ body: |
     ; GFX9: liveins: $sgpr0_sgpr1
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (load (s8), align 4, addrspace 3)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (load (s8) from constant-pool, align 4, addrspace 3)
     ; GFX9-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     ;
@@ -603,11 +603,11 @@ body: |
     ; GFX10: liveins: $sgpr0_sgpr1
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (load (s8), align 4, addrspace 3)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY]](p1) :: (load (s8) from constant-pool, align 4, addrspace 3)
     ; GFX10-NEXT: [[AMDGPU_READANYLANE:%[0-9]+]]:sgpr(s32) = G_AMDGPU_READANYLANE [[LOAD]]
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[AMDGPU_READANYLANE]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s8), align 4, addrspace 3)
+    %1:_(s32) = G_LOAD %0 :: (load (s8) from constant-pool, align 4, addrspace 3)
     S_ENDPGM 0, implicit %1
 ...
 ---
@@ -622,7 +622,7 @@ body: |
     ; GFX8-NEXT: {{  $}}
     ; GFX8-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
     ; GFX8-NEXT: [[COPY1:%[0-9]+]]:vgpr(p1) = COPY [[COPY]](p1)
-    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY1]](p1) :: (load (s8), align 4, addrspace 5)
+    ; GFX8-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY1]](p1) :: (load (s8) from constant-pool, align 4, addrspace 5)
     ; GFX8-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX9-LABEL: name: private_load_i8_align4
@@ -630,7 +630,7 @@ body: |
     ; GFX9-NEXT: {{  $}}
     ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
     ; GFX9-NEXT: [[COPY1:%[0-9]+]]:vgpr(p1) = COPY [[COPY]](p1)
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY1]](p1) :: (load (s8), align 4, addrspace 5)
+    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY1]](p1) :: (load (s8) from constant-pool, align 4, addrspace 5)
     ; GFX9-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     ;
     ; GFX10-LABEL: name: private_load_i8_align4
@@ -638,9 +638,9 @@ body: |
     ; GFX10-NEXT: {{  $}}
     ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr(p1) = COPY $sgpr0_sgpr1
     ; GFX10-NEXT: [[COPY1:%[0-9]+]]:vgpr(p1) = COPY [[COPY]](p1)
-    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY1]](p1) :: (load (s8), align 4, addrspace 5)
+    ; GFX10-NEXT: [[LOAD:%[0-9]+]]:vgpr(s32) = G_LOAD [[COPY1]](p1) :: (load (s8) from constant-pool, align 4, addrspace 5)
     ; GFX10-NEXT: S_ENDPGM 0, implicit [[LOAD]](s32)
     %0:_(p1) = COPY $sgpr0_sgpr1
-    %1:_(s32) = G_LOAD %0 :: (load (s8), align 4, addrspace 5)
+    %1:_(s32) = G_LOAD %0 :: (load (s8) from constant-pool, align 4, addrspace 5)
     S_ENDPGM 0, implicit %1
 ...
diff --git a/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll b/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll
index 5d8bebe89dd94..74b5639d902a4 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll
@@ -4,7 +4,6 @@
 
 declare amdgpu_cs_chain void @callee(<3 x i32> inreg, { i32, ptr addrspace(5), i32, i32 })
 declare amdgpu_cs_chain_preserve void @callee_preserve(<3 x i32> inreg, { i32, ptr addrspace(5), i32, i32 })
-declare void @llvm.amdgcn.cs.chain(ptr, i32, <3 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) noreturn
 
 define amdgpu_cs_chain void @dynamic_vgprs(i32 inreg %exec, <3 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 inreg %num_vgpr) {
 ; GISEL-GFX12-LABEL: dynamic_vgprs:
@@ -94,4 +93,45 @@ define amdgpu_cs_chain void @constants(<3 x i32> inreg %sgpr, { i32, ptr addrspa
   unreachable
 }
 
+define amdgpu_cs_chain void @high_sgpr_pressure(<30 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr) {
+; GISEL-GFX12-LABEL: high_sgpr_pressure:
+; GISEL-GFX12:       ; %bb.0:
+; GISEL-GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GISEL-GFX12-NEXT:    s_wait_expcnt 0x0
+; GISEL-GFX12-NEXT:    s_wait_samplecnt 0x0
+; GISEL-GFX12-NEXT:    s_wait_bvhcnt 0x0
+; GISEL-GFX12-NEXT:    s_wait_kmcnt 0x0
+; GISEL-GFX12-NEXT:    s_mov_b32 s30, callee_high_sgpr at abs32@lo
+; GISEL-GFX12-NEXT:    s_mov_b32 s31, callee_high_sgpr at abs32@hi
+; GISEL-GFX12-NEXT:    s_mov_b32 s34, retry_vgpr_alloc at abs32@lo
+; GISEL-GFX12-NEXT:    s_mov_b32 s35, retry_vgpr_alloc at abs32@hi
+; GISEL-GFX12-NEXT:    s_alloc_vgpr 64
+; GISEL-GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GISEL-GFX12-NEXT:    s_cselect_b64 s[30:31], s[30:31], s[34:35]
+; GISEL-GFX12-NEXT:    s_cselect_b32 exec_lo, 7, -1
+; GISEL-GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GISEL-GFX12-NEXT:    s_setpc_b64 s[30:31]
+;
+; DAGISEL-GFX12-LABEL: high_sgpr_pressure:
+; DAGISEL-GFX12:       ; %bb.0:
+; DAGISEL-GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
+; DAGISEL-GFX12-NEXT:    s_wait_expcnt 0x0
+; DAGISEL-GFX12-NEXT:    s_wait_samplecnt 0x0
+; DAGISEL-GFX12-NEXT:    s_wait_bvhcnt 0x0
+; DAGISEL-GFX12-NEXT:    s_wait_kmcnt 0x0
+; DAGISEL-GFX12-NEXT:    s_mov_b32 s31, retry_vgpr_alloc at abs32@hi
+; DAGISEL-GFX12-NEXT:    s_mov_b32 s30, retry_vgpr_alloc at abs32@lo
+; DAGISEL-GFX12-NEXT:    s_mov_b32 s35, callee_high_sgpr at abs32@hi
+; DAGISEL-GFX12-NEXT:    s_mov_b32 s34, callee_high_sgpr at abs32@lo
+; DAGISEL-GFX12-NEXT:    s_alloc_vgpr 64
+; DAGISEL-GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; DAGISEL-GFX12-NEXT:    s_cselect_b64 s[34:35], s[34:35], s[30:31]
+; DAGISEL-GFX12-NEXT:    s_cselect_b32 exec_lo, 7, -1
+; DAGISEL-GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; DAGISEL-GFX12-NEXT:    s_setpc_b64 s[34:35]
+  call void(ptr, i32, <30 x i32>, { i32, ptr addrspace(5), i32, i32 }, i32, ...) @llvm.amdgcn.cs.chain(ptr @callee_high_sgpr, i32 7, <30 x i32> inreg %sgpr, { i32, ptr addrspace(5), i32, i32 } %vgpr, i32 1, i32 inreg 64, i32 inreg -1, ptr @retry_vgpr_alloc)
+  unreachable
+}
+
+declare amdgpu_cs_chain void @callee_high_sgpr(<30 x i32> inreg, { i32, ptr addrspace(5), i32, i32 })
 declare amdgpu_cs_chain_preserve void @retry_vgpr_alloc(<3 x i32> inreg %sgpr)
diff --git a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll
index 931a62298812f..1e79c4f63cd42 100644
--- a/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll
@@ -255,7 +255,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; SDAG-GFX942-NEXT:    s_mov_b32 s17, s10
 ; SDAG-GFX942-NEXT:    s_mov_b32 s2, s9
 ; SDAG-GFX942-NEXT:    s_or_b64 s[12:13], s[2:3], s[16:17]
-; SDAG-GFX942-NEXT:  .LBB0_1: ; %load-store-loop
+; SDAG-GFX942-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; SDAG-GFX942-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; SDAG-GFX942-NEXT:    s_add_i32 s1, s0, s16
 ; SDAG-GFX942-NEXT:    v_mov_b32_e32 v0, s1
@@ -312,7 +312,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; SDAG-GFX942-NEXT:    s_waitcnt vmcnt(15)
 ; SDAG-GFX942-NEXT:    buffer_store_dwordx4 a[0:3], v0, s[12:15], 0 offen offset:240
 ; SDAG-GFX942-NEXT:    s_cbranch_scc1 .LBB0_1
-; SDAG-GFX942-NEXT:  ; %bb.2: ; %memcpy-split
+; SDAG-GFX942-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; SDAG-GFX942-NEXT:    s_endpgm
 ;
 ; SDAG-GFX1100-LABEL: memcpy_known:
@@ -341,7 +341,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; SDAG-GFX1100-NEXT:    s_mov_b32 s3, s16
 ; SDAG-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; SDAG-GFX1100-NEXT:    s_or_b64 s[12:13], s[2:3], s[16:17]
-; SDAG-GFX1100-NEXT:  .LBB0_1: ; %load-store-loop
+; SDAG-GFX1100-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; SDAG-GFX1100-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; SDAG-GFX1100-NEXT:    s_add_i32 s1, s0, s16
 ; SDAG-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -400,7 +400,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; SDAG-GFX1100-NEXT:    s_waitcnt vmcnt(0)
 ; SDAG-GFX1100-NEXT:    buffer_store_b128 v[60:63], v64, s[12:15], 0 offen offset:240
 ; SDAG-GFX1100-NEXT:    s_cbranch_scc1 .LBB0_1
-; SDAG-GFX1100-NEXT:  ; %bb.2: ; %memcpy-split
+; SDAG-GFX1100-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; SDAG-GFX1100-NEXT:    s_endpgm
 ;
 ; GISEL-GFX942-LABEL: memcpy_known:
@@ -419,7 +419,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; GISEL-GFX942-NEXT:    s_mov_b32 s5, s14
 ; GISEL-GFX942-NEXT:    s_mov_b32 s6, s15
 ; GISEL-GFX942-NEXT:    v_mov_b32_e32 v1, s16
-; GISEL-GFX942-NEXT:  .LBB0_1: ; %load-store-loop
+; GISEL-GFX942-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; GISEL-GFX942-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GISEL-GFX942-NEXT:    v_add_u32_e32 v62, s0, v1
 ; GISEL-GFX942-NEXT:    buffer_load_dwordx4 v[2:5], v62, s[8:11], 0 offen
@@ -477,7 +477,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; GISEL-GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-GFX942-NEXT:    buffer_store_dwordx4 v[2:5], v63, s[4:7], 0 offen offset:240
 ; GISEL-GFX942-NEXT:    s_cbranch_vccnz .LBB0_1
-; GISEL-GFX942-NEXT:  ; %bb.2: ; %memcpy-split
+; GISEL-GFX942-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; GISEL-GFX942-NEXT:    s_endpgm
 ;
 ; GISEL-GFX1100-LABEL: memcpy_known:
@@ -497,7 +497,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s12, s9
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s13, s10
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s14, s11
-; GISEL-GFX1100-NEXT:  .LBB0_1: ; %load-store-loop
+; GISEL-GFX1100-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; GISEL-GFX1100-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GISEL-GFX1100-NEXT:    v_add_nc_u32_e32 v61, s0, v0
 ; GISEL-GFX1100-NEXT:    v_add_nc_u32_e32 v65, s8, v0
@@ -553,7 +553,7 @@ define amdgpu_kernel void @memcpy_known(ptr addrspace(7) %src, ptr addrspace(7)
 ; GISEL-GFX1100-NEXT:    buffer_store_b128 v[61:64], v65, s[12:15], 0 offen offset:240
 ; GISEL-GFX1100-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 0x2000, v0
 ; GISEL-GFX1100-NEXT:    s_cbranch_vccnz .LBB0_1
-; GISEL-GFX1100-NEXT:  ; %bb.2: ; %memcpy-split
+; GISEL-GFX1100-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; GISEL-GFX1100-NEXT:    s_endpgm
   call void @llvm.memcpy.p7.p7.i32(ptr addrspace(7) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
   ret void
@@ -787,7 +787,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; SDAG-GFX942-NEXT:    s_mov_b32 s17, s10
 ; SDAG-GFX942-NEXT:    s_mov_b32 s2, s9
 ; SDAG-GFX942-NEXT:    s_or_b64 s[12:13], s[2:3], s[16:17]
-; SDAG-GFX942-NEXT:  .LBB1_1: ; %load-store-loop
+; SDAG-GFX942-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; SDAG-GFX942-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; SDAG-GFX942-NEXT:    s_add_i32 s1, s0, s16
 ; SDAG-GFX942-NEXT:    v_mov_b32_e32 v0, s1
@@ -844,7 +844,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; SDAG-GFX942-NEXT:    s_waitcnt vmcnt(15)
 ; SDAG-GFX942-NEXT:    buffer_store_dwordx4 a[0:3], v0, s[12:15], 0 offen offset:240
 ; SDAG-GFX942-NEXT:    s_cbranch_scc1 .LBB1_1
-; SDAG-GFX942-NEXT:  ; %bb.2: ; %memcpy-split
+; SDAG-GFX942-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; SDAG-GFX942-NEXT:    s_endpgm
 ;
 ; SDAG-GFX1100-LABEL: memcpy_known_medium:
@@ -873,7 +873,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; SDAG-GFX1100-NEXT:    s_mov_b32 s3, s16
 ; SDAG-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; SDAG-GFX1100-NEXT:    s_or_b64 s[12:13], s[2:3], s[16:17]
-; SDAG-GFX1100-NEXT:  .LBB1_1: ; %load-store-loop
+; SDAG-GFX1100-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; SDAG-GFX1100-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; SDAG-GFX1100-NEXT:    s_add_i32 s1, s0, s16
 ; SDAG-GFX1100-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
@@ -932,7 +932,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; SDAG-GFX1100-NEXT:    s_waitcnt vmcnt(0)
 ; SDAG-GFX1100-NEXT:    buffer_store_b128 v[60:63], v64, s[12:15], 0 offen offset:240
 ; SDAG-GFX1100-NEXT:    s_cbranch_scc1 .LBB1_1
-; SDAG-GFX1100-NEXT:  ; %bb.2: ; %memcpy-split
+; SDAG-GFX1100-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; SDAG-GFX1100-NEXT:    s_endpgm
 ;
 ; GISEL-GFX942-LABEL: memcpy_known_medium:
@@ -951,7 +951,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; GISEL-GFX942-NEXT:    s_mov_b32 s5, s14
 ; GISEL-GFX942-NEXT:    s_mov_b32 s6, s15
 ; GISEL-GFX942-NEXT:    v_mov_b32_e32 v1, s16
-; GISEL-GFX942-NEXT:  .LBB1_1: ; %load-store-loop
+; GISEL-GFX942-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; GISEL-GFX942-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GISEL-GFX942-NEXT:    v_add_u32_e32 v62, s0, v1
 ; GISEL-GFX942-NEXT:    buffer_load_dwordx4 v[2:5], v62, s[8:11], 0 offen
@@ -1009,7 +1009,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; GISEL-GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GISEL-GFX942-NEXT:    buffer_store_dwordx4 v[2:5], v63, s[4:7], 0 offen offset:240
 ; GISEL-GFX942-NEXT:    s_cbranch_vccnz .LBB1_1
-; GISEL-GFX942-NEXT:  ; %bb.2: ; %memcpy-split
+; GISEL-GFX942-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; GISEL-GFX942-NEXT:    s_endpgm
 ;
 ; GISEL-GFX1100-LABEL: memcpy_known_medium:
@@ -1029,7 +1029,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s12, s9
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s13, s10
 ; GISEL-GFX1100-NEXT:    s_mov_b32 s14, s11
-; GISEL-GFX1100-NEXT:  .LBB1_1: ; %load-store-loop
+; GISEL-GFX1100-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; GISEL-GFX1100-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; GISEL-GFX1100-NEXT:    v_add_nc_u32_e32 v61, s0, v0
 ; GISEL-GFX1100-NEXT:    v_add_nc_u32_e32 v65, s8, v0
@@ -1085,7 +1085,7 @@ define amdgpu_kernel void @memcpy_known_medium(ptr addrspace(7) %src, ptr addrsp
 ; GISEL-GFX1100-NEXT:    buffer_store_b128 v[61:64], v65, s[12:15], 0 offen offset:240
 ; GISEL-GFX1100-NEXT:    v_cmp_gt_u32_e32 vcc_lo, 0x100, v0
 ; GISEL-GFX1100-NEXT:    s_cbranch_vccnz .LBB1_1
-; GISEL-GFX1100-NEXT:  ; %bb.2: ; %memcpy-split
+; GISEL-GFX1100-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; GISEL-GFX1100-NEXT:    s_endpgm
   call void @llvm.memcpy.p7.p7.i32(ptr addrspace(7) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 256, i1 false)
   ret void
diff --git a/llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll b/llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll
index 66d99b14e282d..a99aab7a23a3b 100644
--- a/llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll
+++ b/llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll
@@ -474,15 +474,24 @@ define i32 @cast_private_to_flat_to_global(ptr addrspace(6) %const32.ptr) {
 ; OPT-NEXT:    [[LOAD:%.*]] = load volatile i32, ptr addrspace(3) [[LOCAL_PTR]], align 4
 ; OPT-NEXT:    ret i32 [[LOAD]]
 ;
-; ASM-LABEL: cast_private_to_flat_to_global:
-; ASM:       ; %bb.0:
-; ASM-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; ASM-NEXT:    v_mov_b32_e32 v1, 0
-; ASM-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[0:1]
-; ASM-NEXT:    v_cndmask_b32_e32 v0, -1, v0, vcc
-; ASM-NEXT:    ds_read_b32 v0, v0
-; ASM-NEXT:    s_waitcnt lgkmcnt(0)
-; ASM-NEXT:    s_setpc_b64 s[30:31]
+; DAGISEL-ASM-LABEL: cast_private_to_flat_to_global:
+; DAGISEL-ASM:       ; %bb.0:
+; DAGISEL-ASM-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; DAGISEL-ASM-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v0
+; DAGISEL-ASM-NEXT:    v_cndmask_b32_e32 v0, -1, v0, vcc
+; DAGISEL-ASM-NEXT:    ds_read_b32 v0, v0
+; DAGISEL-ASM-NEXT:    s_waitcnt lgkmcnt(0)
+; DAGISEL-ASM-NEXT:    s_setpc_b64 s[30:31]
+;
+; GISEL-ASM-LABEL: cast_private_to_flat_to_global:
+; GISEL-ASM:       ; %bb.0:
+; GISEL-ASM-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GISEL-ASM-NEXT:    v_mov_b32_e32 v1, 0
+; GISEL-ASM-NEXT:    v_cmp_ne_u64_e32 vcc, 0, v[0:1]
+; GISEL-ASM-NEXT:    v_cndmask_b32_e32 v0, -1, v0, vcc
+; GISEL-ASM-NEXT:    ds_read_b32 v0, v0
+; GISEL-ASM-NEXT:    s_waitcnt lgkmcnt(0)
+; GISEL-ASM-NEXT:    s_setpc_b64 s[30:31]
   %flat.ptr = addrspacecast ptr addrspace(6) %const32.ptr to ptr
   %local.ptr = addrspacecast ptr %flat.ptr to ptr addrspace(3)
   %load = load volatile i32, ptr addrspace(3) %local.ptr
diff --git a/llvm/test/CodeGen/AMDGPU/fcanonicalize-elimination.ll b/llvm/test/CodeGen/AMDGPU/fcanonicalize-elimination.ll
index 05d3e9c381910..e7685d53b2d10 100644
--- a/llvm/test/CodeGen/AMDGPU/fcanonicalize-elimination.ll
+++ b/llvm/test/CodeGen/AMDGPU/fcanonicalize-elimination.ll
@@ -497,10 +497,12 @@ define amdgpu_kernel void @test_fold_canonicalize_minnum_value_f32(ptr addrspace
   ret void
 }
 
-; FIXME: Should there be more checks here? minnum with sNaN operand is simplified to qNaN.
+; FIXME: Should there be more checks here? minnum with sNaN operand might get simplified away.
 
 ; GCN-LABEL: test_fold_canonicalize_sNaN_value_f32:
-; GCN: v_mov_b32_e32 v{{.+}}, 0x7fc00000
+; GCN: {{flat|global}}_load_dword [[LOAD:v[0-9]+]]
+; VI: v_mul_f32_e32 v{{[0-9]+}}, 1.0, [[LOAD]]
+; GFX9: v_max_f32_e32 v{{[0-9]+}}, [[LOAD]], [[LOAD]]
 define amdgpu_kernel void @test_fold_canonicalize_sNaN_value_f32(ptr addrspace(1) %arg) {
   %id = tail call i32 @llvm.amdgcn.workitem.id.x()
   %gep = getelementptr inbounds float, ptr addrspace(1) %arg, i32 %id
diff --git a/llvm/test/CodeGen/AMDGPU/hazard-gfx1250-flat-scr-hi.mir b/llvm/test/CodeGen/AMDGPU/hazard-gfx1250-flat-scr-hi.mir
new file mode 100644
index 0000000000000..e98c08248af75
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/hazard-gfx1250-flat-scr-hi.mir
@@ -0,0 +1,183 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 6
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -run-pass si-fold-operands %s -o - | FileCheck -check-prefix=GCN %s
+
+---
+name:            s_ashr_i64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_ashr_i64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_ASHR_I64_:%[0-9]+]]:sreg_64 = S_ASHR_I64 [[DEF]], [[COPY]], implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %2:sreg_64 = S_ASHR_I64 %1:sreg_64, %0, implicit-def $scc
+...
+
+---
+name:            s_lshl_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_lshl_b64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_LSHL_B64_:%[0-9]+]]:sreg_64 = S_LSHL_B64 [[DEF]], [[COPY]], implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %2:sreg_64 = S_LSHL_B64 %1:sreg_64, %0, implicit-def $scc
+...
+
+---
+name:            s_lshr_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_lshr_b64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_LSHR_B64_:%[0-9]+]]:sreg_64 = S_LSHR_B64 [[DEF]], [[COPY]], implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %2:sreg_64 = S_LSHR_B64 %1:sreg_64, %0, implicit-def $scc
+...
+
+---
+name:            s_bfe_i64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bfe_i64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_BFE_I64_:%[0-9]+]]:sreg_64 = S_BFE_I64 [[DEF]], [[COPY]], implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %2:sreg_64 = S_BFE_I64 %1:sreg_64, %0, implicit-def $scc
+...
+
+---
+name:            s_bfe_u64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bfe_u64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_BFE_U64_:%[0-9]+]]:sreg_64 = S_BFE_U64 [[DEF]], [[COPY]], implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %2:sreg_64 = S_BFE_U64 %1:sreg_64, %0, implicit-def $scc
+...
+
+---
+name:            s_bfm_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bfm_b64
+    ; GCN: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_BFM_B64_:%[0-9]+]]:sreg_64 = S_BFM_B64 [[COPY]], 1, implicit-def $scc
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %1:sreg_64 = S_BFM_B64 %0, 1, implicit-def $scc
+...
+
+---
+name:            s_bitcmp0_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bitcmp0_b64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: $scc = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: S_BITCMP0_B64 [[DEF]], [[COPY]], implicit $scc, implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    $scc = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    S_BITCMP0_B64 %1:sreg_64, %0, implicit $scc, implicit-def $scc
+...
+
+---
+name:            s_bitcmp1_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bitcmp1_b64
+    ; GCN: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
+    ; GCN-NEXT: $scc = IMPLICIT_DEF
+    ; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: S_BITCMP1_B64 [[DEF]], [[COPY]], implicit $scc, implicit-def $scc
+    %1:sreg_64 = IMPLICIT_DEF
+    $scc = IMPLICIT_DEF
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    S_BITCMP1_B64 %1:sreg_64, %0, implicit $scc, implicit-def $scc
+...
+
+---
+name:            s_bitreplicate_b64_b32
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bitreplicate_b64_b32
+    ; GCN: [[COPY:%[0-9]+]]:sreg_32 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: [[S_BITREPLICATE_B64_B32_:%[0-9]+]]:sreg_64 = S_BITREPLICATE_B64_B32 [[COPY]], implicit-def $scc
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    %2:sreg_64 = S_BITREPLICATE_B64_B32 %0, implicit-def $scc
+...
+
+---
+name:            s_bitset0_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bitset0_b64
+    ; GCN: $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; GCN-NEXT: $sgpr2 = S_MOV_B32 $src_flat_scratch_base_hi
+    ; GCN-NEXT: $sgpr0_sgpr1 = S_BITSET0_B64 $sgpr2, $sgpr0_sgpr1, implicit-def $scc
+    $sgpr0_sgpr1 = IMPLICIT_DEF
+    $sgpr2 = S_MOV_B32 $src_flat_scratch_base_hi
+    $sgpr0_sgpr1 = S_BITSET0_B64 $sgpr2, $sgpr0_sgpr1, implicit-def $scc
+...
+
+---
+name:            s_bitset1_b64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_bitset1_b64
+    ; GCN: $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; GCN-NEXT: $sgpr2 = S_MOV_B32 $src_flat_scratch_base_hi
+    ; GCN-NEXT: $sgpr0_sgpr1 = S_BITSET1_B64 $sgpr2, $sgpr0_sgpr1, implicit-def $scc
+    $sgpr0_sgpr1 = IMPLICIT_DEF
+    $sgpr2 = S_MOV_B32 $src_flat_scratch_base_hi
+    $sgpr0_sgpr1 = S_BITSET1_B64 $sgpr2, $sgpr0_sgpr1, implicit-def $scc
+...
+
+---
+name:            s_ashr_i64_phys_dst
+tracksRegLiveness: true
+body:             |
+  bb.0:
+
+    ; GCN-LABEL: name: s_ashr_i64_phys_dst
+    ; GCN: $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; GCN-NEXT: $sgpr2 = COPY $src_flat_scratch_base_hi
+    ; GCN-NEXT: $sgpr0_sgpr1 = S_ASHR_I64 $sgpr0_sgpr1, $sgpr2, implicit-def $scc
+    $sgpr0_sgpr1 = IMPLICIT_DEF
+    $sgpr2 = COPY $src_flat_scratch_base_hi
+    %0:sreg_32 = COPY $src_flat_scratch_base_hi
+    $sgpr0_sgpr1 = S_ASHR_I64 $sgpr0_sgpr1, $sgpr2, implicit-def $scc
+...
diff --git a/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w32.ll b/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w32.ll
index ece86627cbd92..43ba2925914a0 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w32.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w32.ll
@@ -35,7 +35,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY10]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: chain_to_chain
@@ -67,7 +67,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY11]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: chain_to_chain
@@ -83,7 +83,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -112,7 +112,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -161,7 +161,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY10]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: cs_to_chain
@@ -193,7 +193,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY11]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: cs_to_chain
@@ -209,7 +209,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -238,7 +238,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -287,7 +287,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY10]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: chain_to_chain_preserve
@@ -319,7 +319,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY11]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: chain_to_chain_preserve
@@ -335,7 +335,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -364,7 +364,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -413,7 +413,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY10]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: cs_to_chain_preserve
@@ -445,7 +445,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY11]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: cs_to_chain_preserve
@@ -461,7 +461,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -490,7 +490,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -518,7 +518,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; GISEL-GFX11-NEXT: {{  $}}
   ; GISEL-GFX11-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX11-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX11-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
@@ -547,7 +547,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; GISEL-GFX10-NEXT: {{  $}}
   ; GISEL-GFX10-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX10-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX10-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX10-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
@@ -592,7 +592,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; DAGISEL-GFX11-NEXT:   [[COPY11:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX11-NEXT:   [[COPY12:%[0-9]+]]:vgpr_32 = COPY [[COPY11]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY12]], implicit $exec
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY13]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -628,7 +628,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; DAGISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX10-NEXT:   [[COPY12:%[0-9]+]]:vgpr_32 = COPY [[COPY11]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY12]], implicit $exec
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY13]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -658,7 +658,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; GISEL-GFX11-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr8
   ; GISEL-GFX11-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr9
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX11-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY $vgpr10
   ; GISEL-GFX11-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
@@ -674,7 +674,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; GISEL-GFX10-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr8
   ; GISEL-GFX10-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr9
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX10-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY $vgpr10
   ; GISEL-GFX10-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
@@ -698,7 +698,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY4]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY5]], implicit $exec
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY6]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   $sgpr0 = COPY [[V_READFIRSTLANE_B32_2]]
@@ -718,7 +718,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY4]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY5]], implicit $exec
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY6]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:sgpr_128 = COPY $sgpr48_sgpr49_sgpr50_sgpr51
@@ -759,7 +759,7 @@ define amdgpu_cs_chain void @non_imm_exec(i32 inreg %exec, <3 x i32> inreg %sgpr
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY11]], @callee, 0, [[COPY]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: non_imm_exec
@@ -792,7 +792,7 @@ define amdgpu_cs_chain void @non_imm_exec(i32 inreg %exec, <3 x i32> inreg %sgpr
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY12:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY12:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W32 [[COPY12]], @callee, 0, [[COPY]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: non_imm_exec
@@ -809,7 +809,7 @@ define amdgpu_cs_chain void @non_imm_exec(i32 inreg %exec, <3 x i32> inreg %sgpr
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY8]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -839,7 +839,7 @@ define amdgpu_cs_chain void @non_imm_exec(i32 inreg %exec, <3 x i32> inreg %sgpr
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY8]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -867,7 +867,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i32 i
   ; GISEL-GFX11-NEXT: {{  $}}
   ; GISEL-GFX11-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX11-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX11-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
@@ -897,7 +897,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i32 i
   ; GISEL-GFX10-NEXT: {{  $}}
   ; GISEL-GFX10-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX10-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX10-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX10-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
@@ -944,7 +944,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i32 i
   ; DAGISEL-GFX11-NEXT:   [[COPY12:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX11-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[COPY12]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY13]], implicit $exec
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY14]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -981,7 +981,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i32 i
   ; DAGISEL-GFX10-NEXT:   [[COPY12:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX10-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[COPY12]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY13]], implicit $exec
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY14]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
diff --git a/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w64.ll b/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w64.ll
index 6c9c7a4a06fa6..00731126b4b86 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w64.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-amdgcn-cs-chain-intrinsic-w64.ll
@@ -35,7 +35,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY10]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: chain_to_chain
@@ -67,7 +67,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY11]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: chain_to_chain
@@ -83,7 +83,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -112,7 +112,7 @@ define amdgpu_cs_chain void @chain_to_chain(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -161,7 +161,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY10]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: cs_to_chain
@@ -193,7 +193,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY11]], @callee, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: cs_to_chain
@@ -209,7 +209,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -238,7 +238,7 @@ define amdgpu_cs void @cs_to_chain(<3 x i32> inreg %sgpr, { i32, ptr addrspace(5
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -287,7 +287,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY10]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: chain_to_chain_preserve
@@ -319,7 +319,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY11]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: chain_to_chain_preserve
@@ -335,7 +335,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -364,7 +364,7 @@ define amdgpu_cs_chain void @chain_to_chain_preserve(<3 x i32> inreg %sgpr, { i3
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -413,7 +413,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY10]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: cs_to_chain_preserve
@@ -445,7 +445,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY11]], @callee_preserve, 0, -1, amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: cs_to_chain_preserve
@@ -461,7 +461,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -490,7 +490,7 @@ define amdgpu_cs void @cs_to_chain_preserve(<3 x i32> inreg %sgpr, { i32, ptr ad
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee_preserve
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee_preserve
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -518,7 +518,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; GISEL-GFX11-NEXT: {{  $}}
   ; GISEL-GFX11-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX11-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX11-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
@@ -547,7 +547,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; GISEL-GFX10-NEXT: {{  $}}
   ; GISEL-GFX10-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX10-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX10-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX10-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
@@ -592,7 +592,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; DAGISEL-GFX11-NEXT:   [[COPY11:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX11-NEXT:   [[COPY12:%[0-9]+]]:vgpr_32 = COPY [[COPY11]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY12]], implicit $exec
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY13]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -628,7 +628,7 @@ define amdgpu_cs_chain void @indirect(ptr inreg %callee, <3 x i32> inreg %sgpr,
   ; DAGISEL-GFX10-NEXT:   [[COPY11:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX10-NEXT:   [[COPY12:%[0-9]+]]:vgpr_32 = COPY [[COPY11]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY12]], implicit $exec
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY13:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY13]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -658,7 +658,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; GISEL-GFX11-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr8
   ; GISEL-GFX11-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr9
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX11-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY $vgpr10
   ; GISEL-GFX11-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
@@ -674,7 +674,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; GISEL-GFX10-NEXT:   [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr8
   ; GISEL-GFX10-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr9
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX10-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX10-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY $vgpr10
   ; GISEL-GFX10-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY3]]
@@ -698,7 +698,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY4]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY5]], implicit $exec
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY6]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   $sgpr0 = COPY [[V_READFIRSTLANE_B32_2]]
@@ -718,7 +718,7 @@ define amdgpu_cs_chain void @nonuniform_callee(ptr %callee, i32 inreg %sgpr, i32
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY4]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY5]], implicit $exec
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY6:%[0-9]+]]:vgpr_32 = COPY [[COPY1]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY6]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY7:%[0-9]+]]:sgpr_128 = COPY $sgpr48_sgpr49_sgpr50_sgpr51
@@ -760,7 +760,7 @@ define amdgpu_cs_chain void @non_imm_exec(i64 inreg %exec, <3 x i32> inreg %sgpr
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX11-NEXT:   [[COPY12:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE1]]
+  ; GISEL-GFX11-NEXT:   [[COPY12:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE1]]
   ; GISEL-GFX11-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY12]], @callee, 0, [[REG_SEQUENCE]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; GISEL-GFX10-LABEL: name: non_imm_exec
@@ -795,7 +795,7 @@ define amdgpu_cs_chain void @non_imm_exec(i64 inreg %exec, <3 x i32> inreg %sgpr
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX10-NEXT:   [[COPY13:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE1]]
+  ; GISEL-GFX10-NEXT:   [[COPY13:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE1]]
   ; GISEL-GFX10-NEXT:   SI_CS_CHAIN_TC_W64 [[COPY13]], @callee, 0, [[REG_SEQUENCE]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11, implicit $sgpr48_sgpr49_sgpr50_sgpr51
   ;
   ; DAGISEL-GFX11-LABEL: name: non_imm_exec
@@ -814,7 +814,7 @@ define amdgpu_cs_chain void @non_imm_exec(i64 inreg %exec, <3 x i32> inreg %sgpr
   ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY8]], %subreg.sub0, [[COPY7]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX11-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY9]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -846,7 +846,7 @@ define amdgpu_cs_chain void @non_imm_exec(i64 inreg %exec, <3 x i32> inreg %sgpr
   ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY8]], %subreg.sub0, [[COPY7]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX10-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY9:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY9]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -874,7 +874,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i64 i
   ; GISEL-GFX11-NEXT: {{  $}}
   ; GISEL-GFX11-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX11-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX11-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX11-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX11-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY3]], %subreg.sub1
@@ -906,7 +906,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i64 i
   ; GISEL-GFX10-NEXT: {{  $}}
   ; GISEL-GFX10-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX10-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
-  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
+  ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY]], %subreg.sub0, [[COPY1]], %subreg.sub1
   ; GISEL-GFX10-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX10-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX10-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY3]], %subreg.sub1
@@ -957,7 +957,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i64 i
   ; DAGISEL-GFX11-NEXT:   [[COPY13:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub0
   ; DAGISEL-GFX11-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY13]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY14]], implicit $exec
-  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX11-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX11-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX11-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY15]], implicit $exec
   ; DAGISEL-GFX11-NEXT:   [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -996,7 +996,7 @@ define amdgpu_cs_chain void @indirect_with_non_imm_exec(ptr inreg %callee, i64 i
   ; DAGISEL-GFX10-NEXT:   [[COPY13:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub0
   ; DAGISEL-GFX10-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY13]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY14]], implicit $exec
-  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX10-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX10-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX10-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY15]], implicit $exec
   ; DAGISEL-GFX10-NEXT:   [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
diff --git a/llvm/test/CodeGen/AMDGPU/isel-amdgpu-cs-chain-intrinsic-dyn-vgpr-w32.ll b/llvm/test/CodeGen/AMDGPU/isel-amdgpu-cs-chain-intrinsic-dyn-vgpr-w32.ll
index 9fe26ec97d580..b723ea8f92a87 100644
--- a/llvm/test/CodeGen/AMDGPU/isel-amdgpu-cs-chain-intrinsic-dyn-vgpr-w32.ll
+++ b/llvm/test/CodeGen/AMDGPU/isel-amdgpu-cs-chain-intrinsic-dyn-vgpr-w32.ll
@@ -35,11 +35,11 @@ define amdgpu_cs_chain void @direct_callee_direct_fallback(<3 x i32> inreg %sgpr
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX12-NEXT:   [[COPY10:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE]]
+  ; GISEL-GFX12-NEXT:   [[COPY10:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE]]
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @retry_vgpr_alloc
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @retry_vgpr_alloc
   ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_2]], %subreg.sub0, [[S_MOV_B32_3]], %subreg.sub1
-  ; GISEL-GFX12-NEXT:   [[COPY11:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE1]]
+  ; GISEL-GFX12-NEXT:   [[COPY11:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE1]]
   ; GISEL-GFX12-NEXT:   SI_CS_CHAIN_TC_W32_DVGPR [[COPY10]], 0, 0, 15, 64, -1, [[COPY11]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; DAGISEL-GFX12-LABEL: name: direct_callee_direct_fallback
@@ -55,10 +55,10 @@ define amdgpu_cs_chain void @direct_callee_direct_fallback(<3 x i32> inreg %sgpr
   ; DAGISEL-GFX12-NEXT:   [[COPY6:%[0-9]+]]:sgpr_32 = COPY $sgpr0
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @retry_vgpr_alloc
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @retry_vgpr_alloc
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_3]], %subreg.sub0, killed [[S_MOV_B32_2]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_3]], %subreg.sub0, killed [[S_MOV_B32_2]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[COPY7:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
   ; DAGISEL-GFX12-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY7]], implicit $exec
   ; DAGISEL-GFX12-NEXT:   [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY5]]
@@ -85,7 +85,7 @@ define amdgpu_cs_chain void @indirect_callee_direct_fallback(i32 inreg %exec, pt
   ; GISEL-GFX12-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX12-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
   ; GISEL-GFX12-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
-  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY1]], %subreg.sub0, [[COPY2]], %subreg.sub1
+  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY1]], %subreg.sub0, [[COPY2]], %subreg.sub1
   ; GISEL-GFX12-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX12-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
   ; GISEL-GFX12-NEXT:   [[COPY5:%[0-9]+]]:sreg_32 = COPY $sgpr5
@@ -110,7 +110,7 @@ define amdgpu_cs_chain void @indirect_callee_direct_fallback(i32 inreg %exec, pt
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @retry_vgpr_alloc
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @retry_vgpr_alloc
   ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX12-NEXT:   [[COPY14:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE1]]
+  ; GISEL-GFX12-NEXT:   [[COPY14:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE1]]
   ; GISEL-GFX12-NEXT:   SI_CS_CHAIN_TC_W32_DVGPR [[REG_SEQUENCE]], 0, 0, [[COPY]], [[COPY10]], -1, [[COPY14]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; DAGISEL-GFX12-LABEL: name: indirect_callee_direct_fallback
@@ -135,10 +135,10 @@ define amdgpu_cs_chain void @indirect_callee_direct_fallback(i32 inreg %exec, pt
   ; DAGISEL-GFX12-NEXT:   [[COPY13:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE]].sub0
   ; DAGISEL-GFX12-NEXT:   [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY13]]
   ; DAGISEL-GFX12-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY14]], implicit $exec
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @retry_vgpr_alloc
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @retry_vgpr_alloc
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[COPY7]]
   ; DAGISEL-GFX12-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY15]], implicit $exec
   ; DAGISEL-GFX12-NEXT:   [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
@@ -165,7 +165,7 @@ define amdgpu_cs_chain void @direct_callee_indirect_fallback(i32 inreg %exec, pt
   ; GISEL-GFX12-NEXT:   [[COPY:%[0-9]+]]:sreg_32 = COPY $sgpr0
   ; GISEL-GFX12-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
   ; GISEL-GFX12-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
-  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY1]], %subreg.sub0, [[COPY2]], %subreg.sub1
+  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY1]], %subreg.sub0, [[COPY2]], %subreg.sub1
   ; GISEL-GFX12-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
   ; GISEL-GFX12-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
   ; GISEL-GFX12-NEXT:   [[COPY5:%[0-9]+]]:sreg_32 = COPY $sgpr5
@@ -190,7 +190,7 @@ define amdgpu_cs_chain void @direct_callee_indirect_fallback(i32 inreg %exec, pt
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
   ; GISEL-GFX12-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_]], %subreg.sub0, [[S_MOV_B32_1]], %subreg.sub1
-  ; GISEL-GFX12-NEXT:   [[COPY14:%[0-9]+]]:ccr_sgpr_64 = COPY [[REG_SEQUENCE1]]
+  ; GISEL-GFX12-NEXT:   [[COPY14:%[0-9]+]]:sgpr_64 = COPY [[REG_SEQUENCE1]]
   ; GISEL-GFX12-NEXT:   SI_CS_CHAIN_TC_W32_DVGPR [[COPY14]], 0, 0, [[COPY]], [[COPY10]], -1, [[REG_SEQUENCE]], amdgpu_allvgprs, implicit $sgpr0, implicit $sgpr1, implicit $sgpr2, implicit $vgpr8, implicit $vgpr9, implicit $vgpr10, implicit $vgpr11
   ;
   ; DAGISEL-GFX12-LABEL: name: direct_callee_indirect_fallback
@@ -208,10 +208,10 @@ define amdgpu_cs_chain void @direct_callee_indirect_fallback(i32 inreg %exec, pt
   ; DAGISEL-GFX12-NEXT:   [[COPY8:%[0-9]+]]:sgpr_32 = COPY $sgpr2
   ; DAGISEL-GFX12-NEXT:   [[COPY9:%[0-9]+]]:sgpr_32 = COPY $sgpr1
   ; DAGISEL-GFX12-NEXT:   [[COPY10:%[0-9]+]]:sgpr_32 = COPY $sgpr0
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY9]], %subreg.sub0, [[COPY8]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY9]], %subreg.sub0, [[COPY8]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-hi) @callee
   ; DAGISEL-GFX12-NEXT:   [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 target-flags(amdgpu-abs32-lo) @callee
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[S_MOV_B32_1]], %subreg.sub0, killed [[S_MOV_B32_]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[COPY11:%[0-9]+]]:vgpr_32 = COPY [[COPY7]]
   ; DAGISEL-GFX12-NEXT:   [[V_READFIRSTLANE_B32_:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY11]], implicit $exec
   ; DAGISEL-GFX12-NEXT:   [[COPY12:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
@@ -240,10 +240,10 @@ define amdgpu_cs_chain void @indirect_callee_indirect_fallback(i32 inreg %exec,
   ; GISEL-GFX12-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY $sgpr1
   ; GISEL-GFX12-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY $sgpr2
   ; GISEL-GFX12-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY $sgpr3
-  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY3]], %subreg.sub1
+  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY2]], %subreg.sub0, [[COPY3]], %subreg.sub1
   ; GISEL-GFX12-NEXT:   [[COPY4:%[0-9]+]]:sreg_32 = COPY $sgpr4
   ; GISEL-GFX12-NEXT:   [[COPY5:%[0-9]+]]:sreg_32 = COPY $sgpr5
-  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY5]], %subreg.sub1
+  ; GISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY5]], %subreg.sub1
   ; GISEL-GFX12-NEXT:   [[COPY6:%[0-9]+]]:sreg_32 = COPY $sgpr6
   ; GISEL-GFX12-NEXT:   [[COPY7:%[0-9]+]]:sreg_32 = COPY $sgpr7
   ; GISEL-GFX12-NEXT:   [[COPY8:%[0-9]+]]:sreg_32 = COPY $sgpr8
@@ -285,7 +285,7 @@ define amdgpu_cs_chain void @indirect_callee_indirect_fallback(i32 inreg %exec,
   ; DAGISEL-GFX12-NEXT:   [[COPY11:%[0-9]+]]:sgpr_32 = COPY $sgpr2
   ; DAGISEL-GFX12-NEXT:   [[COPY12:%[0-9]+]]:sgpr_32 = COPY $sgpr1
   ; DAGISEL-GFX12-NEXT:   [[COPY13:%[0-9]+]]:sgpr_32 = COPY $sgpr0
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE [[COPY9]], %subreg.sub0, [[COPY8]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY9]], %subreg.sub0, [[COPY8]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE1:%[0-9]+]]:sgpr_64 = REG_SEQUENCE [[COPY11]], %subreg.sub0, [[COPY10]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[COPY14:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub1
   ; DAGISEL-GFX12-NEXT:   [[COPY15:%[0-9]+]]:vgpr_32 = COPY [[COPY14]]
@@ -293,7 +293,7 @@ define amdgpu_cs_chain void @indirect_callee_indirect_fallback(i32 inreg %exec,
   ; DAGISEL-GFX12-NEXT:   [[COPY16:%[0-9]+]]:sreg_32 = COPY [[REG_SEQUENCE1]].sub0
   ; DAGISEL-GFX12-NEXT:   [[COPY17:%[0-9]+]]:vgpr_32 = COPY [[COPY16]]
   ; DAGISEL-GFX12-NEXT:   [[V_READFIRSTLANE_B32_1:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 killed [[COPY17]], implicit $exec
-  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:ccr_sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
+  ; DAGISEL-GFX12-NEXT:   [[REG_SEQUENCE2:%[0-9]+]]:sgpr_64 = REG_SEQUENCE killed [[V_READFIRSTLANE_B32_1]], %subreg.sub0, killed [[V_READFIRSTLANE_B32_]], %subreg.sub1
   ; DAGISEL-GFX12-NEXT:   [[COPY18:%[0-9]+]]:vgpr_32 = COPY [[COPY7]]
   ; DAGISEL-GFX12-NEXT:   [[V_READFIRSTLANE_B32_2:%[0-9]+]]:sreg_32_xm0 = V_READFIRSTLANE_B32 [[COPY18]], implicit $exec
   ; DAGISEL-GFX12-NEXT:   [[COPY19:%[0-9]+]]:vgpr_32 = COPY [[COPY6]]
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll
index e4e40159e185d..2df8be55de3a8 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-waits.ll
@@ -1,5 +1,6 @@
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s --check-prefixes=GCN,GFX9
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck %s --check-prefixes=GCN,GFX10
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck %s --check-prefixes=GFX9
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck %s --check-prefixes=GFX10
 
 @lds.0 = internal addrspace(3) global [64 x float] poison, align 16
 @lds.1 = internal addrspace(3) global [64 x float] poison, align 16
@@ -15,13 +16,60 @@
 declare void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) nocapture, i32 %size, i32 %voffset, i32 %soffset, i32 %offset, i32 %aux)
 declare void @llvm.amdgcn.global.load.lds(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux)
 
-; GCN-LABEL: {{^}}buffer_load_lds_dword_2_arrays:
-; GCN-COUNT-4: buffer_load_dword
-; GCN: s_waitcnt vmcnt(2)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(0)
-; GCN: ds_read_b32
 define amdgpu_kernel void @buffer_load_lds_dword_2_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+; GFX9-LABEL: buffer_load_lds_dword_2_arrays:
+; GFX9:       ; %bb.0: ; %main_body
+; GFX9-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x24
+; GFX9-NEXT:    v_mov_b32_e32 v0, 4
+; GFX9-NEXT:    s_mov_b32 m0, 0
+; GFX9-NEXT:    v_mov_b32_e32 v1, 8
+; GFX9-NEXT:    v_mov_b32_e32 v2, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    buffer_load_dword off, s[8:11], 0 lds
+; GFX9-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x100
+; GFX9-NEXT:    v_mov_b32_e32 v0, 12
+; GFX9-NEXT:    buffer_load_dword v1, s[8:11], 0 offen lds
+; GFX9-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
+; GFX9-NEXT:    s_lshl_b32 s0, s12, 2
+; GFX9-NEXT:    s_lshl_b32 s1, s13, 2
+; GFX9-NEXT:    v_mov_b32_e32 v0, s0
+; GFX9-NEXT:    v_mov_b32_e32 v1, s1
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
+; GFX9-NEXT:    ds_read_b32 v0, v0
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    ds_read_b32 v1, v1 offset:256
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    global_store_dwordx2 v2, v[0:1], s[14:15]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX10-LABEL: buffer_load_lds_dword_2_arrays:
+; GFX10:       ; %bb.0: ; %main_body
+; GFX10-NEXT:    s_load_dwordx8 s[0:7], s[4:5], 0x24
+; GFX10-NEXT:    v_mov_b32_e32 v0, 4
+; GFX10-NEXT:    v_mov_b32_e32 v1, 8
+; GFX10-NEXT:    v_mov_b32_e32 v2, 12
+; GFX10-NEXT:    s_mov_b32 m0, 0
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    buffer_load_dword v0, s[0:3], 0 offen lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x100
+; GFX10-NEXT:    buffer_load_dword v1, s[0:3], 0 offen lds
+; GFX10-NEXT:    buffer_load_dword v2, s[0:3], 0 offen lds
+; GFX10-NEXT:    s_lshl_b32 s0, s4, 2
+; GFX10-NEXT:    s_lshl_b32 s1, s5, 2
+; GFX10-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-NEXT:    v_mov_b32_e32 v1, s1
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    s_waitcnt vmcnt(2)
+; GFX10-NEXT:    ds_read_b32 v0, v0
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    ds_read_b32 v1, v1 offset:256
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
+; GFX10-NEXT:    s_endpgm
 main_body:
   call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
   call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0, i32 0, i32 0)
@@ -41,15 +89,56 @@ main_body:
 ; On gfx9 if there is a pending FLAT operation, and this is a VMem or LGKM
 ; waitcnt and the target can report early completion, then we need to force a waitcnt 0.
 
-; GCN-LABEL: {{^}}global_load_lds_dword_2_arrays:
-; GCN-COUNT-4: global_load_dword
-; GFX9: s_waitcnt vmcnt(0)
-; GFX9-COUNT-2: ds_read_b32
-; GFX10: s_waitcnt vmcnt(2)
-; GFX10: ds_read_b32
-; GFX10: s_waitcnt vmcnt(0)
-; GFX10: ds_read_b32
 define amdgpu_kernel void @global_load_lds_dword_2_arrays(ptr addrspace(1) nocapture %gptr, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
+; GFX9-LABEL: global_load_lds_dword_2_arrays:
+; GFX9:       ; %bb.0: ; %main_body
+; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX9-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX9-NEXT:    v_mov_b32_e32 v2, 0
+; GFX9-NEXT:    s_mov_b32 m0, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    global_load_dword v2, s[0:1] lds
+; GFX9-NEXT:    global_load_dword v2, s[0:1] offset:4 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x100
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    global_load_dword v2, s[0:1] offset:8 lds
+; GFX9-NEXT:    global_load_dword v2, s[0:1] offset:12 lds
+; GFX9-NEXT:    s_lshl_b32 s0, s2, 2
+; GFX9-NEXT:    s_lshl_b32 s1, s3, 2
+; GFX9-NEXT:    v_mov_b32_e32 v0, s0
+; GFX9-NEXT:    v_mov_b32_e32 v1, s1
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    ds_read_b32 v0, v0
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    ds_read_b32 v1, v1 offset:256
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX10-LABEL: global_load_lds_dword_2_arrays:
+; GFX10:       ; %bb.0: ; %main_body
+; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    s_mov_b32 m0, 0
+; GFX10-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x34
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    global_load_dword v2, s[0:1] lds
+; GFX10-NEXT:    global_load_dword v2, s[0:1] offset:4 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x100
+; GFX10-NEXT:    global_load_dword v2, s[0:1] offset:8 lds
+; GFX10-NEXT:    global_load_dword v2, s[0:1] offset:12 lds
+; GFX10-NEXT:    s_lshl_b32 s0, s2, 2
+; GFX10-NEXT:    s_lshl_b32 s1, s3, 2
+; GFX10-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-NEXT:    v_mov_b32_e32 v1, s1
+; GFX10-NEXT:    s_waitcnt vmcnt(2)
+; GFX10-NEXT:    ds_read_b32 v0, v0
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(0)
+; GFX10-NEXT:    ds_read_b32 v1, v1 offset:256
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[4:5]
+; GFX10-NEXT:    s_endpgm
 main_body:
   call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0)
   call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 4, i32 0)
@@ -68,25 +157,144 @@ main_body:
 
 ; There are 8 pseudo registers defined to track LDS DMA dependencies.
 
-; GCN-LABEL: {{^}}buffer_load_lds_dword_10_arrays:
-; GCN-COUNT-10: buffer_load_dword
-; GCN: s_waitcnt vmcnt(8)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(7)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(6)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(5)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(4)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(3)
-; GCN: ds_read_b32
-; GCN: s_waitcnt vmcnt(2)
-; GCN-NOT: s_waitcnt vmcnt
-; GCN: ds_read_b32
-; GCN: ds_read_b32
 define amdgpu_kernel void @buffer_load_lds_dword_10_arrays(<4 x i32> %rsrc, i32 %i1, i32 %i2, i32 %i3, i32 %i4, i32 %i5, i32 %i6, i32 %i7, i32 %i8, i32 %i9, ptr addrspace(1) %out) {
+; GFX9-LABEL: buffer_load_lds_dword_10_arrays:
+; GFX9:       ; %bb.0: ; %main_body
+; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX9-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX9-NEXT:    s_mov_b32 m0, 0
+; GFX9-NEXT:    v_mov_b32_e32 v10, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x100
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x200
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x300
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x400
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x500
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x600
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x700
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x800
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x900
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX9-NEXT:    s_lshl_b32 s2, s6, 2
+; GFX9-NEXT:    s_lshl_b32 s3, s7, 2
+; GFX9-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9-NEXT:    v_mov_b32_e32 v9, s3
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x5c
+; GFX9-NEXT:    s_waitcnt vmcnt(9)
+; GFX9-NEXT:    ds_read_b32 v0, v0
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(8)
+; GFX9-NEXT:    ds_read_b32 v1, v9 offset:256
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(7)
+; GFX9-NEXT:    ds_read_b32 v2, v9 offset:512
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(6)
+; GFX9-NEXT:    ds_read_b32 v3, v9 offset:768
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(5)
+; GFX9-NEXT:    ds_read_b32 v4, v9 offset:1024
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(4)
+; GFX9-NEXT:    ds_read_b32 v5, v9 offset:1280
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(3)
+; GFX9-NEXT:    ds_read_b32 v6, v9 offset:1536
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    s_waitcnt vmcnt(2)
+; GFX9-NEXT:    ds_read_b32 v7, v9 offset:1792
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    ds_read_b32 v8, v9 offset:2048
+; GFX9-NEXT:    ; wave barrier
+; GFX9-NEXT:    ds_read_b32 v9, v9 offset:2304
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    global_store_dwordx4 v10, v[0:3], s[0:1]
+; GFX9-NEXT:    global_store_dwordx4 v10, v[4:7], s[0:1] offset:16
+; GFX9-NEXT:    global_store_dwordx2 v10, v[8:9], s[0:1] offset:32
+; GFX9-NEXT:    s_endpgm
+;
+; GFX10-LABEL: buffer_load_lds_dword_10_arrays:
+; GFX10:       ; %bb.0: ; %main_body
+; GFX10-NEXT:    s_clause 0x1
+; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX10-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX10-NEXT:    s_mov_b32 m0, 0
+; GFX10-NEXT:    v_mov_b32_e32 v10, 0
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x100
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x200
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x300
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x400
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x500
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x600
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x700
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x800
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x900
+; GFX10-NEXT:    buffer_load_dword off, s[0:3], 0 lds
+; GFX10-NEXT:    s_lshl_b32 s0, s6, 2
+; GFX10-NEXT:    s_lshl_b32 s1, s7, 2
+; GFX10-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-NEXT:    v_mov_b32_e32 v9, s1
+; GFX10-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x5c
+; GFX10-NEXT:    s_waitcnt vmcnt(9)
+; GFX10-NEXT:    ds_read_b32 v0, v0
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(8)
+; GFX10-NEXT:    ds_read_b32 v1, v9 offset:256
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(7)
+; GFX10-NEXT:    ds_read_b32 v2, v9 offset:512
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(6)
+; GFX10-NEXT:    ds_read_b32 v3, v9 offset:768
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(5)
+; GFX10-NEXT:    ds_read_b32 v4, v9 offset:1024
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(4)
+; GFX10-NEXT:    ds_read_b32 v5, v9 offset:1280
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(3)
+; GFX10-NEXT:    ds_read_b32 v6, v9 offset:1536
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    s_waitcnt vmcnt(2)
+; GFX10-NEXT:    ds_read_b32 v7, v9 offset:1792
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    ds_read_b32 v8, v9 offset:2048
+; GFX10-NEXT:    ; wave barrier
+; GFX10-NEXT:    ds_read_b32 v9, v9 offset:2304
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    global_store_dwordx4 v10, v[0:3], s[0:1]
+; GFX10-NEXT:    global_store_dwordx4 v10, v[4:7], s[0:1] offset:16
+; GFX10-NEXT:    global_store_dwordx2 v10, v[8:9], s[0:1] offset:32
+; GFX10-NEXT:    s_endpgm
 main_body:
   call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0, i32 0, i32 0)
   call void @llvm.amdgcn.raw.buffer.load.lds(<4 x i32> %rsrc, ptr addrspace(3) @lds.1, i32 4, i32 0, i32 0, i32 0, i32 0)
@@ -151,14 +359,49 @@ main_body:
 
 define amdgpu_kernel void @global_load_lds_no_alias_ds_read(ptr addrspace(1) nocapture %gptr, i32 %i1, i32 %i2, ptr addrspace(1) %out) {
 ; GFX9-LABEL: global_load_lds_no_alias_ds_read:
-; GFX9: global_load_dword
-; GFX9: global_load_dword
-; GFX9: s_waitcnt vmcnt(1)
-; GFX9-NOT: s_waitcnt vmcnt(0)
-; GFX9: ds_read_b32
-; GFX9: s_waitcnt vmcnt(0)
-; GFX9: ds_read_b32
-; GFX9: s_endpgm
+; GFX9:       ; %bb.0: ; %body
+; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX9-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
+; GFX9-NEXT:    v_mov_b32_e32 v2, 0
+; GFX9-NEXT:    s_mov_b32 m0, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    global_load_dword v2, s[0:1] lds
+; GFX9-NEXT:    s_movk_i32 m0, 0x100
+; GFX9-NEXT:    s_nop 0
+; GFX9-NEXT:    global_load_dword v2, s[0:1] offset:4 lds
+; GFX9-NEXT:    s_lshl_b32 s0, s2, 2
+; GFX9-NEXT:    v_mov_b32_e32 v0, s0
+; GFX9-NEXT:    s_lshl_b32 s0, s3, 2
+; GFX9-NEXT:    v_mov_b32_e32 v1, s0
+; GFX9-NEXT:    s_waitcnt vmcnt(1)
+; GFX9-NEXT:    ds_read_b32 v0, v0 offset:512
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    ds_read_b32 v1, v1 offset:768
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    global_store_dwordx2 v2, v[0:1], s[6:7]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX10-LABEL: global_load_lds_no_alias_ds_read:
+; GFX10:       ; %bb.0: ; %body
+; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    s_mov_b32 m0, 0
+; GFX10-NEXT:    s_load_dwordx2 s[4:5], s[4:5], 0x34
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    global_load_dword v2, s[0:1] lds
+; GFX10-NEXT:    s_movk_i32 m0, 0x100
+; GFX10-NEXT:    global_load_dword v2, s[0:1] offset:4 lds
+; GFX10-NEXT:    s_lshl_b32 s0, s2, 2
+; GFX10-NEXT:    s_waitcnt vmcnt(1) lgkmcnt(15)
+; GFX10-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-NEXT:    s_lshl_b32 s0, s3, 2
+; GFX10-NEXT:    v_mov_b32_e32 v1, s0
+; GFX10-NEXT:    ds_read_b32 v0, v0 offset:512
+; GFX10-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(15)
+; GFX10-NEXT:    ds_read_b32 v1, v1 offset:768
+; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[4:5]
+; GFX10-NEXT:    s_endpgm
 body:
   call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.0, i32 4, i32 0, i32 0)
   call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) @lds.1, i32 4, i32 4, i32 0)
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index 1f7888a633d62..8364e680bc8c7 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -14,9 +14,11 @@
 ; REQUIRES: asserts
 
 ; GCN-O0:Target Library Information
+; GCN-O0-NEXT:Runtime Library Function Analysis
 ; GCN-O0-NEXT:Target Pass Configuration
 ; GCN-O0-NEXT:Machine Module Information
 ; GCN-O0-NEXT:Target Transform Information
+; GCN-O0-NEXT:Library Function Lowering Analysis
 ; GCN-O0-NEXT:Assumption Cache Tracker
 ; GCN-O0-NEXT:Profile summary info
 ; GCN-O0-NEXT:Argument Register Usage Information Storage
@@ -161,9 +163,11 @@
 ; GCN-O0-NEXT:        Free MachineFunction
 
 ; GCN-O1:Target Library Information
+; GCN-O1-NEXT:Runtime Library Function Analysis
 ; GCN-O1-NEXT:Target Pass Configuration
 ; GCN-O1-NEXT:Machine Module Information
 ; GCN-O1-NEXT:Target Transform Information
+; GCN-O1-NEXT:Library Function Lowering Analysis
 ; GCN-O1-NEXT:Assumption Cache Tracker
 ; GCN-O1-NEXT:Profile summary info
 ; GCN-O1-NEXT:AMDGPU Address space based Alias Analysis
@@ -453,9 +457,11 @@
 ; GCN-O1-NEXT:        Free MachineFunction
 
 ; GCN-O1-OPTS:Target Library Information
+; GCN-O1-OPTS-NEXT:Runtime Library Function Analysis
 ; GCN-O1-OPTS-NEXT:Target Pass Configuration
 ; GCN-O1-OPTS-NEXT:Machine Module Information
 ; GCN-O1-OPTS-NEXT:Target Transform Information
+; GCN-O1-OPTS-NEXT:Library Function Lowering Analysis
 ; GCN-O1-OPTS-NEXT:Assumption Cache Tracker
 ; GCN-O1-OPTS-NEXT:Profile summary info
 ; GCN-O1-OPTS-NEXT:AMDGPU Address space based Alias Analysis
@@ -773,9 +779,11 @@
 ; GCN-O1-OPTS-NEXT:        Free MachineFunction
 
 ; GCN-O2:Target Library Information
+; GCN-O2-NEXT:Runtime Library Function Analysis
 ; GCN-O2-NEXT:Target Pass Configuration
 ; GCN-O2-NEXT:Machine Module Information
 ; GCN-O2-NEXT:Target Transform Information
+; GCN-O2-NEXT:Library Function Lowering Analysis
 ; GCN-O2-NEXT:Assumption Cache Tracker
 ; GCN-O2-NEXT:Profile summary info
 ; GCN-O2-NEXT:AMDGPU Address space based Alias Analysis
@@ -1098,9 +1106,11 @@
 ; GCN-O2-NEXT:        Free MachineFunction
 
 ; GCN-O3:Target Library Information
+; GCN-O3-NEXT:Runtime Library Function Analysis
 ; GCN-O3-NEXT:Target Pass Configuration
 ; GCN-O3-NEXT:Machine Module Information
 ; GCN-O3-NEXT:Target Transform Information
+; GCN-O3-NEXT:Library Function Lowering Analysis
 ; GCN-O3-NEXT:Assumption Cache Tracker
 ; GCN-O3-NEXT:Profile summary info
 ; GCN-O3-NEXT:AMDGPU Address space based Alias Analysis
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll
index 83b0847cd2589..e58bf6280a1f2 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll
@@ -2019,1007 +2019,6 @@ endif:
   store i64 %combine, ptr addrspace(1) %out
   ret void
 }
-
-define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
-; GFX8DAGISEL-LABEL: uniform_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_endpgm
-;
-; GFX8GISEL-LABEL: uniform_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_endpgm
-;
-; GFX9DAGISEL-LABEL: uniform_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9DAGISEL-NEXT:    s_endpgm
-;
-; GFX9GISEL-LABEL: uniform_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9GISEL-NEXT:    s_endpgm
-;
-; GFX1064DAGISEL-LABEL: uniform_value_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
-; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1064DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX1064DAGISEL-NEXT:    s_endpgm
-;
-; GFX1064GISEL-LABEL: uniform_value_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1064GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1064GISEL-NEXT:    s_endpgm
-;
-; GFX1032DAGISEL-LABEL: uniform_value_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
-; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1032DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX1032DAGISEL-NEXT:    s_endpgm
-;
-; GFX1032GISEL-LABEL: uniform_value_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1032GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
-; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1032GISEL-NEXT:    s_endpgm
-;
-; GFX1164DAGISEL-LABEL: uniform_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_endpgm
-;
-; GFX1164GISEL-LABEL: uniform_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164GISEL-NEXT:    s_endpgm
-;
-; GFX1132DAGISEL-LABEL: uniform_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
-; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1132DAGISEL-NEXT:    s_endpgm
-;
-; GFX1132GISEL-LABEL: uniform_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132GISEL-NEXT:    s_endpgm
-;
-; GFX12DAGISEL-LABEL: uniform_value_float:
-; GFX12DAGISEL:       ; %bb.0: ; %entry
-; GFX12DAGISEL-NEXT:    s_load_b96 s[0:2], s[4:5], 0x24
-; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
-; GFX12DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX12DAGISEL-NEXT:    s_endpgm
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fadd(float %in, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define void @divergent_value_float(ptr addrspace(1) %out, float %id.x) {
-; GFX8DAGISEL-LABEL: divergent_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8DAGISEL-NEXT:    s_brev_b32 s6, 1
-; GFX8DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8DAGISEL-NEXT:    v_add_f32_e32 v3, s6, v3
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8DAGISEL-NEXT:  ; %bb.2:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8GISEL-LABEL: divergent_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8GISEL-NEXT:    s_brev_b32 s6, 1
-; GFX8GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8GISEL-NEXT:    v_add_f32_e32 v3, s6, v3
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8GISEL-NEXT:  ; %bb.2:
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9DAGISEL-LABEL: divergent_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9DAGISEL-NEXT:    s_brev_b32 s6, 1
-; GFX9DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9DAGISEL-NEXT:    v_add_f32_e32 v3, s6, v3
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9DAGISEL-NEXT:  ; %bb.2:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9GISEL-LABEL: divergent_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9GISEL-NEXT:    s_brev_b32 s6, 1
-; GFX9GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9GISEL-NEXT:    v_add_f32_e32 v3, s6, v3
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9GISEL-NEXT:  ; %bb.2:
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064DAGISEL-LABEL: divergent_value_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064DAGISEL-NEXT:    s_brev_b32 s6, 1
-; GFX1064DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064DAGISEL-NEXT:    v_add_f32_e64 v3, s6, s8
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064DAGISEL-NEXT:  ; %bb.2:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064GISEL-LABEL: divergent_value_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064GISEL-NEXT:    s_brev_b32 s6, 1
-; GFX1064GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064GISEL-NEXT:    v_add_f32_e64 v3, s6, s8
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064GISEL-NEXT:  ; %bb.2:
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032DAGISEL-LABEL: divergent_value_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032DAGISEL-NEXT:    s_brev_b32 s5, 1
-; GFX1032DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032DAGISEL-NEXT:    v_add_f32_e64 v3, s5, s7
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032DAGISEL-NEXT:  ; %bb.2:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032GISEL-LABEL: divergent_value_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032GISEL-NEXT:    s_brev_b32 s5, 1
-; GFX1032GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032GISEL-NEXT:    v_add_f32_e64 v3, s5, s7
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032GISEL-NEXT:  ; %bb.2:
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164DAGISEL-LABEL: divergent_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164DAGISEL-NEXT:    s_brev_b32 s2, 1
-; GFX1164DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164DAGISEL-NEXT:    v_add_f32_e64 v3, s2, s4
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164DAGISEL-NEXT:  ; %bb.2:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164GISEL-LABEL: divergent_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    s_brev_b32 s2, 1
-; GFX1164GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164GISEL-NEXT:    v_add_f32_e64 v3, s2, s4
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164GISEL-NEXT:  ; %bb.2:
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132DAGISEL-LABEL: divergent_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132DAGISEL-NEXT:    s_brev_b32 s1, 1
-; GFX1132DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132DAGISEL-NEXT:    v_add_f32_e64 v3, s1, s3
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132DAGISEL-NEXT:  ; %bb.2:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132GISEL-LABEL: divergent_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132GISEL-NEXT:    s_brev_b32 s1, 1
-; GFX1132GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132GISEL-NEXT:    v_add_f32_e64 v3, s1, s3
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132GISEL-NEXT:  ; %bb.2:
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX12DAGISEL-LABEL: divergent_value_float:
-; GFX12DAGISEL:       ; %bb.0: ; %entry
-; GFX12DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_expcnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_samplecnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_bvhcnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX12DAGISEL-NEXT:    s_brev_b32 s1, 1
-; GFX12DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX12DAGISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX12DAGISEL-NEXT:    v_add_f32_e64 v3, s1, s3
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX12DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX12DAGISEL-NEXT:  ; %bb.2:
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX12DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX12DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fadd(float %id.x, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define amdgpu_kernel void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
-; GFX8DAGISEL-LABEL: divergent_cfg_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
-; GFX8DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX8DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX8DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX8DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX8DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX8DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s0
-; GFX8DAGISEL-NEXT:    flat_store_dword v[1:2], v0
-; GFX8DAGISEL-NEXT:    s_endpgm
-;
-; GFX8GISEL-LABEL: divergent_cfg_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
-; GFX8GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX8GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX8GISEL-NEXT:  ; %bb.1: ; %else
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX8GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX8GISEL-NEXT:  ; %bb.3: ; %if
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX8GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_endpgm
-;
-; GFX9DAGISEL-LABEL: divergent_cfg_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
-; GFX9DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX9DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX9DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX9DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX9DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX9DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX9DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9DAGISEL-NEXT:    s_endpgm
-;
-; GFX9GISEL-LABEL: divergent_cfg_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
-; GFX9GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX9GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX9GISEL-NEXT:  ; %bb.1: ; %else
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX9GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX9GISEL-NEXT:  ; %bb.3: ; %if
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX9GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9GISEL-NEXT:    s_endpgm
-;
-; GFX1064DAGISEL-LABEL: divergent_cfg_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
-; GFX1064DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX1064DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1064DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1064DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1064DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1064DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1064DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX1064DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1064DAGISEL-NEXT:    s_endpgm
-;
-; GFX1064GISEL-LABEL: divergent_cfg_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
-; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX1064GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1064GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1064GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1064GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1064GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1064GISEL-NEXT:    s_endpgm
-;
-; GFX1032DAGISEL-LABEL: divergent_cfg_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v0
-; GFX1032DAGISEL-NEXT:    ; implicit-def: $sgpr3
-; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s2, vcc_lo
-; GFX1032DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
-; GFX1032DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX1032DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1032DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1032DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
-; GFX1032DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1032DAGISEL-NEXT:    s_endpgm
-;
-; GFX1032GISEL-LABEL: divergent_cfg_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v0
-; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr2
-; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s3, vcc_lo
-; GFX1032GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s0, s3
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1032GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1032GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1032GISEL-NEXT:    s_endpgm
-;
-; GFX1164DAGISEL-LABEL: divergent_cfg_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
-; GFX1164DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1164DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1164DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1164DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1164DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1164DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX1164DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_endpgm
-;
-; GFX1164GISEL-LABEL: divergent_cfg_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1164GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
-; GFX1164GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1164GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[2:3], s[2:3]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1164GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1164GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1164GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164GISEL-NEXT:    s_endpgm
-;
-; GFX1132DAGISEL-LABEL: divergent_cfg_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132DAGISEL-NEXT:    ; implicit-def: $sgpr3
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
-; GFX1132DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
-; GFX1132DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX1132DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1132DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
-; GFX1132DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132DAGISEL-NEXT:    s_endpgm
-;
-; GFX1132GISEL-LABEL: divergent_cfg_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1132GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1132GISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
-; GFX1132GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s0, s3
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1132GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, 0
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132GISEL-NEXT:    s_endpgm
-;
-; GFX12DAGISEL-LABEL: divergent_cfg_float:
-; GFX12DAGISEL:       ; %bb.0: ; %entry
-; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX12DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX12DAGISEL-NEXT:    ; implicit-def: $sgpr3
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
-; GFX12DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
-; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX12DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
-; GFX12DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
-; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX12DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
-; GFX12DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX12DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX12DAGISEL-NEXT:    s_endpgm
-entry:
-  %tid = call i32 @llvm.amdgcn.workitem.id.x()
-  %d_cmp = icmp ult i32 %tid, 16
-  br i1 %d_cmp, label %if, label %else
-
-if:
-  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fadd(float %in2, i32 1)
-  br label %endif
-
-else:
-  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fadd(float %in, i32 1)
-  br label %endif
-
-endif:
-  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
-  store float %combine, ptr addrspace(1) %out
-  ret void
-}
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; GFX10DAGISEL: {{.*}}
 ; GFX10GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fadd.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fadd.ll
new file mode 100644
index 0000000000000..5d408dc65d68b
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fadd.ll
@@ -0,0 +1,1021 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=0 < %s | FileCheck  -check-prefixes=GFX8DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=1 < %s | FileCheck  -check-prefixes=GFX8GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck  -check-prefixes=GFX9DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 < %s | FileCheck  -check-prefixes=GFX9GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1064DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1064GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1032DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1032GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1164DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1164GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1132DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1132GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -global-isel=0 < %s | FileCheck -check-prefixes=GFX12DAGISEL %s
+
+
+define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
+; GFX8DAGISEL-LABEL: uniform_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: uniform_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: uniform_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: uniform_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX1064DAGISEL-LABEL: uniform_value_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1064DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_endpgm
+;
+; GFX1064GISEL-LABEL: uniform_value_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1064GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064GISEL-NEXT:    s_endpgm
+;
+; GFX1032DAGISEL-LABEL: uniform_value_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1032DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX1032DAGISEL-NEXT:    s_endpgm
+;
+; GFX1032GISEL-LABEL: uniform_value_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1032GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
+; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: uniform_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: uniform_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: uniform_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: uniform_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132GISEL-NEXT:    s_endpgm
+;
+; GFX12DAGISEL-LABEL: uniform_value_float:
+; GFX12DAGISEL:       ; %bb.0: ; %entry
+; GFX12DAGISEL-NEXT:    s_load_b96 s[0:2], s[4:5], 0x24
+; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mul_f32_e32 v0, s2, v0
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX12DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX12DAGISEL-NEXT:    s_endpgm
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fadd(float %in, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define void @divergent_value_float(ptr addrspace(1) %out, float %id.x) {
+; GFX8DAGISEL-LABEL: divergent_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8DAGISEL-NEXT:    s_brev_b32 s6, 1
+; GFX8DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8DAGISEL-NEXT:    v_add_f32_e32 v3, s6, v3
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8DAGISEL-NEXT:  ; %bb.2:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8GISEL-LABEL: divergent_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8GISEL-NEXT:    s_brev_b32 s6, 1
+; GFX8GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8GISEL-NEXT:    v_add_f32_e32 v3, s6, v3
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8GISEL-NEXT:  ; %bb.2:
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9DAGISEL-LABEL: divergent_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9DAGISEL-NEXT:    s_brev_b32 s6, 1
+; GFX9DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9DAGISEL-NEXT:    v_add_f32_e32 v3, s6, v3
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9DAGISEL-NEXT:  ; %bb.2:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9GISEL-LABEL: divergent_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9GISEL-NEXT:    s_brev_b32 s6, 1
+; GFX9GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9GISEL-NEXT:    v_add_f32_e32 v3, s6, v3
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9GISEL-NEXT:  ; %bb.2:
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064DAGISEL-LABEL: divergent_value_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064DAGISEL-NEXT:    s_brev_b32 s6, 1
+; GFX1064DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064DAGISEL-NEXT:    v_add_f32_e64 v3, s6, s8
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064DAGISEL-NEXT:  ; %bb.2:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064GISEL-LABEL: divergent_value_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064GISEL-NEXT:    s_brev_b32 s6, 1
+; GFX1064GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064GISEL-NEXT:    v_add_f32_e64 v3, s6, s8
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064GISEL-NEXT:  ; %bb.2:
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032DAGISEL-LABEL: divergent_value_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032DAGISEL-NEXT:    s_brev_b32 s5, 1
+; GFX1032DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032DAGISEL-NEXT:    v_add_f32_e64 v3, s5, s7
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032DAGISEL-NEXT:  ; %bb.2:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032GISEL-LABEL: divergent_value_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032GISEL-NEXT:    s_brev_b32 s5, 1
+; GFX1032GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032GISEL-NEXT:    v_add_f32_e64 v3, s5, s7
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032GISEL-NEXT:  ; %bb.2:
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164DAGISEL-LABEL: divergent_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164DAGISEL-NEXT:    s_brev_b32 s2, 1
+; GFX1164DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164DAGISEL-NEXT:    v_add_f32_e64 v3, s2, s4
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164DAGISEL-NEXT:  ; %bb.2:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164GISEL-LABEL: divergent_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    s_brev_b32 s2, 1
+; GFX1164GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164GISEL-NEXT:    v_add_f32_e64 v3, s2, s4
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164GISEL-NEXT:  ; %bb.2:
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132DAGISEL-LABEL: divergent_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132DAGISEL-NEXT:    s_brev_b32 s1, 1
+; GFX1132DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132DAGISEL-NEXT:    v_add_f32_e64 v3, s1, s3
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132DAGISEL-NEXT:  ; %bb.2:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132GISEL-LABEL: divergent_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132GISEL-NEXT:    s_brev_b32 s1, 1
+; GFX1132GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132GISEL-NEXT:    v_add_f32_e64 v3, s1, s3
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132GISEL-NEXT:  ; %bb.2:
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12DAGISEL-LABEL: divergent_value_float:
+; GFX12DAGISEL:       ; %bb.0: ; %entry
+; GFX12DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_expcnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX12DAGISEL-NEXT:    s_brev_b32 s1, 1
+; GFX12DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX12DAGISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX12DAGISEL-NEXT:    v_add_f32_e64 v3, s1, s3
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX12DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX12DAGISEL-NEXT:  ; %bb.2:
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX12DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX12DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fadd(float %id.x, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
+; GFX8DAGISEL-LABEL: divergent_cfg_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
+; GFX8DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX8DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX8DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX8DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX8DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX8DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s0
+; GFX8DAGISEL-NEXT:    flat_store_dword v[1:2], v0
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: divergent_cfg_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
+; GFX8GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX8GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX8GISEL-NEXT:  ; %bb.1: ; %else
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX8GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX8GISEL-NEXT:  ; %bb.3: ; %if
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX8GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: divergent_cfg_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
+; GFX9DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX9DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX9DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX9DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX9DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX9DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX9DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: divergent_cfg_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
+; GFX9GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX9GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX9GISEL-NEXT:  ; %bb.1: ; %else
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX9GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX9GISEL-NEXT:  ; %bb.3: ; %if
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX9GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX1064DAGISEL-LABEL: divergent_cfg_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
+; GFX1064DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX1064DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1064DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1064DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1064DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1064DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1064DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX1064DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_endpgm
+;
+; GFX1064GISEL-LABEL: divergent_cfg_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
+; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX1064GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1064GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1064GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1064GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1064GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064GISEL-NEXT:    s_endpgm
+;
+; GFX1032DAGISEL-LABEL: divergent_cfg_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v0
+; GFX1032DAGISEL-NEXT:    ; implicit-def: $sgpr3
+; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s2, vcc_lo
+; GFX1032DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
+; GFX1032DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX1032DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1032DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1032DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX1032DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032DAGISEL-NEXT:    s_endpgm
+;
+; GFX1032GISEL-LABEL: divergent_cfg_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v0
+; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr2
+; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s3, vcc_lo
+; GFX1032GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s0, s3
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1032GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1032GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: divergent_cfg_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
+; GFX1164DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1164DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1164DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1164DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1164DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1164DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX1164DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: divergent_cfg_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
+; GFX1164GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1164GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[2:3], s[2:3]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1164GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1164GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1164GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: divergent_cfg_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132DAGISEL-NEXT:    ; implicit-def: $sgpr3
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
+; GFX1132DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
+; GFX1132DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX1132DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1132DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX1132DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: divergent_cfg_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1132GISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
+; GFX1132GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s0, s3
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1132GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, 0
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132GISEL-NEXT:    s_endpgm
+;
+; GFX12DAGISEL-LABEL: divergent_cfg_float:
+; GFX12DAGISEL:       ; %bb.0: ; %entry
+; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX12DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX12DAGISEL-NEXT:    ; implicit-def: $sgpr3
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
+; GFX12DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
+; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX12DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mul_f32_e32 v0, s0, v0
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
+; GFX12DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
+; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX12DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mul_f32_e32 v0, s1, v0
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX12DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX12DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX12DAGISEL-NEXT:    s_endpgm
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %d_cmp = icmp ult i32 %tid, 16
+  br i1 %d_cmp, label %if, label %else
+
+if:
+  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fadd(float %in2, i32 1)
+  br label %endif
+
+else:
+  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fadd(float %in, i32 1)
+  br label %endif
+
+endif:
+  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
+  store float %combine, ptr addrspace(1) %out
+  ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; GFX10DAGISEL: {{.*}}
+; GFX10GISEL: {{.*}}
+; GFX11DAGISEL: {{.*}}
+; GFX11GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmax.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmax.ll
new file mode 100644
index 0000000000000..f02fd876f1aac
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmax.ll
@@ -0,0 +1,928 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=0 < %s | FileCheck  -check-prefixes=GFX8DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=1 < %s | FileCheck  -check-prefixes=GFX8GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck  -check-prefixes=GFX9DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 < %s | FileCheck  -check-prefixes=GFX9GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1064DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1064GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1032DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1032GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1164DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1164GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1132DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1132GISEL %s
+
+
+define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
+; GFX8DAGISEL-LABEL: uniform_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: uniform_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: uniform_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX9DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: uniform_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX10DAGISEL-LABEL: uniform_value_float:
+; GFX10DAGISEL:       ; %bb.0: ; %entry
+; GFX10DAGISEL-NEXT:    s_clause 0x1
+; GFX10DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX10DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX10DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX10DAGISEL-NEXT:    s_endpgm
+;
+; GFX10GISEL-LABEL: uniform_value_float:
+; GFX10GISEL:       ; %bb.0: ; %entry
+; GFX10GISEL-NEXT:    s_clause 0x1
+; GFX10GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX10GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX10GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX10GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX10GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX10GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: uniform_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_clause 0x1
+; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: uniform_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_clause 0x1
+; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: uniform_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_clause 0x1
+; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: uniform_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_clause 0x1
+; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v0, s2
+; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132GISEL-NEXT:    s_endpgm
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fmax(float %in, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define void @divergent_value_float(ptr addrspace(1) %out, float %in) {
+; GFX8DAGISEL-LABEL: divergent_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX8DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8DAGISEL-NEXT:    v_max_f32_e32 v3, s6, v3
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8DAGISEL-NEXT:  ; %bb.2:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8GISEL-LABEL: divergent_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX8GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8GISEL-NEXT:    v_max_f32_e32 v3, s6, v3
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8GISEL-NEXT:  ; %bb.2:
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9DAGISEL-LABEL: divergent_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX9DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9DAGISEL-NEXT:    v_max_f32_e32 v3, s6, v3
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9DAGISEL-NEXT:  ; %bb.2:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9GISEL-LABEL: divergent_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX9GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9GISEL-NEXT:    v_max_f32_e32 v3, s6, v3
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9GISEL-NEXT:  ; %bb.2:
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064DAGISEL-LABEL: divergent_value_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1064DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064DAGISEL-NEXT:    v_max_f32_e64 v3, s6, s8
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064DAGISEL-NEXT:  ; %bb.2:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064GISEL-LABEL: divergent_value_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1064GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064GISEL-NEXT:    v_max_f32_e64 v3, s6, s8
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064GISEL-NEXT:  ; %bb.2:
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032DAGISEL-LABEL: divergent_value_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
+; GFX1032DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032DAGISEL-NEXT:    v_max_f32_e64 v3, s5, s7
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032DAGISEL-NEXT:  ; %bb.2:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032GISEL-LABEL: divergent_value_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
+; GFX1032GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032GISEL-NEXT:    v_max_f32_e64 v3, s5, s7
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032GISEL-NEXT:  ; %bb.2:
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164DAGISEL-LABEL: divergent_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1164DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164DAGISEL-NEXT:    v_max_f32_e64 v3, s2, s4
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164DAGISEL-NEXT:  ; %bb.2:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164GISEL-LABEL: divergent_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1164GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164GISEL-NEXT:    v_max_f32_e64 v3, s2, s4
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164GISEL-NEXT:  ; %bb.2:
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132DAGISEL-LABEL: divergent_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
+; GFX1132DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132DAGISEL-NEXT:    v_max_f32_e64 v3, s1, s3
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132DAGISEL-NEXT:  ; %bb.2:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132GISEL-LABEL: divergent_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
+; GFX1132GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132GISEL-NEXT:    v_max_f32_e64 v3, s1, s3
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132GISEL-NEXT:  ; %bb.2:
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fmax(float %in, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
+; GFX8DAGISEL-LABEL: divergent_cfg_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX8DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8DAGISEL-NEXT:    v_max_f32_e32 v3, s8, v3
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX8DAGISEL-NEXT:  ; %bb.3:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX8DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX8DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX8DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8DAGISEL-NEXT:    v_max_f32_e32 v2, s8, v2
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX8DAGISEL-NEXT:  ; %bb.7:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX8DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v4
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8GISEL-LABEL: divergent_cfg_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
+; GFX8GISEL-NEXT:    ; implicit-def: $sgpr8
+; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX8GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX8GISEL-NEXT:  ; %bb.1: ; %else
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX8GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8GISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v4, s10
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8GISEL-NEXT:    v_max_f32_e32 v4, s8, v4
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v4
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX8GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX8GISEL-NEXT:  ; %bb.4: ; %if
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8GISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8GISEL-NEXT:    v_max_f32_e32 v2, s8, v2
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX8GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9DAGISEL-LABEL: divergent_cfg_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX9DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9DAGISEL-NEXT:    v_max_f32_e32 v3, s8, v3
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX9DAGISEL-NEXT:  ; %bb.3:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX9DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX9DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX9DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9DAGISEL-NEXT:    v_max_f32_e32 v2, s8, v2
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX9DAGISEL-NEXT:  ; %bb.7:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX9DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9GISEL-LABEL: divergent_cfg_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
+; GFX9GISEL-NEXT:    ; implicit-def: $sgpr8
+; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX9GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX9GISEL-NEXT:  ; %bb.1: ; %else
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX9GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9GISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v4, s10
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9GISEL-NEXT:    v_max_f32_e32 v4, s8, v4
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v4
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX9GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX9GISEL-NEXT:  ; %bb.4: ; %if
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9GISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9GISEL-NEXT:    v_max_f32_e32 v2, s8, v2
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX9GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064DAGISEL-LABEL: divergent_cfg_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX1064DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064DAGISEL-NEXT:    v_max_f32_e64 v3, s8, s10
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1064DAGISEL-NEXT:  ; %bb.3:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1064DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1064DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1064DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064DAGISEL-NEXT:    v_max_f32_e64 v2, s8, s10
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1064DAGISEL-NEXT:  ; %bb.7:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX1064DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
+; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064GISEL-LABEL: divergent_cfg_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr8
+; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
+; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX1064GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064GISEL-NEXT:    v_max_f32_e64 v3, s8, s10
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX1064GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1064GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1064GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064GISEL-NEXT:    v_max_f32_e64 v2, s8, s10
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1064GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032DAGISEL-LABEL: divergent_cfg_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
+; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s4, vcc_lo
+; GFX1032DAGISEL-NEXT:    s_xor_b32 s4, exec_lo, s4
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1032DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
+; GFX1032DAGISEL-NEXT:    v_max_f32_e64 v3, s6, s8
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1032DAGISEL-NEXT:  ; %bb.3:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
+; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1032DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1032DAGISEL-NEXT:    s_andn2_saveexec_b32 s4, s4
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1032DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1032DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v3, s7
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
+; GFX1032DAGISEL-NEXT:    v_max_f32_e64 v2, s6, s8
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v2
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1032DAGISEL-NEXT:  ; %bb.7:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
+; GFX1032DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s4
+; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
+; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032GISEL-LABEL: divergent_cfg_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr4
+; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v4
+; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s5, vcc_lo
+; GFX1032GISEL-NEXT:    s_xor_b32 s5, exec_lo, s5
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1032GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
+; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
+; GFX1032GISEL-NEXT:    v_max_f32_e64 v3, s4, s8
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v3
+; GFX1032GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1032GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s5, s5
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1032GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1032GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
+; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v3, s7
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
+; GFX1032GISEL-NEXT:    v_max_f32_e64 v2, s4, s8
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v2
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1032GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s5
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s4
+; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164DAGISEL-LABEL: divergent_cfg_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1164DAGISEL-NEXT:    s_and_saveexec_b64 s[0:1], vcc
+; GFX1164DAGISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v2, s5
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164DAGISEL-NEXT:    v_max_f32_e64 v3, s4, s6
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v3
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1164DAGISEL-NEXT:  ; %bb.3:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
+; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1164DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1164DAGISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1164DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v3, s5
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164DAGISEL-NEXT:    v_max_f32_e64 v2, s4, s6
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v2
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1164DAGISEL-NEXT:  ; %bb.7:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
+; GFX1164DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164GISEL-LABEL: divergent_cfg_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr4
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
+; GFX1164GISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v2, s5
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164GISEL-NEXT:    v_max_f32_e64 v3, s4, s6
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v3
+; GFX1164GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1164GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1164GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v3, s5
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164GISEL-NEXT:    v_max_f32_e64 v2, s4, s6
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v2
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1164GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s4
+; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132DAGISEL-LABEL: divergent_cfg_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
+; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1132DAGISEL-NEXT:    s_and_saveexec_b32 s0, vcc_lo
+; GFX1132DAGISEL-NEXT:    s_xor_b32 s0, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1132DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
+; GFX1132DAGISEL-NEXT:    v_max_f32_e64 v3, s2, s4
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1132DAGISEL-NEXT:  ; %bb.3:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
+; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1132DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1132DAGISEL-NEXT:    s_and_not1_saveexec_b32 s0, s0
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1132DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1132DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v3, s3
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
+; GFX1132DAGISEL-NEXT:    v_max_f32_e64 v2, s2, s4
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1132DAGISEL-NEXT:  ; %bb.7:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
+; GFX1132DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132GISEL-LABEL: divergent_cfg_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1132GISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
+; GFX1132GISEL-NEXT:    s_xor_b32 s1, exec_lo, s1
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
+; GFX1132GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
+; GFX1132GISEL-NEXT:    v_max_f32_e64 v3, s0, s4
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v3
+; GFX1132GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1132GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s1, s1
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1132GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
+; GFX1132GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v3, s3
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
+; GFX1132GISEL-NEXT:    v_max_f32_e64 v2, s0, s4
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v2
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1132GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %d_cmp = icmp ult i32 %tid, 16
+  br i1 %d_cmp, label %if, label %else
+
+if:
+  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fmax(float %in2, i32 1)
+  br label %endif
+
+else:
+  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fmax(float %in, i32 1)
+  br label %endif
+
+endif:
+  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
+  store float %combine, ptr addrspace(1) %out
+  ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; GFX11DAGISEL: {{.*}}
+; GFX11GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmin.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmin.ll
new file mode 100644
index 0000000000000..cb093cb14c4b5
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fmin.ll
@@ -0,0 +1,928 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=0 < %s | FileCheck  -check-prefixes=GFX8DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=1 < %s | FileCheck  -check-prefixes=GFX8GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck  -check-prefixes=GFX9DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 < %s | FileCheck  -check-prefixes=GFX9GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1064DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1064GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1032DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1032GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1164DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1164GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1132DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1132GISEL %s
+
+
+define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
+; GFX8DAGISEL-LABEL: uniform_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: uniform_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: uniform_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX9DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: uniform_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX10DAGISEL-LABEL: uniform_value_float:
+; GFX10DAGISEL:       ; %bb.0: ; %entry
+; GFX10DAGISEL-NEXT:    s_clause 0x1
+; GFX10DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX10DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX10DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX10DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX10DAGISEL-NEXT:    s_endpgm
+;
+; GFX10GISEL-LABEL: uniform_value_float:
+; GFX10GISEL:       ; %bb.0: ; %entry
+; GFX10GISEL-NEXT:    s_clause 0x1
+; GFX10GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX10GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX10GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX10GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX10GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX10GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: uniform_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_clause 0x1
+; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: uniform_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_clause 0x1
+; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: uniform_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_clause 0x1
+; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: uniform_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_clause 0x1
+; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v0, s2
+; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132GISEL-NEXT:    s_endpgm
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fmin(float %in, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define void @divergent_value_float(ptr addrspace(1) %out, float %in) {
+; GFX8DAGISEL-LABEL: divergent_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX8DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8DAGISEL-NEXT:    v_min_f32_e32 v3, s6, v3
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8DAGISEL-NEXT:  ; %bb.2:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8GISEL-LABEL: divergent_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX8GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8GISEL-NEXT:    v_min_f32_e32 v3, s6, v3
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8GISEL-NEXT:  ; %bb.2:
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9DAGISEL-LABEL: divergent_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX9DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9DAGISEL-NEXT:    v_min_f32_e32 v3, s6, v3
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9DAGISEL-NEXT:  ; %bb.2:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9GISEL-LABEL: divergent_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX9GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9GISEL-NEXT:    v_min_f32_e32 v3, s6, v3
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9GISEL-NEXT:  ; %bb.2:
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064DAGISEL-LABEL: divergent_value_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1064DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064DAGISEL-NEXT:    v_min_f32_e64 v3, s6, s8
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064DAGISEL-NEXT:  ; %bb.2:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064GISEL-LABEL: divergent_value_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1064GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064GISEL-NEXT:    v_min_f32_e64 v3, s6, s8
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064GISEL-NEXT:  ; %bb.2:
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032DAGISEL-LABEL: divergent_value_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
+; GFX1032DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032DAGISEL-NEXT:    v_min_f32_e64 v3, s5, s7
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032DAGISEL-NEXT:  ; %bb.2:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032GISEL-LABEL: divergent_value_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
+; GFX1032GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032GISEL-NEXT:    v_min_f32_e64 v3, s5, s7
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032GISEL-NEXT:  ; %bb.2:
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164DAGISEL-LABEL: divergent_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1164DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164DAGISEL-NEXT:    v_min_f32_e64 v3, s2, s4
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164DAGISEL-NEXT:  ; %bb.2:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164GISEL-LABEL: divergent_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1164GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164GISEL-NEXT:    v_min_f32_e64 v3, s2, s4
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164GISEL-NEXT:  ; %bb.2:
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132DAGISEL-LABEL: divergent_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
+; GFX1132DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132DAGISEL-NEXT:    v_min_f32_e64 v3, s1, s3
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132DAGISEL-NEXT:  ; %bb.2:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132GISEL-LABEL: divergent_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
+; GFX1132GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132GISEL-NEXT:    v_min_f32_e64 v3, s1, s3
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132GISEL-NEXT:  ; %bb.2:
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fmin(float %in, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
+; GFX8DAGISEL-LABEL: divergent_cfg_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX8DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8DAGISEL-NEXT:    v_min_f32_e32 v3, s8, v3
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX8DAGISEL-NEXT:  ; %bb.3:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX8DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX8DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX8DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8DAGISEL-NEXT:    v_min_f32_e32 v2, s8, v2
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX8DAGISEL-NEXT:  ; %bb.7:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX8DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v4
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8GISEL-LABEL: divergent_cfg_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
+; GFX8GISEL-NEXT:    ; implicit-def: $sgpr8
+; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX8GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX8GISEL-NEXT:  ; %bb.1: ; %else
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX8GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8GISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v4, s10
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8GISEL-NEXT:    v_min_f32_e32 v4, s8, v4
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v4
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX8GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX8GISEL-NEXT:  ; %bb.4: ; %if
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX8GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX8GISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX8GISEL-NEXT:    v_min_f32_e32 v2, s8, v2
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX8GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9DAGISEL-LABEL: divergent_cfg_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX9DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9DAGISEL-NEXT:    v_min_f32_e32 v3, s8, v3
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX9DAGISEL-NEXT:  ; %bb.3:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX9DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX9DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX9DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9DAGISEL-NEXT:    v_min_f32_e32 v2, s8, v2
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX9DAGISEL-NEXT:  ; %bb.7:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX9DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9GISEL-LABEL: divergent_cfg_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
+; GFX9GISEL-NEXT:    ; implicit-def: $sgpr8
+; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX9GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX9GISEL-NEXT:  ; %bb.1: ; %else
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX9GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9GISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v4, s10
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9GISEL-NEXT:    v_min_f32_e32 v4, s8, v4
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v4
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX9GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX9GISEL-NEXT:  ; %bb.4: ; %if
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX9GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX9GISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s10
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX9GISEL-NEXT:    v_min_f32_e32 v2, s8, v2
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX9GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064DAGISEL-LABEL: divergent_cfg_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX1064DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064DAGISEL-NEXT:    v_min_f32_e64 v3, s8, s10
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1064DAGISEL-NEXT:  ; %bb.3:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1064DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1064DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1064DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064DAGISEL-NEXT:    v_min_f32_e64 v2, s8, s10
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1064DAGISEL-NEXT:  ; %bb.7:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
+; GFX1064DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
+; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064GISEL-LABEL: divergent_cfg_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr8
+; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
+; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
+; GFX1064GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v2, s9
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064GISEL-NEXT:    v_min_f32_e64 v3, s8, s10
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v3
+; GFX1064GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1064GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1064GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
+; GFX1064GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v3, s9
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
+; GFX1064GISEL-NEXT:    v_min_f32_e64 v2, s8, s10
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v2
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1064GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s8
+; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032DAGISEL-LABEL: divergent_cfg_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
+; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s4, vcc_lo
+; GFX1032DAGISEL-NEXT:    s_xor_b32 s4, exec_lo, s4
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1032DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
+; GFX1032DAGISEL-NEXT:    v_min_f32_e64 v3, s6, s8
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1032DAGISEL-NEXT:  ; %bb.3:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
+; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1032DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1032DAGISEL-NEXT:    s_andn2_saveexec_b32 s4, s4
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1032DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
+; GFX1032DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v3, s7
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
+; GFX1032DAGISEL-NEXT:    v_min_f32_e64 v2, s6, s8
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v2
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1032DAGISEL-NEXT:  ; %bb.7:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
+; GFX1032DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s4
+; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
+; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032GISEL-LABEL: divergent_cfg_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr4
+; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v4
+; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s5, vcc_lo
+; GFX1032GISEL-NEXT:    s_xor_b32 s5, exec_lo, s5
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1032GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
+; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
+; GFX1032GISEL-NEXT:    v_min_f32_e64 v3, s4, s8
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v3
+; GFX1032GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1032GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s5, s5
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1032GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1032GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
+; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v3, s7
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
+; GFX1032GISEL-NEXT:    v_min_f32_e64 v2, s4, s8
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v2
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1032GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s5
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s4
+; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164DAGISEL-LABEL: divergent_cfg_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
+; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1164DAGISEL-NEXT:    s_and_saveexec_b64 s[0:1], vcc
+; GFX1164DAGISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v2, s5
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164DAGISEL-NEXT:    v_min_f32_e64 v3, s4, s6
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v3
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1164DAGISEL-NEXT:  ; %bb.3:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
+; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1164DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1164DAGISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1164DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v3, s5
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164DAGISEL-NEXT:    v_min_f32_e64 v2, s4, s6
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v2
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1164DAGISEL-NEXT:  ; %bb.7:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
+; GFX1164DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164GISEL-LABEL: divergent_cfg_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr4
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
+; GFX1164GISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v2, s5
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164GISEL-NEXT:    v_min_f32_e64 v3, s4, s6
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v3
+; GFX1164GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1164GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1164GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
+; GFX1164GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v3, s5
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
+; GFX1164GISEL-NEXT:    v_min_f32_e64 v2, s4, s6
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v2
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1164GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s4
+; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132DAGISEL-LABEL: divergent_cfg_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
+; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr4
+; GFX1132DAGISEL-NEXT:    s_and_saveexec_b32 s0, vcc_lo
+; GFX1132DAGISEL-NEXT:    s_xor_b32 s0, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1132DAGISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
+; GFX1132DAGISEL-NEXT:    v_min_f32_e64 v3, s2, s4
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1132DAGISEL-NEXT:  ; %bb.3:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
+; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1132DAGISEL-NEXT:  .LBB2_4: ; %Flow
+; GFX1132DAGISEL-NEXT:    s_and_not1_saveexec_b32 s0, s0
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_8
+; GFX1132DAGISEL-NEXT:  ; %bb.5: ; %if
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
+; GFX1132DAGISEL-NEXT:  .LBB2_6: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v3, s3
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
+; GFX1132DAGISEL-NEXT:    v_min_f32_e64 v2, s2, s4
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v2
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB2_6
+; GFX1132DAGISEL-NEXT:  ; %bb.7:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
+; GFX1132DAGISEL-NEXT:  .LBB2_8: ; %endif
+; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
+; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132GISEL-LABEL: divergent_cfg_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
+; GFX1132GISEL-NEXT:    s_mov_b32 s1, exec_lo
+; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
+; GFX1132GISEL-NEXT:    s_xor_b32 s1, exec_lo, s1
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_3
+; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
+; GFX1132GISEL-NEXT:  .LBB2_2: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
+; GFX1132GISEL-NEXT:    v_min_f32_e64 v3, s0, s4
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v3
+; GFX1132GISEL-NEXT:    ; implicit-def: $vgpr3
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB2_2
+; GFX1132GISEL-NEXT:  .LBB2_3: ; %Flow
+; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s1, s1
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_6
+; GFX1132GISEL-NEXT:  ; %bb.4: ; %if
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
+; GFX1132GISEL-NEXT:  .LBB2_5: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v3, s3
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
+; GFX1132GISEL-NEXT:    v_min_f32_e64 v2, s0, s4
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v2
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB2_5
+; GFX1132GISEL-NEXT:  .LBB2_6: ; %endif
+; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s1
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %d_cmp = icmp ult i32 %tid, 16
+  br i1 %d_cmp, label %if, label %else
+
+if:
+  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fmin(float %in2, i32 1)
+  br label %endif
+
+else:
+  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fmin(float %in, i32 1)
+  br label %endif
+
+endif:
+  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
+  store float %combine, ptr addrspace(1) %out
+  ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; GFX11DAGISEL: {{.*}}
+; GFX11GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fsub.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fsub.ll
new file mode 100644
index 0000000000000..29dfb0b504f81
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.fsub.ll
@@ -0,0 +1,1021 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=0 < %s | FileCheck  -check-prefixes=GFX8DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=tonga -global-isel=1 < %s | FileCheck  -check-prefixes=GFX8GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=0 < %s | FileCheck  -check-prefixes=GFX9DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 < %s | FileCheck  -check-prefixes=GFX9GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1064DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1064GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=0 < %s | FileCheck -check-prefixes=GFX10DAGISEL,GFX1032DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -global-isel=1 < %s | FileCheck -check-prefixes=GFX10GISEL,GFX1032GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1164DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1164GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 < %s | FileCheck -check-prefixes=GFX11DAGISEL,GFX1132DAGISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 < %s | FileCheck -check-prefixes=GFX11GISEL,GFX1132GISEL %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -global-isel=0 < %s | FileCheck -check-prefixes=GFX12DAGISEL %s
+
+
+define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
+; GFX8DAGISEL-LABEL: uniform_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: uniform_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: uniform_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: uniform_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX1064DAGISEL-LABEL: uniform_value_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1064DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_endpgm
+;
+; GFX1064GISEL-LABEL: uniform_value_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1064GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064GISEL-NEXT:    s_endpgm
+;
+; GFX1032DAGISEL-LABEL: uniform_value_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1032DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
+; GFX1032DAGISEL-NEXT:    s_endpgm
+;
+; GFX1032GISEL-LABEL: uniform_value_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
+; GFX1032GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
+; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: uniform_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: uniform_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: uniform_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: uniform_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132GISEL-NEXT:    s_endpgm
+;
+; GFX12DAGISEL-LABEL: uniform_value_float:
+; GFX12DAGISEL:       ; %bb.0: ; %entry
+; GFX12DAGISEL-NEXT:    s_load_b96 s[0:2], s[4:5], 0x24
+; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
+; GFX12DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
+; GFX12DAGISEL-NEXT:    s_endpgm
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fsub(float %in, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define void @divergent_value_float(ptr addrspace(1) %out, float %id.x) {
+; GFX8DAGISEL-LABEL: divergent_value_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8DAGISEL-NEXT:    s_mov_b32 s6, 0
+; GFX8DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8DAGISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
+; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8DAGISEL-NEXT:  ; %bb.2:
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8GISEL-LABEL: divergent_value_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX8GISEL-NEXT:    s_mov_b32 s6, 0
+; GFX8GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX8GISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
+; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX8GISEL-NEXT:  ; %bb.2:
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9DAGISEL-LABEL: divergent_value_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9DAGISEL-NEXT:    s_mov_b32 s6, 0
+; GFX9DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9DAGISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
+; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9DAGISEL-NEXT:  ; %bb.2:
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9GISEL-LABEL: divergent_value_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX9GISEL-NEXT:    s_mov_b32 s6, 0
+; GFX9GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
+; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX9GISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
+; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX9GISEL-NEXT:  ; %bb.2:
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
+; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064DAGISEL-LABEL: divergent_value_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064DAGISEL-NEXT:    s_mov_b32 s6, 0
+; GFX1064DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064DAGISEL-NEXT:    v_sub_f32_e64 v3, s6, s8
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064DAGISEL-NEXT:  ; %bb.2:
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1064GISEL-LABEL: divergent_value_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
+; GFX1064GISEL-NEXT:    s_mov_b32 s6, 0
+; GFX1064GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
+; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
+; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
+; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
+; GFX1064GISEL-NEXT:    v_sub_f32_e64 v3, s6, s8
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
+; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1064GISEL-NEXT:  ; %bb.2:
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032DAGISEL-LABEL: divergent_value_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, 0
+; GFX1032DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032DAGISEL-NEXT:    v_sub_f32_e64 v3, s5, s7
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032DAGISEL-NEXT:  ; %bb.2:
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1032GISEL-LABEL: divergent_value_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
+; GFX1032GISEL-NEXT:    s_mov_b32 s5, 0
+; GFX1032GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
+; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
+; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
+; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
+; GFX1032GISEL-NEXT:    v_sub_f32_e64 v3, s5, s7
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
+; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1032GISEL-NEXT:  ; %bb.2:
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
+; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
+; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164DAGISEL-LABEL: divergent_value_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164DAGISEL-NEXT:    s_mov_b32 s2, 0
+; GFX1164DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164DAGISEL-NEXT:    v_sub_f32_e64 v3, s2, s4
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164DAGISEL-NEXT:  ; %bb.2:
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1164GISEL-LABEL: divergent_value_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
+; GFX1164GISEL-NEXT:    s_mov_b32 s2, 0
+; GFX1164GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
+; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
+; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
+; GFX1164GISEL-NEXT:    v_sub_f32_e64 v3, s2, s4
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
+; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1164GISEL-NEXT:  ; %bb.2:
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
+; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132DAGISEL-LABEL: divergent_value_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, 0
+; GFX1132DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132DAGISEL-NEXT:    v_sub_f32_e64 v3, s1, s3
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132DAGISEL-NEXT:  ; %bb.2:
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1132GISEL-LABEL: divergent_value_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX1132GISEL-NEXT:    s_mov_b32 s1, 0
+; GFX1132GISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX1132GISEL-NEXT:    v_sub_f32_e64 v3, s1, s3
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX1132GISEL-NEXT:  ; %bb.2:
+; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX12DAGISEL-LABEL: divergent_value_float:
+; GFX12DAGISEL:       ; %bb.0: ; %entry
+; GFX12DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_expcnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_samplecnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_bvhcnt 0x0
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
+; GFX12DAGISEL-NEXT:    s_mov_b32 s1, 0
+; GFX12DAGISEL-NEXT:  .LBB1_1: ; =>This Inner Loop Header: Depth=1
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
+; GFX12DAGISEL-NEXT:    s_bitset0_b32 s0, s2
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
+; GFX12DAGISEL-NEXT:    v_sub_f32_e64 v3, s1, s3
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
+; GFX12DAGISEL-NEXT:    s_cbranch_scc1 .LBB1_1
+; GFX12DAGISEL-NEXT:  ; %bb.2:
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX12DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
+; GFX12DAGISEL-NEXT:    s_setpc_b64 s[30:31]
+entry:
+  %result = call float @llvm.amdgcn.wave.reduce.fsub(float %id.x, i32 1)
+  store float %result, ptr addrspace(1) %out
+  ret void
+}
+
+define amdgpu_kernel void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
+; GFX8DAGISEL-LABEL: divergent_cfg_float:
+; GFX8DAGISEL:       ; %bb.0: ; %entry
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
+; GFX8DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX8DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX8DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX8DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX8DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX8DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
+; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s0
+; GFX8DAGISEL-NEXT:    flat_store_dword v[1:2], v0
+; GFX8DAGISEL-NEXT:    s_endpgm
+;
+; GFX8GISEL-LABEL: divergent_cfg_float:
+; GFX8GISEL:       ; %bb.0: ; %entry
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
+; GFX8GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX8GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX8GISEL-NEXT:  ; %bb.1: ; %else
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX8GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX8GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX8GISEL-NEXT:  ; %bb.3: ; %if
+; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX8GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
+; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
+; GFX8GISEL-NEXT:    s_endpgm
+;
+; GFX9DAGISEL-LABEL: divergent_cfg_float:
+; GFX9DAGISEL:       ; %bb.0: ; %entry
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
+; GFX9DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX9DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX9DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX9DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX9DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX9DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX9DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9DAGISEL-NEXT:    s_endpgm
+;
+; GFX9GISEL-LABEL: divergent_cfg_float:
+; GFX9GISEL:       ; %bb.0: ; %entry
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
+; GFX9GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX9GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX9GISEL-NEXT:  ; %bb.1: ; %else
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX9GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX9GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX9GISEL-NEXT:  ; %bb.3: ; %if
+; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX9GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX9GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX9GISEL-NEXT:    s_endpgm
+;
+; GFX1064DAGISEL-LABEL: divergent_cfg_float:
+; GFX1064DAGISEL:       ; %bb.0: ; %entry
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
+; GFX1064DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX1064DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1064DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1064DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1064DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1064DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1064DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX1064DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064DAGISEL-NEXT:    s_endpgm
+;
+; GFX1064GISEL-LABEL: divergent_cfg_float:
+; GFX1064GISEL:       ; %bb.0: ; %entry
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
+; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
+; GFX1064GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1064GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
+; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1064GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1064GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1064GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1064GISEL-NEXT:    s_endpgm
+;
+; GFX1032DAGISEL-LABEL: divergent_cfg_float:
+; GFX1032DAGISEL:       ; %bb.0: ; %entry
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v0
+; GFX1032DAGISEL-NEXT:    ; implicit-def: $sgpr3
+; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s2, vcc_lo
+; GFX1032DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
+; GFX1032DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX1032DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
+; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1032DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1032DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1032DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX1032DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032DAGISEL-NEXT:    s_endpgm
+;
+; GFX1032GISEL-LABEL: divergent_cfg_float:
+; GFX1032GISEL:       ; %bb.0: ; %entry
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
+; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v0
+; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr2
+; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s3, vcc_lo
+; GFX1032GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s0, s3
+; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1032GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1032GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1032GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
+; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
+; GFX1032GISEL-NEXT:    s_endpgm
+;
+; GFX1164DAGISEL-LABEL: divergent_cfg_float:
+; GFX1164DAGISEL:       ; %bb.0: ; %entry
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164DAGISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
+; GFX1164DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1164DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1164DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1164DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1164DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1164DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX1164DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164DAGISEL-NEXT:    s_endpgm
+;
+; GFX1164GISEL-LABEL: divergent_cfg_float:
+; GFX1164GISEL:       ; %bb.0: ; %entry
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1164GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
+; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr6
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
+; GFX1164GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
+; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1164GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[2:3], s[2:3]
+; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1164GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
+; GFX1164GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
+; GFX1164GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s6
+; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1164GISEL-NEXT:    s_endpgm
+;
+; GFX1132DAGISEL-LABEL: divergent_cfg_float:
+; GFX1132DAGISEL:       ; %bb.0: ; %entry
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132DAGISEL-NEXT:    ; implicit-def: $sgpr3
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
+; GFX1132DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
+; GFX1132DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX1132DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1132DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX1132DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132DAGISEL-NEXT:    s_endpgm
+;
+; GFX1132GISEL-LABEL: divergent_cfg_float:
+; GFX1132GISEL:       ; %bb.0: ; %entry
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX1132GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX1132GISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
+; GFX1132GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132GISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s0, s3
+; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX1132GISEL-NEXT:  ; %bb.3: ; %if
+; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX1132GISEL-NEXT:  .LBB2_4: ; %endif
+; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1132GISEL-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, 0
+; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX1132GISEL-NEXT:    s_endpgm
+;
+; GFX12DAGISEL-LABEL: divergent_cfg_float:
+; GFX12DAGISEL:       ; %bb.0: ; %entry
+; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
+; GFX12DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
+; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX12DAGISEL-NEXT:    ; implicit-def: $sgpr3
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
+; GFX12DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
+; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB2_2
+; GFX12DAGISEL-NEXT:  ; %bb.1: ; %else
+; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
+; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
+; GFX12DAGISEL-NEXT:  .LBB2_2: ; %Flow
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
+; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB2_4
+; GFX12DAGISEL-NEXT:  ; %bb.3: ; %if
+; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
+; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
+; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
+; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX12DAGISEL-NEXT:  .LBB2_4: ; %endif
+; GFX12DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
+; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
+; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
+; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
+; GFX12DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
+; GFX12DAGISEL-NEXT:    s_endpgm
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %d_cmp = icmp ult i32 %tid, 16
+  br i1 %d_cmp, label %if, label %else
+
+if:
+  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fsub(float %in2, i32 1)
+  br label %endif
+
+else:
+  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fsub(float %in, i32 1)
+  br label %endif
+
+endif:
+  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
+  store float %combine, ptr addrspace(1) %out
+  ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; GFX10DAGISEL: {{.*}}
+; GFX10GISEL: {{.*}}
+; GFX11DAGISEL: {{.*}}
+; GFX11GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll
index 55b8e8541ecfb..6f299ab8bb9cf 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll
@@ -1534,917 +1534,6 @@ endif:
   store i64 %combine, ptr addrspace(1) %out
   ret void
 }
-
-define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
-; GFX8DAGISEL-LABEL: uniform_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_endpgm
-;
-; GFX8GISEL-LABEL: uniform_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_endpgm
-;
-; GFX9DAGISEL-LABEL: uniform_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX9DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX9DAGISEL-NEXT:    s_endpgm
-;
-; GFX9GISEL-LABEL: uniform_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9GISEL-NEXT:    s_endpgm
-;
-; GFX10DAGISEL-LABEL: uniform_value_float:
-; GFX10DAGISEL:       ; %bb.0: ; %entry
-; GFX10DAGISEL-NEXT:    s_clause 0x1
-; GFX10DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX10DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX10DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX10DAGISEL-NEXT:    s_endpgm
-;
-; GFX10GISEL-LABEL: uniform_value_float:
-; GFX10GISEL:       ; %bb.0: ; %entry
-; GFX10GISEL-NEXT:    s_clause 0x1
-; GFX10GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX10GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX10GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX10GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX10GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX10GISEL-NEXT:    s_endpgm
-;
-; GFX1164DAGISEL-LABEL: uniform_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_clause 0x1
-; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_endpgm
-;
-; GFX1164GISEL-LABEL: uniform_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_clause 0x1
-; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164GISEL-NEXT:    s_endpgm
-;
-; GFX1132DAGISEL-LABEL: uniform_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_clause 0x1
-; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
-; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1132DAGISEL-NEXT:    s_endpgm
-;
-; GFX1132GISEL-LABEL: uniform_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_clause 0x1
-; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v0, s2
-; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132GISEL-NEXT:    s_endpgm
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fmax(float %in, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define void @divergent_value_float(ptr addrspace(1) %out, float %in) {
-; GFX8DAGISEL-LABEL: divergent_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX8DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8DAGISEL-NEXT:    v_max_f32_e32 v3, s6, v3
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8DAGISEL-NEXT:  ; %bb.2:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8GISEL-LABEL: divergent_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX8GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8GISEL-NEXT:    v_max_f32_e32 v3, s6, v3
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8GISEL-NEXT:  ; %bb.2:
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9DAGISEL-LABEL: divergent_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX9DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9DAGISEL-NEXT:    v_max_f32_e32 v3, s6, v3
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9DAGISEL-NEXT:  ; %bb.2:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9GISEL-LABEL: divergent_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX9GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9GISEL-NEXT:    v_max_f32_e32 v3, s6, v3
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9GISEL-NEXT:  ; %bb.2:
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064DAGISEL-LABEL: divergent_value_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1064DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064DAGISEL-NEXT:    v_max_f32_e64 v3, s6, s8
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064DAGISEL-NEXT:  ; %bb.2:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064GISEL-LABEL: divergent_value_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1064GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064GISEL-NEXT:    v_max_f32_e64 v3, s6, s8
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064GISEL-NEXT:  ; %bb.2:
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032DAGISEL-LABEL: divergent_value_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
-; GFX1032DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032DAGISEL-NEXT:    v_max_f32_e64 v3, s5, s7
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032DAGISEL-NEXT:  ; %bb.2:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032GISEL-LABEL: divergent_value_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
-; GFX1032GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032GISEL-NEXT:    v_max_f32_e64 v3, s5, s7
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032GISEL-NEXT:  ; %bb.2:
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164DAGISEL-LABEL: divergent_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1164DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164DAGISEL-NEXT:    v_max_f32_e64 v3, s2, s4
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164DAGISEL-NEXT:  ; %bb.2:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164GISEL-LABEL: divergent_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1164GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164GISEL-NEXT:    v_max_f32_e64 v3, s2, s4
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164GISEL-NEXT:  ; %bb.2:
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132DAGISEL-LABEL: divergent_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
-; GFX1132DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132DAGISEL-NEXT:    v_max_f32_e64 v3, s1, s3
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132DAGISEL-NEXT:  ; %bb.2:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132GISEL-LABEL: divergent_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
-; GFX1132GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132GISEL-NEXT:    v_max_f32_e64 v3, s1, s3
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132GISEL-NEXT:  ; %bb.2:
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fmax(float %in, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
-; GFX8DAGISEL-LABEL: divergent_cfg_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX8DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8DAGISEL-NEXT:    v_max_f32_e32 v3, s8, v3
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX8DAGISEL-NEXT:  ; %bb.3:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX8DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX8DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX8DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8DAGISEL-NEXT:    v_max_f32_e32 v2, s8, v2
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX8DAGISEL-NEXT:  ; %bb.7:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX8DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v4
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8GISEL-LABEL: divergent_cfg_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
-; GFX8GISEL-NEXT:    ; implicit-def: $sgpr8
-; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX8GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX8GISEL-NEXT:  ; %bb.1: ; %else
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX8GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8GISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v4, s10
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8GISEL-NEXT:    v_max_f32_e32 v4, s8, v4
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v4
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX8GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX8GISEL-NEXT:  ; %bb.4: ; %if
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8GISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8GISEL-NEXT:    v_max_f32_e32 v2, s8, v2
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX8GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s8
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9DAGISEL-LABEL: divergent_cfg_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX9DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9DAGISEL-NEXT:    v_max_f32_e32 v3, s8, v3
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX9DAGISEL-NEXT:  ; %bb.3:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX9DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX9DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX9DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9DAGISEL-NEXT:    v_max_f32_e32 v2, s8, v2
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX9DAGISEL-NEXT:  ; %bb.7:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX9DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9GISEL-LABEL: divergent_cfg_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
-; GFX9GISEL-NEXT:    ; implicit-def: $sgpr8
-; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX9GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX9GISEL-NEXT:  ; %bb.1: ; %else
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX9GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9GISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v4, s10
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9GISEL-NEXT:    v_max_f32_e32 v4, s8, v4
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v4
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX9GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX9GISEL-NEXT:  ; %bb.4: ; %if
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9GISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9GISEL-NEXT:    v_max_f32_e32 v2, s8, v2
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX9GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s8
-; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064DAGISEL-LABEL: divergent_cfg_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX1064DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064DAGISEL-NEXT:    v_max_f32_e64 v3, s8, s10
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1064DAGISEL-NEXT:  ; %bb.3:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1064DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1064DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1064DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064DAGISEL-NEXT:    v_max_f32_e64 v2, s8, s10
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1064DAGISEL-NEXT:  ; %bb.7:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX1064DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
-; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064GISEL-LABEL: divergent_cfg_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr8
-; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
-; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX1064GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064GISEL-NEXT:    v_max_f32_e64 v3, s8, s10
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX1064GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1064GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1064GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064GISEL-NEXT:    v_max_f32_e64 v2, s8, s10
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1064GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s8
-; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032DAGISEL-LABEL: divergent_cfg_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
-; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s4, vcc_lo
-; GFX1032DAGISEL-NEXT:    s_xor_b32 s4, exec_lo, s4
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1032DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
-; GFX1032DAGISEL-NEXT:    v_max_f32_e64 v3, s6, s8
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1032DAGISEL-NEXT:  ; %bb.3:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
-; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1032DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1032DAGISEL-NEXT:    s_andn2_saveexec_b32 s4, s4
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1032DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1032DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v3, s7
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
-; GFX1032DAGISEL-NEXT:    v_max_f32_e64 v2, s6, s8
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v2
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1032DAGISEL-NEXT:  ; %bb.7:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
-; GFX1032DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s4
-; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
-; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032GISEL-LABEL: divergent_cfg_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr4
-; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v4
-; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s5, vcc_lo
-; GFX1032GISEL-NEXT:    s_xor_b32 s5, exec_lo, s5
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1032GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
-; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
-; GFX1032GISEL-NEXT:    v_max_f32_e64 v3, s4, s8
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v3
-; GFX1032GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1032GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s5, s5
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1032GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1032GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
-; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v3, s7
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
-; GFX1032GISEL-NEXT:    v_max_f32_e64 v2, s4, s8
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v2
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1032GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s5
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s4
-; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164DAGISEL-LABEL: divergent_cfg_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1164DAGISEL-NEXT:    s_and_saveexec_b64 s[0:1], vcc
-; GFX1164DAGISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v2, s5
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164DAGISEL-NEXT:    v_max_f32_e64 v3, s4, s6
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v3
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1164DAGISEL-NEXT:  ; %bb.3:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
-; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1164DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1164DAGISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1164DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v3, s5
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164DAGISEL-NEXT:    v_max_f32_e64 v2, s4, s6
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v2
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1164DAGISEL-NEXT:  ; %bb.7:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
-; GFX1164DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
-; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164GISEL-LABEL: divergent_cfg_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr4
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
-; GFX1164GISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v2, s5
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164GISEL-NEXT:    v_max_f32_e64 v3, s4, s6
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v3
-; GFX1164GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1164GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1164GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v3, s5
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164GISEL-NEXT:    v_max_f32_e64 v2, s4, s6
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v2
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1164GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s4
-; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132DAGISEL-LABEL: divergent_cfg_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
-; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1132DAGISEL-NEXT:    s_and_saveexec_b32 s0, vcc_lo
-; GFX1132DAGISEL-NEXT:    s_xor_b32 s0, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1132DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
-; GFX1132DAGISEL-NEXT:    v_max_f32_e64 v3, s2, s4
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1132DAGISEL-NEXT:  ; %bb.3:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
-; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1132DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1132DAGISEL-NEXT:    s_and_not1_saveexec_b32 s0, s0
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1132DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1132DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v3, s3
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
-; GFX1132DAGISEL-NEXT:    v_max_f32_e64 v2, s2, s4
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v2
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1132DAGISEL-NEXT:  ; %bb.7:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
-; GFX1132DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
-; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132GISEL-LABEL: divergent_cfg_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1132GISEL-NEXT:    s_mov_b32 s1, exec_lo
-; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
-; GFX1132GISEL-NEXT:    s_xor_b32 s1, exec_lo, s1
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
-; GFX1132GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
-; GFX1132GISEL-NEXT:    v_max_f32_e64 v3, s0, s4
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v3
-; GFX1132GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1132GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s1, s1
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1132GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
-; GFX1132GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v3, s3
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
-; GFX1132GISEL-NEXT:    v_max_f32_e64 v2, s0, s4
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v2
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1132GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s1
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s0
-; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
-entry:
-  %tid = call i32 @llvm.amdgcn.workitem.id.x()
-  %d_cmp = icmp ult i32 %tid, 16
-  br i1 %d_cmp, label %if, label %else
-
-if:
-  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fmax(float %in2, i32 1)
-  br label %endif
-
-else:
-  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fmax(float %in, i32 1)
-  br label %endif
-
-endif:
-  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
-  store float %combine, ptr addrspace(1) %out
-  ret void
-}
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; GFX11DAGISEL: {{.*}}
 ; GFX11GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll
index e2ee20dd96c62..3c4cbc74aedc1 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll
@@ -1534,917 +1534,6 @@ endif:
   store i64 %combine, ptr addrspace(1) %out
   ret void
 }
-
-define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
-; GFX8DAGISEL-LABEL: uniform_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_endpgm
-;
-; GFX8GISEL-LABEL: uniform_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_endpgm
-;
-; GFX9DAGISEL-LABEL: uniform_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX9DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX9DAGISEL-NEXT:    s_endpgm
-;
-; GFX9GISEL-LABEL: uniform_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9GISEL-NEXT:    s_endpgm
-;
-; GFX10DAGISEL-LABEL: uniform_value_float:
-; GFX10DAGISEL:       ; %bb.0: ; %entry
-; GFX10DAGISEL-NEXT:    s_clause 0x1
-; GFX10DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX10DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX10DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX10DAGISEL-NEXT:    s_endpgm
-;
-; GFX10GISEL-LABEL: uniform_value_float:
-; GFX10GISEL:       ; %bb.0: ; %entry
-; GFX10GISEL-NEXT:    s_clause 0x1
-; GFX10GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX10GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX10GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX10GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX10GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX10GISEL-NEXT:    s_endpgm
-;
-; GFX1164DAGISEL-LABEL: uniform_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_clause 0x1
-; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_endpgm
-;
-; GFX1164GISEL-LABEL: uniform_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_clause 0x1
-; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164GISEL-NEXT:    s_endpgm
-;
-; GFX1132DAGISEL-LABEL: uniform_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_clause 0x1
-; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
-; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1132DAGISEL-NEXT:    s_endpgm
-;
-; GFX1132GISEL-LABEL: uniform_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_clause 0x1
-; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v0, s2
-; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132GISEL-NEXT:    s_endpgm
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fmin(float %in, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define void @divergent_value_float(ptr addrspace(1) %out, float %in) {
-; GFX8DAGISEL-LABEL: divergent_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX8DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8DAGISEL-NEXT:    v_min_f32_e32 v3, s6, v3
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8DAGISEL-NEXT:  ; %bb.2:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8GISEL-LABEL: divergent_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX8GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8GISEL-NEXT:    v_min_f32_e32 v3, s6, v3
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8GISEL-NEXT:  ; %bb.2:
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9DAGISEL-LABEL: divergent_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX9DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9DAGISEL-NEXT:    v_min_f32_e32 v3, s6, v3
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9DAGISEL-NEXT:  ; %bb.2:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9GISEL-LABEL: divergent_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX9GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9GISEL-NEXT:    v_min_f32_e32 v3, s6, v3
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9GISEL-NEXT:  ; %bb.2:
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064DAGISEL-LABEL: divergent_value_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1064DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064DAGISEL-NEXT:    v_min_f32_e64 v3, s6, s8
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064DAGISEL-NEXT:  ; %bb.2:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064GISEL-LABEL: divergent_value_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1064GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064GISEL-NEXT:    v_min_f32_e64 v3, s6, s8
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064GISEL-NEXT:  ; %bb.2:
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032DAGISEL-LABEL: divergent_value_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
-; GFX1032DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032DAGISEL-NEXT:    v_min_f32_e64 v3, s5, s7
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032DAGISEL-NEXT:  ; %bb.2:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032GISEL-LABEL: divergent_value_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s5, 0x7fc00000
-; GFX1032GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032GISEL-NEXT:    v_min_f32_e64 v3, s5, s7
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032GISEL-NEXT:  ; %bb.2:
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164DAGISEL-LABEL: divergent_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1164DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164DAGISEL-NEXT:    v_min_f32_e64 v3, s2, s4
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164DAGISEL-NEXT:  ; %bb.2:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164GISEL-LABEL: divergent_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1164GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164GISEL-NEXT:    v_min_f32_e64 v3, s2, s4
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164GISEL-NEXT:  ; %bb.2:
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132DAGISEL-LABEL: divergent_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
-; GFX1132DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132DAGISEL-NEXT:    v_min_f32_e64 v3, s1, s3
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132DAGISEL-NEXT:  ; %bb.2:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132GISEL-LABEL: divergent_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s1, 0x7fc00000
-; GFX1132GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132GISEL-NEXT:    v_min_f32_e64 v3, s1, s3
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132GISEL-NEXT:  ; %bb.2:
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fmin(float %in, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
-; GFX8DAGISEL-LABEL: divergent_cfg_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX8DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8DAGISEL-NEXT:    v_min_f32_e32 v3, s8, v3
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX8DAGISEL-NEXT:  ; %bb.3:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX8DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX8DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX8DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX8DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8DAGISEL-NEXT:    v_min_f32_e32 v2, s8, v2
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX8DAGISEL-NEXT:  ; %bb.7:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX8DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v4
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8GISEL-LABEL: divergent_cfg_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
-; GFX8GISEL-NEXT:    ; implicit-def: $sgpr8
-; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX8GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX8GISEL-NEXT:  ; %bb.1: ; %else
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX8GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8GISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v4, s10
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8GISEL-NEXT:    v_min_f32_e32 v4, s8, v4
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v4
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX8GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX8GISEL-NEXT:  ; %bb.4: ; %if
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX8GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX8GISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX8GISEL-NEXT:    v_min_f32_e32 v2, s8, v2
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX8GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s8
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9DAGISEL-LABEL: divergent_cfg_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX9DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s10
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9DAGISEL-NEXT:    v_min_f32_e32 v3, s8, v3
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX9DAGISEL-NEXT:  ; %bb.3:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX9DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX9DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX9DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX9DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9DAGISEL-NEXT:    v_min_f32_e32 v2, s8, v2
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX9DAGISEL-NEXT:  ; %bb.7:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX9DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9GISEL-LABEL: divergent_cfg_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
-; GFX9GISEL-NEXT:    ; implicit-def: $sgpr8
-; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX9GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX9GISEL-NEXT:  ; %bb.1: ; %else
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX9GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9GISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v4, s10
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9GISEL-NEXT:    v_min_f32_e32 v4, s8, v4
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v4
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX9GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX9GISEL-NEXT:  ; %bb.4: ; %if
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX9GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX9GISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s10
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX9GISEL-NEXT:    v_min_f32_e32 v2, s8, v2
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX9GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s8
-; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064DAGISEL-LABEL: divergent_cfg_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX1064DAGISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064DAGISEL-NEXT:    v_min_f32_e64 v3, s8, s10
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1064DAGISEL-NEXT:  ; %bb.3:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX1064DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1064DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1064DAGISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1064DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064DAGISEL-NEXT:    v_min_f32_e64 v2, s8, s10
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1064DAGISEL-NEXT:  ; %bb.7:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v4, s8
-; GFX1064DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
-; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064GISEL-LABEL: divergent_cfg_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr8
-; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v4
-; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[4:5], vcc
-; GFX1064GISEL-NEXT:    s_xor_b64 s[4:5], exec, s[4:5]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v2, s9
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064GISEL-NEXT:    v_min_f32_e64 v3, s8, s10
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v3
-; GFX1064GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1064GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[4:5], s[4:5]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1064GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s8, 0x7fc00000
-; GFX1064GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s9, s[6:7]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s10, v3, s9
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[6:7], s9
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[6:7], 0
-; GFX1064GISEL-NEXT:    v_min_f32_e64 v2, s8, s10
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s8, v2
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1064GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[4:5]
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s8
-; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032DAGISEL-LABEL: divergent_cfg_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
-; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s4, vcc_lo
-; GFX1032DAGISEL-NEXT:    s_xor_b32 s4, exec_lo, s4
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1032DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
-; GFX1032DAGISEL-NEXT:    v_min_f32_e64 v3, s6, s8
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1032DAGISEL-NEXT:  ; %bb.3:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
-; GFX1032DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1032DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1032DAGISEL-NEXT:    s_andn2_saveexec_b32 s4, s4
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1032DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s6, 0x7fc00000
-; GFX1032DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s7, s5
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s8, v3, s7
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s5, s7
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s5, 0
-; GFX1032DAGISEL-NEXT:    v_min_f32_e64 v2, s6, s8
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s6, v2
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1032DAGISEL-NEXT:  ; %bb.7:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v4, s6
-; GFX1032DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s4
-; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v4, off
-; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032GISEL-LABEL: divergent_cfg_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr4
-; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v4
-; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s5, vcc_lo
-; GFX1032GISEL-NEXT:    s_xor_b32 s5, exec_lo, s5
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1032GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
-; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
-; GFX1032GISEL-NEXT:    v_min_f32_e64 v3, s4, s8
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v3
-; GFX1032GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1032GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s5, s5
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1032GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1032GISEL-NEXT:    s_mov_b32 s6, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1032GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s7, s6
-; GFX1032GISEL-NEXT:    v_readlane_b32 s8, v3, s7
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s6, s7
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s6, 0
-; GFX1032GISEL-NEXT:    v_min_f32_e64 v2, s4, s8
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s4, v2
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1032GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s5
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s4
-; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164DAGISEL-LABEL: divergent_cfg_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v4
-; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1164DAGISEL-NEXT:    s_and_saveexec_b64 s[0:1], vcc
-; GFX1164DAGISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v2, s5
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164DAGISEL-NEXT:    v_min_f32_e64 v3, s4, s6
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v3
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1164DAGISEL-NEXT:  ; %bb.3:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
-; GFX1164DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1164DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1164DAGISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1164DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s6, v3, s5
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164DAGISEL-NEXT:    v_min_f32_e64 v2, s4, s6
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s4, v2
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1164DAGISEL-NEXT:  ; %bb.7:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v4, s4
-; GFX1164DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
-; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164GISEL-LABEL: divergent_cfg_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr4
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
-; GFX1164GISEL-NEXT:    s_xor_b64 s[0:1], exec, s[0:1]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v2, s5
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164GISEL-NEXT:    v_min_f32_e64 v3, s4, s6
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v3
-; GFX1164GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1164GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[0:1], s[0:1]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1164GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s4, 0x7fc00000
-; GFX1164GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s5, s[2:3]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s6, v3, s5
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[2:3], s5
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[2:3], 0
-; GFX1164GISEL-NEXT:    v_min_f32_e64 v2, s4, s6
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s4, v2
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1164GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[0:1]
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s4
-; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132DAGISEL-LABEL: divergent_cfg_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v4
-; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr4
-; GFX1132DAGISEL-NEXT:    s_and_saveexec_b32 s0, vcc_lo
-; GFX1132DAGISEL-NEXT:    s_xor_b32 s0, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1132DAGISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
-; GFX1132DAGISEL-NEXT:    v_min_f32_e64 v3, s2, s4
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1132DAGISEL-NEXT:  ; %bb.3:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
-; GFX1132DAGISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1132DAGISEL-NEXT:  .LBB8_4: ; %Flow
-; GFX1132DAGISEL-NEXT:    s_and_not1_saveexec_b32 s0, s0
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_8
-; GFX1132DAGISEL-NEXT:  ; %bb.5: ; %if
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, 0x7fc00000
-; GFX1132DAGISEL-NEXT:  .LBB8_6: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s3, s1
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s4, v3, s3
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s1, s3
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s1, 0
-; GFX1132DAGISEL-NEXT:    v_min_f32_e64 v2, s2, s4
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v2
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB8_6
-; GFX1132DAGISEL-NEXT:  ; %bb.7:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v4, s2
-; GFX1132DAGISEL-NEXT:  .LBB8_8: ; %endif
-; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v4, off
-; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132GISEL-LABEL: divergent_cfg_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132GISEL-NEXT:    v_and_b32_e32 v4, 0x3ff, v31
-; GFX1132GISEL-NEXT:    s_mov_b32 s1, exec_lo
-; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v4
-; GFX1132GISEL-NEXT:    s_xor_b32 s1, exec_lo, s1
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_3
-; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
-; GFX1132GISEL-NEXT:  .LBB8_2: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
-; GFX1132GISEL-NEXT:    v_min_f32_e64 v3, s0, s4
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v3
-; GFX1132GISEL-NEXT:    ; implicit-def: $vgpr3
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB8_2
-; GFX1132GISEL-NEXT:  .LBB8_3: ; %Flow
-; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s1, s1
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_6
-; GFX1132GISEL-NEXT:  ; %bb.4: ; %if
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, 0x7fc00000
-; GFX1132GISEL-NEXT:  .LBB8_5: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s3, s2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s4, v3, s3
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s2, s3
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s2, 0
-; GFX1132GISEL-NEXT:    v_min_f32_e64 v2, s0, s4
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s0, v2
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB8_5
-; GFX1132GISEL-NEXT:  .LBB8_6: ; %endif
-; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s1
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s0
-; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
-entry:
-  %tid = call i32 @llvm.amdgcn.workitem.id.x()
-  %d_cmp = icmp ult i32 %tid, 16
-  br i1 %d_cmp, label %if, label %else
-
-if:
-  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fmin(float %in2, i32 1)
-  br label %endif
-
-else:
-  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fmin(float %in, i32 1)
-  br label %endif
-
-endif:
-  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
-  store float %combine, ptr addrspace(1) %out
-  ret void
-}
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; GFX11DAGISEL: {{.*}}
 ; GFX11GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll
index e9a163c214011..f094213731684 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll
@@ -2220,1007 +2220,6 @@ endif:
   store i64 %combine, ptr addrspace(1) %out
   ret void
 }
-
-define amdgpu_kernel void @uniform_value_float(ptr addrspace(1) %out, float %in) {
-; GFX8DAGISEL-LABEL: uniform_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX8DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_endpgm
-;
-; GFX8GISEL-LABEL: uniform_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX8GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_endpgm
-;
-; GFX9DAGISEL-LABEL: uniform_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX9DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9DAGISEL-NEXT:    s_endpgm
-;
-; GFX9GISEL-LABEL: uniform_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX9GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9GISEL-NEXT:    s_endpgm
-;
-; GFX1064DAGISEL-LABEL: uniform_value_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
-; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1064DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX1064DAGISEL-NEXT:    s_endpgm
-;
-; GFX1064GISEL-LABEL: uniform_value_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1064GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1064GISEL-NEXT:    s_endpgm
-;
-; GFX1032DAGISEL-LABEL: uniform_value_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
-; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1032DAGISEL-NEXT:    global_store_dword v0, v1, s[0:1]
-; GFX1032DAGISEL-NEXT:    s_endpgm
-;
-; GFX1032GISEL-LABEL: uniform_value_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_load_dword s2, s[4:5], 0x2c
-; GFX1032GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
-; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1032GISEL-NEXT:    s_endpgm
-;
-; GFX1164DAGISEL-LABEL: uniform_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s3, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, 0
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_endpgm
-;
-; GFX1164GISEL-LABEL: uniform_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[0:1]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164GISEL-NEXT:    s_endpgm
-;
-; GFX1132DAGISEL-LABEL: uniform_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s0
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
-; GFX1132DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX1132DAGISEL-NEXT:    s_endpgm
-;
-; GFX1132GISEL-LABEL: uniform_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_load_b32 s2, s[4:5], 0x2c
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s0, s0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_2) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132GISEL-NEXT:    s_endpgm
-;
-; GFX12DAGISEL-LABEL: uniform_value_float:
-; GFX12DAGISEL:       ; %bb.0: ; %entry
-; GFX12DAGISEL-NEXT:    s_load_b96 s[0:2], s[4:5], 0x24
-; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mul_f32_e64 v0, -s2, v0
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_dual_mov_b32 v0, 0 :: v_dual_mov_b32 v1, s2
-; GFX12DAGISEL-NEXT:    global_store_b32 v0, v1, s[0:1]
-; GFX12DAGISEL-NEXT:    s_endpgm
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fsub(float %in, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define void @divergent_value_float(ptr addrspace(1) %out, float %id.x) {
-; GFX8DAGISEL-LABEL: divergent_value_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8DAGISEL-NEXT:    s_mov_b32 s6, 0
-; GFX8DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8DAGISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
-; GFX8DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8DAGISEL-NEXT:  ; %bb.2:
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8DAGISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8GISEL-LABEL: divergent_value_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX8GISEL-NEXT:    s_mov_b32 s6, 0
-; GFX8GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX8GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX8GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX8GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX8GISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
-; GFX8GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX8GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX8GISEL-NEXT:  ; %bb.2:
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX8GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9DAGISEL-LABEL: divergent_value_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9DAGISEL-NEXT:    s_mov_b32 s6, 0
-; GFX9DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9DAGISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
-; GFX9DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9DAGISEL-NEXT:  ; %bb.2:
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9DAGISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX9GISEL-LABEL: divergent_value_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX9GISEL-NEXT:    s_mov_b32 s6, 0
-; GFX9GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX9GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX9GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v3, s8
-; GFX9GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX9GISEL-NEXT:    v_sub_f32_e32 v3, s6, v3
-; GFX9GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX9GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX9GISEL-NEXT:  ; %bb.2:
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX9GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX9GISEL-NEXT:    s_waitcnt vmcnt(0)
-; GFX9GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064DAGISEL-LABEL: divergent_value_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064DAGISEL-NEXT:    s_mov_b32 s6, 0
-; GFX1064DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064DAGISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064DAGISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064DAGISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064DAGISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064DAGISEL-NEXT:    v_sub_f32_e64 v3, s6, s8
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064DAGISEL-NEXT:  ; %bb.2:
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1064GISEL-LABEL: divergent_value_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_mov_b64 s[4:5], exec
-; GFX1064GISEL-NEXT:    s_mov_b32 s6, 0
-; GFX1064GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1064GISEL-NEXT:    s_ff1_i32_b64 s7, s[4:5]
-; GFX1064GISEL-NEXT:    v_readlane_b32 s8, v2, s7
-; GFX1064GISEL-NEXT:    s_bitset0_b64 s[4:5], s7
-; GFX1064GISEL-NEXT:    s_cmp_lg_u64 s[4:5], 0
-; GFX1064GISEL-NEXT:    v_sub_f32_e64 v3, s6, s8
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v3
-; GFX1064GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1064GISEL-NEXT:  ; %bb.2:
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX1064GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1064GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032DAGISEL-LABEL: divergent_value_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s5, 0
-; GFX1032DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032DAGISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032DAGISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032DAGISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032DAGISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032DAGISEL-NEXT:    v_sub_f32_e64 v3, s5, s7
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032DAGISEL-NEXT:  ; %bb.2:
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032DAGISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1032GISEL-LABEL: divergent_value_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1032GISEL-NEXT:    s_mov_b32 s4, exec_lo
-; GFX1032GISEL-NEXT:    s_mov_b32 s5, 0
-; GFX1032GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1032GISEL-NEXT:    s_ff1_i32_b32 s6, s4
-; GFX1032GISEL-NEXT:    v_readlane_b32 s7, v2, s6
-; GFX1032GISEL-NEXT:    s_bitset0_b32 s4, s6
-; GFX1032GISEL-NEXT:    s_cmp_lg_u32 s4, 0
-; GFX1032GISEL-NEXT:    v_sub_f32_e64 v3, s5, s7
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s5, v3
-; GFX1032GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1032GISEL-NEXT:  ; %bb.2:
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v2, s5
-; GFX1032GISEL-NEXT:    global_store_dword v[0:1], v2, off
-; GFX1032GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164DAGISEL-LABEL: divergent_value_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164DAGISEL-NEXT:    s_mov_b32 s2, 0
-; GFX1164DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164DAGISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164DAGISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164DAGISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164DAGISEL-NEXT:    v_sub_f32_e64 v3, s2, s4
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164DAGISEL-NEXT:  ; %bb.2:
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1164GISEL-LABEL: divergent_value_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_mov_b64 s[0:1], exec
-; GFX1164GISEL-NEXT:    s_mov_b32 s2, 0
-; GFX1164GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1164GISEL-NEXT:    s_ctz_i32_b64 s3, s[0:1]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    v_readlane_b32 s4, v2, s3
-; GFX1164GISEL-NEXT:    s_bitset0_b64 s[0:1], s3
-; GFX1164GISEL-NEXT:    s_cmp_lg_u64 s[0:1], 0
-; GFX1164GISEL-NEXT:    v_sub_f32_e64 v3, s2, s4
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s2, v3
-; GFX1164GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1164GISEL-NEXT:  ; %bb.2:
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v2, s2
-; GFX1164GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1164GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132DAGISEL-LABEL: divergent_value_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s1, 0
-; GFX1132DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132DAGISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132DAGISEL-NEXT:    v_sub_f32_e64 v3, s1, s3
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132DAGISEL-NEXT:  ; %bb.2:
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX1132GISEL-LABEL: divergent_value_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX1132GISEL-NEXT:    s_mov_b32 s1, 0
-; GFX1132GISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX1132GISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX1132GISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX1132GISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX1132GISEL-NEXT:    v_sub_f32_e64 v3, s1, s3
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX1132GISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX1132GISEL-NEXT:  ; %bb.2:
-; GFX1132GISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX1132GISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX1132GISEL-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX12DAGISEL-LABEL: divergent_value_float:
-; GFX12DAGISEL:       ; %bb.0: ; %entry
-; GFX12DAGISEL-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_expcnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_samplecnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_bvhcnt 0x0
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_mov_b32 s0, exec_lo
-; GFX12DAGISEL-NEXT:    s_mov_b32 s1, 0
-; GFX12DAGISEL-NEXT:  .LBB7_1: ; =>This Inner Loop Header: Depth=1
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_ctz_i32_b32 s2, s0
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    v_readlane_b32 s3, v2, s2
-; GFX12DAGISEL-NEXT:    s_bitset0_b32 s0, s2
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_cmp_lg_u32 s0, 0
-; GFX12DAGISEL-NEXT:    v_sub_f32_e64 v3, s1, s3
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v3
-; GFX12DAGISEL-NEXT:    s_cbranch_scc1 .LBB7_1
-; GFX12DAGISEL-NEXT:  ; %bb.2:
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX12DAGISEL-NEXT:    global_store_b32 v[0:1], v2, off
-; GFX12DAGISEL-NEXT:    s_setpc_b64 s[30:31]
-entry:
-  %result = call float @llvm.amdgcn.wave.reduce.fsub(float %id.x, i32 1)
-  store float %result, ptr addrspace(1) %out
-  ret void
-}
-
-define amdgpu_kernel void @divergent_cfg_float(ptr addrspace(1) %out, float %in, float %in2) {
-; GFX8DAGISEL-LABEL: divergent_cfg_float:
-; GFX8DAGISEL:       ; %bb.0: ; %entry
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX8DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
-; GFX8DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX8DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX8DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX8DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX8DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX8DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX8DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX8DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX8DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX8DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX8DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX8DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX8DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v2, s1
-; GFX8DAGISEL-NEXT:    v_mov_b32_e32 v1, s0
-; GFX8DAGISEL-NEXT:    flat_store_dword v[1:2], v0
-; GFX8DAGISEL-NEXT:    s_endpgm
-;
-; GFX8GISEL-LABEL: divergent_cfg_float:
-; GFX8GISEL:       ; %bb.0: ; %entry
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX8GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
-; GFX8GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX8GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX8GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX8GISEL-NEXT:  ; %bb.1: ; %else
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX8GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX8GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
-; GFX8GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX8GISEL-NEXT:  ; %bb.3: ; %if
-; GFX8GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX8GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX8GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX8GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX8GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v2, s6
-; GFX8GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX8GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8GISEL-NEXT:    flat_store_dword v[0:1], v2
-; GFX8GISEL-NEXT:    s_endpgm
-;
-; GFX9DAGISEL-LABEL: divergent_cfg_float:
-; GFX9DAGISEL:       ; %bb.0: ; %entry
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX9DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
-; GFX9DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX9DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX9DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX9DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX9DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX9DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX9DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX9DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX9DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX9DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX9DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX9DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX9DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX9DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9DAGISEL-NEXT:    s_endpgm
-;
-; GFX9GISEL-LABEL: divergent_cfg_float:
-; GFX9GISEL:       ; %bb.0: ; %entry
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX9GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
-; GFX9GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX9GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX9GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX9GISEL-NEXT:  ; %bb.1: ; %else
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX9GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX9GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
-; GFX9GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX9GISEL-NEXT:  ; %bb.3: ; %if
-; GFX9GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX9GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX9GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX9GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX9GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX9GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX9GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX9GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX9GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX9GISEL-NEXT:    s_endpgm
-;
-; GFX1064DAGISEL-LABEL: divergent_cfg_float:
-; GFX1064DAGISEL:       ; %bb.0: ; %entry
-; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1064DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc, 15, v0
-; GFX1064DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1064DAGISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX1064DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1064DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1064DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1064DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1064DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX1064DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1064DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1064DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1064DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1064DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1064DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX1064DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1064DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1064DAGISEL-NEXT:    s_endpgm
-;
-; GFX1064GISEL-LABEL: divergent_cfg_float:
-; GFX1064GISEL:       ; %bb.0: ; %entry
-; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1064GISEL-NEXT:    v_cmp_le_u32_e32 vcc, 16, v0
-; GFX1064GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1064GISEL-NEXT:    s_and_saveexec_b64 s[2:3], vcc
-; GFX1064GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1064GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1064GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1064GISEL-NEXT:    s_andn2_saveexec_b64 s[2:3], s[2:3]
-; GFX1064GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1064GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1064GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1064GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1064GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1064GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1064GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1064GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1064GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1064GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1064GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1064GISEL-NEXT:    s_endpgm
-;
-; GFX1032DAGISEL-LABEL: divergent_cfg_float:
-; GFX1032DAGISEL:       ; %bb.0: ; %entry
-; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1032DAGISEL-NEXT:    v_cmp_lt_u32_e32 vcc_lo, 15, v0
-; GFX1032DAGISEL-NEXT:    ; implicit-def: $sgpr3
-; GFX1032DAGISEL-NEXT:    s_and_saveexec_b32 s2, vcc_lo
-; GFX1032DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1032DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
-; GFX1032DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX1032DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
-; GFX1032DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1032DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1032DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1032DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1032DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1032DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1032DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
-; GFX1032DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1032DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1032DAGISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032DAGISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1032DAGISEL-NEXT:    s_endpgm
-;
-; GFX1032GISEL-LABEL: divergent_cfg_float:
-; GFX1032GISEL:       ; %bb.0: ; %entry
-; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x2c
-; GFX1032GISEL-NEXT:    v_cmp_le_u32_e32 vcc_lo, 16, v0
-; GFX1032GISEL-NEXT:    ; implicit-def: $sgpr2
-; GFX1032GISEL-NEXT:    s_and_saveexec_b32 s3, vcc_lo
-; GFX1032GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1032GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    s_andn2_saveexec_b32 s0, s3
-; GFX1032GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1032GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1032GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1032GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1032GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1032GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1032GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1032GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1032GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1032GISEL-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x24
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v0, s2
-; GFX1032GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1032GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1032GISEL-NEXT:    global_store_dword v1, v0, s[0:1]
-; GFX1032GISEL-NEXT:    s_endpgm
-;
-; GFX1164DAGISEL-LABEL: divergent_cfg_float:
-; GFX1164DAGISEL:       ; %bb.0: ; %entry
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1164DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164DAGISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
-; GFX1164DAGISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1164DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1164DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1164DAGISEL-NEXT:    s_or_saveexec_b64 s[2:3], s[2:3]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1164DAGISEL-NEXT:    s_xor_b64 exec, exec, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1164DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1164DAGISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1164DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1164DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164DAGISEL-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v0, s0
-; GFX1164DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1164DAGISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164DAGISEL-NEXT:    s_endpgm
-;
-; GFX1164GISEL-LABEL: divergent_cfg_float:
-; GFX1164GISEL:       ; %bb.0: ; %entry
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1164GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1164GISEL-NEXT:    s_mov_b64 s[2:3], exec
-; GFX1164GISEL-NEXT:    ; implicit-def: $sgpr6
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
-; GFX1164GISEL-NEXT:    s_xor_b64 s[2:3], exec, s[2:3]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1164GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s6, s[6:7]
-; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s6
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1164GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1164GISEL-NEXT:    s_and_not1_saveexec_b64 s[2:3], s[2:3]
-; GFX1164GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1164GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1164GISEL-NEXT:    s_mov_b64 s[6:7], exec
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_bcnt1_i32_b64 s0, s[6:7]
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s0
-; GFX1164GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_readfirstlane_b32 s6, v0
-; GFX1164GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1164GISEL-NEXT:    s_or_b64 exec, exec, s[2:3]
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1164GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v0, s6
-; GFX1164GISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1164GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1164GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1164GISEL-NEXT:    s_endpgm
-;
-; GFX1132DAGISEL-LABEL: divergent_cfg_float:
-; GFX1132DAGISEL:       ; %bb.0: ; %entry
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1132DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132DAGISEL-NEXT:    ; implicit-def: $sgpr3
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
-; GFX1132DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1132DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
-; GFX1132DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX1132DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1132DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX1132DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1132DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1132DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
-; GFX1132DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
-; GFX1132DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1132DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1132DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX1132DAGISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132DAGISEL-NEXT:    s_endpgm
-;
-; GFX1132GISEL-LABEL: divergent_cfg_float:
-; GFX1132GISEL:       ; %bb.0: ; %entry
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX1132GISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX1132GISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX1132GISEL-NEXT:    ; implicit-def: $sgpr2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_cmpx_le_u32_e32 16, v0
-; GFX1132GISEL-NEXT:    s_xor_b32 s3, exec_lo, s3
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX1132GISEL-NEXT:  ; %bb.1: ; %else
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132GISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    s_and_not1_saveexec_b32 s0, s3
-; GFX1132GISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX1132GISEL-NEXT:  ; %bb.3: ; %if
-; GFX1132GISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX1132GISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX1132GISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX1132GISEL-NEXT:    v_readfirstlane_b32 s2, v0
-; GFX1132GISEL-NEXT:  .LBB8_4: ; %endif
-; GFX1132GISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX1132GISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX1132GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1132GISEL-NEXT:    v_dual_mov_b32 v0, s2 :: v_dual_mov_b32 v1, 0
-; GFX1132GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX1132GISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX1132GISEL-NEXT:    s_endpgm
-;
-; GFX12DAGISEL-LABEL: divergent_cfg_float:
-; GFX12DAGISEL:       ; %bb.0: ; %entry
-; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x2c
-; GFX12DAGISEL-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
-; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX12DAGISEL-NEXT:    ; implicit-def: $sgpr3
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_cmpx_lt_u32_e32 15, v0
-; GFX12DAGISEL-NEXT:    s_xor_b32 s2, exec_lo, s2
-; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB8_2
-; GFX12DAGISEL-NEXT:  ; %bb.1: ; %else
-; GFX12DAGISEL-NEXT:    s_mov_b32 s3, exec_lo
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s3, s3
-; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s3
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mul_f32_e64 v0, -s0, v0
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s3, v0
-; GFX12DAGISEL-NEXT:  .LBB8_2: ; %Flow
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    s_or_saveexec_b32 s0, s2
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_xor_b32 exec_lo, exec_lo, s0
-; GFX12DAGISEL-NEXT:    s_cbranch_execz .LBB8_4
-; GFX12DAGISEL-NEXT:  ; %bb.3: ; %if
-; GFX12DAGISEL-NEXT:    s_mov_b32 s2, exec_lo
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    s_bcnt1_i32_b32 s2, s2
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_sa_sdst(0)
-; GFX12DAGISEL-NEXT:    v_cvt_f32_i32_e32 v0, s2
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mul_f32_e64 v0, -s1, v0
-; GFX12DAGISEL-NEXT:    v_readfirstlane_b32 s1, v0
-; GFX12DAGISEL-NEXT:    s_wait_alu depctr_va_sdst(0)
-; GFX12DAGISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v0, s1
-; GFX12DAGISEL-NEXT:  .LBB8_4: ; %endif
-; GFX12DAGISEL-NEXT:    s_or_b32 exec_lo, exec_lo, s0
-; GFX12DAGISEL-NEXT:    s_load_b64 s[0:1], s[4:5], 0x24
-; GFX12DAGISEL-NEXT:    v_mov_b32_e32 v1, 0
-; GFX12DAGISEL-NEXT:    s_wait_kmcnt 0x0
-; GFX12DAGISEL-NEXT:    global_store_b32 v1, v0, s[0:1]
-; GFX12DAGISEL-NEXT:    s_endpgm
-entry:
-  %tid = call i32 @llvm.amdgcn.workitem.id.x()
-  %d_cmp = icmp ult i32 %tid, 16
-  br i1 %d_cmp, label %if, label %else
-
-if:
-  %reducedValTid = call float @llvm.amdgcn.wave.reduce.fsub(float %in2, i32 1)
-  br label %endif
-
-else:
-  %reducedValIn = call float @llvm.amdgcn.wave.reduce.fsub(float %in, i32 1)
-  br label %endif
-
-endif:
-  %combine = phi float [%reducedValTid, %if], [%reducedValIn, %else]
-  store float %combine, ptr addrspace(1) %out
-  ret void
-}
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; GFX10DAGISEL: {{.*}}
 ; GFX10GISEL: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll
index ad0b4fd8d902e..83d6f4f5882b4 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll
@@ -19,9 +19,9 @@ define void @memcpy_known(ptr addrspace(7) inreg %src, ptr addrspace(7) inreg %d
 ; CHECK-NEXT:    [[DST_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 1
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 16 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
 ; CHECK-NEXT:    [[DOTEXT_0:%.*]] = shufflevector <4 x i32> [[DOTOFF_0]], <4 x i32> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
@@ -135,8 +135,8 @@ define void @memcpy_known(ptr addrspace(7) inreg %src, ptr addrspace(7) inreg %d
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[DOTOFF_240]], ptr addrspace(8) align 16 [[DST_RSRC]], i32 [[DOTPART_60]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.p7.p7.i32(ptr addrspace(7) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -211,9 +211,9 @@ define void @memcpy_known_i64(ptr addrspace(7) inreg %src, ptr addrspace(7) inre
 ; CHECK-NEXT:    [[DST_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 1
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[LOOP_INDEX_C:%.*]] = trunc i64 [[LOOP_INDEX]] to i32
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX_C]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 1 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
@@ -329,8 +329,8 @@ define void @memcpy_known_i64(ptr addrspace(7) inreg %src, ptr addrspace(7) inre
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[DOTOFF_240]], ptr addrspace(8) align 1 [[DST_RSRC]], i32 [[DOTPART_60]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.p7.p7.i64(ptr addrspace(7) %dst, ptr addrspace(7) %src, i64 8192, i1 false)
@@ -366,18 +366,21 @@ define void @memcpy_unknown(ptr addrspace(7) inreg %src, ptr addrspace(7) inreg
 ; CHECK-NEXT:    [[TMP1:%.*]] = and i32 [[LENGTH]], 15
 ; CHECK-NEXT:    [[TMP2:%.*]] = sub i32 [[LENGTH]], [[TMP1]]
 ; CHECK-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], 0
-; CHECK-NEXT:    br i1 [[TMP3]], label %[[LOOP_MEMCPY_EXPANSION:.*]], label %[[LOOP_MEMCPY_RESIDUAL_HEADER:.*]]
-; CHECK:       [[LOOP_MEMCPY_EXPANSION]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP7:%.*]], %[[LOOP_MEMCPY_EXPANSION]] ]
+; CHECK-NEXT:    br i1 [[TMP3]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND:.*]]
+; CHECK:       [[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP7:%.*]], %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 1 [[SRC_RSRC]], i32 [[TMP4]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[DST_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[TMP5]], ptr addrspace(8) align 1 [[DST_RSRC]], i32 [[TMP6]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP7]] = add i32 [[LOOP_INDEX]], 16
 ; CHECK-NEXT:    [[TMP8:%.*]] = icmp ult i32 [[TMP7]], [[TMP2]]
-; CHECK-NEXT:    br i1 [[TMP8]], label %[[LOOP_MEMCPY_EXPANSION]], label %[[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; CHECK:       [[LOOP_MEMCPY_RESIDUAL:.*]]:
-; CHECK-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, %[[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP13:%.*]], %[[LOOP_MEMCPY_RESIDUAL]] ]
+; CHECK-NEXT:    br i1 [[TMP8]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND]]
+; CHECK:       [[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND]]:
+; CHECK-NEXT:    [[TMP15:%.*]] = icmp ne i32 [[TMP1]], 0
+; CHECK-NEXT:    br i1 [[TMP15]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY:.*]], label %[[DYNAMIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY]]:
+; CHECK-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND]] ], [ [[TMP13:%.*]], %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY]] ]
 ; CHECK-NEXT:    [[TMP9:%.*]] = add i32 [[TMP2]], [[RESIDUAL_LOOP_INDEX]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = add i32 [[SRC_OFF]], [[TMP9]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = call i8 @llvm.amdgcn.raw.ptr.buffer.load.i8(ptr addrspace(8) align 1 [[SRC_RSRC]], i32 [[TMP10]], i32 0, i32 0)
@@ -385,12 +388,9 @@ define void @memcpy_unknown(ptr addrspace(7) inreg %src, ptr addrspace(7) inreg
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.i8(i8 [[TMP11]], ptr addrspace(8) align 1 [[DST_RSRC]], i32 [[TMP12]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP13]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; CHECK-NEXT:    [[TMP14:%.*]] = icmp ult i32 [[TMP13]], [[TMP1]]
-; CHECK-NEXT:    br i1 [[TMP14]], label %[[LOOP_MEMCPY_RESIDUAL]], label %[[POST_LOOP_MEMCPY_EXPANSION:.*]]
-; CHECK:       [[POST_LOOP_MEMCPY_EXPANSION]]:
+; CHECK-NEXT:    br i1 [[TMP14]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY]], label %[[DYNAMIC_MEMCPY_POST_LOOP_EXPANSION]]
+; CHECK:       [[DYNAMIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
-; CHECK:       [[LOOP_MEMCPY_RESIDUAL_HEADER]]:
-; CHECK-NEXT:    [[TMP15:%.*]] = icmp ne i32 [[TMP1]], 0
-; CHECK-NEXT:    br i1 [[TMP15]], label %[[LOOP_MEMCPY_RESIDUAL]], label %[[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p7.p7.i32(ptr addrspace(7) %dst, ptr addrspace(7) %src, i32 %length, i1 false)
   ret void
@@ -401,9 +401,9 @@ define void @memcpy_known_p1_to_p7(ptr addrspace(1) inreg %src, ptr addrspace(7)
 ; CHECK-SAME: ptr addrspace(1) inreg [[SRC:%.*]], { ptr addrspace(8), i32 } inreg [[DST:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:    [[DST_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 0
 ; CHECK-NEXT:    [[DST_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i32 [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 16
 ; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[DST_OFF]], [[LOOP_INDEX]]
@@ -456,8 +456,8 @@ define void @memcpy_known_p1_to_p7(ptr addrspace(1) inreg %src, ptr addrspace(7)
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[DOTSLICE_60]], ptr addrspace(8) align 16 [[DST_RSRC]], i32 [[DOTPART_60]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.p7.p1.i32(ptr addrspace(7) noundef nonnull align 16 %dst, ptr addrspace(1) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -469,9 +469,9 @@ define void @memcpy_known_p7_to_p1(ptr addrspace(7) inreg %src, ptr addrspace(1)
 ; CHECK-SAME: { ptr addrspace(8), i32 } inreg [[SRC:%.*]], ptr addrspace(1) inreg [[DST:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 16 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
 ; CHECK-NEXT:    [[DOTEXT_0:%.*]] = shufflevector <4 x i32> [[DOTOFF_0]], <4 x i32> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
@@ -540,8 +540,8 @@ define void @memcpy_known_p7_to_p1(ptr addrspace(7) inreg %src, ptr addrspace(1)
 ; CHECK-NEXT:    store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 16
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.p1.p7.i32(ptr addrspace(1) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -582,9 +582,9 @@ define void @memcpy_known_p7_to_p3_long(ptr addrspace(7) inreg %src, ptr addrspa
 ; CHECK-SAME: { ptr addrspace(8), i32 } inreg [[SRC:%.*]], ptr addrspace(3) inreg [[DST:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 16 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
 ; CHECK-NEXT:    [[DOTEXT_0:%.*]] = shufflevector <4 x i32> [[DOTOFF_0]], <4 x i32> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
@@ -653,8 +653,8 @@ define void @memcpy_known_p7_to_p3_long(ptr addrspace(7) inreg %src, ptr addrspa
 ; CHECK-NEXT:    store <64 x i32> [[TMP2]], ptr addrspace(3) [[TMP3]], align 16
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.p3.p7.i32(ptr addrspace(3) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -676,9 +676,9 @@ define void @memcpy.inline_known(ptr addrspace(7) inreg %src, ptr addrspace(7) i
 ; CHECK-NEXT:    [[DST_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 1
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 16 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
 ; CHECK-NEXT:    [[DOTEXT_0:%.*]] = shufflevector <4 x i32> [[DOTOFF_0]], <4 x i32> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
@@ -792,8 +792,8 @@ define void @memcpy.inline_known(ptr addrspace(7) inreg %src, ptr addrspace(7) i
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[DOTOFF_240]], ptr addrspace(8) align 16 [[DST_RSRC]], i32 [[DOTPART_60]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.inline.p7.p7.i32(ptr addrspace(7) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -868,9 +868,9 @@ define void @memcpy.inline_known_i64(ptr addrspace(7) inreg %src, ptr addrspace(
 ; CHECK-NEXT:    [[DST_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 1
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[LOOP_INDEX_C:%.*]] = trunc i64 [[LOOP_INDEX]] to i32
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX_C]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 1 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
@@ -986,8 +986,8 @@ define void @memcpy.inline_known_i64(ptr addrspace(7) inreg %src, ptr addrspace(
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[DOTOFF_240]], ptr addrspace(8) align 1 [[DST_RSRC]], i32 [[DOTPART_60]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.inline.p7.p7.i64(ptr addrspace(7) %dst, ptr addrspace(7) %src, i64 8192, i1 false)
@@ -1023,18 +1023,21 @@ define void @memcpy.inline_unknown(ptr addrspace(7) inreg %src, ptr addrspace(7)
 ; CHECK-NEXT:    [[TMP1:%.*]] = and i32 [[LENGTH]], 15
 ; CHECK-NEXT:    [[TMP2:%.*]] = sub i32 [[LENGTH]], [[TMP1]]
 ; CHECK-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], 0
-; CHECK-NEXT:    br i1 [[TMP3]], label %[[LOOP_MEMCPY_EXPANSION:.*]], label %[[LOOP_MEMCPY_RESIDUAL_HEADER:.*]]
-; CHECK:       [[LOOP_MEMCPY_EXPANSION]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP7:%.*]], %[[LOOP_MEMCPY_EXPANSION]] ]
+; CHECK-NEXT:    br i1 [[TMP3]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND:.*]]
+; CHECK:       [[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP7:%.*]], %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 1 [[SRC_RSRC]], i32 [[TMP4]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[DST_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[TMP5]], ptr addrspace(8) align 1 [[DST_RSRC]], i32 [[TMP6]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP7]] = add i32 [[LOOP_INDEX]], 16
 ; CHECK-NEXT:    [[TMP8:%.*]] = icmp ult i32 [[TMP7]], [[TMP2]]
-; CHECK-NEXT:    br i1 [[TMP8]], label %[[LOOP_MEMCPY_EXPANSION]], label %[[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; CHECK:       [[LOOP_MEMCPY_RESIDUAL:.*]]:
-; CHECK-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, %[[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP13:%.*]], %[[LOOP_MEMCPY_RESIDUAL]] ]
+; CHECK-NEXT:    br i1 [[TMP8]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND]]
+; CHECK:       [[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND]]:
+; CHECK-NEXT:    [[TMP15:%.*]] = icmp ne i32 [[TMP1]], 0
+; CHECK-NEXT:    br i1 [[TMP15]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY:.*]], label %[[DYNAMIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY]]:
+; CHECK-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_COND]] ], [ [[TMP13:%.*]], %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY]] ]
 ; CHECK-NEXT:    [[TMP9:%.*]] = add i32 [[TMP2]], [[RESIDUAL_LOOP_INDEX]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = add i32 [[SRC_OFF]], [[TMP9]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = call i8 @llvm.amdgcn.raw.ptr.buffer.load.i8(ptr addrspace(8) align 1 [[SRC_RSRC]], i32 [[TMP10]], i32 0, i32 0)
@@ -1042,12 +1045,9 @@ define void @memcpy.inline_unknown(ptr addrspace(7) inreg %src, ptr addrspace(7)
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.i8(i8 [[TMP11]], ptr addrspace(8) align 1 [[DST_RSRC]], i32 [[TMP12]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP13]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; CHECK-NEXT:    [[TMP14:%.*]] = icmp ult i32 [[TMP13]], [[TMP1]]
-; CHECK-NEXT:    br i1 [[TMP14]], label %[[LOOP_MEMCPY_RESIDUAL]], label %[[POST_LOOP_MEMCPY_EXPANSION:.*]]
-; CHECK:       [[POST_LOOP_MEMCPY_EXPANSION]]:
+; CHECK-NEXT:    br i1 [[TMP14]], label %[[DYNAMIC_MEMCPY_LOOP_EXPANSION_RESIDUAL_BODY]], label %[[DYNAMIC_MEMCPY_POST_LOOP_EXPANSION]]
+; CHECK:       [[DYNAMIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
-; CHECK:       [[LOOP_MEMCPY_RESIDUAL_HEADER]]:
-; CHECK-NEXT:    [[TMP15:%.*]] = icmp ne i32 [[TMP1]], 0
-; CHECK-NEXT:    br i1 [[TMP15]], label %[[LOOP_MEMCPY_RESIDUAL]], label %[[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.inline.p7.p7.i32(ptr addrspace(7) %dst, ptr addrspace(7) %src, i32 %length, i1 false)
   ret void
@@ -1058,9 +1058,9 @@ define void @memcpy.inline_known_p1_to_p7(ptr addrspace(1) inreg %src, ptr addrs
 ; CHECK-SAME: ptr addrspace(1) inreg [[SRC:%.*]], { ptr addrspace(8), i32 } inreg [[DST:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:    [[DST_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 0
 ; CHECK-NEXT:    [[DST_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[DST]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i32 [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 16
 ; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[DST_OFF]], [[LOOP_INDEX]]
@@ -1113,8 +1113,8 @@ define void @memcpy.inline_known_p1_to_p7(ptr addrspace(1) inreg %src, ptr addrs
 ; CHECK-NEXT:    call void @llvm.amdgcn.raw.ptr.buffer.store.v4i32(<4 x i32> [[DOTSLICE_60]], ptr addrspace(8) align 16 [[DST_RSRC]], i32 [[DOTPART_60]], i32 0, i32 0)
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.inline.p7.p1.i32(ptr addrspace(7) noundef nonnull align 16 %dst, ptr addrspace(1) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -1126,9 +1126,9 @@ define void @memcpy.inline_known_p7_to_p1(ptr addrspace(7) inreg %src, ptr addrs
 ; CHECK-SAME: { ptr addrspace(8), i32 } inreg [[SRC:%.*]], ptr addrspace(1) inreg [[DST:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 16 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
 ; CHECK-NEXT:    [[DOTEXT_0:%.*]] = shufflevector <4 x i32> [[DOTOFF_0]], <4 x i32> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
@@ -1197,8 +1197,8 @@ define void @memcpy.inline_known_p7_to_p1(ptr addrspace(7) inreg %src, ptr addrs
 ; CHECK-NEXT:    store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 16
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.inline.p1.p7.i32(ptr addrspace(1) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
@@ -1239,9 +1239,9 @@ define void @memcpy.inline_known_p7_to_p3_long(ptr addrspace(7) inreg %src, ptr
 ; CHECK-SAME: { ptr addrspace(8), i32 } inreg [[SRC:%.*]], ptr addrspace(3) inreg [[DST:%.*]]) #[[ATTR0]] {
 ; CHECK-NEXT:    [[SRC_RSRC:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 0
 ; CHECK-NEXT:    [[SRC_OFF:%.*]] = extractvalue { ptr addrspace(8), i32 } [[SRC]], 1
-; CHECK-NEXT:    br label %[[LOAD_STORE_LOOP:.*]]
-; CHECK:       [[LOAD_STORE_LOOP]]:
-; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[LOAD_STORE_LOOP]] ]
+; CHECK-NEXT:    br label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY:.*]]
+; CHECK:       [[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]]:
+; CHECK-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[SRC_OFF]], [[LOOP_INDEX]]
 ; CHECK-NEXT:    [[DOTOFF_0:%.*]] = call <4 x i32> @llvm.amdgcn.raw.ptr.buffer.load.v4i32(ptr addrspace(8) align 16 [[SRC_RSRC]], i32 [[TMP1]], i32 0, i32 0)
 ; CHECK-NEXT:    [[DOTEXT_0:%.*]] = shufflevector <4 x i32> [[DOTOFF_0]], <4 x i32> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
@@ -1310,8 +1310,8 @@ define void @memcpy.inline_known_p7_to_p3_long(ptr addrspace(7) inreg %src, ptr
 ; CHECK-NEXT:    store <64 x i32> [[TMP2]], ptr addrspace(3) [[TMP3]], align 16
 ; CHECK-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 8192
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[LOAD_STORE_LOOP]], label %[[MEMCPY_SPLIT:.*]]
-; CHECK:       [[MEMCPY_SPLIT]]:
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[STATIC_MEMCPY_LOOP_EXPANSION_MAIN_BODY]], label %[[STATIC_MEMCPY_POST_LOOP_EXPANSION:.*]]
+; CHECK:       [[STATIC_MEMCPY_POST_LOOP_EXPANSION]]:
 ; CHECK-NEXT:    ret void
 ;
   call void @llvm.memcpy.inline.p3.p7.i32(ptr addrspace(3) noundef nonnull align 16 %dst, ptr addrspace(7) noundef nonnull align 16 %src, i32 8192, i1 false)
diff --git a/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll b/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
index 5a9f53ec0077d..20a34dc997bbc 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
@@ -43,7 +43,7 @@ define amdgpu_kernel void @max_size_small_static_memcpy_caller0(ptr addrspace(1)
 ;
 ; ALL-LABEL: @max_size_small_static_memcpy_caller0(
 ; ALL-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; ALL:       load-store-loop:
+; ALL:       static-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; ALL-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 1
@@ -52,7 +52,7 @@ define amdgpu_kernel void @max_size_small_static_memcpy_caller0(ptr addrspace(1)
 ; ALL-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; ALL-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; ALL-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; ALL:       memcpy-split:
+; ALL:       static-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 1024, i1 false)
@@ -63,7 +63,7 @@ define amdgpu_kernel void @max_size_small_static_memcpy_caller0(ptr addrspace(1)
 define amdgpu_kernel void @min_size_large_static_memcpy_caller0(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @min_size_large_static_memcpy_caller0(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 1
@@ -72,7 +72,7 @@ define amdgpu_kernel void @min_size_large_static_memcpy_caller0(ptr addrspace(1)
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i8, ptr addrspace(1) [[TMP6]], align 1
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -201,7 +201,7 @@ define amdgpu_kernel void @variable_memcpy_caller0(ptr addrspace(1) %dst, ptr ad
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 1
@@ -210,8 +210,11 @@ define amdgpu_kernel void @variable_memcpy_caller0(ptr addrspace(1) %dst, ptr ad
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -219,12 +222,9 @@ define amdgpu_kernel void @variable_memcpy_caller0(ptr addrspace(1) %dst, ptr ad
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 %n, i1 false)
   ret void
@@ -236,7 +236,7 @@ define amdgpu_kernel void @variable_memcpy_caller1(ptr addrspace(1) %dst, ptr ad
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 1
@@ -245,8 +245,11 @@ define amdgpu_kernel void @variable_memcpy_caller1(ptr addrspace(1) %dst, ptr ad
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -254,12 +257,9 @@ define amdgpu_kernel void @variable_memcpy_caller1(ptr addrspace(1) %dst, ptr ad
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 %n, i1 false)
   ret void
@@ -271,7 +271,7 @@ define amdgpu_kernel void @memcpy_multi_use_one_function(ptr addrspace(1) %dst0,
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION2:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER5:%.*]]
-; OPT:       loop-memcpy-expansion2:
+; OPT:       dynamic-memcpy-expansion-main-body2:
 ; OPT-NEXT:    [[LOOP_INDEX3:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION2]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX3]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 1
@@ -280,8 +280,11 @@ define amdgpu_kernel void @memcpy_multi_use_one_function(ptr addrspace(1) %dst0,
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX3]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION2]], label [[LOOP_MEMCPY_RESIDUAL_HEADER5]]
-; OPT:       loop-memcpy-residual4:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX6:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER5]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL4:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond5:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL4:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION1:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body4:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX6:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER5]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL4]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX6]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -289,13 +292,13 @@ define amdgpu_kernel void @memcpy_multi_use_one_function(ptr addrspace(1) %dst0,
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX6]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL4]], label [[POST_LOOP_MEMCPY_EXPANSION1:%.*]]
-; OPT:       post-loop-memcpy-expansion1:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL4]], label [[POST_LOOP_MEMCPY_EXPANSION1]]
+; OPT:       dynamic-memcpy-post-expansion1:
 ; OPT-NEXT:    [[TMP17:%.*]] = and i64 [[M:%.*]], 15
 ; OPT-NEXT:    [[TMP18:%.*]] = sub i64 [[M]], [[TMP17]]
 ; OPT-NEXT:    [[TMP19:%.*]] = icmp ne i64 [[TMP18]], 0
 ; OPT-NEXT:    br i1 [[TMP19]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[POST_LOOP_MEMCPY_EXPANSION1]] ], [ [[TMP23:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP20:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP21:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP20]], align 1
@@ -304,8 +307,11 @@ define amdgpu_kernel void @memcpy_multi_use_one_function(ptr addrspace(1) %dst0,
 ; OPT-NEXT:    [[TMP23]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP24:%.*]] = icmp ult i64 [[TMP23]], [[TMP18]]
 ; OPT-NEXT:    br i1 [[TMP24]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP29:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP31:%.*]] = icmp ne i64 [[TMP17]], 0
+; OPT-NEXT:    br i1 [[TMP31]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP29:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP25:%.*]] = add i64 [[TMP18]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP25]]
 ; OPT-NEXT:    [[TMP27:%.*]] = load i8, ptr addrspace(1) [[TMP26]], align 1
@@ -313,15 +319,9 @@ define amdgpu_kernel void @memcpy_multi_use_one_function(ptr addrspace(1) %dst0,
 ; OPT-NEXT:    store i8 [[TMP27]], ptr addrspace(1) [[TMP28]], align 1
 ; OPT-NEXT:    [[TMP29]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP30:%.*]] = icmp ult i64 [[TMP29]], [[TMP17]]
-; OPT-NEXT:    br i1 [[TMP30]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP30]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP31:%.*]] = icmp ne i64 [[TMP17]], 0
-; OPT-NEXT:    br i1 [[TMP31]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
-; OPT:       loop-memcpy-residual-header5:
-; OPT-NEXT:    [[TMP32:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP32]], label [[LOOP_MEMCPY_RESIDUAL4]], label [[POST_LOOP_MEMCPY_EXPANSION1]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst0, ptr addrspace(1) %src, i64 %n, i1 false)
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst1, ptr addrspace(1) %src, i64 %m, i1 false)
@@ -334,7 +334,7 @@ define amdgpu_kernel void @memcpy_alt_type(ptr addrspace(1) %dst, ptr addrspace(
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i32 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP5]], align 1
@@ -343,8 +343,11 @@ define amdgpu_kernel void @memcpy_alt_type(ptr addrspace(1) %dst, ptr addrspace(
 ; OPT-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1
@@ -352,12 +355,9 @@ define amdgpu_kernel void @memcpy_alt_type(ptr addrspace(1) %dst, ptr addrspace(
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p3.i32(ptr addrspace(1) %dst, ptr addrspace(3) %src, i32 %n, i1 false)
   ret void
@@ -370,7 +370,7 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 ; MAX1024-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; MAX1024-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; MAX1024-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; MAX1024:       loop-memcpy-expansion:
+; MAX1024:       dynamic-memcpy-expansion-main-body:
 ; MAX1024-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; MAX1024-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; MAX1024-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 1
@@ -379,8 +379,11 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 ; MAX1024-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; MAX1024-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; MAX1024-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; MAX1024:       loop-memcpy-residual:
-; MAX1024-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; MAX1024:       dynamic-memcpy-expansion-residual-cond:
+; MAX1024-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; MAX1024-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; MAX1024:       dynamic-memcpy-expansion-residual-body:
+; MAX1024-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; MAX1024-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; MAX1024-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; MAX1024-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -388,20 +391,17 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 ; MAX1024-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; MAX1024-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; MAX1024-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; MAX1024-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; MAX1024:       post-loop-memcpy-expansion:
+; MAX1024-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; MAX1024:       dynamic-memcpy-post-expansion:
 ; MAX1024-NEXT:    call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) [[DST1:%.*]], ptr addrspace(1) [[SRC]], i64 102, i1 false)
 ; MAX1024-NEXT:    ret void
-; MAX1024:       loop-memcpy-residual-header:
-; MAX1024-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; MAX1024-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
 ; ALL-LABEL: @memcpy_multi_use_one_function_keep_small(
 ; ALL-NEXT:    [[TMP2:%.*]] = and i64 [[N:%.*]], 15
 ; ALL-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; ALL-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; ALL-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; ALL:       loop-memcpy-expansion:
+; ALL:       dynamic-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX1:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; ALL-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX1]]
 ; ALL-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 1
@@ -410,8 +410,11 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 ; ALL-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX1]], 16
 ; ALL-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; ALL-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; ALL:       loop-memcpy-residual:
-; ALL-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; ALL:       dynamic-memcpy-expansion-residual-cond:
+; ALL-NEXT:    [[TMP27:%.*]] = icmp ne i64 [[TMP2]], 0
+; ALL-NEXT:    br i1 [[TMP27]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; ALL:       dynamic-memcpy-expansion-residual-body:
+; ALL-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; ALL-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; ALL-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -419,8 +422,8 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 ; ALL-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; ALL-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; ALL-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; ALL-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; ALL:       post-loop-memcpy-expansion:
+; ALL-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; ALL:       dynamic-memcpy-post-expansion:
 ; ALL-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 0
 ; ALL-NEXT:    [[TMP17:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP16]], align 1
 ; ALL-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST1:%.*]], i64 0
@@ -454,9 +457,6 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 ; ALL-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST1]], i64 100
 ; ALL-NEXT:    store i16 [[TMP25]], ptr addrspace(1) [[TMP26]], align 1
 ; ALL-NEXT:    ret void
-; ALL:       loop-memcpy-residual-header:
-; ALL-NEXT:    [[TMP27:%.*]] = icmp ne i64 [[TMP2]], 0
-; ALL-NEXT:    br i1 [[TMP27]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst0, ptr addrspace(1) %src, i64 %n, i1 false)
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst1, ptr addrspace(1) %src, i64 102, i1 false)
@@ -466,7 +466,7 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1028(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1028(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -475,7 +475,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1028(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i32, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -489,7 +489,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1028(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1025(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1025(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -498,7 +498,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1025(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i8, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -512,7 +512,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1025(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1026(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1026(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -521,7 +521,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1026(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i16, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -535,7 +535,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1026(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1032(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1032(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -544,7 +544,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1032(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i64, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -558,7 +558,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1032(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1034(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1034(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -567,7 +567,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1034(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i64, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -585,7 +585,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1034(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1035(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1035(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -594,7 +594,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1035(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i64, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -616,7 +616,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1035(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1036(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1036(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -625,7 +625,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1036(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i64, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -643,7 +643,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1036(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1039(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1039(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -652,7 +652,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1039(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i64, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -678,7 +678,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1039(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align2_global_align2_1039(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align2_global_align2_1039(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 2
@@ -687,7 +687,7 @@ define amdgpu_kernel void @memcpy_global_align2_global_align2_1039(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP16:%.*]] = load i64, ptr addrspace(1) [[TMP15]], align 2
 ; OPT-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -713,7 +713,7 @@ define amdgpu_kernel void @memcpy_global_align2_global_align2_1039(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align4_1027(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align4_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -722,7 +722,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1027(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i16, ptr addrspace(1) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -740,7 +740,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1027(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align2_global_align4_1027(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align2_global_align4_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
@@ -749,7 +749,7 @@ define amdgpu_kernel void @memcpy_global_align2_global_align4_1027(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP10:%.*]] = load i16, ptr addrspace(1) [[TMP9]], align 4
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -767,7 +767,7 @@ define amdgpu_kernel void @memcpy_global_align2_global_align4_1027(ptr addrspace
 define amdgpu_kernel void @memcpy_global_align4_global_align2_1027(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
 ; OPT-LABEL: @memcpy_global_align4_global_align2_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 2
@@ -776,7 +776,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align2_1027(ptr addrspace
 ; OPT-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
 ; OPT-NEXT:    [[TMP10:%.*]] = load i16, ptr addrspace(1) [[TMP9]], align 2
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[DST]], i64 1024
@@ -794,7 +794,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align2_1027(ptr addrspace
 define amdgpu_kernel void @memcpy_private_align4_private_align4_1027(ptr addrspace(5) %dst, ptr addrspace(5) %src) #0 {
 ; OPT-LABEL: @memcpy_private_align4_private_align4_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 4
@@ -803,7 +803,7 @@ define amdgpu_kernel void @memcpy_private_align4_private_align4_1027(ptr addrspa
 ; OPT-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i16, ptr addrspace(5) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[DST]], i32 1024
@@ -821,7 +821,7 @@ define amdgpu_kernel void @memcpy_private_align4_private_align4_1027(ptr addrspa
 define amdgpu_kernel void @memcpy_private_align2_private_align4_1027(ptr addrspace(5) %dst, ptr addrspace(5) %src) #0 {
 ; OPT-LABEL: @memcpy_private_align2_private_align4_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 4
@@ -830,7 +830,7 @@ define amdgpu_kernel void @memcpy_private_align2_private_align4_1027(ptr addrspa
 ; OPT-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 1024
 ; OPT-NEXT:    [[TMP10:%.*]] = load i16, ptr addrspace(5) [[TMP9]], align 4
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[DST]], i32 1024
@@ -848,7 +848,7 @@ define amdgpu_kernel void @memcpy_private_align2_private_align4_1027(ptr addrspa
 define amdgpu_kernel void @memcpy_private_align1_private_align4_1027(ptr addrspace(5) %dst, ptr addrspace(5) %src) #0 {
 ; OPT-LABEL: @memcpy_private_align1_private_align4_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 4
@@ -857,7 +857,7 @@ define amdgpu_kernel void @memcpy_private_align1_private_align4_1027(ptr addrspa
 ; OPT-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i16, ptr addrspace(5) [[TMP6]], align 4
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[DST]], i32 1024
@@ -875,7 +875,7 @@ define amdgpu_kernel void @memcpy_private_align1_private_align4_1027(ptr addrspa
 define amdgpu_kernel void @memcpy_private_align4_private_align2_1027(ptr addrspace(5) %dst, ptr addrspace(5) %src) #0 {
 ; OPT-LABEL: @memcpy_private_align4_private_align2_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 2
@@ -884,7 +884,7 @@ define amdgpu_kernel void @memcpy_private_align4_private_align2_1027(ptr addrspa
 ; OPT-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 1024
 ; OPT-NEXT:    [[TMP10:%.*]] = load i16, ptr addrspace(5) [[TMP9]], align 2
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[DST]], i32 1024
@@ -902,7 +902,7 @@ define amdgpu_kernel void @memcpy_private_align4_private_align2_1027(ptr addrspa
 define amdgpu_kernel void @memcpy_private_align4_private_align1_1027(ptr addrspace(5) %dst, ptr addrspace(5) %src) #0 {
 ; OPT-LABEL: @memcpy_private_align4_private_align1_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 1
@@ -911,7 +911,7 @@ define amdgpu_kernel void @memcpy_private_align4_private_align1_1027(ptr addrspa
 ; OPT-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 1024
 ; OPT-NEXT:    [[TMP7:%.*]] = load i16, ptr addrspace(5) [[TMP6]], align 1
 ; OPT-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[DST]], i32 1024
@@ -929,7 +929,7 @@ define amdgpu_kernel void @memcpy_private_align4_private_align1_1027(ptr addrspa
 define amdgpu_kernel void @memcpy_private_align2_private_align2_1027(ptr addrspace(5) %dst, ptr addrspace(5) %src) #0 {
 ; OPT-LABEL: @memcpy_private_align2_private_align2_1027(
 ; OPT-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; OPT:       load-store-loop:
+; OPT:       static-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; OPT-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 2
@@ -938,7 +938,7 @@ define amdgpu_kernel void @memcpy_private_align2_private_align2_1027(ptr addrspa
 ; OPT-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; OPT-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 1024
 ; OPT-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; OPT:       memcpy-split:
+; OPT:       static-memcpy-post-expansion:
 ; OPT-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 1024
 ; OPT-NEXT:    [[TMP10:%.*]] = load i16, ptr addrspace(5) [[TMP9]], align 2
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[DST]], i32 1024
@@ -959,7 +959,7 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_variable(ptr addrs
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 4
@@ -968,8 +968,11 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_variable(ptr addrs
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -977,12 +980,9 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_variable(ptr addrs
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 4 %dst, ptr addrspace(1) align 4 %src, i64 %n, i1 false)
   ret void
@@ -994,7 +994,7 @@ define amdgpu_kernel void @memcpy_global_align2_global_align2_variable(ptr addrs
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 2
@@ -1003,8 +1003,11 @@ define amdgpu_kernel void @memcpy_global_align2_global_align2_variable(ptr addrs
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -1012,12 +1015,9 @@ define amdgpu_kernel void @memcpy_global_align2_global_align2_variable(ptr addrs
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 2 %dst, ptr addrspace(1) align 2 %src, i64 %n, i1 false)
   ret void
@@ -1029,7 +1029,7 @@ define amdgpu_kernel void @memcpy_global_align1_global_align1_variable(ptr addrs
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 1
@@ -1038,8 +1038,11 @@ define amdgpu_kernel void @memcpy_global_align1_global_align1_variable(ptr addrs
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -1047,12 +1050,9 @@ define amdgpu_kernel void @memcpy_global_align1_global_align1_variable(ptr addrs
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align 1 %src, i64 %n, i1 false)
   ret void
@@ -1064,7 +1064,7 @@ define amdgpu_kernel void @memcpy_local_align4_local_align4_variable(ptr addrspa
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i32 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP5]], align 4
@@ -1073,8 +1073,11 @@ define amdgpu_kernel void @memcpy_local_align4_local_align4_variable(ptr addrspa
 ; OPT-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1
@@ -1082,12 +1085,9 @@ define amdgpu_kernel void @memcpy_local_align4_local_align4_variable(ptr addrspa
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(3) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p3.p3.i32(ptr addrspace(3) align 4 %dst, ptr addrspace(3) align 4 %src, i32 %n, i1 false)
   ret void
@@ -1099,7 +1099,7 @@ define amdgpu_kernel void @memcpy_local_align2_local_align2_variable(ptr addrspa
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i32 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP5]], align 2
@@ -1108,8 +1108,11 @@ define amdgpu_kernel void @memcpy_local_align2_local_align2_variable(ptr addrspa
 ; OPT-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1
@@ -1117,12 +1120,9 @@ define amdgpu_kernel void @memcpy_local_align2_local_align2_variable(ptr addrspa
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(3) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p3.p3.i32(ptr addrspace(3) align 2 %dst, ptr addrspace(3) align 2 %src, i32 %n, i1 false)
   ret void
@@ -1134,7 +1134,7 @@ define amdgpu_kernel void @memcpy_local_align1_local_align1_variable(ptr addrspa
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i32 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP5]], align 1
@@ -1143,8 +1143,11 @@ define amdgpu_kernel void @memcpy_local_align1_local_align1_variable(ptr addrspa
 ; OPT-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1
@@ -1152,12 +1155,9 @@ define amdgpu_kernel void @memcpy_local_align1_local_align1_variable(ptr addrspa
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(3) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p3.p3.i32(ptr addrspace(3) align 1 %dst, ptr addrspace(3) align 1 %src, i32 %n, i1 false)
   ret void
@@ -1169,7 +1169,7 @@ define amdgpu_kernel void @memcpy_local_align4_global_align4_variable(ptr addrsp
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i32 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP5]], align 4
@@ -1178,8 +1178,11 @@ define amdgpu_kernel void @memcpy_local_align4_global_align4_variable(ptr addrsp
 ; OPT-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i32 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(1) [[TMP11]], align 1
@@ -1187,12 +1190,9 @@ define amdgpu_kernel void @memcpy_local_align4_global_align4_variable(ptr addrsp
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(3) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p3.p1.i32(ptr addrspace(3) align 4 %dst, ptr addrspace(1) align 4 %src, i32 %n, i1 false)
   ret void
@@ -1204,7 +1204,7 @@ define amdgpu_kernel void @memcpy_global_align4_local_align4_variable(ptr addrsp
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i32 [[N]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP5]], align 4
@@ -1213,8 +1213,11 @@ define amdgpu_kernel void @memcpy_global_align4_local_align4_variable(ptr addrsp
 ; OPT-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1
@@ -1222,12 +1225,9 @@ define amdgpu_kernel void @memcpy_global_align4_local_align4_variable(ptr addrsp
 ; OPT-NEXT:    store i8 [[TMP12]], ptr addrspace(1) [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memcpy.p1.p3.i32(ptr addrspace(1) align 4 %dst, ptr addrspace(3) align 4 %src, i32 %n, i1 false)
   ret void
@@ -1496,7 +1496,7 @@ define amdgpu_kernel void @memmove_private_align1_global_align1(ptr addrspace(5)
 ;
 ; ALL-LABEL: @memmove_private_align1_global_align1(
 ; ALL-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; ALL:       load-store-loop:
+; ALL:       static-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; ALL-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 1, !alias.scope [[META0:![0-9]+]]
@@ -1505,7 +1505,7 @@ define amdgpu_kernel void @memmove_private_align1_global_align1(ptr addrspace(5)
 ; ALL-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; ALL-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 256
 ; ALL-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; ALL:       memcpy-split:
+; ALL:       static-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
 ;
   call void @llvm.memmove.p5.p1.i64(ptr addrspace(5) %dst, ptr addrspace(1) %src, i64 256, i1 false)
@@ -1519,7 +1519,7 @@ define amdgpu_kernel void @memmove_global_align1_private_align1(ptr addrspace(1)
 ;
 ; ALL-LABEL: @memmove_global_align1_private_align1(
 ; ALL-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; ALL:       load-store-loop:
+; ALL:       static-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; ALL-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 1, !alias.scope [[META3:![0-9]+]]
@@ -1528,7 +1528,7 @@ define amdgpu_kernel void @memmove_global_align1_private_align1(ptr addrspace(1)
 ; ALL-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; ALL-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 256
 ; ALL-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; ALL:       memcpy-split:
+; ALL:       static-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
 ;
   call void @llvm.memmove.p1.p5.i64(ptr addrspace(1) %dst, ptr addrspace(5) %src, i64 256, i1 false)
@@ -1722,7 +1722,7 @@ define amdgpu_kernel void @memmove_local_align1_private_align1(ptr addrspace(3)
 ;
 ; ALL-LABEL: @memmove_local_align1_private_align1(
 ; ALL-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; ALL:       load-store-loop:
+; ALL:       static-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; ALL-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 1, !alias.scope [[META6:![0-9]+]]
@@ -1731,7 +1731,7 @@ define amdgpu_kernel void @memmove_local_align1_private_align1(ptr addrspace(3)
 ; ALL-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; ALL-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 256
 ; ALL-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; ALL:       memcpy-split:
+; ALL:       static-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
 ;
   call void @llvm.memmove.p3.p5.i32(ptr addrspace(3) %dst, ptr addrspace(5) %src, i32 256, i1 false)
@@ -1744,7 +1744,7 @@ define amdgpu_kernel void @memmove_local_align1_private_align1_unknown_size(ptr
 ; MAX1024-NEXT:    [[TMP3:%.*]] = sub i32 [[SIZE]], [[TMP2]]
 ; MAX1024-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; MAX1024-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; MAX1024:       loop-memcpy-expansion:
+; MAX1024:       dynamic-memcpy-expansion-main-body:
 ; MAX1024-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; MAX1024-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; MAX1024-NEXT:    [[TMP5:%.*]] = load <4 x i32>, ptr addrspace(5) [[TMP7]], align 1, !alias.scope [[META0:![0-9]+]]
@@ -1753,8 +1753,11 @@ define amdgpu_kernel void @memmove_local_align1_private_align1_unknown_size(ptr
 ; MAX1024-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; MAX1024-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; MAX1024-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; MAX1024:       loop-memcpy-residual:
-; MAX1024-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; MAX1024:       dynamic-memcpy-expansion-residual-cond:
+; MAX1024-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; MAX1024-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; MAX1024:       dynamic-memcpy-expansion-residual-body:
+; MAX1024-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; MAX1024-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; MAX1024-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 [[TMP10]]
 ; MAX1024-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(5) [[TMP11]], align 1, !alias.scope [[META0]]
@@ -1762,19 +1765,16 @@ define amdgpu_kernel void @memmove_local_align1_private_align1_unknown_size(ptr
 ; MAX1024-NEXT:    store i8 [[TMP12]], ptr addrspace(3) [[TMP13]], align 1, !noalias [[META0]]
 ; MAX1024-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; MAX1024-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; MAX1024-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; MAX1024:       post-loop-memcpy-expansion:
+; MAX1024-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; MAX1024:       dynamic-memcpy-post-expansion:
 ; MAX1024-NEXT:    ret void
-; MAX1024:       loop-memcpy-residual-header:
-; MAX1024-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; MAX1024-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
 ; ALL-LABEL: @memmove_local_align1_private_align1_unknown_size(
 ; ALL-NEXT:    [[TMP2:%.*]] = and i32 [[SIZE:%.*]], 15
 ; ALL-NEXT:    [[TMP3:%.*]] = sub i32 [[SIZE]], [[TMP2]]
 ; ALL-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; ALL-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; ALL:       loop-memcpy-expansion:
+; ALL:       dynamic-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; ALL-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP5:%.*]] = load <4 x i32>, ptr addrspace(5) [[TMP7]], align 1, !alias.scope [[META9:![0-9]+]]
@@ -1783,8 +1783,11 @@ define amdgpu_kernel void @memmove_local_align1_private_align1_unknown_size(ptr
 ; ALL-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; ALL-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; ALL-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; ALL:       loop-memcpy-residual:
-; ALL-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; ALL:       dynamic-memcpy-expansion-residual-cond:
+; ALL-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; ALL-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; ALL:       dynamic-memcpy-expansion-residual-body:
+; ALL-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; ALL-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(5) [[SRC]], i32 [[TMP10]]
 ; ALL-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(5) [[TMP11]], align 1, !alias.scope [[META9]]
@@ -1792,12 +1795,9 @@ define amdgpu_kernel void @memmove_local_align1_private_align1_unknown_size(ptr
 ; ALL-NEXT:    store i8 [[TMP12]], ptr addrspace(3) [[TMP13]], align 1, !noalias [[META9]]
 ; ALL-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; ALL-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; ALL-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; ALL:       post-loop-memcpy-expansion:
+; ALL-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; ALL:       dynamic-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
-; ALL:       loop-memcpy-residual-header:
-; ALL-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; ALL-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memmove.p3.p5.i32(ptr addrspace(3) %dst, ptr addrspace(5) %src, i32 %size, i1 false)
   ret void
@@ -1810,7 +1810,7 @@ define amdgpu_kernel void @memmove_private_align1_local_align1(ptr addrspace(5)
 ;
 ; ALL-LABEL: @memmove_private_align1_local_align1(
 ; ALL-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; ALL:       load-store-loop:
+; ALL:       static-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; ALL-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(3) [[TMP1]], align 1, !alias.scope [[META12:![0-9]+]]
@@ -1819,7 +1819,7 @@ define amdgpu_kernel void @memmove_private_align1_local_align1(ptr addrspace(5)
 ; ALL-NEXT:    [[TMP4]] = add i32 [[LOOP_INDEX]], 256
 ; ALL-NEXT:    [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 256
 ; ALL-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; ALL:       memcpy-split:
+; ALL:       static-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
 ;
   call void @llvm.memmove.p5.p3.i32(ptr addrspace(5) %dst, ptr addrspace(3) %src, i32 256, i1 false)
@@ -1832,7 +1832,7 @@ define amdgpu_kernel void @memmove_private_align1_local_align1_unknown_size(ptr
 ; MAX1024-NEXT:    [[TMP3:%.*]] = sub i32 [[SIZE]], [[TMP2]]
 ; MAX1024-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; MAX1024-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; MAX1024:       loop-memcpy-expansion:
+; MAX1024:       dynamic-memcpy-expansion-main-body:
 ; MAX1024-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; MAX1024-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; MAX1024-NEXT:    [[TMP5:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP7]], align 1, !alias.scope [[META3:![0-9]+]]
@@ -1841,8 +1841,11 @@ define amdgpu_kernel void @memmove_private_align1_local_align1_unknown_size(ptr
 ; MAX1024-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; MAX1024-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; MAX1024-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; MAX1024:       loop-memcpy-residual:
-; MAX1024-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; MAX1024:       dynamic-memcpy-expansion-residual-cond:
+; MAX1024-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; MAX1024-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; MAX1024:       dynamic-memcpy-expansion-residual-body:
+; MAX1024-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; MAX1024-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; MAX1024-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; MAX1024-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1, !alias.scope [[META3]]
@@ -1850,19 +1853,16 @@ define amdgpu_kernel void @memmove_private_align1_local_align1_unknown_size(ptr
 ; MAX1024-NEXT:    store i8 [[TMP12]], ptr addrspace(5) [[TMP13]], align 1, !noalias [[META3]]
 ; MAX1024-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; MAX1024-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; MAX1024-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; MAX1024:       post-loop-memcpy-expansion:
+; MAX1024-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; MAX1024:       dynamic-memcpy-post-expansion:
 ; MAX1024-NEXT:    ret void
-; MAX1024:       loop-memcpy-residual-header:
-; MAX1024-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; MAX1024-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
 ; ALL-LABEL: @memmove_private_align1_local_align1_unknown_size(
 ; ALL-NEXT:    [[TMP2:%.*]] = and i32 [[SIZE:%.*]], 15
 ; ALL-NEXT:    [[TMP3:%.*]] = sub i32 [[SIZE]], [[TMP2]]
 ; ALL-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], 0
 ; ALL-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; ALL:       loop-memcpy-expansion:
+; ALL:       dynamic-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; ALL-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC:%.*]], i32 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP5:%.*]] = load <4 x i32>, ptr addrspace(3) [[TMP7]], align 1, !alias.scope [[META15:![0-9]+]]
@@ -1871,8 +1871,11 @@ define amdgpu_kernel void @memmove_private_align1_local_align1_unknown_size(ptr
 ; ALL-NEXT:    [[TMP8]] = add i32 [[LOOP_INDEX]], 16
 ; ALL-NEXT:    [[TMP9:%.*]] = icmp ult i32 [[TMP8]], [[TMP3]]
 ; ALL-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; ALL:       loop-memcpy-residual:
-; ALL-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; ALL:       dynamic-memcpy-expansion-residual-cond:
+; ALL-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
+; ALL-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; ALL:       dynamic-memcpy-expansion-residual-body:
+; ALL-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i32 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; ALL-NEXT:    [[TMP10:%.*]] = add i32 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr addrspace(3) [[SRC]], i32 [[TMP10]]
 ; ALL-NEXT:    [[TMP12:%.*]] = load i8, ptr addrspace(3) [[TMP11]], align 1, !alias.scope [[META15]]
@@ -1880,12 +1883,9 @@ define amdgpu_kernel void @memmove_private_align1_local_align1_unknown_size(ptr
 ; ALL-NEXT:    store i8 [[TMP12]], ptr addrspace(5) [[TMP13]], align 1, !noalias [[META15]]
 ; ALL-NEXT:    [[TMP14]] = add i32 [[RESIDUAL_LOOP_INDEX]], 1
 ; ALL-NEXT:    [[TMP15:%.*]] = icmp ult i32 [[TMP14]], [[TMP2]]
-; ALL-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; ALL:       post-loop-memcpy-expansion:
+; ALL-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; ALL:       dynamic-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
-; ALL:       loop-memcpy-residual-header:
-; ALL-NEXT:    [[TMP16:%.*]] = icmp ne i32 [[TMP2]], 0
-; ALL-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
   call void @llvm.memmove.p5.p3.i32(ptr addrspace(5) %dst, ptr addrspace(3) %src, i32 %size, i1 false)
   ret void
@@ -2367,7 +2367,7 @@ define void @test_umin(i64 %0, i64 %idxprom, ptr %x, ptr %y) {
 ; OPT-NEXT:    [[TMP3:%.*]] = sub i64 [[SPEC_SELECT]], [[TMP2]]
 ; OPT-NEXT:    [[TMP4:%.*]] = icmp ne i64 [[TMP3]], 0
 ; OPT-NEXT:    br i1 [[TMP4]], label [[LOOP_MEMCPY_EXPANSION:%.*]], label [[LOOP_MEMCPY_RESIDUAL_HEADER:%.*]]
-; OPT:       loop-memcpy-expansion:
+; OPT:       dynamic-memcpy-expansion-main-body:
 ; OPT-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[TMP8:%.*]], [[LOOP_MEMCPY_EXPANSION]] ]
 ; OPT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[X:%.*]], i64 [[LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP6:%.*]] = load <4 x i32>, ptr [[TMP5]], align 1
@@ -2376,8 +2376,11 @@ define void @test_umin(i64 %0, i64 %idxprom, ptr %x, ptr %y) {
 ; OPT-NEXT:    [[TMP8]] = add i64 [[LOOP_INDEX]], 16
 ; OPT-NEXT:    [[TMP9:%.*]] = icmp ult i64 [[TMP8]], [[TMP3]]
 ; OPT-NEXT:    br i1 [[TMP9]], label [[LOOP_MEMCPY_EXPANSION]], label [[LOOP_MEMCPY_RESIDUAL_HEADER]]
-; OPT:       loop-memcpy-residual:
-; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL:%.*]] ]
+; OPT:       dynamic-memcpy-expansion-residual-cond:
+; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
+; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL:%.*]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
+; OPT:       dynamic-memcpy-expansion-residual-body:
+; OPT-NEXT:    [[RESIDUAL_LOOP_INDEX:%.*]] = phi i64 [ 0, [[LOOP_MEMCPY_RESIDUAL_HEADER]] ], [ [[TMP14:%.*]], [[LOOP_MEMCPY_RESIDUAL]] ]
 ; OPT-NEXT:    [[TMP10:%.*]] = add i64 [[TMP3]], [[RESIDUAL_LOOP_INDEX]]
 ; OPT-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[X]], i64 [[TMP10]]
 ; OPT-NEXT:    [[TMP12:%.*]] = load i8, ptr [[TMP11]], align 1
@@ -2385,12 +2388,9 @@ define void @test_umin(i64 %0, i64 %idxprom, ptr %x, ptr %y) {
 ; OPT-NEXT:    store i8 [[TMP12]], ptr [[TMP13]], align 1
 ; OPT-NEXT:    [[TMP14]] = add i64 [[RESIDUAL_LOOP_INDEX]], 1
 ; OPT-NEXT:    [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
-; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
-; OPT:       post-loop-memcpy-expansion:
+; OPT-NEXT:    br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
+; OPT:       dynamic-memcpy-post-expansion:
 ; OPT-NEXT:    ret void
-; OPT:       loop-memcpy-residual-header:
-; OPT-NEXT:    [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
-; OPT-NEXT:    br i1 [[TMP16]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
 ;
 entry:
   %arrayidx = getelementptr [32 x [8 x i64]], ptr %y, i64 0, i64 %idxprom
@@ -2439,7 +2439,7 @@ define amdgpu_kernel void @memcpy_volatile(ptr addrspace(1) %dst, ptr addrspace(
 ;
 ; ALL-LABEL: @memcpy_volatile(
 ; ALL-NEXT:    br label [[LOAD_STORE_LOOP:%.*]]
-; ALL:       load-store-loop:
+; ALL:       static-memcpy-expansion-main-body:
 ; ALL-NEXT:    [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
 ; ALL-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
 ; ALL-NEXT:    [[TMP2:%.*]] = load volatile <64 x i32>, ptr addrspace(1) [[TMP1]], align 1
@@ -2448,7 +2448,7 @@ define amdgpu_kernel void @memcpy_volatile(ptr addrspace(1) %dst, ptr addrspace(
 ; ALL-NEXT:    [[TMP4]] = add i64 [[LOOP_INDEX]], 256
 ; ALL-NEXT:    [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 512
 ; ALL-NEXT:    br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
-; ALL:       memcpy-split:
+; ALL:       static-memcpy-post-expansion:
 ; ALL-NEXT:    ret void
 ;
   call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 512, i1 true)
diff --git a/llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll b/llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll
index 43752c22b1f3e..faf70f55876f7 100644
--- a/llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll
+++ b/llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll
@@ -12,7 +12,7 @@ define void @issue63986(i64 %0, i64 %idxprom, ptr inreg %ptr) {
 ; CHECK-NEXT:    v_add_co_u32_e32 v8, vcc, s16, v4
 ; CHECK-NEXT:    v_addc_co_u32_e32 v9, vcc, v6, v5, vcc
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  .LBB0_1: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB0_1: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_mov_b32_e32 v7, s5
 ; CHECK-NEXT:    v_mov_b32_e32 v6, s4
@@ -20,28 +20,28 @@ define void @issue63986(i64 %0, i64 %idxprom, ptr inreg %ptr) {
 ; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, s4, v8
 ; CHECK-NEXT:    s_add_u32 s4, s4, 16
 ; CHECK-NEXT:    s_addc_u32 s5, s5, 0
-; CHECK-NEXT:    v_cmp_ge_u64_e64 s[6:7], s[4:5], 32
+; CHECK-NEXT:    v_cmp_lt_u64_e64 s[6:7], s[4:5], 32
 ; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, v9, v7, vcc
 ; CHECK-NEXT:    s_and_b64 vcc, exec, s[6:7]
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[6:7], v[10:13]
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_1
-; CHECK-NEXT:  ; %bb.2: ; %loop-memcpy-residual-header
+; CHECK-NEXT:    s_cbranch_vccnz .LBB0_1
+; CHECK-NEXT:  ; %bb.2: ; %dynamic-memcpy-expansion-residual-cond
 ; CHECK-NEXT:    s_branch .LBB0_4
 ; CHECK-NEXT:  ; %bb.3:
 ; CHECK-NEXT:    ; implicit-def: $vgpr6_vgpr7
 ; CHECK-NEXT:    s_branch .LBB0_5
-; CHECK-NEXT:  .LBB0_4: ; %loop-memcpy-residual-header.post-loop-memcpy-expansion_crit_edge
+; CHECK-NEXT:  .LBB0_4: ; %dynamic-memcpy-expansion-residual-cond.dynamic-memcpy-post-expansion_crit_edge
 ; CHECK-NEXT:    v_lshlrev_b64 v[6:7], 6, v[2:3]
 ; CHECK-NEXT:    s_cbranch_execnz .LBB0_8
-; CHECK-NEXT:  .LBB0_5: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:  .LBB0_5: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    s_add_u32 s4, s16, 32
 ; CHECK-NEXT:    s_addc_u32 s5, s17, 0
 ; CHECK-NEXT:    v_mov_b32_e32 v3, s5
 ; CHECK-NEXT:    v_add_co_u32_e32 v2, vcc, s4, v4
 ; CHECK-NEXT:    v_addc_co_u32_e32 v3, vcc, v3, v5, vcc
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  ; %bb.6: ; %loop-memcpy-residual
+; CHECK-NEXT:  ; %bb.6: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    s_add_u32 s6, 32, s4
 ; CHECK-NEXT:    s_addc_u32 s7, 0, s5
 ; CHECK-NEXT:    v_mov_b32_e32 v6, s6
@@ -57,7 +57,7 @@ define void @issue63986(i64 %0, i64 %idxprom, ptr inreg %ptr) {
 ; CHECK-NEXT:  ; %bb.7:
 ; CHECK-NEXT:    v_mov_b32_e32 v7, v5
 ; CHECK-NEXT:    v_mov_b32_e32 v6, v4
-; CHECK-NEXT:  .LBB0_8: ; %post-loop-memcpy-expansion
+; CHECK-NEXT:  .LBB0_8: ; %dynamic-memcpy-post-expansion
 ; CHECK-NEXT:    v_and_b32_e32 v2, 15, v0
 ; CHECK-NEXT:    v_and_b32_e32 v0, -16, v0
 ; CHECK-NEXT:    v_add_co_u32_e32 v4, vcc, v6, v0
@@ -76,18 +76,18 @@ define void @issue63986(i64 %0, i64 %idxprom, ptr inreg %ptr) {
 ; CHECK-NEXT:  .LBB0_10: ; %Flow16
 ; CHECK-NEXT:    ; in Loop: Header=BB0_11 Depth=1
 ; CHECK-NEXT:    s_andn2_b64 vcc, exec, s[8:9]
-; CHECK-NEXT:    s_cbranch_vccz .LBB0_19
+; CHECK-NEXT:    s_cbranch_vccz .LBB0_18
 ; CHECK-NEXT:  .LBB0_11: ; %while.cond
 ; CHECK-NEXT:    ; =>This Loop Header: Depth=1
 ; CHECK-NEXT:    ; Child Loop BB0_13 Depth 2
 ; CHECK-NEXT:    ; Child Loop BB0_17 Depth 2
 ; CHECK-NEXT:    s_and_saveexec_b64 s[8:9], s[4:5]
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_14
-; CHECK-NEXT:  ; %bb.12: ; %loop-memcpy-expansion2.preheader
+; CHECK-NEXT:  ; %bb.12: ; %dynamic-memcpy-expansion-main-body2.preheader
 ; CHECK-NEXT:    ; in Loop: Header=BB0_11 Depth=1
 ; CHECK-NEXT:    s_mov_b64 s[10:11], 0
 ; CHECK-NEXT:    s_mov_b64 s[12:13], 0
-; CHECK-NEXT:  .LBB0_13: ; %loop-memcpy-expansion2
+; CHECK-NEXT:  .LBB0_13: ; %dynamic-memcpy-expansion-main-body2
 ; CHECK-NEXT:    ; Parent Loop BB0_11 Depth=1
 ; CHECK-NEXT:    ; => This Inner Loop Header: Depth=2
 ; CHECK-NEXT:    v_mov_b32_e32 v6, s10
@@ -108,37 +108,33 @@ define void @issue63986(i64 %0, i64 %idxprom, ptr inreg %ptr) {
 ; CHECK-NEXT:    s_or_b64 exec, exec, s[8:9]
 ; CHECK-NEXT:    s_mov_b64 s[8:9], -1
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_10
-; CHECK-NEXT:  ; %bb.15: ; %loop-memcpy-residual-header5
+; CHECK-NEXT:  ; %bb.15: ; %dynamic-memcpy-expansion-residual-cond5
 ; CHECK-NEXT:    ; in Loop: Header=BB0_11 Depth=1
-; CHECK-NEXT:    s_and_saveexec_b64 s[8:9], s[6:7]
-; CHECK-NEXT:    s_xor_b64 s[10:11], exec, s[8:9]
+; CHECK-NEXT:    s_and_saveexec_b64 s[10:11], s[6:7]
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_9
-; CHECK-NEXT:  ; %bb.16: ; %loop-memcpy-residual4.preheader
+; CHECK-NEXT:  ; %bb.16: ; %dynamic-memcpy-expansion-residual-body4.preheader
 ; CHECK-NEXT:    ; in Loop: Header=BB0_11 Depth=1
-; CHECK-NEXT:    s_mov_b64 s[14:15], 0
 ; CHECK-NEXT:    s_mov_b64 s[12:13], 0
-; CHECK-NEXT:  .LBB0_17: ; %loop-memcpy-residual4
+; CHECK-NEXT:    s_mov_b64 s[14:15], 0
+; CHECK-NEXT:  .LBB0_17: ; %dynamic-memcpy-expansion-residual-body4
 ; CHECK-NEXT:    ; Parent Loop BB0_11 Depth=1
 ; CHECK-NEXT:    ; => This Inner Loop Header: Depth=2
-; CHECK-NEXT:    v_mov_b32_e32 v10, s15
-; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, s14, v0
+; CHECK-NEXT:    v_mov_b32_e32 v10, s13
+; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, s12, v0
 ; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, v1, v10, vcc
 ; CHECK-NEXT:    flat_load_ubyte v11, v[6:7]
-; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, s14, v4
-; CHECK-NEXT:    s_add_u32 s14, s14, 1
-; CHECK-NEXT:    s_addc_u32 s15, s15, 0
-; CHECK-NEXT:    v_cmp_ge_u64_e64 s[8:9], s[14:15], v[2:3]
+; CHECK-NEXT:    v_add_co_u32_e32 v6, vcc, s12, v4
+; CHECK-NEXT:    s_add_u32 s12, s12, 1
+; CHECK-NEXT:    s_addc_u32 s13, s13, 0
+; CHECK-NEXT:    v_cmp_ge_u64_e64 s[8:9], s[12:13], v[2:3]
 ; CHECK-NEXT:    v_addc_co_u32_e32 v7, vcc, v5, v10, vcc
-; CHECK-NEXT:    s_or_b64 s[12:13], s[8:9], s[12:13]
+; CHECK-NEXT:    s_or_b64 s[14:15], s[8:9], s[14:15]
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    flat_store_byte v[6:7], v11
-; CHECK-NEXT:    s_andn2_b64 exec, exec, s[12:13]
+; CHECK-NEXT:    s_andn2_b64 exec, exec, s[14:15]
 ; CHECK-NEXT:    s_cbranch_execnz .LBB0_17
-; CHECK-NEXT:  ; %bb.18: ; %Flow
-; CHECK-NEXT:    ; in Loop: Header=BB0_11 Depth=1
-; CHECK-NEXT:    s_or_b64 exec, exec, s[12:13]
 ; CHECK-NEXT:    s_branch .LBB0_9
-; CHECK-NEXT:  .LBB0_19: ; %DummyReturnBlock
+; CHECK-NEXT:  .LBB0_18: ; %DummyReturnBlock
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
diff --git a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
index cb68a987c243b..4f2816538b1ff 100644
--- a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
+++ b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
@@ -14,7 +14,7 @@ define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0)
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  .LBB0_1: ; %load-store-loop
+; CHECK-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v24, vcc_lo, v2, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v25, null, s5, v3, vcc_lo
@@ -83,7 +83,7 @@ define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0)
 ; CHECK-NEXT:    flat_store_dwordx4 v[102:103], v[96:99]
 ; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB0_1
-; CHECK-NEXT:  ; %bb.2: ; %memcpy-split
+; CHECK-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -108,7 +108,7 @@ define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0)
 ; ALIGNED-NEXT:    buffer_store_dword v62, off, s[0:3], s32 offset:8 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v63, off, s[0:3], s32 offset:4 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v72, off, s[0:3], s32 ; 4-byte Folded Spill
-; ALIGNED-NEXT:  .LBB0_1: ; %load-store-loop
+; ALIGNED-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; ALIGNED-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; ALIGNED-NEXT:    v_add_co_u32 v4, vcc_lo, v2, s4
 ; ALIGNED-NEXT:    v_add_co_ci_u32_e64 v5, null, s5, v3, vcc_lo
@@ -757,7 +757,7 @@ define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0)
 ; ALIGNED-NEXT:    flat_store_byte v[20:21], v11 offset:48
 ; ALIGNED-NEXT:    flat_store_byte v[20:21], v4 offset:46
 ; ALIGNED-NEXT:    s_cbranch_vccnz .LBB0_1
-; ALIGNED-NEXT:  ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; ALIGNED-NEXT:    s_clause 0x10 ; 68-byte Folded Reload
 ; ALIGNED-NEXT:    buffer_load_dword v72, off, s[0:3], s32
 ; ALIGNED-NEXT:    buffer_load_dword v63, off, s[0:3], s32 offset:4
@@ -784,7 +784,7 @@ define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0)
 ; UNROLL3-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; UNROLL3-NEXT:    s_mov_b64 s[4:5], 0
 ; UNROLL3-NEXT:    .p2align 6
-; UNROLL3-NEXT:  .LBB0_1: ; %load-store-loop
+; UNROLL3-NEXT:  .LBB0_1: ; %static-memcpy-expansion-main-body
 ; UNROLL3-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; UNROLL3-NEXT:    v_add_co_u32 v12, vcc_lo, v2, s4
 ; UNROLL3-NEXT:    v_add_co_ci_u32_e64 v13, null, s5, v3, vcc_lo
@@ -805,7 +805,7 @@ define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0)
 ; UNROLL3-NEXT:    v_cmp_gt_u64_e64 s6, 0x7e0, s[4:5]
 ; UNROLL3-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; UNROLL3-NEXT:    s_cbranch_vccnz .LBB0_1
-; UNROLL3-NEXT:  ; %bb.2: ; %memcpy-split
+; UNROLL3-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; UNROLL3-NEXT:    flat_load_dwordx4 v[4:7], v[2:3] offset:2016
 ; UNROLL3-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; UNROLL3-NEXT:    flat_store_dwordx4 v[0:1], v[4:7] offset:2016
@@ -824,7 +824,7 @@ define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1)
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  .LBB1_1: ; %load-store-loop
+; CHECK-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v96, vcc_lo, v2, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v97, null, s5, v3, vcc_lo
@@ -884,7 +884,7 @@ define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1)
 ; CHECK-NEXT:    v_cmp_gt_u64_e64 s6, 0x800, s[4:5]
 ; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB1_1
-; CHECK-NEXT:  ; %bb.2: ; %memcpy-split
+; CHECK-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; ALIGNED-LABEL: memcpy_p1_p1_sz2048:
@@ -899,7 +899,7 @@ define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1)
 ; ALIGNED-NEXT:    buffer_store_dword v45, off, s[0:3], s32 offset:8 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v46, off, s[0:3], s32 offset:4 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v47, off, s[0:3], s32 ; 4-byte Folded Spill
-; ALIGNED-NEXT:  .LBB1_1: ; %load-store-loop
+; ALIGNED-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; ALIGNED-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; ALIGNED-NEXT:    v_add_co_u32 v24, vcc_lo, v2, s4
 ; ALIGNED-NEXT:    v_add_co_ci_u32_e64 v25, null, s5, v3, vcc_lo
@@ -1520,7 +1520,7 @@ define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1)
 ; ALIGNED-NEXT:    global_store_byte v[16:17], v11, off offset:3
 ; ALIGNED-NEXT:    global_store_byte v[16:17], v4, off offset:1
 ; ALIGNED-NEXT:    s_cbranch_vccnz .LBB1_1
-; ALIGNED-NEXT:  ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; ALIGNED-NEXT:    s_clause 0x7 ; 32-byte Folded Reload
 ; ALIGNED-NEXT:    buffer_load_dword v47, off, s[0:3], s32
 ; ALIGNED-NEXT:    buffer_load_dword v46, off, s[0:3], s32 offset:4
@@ -1538,7 +1538,7 @@ define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1)
 ; UNROLL3-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; UNROLL3-NEXT:    s_mov_b64 s[4:5], 0
 ; UNROLL3-NEXT:    .p2align 6
-; UNROLL3-NEXT:  .LBB1_1: ; %load-store-loop
+; UNROLL3-NEXT:  .LBB1_1: ; %static-memcpy-expansion-main-body
 ; UNROLL3-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; UNROLL3-NEXT:    v_add_co_u32 v12, vcc_lo, v2, s4
 ; UNROLL3-NEXT:    v_add_co_ci_u32_e64 v13, null, s5, v3, vcc_lo
@@ -1559,7 +1559,7 @@ define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1)
 ; UNROLL3-NEXT:    v_cmp_gt_u64_e64 s6, 0x7e0, s[4:5]
 ; UNROLL3-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; UNROLL3-NEXT:    s_cbranch_vccnz .LBB1_1
-; UNROLL3-NEXT:  ; %bb.2: ; %memcpy-split
+; UNROLL3-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; UNROLL3-NEXT:    global_load_dwordx4 v[4:7], v[2:3], off offset:2016
 ; UNROLL3-NEXT:    s_waitcnt vmcnt(0)
 ; UNROLL3-NEXT:    global_store_dwordx4 v[0:1], v[4:7], off offset:2016
@@ -1577,7 +1577,7 @@ define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4)
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  .LBB2_1: ; %load-store-loop
+; CHECK-NEXT:  .LBB2_1: ; %static-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v96, vcc_lo, v2, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v97, null, s5, v3, vcc_lo
@@ -1639,7 +1639,7 @@ define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[96:99]
 ; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB2_1
-; CHECK-NEXT:  ; %bb.2: ; %memcpy-split
+; CHECK-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -1647,7 +1647,7 @@ define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4)
 ; ALIGNED:       ; %bb.0: ; %entry
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; ALIGNED-NEXT:    s_mov_b64 s[4:5], 0
-; ALIGNED-NEXT:  .LBB2_1: ; %load-store-loop
+; ALIGNED-NEXT:  .LBB2_1: ; %static-memcpy-expansion-main-body
 ; ALIGNED-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; ALIGNED-NEXT:    v_add_co_u32 v8, vcc_lo, v2, s4
 ; ALIGNED-NEXT:    v_add_co_ci_u32_e64 v9, null, s5, v3, vcc_lo
@@ -2141,7 +2141,7 @@ define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4)
 ; ALIGNED-NEXT:    flat_store_byte v[84:85], v65 offset:1
 ; ALIGNED-NEXT:    flat_store_byte v[84:85], v4
 ; ALIGNED-NEXT:    s_cbranch_vccnz .LBB2_1
-; ALIGNED-NEXT:  ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; ALIGNED-NEXT:    s_waitcnt lgkmcnt(0)
 ; ALIGNED-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -2150,7 +2150,7 @@ define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4)
 ; UNROLL3-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; UNROLL3-NEXT:    s_mov_b64 s[4:5], 0
 ; UNROLL3-NEXT:    .p2align 6
-; UNROLL3-NEXT:  .LBB2_1: ; %load-store-loop
+; UNROLL3-NEXT:  .LBB2_1: ; %static-memcpy-expansion-main-body
 ; UNROLL3-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; UNROLL3-NEXT:    v_add_co_u32 v12, vcc_lo, v2, s4
 ; UNROLL3-NEXT:    v_add_co_ci_u32_e64 v13, null, s5, v3, vcc_lo
@@ -2171,7 +2171,7 @@ define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4)
 ; UNROLL3-NEXT:    v_cmp_gt_u64_e64 s6, 0x7e0, s[4:5]
 ; UNROLL3-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; UNROLL3-NEXT:    s_cbranch_vccnz .LBB2_1
-; UNROLL3-NEXT:  ; %bb.2: ; %memcpy-split
+; UNROLL3-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; UNROLL3-NEXT:    s_clause 0x1
 ; UNROLL3-NEXT:    global_load_dwordx4 v[4:7], v[2:3], off offset:2016
 ; UNROLL3-NEXT:    global_load_dwordx4 v[8:11], v[2:3], off offset:2032
@@ -2191,7 +2191,7 @@ define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5)
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  .LBB3_1: ; %load-store-loop
+; CHECK-NEXT:  .LBB3_1: ; %static-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3e
 ; CHECK-NEXT:    buffer_load_dword v2, v1, s[0:3], 0 offen offset:252
@@ -2392,7 +2392,7 @@ define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5)
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB3_1
-; CHECK-NEXT:  ; %bb.2: ; %memcpy-split
+; CHECK-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; ALIGNED-LABEL: memcpy_p5_p5_sz2048:
@@ -2447,7 +2447,7 @@ define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v125, off, s[0:3], s32 offset:8 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v126, off, s[0:3], s32 offset:4 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v127, off, s[0:3], s32 ; 4-byte Folded Spill
-; ALIGNED-NEXT:  .LBB3_1: ; %load-store-loop
+; ALIGNED-NEXT:  .LBB3_1: ; %static-memcpy-expansion-main-body
 ; ALIGNED-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; ALIGNED-NEXT:    s_clause 0x34
 ; ALIGNED-NEXT:    buffer_load_ubyte v116, v1, s[0:3], 0 offen offset:255
@@ -3495,7 +3495,7 @@ define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_byte v2, v0, s[0:3], 0 offen
 ; ALIGNED-NEXT:    v_add_nc_u32_e32 v0, 0x100, v0
 ; ALIGNED-NEXT:    s_cbranch_vccnz .LBB3_1
-; ALIGNED-NEXT:  ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; ALIGNED-NEXT:    s_clause 0x2f ; 192-byte Folded Reload
 ; ALIGNED-NEXT:    buffer_load_dword v127, off, s[0:3], s32
 ; ALIGNED-NEXT:    buffer_load_dword v126, off, s[0:3], s32 offset:4
@@ -3554,7 +3554,7 @@ define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5)
 ; UNROLL3-NEXT:    v_mov_b32_e32 v2, v1
 ; UNROLL3-NEXT:    v_mov_b32_e32 v3, v0
 ; UNROLL3-NEXT:    s_mov_b64 s[4:5], 0
-; UNROLL3-NEXT:  .LBB3_1: ; %load-store-loop
+; UNROLL3-NEXT:  .LBB3_1: ; %static-memcpy-expansion-main-body
 ; UNROLL3-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; UNROLL3-NEXT:    s_clause 0xb
 ; UNROLL3-NEXT:    buffer_load_dword v4, v2, s[0:3], 0 offen offset:44
@@ -3600,7 +3600,7 @@ define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5)
 ; UNROLL3-NEXT:    v_add_nc_u32_e32 v3, 48, v3
 ; UNROLL3-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; UNROLL3-NEXT:    s_cbranch_vccnz .LBB3_1
-; UNROLL3-NEXT:  ; %bb.2: ; %memcpy-split
+; UNROLL3-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; UNROLL3-NEXT:    s_clause 0x3
 ; UNROLL3-NEXT:    buffer_load_dword v2, v1, s[0:3], 0 offen offset:2028
 ; UNROLL3-NEXT:    buffer_load_dword v3, v1, s[0:3], 0 offen offset:2024
@@ -3638,7 +3638,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  .LBB4_1: ; %load-store-loop
+; CHECK-NEXT:  .LBB4_1: ; %static-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3e
 ; CHECK-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:32
@@ -3741,7 +3741,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; CHECK-NEXT:    flat_store_dwordx4 v[100:101], v[84:87]
 ; CHECK-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; CHECK-NEXT:    s_cbranch_vccnz .LBB4_1
-; CHECK-NEXT:  ; %bb.2: ; %memcpy-split
+; CHECK-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 ;
@@ -3799,7 +3799,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    buffer_store_dword v127, off, s[0:3], s32 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v1, off, s[0:3], s32 offset:1224 ; 4-byte Folded Spill
 ; ALIGNED-NEXT:    buffer_store_dword v0, off, s[0:3], s32 offset:1228 ; 4-byte Folded Spill
-; ALIGNED-NEXT:  .LBB4_1: ; %load-store-loop
+; ALIGNED-NEXT:  .LBB4_1: ; %static-memcpy-expansion-main-body
 ; ALIGNED-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; ALIGNED-NEXT:    s_clause 0x3e
 ; ALIGNED-NEXT:    buffer_load_ubyte v0, v2, s[0:3], 0 offen offset:20
@@ -5282,7 +5282,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; ALIGNED-NEXT:    s_waitcnt vmcnt(0)
 ; ALIGNED-NEXT:    flat_store_byte v[3:4], v0
 ; ALIGNED-NEXT:    s_cbranch_vccnz .LBB4_1
-; ALIGNED-NEXT:  ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; ALIGNED-NEXT:    s_clause 0x2f ; 192-byte Folded Reload
 ; ALIGNED-NEXT:    buffer_load_dword v127, off, s[0:3], s32
 ; ALIGNED-NEXT:    buffer_load_dword v126, off, s[0:3], s32 offset:4
@@ -5342,7 +5342,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; UNROLL3-NEXT:    s_mov_b64 s[4:5], 0
 ; UNROLL3-NEXT:    s_inst_prefetch 0x1
 ; UNROLL3-NEXT:    .p2align 6
-; UNROLL3-NEXT:  .LBB4_1: ; %load-store-loop
+; UNROLL3-NEXT:  .LBB4_1: ; %static-memcpy-expansion-main-body
 ; UNROLL3-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; UNROLL3-NEXT:    s_clause 0xb
 ; UNROLL3-NEXT:    buffer_load_dword v4, v3, s[0:3], 0 offen
@@ -5370,7 +5370,7 @@ define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5)
 ; UNROLL3-NEXT:    flat_store_dwordx4 v[16:17], v[12:15] offset:32
 ; UNROLL3-NEXT:    s_and_b32 vcc_lo, exec_lo, s6
 ; UNROLL3-NEXT:    s_cbranch_vccnz .LBB4_1
-; UNROLL3-NEXT:  ; %bb.2: ; %memcpy-split
+; UNROLL3-NEXT:  ; %bb.2: ; %static-memcpy-post-expansion
 ; UNROLL3-NEXT:    s_inst_prefetch 0x2
 ; UNROLL3-NEXT:    s_clause 0x3
 ; UNROLL3-NEXT:    buffer_load_dword v3, v2, s[0:3], 0 offen offset:2016
diff --git a/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll b/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
index 953511db10b29..d95965caa81ab 100644
--- a/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
@@ -1051,10 +1051,10 @@ define void @memmove_p1_p3(ptr addrspace(1) align 1 %dst, ptr addrspace(3) align
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[7:8]
 ; CHECK-NEXT:    s_cbranch_execz .LBB7_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v9, v2
 ; CHECK-NEXT:    s_mov_b32 s7, 0
-; CHECK-NEXT:  .LBB7_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB7_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ds_read_b128 v[10:13], v9
 ; CHECK-NEXT:    v_add_co_u32 v14, vcc_lo, v0, s4
@@ -1073,15 +1073,14 @@ define void @memmove_p1_p3(ptr addrspace(1) align 1 %dst, ptr addrspace(3) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB7_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB7_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v3, -16, v3
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v3
 ; CHECK-NEXT:    v_add_nc_u32_e32 v2, v2, v3
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v1, null, v1, v4, vcc_lo
-; CHECK-NEXT:  .LBB7_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB7_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ds_read_u8 v7, v2
 ; CHECK-NEXT:    v_add_co_u32 v3, vcc_lo, v0, s4
@@ -1095,9 +1094,7 @@ define void @memmove_p1_p3(ptr addrspace(1) align 1 %dst, ptr addrspace(3) align
 ; CHECK-NEXT:    global_store_byte v[3:4], v7, off
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB7_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB7_7: ; %Flow7
+; CHECK-NEXT:  .LBB7_6: ; %Flow7
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -1263,11 +1260,11 @@ define void @memmove_p1_p5(ptr addrspace(1) align 1 %dst, ptr addrspace(5) align
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[7:8]
 ; CHECK-NEXT:    s_cbranch_execz .LBB9_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v9, v2
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    .p2align 6
-; CHECK-NEXT:  .LBB9_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB9_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3
 ; CHECK-NEXT:    buffer_load_dword v10, v9, s[0:3], 0 offen
@@ -1290,15 +1287,14 @@ define void @memmove_p1_p5(ptr addrspace(1) align 1 %dst, ptr addrspace(5) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB9_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB9_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v3, -16, v3
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_co_u32 v0, vcc_lo, v0, v3
 ; CHECK-NEXT:    v_add_nc_u32_e32 v2, v2, v3
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v1, null, v1, v4, vcc_lo
-; CHECK-NEXT:  .LBB9_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB9_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    buffer_load_ubyte v7, v2, s[0:3], 0 offen
 ; CHECK-NEXT:    v_add_co_u32 v3, vcc_lo, v0, s4
@@ -1312,9 +1308,7 @@ define void @memmove_p1_p5(ptr addrspace(1) align 1 %dst, ptr addrspace(5) align
 ; CHECK-NEXT:    global_store_byte v[3:4], v7, off
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB9_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB9_7: ; %Flow7
+; CHECK-NEXT:  .LBB9_6: ; %Flow7
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -1479,10 +1473,10 @@ define void @memmove_p3_p1(ptr addrspace(3) align 1 %dst, ptr addrspace(1) align
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[7:8]
 ; CHECK-NEXT:    s_cbranch_execz .LBB11_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v9, v0
 ; CHECK-NEXT:    s_mov_b32 s7, 0
-; CHECK-NEXT:  .LBB11_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB11_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v10, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v11, null, s5, v2, vcc_lo
@@ -1501,15 +1495,14 @@ define void @memmove_p3_p1(ptr addrspace(3) align 1 %dst, ptr addrspace(1) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB11_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB11_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v3, -16, v3
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v3
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v3
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v2, null, v2, v4, vcc_lo
-; CHECK-NEXT:  .LBB11_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB11_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v3, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v4, null, s5, v2, vcc_lo
@@ -1523,9 +1516,7 @@ define void @memmove_p3_p1(ptr addrspace(3) align 1 %dst, ptr addrspace(1) align
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 1, v0
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB11_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB11_7: ; %Flow7
+; CHECK-NEXT:  .LBB11_6: ; %Flow7
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
@@ -1673,10 +1664,10 @@ define void @memmove_p3_p4(ptr addrspace(3) align 1 %dst, ptr addrspace(4) align
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[7:8]
 ; CHECK-NEXT:    s_cbranch_execz .LBB13_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v9, v0
 ; CHECK-NEXT:    s_mov_b32 s7, 0
-; CHECK-NEXT:  .LBB13_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB13_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v10, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v11, null, s5, v2, vcc_lo
@@ -1695,15 +1686,14 @@ define void @memmove_p3_p4(ptr addrspace(3) align 1 %dst, ptr addrspace(4) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB13_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB13_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v3, -16, v3
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v3
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v3
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v2, null, v2, v4, vcc_lo
-; CHECK-NEXT:  .LBB13_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB13_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v3, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v4, null, s5, v2, vcc_lo
@@ -1717,9 +1707,7 @@ define void @memmove_p3_p4(ptr addrspace(3) align 1 %dst, ptr addrspace(4) align
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 1, v0
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB13_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB13_7: ; %Flow7
+; CHECK-NEXT:  .LBB13_6: ; %Flow7
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
@@ -1740,12 +1728,12 @@ define void @memmove_p3_p5(ptr addrspace(3) align 1 %dst, ptr addrspace(5) align
 ; CHECK-NEXT:    v_and_b32_e32 v5, 15, v4
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[2:3]
 ; CHECK-NEXT:    s_cbranch_execz .LBB14_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v7, v1
 ; CHECK-NEXT:    v_mov_b32_e32 v8, v0
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    .p2align 6
-; CHECK-NEXT:  .LBB14_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB14_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    s_clause 0x3
 ; CHECK-NEXT:    buffer_load_dword v9, v7, s[0:3], 0 offen
@@ -1767,14 +1755,13 @@ define void @memmove_p3_p5(ptr addrspace(3) align 1 %dst, ptr addrspace(5) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB14_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB14_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v2, -16, v4
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v2
 ; CHECK-NEXT:    v_add_nc_u32_e32 v1, v1, v2
-; CHECK-NEXT:  .LBB14_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB14_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    buffer_load_ubyte v2, v1, s[0:3], 0 offen
 ; CHECK-NEXT:    s_add_u32 s4, s4, 1
@@ -1787,9 +1774,7 @@ define void @memmove_p3_p5(ptr addrspace(3) align 1 %dst, ptr addrspace(5) align
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 1, v0
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB14_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB14_7: ; %Flow12
+; CHECK-NEXT:  .LBB14_6: ; %Flow12
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
@@ -1959,11 +1944,11 @@ define void @memmove_p5_p1(ptr addrspace(5) align 1 %dst, ptr addrspace(1) align
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[7:8]
 ; CHECK-NEXT:    s_cbranch_execz .LBB16_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v9, v0
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    .p2align 6
-; CHECK-NEXT:  .LBB16_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB16_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v10, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v11, null, s5, v2, vcc_lo
@@ -1985,15 +1970,14 @@ define void @memmove_p5_p1(ptr addrspace(5) align 1 %dst, ptr addrspace(1) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB16_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB16_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v3, -16, v3
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v3
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v3
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v2, null, v2, v4, vcc_lo
-; CHECK-NEXT:  .LBB16_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB16_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v3, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v4, null, s5, v2, vcc_lo
@@ -2007,9 +1991,7 @@ define void @memmove_p5_p1(ptr addrspace(5) align 1 %dst, ptr addrspace(1) align
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 1, v0
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB16_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB16_7: ; %Flow7
+; CHECK-NEXT:  .LBB16_6: ; %Flow7
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -2029,12 +2011,12 @@ define void @memmove_p5_p3(ptr addrspace(5) align 1 %dst, ptr addrspace(3) align
 ; CHECK-NEXT:    v_and_b32_e32 v5, 15, v4
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[2:3]
 ; CHECK-NEXT:    s_cbranch_execz .LBB17_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v7, v1
 ; CHECK-NEXT:    v_mov_b32_e32 v8, v0
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    .p2align 6
-; CHECK-NEXT:  .LBB17_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB17_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ds_read_b128 v[9:12], v7
 ; CHECK-NEXT:    s_add_u32 s4, s4, 16
@@ -2055,14 +2037,13 @@ define void @memmove_p5_p3(ptr addrspace(5) align 1 %dst, ptr addrspace(3) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB17_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB17_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v2, -16, v4
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v2
 ; CHECK-NEXT:    v_add_nc_u32_e32 v1, v1, v2
-; CHECK-NEXT:  .LBB17_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB17_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    ds_read_u8 v2, v1
 ; CHECK-NEXT:    s_add_u32 s4, s4, 1
@@ -2075,9 +2056,7 @@ define void @memmove_p5_p3(ptr addrspace(5) align 1 %dst, ptr addrspace(3) align
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 1, v0
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB17_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB17_7: ; %Flow12
+; CHECK-NEXT:  .LBB17_6: ; %Flow12
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
@@ -2097,11 +2076,11 @@ define void @memmove_p5_p4(ptr addrspace(5) align 1 %dst, ptr addrspace(4) align
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[7:8]
 ; CHECK-NEXT:    s_cbranch_execz .LBB18_3
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
+; CHECK-NEXT:  ; %bb.1: ; %dynamic-memcpy-expansion-main-body.preheader
 ; CHECK-NEXT:    v_mov_b32_e32 v9, v0
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    .p2align 6
-; CHECK-NEXT:  .LBB18_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB18_2: ; %dynamic-memcpy-expansion-main-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v10, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v11, null, s5, v2, vcc_lo
@@ -2123,15 +2102,14 @@ define void @memmove_p5_p4(ptr addrspace(5) align 1 %dst, ptr addrspace(4) align
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
 ; CHECK-NEXT:    s_mov_b32 s6, exec_lo
 ; CHECK-NEXT:    v_cmpx_ne_u64_e32 0, v[5:6]
-; CHECK-NEXT:    s_xor_b32 s6, exec_lo, s6
-; CHECK-NEXT:    s_cbranch_execz .LBB18_7
-; CHECK-NEXT:  ; %bb.4: ; %loop-memcpy-residual.preheader
+; CHECK-NEXT:    s_cbranch_execz .LBB18_6
+; CHECK-NEXT:  ; %bb.4: ; %dynamic-memcpy-expansion-residual-body.preheader
 ; CHECK-NEXT:    v_and_b32_e32 v3, -16, v3
 ; CHECK-NEXT:    s_mov_b32 s7, 0
 ; CHECK-NEXT:    v_add_co_u32 v1, vcc_lo, v1, v3
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, v0, v3
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v2, null, v2, v4, vcc_lo
-; CHECK-NEXT:  .LBB18_5: ; %loop-memcpy-residual
+; CHECK-NEXT:  .LBB18_5: ; %dynamic-memcpy-expansion-residual-body
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    v_add_co_u32 v3, vcc_lo, v1, s4
 ; CHECK-NEXT:    v_add_co_ci_u32_e64 v4, null, s5, v2, vcc_lo
@@ -2145,9 +2123,7 @@ define void @memmove_p5_p4(ptr addrspace(5) align 1 %dst, ptr addrspace(4) align
 ; CHECK-NEXT:    v_add_nc_u32_e32 v0, 1, v0
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s7
 ; CHECK-NEXT:    s_cbranch_execnz .LBB18_5
-; CHECK-NEXT:  ; %bb.6: ; %Flow
-; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s7
-; CHECK-NEXT:  .LBB18_7: ; %Flow7
+; CHECK-NEXT:  .LBB18_6: ; %Flow7
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s6
 ; CHECK-NEXT:    s_setpc_b64 s[30:31]
 entry:
diff --git a/llvm/test/CodeGen/AMDGPU/misched-into-wmma-hazard-shadow.mir b/llvm/test/CodeGen/AMDGPU/misched-into-wmma-hazard-shadow.mir
new file mode 100644
index 0000000000000..e3c8acc837f09
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/misched-into-wmma-hazard-shadow.mir
@@ -0,0 +1,56 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 6
+# RUN: llc -march=amdgcn -mcpu=gfx1250 -run-pass=postmisched,post-RA-hazard-rec %s -o - | FileCheck --check-prefix=GCN %s
+
+# Bring all independent V_LSHL_ADD_U32_e64 instructions into the shadow
+# of the WMMA so then hazard recognizer only need to insert 4 V_NOP_e32
+# instructions instead of 8.
+
+---
+name:            test_wmma_scale_f32_16x16x128_f8f6f4_shadow_sched
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9, $vgpr10, $vgpr11, $vgpr12, $vgpr13, $vgpr14, $vgpr15, $vgpr16, $vgpr17, $vgpr18, $vgpr19, $vgpr20, $vgpr21, $vgpr22, $vgpr23, $vgpr24, $vgpr25, $vgpr26, $vgpr27, $vgpr28, $vgpr29, $vgpr30, $vgpr31, $vgpr32, $vgpr33, $vgpr34, $vgpr35, $vgpr36, $vgpr37, $vgpr38, $vgpr39, $vgpr40, $vgpr41, $vgpr42, $vgpr43, $vgpr44, $vgpr45
+
+    ; GCN-LABEL: name: test_wmma_scale_f32_16x16x128_f8f6f4_shadow_sched
+    ; GCN: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9, $vgpr10, $vgpr11, $vgpr12, $vgpr13, $vgpr14, $vgpr15, $vgpr16, $vgpr17, $vgpr18, $vgpr19, $vgpr20, $vgpr21, $vgpr22, $vgpr23, $vgpr24, $vgpr25, $vgpr26, $vgpr27, $vgpr28, $vgpr29, $vgpr30, $vgpr31, $vgpr32, $vgpr33, $vgpr34, $vgpr35, $vgpr36, $vgpr37, $vgpr38, $vgpr39, $vgpr40, $vgpr41, $vgpr42, $vgpr43, $vgpr44, $vgpr45
+    ; GCN-NEXT: {{  $}}
+    ; GCN-NEXT: early-clobber $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39 = V_WMMA_SCALE_F32_16X16X128_F8F6F4_f8_f8_w32_twoaddr $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, killed $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, 8, killed $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39, 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, implicit $exec
+    ; GCN-NEXT: $vgpr46 = V_LSHL_ADD_U32_e64 killed $vgpr43, 1, $vgpr43, implicit $exec
+    ; GCN-NEXT: $vgpr47 = V_LSHL_ADD_U32_e64 killed $vgpr42, 1, $vgpr42, implicit $exec
+    ; GCN-NEXT: $vgpr48 = V_LSHL_ADD_U32_e64 killed $vgpr41, 1, $vgpr41, implicit $exec
+    ; GCN-NEXT: $vgpr49 = V_LSHL_ADD_U32_e64 killed $vgpr40, 1, $vgpr40, implicit $exec
+    ; GCN-NEXT: V_NOP_e32 implicit $exec
+    ; GCN-NEXT: V_NOP_e32 implicit $exec
+    ; GCN-NEXT: V_NOP_e32 implicit $exec
+    ; GCN-NEXT: V_NOP_e32 implicit $exec
+    ; GCN-NEXT: $vgpr4 = V_ADD_F32_e32 1065353216, killed $vgpr4, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr5 = V_ADD_F32_e32 1065353216, killed $vgpr5, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr6 = V_ADD_F32_e32 1065353216, killed $vgpr6, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr7 = V_ADD_F32_e32 1065353216, killed $vgpr7, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr8 = V_ADD_F32_e32 1065353216, killed $vgpr8, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr9 = V_ADD_F32_e32 1065353216, killed $vgpr9, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr10 = V_ADD_F32_e32 1065353216, killed $vgpr10, implicit $mode, implicit $exec
+    ; GCN-NEXT: $vgpr11 = V_ADD_F32_e32 1065353216, killed $vgpr11, implicit $mode, implicit $exec
+    ; GCN-NEXT: GLOBAL_STORE_DWORDX4 renamable $vgpr44_vgpr45, killed renamable $vgpr4_vgpr5_vgpr6_vgpr7, 0, 0, implicit $exec
+    ; GCN-NEXT: GLOBAL_STORE_DWORDX4 renamable $vgpr44_vgpr45, killed renamable $vgpr8_vgpr9_vgpr10_vgpr11, 32, 0, implicit $exec
+    ; GCN-NEXT: GLOBAL_STORE_DWORDX4 killed renamable $vgpr44_vgpr45, killed renamable $vgpr46_vgpr47_vgpr48_vgpr49, 64, 0, implicit $exec
+    ; GCN-NEXT: S_ENDPGM 0
+    early-clobber $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39 = V_WMMA_SCALE_F32_16X16X128_F8F6F4_f8_f8_w32_twoaddr $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, 8, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39, 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, implicit $exec
+    $vgpr4 = V_ADD_F32_e32 1065353216, $vgpr4, implicit $mode, implicit $exec
+    $vgpr5 = V_ADD_F32_e32 1065353216, $vgpr5, implicit $mode, implicit $exec
+    $vgpr6 = V_ADD_F32_e32 1065353216, $vgpr6, implicit $mode, implicit $exec
+    $vgpr7 = V_ADD_F32_e32 1065353216, $vgpr7, implicit $mode, implicit $exec
+    $vgpr8 = V_ADD_F32_e32 1065353216, $vgpr8, implicit $mode, implicit $exec
+    $vgpr9 = V_ADD_F32_e32 1065353216, $vgpr9, implicit $mode, implicit $exec
+    $vgpr10 = V_ADD_F32_e32 1065353216, $vgpr10, implicit $mode, implicit $exec
+    $vgpr11 = V_ADD_F32_e32 1065353216, $vgpr11, implicit $mode, implicit $exec
+    $vgpr46 = V_LSHL_ADD_U32_e64 $vgpr43, 1, $vgpr43, implicit $exec
+    $vgpr47 = V_LSHL_ADD_U32_e64 $vgpr42, 1, $vgpr42, implicit $exec
+    $vgpr48 = V_LSHL_ADD_U32_e64 $vgpr41, 1, $vgpr41, implicit $exec
+    $vgpr49 = V_LSHL_ADD_U32_e64 $vgpr40, 1, $vgpr40, implicit $exec
+    GLOBAL_STORE_DWORDX4 renamable $vgpr44_vgpr45, killed renamable $vgpr4_vgpr5_vgpr6_vgpr7, 0, 0, implicit $exec
+    GLOBAL_STORE_DWORDX4 renamable $vgpr44_vgpr45, killed renamable $vgpr8_vgpr9_vgpr10_vgpr11, 32, 0, implicit $exec
+    GLOBAL_STORE_DWORDX4 renamable $vgpr44_vgpr45, killed renamable $vgpr46_vgpr47_vgpr48_vgpr49, 64, 0, implicit $exec
+    S_ENDPGM 0
+...
diff --git a/llvm/test/CodeGen/AMDGPU/occupancy-levels.ll b/llvm/test/CodeGen/AMDGPU/occupancy-levels.ll
index d1ab92e1d48ff..9278c024f8905 100644
--- a/llvm/test/CodeGen/AMDGPU/occupancy-levels.ll
+++ b/llvm/test/CodeGen/AMDGPU/occupancy-levels.ll
@@ -1,4 +1,5 @@
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-xnack < %s | FileCheck --check-prefixes=GCN,GFX9 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx950 -mattr=-xnack < %s | FileCheck --check-prefixes=GCN,GFX950 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=-xnack < %s | FileCheck --check-prefixes=GCN,GFX10,GFX10W32,GFX1010,GFX1010W32 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=-xnack -mattr=+wavefrontsize64 < %s | FileCheck --check-prefixes=GCN,GFX10,GFX10W64,GFX1010,GFX1010W64 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck --check-prefixes=GCN,GFX10,GFX10W32,GFX1030,GFX1030W32 %s
@@ -19,49 +20,60 @@
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1153 -mattr=+wavefrontsize64 < %s | FileCheck --check-prefixes=GCN,GFX1030,GFX1030W64 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 < %s | FileCheck --check-prefixes=GCN,GFX1100,GFX1100W32 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1200 -mattr=+wavefrontsize64 < %s | FileCheck --check-prefixes=GCN,GFX1100,GFX1100W64 %s
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1250 < %s | FileCheck --check-prefixes=GCN,GFX1250 %s
 
 ; GCN-LABEL: {{^}}max_occupancy:
 ; GFX9:       ; Occupancy: 10
+; GFX950:     ; Occupancy: 8
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @max_occupancy() #10 {
   ret void
 }
 
 ; GCN-LABEL: {{^}}limited_occupancy_3:
 ; GFX9:       ; Occupancy: 3
+; GFX950:     ; Occupancy: 3
 ; GFX10W64:   ; Occupancy: 3
 ; GFX10W32:   ; Occupancy: 4
 ; GFX1100W64: ; Occupancy: 3
 ; GFX1100W32: ; Occupancy: 5
+; GFX1250:    ; Occupancy: 3
 define amdgpu_kernel void @limited_occupancy_3() #0 {
   ret void
 }
 
 ; GCN-LABEL: {{^}}limited_occupancy_18:
 ; GFX9:       ; Occupancy: 10
+; GFX950:     ; Occupancy: 8
 ; GFX1010:    ; Occupancy: 18
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @limited_occupancy_18() #1 {
   ret void
 }
 
 ; GCN-LABEL: {{^}}limited_occupancy_19:
 ; GFX9:       ; Occupancy: 10
+; GFX950:     ; Occupancy: 8
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @limited_occupancy_19() #2 {
   ret void
 }
 
 ; GCN-LABEL: {{^}}used_24_vgprs:
 ; GFX9:       ; Occupancy: 10
+; GFX950:     ; Occupancy: 8
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_24_vgprs() #10 {
   call void asm sideeffect "", "~{v23}" ()
   ret void
@@ -69,10 +81,12 @@ define amdgpu_kernel void @used_24_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_28_vgprs:
 ; GFX9:       ; Occupancy: 9
+; GFX950:     ; Occupancy: 8
 ; GFX1010W64: ; Occupancy: 18
 ; GFX1010W32: ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_28_vgprs() #10 {
   call void asm sideeffect "", "~{v27}" ()
   ret void
@@ -80,10 +94,12 @@ define amdgpu_kernel void @used_28_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_32_vgprs:
 ; GFX9:       ; Occupancy: 8
+; GFX950:     ; Occupancy: 8
 ; GFX10W64:   ; Occupancy: 16
 ; GFX1010W32: ; Occupancy: 20
 ; GFX1030W32: ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_32_vgprs() #10 {
   call void asm sideeffect "", "~{v31}" ()
   ret void
@@ -91,11 +107,13 @@ define amdgpu_kernel void @used_32_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_36_vgprs:
 ; GFX9:       ; Occupancy: 7
+; GFX950:     ; Occupancy: 8
 ; GFX1010W64: ; Occupancy: 14
 ; GFX1010W32: ; Occupancy: 20
 ; GFX1030W64: ; Occupancy: 12
 ; GFX1030W32: ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_36_vgprs() #10 {
   call void asm sideeffect "", "~{v35}" ()
   ret void
@@ -103,10 +121,12 @@ define amdgpu_kernel void @used_36_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_40_vgprs:
 ; GFX9:       ; Occupancy: 6
+; GFX950:     ; Occupancy: 8
 ; GFX10W64:   ; Occupancy: 12
 ; GFX1010W32: ; Occupancy: 20
 ; GFX1030W32: ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_40_vgprs() #10 {
   call void asm sideeffect "", "~{v39}" ()
   ret void
@@ -114,11 +134,13 @@ define amdgpu_kernel void @used_40_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_44_vgprs:
 ; GFX9:       ; Occupancy: 5
+; GFX950:     ; Occupancy: 8
 ; GFX1010W64: ; Occupancy: 11
 ; GFX1010W32: ; Occupancy: 20
 ; GFX1030W64: ; Occupancy: 10
 ; GFX1030W32: ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_44_vgprs() #10 {
   call void asm sideeffect "", "~{v43}" ()
   ret void
@@ -126,10 +148,12 @@ define amdgpu_kernel void @used_44_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_48_vgprs:
 ; GFX9:       ; Occupancy: 5
+; GFX950:     ; Occupancy: 8
 ; GFX10W64:   ; Occupancy: 10
 ; GFX1010W32: ; Occupancy: 20
 ; GFX1030W32: ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_48_vgprs() #10 {
   call void asm sideeffect "", "~{v47}" ()
   ret void
@@ -137,11 +161,13 @@ define amdgpu_kernel void @used_48_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_56_vgprs:
 ; GFX9:       ; Occupancy: 4
+; GFX950:     ; Occupancy: 8
 ; GFX10W64:   ; Occupancy: 9
 ; GFX1010W32: ; Occupancy: 18
 ; GFX1030W32: ; Occupancy: 16
 ; GFX1100W64: ; Occupancy: 12
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_56_vgprs() #10 {
   call void asm sideeffect "", "~{v55}" ()
   ret void
@@ -149,10 +175,12 @@ define amdgpu_kernel void @used_56_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_64_vgprs:
 ; GFX9:       ; Occupancy: 4
+; GFX950:     ; Occupancy: 8
 ; GFX10W64:   ; Occupancy: 8
 ; GFX10W32:   ; Occupancy: 16
 ; GFX1100W64: ; Occupancy: 10
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_64_vgprs() #10 {
   call void asm sideeffect "", "~{v63}" ()
   ret void
@@ -160,11 +188,13 @@ define amdgpu_kernel void @used_64_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_72_vgprs:
 ; GFX9:       ; Occupancy: 3
+; GFX950:     ; Occupancy: 7
 ; GFX10W64:   ; Occupancy: 7
 ; GFX1010W32: ; Occupancy: 14
 ; GFX1030W32: ; Occupancy: 12
 ; GFX1100W64: ; Occupancy: 10
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 12
 define amdgpu_kernel void @used_72_vgprs() #10 {
   call void asm sideeffect "", "~{v71}" ()
   ret void
@@ -172,10 +202,12 @@ define amdgpu_kernel void @used_72_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_80_vgprs:
 ; GFX9:       ; Occupancy: 3
+; GFX950:     ; Occupancy: 6
 ; GFX10W64:   ; Occupancy: 6
 ; GFX10W32:   ; Occupancy: 12
 ; GFX1100W64: ; Occupancy: 9
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 12
 define amdgpu_kernel void @used_80_vgprs() #10 {
   call void asm sideeffect "", "~{v79}" ()
   ret void
@@ -183,12 +215,14 @@ define amdgpu_kernel void @used_80_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_84_vgprs:
 ; GFX9:       ; Occupancy: 3
+; GFX950:     ; Occupancy: 5
 ; GFX1010W64: ; Occupancy: 6
 ; GFX1010W32: ; Occupancy: 11
 ; GFX1030W64: ; Occupancy: 5
 ; GFX1030W32: ; Occupancy: 10
 ; GFX1100W64: ; Occupancy: 9
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 10
 define amdgpu_kernel void @used_84_vgprs() #10 {
   call void asm sideeffect "", "~{v83}" ()
   ret void
@@ -196,11 +230,13 @@ define amdgpu_kernel void @used_84_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_88_vgprs:
 ; GFX9:       ; Occupancy: 2
+; GFX950:     ; Occupancy: 5
 ; GFX10W64:   ; Occupancy: 5
 ; GFX1010W32: ; Occupancy: 11
 ; GFX1030W32: ; Occupancy: 10
 ; GFX1100W64: ; Occupancy: 8
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 10
 define amdgpu_kernel void @used_88_vgprs() #10 {
   call void asm sideeffect "", "~{v87}" ()
   ret void
@@ -208,10 +244,12 @@ define amdgpu_kernel void @used_88_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_96_vgprs:
 ; GFX9:       ; Occupancy: 2
+; GFX950:     ; Occupancy: 5
 ; GFX10W64:   ; Occupancy: 5
 ; GFX10W32:   ; Occupancy: 10
 ; GFX1100W64: ; Occupancy: 8
 ; GFX1100W32: ; Occupancy: 16
+; GFX1250:    ; Occupancy: 10
 define amdgpu_kernel void @used_96_vgprs() #10 {
   call void asm sideeffect "", "~{v95}" ()
   ret void
@@ -219,11 +257,13 @@ define amdgpu_kernel void @used_96_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_100_vgprs:
 ; GFX9:       ; Occupancy: 2
+; GFX950:     ; Occupancy: 4
 ; GFX1010W64: ; Occupancy: 5
 ; GFX1030W64: ; Occupancy: 4
 ; GFX10W32:   ; Occupancy: 9
 ; GFX1100W64: ; Occupancy: 7
 ; GFX1100W32: ; Occupancy: 12
+; GFX1250:    ; Occupancy: 9
 define amdgpu_kernel void @used_100_vgprs() #10 {
   call void asm sideeffect "", "~{v99}" ()
   ret void
@@ -231,10 +271,12 @@ define amdgpu_kernel void @used_100_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_112_vgprs:
 ; GFX9:       ; Occupancy: 2
+; GFX950:     ; Occupancy: 4
 ; GFX10W64:   ; Occupancy: 4
 ; GFX10W32:   ; Occupancy: 9
 ; GFX1100W64: ; Occupancy: 6
 ; GFX1100W32: ; Occupancy: 12
+; GFX1250:    ; Occupancy: 9
 define amdgpu_kernel void @used_112_vgprs() #10 {
   call void asm sideeffect "", "~{v111}" ()
   ret void
@@ -242,10 +284,12 @@ define amdgpu_kernel void @used_112_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_128_vgprs:
 ; GFX9:       ; Occupancy: 2
+; GFX950:     ; Occupancy: 4
 ; GFX10W64:   ; Occupancy: 4
 ; GFX10W32:   ; Occupancy: 8
 ; GFX1100W64: ; Occupancy: 5
 ; GFX1100W32: ; Occupancy: 10
+; GFX1250:    ; Occupancy: 8
 define amdgpu_kernel void @used_128_vgprs() #10 {
   call void asm sideeffect "", "~{v127}" ()
   ret void
@@ -253,10 +297,12 @@ define amdgpu_kernel void @used_128_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_144_vgprs:
 ; GFX9:       ; Occupancy: 1
+; GFX950:     ; Occupancy: 3
 ; GFX10W64:   ; Occupancy: 3
 ; GFX10W32:   ; Occupancy: 7
 ; GFX1100W64: ; Occupancy: 5
 ; GFX1100W32: ; Occupancy: 10
+; GFX1250:    ; Occupancy: 7
 define amdgpu_kernel void @used_144_vgprs() #10 {
   call void asm sideeffect "", "~{v143}" ()
   ret void
@@ -264,11 +310,13 @@ define amdgpu_kernel void @used_144_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_168_vgprs:
 ; GFX9:       ; Occupancy: 1
+; GFX950:     ; Occupancy: 3
 ; GFX10W64:   ; Occupancy: 3
 ; GFX1010W32: ; Occupancy: 6
 ; GFX1030W32: ; Occupancy: 5
 ; GFX1100W64: ; Occupancy: 4
 ; GFX1100W32: ; Occupancy: 9
+; GFX1250:    ; Occupancy: 5
 define amdgpu_kernel void @used_168_vgprs() #10 {
   call void asm sideeffect "", "~{v167}" ()
   ret void
@@ -276,11 +324,13 @@ define amdgpu_kernel void @used_168_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_200_vgprs:
 ; GFX9:       ; Occupancy: 1
+; GFX950:     ; Occupancy: 2
 ; GFX10W64:   ; Occupancy: 2
 ; GFX1010W32: ; Occupancy: 5
 ; GFX1030W32: ; Occupancy: 4
 ; GFX1100W64: ; Occupancy: 3
 ; GFX1100W32: ; Occupancy: 7
+; GFX1250:    ; Occupancy: 4
 define amdgpu_kernel void @used_200_vgprs() #10 {
   call void asm sideeffect "", "~{v199}" ()
   ret void
@@ -288,10 +338,12 @@ define amdgpu_kernel void @used_200_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_256_vgprs:
 ; GFX9:       ; Occupancy: 1
+; GFX950:     ; Occupancy: 2
 ; GFX10W64:   ; Occupancy: 2
 ; GFX10W32:   ; Occupancy: 4
 ; GFX1100W64: ; Occupancy: 2
 ; GFX1100W32: ; Occupancy: 5
+; GFX1250:    ; Occupancy: 4
 define amdgpu_kernel void @used_256_vgprs() #10 {
   call void asm sideeffect "", "~{v255}" ()
   ret void
@@ -299,9 +351,11 @@ define amdgpu_kernel void @used_256_vgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_80_sgprs:
 ; GFX9:       ; Occupancy: 10
+; GFX950:     ; Occupancy: 8
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_80_sgprs() #10 {
   call void asm sideeffect "", "~{s79}" ()
   ret void
@@ -309,9 +363,11 @@ define amdgpu_kernel void @used_80_sgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_88_sgprs:
 ; GFX9:       ; Occupancy: 9
+; GFX950:     ; Occupancy: 8
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_88_sgprs() #10 {
   call void asm sideeffect "", "~{s87}" ()
   ret void
@@ -319,9 +375,11 @@ define amdgpu_kernel void @used_88_sgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_100_sgprs:
 ; GFX9:       ; Occupancy: 8
+; GFX950:     ; Occupancy: 7
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_100_sgprs() #10 {
   call void asm sideeffect "", "~{s99}" ()
   ret void
@@ -329,9 +387,11 @@ define amdgpu_kernel void @used_100_sgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_101_sgprs:
 ; GFX9:       ; Occupancy: 7
+; GFX950:     ; Occupancy: 7
 ; GFX1010:    ; Occupancy: 20
 ; GFX1030:    ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 define amdgpu_kernel void @used_101_sgprs() #10 {
   call void asm sideeffect "", "~{s100}" ()
   ret void
@@ -339,10 +399,12 @@ define amdgpu_kernel void @used_101_sgprs() #10 {
 
 ; GCN-LABEL: {{^}}used_lds_6552:
 ; GFX9:       ; Occupancy: 8
+; GFX950:     ; Occupancy: 8
 ; GFX1010W64: ; Occupancy: 20
 ; GFX1030W64: ; Occupancy: 16
 ; GFX10W32:   ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 @lds6552 = internal addrspace(3) global [6552 x i8] poison, align 4
 define amdgpu_kernel void @used_lds_6552() {
   store volatile i8 1, ptr addrspace(3) @lds6552
@@ -351,10 +413,12 @@ define amdgpu_kernel void @used_lds_6552() {
 
 ; GCN-LABEL: {{^}}used_lds_6556:
 ; GFX9:       ; Occupancy: 8
+; GFX950:     ; Occupancy: 8
 ; GFX1010W64: ; Occupancy: 20
 ; GFX1030W64: ; Occupancy: 16
 ; GFX10W32:   ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 @lds6556 = internal addrspace(3) global [6556 x i8] poison, align 4
 define amdgpu_kernel void @used_lds_6556() {
   store volatile i8 1, ptr addrspace(3) @lds6556
@@ -363,10 +427,12 @@ define amdgpu_kernel void @used_lds_6556() {
 
 ; GCN-LABEL: {{^}}used_lds_13112:
 ; GFX9:       ; Occupancy: 8
+; GFX950:     ; Occupancy: 8
 ; GFX1010W64: ; Occupancy: 20
 ; GFX1030W64: ; Occupancy: 16
 ; GFX10W32:   ; Occupancy: 16
 ; GFX1100:    ; Occupancy: 16
+; GFX1250:    ; Occupancy: 16
 @lds13112 = internal addrspace(3) global [13112 x i8] poison, align 4
 define amdgpu_kernel void @used_lds_13112() {
   store volatile i8 1, ptr addrspace(3) @lds13112
@@ -375,10 +441,12 @@ define amdgpu_kernel void @used_lds_13112() {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_64:
 ; GFX9:       ; Occupancy: 2{{$}}
+; GFX950:     ; Occupancy: 5{{$}}
 ; GFX10W64:   ; Occupancy: 4{{$}}
 ; GFX10W32:   ; Occupancy: 8{{$}}
 ; GFX1100W64: ; Occupancy: 4{{$}}
 ; GFX1100W32: ; Occupancy: 8{{$}}
+; GFX1250:    ; Occupancy: 10{{$}}
 @lds8252 = internal addrspace(3) global [8252 x i8] poison, align 4
 define amdgpu_kernel void @used_lds_8252_max_group_size_64() #3 {
   store volatile i8 1, ptr addrspace(3) @lds8252
@@ -387,10 +455,12 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_64() #3 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_96:
 ; GFX9:       ; Occupancy: 4{{$}}
+; GFX950:     ; Occupancy: 8{{$}}
 ; GFX10W64:   ; Occupancy: 8{{$}}
 ; GFX10W32:   ; Occupancy: 12{{$}}
 ; GFX1100W64: ; Occupancy: 8{{$}}
 ; GFX1100W32: ; Occupancy: 12{{$}}
+; GFX1250:    ; Occupancy: 12{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_96() #4 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
@@ -398,10 +468,12 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_96() #4 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_128:
 ; GFX9:       ; Occupancy: 4{{$}}
+; GFX950:     ; Occupancy: 8{{$}}
 ; GFX10W64:   ; Occupancy: 8{{$}}
 ; GFX10W32:   ; Occupancy: 15{{$}}
 ; GFX1100W64: ; Occupancy: 8{{$}}
 ; GFX1100W32: ; Occupancy: 15{{$}}
+; GFX1250:    ; Occupancy: 16{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_128() #5 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
@@ -409,11 +481,13 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_128() #5 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_192:
 ; GFX9:       ; Occupancy: 6{{$}}
+; GFX950:     ; Occupancy: 8{{$}}
 ; GFX10W64:   ; Occupancy: 12{{$}}
 ; GFX1010W32: ; Occupancy: 20{{$}}
 ; GFX1030W32: ; Occupancy: 15{{$}}
 ; GFX1100W64: ; Occupancy: 12{{$}}
 ; GFX1100W32: ; Occupancy: 15{{$}}
+; GFX1250:    ; Occupancy: 15{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_192() #6 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
@@ -421,11 +495,13 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_192() #6 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_256:
 ; GFX9:       ; Occupancy: 7{{$}}
+; GFX950:     ; Occupancy: 8{{$}}
 ; GFX10W64:   ; Occupancy: 15{{$}}
 ; GFX1010W32: ; Occupancy: 20{{$}}
 ; GFX1030W32: ; Occupancy: 16{{$}}
 ; GFX1100W64: ; Occupancy: 15{{$}}
 ; GFX1100W32: ; Occupancy: 16{{$}}
+; GFX1250:    ; Occupancy: 16{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_256() #7 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
@@ -433,9 +509,11 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_256() #7 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_512:
 ; GFX9:       ; Occupancy: 10{{$}}
+; GFX950:     ; Occupancy: 8{{$}}
 ; GFX1010:    ; Occupancy: 20{{$}}
 ; GFX1030:    ; Occupancy: 16{{$}}
 ; GFX1100:    ; Occupancy: 16{{$}}
+; GFX1250:    ; Occupancy: 16{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_512() #8 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
@@ -443,10 +521,12 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_512() #8 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_1024:
 ; GFX9:       ; Occupancy: 8{{$}}
+; GFX950:     ; Occupancy: 8{{$}}
 ; GFX1010W32: ; Occupancy: 16{{$}}
 ; GFX1010W64: ; Occupancy: 20{{$}}
 ; GFX1030:    ; Occupancy: 16{{$}}
 ; GFX1100:    ; Occupancy: 16{{$}}
+; GFX1250:    ; Occupancy: 16{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_1024() #9 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
@@ -454,8 +534,10 @@ define amdgpu_kernel void @used_lds_8252_max_group_size_1024() #9 {
 
 ; GCN-LABEL: {{^}}used_lds_8252_max_group_size_32:
 ; GFX9:       ; Occupancy: 2{{$}}
+; GFX950:     ; Occupancy: 5{{$}}
 ; GFX10:      ; Occupancy: 4{{$}}
 ; GFX1100:    ; Occupancy: 4{{$}}
+; GFX1250:    ; Occupancy: 10{{$}}
 define amdgpu_kernel void @used_lds_8252_max_group_size_32() #10 {
   store volatile i8 1, ptr addrspace(3) @lds8252
   ret void
diff --git a/llvm/test/CodeGen/AMDGPU/release-vgprs.mir b/llvm/test/CodeGen/AMDGPU/release-vgprs.mir
index c845a4c82b9cc..57db12e6f41d8 100644
--- a/llvm/test/CodeGen/AMDGPU/release-vgprs.mir
+++ b/llvm/test/CodeGen/AMDGPU/release-vgprs.mir
@@ -25,6 +25,8 @@
   define amdgpu_cs void @with_calls() { ret void }
   define fastcc void @with_tail_calls() { ret void }
   define amdgpu_cs void @waveslot_limited() { ret void }
+  define amdgpu_ps void @tbuffer_without_mmo_may_hit_scratch() { ret void }
+  define amdgpu_ps void @buffer_without_mmo_may_hit_scratch() { ret void }
 ...
 
 ---
@@ -34,15 +36,15 @@ machineFunctionInfo:
 body:             |
   bb.0:
     ; OPT-LABEL: name: tbuffer_store1
-    ; OPT: TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec
+    ; OPT: TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     ; OPT-NEXT: S_NOP 0
     ; OPT-NEXT: S_SENDMSG 3, implicit $exec, implicit $m0
     ; OPT-NEXT: S_ENDPGM 0, implicit $vgpr97
     ;
     ; NOOPT-LABEL: name: tbuffer_store1
-    ; NOOPT: TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec
+    ; NOOPT: TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     ; NOOPT-NEXT: S_ENDPGM 0, implicit $vgpr97
-    TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec
+    TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     S_ENDPGM 0, implicit $vgpr97
 ...
 
@@ -107,15 +109,15 @@ machineFunctionInfo:
 body:             |
   bb.0:
     ; OPT-LABEL: name: buffer_store_format
-    ; OPT: BUFFER_STORE_FORMAT_D16_X_OFFEN_exact killed renamable $vgpr0, killed renamable $vgpr1, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 0, 0, implicit $exec
+    ; OPT: BUFFER_STORE_FORMAT_D16_X_OFFEN_exact killed renamable $vgpr0, killed renamable $vgpr1, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     ; OPT-NEXT: S_NOP 0
     ; OPT-NEXT: S_SENDMSG 3, implicit $exec, implicit $m0
     ; OPT-NEXT: S_ENDPGM 0, implicit $vgpr97
     ;
     ; NOOPT-LABEL: name: buffer_store_format
-    ; NOOPT: BUFFER_STORE_FORMAT_D16_X_OFFEN_exact killed renamable $vgpr0, killed renamable $vgpr1, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 0, 0, implicit $exec
+    ; NOOPT: BUFFER_STORE_FORMAT_D16_X_OFFEN_exact killed renamable $vgpr0, killed renamable $vgpr1, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     ; NOOPT-NEXT: S_ENDPGM 0, implicit $vgpr97
-    BUFFER_STORE_FORMAT_D16_X_OFFEN_exact killed renamable $vgpr0, killed renamable $vgpr1, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 0, 0, implicit $exec
+    BUFFER_STORE_FORMAT_D16_X_OFFEN_exact killed renamable $vgpr0, killed renamable $vgpr1, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 0, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     S_ENDPGM 0, implicit $vgpr97
 ...
 
@@ -218,7 +220,7 @@ body:             |
   ; OPT: bb.0:
   ; OPT-NEXT:   successors: %bb.2(0x80000000)
   ; OPT-NEXT: {{  $}}
-  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; OPT-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr2, implicit $exec
   ; OPT-NEXT:   S_BRANCH %bb.2
   ; OPT-NEXT: {{  $}}
@@ -226,7 +228,7 @@ body:             |
   ; OPT-NEXT:   successors: %bb.2(0x80000000)
   ; OPT-NEXT: {{  $}}
   ; OPT-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr2, implicit $exec
-  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; OPT-NEXT:   S_BRANCH %bb.2
   ; OPT-NEXT: {{  $}}
   ; OPT-NEXT: bb.2:
@@ -238,7 +240,7 @@ body:             |
   ; NOOPT: bb.0:
   ; NOOPT-NEXT:   successors: %bb.2(0x80000000)
   ; NOOPT-NEXT: {{  $}}
-  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; NOOPT-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr2, implicit $exec
   ; NOOPT-NEXT:   S_BRANCH %bb.2
   ; NOOPT-NEXT: {{  $}}
@@ -246,7 +248,7 @@ body:             |
   ; NOOPT-NEXT:   successors: %bb.2(0x80000000)
   ; NOOPT-NEXT: {{  $}}
   ; NOOPT-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr2, implicit $exec
-  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; NOOPT-NEXT:   S_BRANCH %bb.2
   ; NOOPT-NEXT: {{  $}}
   ; NOOPT-NEXT: bb.2:
@@ -254,7 +256,7 @@ body:             |
   bb.0:
     successors: %bb.2
 
-    TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+    TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     $vgpr1 = V_ADD_U32_e32 renamable $vgpr0, renamable $vgpr2, implicit $exec
     S_BRANCH %bb.2
 
@@ -262,7 +264,7 @@ body:             |
     successors: %bb.2
 
     $vgpr1 = V_ADD_U32_e32 renamable $vgpr0, renamable $vgpr2, implicit $exec
-    TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+    TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     S_BRANCH %bb.2
 
   bb.2:
@@ -281,7 +283,7 @@ body:             |
   ; OPT-NEXT:   successors: %bb.2(0x80000000)
   ; OPT-NEXT: {{  $}}
   ; OPT-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr2, implicit $exec
-  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; OPT-NEXT:   S_BRANCH %bb.2
   ; OPT-NEXT: {{  $}}
   ; OPT-NEXT: bb.1:
@@ -311,7 +313,7 @@ body:             |
   ; NOOPT-NEXT:   successors: %bb.2(0x80000000)
   ; NOOPT-NEXT: {{  $}}
   ; NOOPT-NEXT:   $vgpr1 = V_ADD_U32_e32 $vgpr0, $vgpr2, implicit $exec
-  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; NOOPT-NEXT:   S_BRANCH %bb.2
   ; NOOPT-NEXT: {{  $}}
   ; NOOPT-NEXT: bb.1:
@@ -337,7 +339,7 @@ body:             |
     successors: %bb.2
 
     $vgpr1 = V_ADD_U32_e32 renamable $vgpr0, renamable $vgpr2, implicit $exec
-    TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec
+    TBUFFER_STORE_FORMAT_X_OFFSET_exact killed renamable $vgpr0, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 125, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     S_BRANCH %bb.2
 
   bb.1:
@@ -408,14 +410,14 @@ body:             |
   ; OPT: bb.0:
   ; OPT-NEXT:   successors: %bb.1(0x80000000)
   ; OPT-NEXT: {{  $}}
-  ; OPT-NEXT:   renamable $vgpr0 = BUFFER_LOAD_FORMAT_X_IDXEN killed renamable $vgpr0, renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, 0, implicit $exec
+  ; OPT-NEXT:   renamable $vgpr0 = BUFFER_LOAD_FORMAT_X_IDXEN killed renamable $vgpr0, renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, 0, implicit $exec :: (volatile load (s32), addrspace 1)
   ; OPT-NEXT:   S_BRANCH %bb.1
   ; OPT-NEXT: {{  $}}
   ; OPT-NEXT: bb.1:
   ; OPT-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
   ; OPT-NEXT: {{  $}}
   ; OPT-NEXT:   S_WAITCNT 1015
-  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $vgpr4, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 115, 0, 0, implicit $exec
+  ; OPT-NEXT:   TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $vgpr4, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 115, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; OPT-NEXT:   S_CMP_LG_U32 killed renamable $sgpr3, renamable $sgpr4, implicit-def $scc
   ; OPT-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit killed $scc
   ; OPT-NEXT:   S_BRANCH %bb.2
@@ -429,14 +431,14 @@ body:             |
   ; NOOPT: bb.0:
   ; NOOPT-NEXT:   successors: %bb.1(0x80000000)
   ; NOOPT-NEXT: {{  $}}
-  ; NOOPT-NEXT:   renamable $vgpr0 = BUFFER_LOAD_FORMAT_X_IDXEN killed renamable $vgpr0, renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, 0, implicit $exec
+  ; NOOPT-NEXT:   renamable $vgpr0 = BUFFER_LOAD_FORMAT_X_IDXEN killed renamable $vgpr0, renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, 0, implicit $exec :: (volatile load (s32), addrspace 1)
   ; NOOPT-NEXT:   S_BRANCH %bb.1
   ; NOOPT-NEXT: {{  $}}
   ; NOOPT-NEXT: bb.1:
   ; NOOPT-NEXT:   successors: %bb.1(0x40000000), %bb.2(0x40000000)
   ; NOOPT-NEXT: {{  $}}
   ; NOOPT-NEXT:   S_WAITCNT 1015
-  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $vgpr4, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 115, 0, 0, implicit $exec
+  ; NOOPT-NEXT:   TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $vgpr4, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 115, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
   ; NOOPT-NEXT:   S_CMP_LG_U32 killed renamable $sgpr3, renamable $sgpr4, implicit-def $scc
   ; NOOPT-NEXT:   S_CBRANCH_SCC1 %bb.1, implicit killed $scc
   ; NOOPT-NEXT:   S_BRANCH %bb.2
@@ -446,13 +448,13 @@ body:             |
   bb.0:
     successors: %bb.1
 
-    renamable $vgpr0 = BUFFER_LOAD_FORMAT_X_IDXEN killed renamable $vgpr0, renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, 0, implicit $exec
+    renamable $vgpr0 = BUFFER_LOAD_FORMAT_X_IDXEN killed renamable $vgpr0, renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, 0, implicit $exec :: (volatile load (s32), addrspace 1)
     S_BRANCH %bb.1
 
   bb.1:
     successors: %bb.1, %bb.2
 
-    TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $vgpr4, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 115, 0, 0, implicit $exec
+    TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $vgpr4, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 115, 0, 0, implicit $exec :: (volatile store (s32), addrspace 1)
     S_CMP_LG_U32 killed renamable $sgpr3, renamable $sgpr4, implicit-def $scc
     S_CBRANCH_SCC1 %bb.1, implicit killed $scc
     S_BRANCH %bb.2
@@ -619,3 +621,29 @@ body:             |
     GLOBAL_STORE_DWORD undef renamable $vgpr0_vgpr1, killed renamable $vgpr96, 0, 4, implicit $exec
     S_ENDPGM 0
 ...
+
+---
+name:            tbuffer_without_mmo_may_hit_scratch
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: name: tbuffer_without_mmo_may_hit_scratch
+    ; CHECK: TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec
+    ; CHECK-NEXT: S_ENDPGM 0, implicit $vgpr97
+    TBUFFER_STORE_FORMAT_XYZW_OFFSET_exact killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, killed renamable $sgpr4, 42, 117, 0, 0, implicit $exec
+    S_ENDPGM 0, implicit $vgpr97
+...
+
+---
+name:            buffer_without_mmo_may_hit_scratch
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: name: buffer_without_mmo_may_hit_scratch
+    ; CHECK: BUFFER_ATOMIC_ADD_F32_OFFEN killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, implicit $exec
+    ; CHECK-NEXT: S_ENDPGM 0, implicit $vgpr97
+    BUFFER_ATOMIC_ADD_F32_OFFEN killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr0_sgpr1_sgpr2_sgpr3, 0, 0, 0, implicit $exec
+    S_ENDPGM 0, implicit $vgpr97
+...
diff --git a/llvm/test/CodeGen/AMDGPU/remove-register-flags.mir b/llvm/test/CodeGen/AMDGPU/remove-register-flags.mir
index d9dc449501203..67e00119b1f98 100644
--- a/llvm/test/CodeGen/AMDGPU/remove-register-flags.mir
+++ b/llvm/test/CodeGen/AMDGPU/remove-register-flags.mir
@@ -14,6 +14,6 @@ body: |
     ; CHECK-NEXT: S_ALLOC_VGPR $sgpr19, implicit-def $scc
     ; CHECK-NEXT: $sgpr20_sgpr21 = S_CSELECT_B64 $sgpr20_sgpr21, $sgpr22_sgpr23, implicit $scc
     ; CHECK-NEXT: $exec_lo = S_CSELECT_B32 $sgpr18, -1, implicit $scc
-    ; CHECK-NEXT: SI_TCRETURN killed renamable $sgpr20_sgpr21, 0, 0, amdgpu_allvgprs, implicit killed $sgpr0, implicit killed $sgpr1, implicit killed $sgpr2, implicit killed $sgpr3, implicit killed $sgpr4, implicit killed $sgpr5, implicit killed $sgpr6, implicit killed $sgpr7, implicit killed $sgpr8, implicit killed $sgpr9, implicit killed $sgpr10, implicit killed $sgpr11, implicit killed $sgpr12, implicit killed $sgpr13, implicit killed $sgpr14, implicit killed $sgpr15, implicit killed $sgpr16, implicit killed $sgpr17, implicit $sgpr18, implicit $sgpr19
+    ; CHECK-NEXT: SI_TCRETURN_CHAIN killed renamable $sgpr20_sgpr21, 0, 0, amdgpu_allvgprs, implicit killed $sgpr0, implicit killed $sgpr1, implicit killed $sgpr2, implicit killed $sgpr3, implicit killed $sgpr4, implicit killed $sgpr5, implicit killed $sgpr6, implicit killed $sgpr7, implicit killed $sgpr8, implicit killed $sgpr9, implicit killed $sgpr10, implicit killed $sgpr11, implicit killed $sgpr12, implicit killed $sgpr13, implicit killed $sgpr14, implicit killed $sgpr15, implicit killed $sgpr16, implicit killed $sgpr17, implicit $sgpr18, implicit $sgpr19
     SI_CS_CHAIN_TC_W32_DVGPR killed renamable $sgpr20_sgpr21, 0, 0, killed renamable $sgpr18, killed renamable $sgpr19, -1, killed renamable $sgpr22_sgpr23, amdgpu_allvgprs, implicit killed $sgpr0, implicit killed $sgpr1, implicit killed $sgpr2, implicit killed $sgpr3, implicit killed $sgpr4, implicit killed $sgpr5, implicit killed $sgpr6, implicit killed $sgpr7, implicit killed $sgpr8, implicit killed $sgpr9, implicit killed $sgpr10, implicit killed $sgpr11, implicit killed $sgpr12, implicit killed $sgpr13, implicit killed $sgpr14, implicit killed $sgpr15, implicit killed $sgpr16, implicit killed $sgpr17, implicit $sgpr18, implicit $sgpr19
 ...
diff --git a/llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-scale-to-agpr.mir b/llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-scale-to-agpr.mir
index 999ea42910d92..e35927e8bf00d 100644
--- a/llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-scale-to-agpr.mir
+++ b/llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-scale-to-agpr.mir
@@ -1,7 +1,9 @@
-# RUN: not --crash llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -run-pass=greedy,amdgpu-rewrite-agpr-copy-mfma -verify-machineinstrs -o - %s 2>&1 | FileCheck %s
-# CHECK: Illegal virtual register for instruction
-# CHECK: Expected a VGPR_32 register, but got a AGPR_32 register
- 
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -run-pass=greedy,amdgpu-rewrite-agpr-copy-mfma -verify-machineinstrs -o - %s | FileCheck %s
+# CHECK: bb.1:
+# CHECK: dead %{{[0-9]+}}:vreg_128_align2 = V_MFMA_SCALE_F32_16X16X128_F8F6F4_f4_f4_vgprcd_e64 %{{[0-9]+}}, %{{[0-9]+}}, %{{[0-9]+}}, 4, 4, %{{[0-9]+}}, %[[REG:[0-9]+]], 4, 0, implicit $mode, implicit $exec
+# CHECK: %{{[0-9]+}}:agpr_32 = IMPLICIT_DEF
+# CHECK: %[[REG]]:vgpr_32 = COPY %{{[0-9]+}}
+
 # Test for issue in amdgpu-rewrite-agpr-copy-mfma, which reassigns scale operand
 # in vgpr_32 register to agpr_32, not permitted by instruction format.
 ---
diff --git a/llvm/test/CodeGen/AMDGPU/strict_fadd.f16.ll b/llvm/test/CodeGen/AMDGPU/strict_fadd.f16.ll
index e9e4d5ebed41c..c68a0e6f43578 100644
--- a/llvm/test/CodeGen/AMDGPU/strict_fadd.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/strict_fadd.f16.ll
@@ -1,41 +1,29 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-GISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9,GFX9-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9,GFX9-GISEL %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GFX8,GFX8-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GFX8,GFX8-GISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GCN,GFX8,GFX8-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GCN,GFX8,GFX8-GISEL %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10,GFX10-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10,GFX10-GISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX10-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX10-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-SDAG-TRUE16 %s
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-SDAG-FAKE16 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-GISEL,GFX11-GISEL-TRUE16 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-GISEL,GFX11-GISEL-FAKE16 %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX11,GFX11-SDAG-FAKE16 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-GISEL,GFX11-GISEL-TRUE16 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX11,GFX11-GISEL,GFX11-GISEL-FAKE16 %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
 
 ; FIXME: promotion not handled without f16 insts
 
 define half @v_constained_fadd_f16_fpexcept_strict(half %x, half %y) #0 {
-; GFX9-LABEL: v_constained_fadd_f16_fpexcept_strict:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX9-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: v_constained_fadd_f16_fpexcept_strict:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: v_constained_fadd_f16_fpexcept_strict:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
+; GCN-LABEL: v_constained_fadd_f16_fpexcept_strict:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_add_f16_e32 v0, v0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-SDAG-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_strict:
 ; GFX11-SDAG-TRUE16:       ; %bb.0:
@@ -43,24 +31,12 @@ define half @v_constained_fadd_f16_fpexcept_strict(half %x, half %y) #0 {
 ; GFX11-SDAG-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
 ; GFX11-SDAG-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-SDAG-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_strict:
-; GFX11-SDAG-FAKE16:       ; %bb.0:
-; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-SDAG-FAKE16-NEXT:    s_setpc_b64 s[30:31]
-;
 ; GFX11-GISEL-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_strict:
 ; GFX11-GISEL-TRUE16:       ; %bb.0:
 ; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-GISEL-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
 ; GFX11-GISEL-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-GISEL-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_strict:
-; GFX11-GISEL-FAKE16:       ; %bb.0:
-; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-GISEL-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-GISEL-FAKE16-NEXT:    s_setpc_b64 s[30:31]
-;
 ; GFX12-LABEL: v_constained_fadd_f16_fpexcept_strict:
 ; GFX12:       ; %bb.0:
 ; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -75,23 +51,11 @@ define half @v_constained_fadd_f16_fpexcept_strict(half %x, half %y) #0 {
 }
 
 define half @v_constained_fadd_f16_fpexcept_ignore(half %x, half %y) #0 {
-; GFX9-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX9-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
+; GCN-LABEL: v_constained_fadd_f16_fpexcept_ignore:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_add_f16_e32 v0, v0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-SDAG-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_ignore:
 ; GFX11-SDAG-TRUE16:       ; %bb.0:
@@ -99,24 +63,12 @@ define half @v_constained_fadd_f16_fpexcept_ignore(half %x, half %y) #0 {
 ; GFX11-SDAG-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
 ; GFX11-SDAG-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-SDAG-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX11-SDAG-FAKE16:       ; %bb.0:
-; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-SDAG-FAKE16-NEXT:    s_setpc_b64 s[30:31]
-;
 ; GFX11-GISEL-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_ignore:
 ; GFX11-GISEL-TRUE16:       ; %bb.0:
 ; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-GISEL-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
 ; GFX11-GISEL-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-GISEL-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX11-GISEL-FAKE16:       ; %bb.0:
-; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-GISEL-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-GISEL-FAKE16-NEXT:    s_setpc_b64 s[30:31]
-;
 ; GFX12-LABEL: v_constained_fadd_f16_fpexcept_ignore:
 ; GFX12:       ; %bb.0:
 ; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -126,38 +78,16 @@ define half @v_constained_fadd_f16_fpexcept_ignore(half %x, half %y) #0 {
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    v_add_f16_e32 v0, v0, v1
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX11-TRUE16:       ; %bb.0:
-; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_ignore:
-; GFX11-FAKE16:       ; %bb.0:
-; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
   %val = call half @llvm.experimental.constrained.fadd.f16(half %x, half %y, metadata !"round.tonearest", metadata !"fpexcept.ignore")
   ret half %val
 }
 
 define half @v_constained_fadd_f16_fpexcept_maytrap(half %x, half %y) #0 {
-; GFX9-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX9-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX8-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX8-NEXT:    s_setpc_b64 s[30:31]
-;
-; GFX10-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX10-NEXT:    s_setpc_b64 s[30:31]
+; GCN-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_add_f16_e32 v0, v0, v1
+; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-SDAG-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
 ; GFX11-SDAG-TRUE16:       ; %bb.0:
@@ -165,24 +95,12 @@ define half @v_constained_fadd_f16_fpexcept_maytrap(half %x, half %y) #0 {
 ; GFX11-SDAG-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
 ; GFX11-SDAG-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-SDAG-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX11-SDAG-FAKE16:       ; %bb.0:
-; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-SDAG-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-SDAG-FAKE16-NEXT:    s_setpc_b64 s[30:31]
-;
 ; GFX11-GISEL-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
 ; GFX11-GISEL-TRUE16:       ; %bb.0:
 ; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-GISEL-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
 ; GFX11-GISEL-TRUE16-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-GISEL-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX11-GISEL-FAKE16:       ; %bb.0:
-; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-GISEL-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-GISEL-FAKE16-NEXT:    s_setpc_b64 s[30:31]
-;
 ; GFX12-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
 ; GFX12:       ; %bb.0:
 ; GFX12-NEXT:    s_wait_loadcnt_dscnt 0x0
@@ -192,16 +110,6 @@ define half @v_constained_fadd_f16_fpexcept_maytrap(half %x, half %y) #0 {
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    v_add_f16_e32 v0, v0, v1
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-TRUE16-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX11-TRUE16:       ; %bb.0:
-; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v1.l
-; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-FAKE16-LABEL: v_constained_fadd_f16_fpexcept_maytrap:
-; GFX11-FAKE16:       ; %bb.0:
-; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v1
-; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
   %val = call half @llvm.experimental.constrained.fadd.f16(half %x, half %y, metadata !"round.tonearest", metadata !"fpexcept.maytrap")
   ret half %val
 }
@@ -439,18 +347,6 @@ define <3 x half> @v_constained_fadd_v3f16_fpexcept_strict(<3 x half> %x, <3 x h
 ; GFX12-GISEL-NEXT:    v_pk_add_f16 v0, v0, v2
 ; GFX12-GISEL-NEXT:    v_pk_add_f16 v1, v1, v3
 ; GFX12-GISEL-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-TRUE16-LABEL: v_constained_fadd_v3f16_fpexcept_strict:
-; GFX11-TRUE16:       ; %bb.0:
-; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT:    v_pk_add_f16 v0, v0, v2
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v1.l, v1.l, v3.l
-; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-FAKE16-LABEL: v_constained_fadd_v3f16_fpexcept_strict:
-; GFX11-FAKE16:       ; %bb.0:
-; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-FAKE16-NEXT:    v_pk_add_f16 v0, v0, v2
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v1, v1, v3
-; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
   %val = call <3 x half> @llvm.experimental.constrained.fadd.v3f16(<3 x half> %x, <3 x half> %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   ret <3 x half> %val
 }
@@ -578,28 +474,6 @@ define <4 x half> @v_constained_fadd_v4f16_fpexcept_strict(<4 x half> %x, <4 x h
 ; GFX12-GISEL-NEXT:    v_pk_add_f16 v0, v0, v2
 ; GFX12-GISEL-NEXT:    v_pk_add_f16 v1, v1, v3
 ; GFX12-GISEL-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-TRUE16-LABEL: v_constained_fadd_v4f16_fpexcept_strict:
-; GFX11-TRUE16:       ; %bb.0:
-; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v1.h, v1.h, v3.h
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v0.h, v0.h, v2.h
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v0.l, v0.l, v2.l
-; GFX11-TRUE16-NEXT:    v_add_f16_e32 v1.l, v1.l, v3.l
-; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
-; GFX11-FAKE16-LABEL: v_constained_fadd_v4f16_fpexcept_strict:
-; GFX11-FAKE16:       ; %bb.0:
-; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v4, 16, v3
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v5, 16, v2
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v6, 16, v0
-; GFX11-FAKE16-NEXT:    v_lshrrev_b32_e32 v7, 16, v1
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v1, v1, v3
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v0, v0, v2
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v2, v6, v5
-; GFX11-FAKE16-NEXT:    v_add_f16_e32 v3, v7, v4
-; GFX11-FAKE16-NEXT:    v_perm_b32 v0, v2, v0, 0x5040100
-; GFX11-FAKE16-NEXT:    v_perm_b32 v1, v3, v1, 0x5040100
-; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
   %val = call <4 x half> @llvm.experimental.constrained.fadd.v4f16(<4 x half> %x, <4 x half> %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   ret <4 x half> %val
 }
@@ -648,14 +522,6 @@ define amdgpu_ps half @s_constained_fadd_f16_fpexcept_strict(half inreg %x, half
 ; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_3)
 ; GFX12-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX12-NEXT:    ; return to shader part epilog
-; GFX11-TRUE16-LABEL: s_constained_fadd_f16_fpexcept_strict:
-; GFX11-TRUE16:       ; %bb.0:
-; GFX11-TRUE16-NEXT:    v_add_f16_e64 v0.l, s2, s3
-; GFX11-TRUE16-NEXT:    ; return to shader part epilog
-; GFX11-FAKE16-LABEL: s_constained_fadd_f16_fpexcept_strict:
-; GFX11-FAKE16:       ; %bb.0:
-; GFX11-FAKE16-NEXT:    v_add_f16_e64 v0, s2, s3
-; GFX11-FAKE16-NEXT:    ; return to shader part epilog
   %val = call half @llvm.experimental.constrained.fadd.f16(half %x, half %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   ret half %val
 }
@@ -681,14 +547,19 @@ define amdgpu_ps <2 x half> @s_constained_fadd_v2f16_fpexcept_strict(<2 x half>
 ;
 ; GFX8-GISEL-LABEL: s_constained_fadd_v2f16_fpexcept_strict:
 ; GFX8-GISEL:       ; %bb.0:
-; GFX8-GISEL-NEXT:    s_lshr_b32 s0, s2, 16
-; GFX8-GISEL-NEXT:    s_lshr_b32 s1, s3, 16
 ; GFX8-GISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX8-GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8-GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GFX8-GISEL-NEXT:    s_lshr_b32 s1, s3, 16
 ; GFX8-GISEL-NEXT:    v_add_f16_e32 v0, s2, v0
-; GFX8-GISEL-NEXT:    v_add_f16_sdwa v1, v2, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX8-GISEL-NEXT:    v_or_b32_e32 v0, v0, v1
+; GFX8-GISEL-NEXT:    s_lshr_b32 s0, s2, 16
+; GFX8-GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX8-GISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX8-GISEL-NEXT:    v_add_f16_e32 v0, s0, v0
+; GFX8-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-GISEL-NEXT:    s_and_b32 s0, 0xffff, s0
+; GFX8-GISEL-NEXT:    s_and_b32 s1, 0xffff, s2
+; GFX8-GISEL-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX8-GISEL-NEXT:    s_or_b32 s0, s1, s0
+; GFX8-GISEL-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX8-GISEL-NEXT:    ; return to shader part epilog
 ;
 ; GFX10-LABEL: s_constained_fadd_v2f16_fpexcept_strict:
@@ -701,10 +572,21 @@ define amdgpu_ps <2 x half> @s_constained_fadd_v2f16_fpexcept_strict(<2 x half>
 ; GFX11-NEXT:    v_pk_add_f16 v0, s2, s3
 ; GFX11-NEXT:    ; return to shader part epilog
 ;
-; GFX12-LABEL: s_constained_fadd_v2f16_fpexcept_strict:
-; GFX12:       ; %bb.0:
-; GFX12-NEXT:    v_pk_add_f16 v0, s2, s3
-; GFX12-NEXT:    ; return to shader part epilog
+; GFX12-SDAG-LABEL: s_constained_fadd_v2f16_fpexcept_strict:
+; GFX12-SDAG:       ; %bb.0:
+; GFX12-SDAG-NEXT:    v_pk_add_f16 v0, s2, s3
+; GFX12-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX12-GISEL-LABEL: s_constained_fadd_v2f16_fpexcept_strict:
+; GFX12-GISEL:       ; %bb.0:
+; GFX12-GISEL-NEXT:    s_lshr_b32 s0, s2, 16
+; GFX12-GISEL-NEXT:    s_lshr_b32 s1, s3, 16
+; GFX12-GISEL-NEXT:    s_add_f16 s2, s2, s3
+; GFX12-GISEL-NEXT:    s_add_f16 s0, s0, s1
+; GFX12-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-GISEL-NEXT:    s_pack_ll_b32_b16 s0, s2, s0
+; GFX12-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX12-GISEL-NEXT:    ; return to shader part epilog
   %val = call <2 x half> @llvm.experimental.constrained.fadd.v2f16(<2 x half> %x, <2 x half> %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   ret <2 x half> %val
 }
diff --git a/llvm/test/CodeGen/AMDGPU/strict_fadd.f32.ll b/llvm/test/CodeGen/AMDGPU/strict_fadd.f32.ll
index a039c2629c395..52eef3e2a10f8 100644
--- a/llvm/test/CodeGen/AMDGPU/strict_fadd.f32.ll
+++ b/llvm/test/CodeGen/AMDGPU/strict_fadd.f32.ll
@@ -1,15 +1,15 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GFX9,GFX9-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
 
 define float @v_constained_fadd_f32_fpexcept_strict(float %x, float %y) #0 {
 ; GFX9-LABEL: v_constained_fadd_f32_fpexcept_strict:
diff --git a/llvm/test/CodeGen/AMDGPU/strict_fadd.f64.ll b/llvm/test/CodeGen/AMDGPU/strict_fadd.f64.ll
index 5469fc8330971..2e5268da9aa49 100644
--- a/llvm/test/CodeGen/AMDGPU/strict_fadd.f64.ll
+++ b/llvm/test/CodeGen/AMDGPU/strict_fadd.f64.ll
@@ -1,12 +1,12 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GCN-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GCN-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GCN-GISEL %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GCN,GFX10PLUS,GFX10 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GCN,GFX10PLUS,GFX10 %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GCN,GFX10-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GCN,GFX10-GISEL %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX10PLUS,GFX11 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX10PLUS,GFX11 %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX11-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GCN,GFX11-GISEL %s
 
 define double @v_constained_fadd_f64_fpexcept_strict(double %x, double %y) #0 {
 ; GCN-LABEL: v_constained_fadd_f64_fpexcept_strict:
@@ -96,12 +96,38 @@ define amdgpu_ps <2 x float> @s_constained_fadd_f64_fpexcept_strict(double inreg
 ; GCN-GISEL-NEXT:    v_mov_b32_e32 v0, s4
 ; GCN-GISEL-NEXT:    v_mov_b32_e32 v1, s5
 ; GCN-GISEL-NEXT:    v_add_f64 v[0:1], s[2:3], v[0:1]
+; GCN-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GCN-GISEL-NEXT:    v_readfirstlane_b32 s1, v1
+; GCN-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GCN-GISEL-NEXT:    v_mov_b32_e32 v1, s1
 ; GCN-GISEL-NEXT:    ; return to shader part epilog
 ;
-; GFX10PLUS-LABEL: s_constained_fadd_f64_fpexcept_strict:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    v_add_f64 v[0:1], s[2:3], s[4:5]
-; GFX10PLUS-NEXT:    ; return to shader part epilog
+; GFX10-SDAG-LABEL: s_constained_fadd_f64_fpexcept_strict:
+; GFX10-SDAG:       ; %bb.0:
+; GFX10-SDAG-NEXT:    v_add_f64 v[0:1], s[2:3], s[4:5]
+; GFX10-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX10-GISEL-LABEL: s_constained_fadd_f64_fpexcept_strict:
+; GFX10-GISEL:       ; %bb.0:
+; GFX10-GISEL-NEXT:    v_add_f64 v[0:1], s[2:3], s[4:5]
+; GFX10-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-GISEL-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX10-GISEL-NEXT:    ; return to shader part epilog
+;
+; GFX11-SDAG-LABEL: s_constained_fadd_f64_fpexcept_strict:
+; GFX11-SDAG:       ; %bb.0:
+; GFX11-SDAG-NEXT:    v_add_f64 v[0:1], s[2:3], s[4:5]
+; GFX11-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX11-GISEL-LABEL: s_constained_fadd_f64_fpexcept_strict:
+; GFX11-GISEL:       ; %bb.0:
+; GFX11-GISEL-NEXT:    v_add_f64 v[0:1], s[2:3], s[4:5]
+; GFX11-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-GISEL-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-GISEL-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-GISEL-NEXT:    ; return to shader part epilog
   %val = call double @llvm.experimental.constrained.fadd.f64(double %x, double %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   %cast = bitcast double %val to <2 x float>
   ret <2 x float> %cast
@@ -113,6 +139,3 @@ declare <3 x double> @llvm.experimental.constrained.fadd.v3f64(<3 x double>, <3
 
 attributes #0 = { strictfp }
 attributes #1 = { inaccessiblememonly nounwind willreturn }
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; GFX10: {{.*}}
-; GFX11: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/strict_fmul.f16.ll b/llvm/test/CodeGen/AMDGPU/strict_fmul.f16.ll
index 79154d0db16ec..bdb2128bf609b 100644
--- a/llvm/test/CodeGen/AMDGPU/strict_fmul.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/strict_fmul.f16.ll
@@ -1,20 +1,20 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9,GFX9-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9,GFX9-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9,GFX9-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GCN,GFX8,GFX8-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GCN,GFX8,GFX8-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji < %s | FileCheck -check-prefixes=GCN,GFX8,GFX8-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10,GFX10-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10,GFX10-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10,GFX10-GISEL %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-SDAG,GFX11-SDAG-TRUE16 %s
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-SDAG,GFX11-SDAG-FAKE16 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-GISEL,GFX11-GISEL-TRUE16 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-GISEL,GFX11-GISEL-FAKE16 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-GISEL,GFX11-GISEL-TRUE16 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-GISEL,GFX11-GISEL-FAKE16 %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
 
 ; FIXME: promotion not handled without f16 insts
 
@@ -627,14 +627,19 @@ define amdgpu_ps <2 x half> @s_constained_fmul_v2f16_fpexcept_strict(<2 x half>
 ;
 ; GFX8-GISEL-LABEL: s_constained_fmul_v2f16_fpexcept_strict:
 ; GFX8-GISEL:       ; %bb.0:
-; GFX8-GISEL-NEXT:    s_lshr_b32 s0, s2, 16
-; GFX8-GISEL-NEXT:    s_lshr_b32 s1, s3, 16
 ; GFX8-GISEL-NEXT:    v_mov_b32_e32 v0, s3
-; GFX8-GISEL-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8-GISEL-NEXT:    v_mov_b32_e32 v2, s0
+; GFX8-GISEL-NEXT:    s_lshr_b32 s1, s3, 16
 ; GFX8-GISEL-NEXT:    v_mul_f16_e32 v0, s2, v0
-; GFX8-GISEL-NEXT:    v_mul_f16_sdwa v1, v2, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
-; GFX8-GISEL-NEXT:    v_or_b32_e32 v0, v0, v1
+; GFX8-GISEL-NEXT:    s_lshr_b32 s0, s2, 16
+; GFX8-GISEL-NEXT:    v_readfirstlane_b32 s2, v0
+; GFX8-GISEL-NEXT:    v_mov_b32_e32 v0, s1
+; GFX8-GISEL-NEXT:    v_mul_f16_e32 v0, s0, v0
+; GFX8-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX8-GISEL-NEXT:    s_and_b32 s0, 0xffff, s0
+; GFX8-GISEL-NEXT:    s_and_b32 s1, 0xffff, s2
+; GFX8-GISEL-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX8-GISEL-NEXT:    s_or_b32 s0, s1, s0
+; GFX8-GISEL-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX8-GISEL-NEXT:    ; return to shader part epilog
 ;
 ; GFX10PLUS-LABEL: s_constained_fmul_v2f16_fpexcept_strict:
@@ -642,10 +647,21 @@ define amdgpu_ps <2 x half> @s_constained_fmul_v2f16_fpexcept_strict(<2 x half>
 ; GFX10PLUS-NEXT:    v_pk_mul_f16 v0, s2, s3
 ; GFX10PLUS-NEXT:    ; return to shader part epilog
 ;
-; GFX12-LABEL: s_constained_fmul_v2f16_fpexcept_strict:
-; GFX12:       ; %bb.0:
-; GFX12-NEXT:    v_pk_mul_f16 v0, s2, s3
-; GFX12-NEXT:    ; return to shader part epilog
+; GFX12-SDAG-LABEL: s_constained_fmul_v2f16_fpexcept_strict:
+; GFX12-SDAG:       ; %bb.0:
+; GFX12-SDAG-NEXT:    v_pk_mul_f16 v0, s2, s3
+; GFX12-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX12-GISEL-LABEL: s_constained_fmul_v2f16_fpexcept_strict:
+; GFX12-GISEL:       ; %bb.0:
+; GFX12-GISEL-NEXT:    s_lshr_b32 s0, s2, 16
+; GFX12-GISEL-NEXT:    s_lshr_b32 s1, s3, 16
+; GFX12-GISEL-NEXT:    s_mul_f16 s2, s2, s3
+; GFX12-GISEL-NEXT:    s_mul_f16 s0, s0, s1
+; GFX12-GISEL-NEXT:    s_delay_alu instid0(SALU_CYCLE_3) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-GISEL-NEXT:    s_pack_ll_b32_b16 s0, s2, s0
+; GFX12-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX12-GISEL-NEXT:    ; return to shader part epilog
   %val = call <2 x half> @llvm.experimental.constrained.fmul.v2f16(<2 x half> %x, <2 x half> %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   ret <2 x half> %val
 }
diff --git a/llvm/test/CodeGen/AMDGPU/strict_fmul.f32.ll b/llvm/test/CodeGen/AMDGPU/strict_fmul.f32.ll
index 4c1df046a6684..742c9c0e49f3d 100644
--- a/llvm/test/CodeGen/AMDGPU/strict_fmul.f32.ll
+++ b/llvm/test/CodeGen/AMDGPU/strict_fmul.f32.ll
@@ -1,15 +1,15 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10 %s
 
 ; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11 %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-SDAG %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12,GFX12-GISEL %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12 %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 < %s | FileCheck -check-prefixes=GFX12 %s
 
 define float @v_constained_fmul_f32_fpexcept_strict(float %x, float %y) #0 {
 ; GCN-LABEL: v_constained_fmul_f32_fpexcept_strict:
@@ -339,6 +339,3 @@ declare <2 x float> @llvm.experimental.constrained.fmul.v2f32(<2 x float>, <2 x
 declare <3 x float> @llvm.experimental.constrained.fmul.v3f32(<3 x float>, <3 x float>, metadata, metadata)
 
 attributes #0 = { strictfp }
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; GFX12-GISEL: {{.*}}
-; GFX12-SDAG: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/strict_fmul.f64.ll b/llvm/test/CodeGen/AMDGPU/strict_fmul.f64.ll
index 4d2a93397e0c3..e7f5b54c9d54d 100644
--- a/llvm/test/CodeGen/AMDGPU/strict_fmul.f64.ll
+++ b/llvm/test/CodeGen/AMDGPU/strict_fmul.f64.ll
@@ -1,12 +1,12 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefix=GCN %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GCN-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GCN-GISEL %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10,GFX10-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10,GFX10-GISEL %s
 
-; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefix=GFX11 %s
-; RUN: llc -global-isel=1 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefix=GFX11 %s
+; RUN: llc -global-isel=0 -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-SDAG %s
+; RUN: llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX11,GFX11-GISEL %s
 
 define double @v_constained_fmul_f64_fpexcept_strict(double %x, double %y) #0 {
 ; GCN-LABEL: v_constained_fmul_f64_fpexcept_strict:
@@ -178,22 +178,50 @@ define <3 x double> @v_constained_fmul_v3f64_fpexcept_strict(<3 x double> %x, <3
 }
 
 define amdgpu_ps <2 x float> @s_constained_fmul_f64_fpexcept_strict(double inreg %x, double inreg %y) #0 {
-; GCN-LABEL: s_constained_fmul_f64_fpexcept_strict:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    v_mov_b32_e32 v0, s4
-; GCN-NEXT:    v_mov_b32_e32 v1, s5
-; GCN-NEXT:    v_mul_f64 v[0:1], s[2:3], v[0:1]
-; GCN-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: s_constained_fmul_f64_fpexcept_strict:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    v_mul_f64 v[0:1], s[2:3], s[4:5]
-; GFX10-NEXT:    ; return to shader part epilog
-;
-; GFX11-LABEL: s_constained_fmul_f64_fpexcept_strict:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    v_mul_f64 v[0:1], s[2:3], s[4:5]
-; GFX11-NEXT:    ; return to shader part epilog
+; GCN-SDAG-LABEL: s_constained_fmul_f64_fpexcept_strict:
+; GCN-SDAG:       ; %bb.0:
+; GCN-SDAG-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-SDAG-NEXT:    v_mov_b32_e32 v1, s5
+; GCN-SDAG-NEXT:    v_mul_f64 v[0:1], s[2:3], v[0:1]
+; GCN-SDAG-NEXT:    ; return to shader part epilog
+;
+; GCN-GISEL-LABEL: s_constained_fmul_f64_fpexcept_strict:
+; GCN-GISEL:       ; %bb.0:
+; GCN-GISEL-NEXT:    v_mov_b32_e32 v0, s4
+; GCN-GISEL-NEXT:    v_mov_b32_e32 v1, s5
+; GCN-GISEL-NEXT:    v_mul_f64 v[0:1], s[2:3], v[0:1]
+; GCN-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GCN-GISEL-NEXT:    v_readfirstlane_b32 s1, v1
+; GCN-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GCN-GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GCN-GISEL-NEXT:    ; return to shader part epilog
+;
+; GFX10-SDAG-LABEL: s_constained_fmul_f64_fpexcept_strict:
+; GFX10-SDAG:       ; %bb.0:
+; GFX10-SDAG-NEXT:    v_mul_f64 v[0:1], s[2:3], s[4:5]
+; GFX10-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX10-GISEL-LABEL: s_constained_fmul_f64_fpexcept_strict:
+; GFX10-GISEL:       ; %bb.0:
+; GFX10-GISEL-NEXT:    v_mul_f64 v[0:1], s[2:3], s[4:5]
+; GFX10-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX10-GISEL-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX10-GISEL-NEXT:    v_mov_b32_e32 v0, s0
+; GFX10-GISEL-NEXT:    v_mov_b32_e32 v1, s1
+; GFX10-GISEL-NEXT:    ; return to shader part epilog
+;
+; GFX11-SDAG-LABEL: s_constained_fmul_f64_fpexcept_strict:
+; GFX11-SDAG:       ; %bb.0:
+; GFX11-SDAG-NEXT:    v_mul_f64 v[0:1], s[2:3], s[4:5]
+; GFX11-SDAG-NEXT:    ; return to shader part epilog
+;
+; GFX11-GISEL-LABEL: s_constained_fmul_f64_fpexcept_strict:
+; GFX11-GISEL:       ; %bb.0:
+; GFX11-GISEL-NEXT:    v_mul_f64 v[0:1], s[2:3], s[4:5]
+; GFX11-GISEL-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-GISEL-NEXT:    v_readfirstlane_b32 s1, v1
+; GFX11-GISEL-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-GISEL-NEXT:    ; return to shader part epilog
   %val = call double @llvm.experimental.constrained.fmul.f64(double %x, double %y, metadata !"round.tonearest", metadata !"fpexcept.strict")
   %cast = bitcast double %val to <2 x float>
   ret <2 x float> %cast
diff --git a/llvm/test/CodeGen/AMDGPU/vector-alloca-atomic.ll b/llvm/test/CodeGen/AMDGPU/vector-alloca-atomic.ll
index 8e4cc2b0236c0..a7090960518af 100644
--- a/llvm/test/CodeGen/AMDGPU/vector-alloca-atomic.ll
+++ b/llvm/test/CodeGen/AMDGPU/vector-alloca-atomic.ll
@@ -1,11 +1,11 @@
-; RUN: opt -S -mtriple=amdgcn-- -data-layout=A5 -passes='amdgpu-promote-alloca,sroa,instcombine' < %s | FileCheck -check-prefix=OPT %s
+; RUN: opt -S -mtriple=amdgcn-- -passes='amdgpu-promote-alloca,sroa,instcombine' < %s | FileCheck -check-prefix=OPT %s
 
 ; Show that what the alloca promotion pass will do for non-atomic load/store.
 
 ; OPT-LABEL: @vector_alloca_not_atomic(
 ;
-; OPT: extractelement <3 x i32> <i32 0, i32 1, i32 2>, i64 %index
-define amdgpu_kernel void @vector_alloca_not_atomic(ptr addrspace(1) %out, i64 %index) {
+; OPT: extractelement <3 x i32> <i32 0, i32 1, i32 2>, i32 %index
+define amdgpu_kernel void @vector_alloca_not_atomic(ptr addrspace(1) %out, i32 %index) {
 entry:
   %alloca = alloca [3 x i32], addrspace(5)
   %a1 = getelementptr [3 x i32], ptr addrspace(5) %alloca, i32 0, i32 1
@@ -13,7 +13,7 @@ entry:
   store i32 0, ptr addrspace(5) %alloca
   store i32 1, ptr addrspace(5) %a1
   store i32 2, ptr addrspace(5) %a2
-  %tmp = getelementptr [3 x i32], ptr addrspace(5) %alloca, i64 0, i64 %index
+  %tmp = getelementptr [3 x i32], ptr addrspace(5) %alloca, i32 0, i32 %index
   %data = load i32, ptr addrspace(5) %tmp
   store i32 %data, ptr addrspace(1) %out
   ret void
@@ -26,7 +26,7 @@ entry:
 ; OPT: store i32 1, ptr addrspace(5)
 ; OPT: store i32 2, ptr addrspace(5)
 ; OPT: load atomic i32, ptr addrspace(5)
-define amdgpu_kernel void @vector_alloca_atomic_read(ptr addrspace(1) %out, i64 %index) {
+define amdgpu_kernel void @vector_alloca_atomic_read(ptr addrspace(1) %out, i32 %index) {
 entry:
   %alloca = alloca [3 x i32], addrspace(5)
   %a1 = getelementptr [3 x i32], ptr addrspace(5) %alloca, i32 0, i32 1
@@ -34,7 +34,7 @@ entry:
   store i32 0, ptr addrspace(5) %alloca
   store i32 1, ptr addrspace(5) %a1
   store i32 2, ptr addrspace(5) %a2
-  %tmp = getelementptr [3 x i32], ptr addrspace(5) %alloca, i64 0, i64 %index
+  %tmp = getelementptr [3 x i32], ptr addrspace(5) %alloca, i32 0, i32 %index
   %data = load atomic i32, ptr addrspace(5) %tmp acquire, align 4
   store i32 %data, ptr addrspace(1) %out
   ret void
@@ -47,7 +47,7 @@ entry:
 ; OPT: store atomic i32 1, ptr addrspace(5)
 ; OPT: store atomic i32 2, ptr addrspace(5)
 ; OPT: load i32, ptr addrspace(5)
-define amdgpu_kernel void @vector_alloca_atomic_write(ptr addrspace(1) %out, i64 %index) {
+define amdgpu_kernel void @vector_alloca_atomic_write(ptr addrspace(1) %out, i32 %index) {
 entry:
   %alloca = alloca [3 x i32], addrspace(5)
   %a1 = getelementptr [3 x i32], ptr addrspace(5) %alloca, i32 0, i32 1
@@ -55,7 +55,7 @@ entry:
   store atomic i32 0, ptr addrspace(5) %alloca release, align 4
   store atomic i32 1, ptr addrspace(5) %a1 release, align 4
   store atomic i32 2, ptr addrspace(5) %a2  release, align 4
-  %tmp = getelementptr [3 x i32], ptr addrspace(5) %alloca, i64 0, i64 %index
+  %tmp = getelementptr [3 x i32], ptr addrspace(5) %alloca, i32 0, i32 %index
   %data = load i32, ptr addrspace(5) %tmp
   store i32 %data, ptr addrspace(1) %out
   ret void
diff --git a/llvm/test/CodeGen/AMDGPU/vector-alloca-bitcast.ll b/llvm/test/CodeGen/AMDGPU/vector-alloca-bitcast.ll
index 9c05f4d16cb4e..4a29f7e53e93a 100644
--- a/llvm/test/CodeGen/AMDGPU/vector-alloca-bitcast.ll
+++ b/llvm/test/CodeGen/AMDGPU/vector-alloca-bitcast.ll
@@ -72,7 +72,8 @@ entry:
 ; OPT-NOT:   alloca
 ; OPT: bb2:
 ; OPT:  %promotealloca = phi <6 x float> [ zeroinitializer, %bb ], [ %0, %bb2 ]
-; OPT:  %0 = insertelement <6 x float> %promotealloca, float %tmp71, i32 %tmp10
+; OPT: [[TMP:%tmp7.*]] = load float, ptr addrspace(1) %tmp5, align 4
+; OPT:  %0 = insertelement <6 x float> %promotealloca, float [[TMP]], i32 %tmp10
 ; OPT: .preheader:
 ; OPT:  %bc = bitcast <6 x float> %0 to <6 x i32>
 ; OPT:  %1 = extractelement <6 x i32> %bc, i32 %tmp20
@@ -132,7 +133,8 @@ bb15:                                             ; preds = %.preheader
 ; OPT-NOT:   alloca
 ; OPT: bb2:
 ; OPT:  %promotealloca = phi <6 x double> [ zeroinitializer, %bb ], [ %0, %bb2 ]
-; OPT:  %0 = insertelement <6 x double> %promotealloca, double %tmp71, i32 %tmp10
+; OPT:  [[TMP:%tmp7.*]] = load double, ptr addrspace(1) %tmp5, align 8
+; OPT:  %0 = insertelement <6 x double> %promotealloca, double [[TMP]], i32 %tmp10
 ; OPT: .preheader:
 ; OPT:  %bc = bitcast <6 x double> %0 to <6 x i64>
 ; OPT:  %1 = extractelement <6 x i64> %bc, i32 %tmp20
diff --git a/llvm/test/CodeGen/AMDGPU/vopd-combine-gfx1250.mir b/llvm/test/CodeGen/AMDGPU/vopd-combine-gfx1250.mir
index fa6c34cf07730..b05edd046b874 100644
--- a/llvm/test/CodeGen/AMDGPU/vopd-combine-gfx1250.mir
+++ b/llvm/test/CodeGen/AMDGPU/vopd-combine-gfx1250.mir
@@ -1,6 +1,7 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 2
 # RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -run-pass=postmisched %s -o - | FileCheck -check-prefix=SCHED %s
 # RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -run-pass=postmisched,gcn-create-vopd %s -o - | FileCheck -check-prefix=PAIR %s
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -run-pass=postmisched,gcn-create-vopd,amdgpu-lower-vgpr-encoding %s -o - | FileCheck -check-prefix=LOWER %s
 
 ---
 name:            vopd_combine_low_vgprs
@@ -20,6 +21,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr1, $vgpr1, $vgpr0, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_low_vgprs
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr1, $vgpr1, $vgpr0, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
@@ -45,6 +52,15 @@ body:             |
     ; PAIR-NEXT: $vgpr301 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr303, $vgpr306 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr301, $vgpr301, $vgpr300, $vgpr300, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr300, $vgpr301, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_hi_vgprs
+    ; LOWER: $vgpr300 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr301 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 69, implicit-def $mode
+    ; LOWER-NEXT: $vgpr303, $vgpr306 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr301, $vgpr301, $vgpr300, $vgpr300, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 17669, implicit-def $mode
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr300, $vgpr301, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 1280, implicit-def $mode
     $vgpr300 = IMPLICIT_DEF
     $vgpr301 = IMPLICIT_DEF
     $vgpr303 = V_SUB_F32_e32 $vgpr301, $vgpr301, implicit $mode, implicit $exec
@@ -70,6 +86,15 @@ body:             |
     ; PAIR-NEXT: $vgpr813 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr559, $vgpr562 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr813, $vgpr813, $vgpr812, $vgpr812, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr812, $vgpr813, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_hi_vgprs_above_512
+    ; LOWER: $vgpr812 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr813 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 143, implicit-def $mode
+    ; LOWER-NEXT: $vgpr559, $vgpr562 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr813, $vgpr813, $vgpr812, $vgpr812, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 36623, implicit-def $mode
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr812, $vgpr813, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 3840, implicit-def $mode
     $vgpr812 = IMPLICIT_DEF
     $vgpr813 = IMPLICIT_DEF
     $vgpr559 = V_SUB_F32_e32 $vgpr813, $vgpr813, implicit $mode, implicit $exec
@@ -96,6 +121,15 @@ body:             |
     ; PAIR-NEXT: $vgpr303 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr6 = V_MUL_F32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: mixed_vgprs_low_and_hi_dst
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 64, implicit-def $mode
+    ; LOWER-NEXT: $vgpr303 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 16384, implicit-def $mode
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr6 = V_MUL_F32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr303 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
@@ -124,6 +158,16 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr6 = V_MUL_F32_e32 $vgpr300, killed $vgpr0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: mixed_vgprs_low_and_hi_scr0
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr300 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 1, implicit-def $mode
+    ; LOWER-NEXT: $vgpr6 = V_MUL_F32_e32 $vgpr300, killed $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 256, implicit-def $mode
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr300 = IMPLICIT_DEF
@@ -153,6 +197,18 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr1, $vgpr301, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr6 = V_MUL_F32_e32 $vgpr300, killed $vgpr0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: mixed_vgprs_low_and_hi_scr1
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr300 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 4, implicit-def $mode
+    ; LOWER-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr1, $vgpr301, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 1024, implicit-def $mode
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 1, implicit-def $mode
+    ; LOWER-NEXT: $vgpr6 = V_MUL_F32_e32 $vgpr300, killed $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 256, implicit-def $mode
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr300 = IMPLICIT_DEF
@@ -178,6 +234,15 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr559 = V_SUB_F32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr303 = V_MUL_F32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: mixed_vgprs_hi_and_hi_dst_different_msb
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 128, implicit-def $mode
+    ; LOWER-NEXT: $vgpr559 = V_SUB_F32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 32832, implicit-def $mode
+    ; LOWER-NEXT: $vgpr303 = V_MUL_F32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 16384, implicit-def $mode
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr559 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
@@ -205,6 +270,17 @@ body:             |
     ; PAIR-NEXT: $vgpr812 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr513, killed $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr6 = V_MUL_F32_e32 $vgpr812, killed $vgpr0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: mixed_vgprs_low_and_hi_scr0_different_msb
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr513 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr812 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 2, implicit-def $mode
+    ; LOWER-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr513, killed $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 515, implicit-def $mode
+    ; LOWER-NEXT: $vgpr6 = V_MUL_F32_e32 $vgpr812, killed $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 768, implicit-def $mode
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr513 = IMPLICIT_DEF
@@ -235,6 +311,16 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 killed $sgpr0, $vgpr1, $vgpr300, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_sgpr_src0
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr300 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 1, implicit-def $mode
+    ; LOWER-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 killed $sgpr0, $vgpr1, $vgpr300, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 256, implicit-def $mode
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr300 = IMPLICIT_DEF
@@ -264,6 +350,15 @@ body:             |
     ; PAIR-NEXT: $vgpr300 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 0, $vgpr1, $vgpr300, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_imm_src0
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr300 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 1, implicit-def $mode
+    ; LOWER-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 0, $vgpr1, $vgpr300, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 256, implicit-def $mode
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr300 = IMPLICIT_DEF
@@ -290,6 +385,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_MAX_I32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_mov_max_i32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_MAX_I32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -315,6 +416,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_MIN_I32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_mov_min_i32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_MIN_I32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -339,6 +446,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2 = V_MAX_I32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr3 = V_MAX_I32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_max_i32_max_i32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = V_MAX_I32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr3 = V_MAX_I32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MAX_I32_e32 $vgpr0, $vgpr0, implicit $mode, implicit $exec
@@ -362,6 +475,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2 = V_MIN_I32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr3 = V_MIN_I32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_min_i32_min_i32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = V_MIN_I32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr3 = V_MIN_I32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MIN_I32_e32 $vgpr0, $vgpr0, implicit $mode, implicit $exec
@@ -386,6 +505,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_SUB_U32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_mov_sub_nc_i32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_SUB_U32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -411,6 +536,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_LSHRREV_B32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_mov_lshrrev_b32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_LSHRREV_B32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -436,6 +567,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_ASHRREV_I32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_mov_ashrrev_i32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr3 = V_DUAL_MOV_B32_e32_X_ASHRREV_I32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr1, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -464,6 +601,14 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr0, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr6 = V_MUL_F32_e32 killed $vgpr0, killed $vgpr5, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_same_vgprs_banks
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = V_SUB_F32_e32 $vgpr0, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr6 = V_MUL_F32_e32 killed $vgpr0, killed $vgpr5, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr5 = IMPLICIT_DEF
@@ -490,6 +635,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr0, $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_same_vgprs
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 $vgpr0, $vgpr1, $vgpr0, $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_SUB_F32_e32 $vgpr0, $vgpr1, implicit $mode, implicit $exec
@@ -515,6 +666,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr5 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_e96_gfx1250 0, $vgpr1, 0, $vgpr1, 0, $vgpr0, 0, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_same_dst_parity
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr5 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_e96_gfx1250 0, $vgpr1, 0, $vgpr1, 0, $vgpr0, 0, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_SUB_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
@@ -540,6 +697,12 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr1, $vgpr2 = V_DUAL_FMAAK_F32_X_MOV_B32_e32_gfx1250 killed $sgpr0, $vgpr0, 981467136, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, $vgpr0, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_x_fmaak
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1, $vgpr2 = V_DUAL_FMAAK_F32_X_MOV_B32_e32_gfx1250 killed $sgpr0, $vgpr0, 981467136, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, $vgpr0, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $sgpr0 = IMPLICIT_DEF
     $vgpr1 = V_FMAAK_F32 $sgpr0, $vgpr0, 981467136, implicit $mode, implicit $exec
@@ -565,6 +728,12 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr1, $vgpr2 = V_DUAL_MOV_B32_e32_X_FMAAK_F32_gfx1250 $vgpr0, killed $sgpr0, $vgpr0, 981467136, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, $vgpr0, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_y_fmaak
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1, $vgpr2 = V_DUAL_MOV_B32_e32_X_FMAAK_F32_gfx1250 $vgpr0, killed $sgpr0, $vgpr0, 981467136, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, $vgpr0, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $sgpr0 = IMPLICIT_DEF
     $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -591,6 +760,13 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = V_FMAAK_F32 killed $sgpr0, $vgpr0, 981467136, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, $vgpr0, implicit $exec
     ; PAIR-NEXT: $vgpr3 = V_MOV_B32_e32 killed $vgpr0, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_x_fmaak_same_dst_parity
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = V_FMAAK_F32 killed $sgpr0, $vgpr0, 981467136, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; LOWER-NEXT: $vgpr3 = V_MOV_B32_e32 killed $vgpr0, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $sgpr0 = IMPLICIT_DEF
     $vgpr1 = V_FMAAK_F32 $sgpr0, $vgpr0, 981467136, implicit $mode, implicit $exec
@@ -617,6 +793,13 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, $vgpr0, implicit $exec
     ; PAIR-NEXT: $vgpr3 = V_FMAAK_F32 killed $sgpr0, killed $vgpr0, 981467136, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_y_fmaak_same_dst_parity
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, $vgpr0, implicit $exec
+    ; LOWER-NEXT: $vgpr3 = V_FMAAK_F32 killed $sgpr0, killed $vgpr0, 981467136, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $sgpr0 = IMPLICIT_DEF
     $vgpr1 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -642,6 +825,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 12345, $vgpr1, $vgpr0, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_literal_x
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr6 = V_DUAL_SUB_F32_e32_X_MUL_F32_e32_gfx1250 12345, $vgpr1, $vgpr0, $vgpr0, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_SUB_F32_e32 12345, $vgpr1, implicit $mode, implicit $exec
@@ -667,6 +856,12 @@ body:             |
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr6 = V_DUAL_MUL_F32_e32_X_SUB_F32_e32_gfx1250 $vgpr0, $vgpr0, 12345, $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_literal_y
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr6 = V_DUAL_MUL_F32_e32_X_SUB_F32_e32_gfx1250 $vgpr0, $vgpr0, 12345, $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_MUL_F32_e32 $vgpr0, $vgpr0, implicit $mode, implicit $exec
@@ -695,6 +890,13 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_SUB_F32_e32 12345, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_MUL_F32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_literal_x
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = V_SUB_F32_e32 12345, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_MUL_F32_e32 killed $vgpr0, $vgpr0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_SUB_F32_e32 12345, $vgpr1, implicit $mode, implicit $exec
@@ -721,6 +923,13 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_MUL_F32_e32 $vgpr0, $vgpr0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_SUB_F32_e32 12345, killed $vgpr1, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_literal_y
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = V_MUL_F32_e32 $vgpr0, $vgpr0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e32 killed $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_SUB_F32_e32 12345, killed $vgpr1, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_MUL_F32_e32 $vgpr0, $vgpr0, implicit $mode, implicit $exec
@@ -750,6 +959,14 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4, $vgpr6 = V_DUAL_ADD_U32_e32_X_ADD_F32_e32_e96_gfx1250 $vgpr0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_u32_add_f32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4, $vgpr6 = V_DUAL_ADD_U32_e32_X_ADD_F32_e32_e96_gfx1250 $vgpr0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -781,6 +998,14 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr4 = V_DUAL_ADD_F32_e32_X_ADD_U32_e32_e96_gfx1250 0, killed $vgpr2, 0, killed $vgpr3, $vgpr0, $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_f32_add_u32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr4 = V_DUAL_ADD_F32_e32_X_ADD_U32_e32_e96_gfx1250 0, killed $vgpr2, 0, killed $vgpr3, $vgpr0, $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -810,6 +1035,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr5, $vgpr4 = V_DUAL_ADD_F32_e32_X_ADD_U32_e32_gfx1250 killed $vgpr2, killed $vgpr3, killed $vgpr0, killed $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_u32_add_f32_same_dst_parity
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5, $vgpr4 = V_DUAL_ADD_F32_e32_X_ADD_U32_e32_gfx1250 killed $vgpr2, killed $vgpr3, killed $vgpr0, killed $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -838,6 +1070,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr5, $vgpr4 = V_DUAL_ADD_F32_e32_X_ADD_U32_e32_gfx1250 killed $vgpr2, killed $vgpr3, killed $vgpr0, killed $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_f32_add_u32_same_dst_parity
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5, $vgpr4 = V_DUAL_ADD_F32_e32_X_ADD_U32_e32_gfx1250 killed $vgpr2, killed $vgpr3, killed $vgpr0, killed $vgpr1, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -866,6 +1105,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4, $vgpr6 = V_DUAL_LSHLREV_B32_e32_X_LSHLREV_B32_e32_e96_gfx1250 killed $vgpr0, killed $vgpr1, killed $vgpr2, killed $vgpr3, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_lshl_lshl
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4, $vgpr6 = V_DUAL_LSHLREV_B32_e32_X_LSHLREV_B32_e32_e96_gfx1250 killed $vgpr0, killed $vgpr1, killed $vgpr2, killed $vgpr3, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -894,6 +1140,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4, $vgpr5 = V_DUAL_ASHRREV_I32_e32_X_ASHRREV_I32_e32_e96_gfx1250 killed $vgpr0, killed $vgpr1, killed $vgpr2, killed $vgpr3, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_ashr_ashr
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4, $vgpr5 = V_DUAL_ASHRREV_I32_e32_X_ASHRREV_I32_e32_e96_gfx1250 killed $vgpr0, killed $vgpr1, killed $vgpr2, killed $vgpr3, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -922,6 +1175,15 @@ body:             |
     ; PAIR-NEXT: $vgpr302 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr303 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr304, $vgpr305 = V_DUAL_LSHRREV_B32_e32_X_LSHRREV_B32_e32_e96_gfx1250 $vgpr300, $vgpr301, $vgpr302, $vgpr303, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_lshr_lshr
+    ; LOWER: $vgpr300 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr301 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr302 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr303 = IMPLICIT_DEF
+    ; LOWER-NEXT: S_SET_VGPR_MSB 69, implicit-def $mode
+    ; LOWER-NEXT: $vgpr304, $vgpr305 = V_DUAL_LSHRREV_B32_e32_X_LSHRREV_B32_e32_e96_gfx1250 $vgpr300, $vgpr301, $vgpr302, $vgpr303, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: S_SET_VGPR_MSB 17664, implicit-def $mode
     $vgpr300 = IMPLICIT_DEF
     $vgpr301 = IMPLICIT_DEF
     $vgpr302 = IMPLICIT_DEF
@@ -950,6 +1212,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4, $vgpr5 = V_DUAL_SUB_U32_e32_X_SUB_U32_e32_e96_gfx1250 killed $vgpr0, killed $vgpr1, killed $vgpr2, killed $vgpr3, implicit $exec, implicit $exec, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_sub_u32_sub_u32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4, $vgpr5 = V_DUAL_SUB_U32_e32_X_SUB_U32_e32_e96_gfx1250 killed $vgpr0, killed $vgpr1, killed $vgpr2, killed $vgpr3, implicit $exec, implicit $exec, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -977,6 +1246,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4 = V_SUB_U32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_SUB_U32_e32 300, killed $vgpr2, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_sub_u32_sub_u32_lit
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = V_SUB_U32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_SUB_U32_e32 300, killed $vgpr2, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1004,6 +1280,13 @@ body:             |
     ; PAIR-NEXT: $vgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr3 = V_DUAL_FMAC_F32_e32_X_FMAC_F32_e32_gfx1250 $vgpr1, $vgpr1, killed $vgpr2, killed $vgpr1, $vgpr1, killed $vgpr3, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fmac_fmac
+    ; LOWER: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr3 = V_DUAL_FMAC_F32_e32_X_FMAC_F32_e32_gfx1250 $vgpr1, $vgpr1, killed $vgpr2, killed $vgpr1, $vgpr1, killed $vgpr3, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1032,6 +1315,13 @@ body:             |
     ; PAIR-NEXT: $vgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2, $vgpr4 = V_DUAL_FMAC_F32_e32_X_FMAC_F32_e32_e96_gfx1250 0, $vgpr1, 0, $vgpr1, killed $vgpr2, 0, killed $vgpr1, 0, $vgpr1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fmac_fmac_same_dst_parity
+    ; LOWER: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2, $vgpr4 = V_DUAL_FMAC_F32_e32_X_FMAC_F32_e32_e96_gfx1250 0, $vgpr1, 0, $vgpr1, killed $vgpr2, 0, killed $vgpr1, 0, $vgpr1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1059,6 +1349,13 @@ body:             |
     ; PAIR-NEXT: $vgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2 = V_FMAC_F32_e32 $vgpr1, $vgpr1, killed $vgpr2, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr2 = V_FMAC_F32_e32 killed $vgpr1, $vgpr1, killed $vgpr2, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fmac_fmac_same_dst
+    ; LOWER: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = V_FMAC_F32_e32 $vgpr1, $vgpr1, killed $vgpr2, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr2 = V_FMAC_F32_e32 killed $vgpr1, $vgpr1, killed $vgpr2, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1083,6 +1380,12 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr2 = V_ADD_F32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_add_f32_fadd_f32_same_dst
+    ; LOWER: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = V_ADD_F32_e32 killed $vgpr1, $vgpr1, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = V_ADD_F32_e32 $vgpr1, $vgpr1, implicit $mode, implicit $exec
@@ -1113,6 +1416,15 @@ body:             |
     ; PAIR-NEXT: $vgpr8_vgpr9 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4_vgpr5, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr8_vgpr9, 0, killed $vgpr2, 0, killed $vgpr3, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_f64_add_f32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr8_vgpr9, 0, killed $vgpr2, 0, killed $vgpr3, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1147,6 +1459,15 @@ body:             |
     ; PAIR-NEXT: $vgpr8_vgpr9 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr4_vgpr5, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr8_vgpr9, 0, killed $vgpr2, 0, killed $vgpr3, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_f32_add_f64
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr8_vgpr9, 0, killed $vgpr2, 0, killed $vgpr3, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1184,6 +1505,17 @@ body:             |
     ; PAIR-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 $vgpr0_vgpr1, killed $vgpr8_vgpr9, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e32 $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr6_vgpr7 = V_ADD_F64_pseudo_e32 killed $vgpr0_vgpr1, killed $vgpr10_vgpr11, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_add_f64_add_f64
+    ; LOWER: $vgpr8_vgpr9 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 $vgpr0_vgpr1, killed $vgpr8_vgpr9, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e32 $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr6_vgpr7 = V_ADD_F64_pseudo_e32 killed $vgpr0_vgpr1, killed $vgpr10_vgpr11, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1220,6 +1552,16 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_ADD_F32_e32 killed $vgpr2, killed $vgpr3, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_add_f64_add_f32_overlapping_dst
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 $vgpr0_vgpr1, killed $vgpr8_vgpr9, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e32 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_ADD_F32_e32 killed $vgpr2, killed $vgpr3, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1251,6 +1593,14 @@ body:             |
     ; PAIR-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 killed $vgpr0_vgpr1, killed $vgpr10_vgpr11, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6 = V_ADD_F32_e32 killed $vgpr2, killed $vgpr5, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_add_f64_add_f32_overlapping_src_sub1
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 killed $vgpr0_vgpr1, killed $vgpr10_vgpr11, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_ADD_F32_e32 killed $vgpr2, killed $vgpr5, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1280,6 +1630,14 @@ body:             |
     ; PAIR-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 killed $vgpr0_vgpr1, killed $vgpr10_vgpr11, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6 = V_ADD_F32_e32 killed $vgpr2, killed $vgpr4, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_add_f64_add_f32_overlapping_src_sub0
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = V_ADD_F64_pseudo_e32 killed $vgpr0_vgpr1, killed $vgpr10_vgpr11, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_ADD_F32_e32 killed $vgpr2, killed $vgpr4, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1314,6 +1672,16 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_fma
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1352,6 +1720,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr10, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_fma_bank_conflict_src2
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr10, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1387,6 +1766,15 @@ body:             |
     ; PAIR-NEXT: $vgpr4 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_add_f32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1421,6 +1809,15 @@ body:             |
     ; PAIR-NEXT: $vgpr4 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr7, $vgpr6 = V_DUAL_ADD_F32_e32_X_FMA_F32_e64_e96_gfx1250 0, killed $vgpr3, 0, killed $vgpr4, 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_f32_fma
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr7, $vgpr6 = V_DUAL_ADD_F32_e32_X_FMA_F32_e64_e96_gfx1250 0, killed $vgpr3, 0, killed $vgpr4, 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1455,6 +1852,15 @@ body:             |
     ; PAIR-NEXT: $vgpr10_vgpr11 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 0, killed $vgpr2_vgpr3, 0, killed $vgpr10_vgpr11, 0, $vgpr0, 0, $vgpr1, 0, $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_add_f64
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 0, killed $vgpr2_vgpr3, 0, killed $vgpr10_vgpr11, 0, $vgpr0, 0, $vgpr1, 0, $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1492,6 +1898,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 3, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_src0_mod_fma
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 3, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1530,6 +1947,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 2, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_fma_src1_mod
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 2, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1568,6 +1996,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 3, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_fma_src2_mod
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 3, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1606,6 +2045,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 1, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_clamp_fma
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 1, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1644,6 +2094,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 1, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_fma_omod
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 1, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1681,6 +2142,16 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 1, $vgpr0, 1, $vgpr1, 1, killed $vgpr2, 1, killed $vgpr3, 1, killed $vgpr4, 1, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_fma_neg
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 1, $vgpr0, 1, $vgpr1, 1, killed $vgpr2, 1, killed $vgpr3, 1, killed $vgpr4, 1, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1718,6 +2189,16 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 1, $sgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $sgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_fma_src0_neg
+    ; LOWER: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 1, $sgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $sgpr0, killed $vgpr1, implicit $exec
     $sgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1755,6 +2236,16 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 1, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_fma_src1_neg
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, killed $vgpr3, 1, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1792,6 +2283,16 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 1, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_fma_src2_neg
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_FMA_F32_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 1, killed $vgpr2, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1829,6 +2330,16 @@ body:             |
     ; PAIR-NEXT: $vgpr8 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr10_vgpr11, $vgpr9 = V_DUAL_FMA_F64_e64_X_FMA_F32_e64_e96_gfx1250 1, $vgpr0_vgpr1, 1, killed $vgpr2_vgpr3, 1, killed $vgpr4_vgpr5, 0, killed $vgpr6, 0, killed $vgpr8, 0, killed $vgpr7, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr12 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_f64_fma_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr7 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11, $vgpr9 = V_DUAL_FMA_F64_e64_X_FMA_F32_e64_e96_gfx1250 1, $vgpr0_vgpr1, 1, killed $vgpr2_vgpr3, 1, killed $vgpr4_vgpr5, 0, killed $vgpr6, 0, killed $vgpr8, 0, killed $vgpr7, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr12 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4_vgpr5 = IMPLICIT_DEF
@@ -1867,6 +2378,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, $vgpr1, $vgpr2_vgpr3, implicit $exec
     ; PAIR-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr2, 0, killed $vgpr4, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_lshl_add_u64_fma
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, $vgpr1, $vgpr2_vgpr3, implicit $exec
+    ; LOWER-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr2, 0, killed $vgpr4, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1905,6 +2427,17 @@ body:             |
     ; PAIR-NEXT: $vgpr8 = V_FMA_F32_e64 0, $vgpr3, 0, $vgpr2, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr9 = V_BFM_B32_e64 $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 killed $vgpr0_vgpr1, $vgpr1, killed $vgpr2_vgpr3, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_lshl_add_u64
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8 = V_FMA_F32_e64 0, $vgpr3, 0, $vgpr2, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr9 = V_BFM_B32_e64 $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 killed $vgpr0_vgpr1, $vgpr1, killed $vgpr2_vgpr3, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1941,6 +2474,16 @@ body:             |
     ; PAIR-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, $vgpr1, $vgpr2_vgpr3, implicit $exec
     ; PAIR-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, $vgpr3, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_lshl_add_u64_fma_overlapping_src2
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, $vgpr1, $vgpr2_vgpr3, implicit $exec
+    ; LOWER-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, $vgpr3, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -1978,6 +2521,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, killed $vgpr5, $vgpr2_vgpr3, implicit $exec
     ; PAIR-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr1, 0, killed $vgpr3, 0, killed $vgpr4, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_lshl_add_u64_fma_src0_conflict
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, killed $vgpr5, $vgpr2_vgpr3, implicit $exec
+    ; LOWER-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr1, 0, killed $vgpr3, 0, killed $vgpr4, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2016,6 +2570,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, $vgpr5, $vgpr2_vgpr3, implicit $exec
     ; PAIR-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr5, 0, killed $vgpr4, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_lshl_add_u64_fma_src1_conflict
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6_vgpr7 = V_LSHL_ADD_U64_e64 $vgpr0_vgpr1, $vgpr5, $vgpr2_vgpr3, implicit $exec
+    ; LOWER-NEXT: $vgpr9 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr5, 0, killed $vgpr4, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2053,6 +2618,16 @@ body:             |
     ; PAIR-NEXT: $vgpr8 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr10_vgpr11, $vgpr9 = V_DUAL_FMA_F64_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr4_vgpr5, 0, killed $vgpr6, 0, killed $vgpr8, 0, killed $vgpr7, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr12 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_f64_fma_f32
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr7 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11, $vgpr9 = V_DUAL_FMA_F64_e64_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr4_vgpr5, 0, killed $vgpr6, 0, killed $vgpr8, 0, killed $vgpr7, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr12 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4_vgpr5 = IMPLICIT_DEF
@@ -2089,6 +2664,16 @@ body:             |
     ; PAIR-NEXT: $vgpr10_vgpr11 = V_FMA_F64_e64 0, $vgpr0_vgpr1, 0, $vgpr2_vgpr3, 0, killed $vgpr4_vgpr5, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr12 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr9 = V_FMA_F32_e64 0, killed $vgpr6, 0, killed $vgpr3, 0, killed $vgpr7, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fma_f64_fma_f32_overlapping_src1
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4_vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr7 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11 = V_FMA_F64_e64 0, $vgpr0_vgpr1, 0, $vgpr2_vgpr3, 0, killed $vgpr4_vgpr5, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr12 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr9 = V_FMA_F32_e64 0, killed $vgpr6, 0, killed $vgpr3, 0, killed $vgpr7, 0, 0, implicit $mode, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4_vgpr5 = IMPLICIT_DEF
@@ -2123,6 +2708,15 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_f32_add_f64_e32
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -2157,6 +2751,15 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_f32_add_f64_e64
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -2191,6 +2794,15 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 1, $vgpr0_vgpr1, 1, killed $vgpr2_vgpr3, 0, killed $vgpr6, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_f32_add_f64_e64_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_ADD_F64_pseudo_e32_X_FMA_F32_e64_e96_gfx1250 1, $vgpr0_vgpr1, 1, killed $vgpr2_vgpr3, 0, killed $vgpr6, 0, killed $vgpr4, 0, killed $vgpr5, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -2225,6 +2837,15 @@ body:             |
     ; PAIR-NEXT: $vgpr4 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr5, $vgpr6 = V_DUAL_FMA_F32_e64_X_BITOP2_B32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, killed $vgpr3, killed $vgpr4, 123, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_bitop
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5, $vgpr6 = V_DUAL_FMA_F32_e64_X_BITOP2_B32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, killed $vgpr3, killed $vgpr4, 123, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2260,6 +2881,15 @@ body:             |
     ; PAIR-NEXT: $vgpr4 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr5, $vgpr6 = V_DUAL_FMA_F32_e64_X_BITOP2_B32_e64_e96_gfx1250 0, $sgpr0, 0, $vgpr1, 0, killed $vgpr2, killed $sgpr3, killed $vgpr4, 123, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_BFM_B32_e64 killed $sgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fma_bitop_2_scalar_src
+    ; LOWER: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5, $vgpr6 = V_DUAL_FMA_F32_e64_X_BITOP2_B32_e64_e96_gfx1250 0, $sgpr0, 0, $vgpr1, 0, killed $vgpr2, killed $sgpr3, killed $vgpr4, 123, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_BFM_B32_e64 killed $sgpr0, killed $vgpr1, implicit $exec
     $sgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2290,6 +2920,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr5, $vgpr3 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 killed $vgpr2, $vgpr0, $vgpr1, 20, implicit $exec, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_bitop_mov_b32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5, $vgpr3 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 killed $vgpr2, $vgpr0, $vgpr1, 20, implicit $exec, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2319,6 +2956,14 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = V_MOV_B32_e32 $vgpr2, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr3 = V_BITOP3_B32_e64 killed $vgpr0, killed $vgpr1, killed $vgpr2, 20, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_mov_b32_bitop_non_imm_src2
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = V_MOV_B32_e32 $vgpr2, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr3 = V_BITOP3_B32_e64 killed $vgpr0, killed $vgpr1, killed $vgpr2, 20, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2348,6 +2993,14 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = V_MOV_B32_e32 killed $vgpr2, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr3 = V_BITOP3_B32_e64 killed $vgpr0, killed $vgpr1, 1, 20, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_mov_b32_bitop_non_zero_src2
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = V_MOV_B32_e32 killed $vgpr2, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr3 = V_BITOP3_B32_e64 killed $vgpr0, killed $vgpr1, 1, 20, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2376,6 +3029,14 @@ body:             |
     ; PAIR-NEXT: renamable $vgpr1 = V_MOV_B32_dpp killed $vgpr1, $vgpr3, 258, 15, 15, 0, implicit $exec
     ; PAIR-NEXT: renamable $vgpr1 = V_BITOP3_B32_e64 killed $vgpr3, killed $vgpr4, killed $vgpr1, 128, implicit $exec
     ; PAIR-NEXT: renamable $vgpr3 = V_MOV_B32_e32 -1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_bitop3_mov_dpp_vgpr_src2
+    ; LOWER: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: renamable $vgpr1 = V_MOV_B32_dpp killed $vgpr1, $vgpr3, 258, 15, 15, 0, implicit $exec
+    ; LOWER-NEXT: renamable $vgpr1 = V_BITOP3_B32_e64 killed $vgpr3, killed $vgpr4, killed $vgpr1, 128, implicit $exec
+    ; LOWER-NEXT: renamable $vgpr3 = V_MOV_B32_e32 -1, implicit $exec
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -2405,6 +3066,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 84, implicit $exec, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mov_or
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 84, implicit $exec, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2433,6 +3101,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 64, implicit $exec, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mov_and
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 64, implicit $exec, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2461,6 +3136,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 20, implicit $exec, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mov_xor
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 20, implicit $exec, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2489,6 +3171,13 @@ body:             |
     ; PAIR-NEXT: $vgpr2 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 65, implicit $exec, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mov_xnor
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3, $vgpr5 = V_DUAL_MOV_B32_e32_X_BITOP2_B32_e64_e96_gfx1250 $vgpr0, $vgpr1, killed $vgpr2, 65, implicit $exec, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2519,6 +3208,13 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_MOV_B32_e32 $vgpr0, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_NOT_B32_e32 killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mov_not
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = V_MOV_B32_e32 $vgpr0, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_NOT_B32_e32 killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = V_MOV_B32_e32 $vgpr0, implicit $exec
@@ -2547,6 +3243,14 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = V_ADD_F32_e32 $vgpr0, $vgpr1, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_NOT_B32_e32 killed $vgpr2, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fadd_not
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = V_ADD_F32_e32 $vgpr0, $vgpr1, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr4 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_NOT_B32_e32 killed $vgpr2, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2576,6 +3280,14 @@ body:             |
     ; PAIR-NEXT: $vgpr8_vgpr9 = V_ADD_F64_pseudo_e32 $vgpr0_vgpr1, killed $vgpr2_vgpr3, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr11 = V_NOT_B32_e32 killed $vgpr6, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fadd_f64_not
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9 = V_ADD_F64_pseudo_e32 $vgpr0_vgpr1, killed $vgpr2_vgpr3, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr10 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr11 = V_NOT_B32_e32 killed $vgpr6, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr6 = IMPLICIT_DEF
@@ -2611,6 +3323,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, 1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_src1_imm
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, 1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2649,6 +3372,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, 1, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_src2_imm
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, 1, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -2689,6 +3423,18 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, killed $sgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_src1_sgpr
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, killed $sgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $sgpr1 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
@@ -2730,6 +3476,18 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $sgpr1, 0, 0, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_src2_sgpr
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_FMA_F32_e64 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr2, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_FMA_F32_e64 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $sgpr1, 0, 0, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $sgpr1 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
@@ -2769,6 +3527,16 @@ body:             |
     ; PAIR-NEXT: $vgpr4 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_gfx1250 $vgpr0, $vgpr1, killed $vgpr3, killed $vgpr4, implicit $vcc_lo, implicit $exec, implicit $mode, implicit $exec, implicit killed $vcc_lo, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_fadd
+    ; LOWER: liveins: $vcc_lo
+    ; LOWER-NEXT: {{  $}}
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_gfx1250 $vgpr0, $vgpr1, killed $vgpr3, killed $vgpr4, implicit $vcc_lo, implicit $exec, implicit $mode, implicit $exec, implicit killed $vcc_lo, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -2807,6 +3575,17 @@ body:             |
     ; PAIR-NEXT: $vgpr5 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $exec, implicit $mode, implicit $exec, implicit killed $vcc_lo, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_fma
+    ; LOWER: liveins: $vcc_lo
+    ; LOWER-NEXT: {{  $}}
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr5 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_FMA_F32_e64_e96_gfx1250 0, $vgpr0, 0, $vgpr1, $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, 0, killed $vgpr5, implicit $exec, implicit $mode, implicit $exec, implicit killed $vcc_lo, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -2841,6 +3620,15 @@ body:             |
     ; PAIR-NEXT: $vcc = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e64_vcc_fadd
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vcc = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -2875,6 +3663,15 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $sgpr0, 0, killed $vgpr3, 0, killed $vgpr4, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e64_sgpr_fadd
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $sgpr0, 0, killed $vgpr3, 0, killed $vgpr4, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -2909,6 +3706,15 @@ body:             |
     ; PAIR-NEXT: $vcc = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 1, $vgpr1, killed $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e64_neg_vcc_fadd
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vcc = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 1, $vgpr1, killed $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -2946,6 +3752,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_CNDMASK_B32_e64 0, killed $sgpr0, 0, $vgpr1, killed $vcc_lo, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_ADD_F32_e32 killed $sgpr3, killed $vgpr4, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_cndmask_e64_vcc_fadd_constant_bus_limit
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vcc = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_CNDMASK_B32_e64 0, killed $sgpr0, 0, $vgpr1, killed $vcc_lo, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_ADD_F32_e32 killed $sgpr3, killed $vgpr4, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $sgpr3 = IMPLICIT_DEF
@@ -2984,6 +3801,17 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_CNDMASK_B32_e64 0, $vgpr0, 0, killed $sgpr0, killed $vcc_lo, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_ADD_F32_e32 killed $sgpr3, killed $vgpr4, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_cndmask_e64_vcc_fadd_sgpr_src1
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vcc = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_CNDMASK_B32_e64 0, $vgpr0, 0, killed $sgpr0, killed $vcc_lo, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_ADD_F32_e32 killed $sgpr3, killed $vgpr4, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $sgpr3 = IMPLICIT_DEF
@@ -3024,6 +3852,17 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $sgpr0, 0, killed $vgpr3, 0, killed $vgpr4, $vcc_lo, implicit $exec, implicit $exec, implicit killed $vcc_lo, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e64_cndmask_e32
+    ; LOWER: liveins: $vcc_lo
+    ; LOWER-NEXT: {{  $}}
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $sgpr0, 0, killed $vgpr3, 0, killed $vgpr4, $vcc_lo, implicit $exec, implicit $exec, implicit killed $vcc_lo, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -3063,6 +3902,17 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, killed $sgpr0, implicit $exec, implicit killed $vcc_lo, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e32_cndmask_e64
+    ; LOWER: liveins: $vcc_lo
+    ; LOWER-NEXT: {{  $}}
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, $vcc_lo, 0, killed $vgpr3, 0, killed $vgpr4, killed $sgpr0, implicit $exec, implicit killed $vcc_lo, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -3102,6 +3952,17 @@ body:             |
     ; PAIR-NEXT: $sgpr0 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_gfx1250 $vgpr0, $vgpr1, killed $vgpr3, killed $vgpr4, implicit $vcc_lo, implicit $exec, implicit $vcc_lo, implicit $exec, implicit killed $vcc_lo, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e32_cndmask_e32
+    ; LOWER: liveins: $vcc_lo
+    ; LOWER-NEXT: {{  $}}
+    ; LOWER-NEXT: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_gfx1250 $vgpr0, $vgpr1, killed $vgpr3, killed $vgpr4, implicit $vcc_lo, implicit $exec, implicit $vcc_lo, implicit $exec, implicit killed $vcc_lo, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -3138,6 +3999,16 @@ body:             |
     ; PAIR-NEXT: $sgpr1 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $sgpr0, 0, killed $vgpr3, 0, killed $vgpr4, killed $sgpr1, implicit $exec, implicit $exec, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e64_cndmask_e64
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $sgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, killed $sgpr0, 0, killed $vgpr3, 0, killed $vgpr4, killed $sgpr1, implicit $exec, implicit $exec, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -3171,6 +4042,14 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_ADD_F32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr3, 0, killed $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fadd_e64_fadd_e64
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_ADD_F32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 0, $vgpr1, 0, killed $vgpr3, 0, killed $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -3202,6 +4081,14 @@ body:             |
     ; PAIR-NEXT: $vgpr3 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_ADD_F32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 1, $vgpr1, 0, killed $vgpr3, 0, killed $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_fadd_e64_neg_fadd_e32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_ADD_F32_e32_X_ADD_F32_e32_e96_gfx1250 0, $vgpr0, 1, $vgpr1, 0, killed $vgpr3, 0, killed $vgpr2, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -3234,6 +4121,15 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = V_ADD_F32_e64 0, $vgpr0, 3, $vgpr1, 0, 0, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     ; PAIR-NEXT: $vgpr7 = V_ADD_F32_e32 killed $vgpr3, killed $vgpr2, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_fadd_e64_abs_neg_fadd_e32
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = V_ADD_F32_e64 0, $vgpr0, 3, $vgpr1, 0, 0, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr8 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ; LOWER-NEXT: $vgpr7 = V_ADD_F32_e32 killed $vgpr3, killed $vgpr2, implicit $mode, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr2 = IMPLICIT_DEF
@@ -3265,6 +4161,14 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MUL_F64_pseudo_e32_X_SUB_F32_e32_e96_gfx1250 1, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mul_f64_e64_sub_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MUL_F64_pseudo_e32_X_SUB_F32_e32_e96_gfx1250 1, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -3296,6 +4200,14 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MUL_F64_pseudo_e32_X_SUBREV_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 1, killed $vgpr6, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_mul_f64_e32_subrev_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MUL_F64_pseudo_e32_X_SUBREV_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 1, killed $vgpr6, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -3327,6 +4239,14 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MIN_NUM_F64_e32_X_MUL_F32_e32_e96_gfx1250 1, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_min_num_f64_e64_mul_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MIN_NUM_F64_e32_X_MUL_F32_e32_e96_gfx1250 1, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -3358,6 +4278,14 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MIN_NUM_F64_e32_X_MUL_LEGACY_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 1, killed $vgpr6, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_min_num_f64_e32_mul_legacy_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MIN_NUM_F64_e32_X_MUL_LEGACY_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 1, killed $vgpr6, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -3389,6 +4317,14 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MAX_NUM_F64_e32_X_MIN_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 1, killed $vgpr2_vgpr3, 1, killed $vgpr6, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_max_num_f64_e64_min_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MAX_NUM_F64_e32_X_MIN_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 1, killed $vgpr2_vgpr3, 1, killed $vgpr6, 0, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -3420,6 +4356,14 @@ body:             |
     ; PAIR-NEXT: $vgpr6 = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MAX_NUM_F64_e32_X_MAX_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_max_num_f64_e32_max_f32_neg
+    ; LOWER: $vgpr0_vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr8_vgpr9, $vgpr7 = V_DUAL_MAX_NUM_F64_e32_X_MAX_F32_e32_e96_gfx1250 0, $vgpr0_vgpr1, 0, killed $vgpr2_vgpr3, 0, killed $vgpr6, 1, killed $vgpr4, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0_vgpr1 = IMPLICIT_DEF
     $vgpr2_vgpr3 = IMPLICIT_DEF
     $vgpr4 = IMPLICIT_DEF
@@ -3453,6 +4397,15 @@ body:             |
     ; PAIR-NEXT: $vgpr2_vgpr3, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_FMAC_F32_e32_e96_gfx1250 0, 10, 0, killed $vgpr10_vgpr11, 0, $vgpr0, 1, $vgpr1, killed $vgpr6, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
     ; PAIR-NEXT: $vgpr2_vgpr3 = V_MOV_B64_e32 10, implicit $exec
     ; PAIR-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_add_f64_fmac_f32_e64_neg
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr10_vgpr11 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr2_vgpr3, $vgpr6 = V_DUAL_ADD_F64_pseudo_e32_X_FMAC_F32_e32_e96_gfx1250 0, 10, 0, killed $vgpr10_vgpr11, 0, $vgpr0, 1, $vgpr1, killed $vgpr6, implicit $mode, implicit $exec, implicit $mode, implicit $exec, implicit $mode, implicit $exec
+    ; LOWER-NEXT: $vgpr2_vgpr3 = V_MOV_B64_e32 10, implicit $exec
+    ; LOWER-NEXT: $vgpr5 = V_BFM_B32_e64 killed $vgpr0, killed $vgpr1, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr6 = IMPLICIT_DEF
@@ -3485,6 +4438,14 @@ body:             |
     ; PAIR-NEXT: $vgpr4 = IMPLICIT_DEF
     ; PAIR-NEXT: $vcc = IMPLICIT_DEF
     ; PAIR-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 1, killed $vgpr0, 0, killed $vgpr1, $vcc_lo, 1, killed $vgpr3, 0, killed $vgpr4, killed $vcc_lo, implicit $exec, implicit $exec, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_combine_cndmask_e64_neg_cndmask_e64_neg
+    ; LOWER: $vgpr0 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr1 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr3 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr4 = IMPLICIT_DEF
+    ; LOWER-NEXT: $vcc = IMPLICIT_DEF
+    ; LOWER-NEXT: $vgpr6, $vgpr7 = V_DUAL_CNDMASK_B32_e32_X_CNDMASK_B32_e32_e96_gfx1250 1, killed $vgpr0, 0, killed $vgpr1, $vcc_lo, 1, killed $vgpr3, 0, killed $vgpr4, killed $vcc_lo, implicit $exec, implicit $exec, implicit $exec
     $vgpr0 = IMPLICIT_DEF
     $vgpr1 = IMPLICIT_DEF
     $vgpr3 = IMPLICIT_DEF
@@ -3511,6 +4472,12 @@ body: |
     ; PAIR-NEXT: {{  $}}
     ; PAIR-NEXT: $vgpr3 = V_MOV_B32_e32 0, implicit $exec
     ; PAIR-NEXT: $vgpr0 = V_ADD_F32_e64_dpp killed $vgpr0, 0, killed $vgpr2, 0, killed $vgpr1, 0, 1, 1, 15, 15, 1, implicit $mode, implicit $exec
+    ;
+    ; LOWER-LABEL: name: vopd_no_combine_dpp
+    ; LOWER: liveins: $vgpr0, $vgpr1, $vgpr2
+    ; LOWER-NEXT: {{  $}}
+    ; LOWER-NEXT: $vgpr3 = V_MOV_B32_e32 0, implicit $exec
+    ; LOWER-NEXT: $vgpr0 = V_ADD_F32_e64_dpp killed $vgpr0, 0, killed $vgpr2, 0, killed $vgpr1, 0, 1, 1, 15, 15, 1, implicit $mode, implicit $exec
       $vgpr3 = V_MOV_B32_e32 0, implicit $exec
       $vgpr0 = V_ADD_F32_e64_dpp $vgpr0, 0, $vgpr2, 0, $vgpr1, 0, 1, 1, 15, 15, 1, implicit $mode, implicit $exec
 ...
diff --git a/llvm/test/CodeGen/AMDGPU/waitcnt-debug.mir b/llvm/test/CodeGen/AMDGPU/waitcnt-debug.mir
index 0e3656b498d33..eeca220b9e260 100644
--- a/llvm/test/CodeGen/AMDGPU/waitcnt-debug.mir
+++ b/llvm/test/CodeGen/AMDGPU/waitcnt-debug.mir
@@ -1,4 +1,3 @@
-# REQUIRES: asserts
 # RUN: llc -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts -debug-counter=si-insert-waitcnts-forcelgkm=0 -o - %s | FileCheck -check-prefixes=GCN,LGKM %s
 # RUN: llc -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts -debug-counter=si-insert-waitcnts-forceexp=0-1 -o - %s | FileCheck -check-prefixes=GCN,EXP %s
 # RUN: llc -mtriple=amdgcn -verify-machineinstrs -run-pass si-insert-waitcnts -debug-counter=si-insert-waitcnts-forcevm=0-2 -o - %s | FileCheck -check-prefixes=GCN,VM %s
diff --git a/llvm/test/CodeGen/ARM/fminmax-folds.ll b/llvm/test/CodeGen/ARM/fminmax-folds.ll
index b13426c7c0500..ca3d7f9c3be7c 100644
--- a/llvm/test/CodeGen/ARM/fminmax-folds.ll
+++ b/llvm/test/CodeGen/ARM/fminmax-folds.ll
@@ -65,9 +65,15 @@ define float @test_minnum_const_inf(float %x) {
 define float @test_maxnum_const_inf(float %x) {
 ; CHECK-LABEL: test_maxnum_const_inf:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    movw r0, #0
-; CHECK-NEXT:    movt r0, #32640
+; CHECK-NEXT:    vldr s0, .LCPI5_0
+; CHECK-NEXT:    vmov s2, r0
+; CHECK-NEXT:    vmaxnm.f32 s0, s2, s0
+; CHECK-NEXT:    vmov r0, s0
 ; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 2
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI5_0:
+; CHECK-NEXT:    .long 0x7f800000 @ float +Inf
   %r = call float @llvm.maxnum.f32(float %x, float 0x7ff0000000000000)
   ret float %r
 }
@@ -99,9 +105,15 @@ define float @test_minimum_const_inf(float %x) {
 define float @test_minnum_const_neg_inf(float %x) {
 ; CHECK-LABEL: test_minnum_const_neg_inf:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    movw r0, #0
-; CHECK-NEXT:    movt r0, #65408
+; CHECK-NEXT:    vldr s0, .LCPI8_0
+; CHECK-NEXT:    vmov s2, r0
+; CHECK-NEXT:    vminnm.f32 s0, s2, s0
+; CHECK-NEXT:    vmov r0, s0
 ; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 2
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI8_0:
+; CHECK-NEXT:    .long 0xff800000 @ float -Inf
   %r = call float @llvm.minnum.f32(float %x, float 0xfff0000000000000)
   ret float %r
 }
@@ -447,9 +459,15 @@ define float @test_minnum_const_max_ninf(float %x) {
 define float @test_maxnum_const_max_ninf(float %x) {
 ; CHECK-LABEL: test_maxnum_const_max_ninf:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    movw r0, #65535
-; CHECK-NEXT:    movt r0, #32639
+; CHECK-NEXT:    vldr s0, .LCPI37_0
+; CHECK-NEXT:    vmov s2, r0
+; CHECK-NEXT:    vmaxnm.f32 s0, s2, s0
+; CHECK-NEXT:    vmov r0, s0
 ; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 2
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI37_0:
+; CHECK-NEXT:    .long 0x7f7fffff @ float 3.40282347E+38
   %r = call ninf float @llvm.maxnum.f32(float %x, float 0x47efffffe0000000)
   ret float %r
 }
@@ -481,8 +499,15 @@ define float @test_minimum_const_max_ninf(float %x) {
 define float @test_minnum_const_neg_max_ninf(float %x) {
 ; CHECK-LABEL: test_minnum_const_neg_max_ninf:
 ; CHECK:       @ %bb.0:
-; CHECK-NEXT:    mvn r0, #8388608
+; CHECK-NEXT:    vldr s0, .LCPI40_0
+; CHECK-NEXT:    vmov s2, r0
+; CHECK-NEXT:    vminnm.f32 s0, s2, s0
+; CHECK-NEXT:    vmov r0, s0
 ; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 2
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI40_0:
+; CHECK-NEXT:    .long 0xff7fffff @ float -3.40282347E+38
   %r = call ninf float @llvm.minnum.f32(float %x, float 0xc7efffffe0000000)
   ret float %r
 }
diff --git a/llvm/test/CodeGen/ARM/fp-intrinsics-vector.ll b/llvm/test/CodeGen/ARM/fp-intrinsics-vector.ll
new file mode 100644
index 0000000000000..d4b94b97acad8
--- /dev/null
+++ b/llvm/test/CodeGen/ARM/fp-intrinsics-vector.ll
@@ -0,0 +1,1499 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=armv7a-none-eabihf -mattr=+neon,+vfp4 %s -o - | FileCheck %s
+
+define <4 x float> @add_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: add_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vadd.f32 s11, s3, s7
+; CHECK-NEXT:    vadd.f32 s10, s2, s6
+; CHECK-NEXT:    vadd.f32 s9, s1, s5
+; CHECK-NEXT:    vadd.f32 s8, s0, s4
+; CHECK-NEXT:    vorr q0, q2, q2
+; CHECK-NEXT:    bx lr
+  %val = call <4 x float> @llvm.experimental.constrained.fadd.v4f32(<4 x float> %x, <4 x float> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @sub_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: sub_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vsub.f32 s11, s3, s7
+; CHECK-NEXT:    vsub.f32 s10, s2, s6
+; CHECK-NEXT:    vsub.f32 s9, s1, s5
+; CHECK-NEXT:    vsub.f32 s8, s0, s4
+; CHECK-NEXT:    vorr q0, q2, q2
+; CHECK-NEXT:    bx lr
+  %val = call <4 x float> @llvm.experimental.constrained.fsub.v4f32(<4 x float> %x, <4 x float> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @mul_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: mul_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vmul.f32 s11, s3, s7
+; CHECK-NEXT:    vmul.f32 s10, s2, s6
+; CHECK-NEXT:    vmul.f32 s9, s1, s5
+; CHECK-NEXT:    vmul.f32 s8, s0, s4
+; CHECK-NEXT:    vorr q0, q2, q2
+; CHECK-NEXT:    bx lr
+  %val = call <4 x float> @llvm.experimental.constrained.fmul.v4f32(<4 x float> %x, <4 x float> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @div_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: div_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vdiv.f32 s11, s3, s7
+; CHECK-NEXT:    vdiv.f32 s10, s2, s6
+; CHECK-NEXT:    vdiv.f32 s9, s1, s5
+; CHECK-NEXT:    vdiv.f32 s8, s0, s4
+; CHECK-NEXT:    vorr q0, q2, q2
+; CHECK-NEXT:    bx lr
+  %val = call <4 x float> @llvm.experimental.constrained.fdiv.v4f32(<4 x float> %x, <4 x float> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @fma_v4f32(<4 x float> %x, <4 x float> %y, <4 x float> %z) #0 {
+; CHECK-LABEL: fma_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vfma.f32 s11, s3, s7
+; CHECK-NEXT:    vfma.f32 s10, s2, s6
+; CHECK-NEXT:    vfma.f32 s9, s1, s5
+; CHECK-NEXT:    vfma.f32 s8, s0, s4
+; CHECK-NEXT:    vorr q0, q2, q2
+; CHECK-NEXT:    bx lr
+  %val = call <4 x float> @llvm.experimental.constrained.fma.v4f32(<4 x float> %x, <4 x float> %y, <4 x float> %z, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x i32> @fptosi_v4i32_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: fptosi_v4i32_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.s32.f32 s4, s2
+; CHECK-NEXT:    vcvt.s32.f32 s6, s0
+; CHECK-NEXT:    vcvt.s32.f32 s0, s1
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vcvt.s32.f32 s4, s3
+; CHECK-NEXT:    vmov.32 d17[0], r0
+; CHECK-NEXT:    vmov r0, s6
+; CHECK-NEXT:    vmov.32 d16[0], r0
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vmov.32 d17[1], r0
+; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    vmov.32 d16[1], r0
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <4 x i32> @llvm.experimental.constrained.fptosi.v4i32.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x i32> %val
+}
+
+define <4 x i32> @fptoui_v4i32_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: fptoui_v4i32_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.u32.f32 s4, s2
+; CHECK-NEXT:    vcvt.u32.f32 s6, s0
+; CHECK-NEXT:    vcvt.u32.f32 s0, s1
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vcvt.u32.f32 s4, s3
+; CHECK-NEXT:    vmov.32 d17[0], r0
+; CHECK-NEXT:    vmov r0, s6
+; CHECK-NEXT:    vmov.32 d16[0], r0
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vmov.32 d17[1], r0
+; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    vmov.32 d16[1], r0
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <4 x i32> @llvm.experimental.constrained.fptoui.v4i32.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x i32> %val
+}
+
+define <4 x i64> @fptosi_v4i64_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: fptosi_v4i64_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r11, lr}
+; CHECK-NEXT:    push {r4, r5, r6, r7, r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vmov r0, s19
+; CHECK-NEXT:    bl __aeabi_f2lz
+; CHECK-NEXT:    mov r4, r1
+; CHECK-NEXT:    vmov r1, s16
+; CHECK-NEXT:    vmov r5, s17
+; CHECK-NEXT:    vmov r6, s18
+; CHECK-NEXT:    vmov.32 d9[0], r0
+; CHECK-NEXT:    mov r0, r1
+; CHECK-NEXT:    bl __aeabi_f2lz
+; CHECK-NEXT:    vmov.32 d10[0], r0
+; CHECK-NEXT:    mov r0, r5
+; CHECK-NEXT:    mov r7, r1
+; CHECK-NEXT:    bl __aeabi_f2lz
+; CHECK-NEXT:    vmov.32 d11[0], r0
+; CHECK-NEXT:    mov r0, r6
+; CHECK-NEXT:    mov r5, r1
+; CHECK-NEXT:    bl __aeabi_f2lz
+; CHECK-NEXT:    vmov.32 d8[0], r0
+; CHECK-NEXT:    vmov.32 d11[1], r5
+; CHECK-NEXT:    vmov.32 d9[1], r4
+; CHECK-NEXT:    vmov.32 d10[1], r7
+; CHECK-NEXT:    vmov.32 d8[1], r1
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vorr q1, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r4, r5, r6, r7, r11, pc}
+  %val = call <4 x i64> @llvm.experimental.constrained.fptosi.v4i64.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x i64> %val
+}
+
+define <4 x i64> @fptoui_v4i64_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: fptoui_v4i64_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r11, lr}
+; CHECK-NEXT:    push {r4, r5, r6, r7, r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vmov r0, s19
+; CHECK-NEXT:    bl __aeabi_f2ulz
+; CHECK-NEXT:    mov r4, r1
+; CHECK-NEXT:    vmov r1, s16
+; CHECK-NEXT:    vmov r5, s17
+; CHECK-NEXT:    vmov r6, s18
+; CHECK-NEXT:    vmov.32 d9[0], r0
+; CHECK-NEXT:    mov r0, r1
+; CHECK-NEXT:    bl __aeabi_f2ulz
+; CHECK-NEXT:    vmov.32 d10[0], r0
+; CHECK-NEXT:    mov r0, r5
+; CHECK-NEXT:    mov r7, r1
+; CHECK-NEXT:    bl __aeabi_f2ulz
+; CHECK-NEXT:    vmov.32 d11[0], r0
+; CHECK-NEXT:    mov r0, r6
+; CHECK-NEXT:    mov r5, r1
+; CHECK-NEXT:    bl __aeabi_f2ulz
+; CHECK-NEXT:    vmov.32 d8[0], r0
+; CHECK-NEXT:    vmov.32 d11[1], r5
+; CHECK-NEXT:    vmov.32 d9[1], r4
+; CHECK-NEXT:    vmov.32 d10[1], r7
+; CHECK-NEXT:    vmov.32 d8[1], r1
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vorr q1, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r4, r5, r6, r7, r11, pc}
+  %val = call <4 x i64> @llvm.experimental.constrained.fptoui.v4i64.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x i64> %val
+}
+
+define <4 x float> @sitofp_v4f32_v4i32(<4 x i32> %x) #0 {
+; CHECK-LABEL: sitofp_v4f32_v4i32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .pad #32
+; CHECK-NEXT:    sub sp, sp, #32
+; CHECK-NEXT:    vmov r12, r1, d0
+; CHECK-NEXT:    movw r0, #0
+; CHECK-NEXT:    vmov r2, r3, d1
+; CHECK-NEXT:    movt r0, #17200
+; CHECK-NEXT:    str r0, [sp, #20]
+; CHECK-NEXT:    vldr d16, .LCPI9_0
+; CHECK-NEXT:    eor r1, r1, #-2147483648
+; CHECK-NEXT:    str r1, [sp, #16]
+; CHECK-NEXT:    str r0, [sp, #12]
+; CHECK-NEXT:    eor r1, r2, #-2147483648
+; CHECK-NEXT:    vldr d17, [sp, #16]
+; CHECK-NEXT:    stmib sp, {r0, r1}
+; CHECK-NEXT:    eor r1, r3, #-2147483648
+; CHECK-NEXT:    vsub.f64 d17, d17, d16
+; CHECK-NEXT:    vldr d18, [sp, #8]
+; CHECK-NEXT:    str r1, [sp]
+; CHECK-NEXT:    str r0, [sp, #28]
+; CHECK-NEXT:    eor r0, r12, #-2147483648
+; CHECK-NEXT:    vldr d19, [sp]
+; CHECK-NEXT:    str r0, [sp, #24]
+; CHECK-NEXT:    vsub.f64 d18, d18, d16
+; CHECK-NEXT:    vsub.f64 d19, d19, d16
+; CHECK-NEXT:    vldr d20, [sp, #24]
+; CHECK-NEXT:    vcvt.f32.f64 s3, d19
+; CHECK-NEXT:    vsub.f64 d16, d20, d16
+; CHECK-NEXT:    vcvt.f32.f64 s2, d18
+; CHECK-NEXT:    vcvt.f32.f64 s1, d17
+; CHECK-NEXT:    vcvt.f32.f64 s0, d16
+; CHECK-NEXT:    add sp, sp, #32
+; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 3
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI9_0:
+; CHECK-NEXT:    .long 2147483648 @ double 4503601774854144
+; CHECK-NEXT:    .long 1127219200
+  %val = call <4 x float> @llvm.experimental.constrained.sitofp.v4f32.v4i32(<4 x i32> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @uitofp_v4f32_v4i32(<4 x i32> %x) #0 {
+; CHECK-LABEL: uitofp_v4f32_v4i32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .pad #32
+; CHECK-NEXT:    sub sp, sp, #32
+; CHECK-NEXT:    vmov r0, r1, d1
+; CHECK-NEXT:    movw r2, #0
+; CHECK-NEXT:    vmov r12, r3, d0
+; CHECK-NEXT:    movt r2, #17200
+; CHECK-NEXT:    stm sp, {r1, r2}
+; CHECK-NEXT:    vldr d17, [sp]
+; CHECK-NEXT:    vldr d16, .LCPI10_0
+; CHECK-NEXT:    str r2, [sp, #12]
+; CHECK-NEXT:    vsub.f64 d17, d17, d16
+; CHECK-NEXT:    vcvt.f32.f64 s3, d17
+; CHECK-NEXT:    str r0, [sp, #8]
+; CHECK-NEXT:    vldr d18, [sp, #8]
+; CHECK-NEXT:    str r2, [sp, #20]
+; CHECK-NEXT:    str r3, [sp, #16]
+; CHECK-NEXT:    vsub.f64 d18, d18, d16
+; CHECK-NEXT:    vldr d19, [sp, #16]
+; CHECK-NEXT:    str r2, [sp, #28]
+; CHECK-NEXT:    vcvt.f32.f64 s2, d18
+; CHECK-NEXT:    str r12, [sp, #24]
+; CHECK-NEXT:    vldr d20, [sp, #24]
+; CHECK-NEXT:    vsub.f64 d19, d19, d16
+; CHECK-NEXT:    vsub.f64 d16, d20, d16
+; CHECK-NEXT:    vcvt.f32.f64 s1, d19
+; CHECK-NEXT:    vcvt.f32.f64 s0, d16
+; CHECK-NEXT:    add sp, sp, #32
+; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 3
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI10_0:
+; CHECK-NEXT:    .long 0 @ double 4503599627370496
+; CHECK-NEXT:    .long 1127219200
+  %val = call <4 x float> @llvm.experimental.constrained.uitofp.v4f32.v4i32(<4 x i32> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @sitofp_v4f32_v4i64(<4 x i64> %x) #0 {
+; CHECK-LABEL: sitofp_v4f32_v4i64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r4, r5, r6, lr}
+; CHECK-NEXT:    push {r4, r5, r6, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q1, q1
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov r0, r1, d8
+; CHECK-NEXT:    bl __aeabi_l2f
+; CHECK-NEXT:    mov r4, r0
+; CHECK-NEXT:    vmov r0, r1, d9
+; CHECK-NEXT:    bl __aeabi_l2f
+; CHECK-NEXT:    vmov r2, r1, d11
+; CHECK-NEXT:    vmov s19, r0
+; CHECK-NEXT:    vmov r5, r6, d10
+; CHECK-NEXT:    vmov s18, r4
+; CHECK-NEXT:    mov r0, r2
+; CHECK-NEXT:    bl __aeabi_l2f
+; CHECK-NEXT:    vmov s17, r0
+; CHECK-NEXT:    mov r0, r5
+; CHECK-NEXT:    mov r1, r6
+; CHECK-NEXT:    bl __aeabi_l2f
+; CHECK-NEXT:    vmov s16, r0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r4, r5, r6, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.sitofp.v4f32.v4i64(<4 x i64> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @uitofp_v4f32_v4i64(<4 x i64> %x) #0 {
+; CHECK-LABEL: uitofp_v4f32_v4i64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r4, r5, r6, lr}
+; CHECK-NEXT:    push {r4, r5, r6, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q1, q1
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov r0, r1, d8
+; CHECK-NEXT:    bl __aeabi_ul2f
+; CHECK-NEXT:    mov r4, r0
+; CHECK-NEXT:    vmov r0, r1, d9
+; CHECK-NEXT:    bl __aeabi_ul2f
+; CHECK-NEXT:    vmov r2, r1, d11
+; CHECK-NEXT:    vmov s19, r0
+; CHECK-NEXT:    vmov r5, r6, d10
+; CHECK-NEXT:    vmov s18, r4
+; CHECK-NEXT:    mov r0, r2
+; CHECK-NEXT:    bl __aeabi_ul2f
+; CHECK-NEXT:    vmov s17, r0
+; CHECK-NEXT:    mov r0, r5
+; CHECK-NEXT:    mov r1, r6
+; CHECK-NEXT:    bl __aeabi_ul2f
+; CHECK-NEXT:    vmov s16, r0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r4, r5, r6, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.uitofp.v4f32.v4i64(<4 x i64> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @sqrt_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: sqrt_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vsqrt.f32 s7, s3
+; CHECK-NEXT:    vsqrt.f32 s6, s2
+; CHECK-NEXT:    vsqrt.f32 s5, s1
+; CHECK-NEXT:    vsqrt.f32 s4, s0
+; CHECK-NEXT:    vorr q0, q1, q1
+; CHECK-NEXT:    bx lr
+  %val = call <4 x float> @llvm.experimental.constrained.sqrt.v4f32(<4 x float> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @rint_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: rint_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl rintf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl rintf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl rintf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl rintf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.rint.v4f32(<4 x float> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @nearbyint_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: nearbyint_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl nearbyintf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl nearbyintf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl nearbyintf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl nearbyintf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.nearbyint.v4f32(<4 x float> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @maxnum_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: maxnum_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vorr q4, q1, q1
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    vmov.f32 s1, s19
+; CHECK-NEXT:    bl fmaxf
+; CHECK-NEXT:    vmov.f32 s27, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    vmov.f32 s1, s18
+; CHECK-NEXT:    bl fmaxf
+; CHECK-NEXT:    vmov.f32 s26, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    vmov.f32 s1, s17
+; CHECK-NEXT:    bl fmaxf
+; CHECK-NEXT:    vmov.f32 s25, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    vmov.f32 s1, s16
+; CHECK-NEXT:    bl fmaxf
+; CHECK-NEXT:    vmov.f32 s24, s0
+; CHECK-NEXT:    vorr q0, q6, q6
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.maxnum.v4f32(<4 x float> %x, <4 x float> %y, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @minnum_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: minnum_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vorr q4, q1, q1
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    vmov.f32 s1, s19
+; CHECK-NEXT:    bl fminf
+; CHECK-NEXT:    vmov.f32 s27, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    vmov.f32 s1, s18
+; CHECK-NEXT:    bl fminf
+; CHECK-NEXT:    vmov.f32 s26, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    vmov.f32 s1, s17
+; CHECK-NEXT:    bl fminf
+; CHECK-NEXT:    vmov.f32 s25, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    vmov.f32 s1, s16
+; CHECK-NEXT:    bl fminf
+; CHECK-NEXT:    vmov.f32 s24, s0
+; CHECK-NEXT:    vorr q0, q6, q6
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.minnum.v4f32(<4 x float> %x, <4 x float> %y, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @ceil_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: ceil_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl ceilf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl ceilf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl ceilf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl ceilf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.ceil.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @floor_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: floor_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl floorf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl floorf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl floorf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl floorf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.floor.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @round_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: round_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl roundf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl roundf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl roundf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl roundf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.round.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @roundeven_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: roundeven_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl roundevenf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl roundevenf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl roundevenf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl roundevenf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.roundeven.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x float> @trunc_v4f32(<4 x float> %x) #0 {
+; CHECK-LABEL: trunc_v4f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vmov.f32 s0, s23
+; CHECK-NEXT:    bl truncf
+; CHECK-NEXT:    vmov.f32 s19, s0
+; CHECK-NEXT:    vmov.f32 s0, s22
+; CHECK-NEXT:    bl truncf
+; CHECK-NEXT:    vmov.f32 s18, s0
+; CHECK-NEXT:    vmov.f32 s0, s21
+; CHECK-NEXT:    bl truncf
+; CHECK-NEXT:    vmov.f32 s17, s0
+; CHECK-NEXT:    vmov.f32 s0, s20
+; CHECK-NEXT:    bl truncf
+; CHECK-NEXT:    vmov.f32 s16, s0
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <4 x float> @llvm.experimental.constrained.trunc.v4f32(<4 x float> %x, metadata !"fpexcept.strict") #0
+  ret <4 x float> %val
+}
+
+define <4 x i1> @fcmp_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: fcmp_v4f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vcmp.f32 s3, s7
+; CHECK-NEXT:    mov r1, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmp.f32 s2, s6
+; CHECK-NEXT:    mov r2, #0
+; CHECK-NEXT:    mov r3, #0
+; CHECK-NEXT:    mov r0, #0
+; CHECK-NEXT:    movweq r1, #1
+; CHECK-NEXT:    cmp r1, #0
+; CHECK-NEXT:    mvnne r1, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmp.f32 s0, s4
+; CHECK-NEXT:    movweq r2, #1
+; CHECK-NEXT:    cmp r2, #0
+; CHECK-NEXT:    mvnne r2, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmp.f32 s1, s5
+; CHECK-NEXT:    vmov.32 d17[0], r2
+; CHECK-NEXT:    movweq r3, #1
+; CHECK-NEXT:    cmp r3, #0
+; CHECK-NEXT:    mvnne r3, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vmov.32 d16[0], r3
+; CHECK-NEXT:    vmov.32 d17[1], r1
+; CHECK-NEXT:    movweq r0, #1
+; CHECK-NEXT:    cmp r0, #0
+; CHECK-NEXT:    mvnne r0, #0
+; CHECK-NEXT:    vmov.32 d16[1], r0
+; CHECK-NEXT:    vmovn.i32 d0, q8
+; CHECK-NEXT:    bx lr
+entry:
+  %val = call <4 x i1> @llvm.experimental.constrained.fcmp.v4f64(<4 x float> %x, <4 x float> %y, metadata !"oeq", metadata !"fpexcept.strict")
+  ret <4 x i1> %val
+}
+
+define <4 x i1> @fcmps_v4f32(<4 x float> %x, <4 x float> %y) #0 {
+; CHECK-LABEL: fcmps_v4f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vcmpe.f32 s3, s7
+; CHECK-NEXT:    mov r1, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmpe.f32 s2, s6
+; CHECK-NEXT:    mov r2, #0
+; CHECK-NEXT:    mov r3, #0
+; CHECK-NEXT:    mov r0, #0
+; CHECK-NEXT:    movweq r1, #1
+; CHECK-NEXT:    cmp r1, #0
+; CHECK-NEXT:    mvnne r1, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmpe.f32 s0, s4
+; CHECK-NEXT:    movweq r2, #1
+; CHECK-NEXT:    cmp r2, #0
+; CHECK-NEXT:    mvnne r2, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmpe.f32 s1, s5
+; CHECK-NEXT:    vmov.32 d17[0], r2
+; CHECK-NEXT:    movweq r3, #1
+; CHECK-NEXT:    cmp r3, #0
+; CHECK-NEXT:    mvnne r3, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vmov.32 d16[0], r3
+; CHECK-NEXT:    vmov.32 d17[1], r1
+; CHECK-NEXT:    movweq r0, #1
+; CHECK-NEXT:    cmp r0, #0
+; CHECK-NEXT:    mvnne r0, #0
+; CHECK-NEXT:    vmov.32 d16[1], r0
+; CHECK-NEXT:    vmovn.i32 d0, q8
+; CHECK-NEXT:    bx lr
+entry:
+  %val = call <4 x i1> @llvm.experimental.constrained.fcmps.v4f32(<4 x float> %x, <4 x float> %y, metadata !"oeq", metadata !"fpexcept.strict")
+  ret <4 x i1> %val
+}
+
+
+
+define <2 x double> @add_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: add_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vadd.f64 d17, d1, d3
+; CHECK-NEXT:    vadd.f64 d16, d0, d2
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.fadd.v2f64(<2 x double> %x, <2 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @sub_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: sub_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vsub.f64 d17, d1, d3
+; CHECK-NEXT:    vsub.f64 d16, d0, d2
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.fsub.v2f64(<2 x double> %x, <2 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @mul_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: mul_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vmul.f64 d17, d1, d3
+; CHECK-NEXT:    vmul.f64 d16, d0, d2
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.fmul.v2f64(<2 x double> %x, <2 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @div_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: div_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vdiv.f64 d17, d1, d3
+; CHECK-NEXT:    vdiv.f64 d16, d0, d2
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.fdiv.v2f64(<2 x double> %x, <2 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @fma_v2f64(<2 x double> %x, <2 x double> %y, <2 x double> %z) #0 {
+; CHECK-LABEL: fma_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vfma.f64 d5, d1, d3
+; CHECK-NEXT:    vfma.f64 d4, d0, d2
+; CHECK-NEXT:    vorr q0, q2, q2
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.fma.v2f64(<2 x double> %x, <2 x double> %y, <2 x double> %z, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x i32> @fptosi_v2i32_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: fptosi_v2i32_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.s32.f64 s4, d0
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vcvt.s32.f64 s2, d1
+; CHECK-NEXT:    vmov.32 d0[0], r0
+; CHECK-NEXT:    vmov r0, s2
+; CHECK-NEXT:    vmov.32 d0[1], r0
+; CHECK-NEXT:    bx lr
+  %val = call <2 x i32> @llvm.experimental.constrained.fptosi.v2i32.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x i32> %val
+}
+
+define <2 x i32> @fptoui_v2i32_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: fptoui_v2i32_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.u32.f64 s4, d0
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vcvt.u32.f64 s2, d1
+; CHECK-NEXT:    vmov.32 d0[0], r0
+; CHECK-NEXT:    vmov r0, s2
+; CHECK-NEXT:    vmov.32 d0[1], r0
+; CHECK-NEXT:    bx lr
+  %val = call <2 x i32> @llvm.experimental.constrained.fptoui.v2i32.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x i32> %val
+}
+
+define <2 x i64> @fptosi_v2i64_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: fptosi_v2i64_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r4, lr}
+; CHECK-NEXT:    push {r4, lr}
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vmov r0, r1, d9
+; CHECK-NEXT:    bl __aeabi_d2lz
+; CHECK-NEXT:    mov r4, r1
+; CHECK-NEXT:    vmov r2, r1, d8
+; CHECK-NEXT:    vmov.32 d9[0], r0
+; CHECK-NEXT:    mov r0, r2
+; CHECK-NEXT:    bl __aeabi_d2lz
+; CHECK-NEXT:    vmov.32 d8[0], r0
+; CHECK-NEXT:    vmov.32 d9[1], r4
+; CHECK-NEXT:    vmov.32 d8[1], r1
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    pop {r4, pc}
+  %val = call <2 x i64> @llvm.experimental.constrained.fptosi.v2i64.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x i64> %val
+}
+
+define <2 x i64> @fptoui_v2i64_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: fptoui_v2i64_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r4, lr}
+; CHECK-NEXT:    push {r4, lr}
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vmov r0, r1, d9
+; CHECK-NEXT:    bl __aeabi_d2ulz
+; CHECK-NEXT:    mov r4, r1
+; CHECK-NEXT:    vmov r2, r1, d8
+; CHECK-NEXT:    vmov.32 d9[0], r0
+; CHECK-NEXT:    mov r0, r2
+; CHECK-NEXT:    bl __aeabi_d2ulz
+; CHECK-NEXT:    vmov.32 d8[0], r0
+; CHECK-NEXT:    vmov.32 d9[1], r4
+; CHECK-NEXT:    vmov.32 d8[1], r1
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    pop {r4, pc}
+  %val = call <2 x i64> @llvm.experimental.constrained.fptoui.v2i64.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x i64> %val
+}
+
+define <2 x double> @sitofp_v2f64_v2i32(<2 x i32> %x) #0 {
+; CHECK-LABEL: sitofp_v2f64_v2i32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .pad #16
+; CHECK-NEXT:    sub sp, sp, #16
+; CHECK-NEXT:    vmov.32 r0, d0[1]
+; CHECK-NEXT:    movw r2, #0
+; CHECK-NEXT:    vmov.32 r1, d0[0]
+; CHECK-NEXT:    movt r2, #17200
+; CHECK-NEXT:    str r2, [sp, #4]
+; CHECK-NEXT:    vldr d16, .LCPI34_0
+; CHECK-NEXT:    eor r0, r0, #-2147483648
+; CHECK-NEXT:    str r0, [sp]
+; CHECK-NEXT:    str r2, [sp, #12]
+; CHECK-NEXT:    eor r0, r1, #-2147483648
+; CHECK-NEXT:    vldr d17, [sp]
+; CHECK-NEXT:    str r0, [sp, #8]
+; CHECK-NEXT:    vldr d18, [sp, #8]
+; CHECK-NEXT:    vsub.f64 d1, d17, d16
+; CHECK-NEXT:    vsub.f64 d0, d18, d16
+; CHECK-NEXT:    add sp, sp, #16
+; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 3
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI34_0:
+; CHECK-NEXT:    .long 2147483648 @ double 4503601774854144
+; CHECK-NEXT:    .long 1127219200
+  %val = call <2 x double> @llvm.experimental.constrained.sitofp.v2f64.v2i32(<2 x i32> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @uitofp_v2f64_v2i32(<2 x i32> %x) #0 {
+; CHECK-LABEL: uitofp_v2f64_v2i32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .pad #16
+; CHECK-NEXT:    sub sp, sp, #16
+; CHECK-NEXT:    movw r0, #0
+; CHECK-NEXT:    mov r1, sp
+; CHECK-NEXT:    movt r0, #17200
+; CHECK-NEXT:    vst1.32 {d0[1]}, [r1:32]
+; CHECK-NEXT:    add r1, sp, #8
+; CHECK-NEXT:    str r0, [sp, #4]
+; CHECK-NEXT:    vldr d17, [sp]
+; CHECK-NEXT:    vst1.32 {d0[0]}, [r1:32]
+; CHECK-NEXT:    vldr d16, .LCPI35_0
+; CHECK-NEXT:    str r0, [sp, #12]
+; CHECK-NEXT:    vldr d18, [sp, #8]
+; CHECK-NEXT:    vsub.f64 d1, d17, d16
+; CHECK-NEXT:    vsub.f64 d0, d18, d16
+; CHECK-NEXT:    add sp, sp, #16
+; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 3
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI35_0:
+; CHECK-NEXT:    .long 0 @ double 4503599627370496
+; CHECK-NEXT:    .long 1127219200
+  %val = call <2 x double> @llvm.experimental.constrained.uitofp.v2f64.v2i32(<2 x i32> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @sitofp_v2f64_v2i64(<2 x i64> %x) #0 {
+; CHECK-LABEL: sitofp_v2f64_v2i64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vmov r0, r1, d9
+; CHECK-NEXT:    bl __aeabi_l2d
+; CHECK-NEXT:    vmov r2, r3, d8
+; CHECK-NEXT:    vmov d9, r0, r1
+; CHECK-NEXT:    mov r0, r2
+; CHECK-NEXT:    mov r1, r3
+; CHECK-NEXT:    bl __aeabi_l2d
+; CHECK-NEXT:    vmov d8, r0, r1
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.sitofp.v2f64.v2i64(<2 x i64> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @uitofp_v2f64_v2i64(<2 x i64> %x) #0 {
+; CHECK-LABEL: uitofp_v2f64_v2i64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9}
+; CHECK-NEXT:    vpush {d8, d9}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vmov r0, r1, d9
+; CHECK-NEXT:    bl __aeabi_ul2d
+; CHECK-NEXT:    vmov r2, r3, d8
+; CHECK-NEXT:    vmov d9, r0, r1
+; CHECK-NEXT:    mov r0, r2
+; CHECK-NEXT:    mov r1, r3
+; CHECK-NEXT:    bl __aeabi_ul2d
+; CHECK-NEXT:    vmov d8, r0, r1
+; CHECK-NEXT:    vorr q0, q4, q4
+; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.uitofp.v2f64.v2i64(<2 x i64> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @sqrt_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: sqrt_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vsqrt.f64 d17, d1
+; CHECK-NEXT:    vsqrt.f64 d16, d0
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.sqrt.v2f64(<2 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @rint_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: rint_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl rint
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl rint
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.rint.v2f64(<2 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @nearbyint_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: nearbyint_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl nearbyint
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl nearbyint
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.nearbyint.v2f64(<2 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @maxnum_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: maxnum_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vorr q4, q1, q1
+; CHECK-NEXT:    vorr d0, d11, d11
+; CHECK-NEXT:    vorr d1, d9, d9
+; CHECK-NEXT:    bl fmax
+; CHECK-NEXT:    vorr d13, d0, d0
+; CHECK-NEXT:    vorr d0, d10, d10
+; CHECK-NEXT:    vorr d1, d8, d8
+; CHECK-NEXT:    bl fmax
+; CHECK-NEXT:    vorr d12, d0, d0
+; CHECK-NEXT:    vorr q0, q6, q6
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.maxnum.v2f64(<2 x double> %x, <2 x double> %y, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @minnum_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: minnum_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    vorr q5, q0, q0
+; CHECK-NEXT:    vorr q4, q1, q1
+; CHECK-NEXT:    vorr d0, d11, d11
+; CHECK-NEXT:    vorr d1, d9, d9
+; CHECK-NEXT:    bl fmin
+; CHECK-NEXT:    vorr d13, d0, d0
+; CHECK-NEXT:    vorr d0, d10, d10
+; CHECK-NEXT:    vorr d1, d8, d8
+; CHECK-NEXT:    bl fmin
+; CHECK-NEXT:    vorr d12, d0, d0
+; CHECK-NEXT:    vorr q0, q6, q6
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.minnum.v2f64(<2 x double> %x, <2 x double> %y, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @ceil_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: ceil_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl ceil
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl ceil
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.ceil.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @floor_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: floor_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl floor
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl floor
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.floor.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @round_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: round_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl round
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl round
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.round.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @roundeven_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: roundeven_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl roundeven
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl roundeven
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.roundeven.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x double> @trunc_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: trunc_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    vorr q4, q0, q0
+; CHECK-NEXT:    vorr d0, d9, d9
+; CHECK-NEXT:    bl trunc
+; CHECK-NEXT:    vorr d11, d0, d0
+; CHECK-NEXT:    vorr d0, d8, d8
+; CHECK-NEXT:    bl trunc
+; CHECK-NEXT:    vorr d10, d0, d0
+; CHECK-NEXT:    vorr q0, q5, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <2 x double> @llvm.experimental.constrained.trunc.v2f64(<2 x double> %x, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+define <2 x i1> @fcmp_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: fcmp_v2f64:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vcmp.f64 d0, d2
+; CHECK-NEXT:    mov r1, #0
+; CHECK-NEXT:    mov r0, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmp.f64 d1, d3
+; CHECK-NEXT:    movweq r1, #1
+; CHECK-NEXT:    cmp r1, #0
+; CHECK-NEXT:    mvnne r1, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vmov.32 d0[0], r1
+; CHECK-NEXT:    movweq r0, #1
+; CHECK-NEXT:    cmp r0, #0
+; CHECK-NEXT:    mvnne r0, #0
+; CHECK-NEXT:    vmov.32 d0[1], r0
+; CHECK-NEXT:    bx lr
+entry:
+  %val = call <2 x i1> @llvm.experimental.constrained.fcmp.v2f64(<2 x double> %x, <2 x double> %y, metadata !"oeq", metadata !"fpexcept.strict")
+  ret <2 x i1> %val
+}
+
+define <2 x i1> @fcmps_v2f64(<2 x double> %x, <2 x double> %y) #0 {
+; CHECK-LABEL: fcmps_v2f64:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vcmpe.f64 d0, d2
+; CHECK-NEXT:    mov r1, #0
+; CHECK-NEXT:    mov r0, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vcmpe.f64 d1, d3
+; CHECK-NEXT:    movweq r1, #1
+; CHECK-NEXT:    cmp r1, #0
+; CHECK-NEXT:    mvnne r1, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    vmov.32 d0[0], r1
+; CHECK-NEXT:    movweq r0, #1
+; CHECK-NEXT:    cmp r0, #0
+; CHECK-NEXT:    mvnne r0, #0
+; CHECK-NEXT:    vmov.32 d0[1], r0
+; CHECK-NEXT:    bx lr
+entry:
+  %val = call <2 x i1> @llvm.experimental.constrained.fcmps.v2f64(<2 x double> %x, <2 x double> %y, metadata !"oeq", metadata !"fpexcept.strict")
+  ret <2 x i1> %val
+}
+
+
+
+define <1 x double> @add_v1f64(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: add_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vadd.f64 d0, d0, d1
+; CHECK-NEXT:    bx lr
+  %val = call <1 x double> @llvm.experimental.constrained.fadd.v1f64(<1 x double> %x, <1 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @sub_v1f64(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: sub_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vsub.f64 d0, d0, d1
+; CHECK-NEXT:    bx lr
+  %val = call <1 x double> @llvm.experimental.constrained.fsub.v1f64(<1 x double> %x, <1 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @mul_v1f64(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: mul_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vmul.f64 d0, d0, d1
+; CHECK-NEXT:    bx lr
+  %val = call <1 x double> @llvm.experimental.constrained.fmul.v1f64(<1 x double> %x, <1 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @div_v1f64(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: div_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vdiv.f64 d0, d0, d1
+; CHECK-NEXT:    bx lr
+  %val = call <1 x double> @llvm.experimental.constrained.fdiv.v1f64(<1 x double> %x, <1 x double> %y, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @fma_v1f64(<1 x double> %x, <1 x double> %y, <1 x double> %z) #0 {
+; CHECK-LABEL: fma_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vfma.f64 d2, d0, d1
+; CHECK-NEXT:    vmov.f64 d0, d2
+; CHECK-NEXT:    bx lr
+  %val = call <1 x double> @llvm.experimental.constrained.fma.v1f64(<1 x double> %x, <1 x double> %y, <1 x double> %z, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x i32> @fptosi_v1i32_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: fptosi_v1i32_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.s32.f64 s0, d0
+; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    bx lr
+  %val = call <1 x i32> @llvm.experimental.constrained.fptosi.v1i32.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x i32> %val
+}
+
+define <1 x i32> @fptoui_v1i32_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: fptoui_v1i32_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.u32.f64 s0, d0
+; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    bx lr
+  %val = call <1 x i32> @llvm.experimental.constrained.fptoui.v1i32.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x i32> %val
+}
+
+define <1 x i64> @fptosi_v1i64_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: fptosi_v1i64_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    vmov r0, r1, d0
+; CHECK-NEXT:    bl __aeabi_d2lz
+; CHECK-NEXT:    vmov.32 d0[0], r0
+; CHECK-NEXT:    vmov.32 d0[1], r1
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x i64> @llvm.experimental.constrained.fptosi.v1i64.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x i64> %val
+}
+
+define <1 x i64> @fptoui_v1i64_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: fptoui_v1i64_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    vmov r0, r1, d0
+; CHECK-NEXT:    bl __aeabi_d2ulz
+; CHECK-NEXT:    vmov.32 d0[0], r0
+; CHECK-NEXT:    vmov.32 d0[1], r1
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x i64> @llvm.experimental.constrained.fptoui.v1i64.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x i64> %val
+}
+
+define <1 x double> @sitofp_v1f64_v1i32(<1 x i32> %x) #0 {
+; CHECK-LABEL: sitofp_v1f64_v1i32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .pad #8
+; CHECK-NEXT:    sub sp, sp, #8
+; CHECK-NEXT:    movw r1, #0
+; CHECK-NEXT:    eor r0, r0, #-2147483648
+; CHECK-NEXT:    movt r1, #17200
+; CHECK-NEXT:    str r0, [sp]
+; CHECK-NEXT:    str r1, [sp, #4]
+; CHECK-NEXT:    vldr d16, .LCPI59_0
+; CHECK-NEXT:    vldr d17, [sp]
+; CHECK-NEXT:    vsub.f64 d0, d17, d16
+; CHECK-NEXT:    add sp, sp, #8
+; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 3
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI59_0:
+; CHECK-NEXT:    .long 2147483648 @ double 4503601774854144
+; CHECK-NEXT:    .long 1127219200
+  %val = call <1 x double> @llvm.experimental.constrained.sitofp.v1f64.v1i32(<1 x i32> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @uitofp_v1f64_v1i32(<1 x i32> %x) #0 {
+; CHECK-LABEL: uitofp_v1f64_v1i32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .pad #8
+; CHECK-NEXT:    sub sp, sp, #8
+; CHECK-NEXT:    movw r1, #0
+; CHECK-NEXT:    str r0, [sp]
+; CHECK-NEXT:    movt r1, #17200
+; CHECK-NEXT:    vldr d16, .LCPI60_0
+; CHECK-NEXT:    str r1, [sp, #4]
+; CHECK-NEXT:    vldr d17, [sp]
+; CHECK-NEXT:    vsub.f64 d0, d17, d16
+; CHECK-NEXT:    add sp, sp, #8
+; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    .p2align 3
+; CHECK-NEXT:  @ %bb.1:
+; CHECK-NEXT:  .LCPI60_0:
+; CHECK-NEXT:    .long 0 @ double 4503599627370496
+; CHECK-NEXT:    .long 1127219200
+  %val = call <1 x double> @llvm.experimental.constrained.uitofp.v1f64.v1i32(<1 x i32> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @sitofp_v1f64_v1i64(<1 x i64> %x) #0 {
+; CHECK-LABEL: sitofp_v1f64_v1i64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    vmov.32 r0, d0[0]
+; CHECK-NEXT:    vmov.32 r1, d0[1]
+; CHECK-NEXT:    bl __aeabi_l2d
+; CHECK-NEXT:    vmov d0, r0, r1
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.sitofp.v1f64.v1i64(<1 x i64> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @uitofp_v1f64_v1i64(<1 x i64> %x) #0 {
+; CHECK-LABEL: uitofp_v1f64_v1i64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    vmov.32 r0, d0[0]
+; CHECK-NEXT:    vmov.32 r1, d0[1]
+; CHECK-NEXT:    bl __aeabi_ul2d
+; CHECK-NEXT:    vmov d0, r0, r1
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.uitofp.v1f64.v1i64(<1 x i64> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @sqrt_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: sqrt_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vsqrt.f64 d0, d0
+; CHECK-NEXT:    bx lr
+  %val = call <1 x double> @llvm.experimental.constrained.sqrt.v1f64(<1 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @rint_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: rint_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl rint
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.rint.v1f64(<1 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @nearbyint_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: nearbyint_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl nearbyint
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.nearbyint.v1f64(<1 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @maxnum_v1f64(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: maxnum_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl fmax
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.maxnum.v1f64(<1 x double> %x, <1 x double> %y, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @minnum_v1f64(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: minnum_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl fmin
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.minnum.v1f64(<1 x double> %x, <1 x double> %y, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @ceil_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: ceil_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl ceil
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.ceil.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @floor_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: floor_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl floor
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.floor.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @round_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: round_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl round
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.round.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @roundeven_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: roundeven_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl roundeven
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.roundeven.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x double> @trunc_v1f64(<1 x double> %x) #0 {
+; CHECK-LABEL: trunc_v1f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    .save {r11, lr}
+; CHECK-NEXT:    push {r11, lr}
+; CHECK-NEXT:    bl trunc
+; CHECK-NEXT:    pop {r11, pc}
+  %val = call <1 x double> @llvm.experimental.constrained.trunc.v1f64(<1 x double> %x, metadata !"fpexcept.strict") #0
+  ret <1 x double> %val
+}
+
+define <1 x i1> @fcmp_v1f61(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: fcmp_v1f61:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vcmp.f64 d0, d1
+; CHECK-NEXT:    mov r0, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    movweq r0, #1
+; CHECK-NEXT:    bx lr
+entry:
+  %val = call <1 x i1> @llvm.experimental.constrained.fcmp.v1f64(<1 x double> %x, <1 x double> %y, metadata !"oeq", metadata !"fpexcept.strict")
+  ret <1 x i1> %val
+}
+
+define <1 x i1> @fcmps_v1f61(<1 x double> %x, <1 x double> %y) #0 {
+; CHECK-LABEL: fcmps_v1f61:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vcmpe.f64 d0, d1
+; CHECK-NEXT:    mov r0, #0
+; CHECK-NEXT:    vmrs APSR_nzcv, fpscr
+; CHECK-NEXT:    movweq r0, #1
+; CHECK-NEXT:    bx lr
+entry:
+  %val = call <1 x i1> @llvm.experimental.constrained.fcmps.v1f64(<1 x double> %x, <1 x double> %y, metadata !"oeq", metadata !"fpexcept.strict")
+  ret <1 x i1> %val
+}
+
+
+
+define <2 x float> @fptrunc_v2f32_v2f64(<2 x double> %x) #0 {
+; CHECK-LABEL: fptrunc_v2f32_v2f64:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.f32.f64 s5, d1
+; CHECK-NEXT:    vcvt.f32.f64 s4, d0
+; CHECK-NEXT:    vmov.f64 d0, d2
+; CHECK-NEXT:    bx lr
+  %val = call <2 x float> @llvm.experimental.constrained.fptrunc.v2f32.v2f64(<2 x double> %x, metadata !"round.tonearest", metadata !"fpexcept.strict") #0
+  ret <2 x float> %val
+}
+
+define <2 x double> @fpext_v2f64_v2f32(<2 x float> %x) #0 {
+; CHECK-LABEL: fpext_v2f64_v2f32:
+; CHECK:       @ %bb.0:
+; CHECK-NEXT:    vcvt.f64.f32 d17, s1
+; CHECK-NEXT:    vcvt.f64.f32 d16, s0
+; CHECK-NEXT:    vorr q0, q8, q8
+; CHECK-NEXT:    bx lr
+  %val = call <2 x double> @llvm.experimental.constrained.fpext.v2f64.v2f32(<2 x float> %x, metadata !"fpexcept.strict") #0
+  ret <2 x double> %val
+}
+
+attributes #0 = { strictfp }
diff --git a/llvm/test/CodeGen/ARM/fp16-fullfp16.ll b/llvm/test/CodeGen/ARM/fp16-fullfp16.ll
index b4060d5fdb574..7b9474313e5bf 100644
--- a/llvm/test/CodeGen/ARM/fp16-fullfp16.ll
+++ b/llvm/test/CodeGen/ARM/fp16-fullfp16.ll
@@ -675,8 +675,8 @@ define half @frem_f16(half %x, half %y) #0 {
 ; CHECK-LABEL: frem_f16:
 ; CHECK:         .save {r11, lr}
 ; CHECK-NEXT:    push {r11, lr}
-; CHECK-NEXT:    vcvtb.f32.f16 s0, s0
 ; CHECK-NEXT:    vcvtb.f32.f16 s1, s1
+; CHECK-NEXT:    vcvtb.f32.f16 s0, s0
 ; CHECK-NEXT:    bl fmodf
 ; CHECK-NEXT:    vcvtb.f16.f32 s0, s0
 ; CHECK-NEXT:    pop {r11, pc}
@@ -713,7 +713,7 @@ define i32 @fptosi_i32_f16(half %x) #0 {
 
 define i32 @fptoui_i32_f16(half %x) #0 {
 ; CHECK-LABEL: fptoui_i32_f16:
-; CHECK:         vcvt.s32.f16 s0, s0
+; CHECK:         vcvt.u32.f16 s0, s0
 ; CHECK-NEXT:    vmov r0, s0
 ; CHECK-NEXT:    bx lr
   %val = call i32 @llvm.experimental.constrained.fptoui.i32.f16(half %x, metadata !"fpexcept.strict") #0
@@ -925,8 +925,8 @@ define half @atan2_f16(half %x, half %y) #0 {
 ; CHECK-LABEL: atan2_f16:
 ; CHECK:         .save {r11, lr}
 ; CHECK-NEXT:    push {r11, lr}
-; CHECK-NEXT:    vcvtb.f32.f16 s0, s0
 ; CHECK-NEXT:    vcvtb.f32.f16 s1, s1
+; CHECK-NEXT:    vcvtb.f32.f16 s0, s0
 ; CHECK-NEXT:    bl atan2f
 ; CHECK-NEXT:    vcvtb.f16.f32 s0, s0
 ; CHECK-NEXT:    pop {r11, pc}
@@ -974,8 +974,8 @@ define half @pow_f16(half %x, half %y) #0 {
 ; CHECK-LABEL: pow_f16:
 ; CHECK:         .save {r11, lr}
 ; CHECK-NEXT:    push {r11, lr}
-; CHECK-NEXT:    vcvtb.f32.f16 s0, s0
 ; CHECK-NEXT:    vcvtb.f32.f16 s1, s1
+; CHECK-NEXT:    vcvtb.f32.f16 s0, s0
 ; CHECK-NEXT:    bl powf
 ; CHECK-NEXT:    vcvtb.f16.f32 s0, s0
 ; CHECK-NEXT:    pop {r11, pc}
diff --git a/llvm/test/CodeGen/BPF/atomic-oversize.ll b/llvm/test/CodeGen/BPF/atomic-oversize.ll
index 6dc49398f091d..187f0964d4fb8 100644
--- a/llvm/test/CodeGen/BPF/atomic-oversize.ll
+++ b/llvm/test/CodeGen/BPF/atomic-oversize.ll
@@ -1,4 +1,6 @@
 ; RUN: llc -mtriple=bpf < %s | FileCheck %s
+; XFAIL: *
+; Doesn't currently build, with error 'only small returns supported'.
 
 define void @test(ptr %a) nounwind {
 ; CHECK-LABEL: test:
diff --git a/llvm/test/CodeGen/BPF/builtin_calls.ll b/llvm/test/CodeGen/BPF/builtin_calls.ll
deleted file mode 100644
index 18199eba7222a..0000000000000
--- a/llvm/test/CodeGen/BPF/builtin_calls.ll
+++ /dev/null
@@ -1,39 +0,0 @@
-; RUN: llc -march=bpfel -mattr=+allow-builtin-calls < %s | FileCheck %s
-;
-; C code for this test case:
-;
-; long func(long a, long b) {
-;     long x;
-;     return __builtin_mul_overflow(a, b, &x);
-; }
-
-
-declare { i64, i1 } @llvm.smul.with.overflow.i64(i64, i64)
-
-define noundef range(i64 0, 2) i64 @func(i64 noundef %a, i64 noundef %b) local_unnamed_addr {
-entry:
-  %0 = tail call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %a, i64 %b)
-  %1 = extractvalue { i64, i1 } %0, 1
-  %conv = zext i1 %1 to i64
-  ret i64 %conv
-}
-
-; CHECK-LABEL: func
-; CHECK: r4 = r2
-; CHECK: r2 = r1
-; CHECK: r3 = r2
-; CHECK: r3 s>>= 63
-; CHECK: r5 = r4
-; CHECK: r5 s>>= 63
-; CHECK: r1 = r10
-; CHECK: r1 += -16
-; CHECK: call __multi3
-; CHECK: r1 = *(u64 *)(r10 - 16)
-; CHECK: r1 s>>= 63
-; CHECK: w0 = 1
-; CHECK: r2 = *(u64 *)(r10 - 8)
-; CHECK: if r2 != r1 goto LBB0_2
-; CHECK:  # %bb.1:                                # %entry
-; CHECK: w0 = 0
-; CHECK:  LBB0_2:                                 # %entry
-; CHECK: exit
\ No newline at end of file
diff --git a/llvm/test/CodeGen/BPF/struct_ret1.ll b/llvm/test/CodeGen/BPF/struct_ret1.ll
index eb66a7deacb91..40d17ec514c48 100644
--- a/llvm/test/CodeGen/BPF/struct_ret1.ll
+++ b/llvm/test/CodeGen/BPF/struct_ret1.ll
@@ -1,6 +1,6 @@
 ; RUN: not llc -mtriple=bpf < %s 2> %t1
 ; RUN: FileCheck %s < %t1
-; CHECK: error: <unknown>:0:0: in function bar { i64, i32 } (i32, i32, i32, i32, i32): stack arguments are not supported
+; CHECK: error: <unknown>:0:0: in function bar { i64, i32 } (i32, i32, i32, i32, i32): aggregate returns are not supported
 
 %struct.S = type { i32, i32, i32 }
 
diff --git a/llvm/test/CodeGen/BPF/struct_ret2.ll b/llvm/test/CodeGen/BPF/struct_ret2.ll
index a20280949215e..170d55cc29df0 100644
--- a/llvm/test/CodeGen/BPF/struct_ret2.ll
+++ b/llvm/test/CodeGen/BPF/struct_ret2.ll
@@ -1,6 +1,6 @@
 ; RUN: not llc -mtriple=bpf < %s 2> %t1
 ; RUN: FileCheck %s < %t1
-; CHECK: too many arguments
+; CHECK: only small returns
 
 ; Function Attrs: nounwind uwtable
 define { i64, i32 } @foo(i32 %a, i32 %b, i32 %c) #0 {
diff --git a/llvm/test/CodeGen/Hexagon/autohvx/fp-to-int_2.ll b/llvm/test/CodeGen/Hexagon/autohvx/fp-to-int_2.ll
new file mode 100644
index 0000000000000..03e484a6721a7
--- /dev/null
+++ b/llvm/test/CodeGen/Hexagon/autohvx/fp-to-int_2.ll
@@ -0,0 +1,75 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=hexagon -hexagon-hvx-widen=32 -hexagon-fp-fast-convert=true -mattr=+hvxv68,+hvx-length128b,+hvx-qfloat < %s | FileCheck %s --check-prefix=CHECK-V68
+; RUN: llc -mtriple=hexagon -hexagon-hvx-widen=32 -hexagon-fp-fast-convert=true -mattr=+hvxv81,+hvx-length128b,+hvx-qfloat < %s | FileCheck %s --check-prefix=CHECK-V81
+
+; ----------------------------
+; V68 Tests
+; ----------------------------
+
+; f16 -> s16 (No widening)
+define void @f16s16_0(ptr %a0, ptr %a1) #0 {
+; CHECK-V68-LABEL: f16s16_0:
+; CHECK-V68:       {
+; CHECK-V68:       [[DST:v[0-9]+]].h = [[SRC:v[0-9]+]].hf
+; CHECK-V68-NEXT:  jumpr r31
+; CHECK-V68:       vmem(r1+#0) = [[DST]].new
+; CHECK-V68-NEXT:  }
+  %v0 = load <64 x half>, ptr %a0, align 128
+  %v1 = fptosi <64 x half> %v0 to <64 x i16>
+  store <64 x i16> %v1, ptr %a1, align 128
+  ret void
+}
+
+; f32 -> s8 (Triggers V6_vconv_w_sf)
+define void @f32s8_2(ptr %a0, ptr %a1) {
+; CHECK-V68-LABEL: f32s8_2:
+; CHECK-V68:       {
+; CHECK-V68:       [[SRC:v[0-9]+]] = vmem(r0+#0)
+; CHECK-V68:       [[SRC]].w = [[SRC]].sf
+; CHECK-V68:       vpack
+; CHECK-V68:       vpack
+; CHECK-V68:       vpack
+; CHECK-V68:       jumpr r31
+; CHECK-V68:       vmem(r1+#0) = [[DST:v[0-9]+]]
+; CHECK-V68-NEXT:  }
+  %v0 = load <32 x float>, ptr %a0, align 128
+  %v1 = fptosi <32 x float> %v0 to <32 x i8>
+  store <32 x i8> %v1, ptr %a1, align 128
+  ret void
+}
+
+; ----------------------------
+; V81 Tests
+; ----------------------------
+
+; f16 -> s16 with rounding (V6_vconv_h_hf_rnd)
+define void @f16s16_v81(ptr %a0, ptr %a1) {
+; CHECK-V81-LABEL: f16s16_v81:
+; CHECK-V81:       {
+; CHECK-V81:       [[DST:v[0-9]+]].h = [[SRC:v[0-9]+]].hf:rnd
+; CHECK-V81-NEXT:  jumpr r31
+; CHECK-V81:       vmem(r1+#0) = [[DST]].new
+; CHECK-V81-NEXT:  }
+  %v0 = load <64 x half>, ptr %a0, align 128
+  %v1 = fptosi <64 x half> %v0 to <64 x i16>
+  store <64 x i16> %v1, ptr %a1, align 128
+  ret void
+}
+
+; f32 -> s8 with V81 (still uses V6_vconv_w_sf)
+define void @f32s8_v81(ptr %a0, ptr %a1) {
+; CHECK-V81-LABEL: f32s8_v81:
+; CHECK-V81:       {
+; CHECK-V81:       [[SRC:v[0-9]+]] = vmem(r0+#0)
+; CHECK-V81:       [[SRC]].w = [[SRC]].sf
+; CHECK-V81:       vpack
+; CHECK-V81:       vpack
+; CHECK-V81:       vpack
+; CHECK-V81:       jumpr r31
+; CHECK-V81:       vmem(r1+#0) = [[DST:v[0-9]+]]
+; CHECK-V81-NEXT:  }
+  %v0 = load <32 x float>, ptr %a0, align 128
+  %v1 = fptosi <32 x float> %v0 to <32 x i8>
+  store <32 x i8> %v1, ptr %a1, align 128
+  ret void
+}
diff --git a/llvm/test/CodeGen/LoongArch/O0-pipeline.ll b/llvm/test/CodeGen/LoongArch/O0-pipeline.ll
index 9006b5c8d6fe1..5f4fccdd72b12 100644
--- a/llvm/test/CodeGen/LoongArch/O0-pipeline.ll
+++ b/llvm/test/CodeGen/LoongArch/O0-pipeline.ll
@@ -9,9 +9,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Create Garbage Collector Module Metadata
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Profile summary info
diff --git a/llvm/test/CodeGen/LoongArch/opt-pipeline.ll b/llvm/test/CodeGen/LoongArch/opt-pipeline.ll
index 661f67d4989c4..546ed6cec5c4a 100644
--- a/llvm/test/CodeGen/LoongArch/opt-pipeline.ll
+++ b/llvm/test/CodeGen/LoongArch/opt-pipeline.ll
@@ -17,9 +17,11 @@
 
 ; LAXX-LABEL: Pass Arguments:
 ; LAXX-NEXT: Target Library Information
+; LAXX-NEXT: Runtime Library Function Analysis
 ; LAXX-NEXT: Target Pass Configuration
 ; LAXX-NEXT: Machine Module Information
 ; LAXX-NEXT: Target Transform Information
+; LAXX-NEXT: Library Function Lowering Analysis
 ; LAXX-NEXT: Assumption Cache Tracker
 ; LAXX-NEXT: Type-Based Alias Analysis
 ; LAXX-NEXT: Scoped NoAlias Alias Analysis
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-empty-lanemask.mir b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-empty-lanemask.mir
new file mode 100644
index 0000000000000..68324f1b2f90e
--- /dev/null
+++ b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-empty-lanemask.mir
@@ -0,0 +1,13 @@
+# RUN: not llc -mtriple=amdgcn-amd-amdhsa -run-pass=none -filetype=null %s 2>&1 | FileCheck %s
+
+---
+name: test_empty_lanemask_type
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK: [[@LINE+1]]:45: expected a valid lane mask value
+    $vgpr1 = COPY_LANEMASK $vgpr0, lanemask()
+    S_ENDPGM 0
+...
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-invalid-lanemask.mir b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-invalid-lanemask.mir
new file mode 100644
index 0000000000000..647f6116f18f7
--- /dev/null
+++ b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-invalid-lanemask.mir
@@ -0,0 +1,13 @@
+# RUN: not llc -mtriple=amdgcn-amd-amdhsa -run-pass=none -filetype=null %s 2>&1 | FileCheck %s
+
+---
+name: test_wrong_lanemask_type
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK: [[@LINE+1]]:45: expected a valid lane mask value
+    $vgpr1 = COPY_LANEMASK $vgpr0, lanemask(undef)
+    S_ENDPGM 0
+...
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-lparen.mir b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-lparen.mir
new file mode 100644
index 0000000000000..3382572f67213
--- /dev/null
+++ b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-lparen.mir
@@ -0,0 +1,13 @@
+# RUN: not llc -mtriple=amdgcn-amd-amdhsa -run-pass=none -filetype=null %s 2>&1 | FileCheck %s
+
+---
+name: test_missing_lparen
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK: [[@LINE+1]]:45: expected '('
+    $vgpr1 = COPY_LANEMASK $vgpr0, lanemask 14)
+    S_ENDPGM 0
+...
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-rparen.mir b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-rparen.mir
new file mode 100644
index 0000000000000..052305d9f9c36
--- /dev/null
+++ b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand-missing-rparen.mir
@@ -0,0 +1,13 @@
+# RUN: not llc -mtriple=amdgcn-amd-amdhsa -run-pass=none -filetype=null %s 2>&1 | FileCheck %s
+
+---
+name: test_missing_rparen
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    ; CHECK: [[@LINE+1]]:47: expected ')'
+    $vgpr1 = COPY_LANEMASK $vgpr0, lanemask(16
+    S_ENDPGM 0
+...
diff --git a/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand.mir b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand.mir
new file mode 100644
index 0000000000000..066bc8e79a56e
--- /dev/null
+++ b/llvm/test/CodeGen/MIR/AMDGPU/parse-lanemask-operand.mir
@@ -0,0 +1,17 @@
+# RUN: llc -mtriple=amdgcn-amd-amdhsa -run-pass=none -verify-machineinstrs -o - %s | FileCheck %s
+
+# This test checks for the correctness of the MIR parser for lanemask
+
+# CHECK-LABEL: name: test_lanemask_operand
+# CHECK: COPY_LANEMASK $vgpr0, lanemask(0x0000000000000002)
+---
+name: test_lanemask_operand
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0
+
+    $vgpr1 = COPY_LANEMASK $vgpr0, lanemask(2)
+    S_ENDPGM 0
+...
+
diff --git a/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_print.txt b/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_print.txt
index 74ef1e608d4ba..62e07445ad12e 100644
--- a/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_print.txt
+++ b/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_print.txt
@@ -186,6 +186,7 @@ Key: CONVERGENCECTRL_ENTRY:  [ 0.00  0.00 ]
 Key: CONVERGENCECTRL_GLUE:  [ 0.00  0.00 ]
 Key: CONVERGENCECTRL_LOOP:  [ 0.00  0.00 ]
 Key: COPY:  [ 0.00  0.00 ]
+Key: COPY_LANEMASK:  [ 0.00  0.00 ]
 Key: COPY_TO_REGCLASS:  [ 0.00  0.00 ]
 Key: CPUID:  [ 0.00  0.00 ]
 Key: CQO:  [ 0.00  0.00 ]
@@ -6884,6 +6885,7 @@ Key: CFIIndex:  [ 0.00  0.00 ]
 Key: IntrinsicID:  [ 0.00  0.00 ]
 Key: Predicate:  [ 0.00  0.00 ]
 Key: ShuffleMask:  [ 0.00  0.00 ]
+Key: LaneMask:  [ 0.00  0.00 ]
 Key: PhyReg_GR8:  [ 0.00  0.00 ]
 Key: PhyReg_GRH8:  [ 0.00  0.00 ]
 Key: PhyReg_GR8_NOREX2:  [ 0.00  0.00 ]
diff --git a/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_wo=0.5_print.txt b/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_wo=0.5_print.txt
index 1ba4f13e69c92..03a3fafc6b801 100644
--- a/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_wo=0.5_print.txt
+++ b/llvm/test/CodeGen/MIR2Vec/Inputs/reference_x86_vocab_wo=0.5_print.txt
@@ -186,6 +186,7 @@ Key: CONVERGENCECTRL_ENTRY:  [ 0.00  0.00 ]
 Key: CONVERGENCECTRL_GLUE:  [ 0.00  0.00 ]
 Key: CONVERGENCECTRL_LOOP:  [ 0.00  0.00 ]
 Key: COPY:  [ 0.00  0.00 ]
+Key: COPY_LANEMASK:  [ 0.00  0.00 ]
 Key: COPY_TO_REGCLASS:  [ 0.00  0.00 ]
 Key: CPUID:  [ 0.00  0.00 ]
 Key: CQO:  [ 0.00  0.00 ]
@@ -6884,6 +6885,7 @@ Key: CFIIndex:  [ 0.00  0.00 ]
 Key: IntrinsicID:  [ 0.00  0.00 ]
 Key: Predicate:  [ 0.00  0.00 ]
 Key: ShuffleMask:  [ 0.00  0.00 ]
+Key: LaneMask:  [ 0.00  0.00 ]
 Key: PhyReg_GR8:  [ 0.00  0.00 ]
 Key: PhyReg_GRH8:  [ 0.00  0.00 ]
 Key: PhyReg_GR8_NOREX2:  [ 0.00  0.00 ]
diff --git a/llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll b/llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll
index 297b2b984cdae..ad78e0fe7438b 100644
--- a/llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll
+++ b/llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll
@@ -20,19 +20,19 @@ entry:
 ; IR-LABEL:   @memcpy_caller
 ; IR:         entry:
 ; IR:         [[Cond:%[0-9]+]] = icmp ne i64 %n, 0
-; IR:         br i1 [[Cond]], label %loop-memcpy-expansion, label %post-loop-memcpy-expansion
+; IR:         br i1 [[Cond]], label %dynamic-memcpy-expansion-main-body, label %dynamic-memcpy-post-expansion
 
-; IR:         loop-memcpy-expansion:
-; IR:         %loop-index = phi i64 [ 0, %entry ], [ [[IndexInc:%[0-9]+]], %loop-memcpy-expansion ]
+; IR:         dynamic-memcpy-expansion-main-body:
+; IR:         %loop-index = phi i64 [ 0, %entry ], [ [[IndexInc:%[0-9]+]], %dynamic-memcpy-expansion-main-body ]
 ; IR:         [[SrcGep:%[0-9]+]] = getelementptr inbounds i8, ptr %src, i64 %loop-index
 ; IR:         [[Load:%[0-9]+]] = load i8, ptr [[SrcGep]]
 ; IR:         [[DstGep:%[0-9]+]] = getelementptr inbounds i8, ptr %dst, i64 %loop-index
 ; IR:         store i8 [[Load]], ptr [[DstGep]]
 ; IR:         [[IndexInc]] = add i64 %loop-index, 1
 ; IR:         [[Cond2:%[0-9]+]] = icmp ult i64 [[IndexInc]], %n
-; IR:         br i1 [[Cond2]], label %loop-memcpy-expansion, label %post-loop-memcpy-expansion
+; IR:         br i1 [[Cond2]], label %dynamic-memcpy-expansion-main-body, label %dynamic-memcpy-post-expansion
 
-; IR-LABEL:   post-loop-memcpy-expansion:
+; IR-LABEL:   dynamic-memcpy-post-expansion:
 ; IR:         ret ptr %dst
 
 ; PTX-LABEL:  .visible .func (.param .b64 func_retval0) memcpy_caller
@@ -53,19 +53,19 @@ entry:
 ; IR-LABEL:   @memcpy_volatile_caller
 ; IR:         entry:
 ; IR:         [[Cond:%[0-9]+]] = icmp ne i64 %n, 0
-; IR:         br i1 [[Cond]], label %loop-memcpy-expansion, label %post-loop-memcpy-expansion
+; IR:         br i1 [[Cond]], label %dynamic-memcpy-expansion-main-body, label %dynamic-memcpy-post-expansion
 
-; IR:         loop-memcpy-expansion:
-; IR:         %loop-index = phi i64 [ 0, %entry ], [ [[IndexInc:%[0-9]+]], %loop-memcpy-expansion ]
+; IR:         dynamic-memcpy-expansion-main-body:
+; IR:         %loop-index = phi i64 [ 0, %entry ], [ [[IndexInc:%[0-9]+]], %dynamic-memcpy-expansion-main-body ]
 ; IR:         [[SrcGep:%[0-9]+]] = getelementptr inbounds i8, ptr %src, i64 %loop-index
 ; IR:         [[Load:%[0-9]+]] = load volatile i8, ptr [[SrcGep]]
 ; IR:         [[DstGep:%[0-9]+]] = getelementptr inbounds i8, ptr %dst, i64 %loop-index
 ; IR:         store volatile i8 [[Load]], ptr [[DstGep]]
 ; IR:         [[IndexInc]] = add i64 %loop-index, 1
 ; IR:         [[Cond2:%[0-9]+]] = icmp ult i64 [[IndexInc]], %n
-; IR:         br i1 [[Cond2]], label %loop-memcpy-expansion, label %post-loop-memcpy-expansion
+; IR:         br i1 [[Cond2]], label %dynamic-memcpy-expansion-main-body, label %dynamic-memcpy-post-expansion
 
-; IR-LABEL:   post-loop-memcpy-expansion:
+; IR-LABEL:   dynamic-memcpy-post-expansion:
 ; IR:         ret ptr %dst
 
 
@@ -97,16 +97,16 @@ entry:
 ; Check that calls with compile-time constant size are handled correctly
 ; IR-LABEL:    @memcpy_known_size
 ; IR:          entry:
-; IR:          br label %load-store-loop
-; IR:          load-store-loop:
-; IR:          %loop-index = phi i64 [ 0, %entry ], [ [[IndexInc:%[0-9]+]], %load-store-loop ]
+; IR:          br label %static-memcpy-expansion-main-body
+; IR:          static-memcpy-expansion-main-body:
+; IR:          %loop-index = phi i64 [ 0, %entry ], [ [[IndexInc:%[0-9]+]], %static-memcpy-expansion-main-body ]
 ; IR:          [[SrcGep:%[0-9]+]] = getelementptr inbounds i8, ptr %src, i64 %loop-index
 ; IR:          [[Load:%[0-9]+]] = load i8, ptr [[SrcGep]]
 ; IR:          [[DstGep:%[0-9]+]] = getelementptr inbounds i8, ptr %dst, i64 %loop-index
 ; IR:          store i8 [[Load]], ptr [[DstGep]]
 ; IR:          [[IndexInc]] = add i64 %loop-index, 1
 ; IR:          [[Cond:%[0-9]+]] = icmp ult i64 %3, 144
-; IR:          br i1 [[Cond]], label %load-store-loop, label %memcpy-split
+; IR:          br i1 [[Cond]], label %static-memcpy-expansion-main-body, label %static-memcpy-post-expansion
 }
 
 define ptr @memset_caller(ptr %dst, i32 %c, i64 %n) #0 {
diff --git a/llvm/test/CodeGen/NVPTX/param-add.ll b/llvm/test/CodeGen/NVPTX/param-add.ll
index 06d7384200696..b220450b61ce9 100644
--- a/llvm/test/CodeGen/NVPTX/param-add.ll
+++ b/llvm/test/CodeGen/NVPTX/param-add.ll
@@ -2,11 +2,6 @@
 ; RUN: llc < %s -march=nvptx64 --debug-counter=dagcombine=0 | FileCheck %s
 ; RUN: %if ptxas %{ llc < %s -march=nvptx64 --debug-counter=dagcombine=0 | %ptxas-verify %}
 
-; REQUIRES: asserts
-; asserts are required for --debug-counter=dagcombine=0 to have the intended
-; effect of disabling DAG combines, which exposes the bug. When combines are
-; enabled the bug does not occur.
-
 %struct.1float = type <{ [1 x float] }>
 
 declare i32 @callee(%struct.1float %a)
diff --git a/llvm/test/CodeGen/NVPTX/switch-loop-header.mir b/llvm/test/CodeGen/NVPTX/switch-loop-header.mir
new file mode 100644
index 0000000000000..4d86bb879f18f
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/switch-loop-header.mir
@@ -0,0 +1,182 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 6
+# RUN: llc -o - %s -passes="require<machine-loops>,require<live-vars>,phi-node-elimination" | FileCheck %s
+
+--- |
+  target datalayout = "e-p6:32:32-i64:64-i128:128-i256:256-v16:16-v32:32-n16:32:64"
+  target triple = "nvptx64-unknown-nvidiacl"
+
+  define void @func_26(i32 %BS_COND_16.0.BS_COND_16.0.BS_COND_16.0.BS_COND_16.0.) {
+  entry:
+    br label %for.cond
+
+  for.cond:                                         ; preds = %BS_LABEL_1, %BS_LABEL_1, %entry
+    %p_2218_0.1 = phi i32 [ 0, %entry ], [ %p_2218_0.3, %BS_LABEL_1 ], [ %p_2218_0.3, %BS_LABEL_1 ]
+    br label %BS_LABEL_1
+
+  BS_LABEL_2:                                       ; preds = %BS_LABEL_1
+    %sub = or i32 %p_2218_0.3, 1
+    br label %for.cond4
+
+  for.cond4:                                        ; preds = %BS_LABEL_1, %BS_LABEL_2
+    %p_2218_0.2 = phi i32 [ %BS_COND_16.0.BS_COND_16.0.BS_COND_16.0.BS_COND_16.0., %BS_LABEL_1 ], [ %sub, %BS_LABEL_2 ]
+    br label %BS_LABEL_1
+
+  BS_LABEL_1:                                       ; preds = %for.cond4, %for.cond
+    %p_2218_0.3 = phi i32 [ %p_2218_0.2, %for.cond4 ], [ %p_2218_0.1, %for.cond ]
+    switch i32 %BS_COND_16.0.BS_COND_16.0.BS_COND_16.0.BS_COND_16.0., label %unreachable [
+      i32 0, label %for.cond4
+      i32 4, label %BS_LABEL_2
+      i32 1, label %for.cond
+      i32 6, label %for.cond
+    ]
+
+  unreachable:                                      ; preds = %BS_LABEL_1
+    call void asm sideeffect "exit;", ""()
+    unreachable
+  }
+...
+---
+name:            func_26
+alignment:       1
+exposesReturnsTwice: false
+legalized:       false
+regBankSelected: false
+selected:        false
+failedISel:      false
+tracksRegLiveness: true
+hasWinCFI:       false
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+callsEHReturn:   false
+callsUnwindInit: false
+hasEHContTarget: false
+hasEHScopes:     false
+hasEHFunclets:   false
+isOutlined:      false
+debugInstrRef:   false
+failsVerification: false
+tracksDebugUserValues: false
+registers:
+  - { id: 0, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 1, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 2, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 3, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 4, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 5, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 6, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 7, class: b1, preferred-register: '', flags: [  ] }
+  - { id: 8, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 9, class: b1, preferred-register: '', flags: [  ] }
+  - { id: 10, class: b32, preferred-register: '', flags: [  ] }
+  - { id: 11, class: b1, preferred-register: '', flags: [  ] }
+liveins:         []
+frameInfo:
+  isFrameAddressTaken: false
+  isReturnAddressTaken: false
+  hasStackMap:     false
+  hasPatchPoint:   false
+  stackSize:       0
+  offsetAdjustment: 0
+  maxAlignment:    1
+  adjustsStack:    false
+  hasCalls:        false
+  stackProtector:  ''
+  functionContext: ''
+  maxCallFrameSize: 4294967295
+  cvBytesOfCalleeSavedRegisters: 0
+  hasOpaqueSPAdjustment: false
+  hasVAStart:      false
+  hasMustTailInVarArgFunc: false
+  hasTailCall:     false
+  isCalleeSavedInfoValid: false
+  localFrameSize:  0
+fixedStack:      []
+stack:           []
+entry_values:    []
+callSites:       []
+debugValueSubstitutions: []
+constants:       []
+machineFunctionInfo: {}
+jumpTable:
+  kind:            inline
+  entries:
+    - id:              0
+      blocks:          [ '%bb.3', '%bb.1', '%bb.6', '%bb.6', '%bb.2', '%bb.6',
+                         '%bb.1' ]
+body:             |
+  ; CHECK-LABEL: name: func_26
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   dead [[DEF:%[0-9]+]]:b32 = IMPLICIT_DEF
+  ; CHECK-NEXT:   dead [[DEF1:%[0-9]+]]:b1 = IMPLICIT_DEF
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.4(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   dead [[DEF2:%[0-9]+]]:b32 = IMPLICIT_DEF
+  ; CHECK-NEXT:   GOTO %bb.4
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.3(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.4(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.6(0x00000000), %bb.5(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   CBranch undef [[DEF1]], %bb.6
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.5:
+  ; CHECK-NEXT:   successors: %bb.3(0x3e000000), %bb.1(0x04000000), %bb.6(0x00000000), %bb.2(0x3e000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   BRX_START 0
+  ; CHECK-NEXT:   BRX_ITEM %bb.3
+  ; CHECK-NEXT:   BRX_ITEM %bb.1
+  ; CHECK-NEXT:   BRX_ITEM %bb.6
+  ; CHECK-NEXT:   BRX_ITEM %bb.6
+  ; CHECK-NEXT:   BRX_ITEM %bb.2
+  ; CHECK-NEXT:   BRX_ITEM %bb.6
+  ; CHECK-NEXT:   BRX_END %bb.1, undef [[DEF]], 0
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.6:
+  bb.0:
+    successors: %bb.1(0x80000000)
+
+    %10:b32 = IMPLICIT_DEF
+    %11:b1 = IMPLICIT_DEF
+
+  bb.1:
+    successors: %bb.4(0x80000000)
+
+    %0:b32 = PHI undef %10, %bb.0, undef %0, %bb.5
+    GOTO %bb.4
+
+  bb.2:
+    successors: %bb.3(0x80000000)
+
+  bb.3:
+    successors: %bb.4(0x80000000)
+
+  bb.4:
+    successors: %bb.6(0x00000000), %bb.5(0x80000000)
+
+    CBranch undef %11, %bb.6
+
+  bb.5:
+    successors: %bb.3(0x3e000000), %bb.1(0x04000000), %bb.6(0x00000000), %bb.2(0x3e000000)
+
+    BRX_START 0
+    BRX_ITEM %bb.3
+    BRX_ITEM %bb.1
+    BRX_ITEM %bb.6
+    BRX_ITEM %bb.6
+    BRX_ITEM %bb.2
+    BRX_ITEM %bb.6
+    BRX_END %bb.1, undef %10, 0
+
+  bb.6:
+...
diff --git a/llvm/test/CodeGen/NVPTX/switch.ll b/llvm/test/CodeGen/NVPTX/switch.ll
new file mode 100644
index 0000000000000..7fcfcfbb85d00
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/switch.ll
@@ -0,0 +1,73 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mcpu=sm_20 -verify-machineinstrs | FileCheck %s
+
+target triple = "nvptx64-unknown-nvidiacl"
+
+define void @pr170051(i32 %cond) {
+; CHECK-LABEL: pr170051(
+; CHECK:       {
+; CHECK-NEXT:    .reg .pred %p<2>;
+; CHECK-NEXT:    .reg .b32 %r<4>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0: // %entry
+; CHECK-NEXT:    mov.b32 %r2, 0;
+; CHECK-NEXT:    ld.param.b32 %r1, [pr170051_param_0];
+; CHECK-NEXT:    setp.gt.u32 %p1, %r1, 6;
+; CHECK-NEXT:    bra.uni $L__BB0_3;
+; CHECK-NEXT:  $L__BB0_1: // %BS_LABEL_2
+; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    or.b32 %r3, %r2, 1;
+; CHECK-NEXT:  $L__BB0_2: // %for.cond4
+; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    mov.b32 %r2, %r3;
+; CHECK-NEXT:  $L__BB0_3: // %BS_LABEL_1
+; CHECK-NEXT:    // =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:    @%p1 bra $L__BB0_5;
+; CHECK-NEXT:  // %bb.4: // %BS_LABEL_1
+; CHECK-NEXT:    // in Loop: Header=BB0_3 Depth=1
+; CHECK-NEXT:    mov.b32 %r3, %r1;
+; CHECK-NEXT:    $L_brx_0: .branchtargets
+; CHECK-NEXT:     $L__BB0_2,
+; CHECK-NEXT:     $L__BB0_3,
+; CHECK-NEXT:     $L__BB0_5,
+; CHECK-NEXT:     $L__BB0_5,
+; CHECK-NEXT:     $L__BB0_1,
+; CHECK-NEXT:     $L__BB0_5,
+; CHECK-NEXT:     $L__BB0_3;
+; CHECK-NEXT:    brx.idx %r1, $L_brx_0;
+; CHECK-NEXT:  $L__BB0_5: // %unreachable
+; CHECK-NEXT:    // begin inline asm
+; CHECK-NEXT:    exit;
+; CHECK-NEXT:    // end inline asm
+entry:
+  br label %for.cond
+
+for.cond:                                         ; preds = %for.cond4.for.cond_crit_edge, %BS_LABEL_1, %BS_LABEL_1, %entry
+  %p_2218_0.1 = phi i32 [ 0, %entry ], [ %p_2218_0.3, %BS_LABEL_1 ], [ %p_2218_0.3, %BS_LABEL_1 ], [ poison, %for.cond4.for.cond_crit_edge ]
+  br label %BS_LABEL_1
+
+BS_LABEL_2:                                       ; preds = %BS_LABEL_1
+  %sub = or i32 %p_2218_0.3, 1
+  br label %for.cond4
+
+for.cond4:                                        ; preds = %BS_LABEL_1, %BS_LABEL_2
+  %p_2218_0.2 = phi i32 [ 0, %BS_LABEL_1 ], [ %sub, %BS_LABEL_2 ]
+  br i1 false, label %for.cond4.for.cond_crit_edge, label %BS_LABEL_1
+
+for.cond4.for.cond_crit_edge:                     ; preds = %for.cond4
+  br label %for.cond
+
+BS_LABEL_1:                                       ; preds = %for.cond4, %for.cond
+  %p_2218_0.3 = phi i32 [ %p_2218_0.2, %for.cond4 ], [ %p_2218_0.1, %for.cond ]
+  switch i32 %cond, label %unreachable [
+    i32 0, label %for.cond4
+    i32 4, label %BS_LABEL_2
+    i32 1, label %for.cond
+    i32 6, label %for.cond
+  ]
+
+unreachable:                                      ; preds = %BS_LABEL_1
+  unreachable
+}
+
+
diff --git a/llvm/test/CodeGen/PowerPC/O0-pipeline.ll b/llvm/test/CodeGen/PowerPC/O0-pipeline.ll
index 38b1074e55d22..ac04be436f6a1 100644
--- a/llvm/test/CodeGen/PowerPC/O0-pipeline.ll
+++ b/llvm/test/CodeGen/PowerPC/O0-pipeline.ll
@@ -6,9 +6,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Create Garbage Collector Module Metadata
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Profile summary info
diff --git a/llvm/test/CodeGen/PowerPC/O3-pipeline.ll b/llvm/test/CodeGen/PowerPC/O3-pipeline.ll
index 7cbb1a1c98873..fd8fd5fa34a17 100644
--- a/llvm/test/CodeGen/PowerPC/O3-pipeline.ll
+++ b/llvm/test/CodeGen/PowerPC/O3-pipeline.ll
@@ -5,9 +5,11 @@
 ; REQUIRES: asserts
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Type-Based Alias Analysis
 ; CHECK-NEXT: Scoped NoAlias Alias Analysis
diff --git a/llvm/test/CodeGen/PowerPC/peephole-counter-XToI.mir b/llvm/test/CodeGen/PowerPC/peephole-counter-XToI.mir
index dc20a1577aa5b..638b533a32a0d 100644
--- a/llvm/test/CodeGen/PowerPC/peephole-counter-XToI.mir
+++ b/llvm/test/CodeGen/PowerPC/peephole-counter-XToI.mir
@@ -1,5 +1,4 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
-# REQUIRES: asserts
 # RUN: llc -mtriple=powerpc64le-unknown-linux-gnu -verify-machineinstrs \
 # RUN:   -run-pass ppc-mi-peepholes %s -o - | FileCheck %s --check-prefix=ALL
 # RUN: llc -mtriple=powerpc64le-unknown-linux-gnu -verify-machineinstrs \
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/rvv/select.mir b/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/rvv/select.mir
index f8061462c6220..ada76a43639d7 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/rvv/select.mir
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/instruction-select/rvv/select.mir
@@ -11,7 +11,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv1i8
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_MF4_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_MF4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 3 /* e8 */
     ; RV32I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_MF4_]]
@@ -19,7 +19,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv1i8
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_MF4_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_MF4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 3 /* e8 */
     ; RV64I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_MF4_]]
@@ -40,7 +40,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv4i8
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M1_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_M1 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 3 /* e8 */
     ; RV32I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_M1_]]
@@ -48,7 +48,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv4i8
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M1_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_M1 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 3 /* e8 */
     ; RV64I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_M1_]]
@@ -69,7 +69,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv16i8
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm4 = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M4_:%[0-9]+]]:vrm4nov0 = PseudoVMERGE_VVM_M4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 3 /* e8 */
     ; RV32I-NEXT: $v8m4 = COPY [[PseudoVMERGE_VVM_M4_]]
@@ -77,7 +77,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv16i8
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm4 = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M4_:%[0-9]+]]:vrm4nov0 = PseudoVMERGE_VVM_M4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 3 /* e8 */
     ; RV64I-NEXT: $v8m4 = COPY [[PseudoVMERGE_VVM_M4_]]
@@ -98,7 +98,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv64i8
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_MF4_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_MF4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 4 /* e16 */
     ; RV32I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_MF4_]]
@@ -106,7 +106,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv64i8
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_MF4_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_MF4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 4 /* e16 */
     ; RV64I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_MF4_]]
@@ -127,7 +127,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv2i16
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M1_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_M1 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 4 /* e16 */
     ; RV32I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_M1_]]
@@ -135,7 +135,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv2i16
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M1_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_M1 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 4 /* e16 */
     ; RV64I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_M1_]]
@@ -156,7 +156,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv8i16
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm4 = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M4_:%[0-9]+]]:vrm4nov0 = PseudoVMERGE_VVM_M4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 4 /* e16 */
     ; RV32I-NEXT: $v8m4 = COPY [[PseudoVMERGE_VVM_M4_]]
@@ -164,7 +164,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv8i16
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm4 = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrm4nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M4_:%[0-9]+]]:vrm4nov0 = PseudoVMERGE_VVM_M4 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 4 /* e16 */
     ; RV64I-NEXT: $v8m4 = COPY [[PseudoVMERGE_VVM_M4_]]
@@ -185,7 +185,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv32i16
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_MF2_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_MF2 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 5 /* e32 */
     ; RV32I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_MF2_]]
@@ -193,7 +193,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv32i16
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vr = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrnov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_MF2_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_MF2 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 5 /* e32 */
     ; RV64I-NEXT: $v8 = COPY [[PseudoVMERGE_VVM_MF2_]]
@@ -214,7 +214,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv2i32
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm2 = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M2_:%[0-9]+]]:vrm2nov0 = PseudoVMERGE_VVM_M2 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 5 /* e32 */
     ; RV32I-NEXT: $v8m2 = COPY [[PseudoVMERGE_VVM_M2_]]
@@ -222,7 +222,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv2i32
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm2 = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M2_:%[0-9]+]]:vrm2nov0 = PseudoVMERGE_VVM_M2 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 5 /* e32 */
     ; RV64I-NEXT: $v8m2 = COPY [[PseudoVMERGE_VVM_M2_]]
@@ -243,7 +243,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv8i32
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm8 = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M8_:%[0-9]+]]:vrm8nov0 = PseudoVMERGE_VVM_M8 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 5 /* e32 */
     ; RV32I-NEXT: $v8m8 = COPY [[PseudoVMERGE_VVM_M8_]]
@@ -251,7 +251,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv8i32
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm8 = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M8_:%[0-9]+]]:vrm8nov0 = PseudoVMERGE_VVM_M8 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 5 /* e32 */
     ; RV64I-NEXT: $v8m8 = COPY [[PseudoVMERGE_VVM_M8_]]
@@ -272,7 +272,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv1i64
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm2 = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M2_:%[0-9]+]]:vrm2nov0 = PseudoVMERGE_VVM_M2 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 6 /* e64 */
     ; RV32I-NEXT: $v8m2 = COPY [[PseudoVMERGE_VVM_M2_]]
@@ -280,7 +280,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv1i64
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm2 = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrm2nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M2_:%[0-9]+]]:vrm2nov0 = PseudoVMERGE_VVM_M2 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 6 /* e64 */
     ; RV64I-NEXT: $v8m2 = COPY [[PseudoVMERGE_VVM_M2_]]
@@ -301,7 +301,7 @@ body:             |
   bb.0.entry:
     ; RV32I-LABEL: name: select_nxv4i64
     ; RV32I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm8 = IMPLICIT_DEF
+    ; RV32I-NEXT: [[DEF1:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[DEF2:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV32I-NEXT: [[PseudoVMERGE_VVM_M8_:%[0-9]+]]:vrm8nov0 = PseudoVMERGE_VVM_M8 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 6 /* e64 */
     ; RV32I-NEXT: $v8m8 = COPY [[PseudoVMERGE_VVM_M8_]]
@@ -309,7 +309,7 @@ body:             |
     ;
     ; RV64I-LABEL: name: select_nxv4i64
     ; RV64I: [[DEF:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm8 = IMPLICIT_DEF
+    ; RV64I-NEXT: [[DEF1:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[DEF2:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; RV64I-NEXT: [[PseudoVMERGE_VVM_M8_:%[0-9]+]]:vrm8nov0 = PseudoVMERGE_VVM_M8 [[DEF2]], [[DEF1]], [[DEF1]], [[DEF]], -1, 6 /* e64 */
     ; RV64I-NEXT: $v8m8 = COPY [[PseudoVMERGE_VVM_M8_]]
diff --git a/llvm/test/CodeGen/RISCV/GlobalISel/legalizer/rvv/legalize-extract-subvector.mir b/llvm/test/CodeGen/RISCV/GlobalISel/legalizer/rvv/legalize-extract-subvector.mir
index dcee71432f4c3..ab352553b4b55 100644
--- a/llvm/test/CodeGen/RISCV/GlobalISel/legalizer/rvv/legalize-extract-subvector.mir
+++ b/llvm/test/CodeGen/RISCV/GlobalISel/legalizer/rvv/legalize-extract-subvector.mir
@@ -20,9 +20,9 @@ body:             |
     ; RV32-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 2
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C2]](s32)
     ; RV32-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C3]](s32)
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 4 x s1>) = G_VMSET_VL [[C3]](s32)
     ; RV32-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 4 x s8>) = G_IMPLICIT_DEF
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 2 x s1>), [[C3]], 3
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 4 x s1>), [[C3]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 2 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 4 x s8>), 0
     ; RV32-NEXT: [[C4:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
     ; RV32-NEXT: [[SPLAT_VECTOR2:%[0-9]+]]:_(<vscale x 2 x s8>) = G_SPLAT_VECTOR [[C4]](s32)
@@ -41,9 +41,9 @@ body:             |
     ; RV64-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C2]](s64)
     ; RV64-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C3]](s64)
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 4 x s1>) = G_VMSET_VL [[C3]](s64)
     ; RV64-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 4 x s8>) = G_IMPLICIT_DEF
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 2 x s1>), [[C3]], 3
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 4 x s1>), [[C3]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 2 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 4 x s8>), 0
     ; RV64-NEXT: [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
     ; RV64-NEXT: [[SPLAT_VECTOR2:%[0-9]+]]:_(<vscale x 2 x s8>) = G_SPLAT_VECTOR [[C4]](s64)
@@ -72,9 +72,9 @@ body:             |
     ; RV32-NEXT: [[C2:%[0-9]+]]:_(s32) = G_CONSTANT i32 2
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C2]](s32)
     ; RV32-NEXT: [[C3:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C3]](s32)
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 8 x s1>) = G_VMSET_VL [[C3]](s32)
     ; RV32-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 8 x s8>) = G_IMPLICIT_DEF
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 2 x s1>), [[C3]], 3
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 8 x s1>), [[C3]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 2 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 8 x s8>), 0
     ; RV32-NEXT: [[C4:%[0-9]+]]:_(s32) = G_CONSTANT i32 0
     ; RV32-NEXT: [[SPLAT_VECTOR2:%[0-9]+]]:_(<vscale x 2 x s8>) = G_SPLAT_VECTOR [[C4]](s32)
@@ -93,9 +93,9 @@ body:             |
     ; RV64-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C2]](s64)
     ; RV64-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C3]](s64)
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 8 x s1>) = G_VMSET_VL [[C3]](s64)
     ; RV64-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 8 x s8>) = G_IMPLICIT_DEF
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 2 x s1>), [[C3]], 3
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[SELECT]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 8 x s1>), [[C3]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 2 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 8 x s8>), 0
     ; RV64-NEXT: [[C4:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
     ; RV64-NEXT: [[SPLAT_VECTOR2:%[0-9]+]]:_(<vscale x 2 x s8>) = G_SPLAT_VECTOR [[C4]](s64)
@@ -158,9 +158,9 @@ body:             |
     ; RV32-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C]](s32)
     ; RV32-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 4 x s1>) = G_VMSET_VL [[C1]](s32)
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 8 x s1>) = G_VMSET_VL [[C1]](s32)
     ; RV32-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 8 x s8>) = G_IMPLICIT_DEF
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[BITCAST]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 4 x s1>), [[C1]], 3
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[BITCAST]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 8 x s1>), [[C1]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 4 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 8 x s8>), 0
     ; RV32-NEXT: [[BITCAST1:%[0-9]+]]:_(<vscale x 32 x s1>) = G_BITCAST [[EXTRACT_SUBVECTOR]](<vscale x 4 x s8>)
     ; RV32-NEXT: $v8 = COPY [[BITCAST1]](<vscale x 32 x s1>)
@@ -173,9 +173,9 @@ body:             |
     ; RV64-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C]](s64)
     ; RV64-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 4 x s1>) = G_VMSET_VL [[C1]](s64)
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 8 x s1>) = G_VMSET_VL [[C1]](s64)
     ; RV64-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 8 x s8>) = G_IMPLICIT_DEF
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[BITCAST]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 4 x s1>), [[C1]], 3
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 8 x s8>) = G_VSLIDEDOWN_VL [[DEF1]], [[BITCAST]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 8 x s1>), [[C1]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 4 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 8 x s8>), 0
     ; RV64-NEXT: [[BITCAST1:%[0-9]+]]:_(<vscale x 32 x s1>) = G_BITCAST [[EXTRACT_SUBVECTOR]](<vscale x 4 x s8>)
     ; RV64-NEXT: $v8 = COPY [[BITCAST1]](<vscale x 32 x s1>)
@@ -317,8 +317,8 @@ body:             |
     ; RV32-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C]](s32)
     ; RV32-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s32)
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s8>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C1]](s32)
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s8>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 2 x s1>), [[C1]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 1 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 2 x s8>), 0
     ; RV32-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR]](<vscale x 1 x s8>)
     ; RV32-NEXT: PseudoRET implicit $v8
@@ -329,8 +329,8 @@ body:             |
     ; RV64-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 3
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C]](s64)
     ; RV64-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s64)
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s8>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C1]](s64)
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s8>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 2 x s1>), [[C1]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 1 x s8>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 2 x s8>), 0
     ; RV64-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR]](<vscale x 1 x s8>)
     ; RV64-NEXT: PseudoRET implicit $v8
@@ -351,8 +351,8 @@ body:             |
     ; RV32-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 2
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C]](s32)
     ; RV32-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s32)
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s16>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 4 x s1>) = G_VMSET_VL [[C1]](s32)
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s16>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 4 x s1>), [[C1]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 1 x s16>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 4 x s16>), 0
     ; RV32-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR]](<vscale x 1 x s16>)
     ; RV32-NEXT: PseudoRET implicit $v8
@@ -363,8 +363,8 @@ body:             |
     ; RV64-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C]](s64)
     ; RV64-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s64)
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s16>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 4 x s1>) = G_VMSET_VL [[C1]](s64)
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 4 x s16>) = G_VSLIDEDOWN_VL [[DEF]], [[DEF]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 4 x s1>), [[C1]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR:%[0-9]+]]:_(<vscale x 1 x s16>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 4 x s16>), 0
     ; RV64-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR]](<vscale x 1 x s16>)
     ; RV64-NEXT: PseudoRET implicit $v8
@@ -418,9 +418,9 @@ body:             |
     ; RV32-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C]](s32)
     ; RV32-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s32)
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C1]](s32)
     ; RV32-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 2 x s32>) = G_IMPLICIT_DEF
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 2 x s1>), [[C1]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR1:%[0-9]+]]:_(<vscale x 1 x s32>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 2 x s32>), 0
     ; RV32-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR1]](<vscale x 1 x s32>)
     ; RV32-NEXT: PseudoRET implicit $v8
@@ -432,9 +432,9 @@ body:             |
     ; RV64-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 3
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C]](s64)
     ; RV64-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s64)
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C1]](s64)
     ; RV64-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 2 x s32>) = G_IMPLICIT_DEF
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 2 x s1>), [[C1]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR1:%[0-9]+]]:_(<vscale x 1 x s32>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 2 x s32>), 0
     ; RV64-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR1]](<vscale x 1 x s32>)
     ; RV64-NEXT: PseudoRET implicit $v8
@@ -456,9 +456,9 @@ body:             |
     ; RV32-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
     ; RV32-NEXT: [[LSHR:%[0-9]+]]:_(s32) = G_LSHR [[READ_VLENB]], [[C]](s32)
     ; RV32-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 -1
-    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s32)
+    ; RV32-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C1]](s32)
     ; RV32-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 2 x s32>) = G_IMPLICIT_DEF
-    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV32-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s32), [[VMSET_VL]](<vscale x 2 x s1>), [[C1]], 3
     ; RV32-NEXT: [[EXTRACT_SUBVECTOR1:%[0-9]+]]:_(<vscale x 1 x s32>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 2 x s32>), 0
     ; RV32-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR1]](<vscale x 1 x s32>)
     ; RV32-NEXT: PseudoRET implicit $v8
@@ -470,9 +470,9 @@ body:             |
     ; RV64-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 3
     ; RV64-NEXT: [[LSHR:%[0-9]+]]:_(s64) = G_LSHR [[READ_VLENB]], [[C]](s64)
     ; RV64-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 -1
-    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 1 x s1>) = G_VMSET_VL [[C1]](s64)
+    ; RV64-NEXT: [[VMSET_VL:%[0-9]+]]:_(<vscale x 2 x s1>) = G_VMSET_VL [[C1]](s64)
     ; RV64-NEXT: [[DEF1:%[0-9]+]]:_(<vscale x 2 x s32>) = G_IMPLICIT_DEF
-    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 1 x s1>), [[C1]], 3
+    ; RV64-NEXT: [[VSLIDEDOWN_VL:%[0-9]+]]:_(<vscale x 2 x s32>) = G_VSLIDEDOWN_VL [[DEF1]], [[EXTRACT_SUBVECTOR]], [[LSHR]](s64), [[VMSET_VL]](<vscale x 2 x s1>), [[C1]], 3
     ; RV64-NEXT: [[EXTRACT_SUBVECTOR1:%[0-9]+]]:_(<vscale x 1 x s32>) = G_EXTRACT_SUBVECTOR [[VSLIDEDOWN_VL]](<vscale x 2 x s32>), 0
     ; RV64-NEXT: $v8 = COPY [[EXTRACT_SUBVECTOR1]](<vscale x 1 x s32>)
     ; RV64-NEXT: PseudoRET implicit $v8
diff --git a/llvm/test/CodeGen/RISCV/O0-pipeline.ll b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
index 8714b286374a5..42d30fcef2a9b 100644
--- a/llvm/test/CodeGen/RISCV/O0-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O0-pipeline.ll
@@ -9,9 +9,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Create Garbage Collector Module Metadata
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Profile summary info
diff --git a/llvm/test/CodeGen/RISCV/O3-pipeline.ll b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
index 3e2de780524b6..85027a56a1348 100644
--- a/llvm/test/CodeGen/RISCV/O3-pipeline.ll
+++ b/llvm/test/CodeGen/RISCV/O3-pipeline.ll
@@ -9,9 +9,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Profile summary info
 ; CHECK-NEXT: Type-Based Alias Analysis
diff --git a/llvm/test/CodeGen/RISCV/cmov-branch-opt.ll b/llvm/test/CodeGen/RISCV/cmov-branch-opt.ll
index 1957019f055a2..a03a53215be38 100644
--- a/llvm/test/CodeGen/RISCV/cmov-branch-opt.ll
+++ b/llvm/test/CodeGen/RISCV/cmov-branch-opt.ll
@@ -5,13 +5,13 @@
 ; RUN:   | FileCheck -check-prefixes=CMOV,CMOV-NOZICOND %s
 ; RUN: llc -mtriple=riscv64 -mattr=+conditional-cmv-fusion,+c,+zicond -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=CMOV,CMOV-ZICOND %s
-; RUN: llc -mtriple=riscv64 -mattr=+short-forward-branch-opt -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -mattr=+short-forward-branch-ialu -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=SHORT_FORWARD,SFB-NOZICOND,SFB-NOZICOND-NOC %s
-; RUN: llc -mtriple=riscv64 -mattr=+short-forward-branch-opt,+c -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -mattr=+short-forward-branch-ialu,+c -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=SHORT_FORWARD,SFB-NOZICOND,SFB-NOZICOND-C %s
-; RUN: llc -mtriple=riscv64 -mattr=+short-forward-branch-opt,+zicond -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -mattr=+short-forward-branch-ialu,+zicond -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=SHORT_FORWARD,SFB-ZICOND %s
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 
 ; The conditional move optimization in sifive-p450 requires that only a
diff --git a/llvm/test/CodeGen/RISCV/features-info.ll b/llvm/test/CodeGen/RISCV/features-info.ll
index 010d3c68b5ef1..b3fa871c859a0 100644
--- a/llvm/test/CodeGen/RISCV/features-info.ll
+++ b/llvm/test/CodeGen/RISCV/features-info.ll
@@ -137,9 +137,9 @@
 ; CHECK-NEXT:   shgatpa                          - 'Shgatpa' (SvNNx4 mode supported for all modes supported by satp, as well as Bare).
 ; CHECK-NEXT:   shifted-zextw-fusion             - Enable SLLI+SRLI to be fused when computing (shifted) word zero extension.
 ; CHECK-NEXT:   shlcofideleg                     - 'Shlcofideleg' (Delegating LCOFI Interrupts to VS-mode).
-; CHECK-NEXT:   short-forward-branch-i-minmax    - Enable short forward branch optimization for min,max instructions in Zbb.
-; CHECK-NEXT:   short-forward-branch-i-mul       - Enable short forward branch optimization for mul instruction.
-; CHECK-NEXT:   short-forward-branch-opt         - Enable short forward branch optimization.
+; CHECK-NEXT:   short-forward-branch-ialu        - Enable short forward branch optimization for RVI base instructions.
+; CHECK-NEXT:   short-forward-branch-iminmax     - Enable short forward branch optimization for MIN,MAX instructions in Zbb.
+; CHECK-NEXT:   short-forward-branch-imul        - Enable short forward branch optimization for MUL instruction.
 ; CHECK-NEXT:   shtvala                          - 'Shtvala' (htval provides all needed values).
 ; CHECK-NEXT:   shvsatpa                         - 'Shvsatpa' (vsatp supports all modes supported by satp).
 ; CHECK-NEXT:   shvstvala                        - 'Shvstvala' (vstval provides all needed values).
diff --git a/llvm/test/CodeGen/RISCV/min-max.ll b/llvm/test/CodeGen/RISCV/min-max.ll
index 71859431de923..316f626b4bc11 100644
--- a/llvm/test/CodeGen/RISCV/min-max.ll
+++ b/llvm/test/CodeGen/RISCV/min-max.ll
@@ -5,11 +5,11 @@
 ; RUN:   FileCheck %s --check-prefixes=ZBB,RV32ZBB
 ; RUN: llc < %s -mtriple=riscv64 -mattr=+zbb | \
 ; RUN:   FileCheck %s --check-prefixes=ZBB,RV64ZBB
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s | \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s | \
 ; RUN:   FileCheck %s --check-prefixes=XQCI
-; RUN: llc < %s -mtriple=riscv32 -mattr=+short-forward-branch-opt | \
+; RUN: llc < %s -mtriple=riscv32 -mattr=+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV32I-SFB
-; RUN: llc < %s -mtriple=riscv64 -mattr=+short-forward-branch-opt | \
+; RUN: llc < %s -mtriple=riscv64 -mattr=+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV64I-SFB
 
 ; Basic tests.
diff --git a/llvm/test/CodeGen/RISCV/rvp-ext-rv32.ll b/llvm/test/CodeGen/RISCV/rvp-ext-rv32.ll
index d4ea9e6c3def0..f803f6aa09652 100644
--- a/llvm/test/CodeGen/RISCV/rvp-ext-rv32.ll
+++ b/llvm/test/CodeGen/RISCV/rvp-ext-rv32.ll
@@ -484,6 +484,25 @@ define void @test_extract_vector_16(ptr %ret_ptr, ptr %a_ptr) {
   ret void
 }
 
+define void @test_extract_vector_16_elem1(ptr %ret_ptr, ptr %a_ptr) {
+; CHECK-RV32-LABEL: test_extract_vector_16_elem1:
+; CHECK-RV32:       # %bb.0:
+; CHECK-RV32-NEXT:    lhu a1, 2(a1)
+; CHECK-RV32-NEXT:    sh a1, 0(a0)
+; CHECK-RV32-NEXT:    ret
+;
+; CHECK-RV64-LABEL: test_extract_vector_16_elem1:
+; CHECK-RV64:       # %bb.0:
+; CHECK-RV64-NEXT:    lw a1, 0(a1)
+; CHECK-RV64-NEXT:    srli a1, a1, 16
+; CHECK-RV64-NEXT:    sh a1, 0(a0)
+; CHECK-RV64-NEXT:    ret
+  %a = load <2 x i16>, ptr %a_ptr
+  %extracted = extractelement <2 x i16> %a, i32 1
+  store i16 %extracted, ptr %ret_ptr
+  ret void
+}
+
 define void @test_extract_vector_8(ptr %ret_ptr, ptr %a_ptr) {
 ; CHECK-LABEL: test_extract_vector_8:
 ; CHECK:       # %bb.0:
@@ -496,6 +515,19 @@ define void @test_extract_vector_8(ptr %ret_ptr, ptr %a_ptr) {
   ret void
 }
 
+define void @test_extract_vector_8_elem1(ptr %ret_ptr, ptr %a_ptr) {
+; CHECK-LABEL: test_extract_vector_8_elem1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    lw a1, 0(a1)
+; CHECK-NEXT:    srli a1, a1, 8
+; CHECK-NEXT:    sb a1, 0(a0)
+; CHECK-NEXT:    ret
+  %a = load <4 x i8>, ptr %a_ptr
+  %extracted = extractelement <4 x i8> %a, i32 1
+  store i8 %extracted, ptr %ret_ptr
+  ret void
+}
+
 ; Test for splat
 define void @test_non_const_splat_i8(ptr %ret_ptr, ptr %a_ptr, i8 %elt) {
 ; CHECK-LABEL: test_non_const_splat_i8:
diff --git a/llvm/test/CodeGen/RISCV/rvp-ext-rv64.ll b/llvm/test/CodeGen/RISCV/rvp-ext-rv64.ll
index b39b807d43154..9b021df8dd452 100644
--- a/llvm/test/CodeGen/RISCV/rvp-ext-rv64.ll
+++ b/llvm/test/CodeGen/RISCV/rvp-ext-rv64.ll
@@ -495,6 +495,18 @@ define void @test_extract_vector_32(ptr %ret_ptr, ptr %a_ptr) {
   ret void
 }
 
+define void @test_extract_vector_32_elem1(ptr %ret_ptr, ptr %a_ptr) {
+; CHECK-LABEL: test_extract_vector_32_elem1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    lw a1, 4(a1)
+; CHECK-NEXT:    sw a1, 0(a0)
+; CHECK-NEXT:    ret
+  %a = load <2 x i32>, ptr %a_ptr
+  %extracted = extractelement <2 x i32> %a, i32 1
+  store i32 %extracted, ptr %ret_ptr
+  ret void
+}
+
 ; Test basic add/sub operations for v2i32 (RV64 only)
 define void @test_padd_w(ptr %ret_ptr, ptr %a_ptr, ptr %b_ptr) {
 ; CHECK-LABEL: test_padd_w:
diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll b/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll
index 2d4fce68f9545..96252f070a580 100644
--- a/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-reduce-add-to-vcpop.ll
@@ -311,10 +311,10 @@ define i32 @test_nxv128i1(<vscale x 128 x i1> %x) {
 ; CHECK-NEXT:    vmerge.vim v16, v16, 1, v0
 ; CHECK-NEXT:    vsetvli a2, zero, e8, mf2, ta, ma
 ; CHECK-NEXT:    vslidedown.vx v0, v6, a0
+; CHECK-NEXT:    vsetvli a2, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
 ; CHECK-NEXT:    vsetvli a2, zero, e8, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vx v6, v7, a1
-; CHECK-NEXT:    vsetvli a1, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
 ; CHECK-NEXT:    vsetvli a1, zero, e8, mf2, ta, ma
 ; CHECK-NEXT:    vslidedown.vx v0, v7, a0
 ; CHECK-NEXT:    vslidedown.vx v5, v6, a0
diff --git a/llvm/test/CodeGen/RISCV/rvv/copyprop.mir b/llvm/test/CodeGen/RISCV/rvv/copyprop.mir
index 31e79e58f44c5..aba75ffe29d33 100644
--- a/llvm/test/CodeGen/RISCV/rvv/copyprop.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/copyprop.mir
@@ -43,7 +43,7 @@ body:             |
     %2:gpr = COPY $x11
     %1:gpr = COPY $x10
     %3:vr = COPY $v8
-    %17:vr = PseudoVSLL_VI_M1 undef $noreg, %3, 5, 1, 6 /* e64 */, 0
+    %17:vrnov0 = PseudoVSLL_VI_M1 undef $noreg, %3, 5, 1, 6 /* e64 */, 0
     %22:vr = PseudoVMSNE_VI_M1 %3, 0, 1, 6 /* e64 */
     %23:vmv0 = COPY %22
     %25:vrnov0 = PseudoVMERGE_VIM_M1 undef $noreg, %17, -1, %23, 1, 6 /* e64 */
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
index 5567310bb2a61..9b35860904f11 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
@@ -530,290 +530,267 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    addi sp, sp, -16
 ; RV32-NEXT:    .cfi_def_cfa_offset 16
 ; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    li a3, 100
+; RV32-NEXT:    li a3, 84
 ; RV32-NEXT:    mul a2, a2, a3
 ; RV32-NEXT:    sub sp, sp, a2
-; RV32-NEXT:    .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0xe4, 0x00, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 100 * vlenb
+; RV32-NEXT:    .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0xd4, 0x00, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 84 * vlenb
 ; RV32-NEXT:    addi a4, a1, 128
 ; RV32-NEXT:    addi a5, a1, 256
 ; RV32-NEXT:    li a2, 32
 ; RV32-NEXT:    lui a3, 12
-; RV32-NEXT:    lui a6, 12291
-; RV32-NEXT:    lui a7, %hi(.LCPI27_0)
-; RV32-NEXT:    addi a7, a7, %lo(.LCPI27_0)
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vle32.v v24, (a5)
-; RV32-NEXT:    vmv.s.x v0, a3
+; RV32-NEXT:    lui a5, 12291
+; RV32-NEXT:    vmv.s.x v3, a3
 ; RV32-NEXT:    vle32.v v8, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 6
+; RV32-NEXT:    li a6, 76
+; RV32-NEXT:    mul a1, a1, a6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    addi a6, a6, 3
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
 ; RV32-NEXT:    vslideup.vi v16, v24, 4
 ; RV32-NEXT:    vsetivli zero, 16, e32, m8, ta, ma
 ; RV32-NEXT:    vslidedown.vi v8, v24, 16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a5, 76
-; RV32-NEXT:    mul a1, a1, a5
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs8r.v v24, (a1) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a5, 92
-; RV32-NEXT:    mul a1, a1, a5
+; RV32-NEXT:    li a6, 60
+; RV32-NEXT:    mul a1, a1, a6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vmv1r.v v30, v0
+; RV32-NEXT:    vmv1r.v v0, v3
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, mu
 ; RV32-NEXT:    vslideup.vi v16, v8, 10, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a5, 72
-; RV32-NEXT:    mul a1, a1, a5
+; RV32-NEXT:    li a6, 56
+; RV32-NEXT:    mul a1, a1, a6
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs4r.v v16, (a1) # vscale x 32-byte Folded Spill
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vle32.v v8, (a4)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 84
+; RV32-NEXT:    li a4, 68
 ; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    addi a5, a5, 3
+; RV32-NEXT:    vmv.s.x v0, a5
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vle16.v v28, (a7)
-; RV32-NEXT:    vmv.s.x v0, a6
+; RV32-NEXT:    vslideup.vi v28, v24, 2
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 6
+; RV32-NEXT:    li a4, 76
+; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 84
+; RV32-NEXT:    li a4, 68
 ; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
-; RV32-NEXT:    vmerge.vvm v16, v8, v16, v0
-; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v0, v16, v28
+; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 52
+; RV32-NEXT:    li a4, 44
 ; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs8r.v v0, (a1) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vsetvli zero, zero, e32, m4, ta, mu
-; RV32-NEXT:    vslideup.vi v8, v24, 2
-; RV32-NEXT:    vmv1r.v v0, v30
+; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmv1r.v v0, v3
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 92
+; RV32-NEXT:    li a4, 60
 ; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vslideup.vi v8, v16, 8, v0.t
+; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, mu
+; RV32-NEXT:    vslideup.vi v28, v8, 8, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 60
+; RV32-NEXT:    li a4, 52
 ; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v8, (a1) # vscale x 32-byte Folded Spill
-; RV32-NEXT:    lui a7, 49164
-; RV32-NEXT:    lui a1, %hi(.LCPI27_1)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_1)
-; RV32-NEXT:    lui t2, 3
-; RV32-NEXT:    lui t1, 196656
-; RV32-NEXT:    lui a4, %hi(.LCPI27_3)
-; RV32-NEXT:    addi a4, a4, %lo(.LCPI27_3)
-; RV32-NEXT:    lui t0, 786624
-; RV32-NEXT:    li a5, 48
-; RV32-NEXT:    lui a6, 768
-; RV32-NEXT:    addi a7, a7, 12
-; RV32-NEXT:    vmv.s.x v0, a7
-; RV32-NEXT:    addi t2, t2, 3
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    li t3, 84
-; RV32-NEXT:    mul a7, a7, t3
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vl8r.v v16, (a7) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    slli a7, a7, 6
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vl8r.v v8, (a7) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
-; RV32-NEXT:    vmerge.vvm v8, v16, v8, v0
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    slli a7, a7, 5
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vs8r.v v8, (a7) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vmv.s.x v0, t2
-; RV32-NEXT:    addi a7, t1, 48
-; RV32-NEXT:    csrr t1, vlenb
-; RV32-NEXT:    li t2, 92
-; RV32-NEXT:    mul t1, t1, t2
-; RV32-NEXT:    add t1, sp, t1
-; RV32-NEXT:    addi t1, t1, 16
-; RV32-NEXT:    vl8r.v v24, (t1) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    csrr t1, vlenb
+; RV32-NEXT:    vs4r.v v28, (a1) # vscale x 32-byte Folded Spill
+; RV32-NEXT:    lui a1, %hi(.LCPI27_0)
+; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_0)
+; RV32-NEXT:    lui a6, 49164
+; RV32-NEXT:    lui t1, 3
+; RV32-NEXT:    lui t0, 196656
+; RV32-NEXT:    lui a7, 786624
+; RV32-NEXT:    li a4, 48
+; RV32-NEXT:    lui a5, 768
+; RV32-NEXT:    addi a6, a6, 12
+; RV32-NEXT:    vmv.s.x v0, a6
+; RV32-NEXT:    addi t1, t1, 3
+; RV32-NEXT:    csrr a6, vlenb
 ; RV32-NEXT:    li t2, 76
-; RV32-NEXT:    mul t1, t1, t2
-; RV32-NEXT:    add t1, sp, t1
-; RV32-NEXT:    addi t1, t1, 16
-; RV32-NEXT:    vl8r.v v8, (t1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    mul a6, a6, t2
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vl8r.v v16, (a6) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    csrr a6, vlenb
+; RV32-NEXT:    li t2, 68
+; RV32-NEXT:    mul a6, a6, t2
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vl8r.v v8, (a6) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
+; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
+; RV32-NEXT:    csrr a6, vlenb
+; RV32-NEXT:    li t2, 28
+; RV32-NEXT:    mul a6, a6, t2
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vs8r.v v8, (a6) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmv.s.x v0, t1
+; RV32-NEXT:    addi a6, t0, 48
+; RV32-NEXT:    csrr t0, vlenb
+; RV32-NEXT:    li t1, 60
+; RV32-NEXT:    mul t0, t0, t1
+; RV32-NEXT:    add t0, sp, t0
+; RV32-NEXT:    addi t0, t0, 16
+; RV32-NEXT:    vl8r.v v8, (t0) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vmerge.vvm v8, v24, v8, v0
-; RV32-NEXT:    csrr t1, vlenb
-; RV32-NEXT:    li t2, 44
-; RV32-NEXT:    mul t1, t1, t2
-; RV32-NEXT:    add t1, sp, t1
-; RV32-NEXT:    addi t1, t1, 16
-; RV32-NEXT:    vs4r.v v8, (t1) # vscale x 32-byte Folded Spill
-; RV32-NEXT:    vmv.s.x v0, a7
+; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
+; RV32-NEXT:    csrr t0, vlenb
+; RV32-NEXT:    li t1, 36
+; RV32-NEXT:    mul t0, t0, t1
+; RV32-NEXT:    add t0, sp, t0
+; RV32-NEXT:    addi t0, t0, 16
+; RV32-NEXT:    vs4r.v v8, (t0) # vscale x 32-byte Folded Spill
+; RV32-NEXT:    vmv.s.x v0, a6
 ; RV32-NEXT:    addi a3, a3, 12
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    slli a7, a7, 6
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vl8r.v v24, (a7) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
-; RV32-NEXT:    vmerge.vvm v8, v16, v24, v0
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    slli a7, a7, 4
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vs8r.v v8, (a7) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vmv8r.v v16, v24
-; RV32-NEXT:    vmv.s.x v0, a3
-; RV32-NEXT:    addi a3, t0, 192
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    li t0, 92
-; RV32-NEXT:    mul a7, a7, t0
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vl8r.v v24, (a7) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    csrr a7, vlenb
+; RV32-NEXT:    csrr a6, vlenb
 ; RV32-NEXT:    li t0, 76
-; RV32-NEXT:    mul a7, a7, t0
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vl8r.v v8, (a7) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vmerge.vvm v8, v24, v8, v0
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    li t0, 48
-; RV32-NEXT:    mul a7, a7, t0
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vs4r.v v8, (a7) # vscale x 32-byte Folded Spill
-; RV32-NEXT:    vmv.s.x v0, a3
-; RV32-NEXT:    li a3, 192
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    li t0, 84
-; RV32-NEXT:    mul a7, a7, t0
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vl8r.v v8, (a7) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    mul a6, a6, t0
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vl8r.v v16, (a6) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    csrr a6, vlenb
+; RV32-NEXT:    li t0, 68
+; RV32-NEXT:    mul a6, a6, t0
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vl8r.v v8, (a6) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
-; RV32-NEXT:    csrr a7, vlenb
-; RV32-NEXT:    li t0, 24
-; RV32-NEXT:    mul a7, a7, t0
-; RV32-NEXT:    add a7, sp, a7
-; RV32-NEXT:    addi a7, a7, 16
-; RV32-NEXT:    vs8r.v v8, (a7) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vmv.s.x v0, a5
-; RV32-NEXT:    addi a5, a6, 768
 ; RV32-NEXT:    csrr a6, vlenb
-; RV32-NEXT:    li a7, 92
-; RV32-NEXT:    mul a6, a6, a7
+; RV32-NEXT:    li t0, 20
+; RV32-NEXT:    mul a6, a6, t0
 ; RV32-NEXT:    add a6, sp, a6
 ; RV32-NEXT:    addi a6, a6, 16
-; RV32-NEXT:    vl8r.v v24, (a6) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vs8r.v v8, (a6) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmv.s.x v0, a3
+; RV32-NEXT:    addi a3, a7, 192
 ; RV32-NEXT:    csrr a6, vlenb
-; RV32-NEXT:    li a7, 76
+; RV32-NEXT:    li a7, 60
 ; RV32-NEXT:    mul a6, a6, a7
 ; RV32-NEXT:    add a6, sp, a6
 ; RV32-NEXT:    addi a6, a6, 16
 ; RV32-NEXT:    vl8r.v v8, (a6) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vmerge.vvm v8, v24, v8, v0
+; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV32-NEXT:    csrr a6, vlenb
 ; RV32-NEXT:    li a7, 40
 ; RV32-NEXT:    mul a6, a6, a7
 ; RV32-NEXT:    add a6, sp, a6
 ; RV32-NEXT:    addi a6, a6, 16
 ; RV32-NEXT:    vs4r.v v8, (a6) # vscale x 32-byte Folded Spill
-; RV32-NEXT:    vmv.s.x v0, a5
-; RV32-NEXT:    vle16.v v6, (a1)
-; RV32-NEXT:    vle16.v v2, (a4)
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a4, 84
-; RV32-NEXT:    mul a1, a1, a4
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmv.s.x v0, a3
+; RV32-NEXT:    li a3, 192
+; RV32-NEXT:    csrr a6, vlenb
+; RV32-NEXT:    li a7, 76
+; RV32-NEXT:    mul a6, a6, a7
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vl8r.v v16, (a6) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    csrr a6, vlenb
+; RV32-NEXT:    li a7, 68
+; RV32-NEXT:    mul a6, a6, a7
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vl8r.v v8, (a6) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
+; RV32-NEXT:    csrr a6, vlenb
+; RV32-NEXT:    slli a6, a6, 3
+; RV32-NEXT:    add a6, sp, a6
+; RV32-NEXT:    addi a6, a6, 16
+; RV32-NEXT:    vs8r.v v8, (a6) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmv.s.x v0, a4
+; RV32-NEXT:    addi a4, a5, 768
+; RV32-NEXT:    csrr a5, vlenb
+; RV32-NEXT:    li a6, 60
+; RV32-NEXT:    mul a5, a5, a6
+; RV32-NEXT:    add a5, sp, a5
+; RV32-NEXT:    addi a5, a5, 16
+; RV32-NEXT:    vl8r.v v8, (a5) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
+; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
+; RV32-NEXT:    csrr a5, vlenb
+; RV32-NEXT:    slli a5, a5, 4
+; RV32-NEXT:    add a5, sp, a5
+; RV32-NEXT:    addi a5, a5, 16
+; RV32-NEXT:    vs4r.v v8, (a5) # vscale x 32-byte Folded Spill
+; RV32-NEXT:    vmv.s.x v0, a4
+; RV32-NEXT:    vle16.v v2, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    li a4, 76
+; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vmv.s.x v0, a3
+; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 5
+; RV32-NEXT:    li a4, 68
+; RV32-NEXT:    mul a1, a1, a4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v24, v8, v6
+; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
+; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    addi a1, sp, 16
-; RV32-NEXT:    vs8r.v v24, (a1) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 92
-; RV32-NEXT:    mul a1, a1, a3
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v24, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmv.s.x v0, a3
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 76
+; RV32-NEXT:    li a3, 60
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; RV32-NEXT:    vmerge.vvm v8, v24, v8, v0
+; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
+; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 92
+; RV32-NEXT:    li a3, 60
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vs4r.v v8, (a1) # vscale x 32-byte Folded Spill
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    li a3, 44
+; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    vsetvli zero, zero, e64, m8, ta, ma
 ; RV32-NEXT:    vrgatherei16.vv v24, v8, v2
-; RV32-NEXT:    lui a1, %hi(.LCPI27_2)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_2)
+; RV32-NEXT:    lui a1, %hi(.LCPI27_1)
+; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_1)
 ; RV32-NEXT:    lui a3, 3073
 ; RV32-NEXT:    addi a3, a3, -1024
 ; RV32-NEXT:    vmv.s.x v0, a3
-; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; RV32-NEXT:    vle16.v v3, (a1)
+; RV32-NEXT:    vle16.v v30, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a3, 84
+; RV32-NEXT:    li a3, 76
+; RV32-NEXT:    mul a1, a1, a3
+; RV32-NEXT:    add a1, sp, a1
+; RV32-NEXT:    addi a1, a1, 16
+; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    li a3, 68
 ; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
@@ -821,179 +798,191 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV32-NEXT:    vsetvli zero, a2, e32, m8, ta, ma
 ; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 72
+; RV32-NEXT:    li a2, 76
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v28, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vs8r.v v8, (a1) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    lui a1, %hi(.LCPI27_3)
+; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_3)
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV32-NEXT:    vle16.v v28, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 52
+; RV32-NEXT:    li a2, 28
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v28, v16
+; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v16, v8, v30
+; RV32-NEXT:    lui a1, %hi(.LCPI27_2)
+; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_2)
+; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT:    vle16.v v20, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 72
+; RV32-NEXT:    li a2, 20
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v28, (a1) # vscale x 32-byte Folded Spill
-; RV32-NEXT:    addi a1, sp, 16
-; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vl8r.v v0, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV32-NEXT:    vrgatherei16.vv v8, v0, v28
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 60
+; RV32-NEXT:    li a2, 56
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v20, (a1) # vscale x 32-byte Folded Reload
-; RV32-NEXT:    vmv.v.v v20, v16
+; RV32-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
+; RV32-NEXT:    vmv.v.v v12, v24
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 60
+; RV32-NEXT:    li a2, 56
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v20, (a1) # vscale x 32-byte Folded Spill
+; RV32-NEXT:    vs4r.v v12, (a1) # vscale x 32-byte Folded Spill
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 44
+; RV32-NEXT:    li a2, 52
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v16, (a1) # vscale x 32-byte Folded Reload
-; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v20, v16, v3
-; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v20, v24
+; RV32-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vmv.v.v v12, v16
+; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    li a2, 52
+; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    add a1, sp, a1
+; RV32-NEXT:    addi a1, a1, 16
+; RV32-NEXT:    vs4r.v v12, (a1) # vscale x 32-byte Folded Spill
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 6
+; RV32-NEXT:    li a2, 36
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs4r.v v20, (a1) # vscale x 32-byte Folded Spill
+; RV32-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
+; RV32-NEXT:    vrgatherei16.vv v24, v12, v20
+; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
+; RV32-NEXT:    vmv.v.v v24, v8
 ; RV32-NEXT:    lui a1, %hi(.LCPI27_4)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_4)
 ; RV32-NEXT:    lui a2, %hi(.LCPI27_5)
 ; RV32-NEXT:    addi a2, a2, %lo(.LCPI27_5)
 ; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV32-NEXT:    vle16.v v24, (a2)
+; RV32-NEXT:    vle16.v v28, (a2)
 ; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; RV32-NEXT:    vle16.v v16, (a1)
-; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 84
-; RV32-NEXT:    mul a1, a1, a2
-; RV32-NEXT:    add a1, sp, a1
-; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs1r.v v16, (a1) # vscale x 8-byte Folded Spill
+; RV32-NEXT:    vle16.v v1, (a1)
 ; RV32-NEXT:    lui a1, %hi(.LCPI27_7)
 ; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_7)
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vle16.v v16, (a1)
+; RV32-NEXT:    vle16.v v2, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 76
-; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    slli a1, a1, 3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vs2r.v v16, (a1) # vscale x 16-byte Folded Spill
+; RV32-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v8, v16, v28
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 24
+; RV32-NEXT:    li a2, 40
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl8r.v v0, (a1) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v16, v0, v24
+; RV32-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
+; RV32-NEXT:    vrgatherei16.vv v28, v12, v1
+; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
+; RV32-NEXT:    vmv.v.v v28, v8
+; RV32-NEXT:    addi a1, sp, 16
+; RV32-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV32-NEXT:    vrgatherei16.vv v16, v8, v2
+; RV32-NEXT:    lui a1, %hi(.LCPI27_6)
+; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_6)
+; RV32-NEXT:    lui a2, %hi(.LCPI27_8)
+; RV32-NEXT:    addi a2, a2, %lo(.LCPI27_8)
+; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
+; RV32-NEXT:    vle16.v v8, (a1)
+; RV32-NEXT:    lui a1, %hi(.LCPI27_9)
+; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_9)
+; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
+; RV32-NEXT:    vle16.v v10, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 48
-; RV32-NEXT:    mul a1, a1, a2
+; RV32-NEXT:    li a3, 44
+; RV32-NEXT:    mul a1, a1, a3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v20, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vs2r.v v10, (a1) # vscale x 16-byte Folded Spill
+; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
+; RV32-NEXT:    vle16.v v9, (a2)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 84
+; RV32-NEXT:    li a2, 68
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl1r.v v7, (a1) # vscale x 8-byte Folded Reload
-; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v24, v20, v7
+; RV32-NEXT:    vs1r.v v9, (a1) # vscale x 8-byte Folded Spill
+; RV32-NEXT:    csrr a1, vlenb
+; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    add a1, sp, a1
+; RV32-NEXT:    addi a1, a1, 16
+; RV32-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vrgatherei16.vv v20, v12, v8
 ; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v24, v16
+; RV32-NEXT:    vmv.v.v v20, v16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    li a2, 76
+; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v0, (a1) # vscale x 64-byte Folded Reload
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 76
+; RV32-NEXT:    li a2, 44
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl2r.v v28, (a1) # vscale x 16-byte Folded Reload
+; RV32-NEXT:    vl2r.v v16, (a1) # vscale x 16-byte Folded Reload
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v16, v0, v28
-; RV32-NEXT:    lui a1, %hi(.LCPI27_6)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_6)
-; RV32-NEXT:    lui a2, %hi(.LCPI27_8)
-; RV32-NEXT:    addi a2, a2, %lo(.LCPI27_8)
-; RV32-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; RV32-NEXT:    vle16.v v4, (a1)
-; RV32-NEXT:    lui a1, %hi(.LCPI27_9)
-; RV32-NEXT:    addi a1, a1, %lo(.LCPI27_9)
-; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV32-NEXT:    vle16.v v6, (a1)
-; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vle16.v v5, (a2)
+; RV32-NEXT:    vrgatherei16.vv v8, v0, v16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 40
+; RV32-NEXT:    li a2, 60
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v20, (a1) # vscale x 32-byte Folded Reload
-; RV32-NEXT:    vrgatherei16.vv v0, v20, v4
-; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v0, v16
-; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v16, v8, v6
+; RV32-NEXT:    vl4r.v v16, (a1) # vscale x 32-byte Folded Reload
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 92
+; RV32-NEXT:    li a2, 68
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
-; RV32-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV32-NEXT:    vl1r.v v7, (a1) # vscale x 8-byte Folded Reload
 ; RV32-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV32-NEXT:    vrgatherei16.vv v8, v12, v5
+; RV32-NEXT:    vrgatherei16.vv v12, v16, v7
 ; RV32-NEXT:    vsetivli zero, 10, e32, m4, tu, ma
-; RV32-NEXT:    vmv.v.v v8, v16
+; RV32-NEXT:    vmv.v.v v12, v8
 ; RV32-NEXT:    addi a1, a0, 320
 ; RV32-NEXT:    vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT:    vse32.v v8, (a1)
+; RV32-NEXT:    vse32.v v12, (a1)
 ; RV32-NEXT:    addi a1, a0, 256
-; RV32-NEXT:    vse32.v v0, (a1)
+; RV32-NEXT:    vse32.v v20, (a1)
 ; RV32-NEXT:    addi a1, a0, 192
-; RV32-NEXT:    vse32.v v24, (a1)
+; RV32-NEXT:    vse32.v v28, (a1)
 ; RV32-NEXT:    addi a1, a0, 128
-; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    slli a2, a2, 6
-; RV32-NEXT:    add a2, sp, a2
-; RV32-NEXT:    addi a2, a2, 16
-; RV32-NEXT:    vl4r.v v8, (a2) # vscale x 32-byte Folded Reload
-; RV32-NEXT:    vse32.v v8, (a1)
+; RV32-NEXT:    vse32.v v24, (a1)
 ; RV32-NEXT:    addi a1, a0, 64
 ; RV32-NEXT:    csrr a2, vlenb
-; RV32-NEXT:    li a3, 60
+; RV32-NEXT:    li a3, 52
 ; RV32-NEXT:    mul a2, a2, a3
 ; RV32-NEXT:    add a2, sp, a2
 ; RV32-NEXT:    addi a2, a2, 16
 ; RV32-NEXT:    vl4r.v v8, (a2) # vscale x 32-byte Folded Reload
 ; RV32-NEXT:    vse32.v v8, (a1)
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    li a2, 72
+; RV32-NEXT:    li a2, 56
 ; RV32-NEXT:    mul a1, a1, a2
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl4r.v v8, (a1) # vscale x 32-byte Folded Reload
 ; RV32-NEXT:    vse32.v v8, (a0)
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    li a1, 100
+; RV32-NEXT:    li a1, 84
 ; RV32-NEXT:    mul a0, a0, a1
 ; RV32-NEXT:    add sp, sp, a0
 ; RV32-NEXT:    .cfi_def_cfa sp, 16
@@ -1013,60 +1002,48 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV64-NEXT:    vle64.v v8, (a1)
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 53
+; RV64-NEXT:    li a3, 85
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vs8r.v v8, (a2) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    addi a2, a1, 128
-; RV64-NEXT:    addi a3, a1, 256
-; RV64-NEXT:    li a4, 128
+; RV64-NEXT:    addi a1, a1, 256
+; RV64-NEXT:    vle64.v v8, (a1)
+; RV64-NEXT:    li a3, 128
 ; RV64-NEXT:    lui a1, 1
-; RV64-NEXT:    vle64.v v8, (a3)
-; RV64-NEXT:    lui a3, %hi(.LCPI27_0)
-; RV64-NEXT:    addi a3, a3, %lo(.LCPI27_0)
-; RV64-NEXT:    vmv.s.x v0, a4
-; RV64-NEXT:    csrr a4, vlenb
-; RV64-NEXT:    li a5, 61
-; RV64-NEXT:    mul a4, a4, a5
-; RV64-NEXT:    add a4, sp, a4
-; RV64-NEXT:    addi a4, a4, 16
-; RV64-NEXT:    vs1r.v v0, (a4) # vscale x 8-byte Folded Spill
-; RV64-NEXT:    addi a4, a1, 65
+; RV64-NEXT:    vmv.s.x v3, a3
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
 ; RV64-NEXT:    vslideup.vi v24, v8, 2
 ; RV64-NEXT:    vsetivli zero, 8, e64, m8, ta, ma
 ; RV64-NEXT:    vslidedown.vi v16, v8, 8
-; RV64-NEXT:    csrr a5, vlenb
-; RV64-NEXT:    li a6, 77
-; RV64-NEXT:    mul a5, a5, a6
-; RV64-NEXT:    add a5, sp, a5
-; RV64-NEXT:    addi a5, a5, 16
-; RV64-NEXT:    vs8r.v v16, (a5) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    csrr a5, vlenb
-; RV64-NEXT:    li a6, 77
-; RV64-NEXT:    mul a5, a5, a6
-; RV64-NEXT:    add a5, sp, a5
-; RV64-NEXT:    addi a5, a5, 16
-; RV64-NEXT:    vl8r.v v16, (a5) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    csrr a3, vlenb
+; RV64-NEXT:    li a4, 45
+; RV64-NEXT:    mul a3, a3, a4
+; RV64-NEXT:    add a3, sp, a3
+; RV64-NEXT:    addi a3, a3, 16
+; RV64-NEXT:    vs8r.v v16, (a3) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vmv1r.v v0, v3
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
 ; RV64-NEXT:    vslideup.vi v24, v16, 5, v0.t
-; RV64-NEXT:    csrr a5, vlenb
-; RV64-NEXT:    li a6, 73
-; RV64-NEXT:    mul a5, a5, a6
-; RV64-NEXT:    add a5, sp, a5
-; RV64-NEXT:    addi a5, a5, 16
-; RV64-NEXT:    vs4r.v v24, (a5) # vscale x 32-byte Folded Spill
+; RV64-NEXT:    csrr a3, vlenb
+; RV64-NEXT:    li a4, 73
+; RV64-NEXT:    mul a3, a3, a4
+; RV64-NEXT:    add a3, sp, a3
+; RV64-NEXT:    addi a3, a3, 16
+; RV64-NEXT:    vs4r.v v24, (a3) # vscale x 32-byte Folded Spill
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vle64.v v24, (a2)
+; RV64-NEXT:    vle64.v v16, (a2)
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a5, 85
-; RV64-NEXT:    mul a2, a2, a5
+; RV64-NEXT:    li a3, 77
+; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vs8r.v v24, (a2) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    vle16.v v12, (a3)
-; RV64-NEXT:    vmv.s.x v0, a4
+; RV64-NEXT:    vs8r.v v16, (a2) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    addi a2, a1, 65
+; RV64-NEXT:    vmv.s.x v0, a2
+; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
+; RV64-NEXT:    vslideup.vi v12, v8, 1
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    li a3, 85
 ; RV64-NEXT:    mul a2, a2, a3
@@ -1074,35 +1051,28 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vl8r.v v24, (a2) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 53
+; RV64-NEXT:    li a3, 77
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmerge.vvm v24, v24, v16, v0
-; RV64-NEXT:    vrgatherei16.vv v0, v24, v12
+; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    li a3, 37
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vs8r.v v0, (a2) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vslideup.vi v12, v8, 1
-; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 61
-; RV64-NEXT:    mul a2, a2, a3
-; RV64-NEXT:    add a2, sp, a2
-; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl1r.v v7, (a2) # vscale x 8-byte Folded Reload
-; RV64-NEXT:    vmv1r.v v0, v7
+; RV64-NEXT:    vs8r.v v16, (a2) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vmv1r.v v0, v3
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 77
+; RV64-NEXT:    li a3, 45
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl8r.v v24, (a2) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vslideup.vi v12, v24, 4, v0.t
+; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
+; RV64-NEXT:    vslideup.vi v12, v16, 4, v0.t
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    li a3, 69
 ; RV64-NEXT:    mul a2, a2, a3
@@ -1115,17 +1085,23 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a2, a2, 130
 ; RV64-NEXT:    vmv.s.x v0, a2
 ; RV64-NEXT:    addi a2, a3, 260
-; RV64-NEXT:    vmv8r.v v24, v16
 ; RV64-NEXT:    csrr a3, vlenb
 ; RV64-NEXT:    li a5, 85
 ; RV64-NEXT:    mul a3, a3, a5
 ; RV64-NEXT:    add a3, sp, a3
 ; RV64-NEXT:    addi a3, a3, 16
+; RV64-NEXT:    vl8r.v v24, (a3) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    csrr a3, vlenb
+; RV64-NEXT:    li a5, 77
+; RV64-NEXT:    mul a3, a3, a5
+; RV64-NEXT:    add a3, sp, a3
+; RV64-NEXT:    addi a3, a3, 16
 ; RV64-NEXT:    vl8r.v v16, (a3) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
 ; RV64-NEXT:    csrr a3, vlenb
-; RV64-NEXT:    slli a3, a3, 3
+; RV64-NEXT:    li a5, 21
+; RV64-NEXT:    mul a3, a3, a5
 ; RV64-NEXT:    add a3, sp, a3
 ; RV64-NEXT:    addi a3, a3, 16
 ; RV64-NEXT:    vs8r.v v16, (a3) # vscale x 64-byte Folded Spill
@@ -1137,6 +1113,12 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
+; RV64-NEXT:    vl8r.v v24, (a2) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    csrr a2, vlenb
+; RV64-NEXT:    li a3, 77
+; RV64-NEXT:    mul a2, a2, a3
+; RV64-NEXT:    add a2, sp, a2
+; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
 ; RV64-NEXT:    csrr a2, vlenb
@@ -1147,21 +1129,21 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vs8r.v v16, (a2) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v2
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 45
+; RV64-NEXT:    li a3, 53
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vs8r.v v8, (a2) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
 ; RV64-NEXT:    vslideup.vi v12, v8, 5, v0.t
-; RV64-NEXT:    vmv1r.v v0, v7
+; RV64-NEXT:    vmv1r.v v0, v3
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 77
+; RV64-NEXT:    li a3, 45
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl8r.v v24, (a2) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vrgather.vi v12, v24, 4, v0.t
+; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vrgather.vi v12, v16, 4, v0.t
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    slli a3, a2, 6
 ; RV64-NEXT:    add a2, a3, a2
@@ -1171,84 +1153,65 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vslidedown.vi v12, v8, 1
 ; RV64-NEXT:    vmv1r.v v0, v2
 ; RV64-NEXT:    vslideup.vi v12, v8, 4, v0.t
-; RV64-NEXT:    vmv1r.v v0, v7
-; RV64-NEXT:    vrgather.vi v12, v24, 5, v0.t
+; RV64-NEXT:    vmv1r.v v0, v3
+; RV64-NEXT:    vmv4r.v v8, v16
+; RV64-NEXT:    vrgather.vi v12, v16, 5, v0.t
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 25
-; RV64-NEXT:    mul a2, a2, a3
+; RV64-NEXT:    slli a3, a2, 4
+; RV64-NEXT:    add a2, a3, a2
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vs4r.v v12, (a2) # vscale x 32-byte Folded Spill
 ; RV64-NEXT:    lui a2, 8
 ; RV64-NEXT:    addi a2, a2, 520
 ; RV64-NEXT:    vmv.s.x v0, a2
-; RV64-NEXT:    vslideup.vi v12, v24, 6
+; RV64-NEXT:    vslideup.vi v4, v16, 6
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    li a3, 85
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vl8r.v v24, (a2) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 53
+; RV64-NEXT:    li a3, 77
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl8r.v v24, (a2) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    slli a3, a2, 4
+; RV64-NEXT:    slli a3, a2, 3
 ; RV64-NEXT:    add a2, a3, a2
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vs8r.v v16, (a2) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    vmv1r.v v0, v7
-; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 77
-; RV64-NEXT:    mul a2, a2, a3
-; RV64-NEXT:    add a2, sp, a2
-; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmv1r.v v0, v3
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
-; RV64-NEXT:    vslideup.vi v12, v16, 1, v0.t
-; RV64-NEXT:    lui a2, %hi(.LCPI27_1)
-; RV64-NEXT:    addi a2, a2, %lo(.LCPI27_1)
-; RV64-NEXT:    li a3, 192
-; RV64-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV64-NEXT:    vle16.v v6, (a2)
-; RV64-NEXT:    vmv.s.x v0, a3
+; RV64-NEXT:    vslideup.vi v4, v8, 1, v0.t
+; RV64-NEXT:    li a2, 192
+; RV64-NEXT:    vmv.s.x v0, a2
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    slli a2, a2, 4
+; RV64-NEXT:    slli a2, a2, 3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vs1r.v v0, (a2) # vscale x 8-byte Folded Spill
 ; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 45
+; RV64-NEXT:    li a3, 53
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
 ; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
-; RV64-NEXT:    vrgather.vi v28, v16, 2
-; RV64-NEXT:    vmerge.vvm v16, v28, v12, v0
+; RV64-NEXT:    vrgather.vi v12, v16, 2
+; RV64-NEXT:    vmerge.vvm v12, v12, v4, v0
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    li a3, 61
 ; RV64-NEXT:    mul a2, a2, a3
 ; RV64-NEXT:    add a2, sp, a2
 ; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vs4r.v v16, (a2) # vscale x 32-byte Folded Spill
-; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    slli a2, a2, 3
-; RV64-NEXT:    add a2, sp, a2
-; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vl8r.v v16, (a2) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v24, v16, v6
-; RV64-NEXT:    addi a2, sp, 16
-; RV64-NEXT:    vs8r.v v24, (a2) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    lui a2, %hi(.LCPI27_2)
-; RV64-NEXT:    addi a2, a2, %lo(.LCPI27_2)
+; RV64-NEXT:    vs4r.v v12, (a2) # vscale x 32-byte Folded Spill
+; RV64-NEXT:    lui a2, %hi(.LCPI27_0)
+; RV64-NEXT:    addi a2, a2, %lo(.LCPI27_0)
 ; RV64-NEXT:    li a3, 1040
 ; RV64-NEXT:    vmv.s.x v0, a3
 ; RV64-NEXT:    addi a1, a1, -2016
@@ -1259,41 +1222,77 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a3, a3, 16
 ; RV64-NEXT:    vl8r.v v24, (a3) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    csrr a3, vlenb
-; RV64-NEXT:    li a4, 53
+; RV64-NEXT:    li a4, 77
 ; RV64-NEXT:    mul a3, a3, a4
 ; RV64-NEXT:    add a3, sp, a3
 ; RV64-NEXT:    addi a3, a3, 16
 ; RV64-NEXT:    vl8r.v v16, (a3) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmerge.vvm v8, v24, v16, v0
-; RV64-NEXT:    csrr a3, vlenb
-; RV64-NEXT:    slli a3, a3, 3
-; RV64-NEXT:    add a3, sp, a3
-; RV64-NEXT:    addi a3, a3, 16
-; RV64-NEXT:    vs8r.v v8, (a3) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    vmv.s.x v0, a1
-; RV64-NEXT:    vle16.v v6, (a2)
-; RV64-NEXT:    li a1, 64
-; RV64-NEXT:    vmerge.vvm v8, v24, v16, v0
-; RV64-NEXT:    csrr a2, vlenb
-; RV64-NEXT:    li a3, 85
-; RV64-NEXT:    mul a2, a2, a3
-; RV64-NEXT:    add a2, sp, a2
-; RV64-NEXT:    addi a2, a2, 16
-; RV64-NEXT:    vs8r.v v8, (a2) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
+; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
+; RV64-NEXT:    addi a3, sp, 16
+; RV64-NEXT:    vs8r.v v16, (a3) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmv.s.x v0, a1
+; RV64-NEXT:    vle16.v v12, (a2)
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 29
+; RV64-NEXT:    li a2, 85
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl8r.v v24, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 77
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vrgatherei16.vv v24, v16, v6
+; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 85
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vs8r.v v16, (a1) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    lui a1, %hi(.LCPI27_1)
+; RV64-NEXT:    addi a1, a1, %lo(.LCPI27_1)
+; RV64-NEXT:    vle16.v v24, (a1)
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 37
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vrgatherei16.vv v0, v16, v12
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 77
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vs8r.v v0, (a1) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    lui a1, %hi(.LCPI27_2)
+; RV64-NEXT:    addi a1, a1, %lo(.LCPI27_2)
+; RV64-NEXT:    vle16.v v12, (a1)
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 21
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vrgatherei16.vv v0, v16, v24
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 37
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vs8r.v v0, (a1) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    li a1, 64
+; RV64-NEXT:    vmv.s.x v0, a1
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 29
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
+; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vrgatherei16.vv v24, v16, v12
 ; RV64-NEXT:    vmv4r.v v28, v8
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, mu
 ; RV64-NEXT:    vslideup.vi v28, v8, 5, v0.t
@@ -1304,13 +1303,13 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl4r.v v8, (a1) # vscale x 32-byte Folded Reload
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 37
+; RV64-NEXT:    li a2, 77
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v0, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 6, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v8, v0
+; RV64-NEXT:    vmv.v.v v8, v16
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 73
 ; RV64-NEXT:    mul a1, a1, a2
@@ -1323,7 +1322,11 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl4r.v v8, (a1) # vscale x 32-byte Folded Reload
-; RV64-NEXT:    addi a1, sp, 16
+; RV64-NEXT:    csrr a1, vlenb
+; RV64-NEXT:    li a2, 37
+; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    add a1, sp, a1
+; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vmv.v.v v8, v16
 ; RV64-NEXT:    csrr a1, vlenb
@@ -1335,62 +1338,59 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    lui a1, %hi(.LCPI27_3)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI27_3)
 ; RV64-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV64-NEXT:    vle16.v v20, (a1)
+; RV64-NEXT:    vle16.v v8, (a1)
 ; RV64-NEXT:    lui a1, %hi(.LCPI27_4)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI27_4)
-; RV64-NEXT:    vle16.v v8, (a1)
+; RV64-NEXT:    vle16.v v10, (a1)
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 77
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs2r.v v8, (a1) # vscale x 16-byte Folded Spill
+; RV64-NEXT:    vs2r.v v10, (a1) # vscale x 16-byte Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    slli a2, a1, 6
 ; RV64-NEXT:    add a1, a2, a1
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v8, (a1) # vscale x 32-byte Folded Reload
+; RV64-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v8, v24
+; RV64-NEXT:    vmv.v.v v12, v24
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    slli a2, a1, 6
 ; RV64-NEXT:    add a1, a2, a1
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vs4r.v v8, (a1) # vscale x 32-byte Folded Spill
+; RV64-NEXT:    vs4r.v v12, (a1) # vscale x 32-byte Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a2, a1, 4
+; RV64-NEXT:    slli a2, a1, 3
 ; RV64-NEXT:    add a1, a2, a1
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v0, v8, v20
+; RV64-NEXT:    vrgatherei16.vv v0, v16, v8
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 25
-; RV64-NEXT:    mul a1, a1, a2
+; RV64-NEXT:    slli a2, a1, 4
+; RV64-NEXT:    add a1, a2, a1
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl4r.v v12, (a1) # vscale x 32-byte Folded Reload
+; RV64-NEXT:    vl4r.v v20, (a1) # vscale x 32-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
-; RV64-NEXT:    vmv.v.v v12, v0
-; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 3
-; RV64-NEXT:    add a1, sp, a1
-; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl8r.v v16, (a1) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmv.v.v v20, v0
+; RV64-NEXT:    addi a1, sp, 16
+; RV64-NEXT:    vl8r.v v8, (a1) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 77
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
-; RV64-NEXT:    vl2r.v v8, (a1) # vscale x 16-byte Folded Reload
+; RV64-NEXT:    vl2r.v v16, (a1) # vscale x 16-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v0, v16, v8
+; RV64-NEXT:    vrgatherei16.vv v0, v8, v16
 ; RV64-NEXT:    lui a1, %hi(.LCPI27_5)
 ; RV64-NEXT:    addi a1, a1, %lo(.LCPI27_5)
-; RV64-NEXT:    vle16.v v20, (a1)
+; RV64-NEXT:    vle16.v v12, (a1)
 ; RV64-NEXT:    csrr a1, vlenb
 ; RV64-NEXT:    li a2, 61
 ; RV64-NEXT:    mul a1, a1, a2
@@ -1406,7 +1406,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vs4r.v v8, (a1) # vscale x 32-byte Folded Spill
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    li a2, 45
+; RV64-NEXT:    li a2, 53
 ; RV64-NEXT:    mul a1, a1, a2
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
@@ -1414,7 +1414,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vsetivli zero, 8, e64, m4, ta, ma
 ; RV64-NEXT:    vrgather.vi v8, v0, 3
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    slli a1, a1, 3
 ; RV64-NEXT:    add a1, sp, a1
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl1r.v v0, (a1) # vscale x 8-byte Folded Reload
@@ -1426,7 +1426,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    addi a1, a1, 16
 ; RV64-NEXT:    vl8r.v v0, (a1) # vscale x 64-byte Folded Reload
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT:    vrgatherei16.vv v24, v0, v20
+; RV64-NEXT:    vrgatherei16.vv v24, v0, v12
 ; RV64-NEXT:    vsetivli zero, 5, e64, m4, tu, ma
 ; RV64-NEXT:    vmv.v.v v8, v24
 ; RV64-NEXT:    addi a1, a0, 320
@@ -1441,7 +1441,7 @@ define {<8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>, <8 x i64>} @load_
 ; RV64-NEXT:    vl4r.v v8, (a2) # vscale x 32-byte Folded Reload
 ; RV64-NEXT:    vse64.v v8, (a1)
 ; RV64-NEXT:    addi a1, a0, 192
-; RV64-NEXT:    vse64.v v12, (a1)
+; RV64-NEXT:    vse64.v v20, (a1)
 ; RV64-NEXT:    addi a1, a0, 128
 ; RV64-NEXT:    csrr a2, vlenb
 ; RV64-NEXT:    slli a3, a2, 6
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
index ffbf1c7a548e1..1ccc52be36215 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll
@@ -1874,57 +1874,77 @@ define float @vreduce_fminimum_v128f32(ptr %x) {
 ; CHECK-NEXT:    addi sp, sp, -16
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 4
+; CHECK-NEXT:    slli a1, a1, 3
+; CHECK-NEXT:    mv a2, a1
+; CHECK-NEXT:    slli a1, a1, 1
+; CHECK-NEXT:    add a1, a1, a2
 ; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb
 ; CHECK-NEXT:    li a1, 32
 ; CHECK-NEXT:    addi a2, a0, 128
 ; CHECK-NEXT:    vsetvli zero, a1, e32, m8, ta, ma
 ; CHECK-NEXT:    vle32.v v24, (a2)
 ; CHECK-NEXT:    addi a1, a0, 384
-; CHECK-NEXT:    vle32.v v16, (a1)
+; CHECK-NEXT:    vle32.v v8, (a1)
 ; CHECK-NEXT:    addi a1, a0, 256
-; CHECK-NEXT:    vle32.v v8, (a0)
+; CHECK-NEXT:    vle32.v v16, (a1)
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    slli a1, a1, 4
+; CHECK-NEXT:    add a1, sp, a1
+; CHECK-NEXT:    addi a1, a1, 16
+; CHECK-NEXT:    vs8r.v v16, (a1) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vmfeq.vv v0, v24, v24
+; CHECK-NEXT:    vmfeq.vv v7, v8, v8
+; CHECK-NEXT:    vle32.v v16, (a0)
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vmerge.vvm v16, v24, v8, v0
 ; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; CHECK-NEXT:    vmfeq.vv v0, v24, v24
-; CHECK-NEXT:    vmfeq.vv v7, v16, v16
-; CHECK-NEXT:    vmerge.vvm v8, v24, v16, v0
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; CHECK-NEXT:    vle32.v v8, (a1)
+; CHECK-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; CHECK-NEXT:    vmv1r.v v0, v7
-; CHECK-NEXT:    vmerge.vvm v16, v16, v24, v0
-; CHECK-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vfmin.vv v24, v16, v24
+; CHECK-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vfmin.vv v24, v8, v24
+; CHECK-NEXT:    addi a0, sp, 16
 ; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
 ; CHECK-NEXT:    vmfeq.vv v0, v16, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; CHECK-NEXT:    vmfeq.vv v7, v8, v8
 ; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vmerge.vvm v8, v16, v8, v0
+; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vmerge.vvm v16, v16, v8, v0
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
 ; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vmerge.vvm v16, v8, v16, v0
+; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vmerge.vvm v8, v8, v16, v0
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vfmin.vv v16, v8, v16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vfmin.vv v16, v16, v8
 ; CHECK-NEXT:    vmfeq.vv v0, v16, v16
 ; CHECK-NEXT:    vmfeq.vv v7, v24, v24
 ; CHECK-NEXT:    vmerge.vvm v8, v16, v24, v0
@@ -1943,7 +1963,10 @@ define float @vreduce_fminimum_v128f32(ptr %x) {
 ; CHECK-NEXT:    vfmv.f.s fa0, v8
 ; CHECK-NEXT:  .LBB121_3:
 ; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
 ; CHECK-NEXT:    add sp, sp, a0
 ; CHECK-NEXT:    .cfi_def_cfa sp, 16
 ; CHECK-NEXT:    addi sp, sp, 16
@@ -2257,56 +2280,76 @@ define double @vreduce_fminimum_v64f64(ptr %x) {
 ; RV32-NEXT:    addi sp, sp, -16
 ; RV32-NEXT:    .cfi_def_cfa_offset 16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    mv a2, a1
+; RV32-NEXT:    slli a1, a1, 1
+; RV32-NEXT:    add a1, a1, a2
 ; RV32-NEXT:    sub sp, sp, a1
-; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb
 ; RV32-NEXT:    addi a1, a0, 128
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV32-NEXT:    vle64.v v24, (a1)
 ; RV32-NEXT:    addi a1, a0, 384
-; RV32-NEXT:    vle64.v v16, (a1)
+; RV32-NEXT:    vle64.v v8, (a1)
 ; RV32-NEXT:    addi a1, a0, 256
-; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    vle64.v v16, (a0)
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    slli a0, a0, 4
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
-; RV32-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV32-NEXT:    vmfeq.vv v0, v24, v24
-; RV32-NEXT:    vmfeq.vv v7, v16, v16
-; RV32-NEXT:    vmerge.vvm v8, v24, v16, v0
+; RV32-NEXT:    vmfeq.vv v7, v8, v8
+; RV32-NEXT:    vle64.v v16, (a1)
 ; RV32-NEXT:    addi a0, sp, 16
-; RV32-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vle64.v v8, (a1)
+; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmerge.vvm v16, v24, v8, v0
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV32-NEXT:    vmv1r.v v0, v7
-; RV32-NEXT:    vmerge.vvm v16, v16, v24, v0
-; RV32-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vfmin.vv v24, v16, v24
+; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    slli a0, a0, 3
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vfmin.vv v24, v8, v24
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmfeq.vv v0, v8, v8
+; RV32-NEXT:    addi a0, sp, 16
 ; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vmfeq.vv v0, v16, v16
-; RV32-NEXT:    vmfeq.vv v7, v8, v8
+; RV32-NEXT:    vmfeq.vv v7, v16, v16
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    slli a0, a0, 3
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
-; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vmerge.vvm v16, v16, v8, v0
-; RV32-NEXT:    addi a0, sp, 16
-; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
 ; RV32-NEXT:    vmv1r.v v0, v7
 ; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmerge.vvm v16, v16, v8, v0
+; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    slli a0, a0, 3
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
-; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
-; RV32-NEXT:    addi a0, sp, 16
-; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vfmin.vv v16, v8, v16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vfmin.vv v16, v16, v8
 ; RV32-NEXT:    vmfeq.vv v0, v16, v16
 ; RV32-NEXT:    vmfeq.vv v7, v24, v24
 ; RV32-NEXT:    vmerge.vvm v8, v16, v24, v0
@@ -2325,7 +2368,10 @@ define double @vreduce_fminimum_v64f64(ptr %x) {
 ; RV32-NEXT:    vfmv.f.s fa0, v8
 ; RV32-NEXT:  .LBB133_3:
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    mv a1, a0
+; RV32-NEXT:    slli a0, a0, 1
+; RV32-NEXT:    add a0, a0, a1
 ; RV32-NEXT:    add sp, sp, a0
 ; RV32-NEXT:    .cfi_def_cfa sp, 16
 ; RV32-NEXT:    addi sp, sp, 16
@@ -2337,56 +2383,76 @@ define double @vreduce_fminimum_v64f64(ptr %x) {
 ; RV64-NEXT:    addi sp, sp, -16
 ; RV64-NEXT:    .cfi_def_cfa_offset 16
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    slli a1, a1, 3
+; RV64-NEXT:    mv a2, a1
+; RV64-NEXT:    slli a1, a1, 1
+; RV64-NEXT:    add a1, a1, a2
 ; RV64-NEXT:    sub sp, sp, a1
-; RV64-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; RV64-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb
 ; RV64-NEXT:    addi a1, a0, 128
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV64-NEXT:    vle64.v v24, (a1)
 ; RV64-NEXT:    addi a1, a0, 384
-; RV64-NEXT:    vle64.v v16, (a1)
+; RV64-NEXT:    vle64.v v8, (a1)
 ; RV64-NEXT:    addi a1, a0, 256
-; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    vle64.v v16, (a0)
 ; RV64-NEXT:    csrr a0, vlenb
-; RV64-NEXT:    slli a0, a0, 3
+; RV64-NEXT:    slli a0, a0, 4
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
-; RV64-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmfeq.vv v0, v24, v24
-; RV64-NEXT:    vmfeq.vv v7, v16, v16
-; RV64-NEXT:    vmerge.vvm v8, v24, v16, v0
+; RV64-NEXT:    vmfeq.vv v7, v8, v8
+; RV64-NEXT:    vle64.v v16, (a1)
 ; RV64-NEXT:    addi a0, sp, 16
-; RV64-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    vle64.v v8, (a1)
+; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vmerge.vvm v16, v24, v8, v0
+; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 3
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v7
-; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
-; RV64-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vfmin.vv v24, v16, v24
+; RV64-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV64-NEXT:    csrr a0, vlenb
 ; RV64-NEXT:    slli a0, a0, 3
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vfmin.vv v24, v8, v24
+; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmfeq.vv v0, v8, v8
+; RV64-NEXT:    addi a0, sp, 16
 ; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmfeq.vv v0, v16, v16
-; RV64-NEXT:    vmfeq.vv v7, v8, v8
+; RV64-NEXT:    vmfeq.vv v7, v16, v16
+; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV64-NEXT:    csrr a0, vlenb
 ; RV64-NEXT:    slli a0, a0, 3
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
-; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmerge.vvm v16, v16, v8, v0
-; RV64-NEXT:    addi a0, sp, 16
-; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v7
 ; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmerge.vvm v16, v16, v8, v0
+; RV64-NEXT:    csrr a0, vlenb
 ; RV64-NEXT:    slli a0, a0, 3
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
-; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmerge.vvm v8, v8, v16, v0
-; RV64-NEXT:    addi a0, sp, 16
-; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vfmin.vv v16, v8, v16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vfmin.vv v16, v16, v8
 ; RV64-NEXT:    vmfeq.vv v0, v16, v16
 ; RV64-NEXT:    vmfeq.vv v7, v24, v24
 ; RV64-NEXT:    vmerge.vvm v8, v16, v24, v0
@@ -2406,7 +2472,10 @@ define double @vreduce_fminimum_v64f64(ptr %x) {
 ; RV64-NEXT:    vfmv.f.s fa0, v8
 ; RV64-NEXT:  .LBB133_3:
 ; RV64-NEXT:    csrr a0, vlenb
-; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    slli a0, a0, 3
+; RV64-NEXT:    mv a1, a0
+; RV64-NEXT:    slli a0, a0, 1
+; RV64-NEXT:    add a0, a0, a1
 ; RV64-NEXT:    add sp, sp, a0
 ; RV64-NEXT:    .cfi_def_cfa sp, 16
 ; RV64-NEXT:    addi sp, sp, 16
@@ -2702,57 +2771,77 @@ define float @vreduce_fmaximum_v128f32(ptr %x) {
 ; CHECK-NEXT:    addi sp, sp, -16
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 4
+; CHECK-NEXT:    slli a1, a1, 3
+; CHECK-NEXT:    mv a2, a1
+; CHECK-NEXT:    slli a1, a1, 1
+; CHECK-NEXT:    add a1, a1, a2
 ; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb
 ; CHECK-NEXT:    li a1, 32
 ; CHECK-NEXT:    addi a2, a0, 128
 ; CHECK-NEXT:    vsetvli zero, a1, e32, m8, ta, ma
 ; CHECK-NEXT:    vle32.v v24, (a2)
 ; CHECK-NEXT:    addi a1, a0, 384
-; CHECK-NEXT:    vle32.v v16, (a1)
+; CHECK-NEXT:    vle32.v v8, (a1)
 ; CHECK-NEXT:    addi a1, a0, 256
-; CHECK-NEXT:    vle32.v v8, (a0)
+; CHECK-NEXT:    vle32.v v16, (a1)
+; CHECK-NEXT:    csrr a1, vlenb
+; CHECK-NEXT:    slli a1, a1, 4
+; CHECK-NEXT:    add a1, sp, a1
+; CHECK-NEXT:    addi a1, a1, 16
+; CHECK-NEXT:    vs8r.v v16, (a1) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vmfeq.vv v0, v24, v24
+; CHECK-NEXT:    vmfeq.vv v7, v8, v8
+; CHECK-NEXT:    vle32.v v16, (a0)
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vmerge.vvm v16, v24, v8, v0
 ; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; CHECK-NEXT:    vmfeq.vv v0, v24, v24
-; CHECK-NEXT:    vmfeq.vv v7, v16, v16
-; CHECK-NEXT:    vmerge.vvm v8, v24, v16, v0
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; CHECK-NEXT:    vle32.v v8, (a1)
+; CHECK-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; CHECK-NEXT:    vmv1r.v v0, v7
-; CHECK-NEXT:    vmerge.vvm v16, v16, v24, v0
-; CHECK-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vfmax.vv v24, v16, v24
+; CHECK-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vfmax.vv v24, v8, v24
+; CHECK-NEXT:    addi a0, sp, 16
 ; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
 ; CHECK-NEXT:    vmfeq.vv v0, v16, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; CHECK-NEXT:    vmfeq.vv v7, v8, v8
 ; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vmerge.vvm v8, v16, v8, v0
+; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vmerge.vvm v16, v16, v8, v0
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
 ; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vmerge.vvm v16, v8, v16, v0
+; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 3
 ; CHECK-NEXT:    add a0, sp, a0
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vmerge.vvm v8, v8, v16, v0
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; CHECK-NEXT:    vfmax.vv v16, v8, v16
+; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vfmax.vv v16, v16, v8
 ; CHECK-NEXT:    vmfeq.vv v0, v16, v16
 ; CHECK-NEXT:    vmfeq.vv v7, v24, v24
 ; CHECK-NEXT:    vmerge.vvm v8, v16, v24, v0
@@ -2771,7 +2860,10 @@ define float @vreduce_fmaximum_v128f32(ptr %x) {
 ; CHECK-NEXT:    vfmv.f.s fa0, v8
 ; CHECK-NEXT:  .LBB149_3:
 ; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    mv a1, a0
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add a0, a0, a1
 ; CHECK-NEXT:    add sp, sp, a0
 ; CHECK-NEXT:    .cfi_def_cfa sp, 16
 ; CHECK-NEXT:    addi sp, sp, 16
@@ -3085,56 +3177,76 @@ define double @vreduce_fmaximum_v64f64(ptr %x) {
 ; RV32-NEXT:    addi sp, sp, -16
 ; RV32-NEXT:    .cfi_def_cfa_offset 16
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    mv a2, a1
+; RV32-NEXT:    slli a1, a1, 1
+; RV32-NEXT:    add a1, a1, a2
 ; RV32-NEXT:    sub sp, sp, a1
-; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; RV32-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb
 ; RV32-NEXT:    addi a1, a0, 128
 ; RV32-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV32-NEXT:    vle64.v v24, (a1)
 ; RV32-NEXT:    addi a1, a0, 384
-; RV32-NEXT:    vle64.v v16, (a1)
+; RV32-NEXT:    vle64.v v8, (a1)
 ; RV32-NEXT:    addi a1, a0, 256
-; RV32-NEXT:    vle64.v v8, (a0)
+; RV32-NEXT:    vle64.v v16, (a0)
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    slli a0, a0, 4
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
-; RV32-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV32-NEXT:    vmfeq.vv v0, v24, v24
-; RV32-NEXT:    vmfeq.vv v7, v16, v16
-; RV32-NEXT:    vmerge.vvm v8, v24, v16, v0
+; RV32-NEXT:    vmfeq.vv v7, v8, v8
+; RV32-NEXT:    vle64.v v16, (a1)
 ; RV32-NEXT:    addi a0, sp, 16
-; RV32-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; RV32-NEXT:    vle64.v v8, (a1)
+; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vmerge.vvm v16, v24, v8, v0
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV32-NEXT:    vmv1r.v v0, v7
-; RV32-NEXT:    vmerge.vvm v16, v16, v24, v0
-; RV32-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vfmax.vv v24, v16, v24
+; RV32-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    slli a0, a0, 3
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vfmax.vv v24, v8, v24
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmfeq.vv v0, v8, v8
+; RV32-NEXT:    addi a0, sp, 16
 ; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vmfeq.vv v0, v16, v16
-; RV32-NEXT:    vmfeq.vv v7, v8, v8
+; RV32-NEXT:    vmfeq.vv v7, v16, v16
+; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    slli a0, a0, 3
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
-; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vmerge.vvm v16, v16, v8, v0
-; RV32-NEXT:    addi a0, sp, 16
-; RV32-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV32-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
 ; RV32-NEXT:    vmv1r.v v0, v7
 ; RV32-NEXT:    csrr a0, vlenb
+; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    add a0, sp, a0
+; RV32-NEXT:    addi a0, a0, 16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vmerge.vvm v16, v16, v8, v0
+; RV32-NEXT:    csrr a0, vlenb
 ; RV32-NEXT:    slli a0, a0, 3
 ; RV32-NEXT:    add a0, sp, a0
 ; RV32-NEXT:    addi a0, a0, 16
-; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vmerge.vvm v8, v8, v16, v0
-; RV32-NEXT:    addi a0, sp, 16
-; RV32-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV32-NEXT:    vfmax.vv v16, v8, v16
+; RV32-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV32-NEXT:    vfmax.vv v16, v16, v8
 ; RV32-NEXT:    vmfeq.vv v0, v16, v16
 ; RV32-NEXT:    vmfeq.vv v7, v24, v24
 ; RV32-NEXT:    vmerge.vvm v8, v16, v24, v0
@@ -3153,7 +3265,10 @@ define double @vreduce_fmaximum_v64f64(ptr %x) {
 ; RV32-NEXT:    vfmv.f.s fa0, v8
 ; RV32-NEXT:  .LBB161_3:
 ; RV32-NEXT:    csrr a0, vlenb
-; RV32-NEXT:    slli a0, a0, 4
+; RV32-NEXT:    slli a0, a0, 3
+; RV32-NEXT:    mv a1, a0
+; RV32-NEXT:    slli a0, a0, 1
+; RV32-NEXT:    add a0, a0, a1
 ; RV32-NEXT:    add sp, sp, a0
 ; RV32-NEXT:    .cfi_def_cfa sp, 16
 ; RV32-NEXT:    addi sp, sp, 16
@@ -3165,56 +3280,76 @@ define double @vreduce_fmaximum_v64f64(ptr %x) {
 ; RV64-NEXT:    addi sp, sp, -16
 ; RV64-NEXT:    .cfi_def_cfa_offset 16
 ; RV64-NEXT:    csrr a1, vlenb
-; RV64-NEXT:    slli a1, a1, 4
+; RV64-NEXT:    slli a1, a1, 3
+; RV64-NEXT:    mv a2, a1
+; RV64-NEXT:    slli a1, a1, 1
+; RV64-NEXT:    add a1, a1, a2
 ; RV64-NEXT:    sub sp, sp, a1
-; RV64-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
+; RV64-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb
 ; RV64-NEXT:    addi a1, a0, 128
 ; RV64-NEXT:    vsetivli zero, 16, e64, m8, ta, ma
 ; RV64-NEXT:    vle64.v v24, (a1)
 ; RV64-NEXT:    addi a1, a0, 384
-; RV64-NEXT:    vle64.v v16, (a1)
+; RV64-NEXT:    vle64.v v8, (a1)
 ; RV64-NEXT:    addi a1, a0, 256
-; RV64-NEXT:    vle64.v v8, (a0)
+; RV64-NEXT:    vle64.v v16, (a0)
 ; RV64-NEXT:    csrr a0, vlenb
-; RV64-NEXT:    slli a0, a0, 3
+; RV64-NEXT:    slli a0, a0, 4
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
-; RV64-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmfeq.vv v0, v24, v24
-; RV64-NEXT:    vmfeq.vv v7, v16, v16
-; RV64-NEXT:    vmerge.vvm v8, v24, v16, v0
+; RV64-NEXT:    vmfeq.vv v7, v8, v8
+; RV64-NEXT:    vle64.v v16, (a1)
 ; RV64-NEXT:    addi a0, sp, 16
-; RV64-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; RV64-NEXT:    vle64.v v8, (a1)
+; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vmerge.vvm v16, v24, v8, v0
+; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 3
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v7
-; RV64-NEXT:    vmerge.vvm v16, v16, v24, v0
-; RV64-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vfmax.vv v24, v16, v24
+; RV64-NEXT:    vmerge.vvm v8, v8, v24, v0
 ; RV64-NEXT:    csrr a0, vlenb
 ; RV64-NEXT:    slli a0, a0, 3
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vfmax.vv v24, v8, v24
+; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmfeq.vv v0, v8, v8
+; RV64-NEXT:    addi a0, sp, 16
 ; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmfeq.vv v0, v16, v16
-; RV64-NEXT:    vmfeq.vv v7, v8, v8
+; RV64-NEXT:    vmfeq.vv v7, v16, v16
+; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmerge.vvm v8, v8, v16, v0
 ; RV64-NEXT:    csrr a0, vlenb
 ; RV64-NEXT:    slli a0, a0, 3
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
-; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmerge.vvm v16, v16, v8, v0
-; RV64-NEXT:    addi a0, sp, 16
-; RV64-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; RV64-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
 ; RV64-NEXT:    vmv1r.v v0, v7
 ; RV64-NEXT:    csrr a0, vlenb
+; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    add a0, sp, a0
+; RV64-NEXT:    addi a0, a0, 16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vmerge.vvm v16, v16, v8, v0
+; RV64-NEXT:    csrr a0, vlenb
 ; RV64-NEXT:    slli a0, a0, 3
 ; RV64-NEXT:    add a0, sp, a0
 ; RV64-NEXT:    addi a0, a0, 16
-; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vmerge.vvm v8, v8, v16, v0
-; RV64-NEXT:    addi a0, sp, 16
-; RV64-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; RV64-NEXT:    vfmax.vv v16, v8, v16
+; RV64-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
+; RV64-NEXT:    vfmax.vv v16, v16, v8
 ; RV64-NEXT:    vmfeq.vv v0, v16, v16
 ; RV64-NEXT:    vmfeq.vv v7, v24, v24
 ; RV64-NEXT:    vmerge.vvm v8, v16, v24, v0
@@ -3234,7 +3369,10 @@ define double @vreduce_fmaximum_v64f64(ptr %x) {
 ; RV64-NEXT:    vfmv.f.s fa0, v8
 ; RV64-NEXT:  .LBB161_3:
 ; RV64-NEXT:    csrr a0, vlenb
-; RV64-NEXT:    slli a0, a0, 4
+; RV64-NEXT:    slli a0, a0, 3
+; RV64-NEXT:    mv a1, a0
+; RV64-NEXT:    slli a0, a0, 1
+; RV64-NEXT:    add a0, a0, a1
 ; RV64-NEXT:    add sp, sp, a0
 ; RV64-NEXT:    .cfi_def_cfa sp, 16
 ; RV64-NEXT:    addi sp, sp, 16
diff --git a/llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll
index 25a4eb74eeba7..fd70f95ed53c6 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll
@@ -121,155 +121,68 @@ define <vscale x 16 x bfloat> @vfmax_nxv16bf16_vv(<vscale x 16 x bfloat> %a, <vs
 }
 
 define <vscale x 32 x bfloat> @vfmax_nxv32bf16_vv(<vscale x 32 x bfloat> %a, <vscale x 32 x bfloat> %b) nounwind {
-; ZVFH-LABEL: vfmax_nxv32bf16_vv:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    addi sp, sp, -16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    mv a1, a0
-; ZVFH-NEXT:    slli a0, a0, 1
-; ZVFH-NEXT:    add a0, a0, a1
-; ZVFH-NEXT:    sub sp, sp, a0
-; ZVFH-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vmv8r.v v24, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    vmv8r.v v0, v8
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v16, v24
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v8, v0
-; ZVFH-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFH-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFH-NEXT:    vmfeq.vv v3, v16, v16
-; ZVFH-NEXT:    vmerge.vvm v24, v8, v16, v0
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    vmv1r.v v0, v3
-; ZVFH-NEXT:    vmerge.vvm v8, v16, v8, v0
-; ZVFH-NEXT:    addi a0, sp, 16
-; ZVFH-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v24, v12
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v8, v4
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    addi a0, sp, 16
-; ZVFH-NEXT:    vl8r.v v0, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFH-NEXT:    vfmax.vv v16, v0, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFH-NEXT:    vmfeq.vv v7, v24, v24
-; ZVFH-NEXT:    vmerge.vvm v16, v8, v24, v0
-; ZVFH-NEXT:    vmv1r.v v0, v7
-; ZVFH-NEXT:    vmerge.vvm v8, v24, v8, v0
-; ZVFH-NEXT:    vfmax.vv v16, v8, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vfncvtbf16.f.f.w v8, v24
-; ZVFH-NEXT:    vfncvtbf16.f.f.w v12, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    mv a1, a0
-; ZVFH-NEXT:    slli a0, a0, 1
-; ZVFH-NEXT:    add a0, a0, a1
-; ZVFH-NEXT:    add sp, sp, a0
-; ZVFH-NEXT:    addi sp, sp, 16
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vfmax_nxv32bf16_vv:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    addi sp, sp, -16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
-; ZVFHMIN-NEXT:    sub sp, sp, a0
-; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vmv8r.v v24, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmv8r.v v0, v8
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v16, v24
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v8, v0
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFHMIN-NEXT:    vmfeq.vv v3, v16, v16
-; ZVFHMIN-NEXT:    vmerge.vvm v24, v8, v16, v0
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmv1r.v v0, v3
-; ZVFHMIN-NEXT:    vmerge.vvm v8, v16, v8, v0
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v24, v12
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v8, v4
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vl8r.v v0, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v16, v0, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFHMIN-NEXT:    vmfeq.vv v7, v24, v24
-; ZVFHMIN-NEXT:    vmerge.vvm v16, v8, v24, v0
-; ZVFHMIN-NEXT:    vmv1r.v v0, v7
-; ZVFHMIN-NEXT:    vmerge.vvm v8, v24, v8, v0
-; ZVFHMIN-NEXT:    vfmax.vv v16, v8, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vfncvtbf16.f.f.w v8, v24
-; ZVFHMIN-NEXT:    vfncvtbf16.f.f.w v12, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
-; ZVFHMIN-NEXT:    add sp, sp, a0
-; ZVFHMIN-NEXT:    addi sp, sp, 16
-; ZVFHMIN-NEXT:    ret
+; CHECK-LABEL: vfmax_nxv32bf16_vv:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vmv8r.v v0, v16
+; CHECK-NEXT:    vmv8r.v v24, v8
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v16, v0
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v8, v24
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmfeq.vv v0, v8, v8
+; CHECK-NEXT:    vmfeq.vv v3, v16, v16
+; CHECK-NEXT:    vmerge.vvm v24, v8, v16, v0
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v3
+; CHECK-NEXT:    vmerge.vvm v8, v16, v8, v0
+; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v24, v4
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vfmax.vv v8, v8, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v8, v20
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmfeq.vv v0, v8, v8
+; CHECK-NEXT:    vmfeq.vv v7, v24, v24
+; CHECK-NEXT:    vmerge.vvm v16, v8, v24, v0
+; CHECK-NEXT:    vmv1r.v v0, v7
+; CHECK-NEXT:    vmerge.vvm v8, v24, v8, v0
+; CHECK-NEXT:    vfmax.vv v16, v8, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v24
+; CHECK-NEXT:    vfncvtbf16.f.f.w v12, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    ret
   %v = call <vscale x 32 x bfloat> @llvm.maximum.nxv32bf16(<vscale x 32 x bfloat> %a, <vscale x 32 x bfloat> %b)
   ret <vscale x 32 x bfloat> %v
 }
@@ -444,54 +357,45 @@ define <vscale x 32 x half> @vfmax_nxv32f16_vv(<vscale x 32 x half> %a, <vscale
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    addi sp, sp, -16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
+; ZVFHMIN-NEXT:    slli a0, a0, 4
 ; ZVFHMIN-NEXT:    sub sp, sp, a0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vmv8r.v v24, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmv8r.v v0, v8
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v24
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v0
+; ZVFHMIN-NEXT:    vmv8r.v v0, v16
+; ZVFHMIN-NEXT:    vmv8r.v v24, v8
+; ZVFHMIN-NEXT:    addi a0, sp, 16
+; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v0
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v24
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
 ; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
 ; ZVFHMIN-NEXT:    vmfeq.vv v3, v16, v16
 ; ZVFHMIN-NEXT:    vmerge.vvm v24, v8, v16, v0
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
 ; ZVFHMIN-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
 ; ZVFHMIN-NEXT:    vmv1r.v v0, v3
 ; ZVFHMIN-NEXT:    vmerge.vvm v8, v16, v8, v0
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v24, v12
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v4
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v24, v4
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
 ; ZVFHMIN-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vl8r.v v0, (a0) # vscale x 64-byte Folded Reload
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v16, v0, v16
+; ZVFHMIN-NEXT:    vfmax.vv v8, v8, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; ZVFHMIN-NEXT:    addi a0, sp, 16
+; ZVFHMIN-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
+; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v20
+; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
 ; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
 ; ZVFHMIN-NEXT:    vmfeq.vv v7, v24, v24
 ; ZVFHMIN-NEXT:    vmerge.vvm v16, v8, v24, v0
@@ -499,7 +403,7 @@ define <vscale x 32 x half> @vfmax_nxv32f16_vv(<vscale x 32 x half> %a, <vscale
 ; ZVFHMIN-NEXT:    vmerge.vvm v8, v24, v8, v0
 ; ZVFHMIN-NEXT:    vfmax.vv v16, v8, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
 ; ZVFHMIN-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
@@ -507,8 +411,7 @@ define <vscale x 32 x half> @vfmax_nxv32f16_vv(<vscale x 32 x half> %a, <vscale
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v24
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v12, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
+; ZVFHMIN-NEXT:    slli a0, a0, 4
 ; ZVFHMIN-NEXT:    add sp, sp, a0
 ; ZVFHMIN-NEXT:    addi sp, sp, 16
 ; ZVFHMIN-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll
index 6ffa71c6c908b..339f97a73ee52 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll
@@ -121,155 +121,68 @@ define <vscale x 16 x bfloat> @vfmin_nxv16bf16_vv(<vscale x 16 x bfloat> %a, <vs
 }
 
 define <vscale x 32 x bfloat> @vfmin_nxv32bf16_vv(<vscale x 32 x bfloat> %a, <vscale x 32 x bfloat> %b) nounwind {
-; ZVFH-LABEL: vfmin_nxv32bf16_vv:
-; ZVFH:       # %bb.0:
-; ZVFH-NEXT:    addi sp, sp, -16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    mv a1, a0
-; ZVFH-NEXT:    slli a0, a0, 1
-; ZVFH-NEXT:    add a0, a0, a1
-; ZVFH-NEXT:    sub sp, sp, a0
-; ZVFH-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vmv8r.v v24, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    vmv8r.v v0, v8
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v16, v24
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v8, v0
-; ZVFH-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFH-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFH-NEXT:    vmfeq.vv v3, v16, v16
-; ZVFH-NEXT:    vmerge.vvm v24, v8, v16, v0
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    vmv1r.v v0, v3
-; ZVFH-NEXT:    vmerge.vvm v8, v16, v8, v0
-; ZVFH-NEXT:    addi a0, sp, 16
-; ZVFH-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v24, v12
-; ZVFH-NEXT:    vfwcvtbf16.f.f.v v8, v4
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    addi a0, sp, 16
-; ZVFH-NEXT:    vl8r.v v0, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFH-NEXT:    vfmin.vv v16, v0, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFH-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFH-NEXT:    vmfeq.vv v7, v24, v24
-; ZVFH-NEXT:    vmerge.vvm v16, v8, v24, v0
-; ZVFH-NEXT:    vmv1r.v v0, v7
-; ZVFH-NEXT:    vmerge.vvm v8, v24, v8, v0
-; ZVFH-NEXT:    vfmin.vv v16, v8, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 4
-; ZVFH-NEXT:    add a0, sp, a0
-; ZVFH-NEXT:    addi a0, a0, 16
-; ZVFH-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; ZVFH-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vfncvtbf16.f.f.w v8, v24
-; ZVFH-NEXT:    vfncvtbf16.f.f.w v12, v16
-; ZVFH-NEXT:    csrr a0, vlenb
-; ZVFH-NEXT:    slli a0, a0, 3
-; ZVFH-NEXT:    mv a1, a0
-; ZVFH-NEXT:    slli a0, a0, 1
-; ZVFH-NEXT:    add a0, a0, a1
-; ZVFH-NEXT:    add sp, sp, a0
-; ZVFH-NEXT:    addi sp, sp, 16
-; ZVFH-NEXT:    ret
-;
-; ZVFHMIN-LABEL: vfmin_nxv32bf16_vv:
-; ZVFHMIN:       # %bb.0:
-; ZVFHMIN-NEXT:    addi sp, sp, -16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
-; ZVFHMIN-NEXT:    sub sp, sp, a0
-; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vmv8r.v v24, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmv8r.v v0, v8
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v16, v24
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v8, v0
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFHMIN-NEXT:    vmfeq.vv v3, v16, v16
-; ZVFHMIN-NEXT:    vmerge.vvm v24, v8, v16, v0
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmv1r.v v0, v3
-; ZVFHMIN-NEXT:    vmerge.vvm v8, v16, v8, v0
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v24, v12
-; ZVFHMIN-NEXT:    vfwcvtbf16.f.f.v v8, v4
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vl8r.v v0, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmin.vv v16, v0, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
-; ZVFHMIN-NEXT:    vmfeq.vv v7, v24, v24
-; ZVFHMIN-NEXT:    vmerge.vvm v16, v8, v24, v0
-; ZVFHMIN-NEXT:    vmv1r.v v0, v7
-; ZVFHMIN-NEXT:    vmerge.vvm v8, v24, v8, v0
-; ZVFHMIN-NEXT:    vfmin.vv v16, v8, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vfncvtbf16.f.f.w v8, v24
-; ZVFHMIN-NEXT:    vfncvtbf16.f.f.w v12, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
-; ZVFHMIN-NEXT:    add sp, sp, a0
-; ZVFHMIN-NEXT:    addi sp, sp, 16
-; ZVFHMIN-NEXT:    ret
+; CHECK-LABEL: vfmin_nxv32bf16_vv:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vmv8r.v v0, v16
+; CHECK-NEXT:    vmv8r.v v24, v8
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v16, v0
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v8, v24
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmfeq.vv v0, v8, v8
+; CHECK-NEXT:    vmfeq.vv v3, v16, v16
+; CHECK-NEXT:    vmerge.vvm v24, v8, v16, v0
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    vmv1r.v v0, v3
+; CHECK-NEXT:    vmerge.vvm v8, v16, v8, v0
+; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v24, v4
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vfmin.vv v8, v8, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vfwcvtbf16.f.f.v v8, v20
+; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
+; CHECK-NEXT:    vmfeq.vv v0, v8, v8
+; CHECK-NEXT:    vmfeq.vv v7, v24, v24
+; CHECK-NEXT:    vmerge.vvm v16, v8, v24, v0
+; CHECK-NEXT:    vmv1r.v v0, v7
+; CHECK-NEXT:    vmerge.vvm v8, v24, v8, v0
+; CHECK-NEXT:    vfmin.vv v16, v8, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 3
+; CHECK-NEXT:    add a0, sp, a0
+; CHECK-NEXT:    addi a0, a0, 16
+; CHECK-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
+; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v24
+; CHECK-NEXT:    vfncvtbf16.f.f.w v12, v16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 4
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    ret
   %v = call <vscale x 32 x bfloat> @llvm.minimum.nxv32bf16(<vscale x 32 x bfloat> %a, <vscale x 32 x bfloat> %b)
   ret <vscale x 32 x bfloat> %v
 }
@@ -444,54 +357,45 @@ define <vscale x 32 x half> @vfmin_nxv32f16_vv(<vscale x 32 x half> %a, <vscale
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    addi sp, sp, -16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
+; ZVFHMIN-NEXT:    slli a0, a0, 4
 ; ZVFHMIN-NEXT:    sub sp, sp, a0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vmv8r.v v24, v16
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    vmv8r.v v0, v8
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v24
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v0
+; ZVFHMIN-NEXT:    vmv8r.v v0, v16
+; ZVFHMIN-NEXT:    vmv8r.v v24, v8
+; ZVFHMIN-NEXT:    addi a0, sp, 16
+; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v0
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v24
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
 ; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
 ; ZVFHMIN-NEXT:    vmfeq.vv v3, v16, v16
 ; ZVFHMIN-NEXT:    vmerge.vvm v24, v8, v16, v0
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
 ; ZVFHMIN-NEXT:    vs8r.v v24, (a0) # vscale x 64-byte Folded Spill
 ; ZVFHMIN-NEXT:    vmv1r.v v0, v3
 ; ZVFHMIN-NEXT:    vmerge.vvm v8, v16, v8, v0
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
-; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 3
-; ZVFHMIN-NEXT:    add a0, sp, a0
-; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v24, v12
-; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v4
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v24, v4
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
 ; ZVFHMIN-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
-; ZVFHMIN-NEXT:    addi a0, sp, 16
-; ZVFHMIN-NEXT:    vl8r.v v0, (a0) # vscale x 64-byte Folded Reload
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmin.vv v16, v0, v16
+; ZVFHMIN-NEXT:    vfmin.vv v8, v8, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
-; ZVFHMIN-NEXT:    vs8r.v v16, (a0) # vscale x 64-byte Folded Spill
+; ZVFHMIN-NEXT:    vs8r.v v8, (a0) # vscale x 64-byte Folded Spill
+; ZVFHMIN-NEXT:    addi a0, sp, 16
+; ZVFHMIN-NEXT:    vl8r.v v16, (a0) # vscale x 64-byte Folded Reload
+; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
+; ZVFHMIN-NEXT:    vfwcvt.f.f.v v8, v20
+; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
 ; ZVFHMIN-NEXT:    vmfeq.vv v0, v8, v8
 ; ZVFHMIN-NEXT:    vmfeq.vv v7, v24, v24
 ; ZVFHMIN-NEXT:    vmerge.vvm v16, v8, v24, v0
@@ -499,7 +403,7 @@ define <vscale x 32 x half> @vfmin_nxv32f16_vv(<vscale x 32 x half> %a, <vscale
 ; ZVFHMIN-NEXT:    vmerge.vvm v8, v24, v8, v0
 ; ZVFHMIN-NEXT:    vfmin.vv v16, v8, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    slli a0, a0, 4
+; ZVFHMIN-NEXT:    slli a0, a0, 3
 ; ZVFHMIN-NEXT:    add a0, sp, a0
 ; ZVFHMIN-NEXT:    addi a0, a0, 16
 ; ZVFHMIN-NEXT:    vl8r.v v24, (a0) # vscale x 64-byte Folded Reload
@@ -507,8 +411,7 @@ define <vscale x 32 x half> @vfmin_nxv32f16_vv(<vscale x 32 x half> %a, <vscale
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v24
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v12, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
-; ZVFHMIN-NEXT:    li a1, 24
-; ZVFHMIN-NEXT:    mul a0, a0, a1
+; ZVFHMIN-NEXT:    slli a0, a0, 4
 ; ZVFHMIN-NEXT:    add sp, sp, a0
 ; ZVFHMIN-NEXT:    addi sp, sp, 16
 ; ZVFHMIN-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/fminimumnum-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/fminimumnum-sdnode.ll
index fcb8ad82342d5..a52625d9e8ef4 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fminimumnum-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fminimumnum-sdnode.ll
@@ -12,185 +12,185 @@
 ; RUN:     -target-abi=lp64d -verify-machineinstrs < %s | FileCheck %s \
 ; RUN:     --check-prefixes=CHECK,ZVFHMIN
 
-define <vscale x 1 x bfloat> @vfadd_vv_nxv1bf16(<vscale x 1 x bfloat> %va, <vscale x 1 x bfloat> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv1bf16:
+define <vscale x 1 x bfloat> @vfmin_vv_nxv1bf16(<vscale x 1 x bfloat> %va, <vscale x 1 x bfloat> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv1bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v10, v9
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; CHECK-NEXT:    vfmax.vv v9, v9, v10
+; CHECK-NEXT:    vfmin.vv v9, v9, v10
 ; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v9
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 1 x bfloat> @llvm.maximumnum.nxv1bf16(<vscale x 1 x bfloat> %va, <vscale x 1 x bfloat> %vb)
+  %vc = call <vscale x 1 x bfloat> @llvm.minimumnum.nxv1bf16(<vscale x 1 x bfloat> %va, <vscale x 1 x bfloat> %vb)
   ret <vscale x 1 x bfloat> %vc
 }
 
-define <vscale x 1 x bfloat> @vfadd_vf_nxv1bf16(<vscale x 1 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_vf_nxv1bf16:
+define <vscale x 1 x bfloat> @vfmin_vf_nxv1bf16(<vscale x 1 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_vf_nxv1bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; CHECK-NEXT:    vfmax.vf v9, v9, fa5
+; CHECK-NEXT:    vfmin.vf v9, v9, fa5
 ; CHECK-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v9
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 1 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 1 x bfloat> %head, <vscale x 1 x bfloat> poison, <vscale x 1 x i32> zeroinitializer
-  %vc = call <vscale x 1 x bfloat> @llvm.maximumnum.nxv1bf16(<vscale x 1 x bfloat> %va, <vscale x 1 x bfloat> %splat)
+  %vc = call <vscale x 1 x bfloat> @llvm.minimumnum.nxv1bf16(<vscale x 1 x bfloat> %va, <vscale x 1 x bfloat> %splat)
   ret <vscale x 1 x bfloat> %vc
 }
 
-define <vscale x 2 x bfloat> @vfadd_vv_nxv2bf16(<vscale x 2 x bfloat> %va, <vscale x 2 x bfloat> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv2bf16:
+define <vscale x 2 x bfloat> @vfmin_vv_nxv2bf16(<vscale x 2 x bfloat> %va, <vscale x 2 x bfloat> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv2bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e16, mf2, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v10, v9
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
-; CHECK-NEXT:    vfmax.vv v9, v9, v10
+; CHECK-NEXT:    vfmin.vv v9, v9, v10
 ; CHECK-NEXT:    vsetvli zero, zero, e16, mf2, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v9
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 2 x bfloat> @llvm.maximumnum.nxv2bf16(<vscale x 2 x bfloat> %va, <vscale x 2 x bfloat> %vb)
+  %vc = call <vscale x 2 x bfloat> @llvm.minimumnum.nxv2bf16(<vscale x 2 x bfloat> %va, <vscale x 2 x bfloat> %vb)
   ret <vscale x 2 x bfloat> %vc
 }
 
-define <vscale x 2 x bfloat> @vfadd_vf_nxv2bf16(<vscale x 2 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_vf_nxv2bf16:
+define <vscale x 2 x bfloat> @vfmin_vf_nxv2bf16(<vscale x 2 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_vf_nxv2bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e16, mf2, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v9, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
-; CHECK-NEXT:    vfmax.vf v9, v9, fa5
+; CHECK-NEXT:    vfmin.vf v9, v9, fa5
 ; CHECK-NEXT:    vsetvli zero, zero, e16, mf2, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v9
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 2 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 2 x bfloat> %head, <vscale x 2 x bfloat> poison, <vscale x 2 x i32> zeroinitializer
-  %vc = call <vscale x 2 x bfloat> @llvm.maximumnum.nxv2bf16(<vscale x 2 x bfloat> %va, <vscale x 2 x bfloat> %splat)
+  %vc = call <vscale x 2 x bfloat> @llvm.minimumnum.nxv2bf16(<vscale x 2 x bfloat> %va, <vscale x 2 x bfloat> %splat)
   ret <vscale x 2 x bfloat> %vc
 }
 
-define <vscale x 4 x bfloat> @vfadd_vv_nxv4bf16(<vscale x 4 x bfloat> %va, <vscale x 4 x bfloat> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv4bf16:
+define <vscale x 4 x bfloat> @vfmin_vv_nxv4bf16(<vscale x 4 x bfloat> %va, <vscale x 4 x bfloat> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv4bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v10, v9
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v12, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m2, ta, ma
-; CHECK-NEXT:    vfmax.vv v10, v12, v10
+; CHECK-NEXT:    vfmin.vv v10, v12, v10
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v10
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 4 x bfloat> @llvm.maximumnum.nxv4bf16(<vscale x 4 x bfloat> %va, <vscale x 4 x bfloat> %vb)
+  %vc = call <vscale x 4 x bfloat> @llvm.minimumnum.nxv4bf16(<vscale x 4 x bfloat> %va, <vscale x 4 x bfloat> %vb)
   ret <vscale x 4 x bfloat> %vc
 }
 
-define <vscale x 4 x bfloat> @vfadd_vf_nxv4bf16(<vscale x 4 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_vf_nxv4bf16:
+define <vscale x 4 x bfloat> @vfmin_vf_nxv4bf16(<vscale x 4 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_vf_nxv4bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v10, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m2, ta, ma
-; CHECK-NEXT:    vfmax.vf v10, v10, fa5
+; CHECK-NEXT:    vfmin.vf v10, v10, fa5
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m1, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v10
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 4 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 4 x bfloat> %head, <vscale x 4 x bfloat> poison, <vscale x 4 x i32> zeroinitializer
-  %vc = call <vscale x 4 x bfloat> @llvm.maximumnum.nxv4bf16(<vscale x 4 x bfloat> %va, <vscale x 4 x bfloat> %splat)
+  %vc = call <vscale x 4 x bfloat> @llvm.minimumnum.nxv4bf16(<vscale x 4 x bfloat> %va, <vscale x 4 x bfloat> %splat)
   ret <vscale x 4 x bfloat> %vc
 }
 
-define <vscale x 8 x bfloat> @vfadd_vv_nxv8bf16(<vscale x 8 x bfloat> %va, <vscale x 8 x bfloat> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv8bf16:
+define <vscale x 8 x bfloat> @vfmin_vv_nxv8bf16(<vscale x 8 x bfloat> %va, <vscale x 8 x bfloat> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv8bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v12, v10
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v16, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vfmax.vv v12, v16, v12
+; CHECK-NEXT:    vfmin.vv v12, v16, v12
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v12
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 8 x bfloat> @llvm.maximumnum.nxv8bf16(<vscale x 8 x bfloat> %va, <vscale x 8 x bfloat> %vb)
+  %vc = call <vscale x 8 x bfloat> @llvm.minimumnum.nxv8bf16(<vscale x 8 x bfloat> %va, <vscale x 8 x bfloat> %vb)
   ret <vscale x 8 x bfloat> %vc
 }
 
-define <vscale x 8 x bfloat> @vfadd_vf_nxv8bf16(<vscale x 8 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_vf_nxv8bf16:
+define <vscale x 8 x bfloat> @vfmin_vf_nxv8bf16(<vscale x 8 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_vf_nxv8bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v12, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vfmax.vf v12, v12, fa5
+; CHECK-NEXT:    vfmin.vf v12, v12, fa5
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v12
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 8 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 8 x bfloat> %head, <vscale x 8 x bfloat> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x bfloat> @llvm.maximumnum.nxv8bf16(<vscale x 8 x bfloat> %va, <vscale x 8 x bfloat> %splat)
+  %vc = call <vscale x 8 x bfloat> @llvm.minimumnum.nxv8bf16(<vscale x 8 x bfloat> %va, <vscale x 8 x bfloat> %splat)
   ret <vscale x 8 x bfloat> %vc
 }
 
-define <vscale x 8 x bfloat> @vfadd_fv_nxv8bf16(<vscale x 8 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_fv_nxv8bf16:
+define <vscale x 8 x bfloat> @vfmin_fv_nxv8bf16(<vscale x 8 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_fv_nxv8bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v12, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vfmax.vf v12, v12, fa5
+; CHECK-NEXT:    vfmin.vf v12, v12, fa5
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v12
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 8 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 8 x bfloat> %head, <vscale x 8 x bfloat> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x bfloat> @llvm.maximumnum.nxv8bf16(<vscale x 8 x bfloat> %splat, <vscale x 8 x bfloat> %va)
+  %vc = call <vscale x 8 x bfloat> @llvm.minimumnum.nxv8bf16(<vscale x 8 x bfloat> %splat, <vscale x 8 x bfloat> %va)
   ret <vscale x 8 x bfloat> %vc
 }
 
-define <vscale x 16 x bfloat> @vfadd_vv_nxv16bf16(<vscale x 16 x bfloat> %va, <vscale x 16 x bfloat> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv16bf16:
+define <vscale x 16 x bfloat> @vfmin_vv_nxv16bf16(<vscale x 16 x bfloat> %va, <vscale x 16 x bfloat> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv16bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v16, v12
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v24, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v16, v24, v16
+; CHECK-NEXT:    vfmin.vv v16, v24, v16
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v16
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 16 x bfloat> @llvm.maximumnum.nxv16bf16(<vscale x 16 x bfloat> %va, <vscale x 16 x bfloat> %vb)
+  %vc = call <vscale x 16 x bfloat> @llvm.minimumnum.nxv16bf16(<vscale x 16 x bfloat> %va, <vscale x 16 x bfloat> %vb)
   ret <vscale x 16 x bfloat> %vc
 }
 
-define <vscale x 16 x bfloat> @vfadd_vf_nxv16bf16(<vscale x 16 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_vf_nxv16bf16:
+define <vscale x 16 x bfloat> @vfmin_vf_nxv16bf16(<vscale x 16 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_vf_nxv16bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    fcvt.s.bf16 fa5, fa0
 ; CHECK-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v16, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vf v16, v16, fa5
+; CHECK-NEXT:    vfmin.vf v16, v16, fa5
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v16
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 16 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 16 x bfloat> %head, <vscale x 16 x bfloat> poison, <vscale x 16 x i32> zeroinitializer
-  %vc = call <vscale x 16 x bfloat> @llvm.maximumnum.nxv16bf16(<vscale x 16 x bfloat> %va, <vscale x 16 x bfloat> %splat)
+  %vc = call <vscale x 16 x bfloat> @llvm.minimumnum.nxv16bf16(<vscale x 16 x bfloat> %va, <vscale x 16 x bfloat> %splat)
   ret <vscale x 16 x bfloat> %vc
 }
 
-define <vscale x 32 x bfloat> @vfadd_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <vscale x 32 x bfloat> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv32bf16:
+define <vscale x 32 x bfloat> @vfmin_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <vscale x 32 x bfloat> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv32bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi sp, sp, -16
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
@@ -207,11 +207,11 @@ define <vscale x 32 x bfloat> @vfadd_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <v
 ; CHECK-NEXT:    vfwcvtbf16.f.f.v v16, v12
 ; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v0, v0, v8
+; CHECK-NEXT:    vfmin.vv v0, v0, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v0
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v16, v16, v24
+; CHECK-NEXT:    vfmin.vv v16, v16, v24
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v12, v16
 ; CHECK-NEXT:    csrr a0, vlenb
@@ -221,12 +221,12 @@ define <vscale x 32 x bfloat> @vfadd_vv_nxv32bf16(<vscale x 32 x bfloat> %va, <v
 ; CHECK-NEXT:    addi sp, sp, 16
 ; CHECK-NEXT:    .cfi_def_cfa_offset 0
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 32 x bfloat> @llvm.maximumnum.nxv32bf16(<vscale x 32 x bfloat> %va, <vscale x 32 x bfloat> %vb)
+  %vc = call <vscale x 32 x bfloat> @llvm.minimumnum.nxv32bf16(<vscale x 32 x bfloat> %va, <vscale x 32 x bfloat> %vb)
   ret <vscale x 32 x bfloat> %vc
 }
 
-define <vscale x 32 x bfloat> @vfadd_vf_nxv32bf16(<vscale x 32 x bfloat> %va, bfloat %b) {
-; CHECK-LABEL: vfadd_vf_nxv32bf16:
+define <vscale x 32 x bfloat> @vfmin_vf_nxv32bf16(<vscale x 32 x bfloat> %va, bfloat %b) {
+; CHECK-LABEL: vfmin_vf_nxv32bf16:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi sp, sp, -16
 ; CHECK-NEXT:    .cfi_def_cfa_offset 16
@@ -248,11 +248,11 @@ define <vscale x 32 x bfloat> @vfadd_vf_nxv32bf16(<vscale x 32 x bfloat> %va, bf
 ; CHECK-NEXT:    addi a0, sp, 16
 ; CHECK-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v0, v8, v0
+; CHECK-NEXT:    vfmin.vv v0, v8, v0
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v8, v0
 ; CHECK-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v16, v24, v16
+; CHECK-NEXT:    vfmin.vv v16, v24, v16
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; CHECK-NEXT:    vfncvtbf16.f.f.w v12, v16
 ; CHECK-NEXT:    csrr a0, vlenb
@@ -264,261 +264,261 @@ define <vscale x 32 x bfloat> @vfadd_vf_nxv32bf16(<vscale x 32 x bfloat> %va, bf
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 32 x bfloat> poison, bfloat %b, i32 0
   %splat = shufflevector <vscale x 32 x bfloat> %head, <vscale x 32 x bfloat> poison, <vscale x 32 x i32> zeroinitializer
-  %vc = call <vscale x 32 x bfloat> @llvm.maximumnum.nxv32bf16(<vscale x 32 x bfloat> %va, <vscale x 32 x bfloat> %splat)
+  %vc = call <vscale x 32 x bfloat> @llvm.minimumnum.nxv32bf16(<vscale x 32 x bfloat> %va, <vscale x 32 x bfloat> %splat)
   ret <vscale x 32 x bfloat> %vc
 }
 
-define <vscale x 1 x half> @vfadd_vv_nxv1f16(<vscale x 1 x half> %va, <vscale x 1 x half> %vb) {
-; ZVFH-LABEL: vfadd_vv_nxv1f16:
+define <vscale x 1 x half> @vfmin_vv_nxv1f16(<vscale x 1 x half> %va, <vscale x 1 x half> %vb) {
+; ZVFH-LABEL: vfmin_vv_nxv1f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, mf4, ta, ma
-; ZVFH-NEXT:    vfmax.vv v8, v8, v9
+; ZVFH-NEXT:    vfmin.vv v8, v8, v9
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vv_nxv1f16:
+; ZVFHMIN-LABEL: vfmin_vv_nxv1f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, mf4, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v10, v9
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v9, v9, v10
+; ZVFHMIN-NEXT:    vfmin.vv v9, v9, v10
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v9
 ; ZVFHMIN-NEXT:    ret
-  %vc = call <vscale x 1 x half> @llvm.maximumnum.nxv1f16(<vscale x 1 x half> %va, <vscale x 1 x half> %vb)
+  %vc = call <vscale x 1 x half> @llvm.minimumnum.nxv1f16(<vscale x 1 x half> %va, <vscale x 1 x half> %vb)
   ret <vscale x 1 x half> %vc
 }
 
-define <vscale x 1 x half> @vfadd_vf_nxv1f16(<vscale x 1 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_vf_nxv1f16:
+define <vscale x 1 x half> @vfmin_vf_nxv1f16(<vscale x 1 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_vf_nxv1f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, mf4, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vf_nxv1f16:
+; ZVFHMIN-LABEL: vfmin_vf_nxv1f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, mf4, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, mf2, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vf v9, v9, fa5
+; ZVFHMIN-NEXT:    vfmin.vf v9, v9, fa5
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, mf4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v9
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 1 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 1 x half> %head, <vscale x 1 x half> poison, <vscale x 1 x i32> zeroinitializer
-  %vc = call <vscale x 1 x half> @llvm.maximumnum.nxv1f16(<vscale x 1 x half> %va, <vscale x 1 x half> %splat)
+  %vc = call <vscale x 1 x half> @llvm.minimumnum.nxv1f16(<vscale x 1 x half> %va, <vscale x 1 x half> %splat)
   ret <vscale x 1 x half> %vc
 }
 
-define <vscale x 2 x half> @vfadd_vv_nxv2f16(<vscale x 2 x half> %va, <vscale x 2 x half> %vb) {
-; ZVFH-LABEL: vfadd_vv_nxv2f16:
+define <vscale x 2 x half> @vfmin_vv_nxv2f16(<vscale x 2 x half> %va, <vscale x 2 x half> %vb) {
+; ZVFH-LABEL: vfmin_vv_nxv2f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, mf2, ta, ma
-; ZVFH-NEXT:    vfmax.vv v8, v8, v9
+; ZVFH-NEXT:    vfmin.vv v8, v8, v9
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vv_nxv2f16:
+; ZVFHMIN-LABEL: vfmin_vv_nxv2f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, mf2, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v10, v9
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v9, v9, v10
+; ZVFHMIN-NEXT:    vfmin.vv v9, v9, v10
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, mf2, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v9
 ; ZVFHMIN-NEXT:    ret
-  %vc = call <vscale x 2 x half> @llvm.maximumnum.nxv2f16(<vscale x 2 x half> %va, <vscale x 2 x half> %vb)
+  %vc = call <vscale x 2 x half> @llvm.minimumnum.nxv2f16(<vscale x 2 x half> %va, <vscale x 2 x half> %vb)
   ret <vscale x 2 x half> %vc
 }
 
-define <vscale x 2 x half> @vfadd_vf_nxv2f16(<vscale x 2 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_vf_nxv2f16:
+define <vscale x 2 x half> @vfmin_vf_nxv2f16(<vscale x 2 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_vf_nxv2f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, mf2, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vf_nxv2f16:
+; ZVFHMIN-LABEL: vfmin_vf_nxv2f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, mf2, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v9, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vf v9, v9, fa5
+; ZVFHMIN-NEXT:    vfmin.vf v9, v9, fa5
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, mf2, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v9
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 2 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 2 x half> %head, <vscale x 2 x half> poison, <vscale x 2 x i32> zeroinitializer
-  %vc = call <vscale x 2 x half> @llvm.maximumnum.nxv2f16(<vscale x 2 x half> %va, <vscale x 2 x half> %splat)
+  %vc = call <vscale x 2 x half> @llvm.minimumnum.nxv2f16(<vscale x 2 x half> %va, <vscale x 2 x half> %splat)
   ret <vscale x 2 x half> %vc
 }
 
-define <vscale x 4 x half> @vfadd_vv_nxv4f16(<vscale x 4 x half> %va, <vscale x 4 x half> %vb) {
-; ZVFH-LABEL: vfadd_vv_nxv4f16:
+define <vscale x 4 x half> @vfmin_vv_nxv4f16(<vscale x 4 x half> %va, <vscale x 4 x half> %vb) {
+; ZVFH-LABEL: vfmin_vv_nxv4f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m1, ta, ma
-; ZVFH-NEXT:    vfmax.vv v8, v8, v9
+; ZVFH-NEXT:    vfmin.vv v8, v8, v9
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vv_nxv4f16:
+; ZVFHMIN-LABEL: vfmin_vv_nxv4f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m1, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v10, v9
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v12, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m2, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v10, v12, v10
+; ZVFHMIN-NEXT:    vfmin.vv v10, v12, v10
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m1, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v10
 ; ZVFHMIN-NEXT:    ret
-  %vc = call <vscale x 4 x half> @llvm.maximumnum.nxv4f16(<vscale x 4 x half> %va, <vscale x 4 x half> %vb)
+  %vc = call <vscale x 4 x half> @llvm.minimumnum.nxv4f16(<vscale x 4 x half> %va, <vscale x 4 x half> %vb)
   ret <vscale x 4 x half> %vc
 }
 
-define <vscale x 4 x half> @vfadd_vf_nxv4f16(<vscale x 4 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_vf_nxv4f16:
+define <vscale x 4 x half> @vfmin_vf_nxv4f16(<vscale x 4 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_vf_nxv4f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m1, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vf_nxv4f16:
+; ZVFHMIN-LABEL: vfmin_vf_nxv4f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m1, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v10, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m2, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vf v10, v10, fa5
+; ZVFHMIN-NEXT:    vfmin.vf v10, v10, fa5
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m1, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v10
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 4 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 4 x half> %head, <vscale x 4 x half> poison, <vscale x 4 x i32> zeroinitializer
-  %vc = call <vscale x 4 x half> @llvm.maximumnum.nxv4f16(<vscale x 4 x half> %va, <vscale x 4 x half> %splat)
+  %vc = call <vscale x 4 x half> @llvm.minimumnum.nxv4f16(<vscale x 4 x half> %va, <vscale x 4 x half> %splat)
   ret <vscale x 4 x half> %vc
 }
 
-define <vscale x 8 x half> @vfadd_vv_nxv8f16(<vscale x 8 x half> %va, <vscale x 8 x half> %vb) {
-; ZVFH-LABEL: vfadd_vv_nxv8f16:
+define <vscale x 8 x half> @vfmin_vv_nxv8f16(<vscale x 8 x half> %va, <vscale x 8 x half> %vb) {
+; ZVFH-LABEL: vfmin_vv_nxv8f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
-; ZVFH-NEXT:    vfmax.vv v8, v8, v10
+; ZVFH-NEXT:    vfmin.vv v8, v8, v10
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vv_nxv8f16:
+; ZVFHMIN-LABEL: vfmin_vv_nxv8f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v12, v10
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v12, v16, v12
+; ZVFHMIN-NEXT:    vfmin.vv v12, v16, v12
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v12
 ; ZVFHMIN-NEXT:    ret
-  %vc = call <vscale x 8 x half> @llvm.maximumnum.nxv8f16(<vscale x 8 x half> %va, <vscale x 8 x half> %vb)
+  %vc = call <vscale x 8 x half> @llvm.minimumnum.nxv8f16(<vscale x 8 x half> %va, <vscale x 8 x half> %vb)
   ret <vscale x 8 x half> %vc
 }
 
-define <vscale x 8 x half> @vfadd_vf_nxv8f16(<vscale x 8 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_vf_nxv8f16:
+define <vscale x 8 x half> @vfmin_vf_nxv8f16(<vscale x 8 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_vf_nxv8f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vf_nxv8f16:
+; ZVFHMIN-LABEL: vfmin_vf_nxv8f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v12, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vf v12, v12, fa5
+; ZVFHMIN-NEXT:    vfmin.vf v12, v12, fa5
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v12
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 8 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 8 x half> %head, <vscale x 8 x half> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x half> @llvm.maximumnum.nxv8f16(<vscale x 8 x half> %va, <vscale x 8 x half> %splat)
+  %vc = call <vscale x 8 x half> @llvm.minimumnum.nxv8f16(<vscale x 8 x half> %va, <vscale x 8 x half> %splat)
   ret <vscale x 8 x half> %vc
 }
 
-define <vscale x 8 x half> @vfadd_fv_nxv8f16(<vscale x 8 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_fv_nxv8f16:
+define <vscale x 8 x half> @vfmin_fv_nxv8f16(<vscale x 8 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_fv_nxv8f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_fv_nxv8f16:
+; ZVFHMIN-LABEL: vfmin_fv_nxv8f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m2, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v12, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m4, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vf v12, v12, fa5
+; ZVFHMIN-NEXT:    vfmin.vf v12, v12, fa5
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m2, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v12
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 8 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 8 x half> %head, <vscale x 8 x half> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x half> @llvm.maximumnum.nxv8f16(<vscale x 8 x half> %splat, <vscale x 8 x half> %va)
+  %vc = call <vscale x 8 x half> @llvm.minimumnum.nxv8f16(<vscale x 8 x half> %splat, <vscale x 8 x half> %va)
   ret <vscale x 8 x half> %vc
 }
 
-define <vscale x 16 x half> @vfadd_vv_nxv16f16(<vscale x 16 x half> %va, <vscale x 16 x half> %vb) {
-; ZVFH-LABEL: vfadd_vv_nxv16f16:
+define <vscale x 16 x half> @vfmin_vv_nxv16f16(<vscale x 16 x half> %va, <vscale x 16 x half> %vb) {
+; ZVFH-LABEL: vfmin_vv_nxv16f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vfmax.vv v8, v8, v12
+; ZVFH-NEXT:    vfmin.vv v8, v8, v12
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vv_nxv16f16:
+; ZVFHMIN-LABEL: vfmin_vv_nxv16f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v12
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v24, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v16, v24, v16
+; ZVFHMIN-NEXT:    vfmin.vv v16, v24, v16
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v16
 ; ZVFHMIN-NEXT:    ret
-  %vc = call <vscale x 16 x half> @llvm.maximumnum.nxv16f16(<vscale x 16 x half> %va, <vscale x 16 x half> %vb)
+  %vc = call <vscale x 16 x half> @llvm.minimumnum.nxv16f16(<vscale x 16 x half> %va, <vscale x 16 x half> %vb)
   ret <vscale x 16 x half> %vc
 }
 
-define <vscale x 16 x half> @vfadd_vf_nxv16f16(<vscale x 16 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_vf_nxv16f16:
+define <vscale x 16 x half> @vfmin_vf_nxv16f16(<vscale x 16 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_vf_nxv16f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vf_nxv16f16:
+; ZVFHMIN-LABEL: vfmin_vf_nxv16f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    fcvt.s.h fa5, fa0
 ; ZVFHMIN-NEXT:    vsetvli a0, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vf v16, v16, fa5
+; ZVFHMIN-NEXT:    vfmin.vf v16, v16, fa5
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v16
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 16 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 16 x half> %head, <vscale x 16 x half> poison, <vscale x 16 x i32> zeroinitializer
-  %vc = call <vscale x 16 x half> @llvm.maximumnum.nxv16f16(<vscale x 16 x half> %va, <vscale x 16 x half> %splat)
+  %vc = call <vscale x 16 x half> @llvm.minimumnum.nxv16f16(<vscale x 16 x half> %va, <vscale x 16 x half> %splat)
   ret <vscale x 16 x half> %vc
 }
 
-define <vscale x 32 x half> @vfadd_vv_nxv32f16(<vscale x 32 x half> %va, <vscale x 32 x half> %vb) {
-; ZVFH-LABEL: vfadd_vv_nxv32f16:
+define <vscale x 32 x half> @vfmin_vv_nxv32f16(<vscale x 32 x half> %va, <vscale x 32 x half> %vb) {
+; ZVFH-LABEL: vfmin_vv_nxv32f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m8, ta, ma
-; ZVFH-NEXT:    vfmax.vv v8, v8, v16
+; ZVFH-NEXT:    vfmin.vv v8, v8, v16
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vv_nxv32f16:
+; ZVFHMIN-LABEL: vfmin_vv_nxv32f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    addi sp, sp, -16
 ; ZVFHMIN-NEXT:    .cfi_def_cfa_offset 16
@@ -535,11 +535,11 @@ define <vscale x 32 x half> @vfadd_vv_nxv32f16(<vscale x 32 x half> %va, <vscale
 ; ZVFHMIN-NEXT:    vfwcvt.f.f.v v16, v12
 ; ZVFHMIN-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v0, v0, v8
+; ZVFHMIN-NEXT:    vfmin.vv v0, v0, v8
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v0
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v16, v16, v24
+; ZVFHMIN-NEXT:    vfmin.vv v16, v16, v24
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v12, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
@@ -549,18 +549,18 @@ define <vscale x 32 x half> @vfadd_vv_nxv32f16(<vscale x 32 x half> %va, <vscale
 ; ZVFHMIN-NEXT:    addi sp, sp, 16
 ; ZVFHMIN-NEXT:    .cfi_def_cfa_offset 0
 ; ZVFHMIN-NEXT:    ret
-  %vc = call <vscale x 32 x half> @llvm.maximumnum.nxv32f16(<vscale x 32 x half> %va, <vscale x 32 x half> %vb)
+  %vc = call <vscale x 32 x half> @llvm.minimumnum.nxv32f16(<vscale x 32 x half> %va, <vscale x 32 x half> %vb)
   ret <vscale x 32 x half> %vc
 }
 
-define <vscale x 32 x half> @vfadd_vf_nxv32f16(<vscale x 32 x half> %va, half %b) {
-; ZVFH-LABEL: vfadd_vf_nxv32f16:
+define <vscale x 32 x half> @vfmin_vf_nxv32f16(<vscale x 32 x half> %va, half %b) {
+; ZVFH-LABEL: vfmin_vf_nxv32f16:
 ; ZVFH:       # %bb.0:
 ; ZVFH-NEXT:    vsetvli a0, zero, e16, m8, ta, ma
-; ZVFH-NEXT:    vfmax.vf v8, v8, fa0
+; ZVFH-NEXT:    vfmin.vf v8, v8, fa0
 ; ZVFH-NEXT:    ret
 ;
-; ZVFHMIN-LABEL: vfadd_vf_nxv32f16:
+; ZVFHMIN-LABEL: vfmin_vf_nxv32f16:
 ; ZVFHMIN:       # %bb.0:
 ; ZVFHMIN-NEXT:    addi sp, sp, -16
 ; ZVFHMIN-NEXT:    .cfi_def_cfa_offset 16
@@ -582,11 +582,11 @@ define <vscale x 32 x half> @vfadd_vf_nxv32f16(<vscale x 32 x half> %va, half %b
 ; ZVFHMIN-NEXT:    addi a0, sp, 16
 ; ZVFHMIN-NEXT:    vl8r.v v8, (a0) # vscale x 64-byte Folded Reload
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v0, v8, v0
+; ZVFHMIN-NEXT:    vfmin.vv v0, v8, v0
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v8, v0
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e32, m8, ta, ma
-; ZVFHMIN-NEXT:    vfmax.vv v16, v24, v16
+; ZVFHMIN-NEXT:    vfmin.vv v16, v24, v16
 ; ZVFHMIN-NEXT:    vsetvli zero, zero, e16, m4, ta, ma
 ; ZVFHMIN-NEXT:    vfncvt.f.f.w v12, v16
 ; ZVFHMIN-NEXT:    csrr a0, vlenb
@@ -598,256 +598,256 @@ define <vscale x 32 x half> @vfadd_vf_nxv32f16(<vscale x 32 x half> %va, half %b
 ; ZVFHMIN-NEXT:    ret
   %head = insertelement <vscale x 32 x half> poison, half %b, i32 0
   %splat = shufflevector <vscale x 32 x half> %head, <vscale x 32 x half> poison, <vscale x 32 x i32> zeroinitializer
-  %vc = call <vscale x 32 x half> @llvm.maximumnum.nxv32f16(<vscale x 32 x half> %va, <vscale x 32 x half> %splat)
+  %vc = call <vscale x 32 x half> @llvm.minimumnum.nxv32f16(<vscale x 32 x half> %va, <vscale x 32 x half> %splat)
   ret <vscale x 32 x half> %vc
 }
 
-define <vscale x 1 x float> @vfadd_vv_nxv1f32(<vscale x 1 x float> %va, <vscale x 1 x float> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv1f32:
+define <vscale x 1 x float> @vfmin_vv_nxv1f32(<vscale x 1 x float> %va, <vscale x 1 x float> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv1f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, mf2, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v9
+; CHECK-NEXT:    vfmin.vv v8, v8, v9
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 1 x float> @llvm.maximumnum.nxv1f32(<vscale x 1 x float> %va, <vscale x 1 x float> %vb)
+  %vc = call <vscale x 1 x float> @llvm.minimumnum.nxv1f32(<vscale x 1 x float> %va, <vscale x 1 x float> %vb)
   ret <vscale x 1 x float> %vc
 }
 
-define <vscale x 1 x float> @vfadd_vf_nxv1f32(<vscale x 1 x float> %va, float %b) {
-; CHECK-LABEL: vfadd_vf_nxv1f32:
+define <vscale x 1 x float> @vfmin_vf_nxv1f32(<vscale x 1 x float> %va, float %b) {
+; CHECK-LABEL: vfmin_vf_nxv1f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, mf2, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 1 x float> poison, float %b, i32 0
   %splat = shufflevector <vscale x 1 x float> %head, <vscale x 1 x float> poison, <vscale x 1 x i32> zeroinitializer
-  %vc = call <vscale x 1 x float> @llvm.maximumnum.nxv1f32(<vscale x 1 x float> %va, <vscale x 1 x float> %splat)
+  %vc = call <vscale x 1 x float> @llvm.minimumnum.nxv1f32(<vscale x 1 x float> %va, <vscale x 1 x float> %splat)
   ret <vscale x 1 x float> %vc
 }
 
-define <vscale x 2 x float> @vfadd_vv_nxv2f32(<vscale x 2 x float> %va, <vscale x 2 x float> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv2f32:
+define <vscale x 2 x float> @vfmin_vv_nxv2f32(<vscale x 2 x float> %va, <vscale x 2 x float> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv2f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m1, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v9
+; CHECK-NEXT:    vfmin.vv v8, v8, v9
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 2 x float> @llvm.maximumnum.nxv2f32(<vscale x 2 x float> %va, <vscale x 2 x float> %vb)
+  %vc = call <vscale x 2 x float> @llvm.minimumnum.nxv2f32(<vscale x 2 x float> %va, <vscale x 2 x float> %vb)
   ret <vscale x 2 x float> %vc
 }
 
-define <vscale x 2 x float> @vfadd_vf_nxv2f32(<vscale x 2 x float> %va, float %b) {
-; CHECK-LABEL: vfadd_vf_nxv2f32:
+define <vscale x 2 x float> @vfmin_vf_nxv2f32(<vscale x 2 x float> %va, float %b) {
+; CHECK-LABEL: vfmin_vf_nxv2f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m1, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 2 x float> poison, float %b, i32 0
   %splat = shufflevector <vscale x 2 x float> %head, <vscale x 2 x float> poison, <vscale x 2 x i32> zeroinitializer
-  %vc = call <vscale x 2 x float> @llvm.maximumnum.nxv2f32(<vscale x 2 x float> %va, <vscale x 2 x float> %splat)
+  %vc = call <vscale x 2 x float> @llvm.minimumnum.nxv2f32(<vscale x 2 x float> %va, <vscale x 2 x float> %splat)
   ret <vscale x 2 x float> %vc
 }
 
-define <vscale x 4 x float> @vfadd_vv_nxv4f32(<vscale x 4 x float> %va, <vscale x 4 x float> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv4f32:
+define <vscale x 4 x float> @vfmin_vv_nxv4f32(<vscale x 4 x float> %va, <vscale x 4 x float> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv4f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m2, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v10
+; CHECK-NEXT:    vfmin.vv v8, v8, v10
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 4 x float> @llvm.maximumnum.nxv4f32(<vscale x 4 x float> %va, <vscale x 4 x float> %vb)
+  %vc = call <vscale x 4 x float> @llvm.minimumnum.nxv4f32(<vscale x 4 x float> %va, <vscale x 4 x float> %vb)
   ret <vscale x 4 x float> %vc
 }
 
-define <vscale x 4 x float> @vfadd_vf_nxv4f32(<vscale x 4 x float> %va, float %b) {
-; CHECK-LABEL: vfadd_vf_nxv4f32:
+define <vscale x 4 x float> @vfmin_vf_nxv4f32(<vscale x 4 x float> %va, float %b) {
+; CHECK-LABEL: vfmin_vf_nxv4f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m2, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 4 x float> poison, float %b, i32 0
   %splat = shufflevector <vscale x 4 x float> %head, <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer
-  %vc = call <vscale x 4 x float> @llvm.maximumnum.nxv4f32(<vscale x 4 x float> %va, <vscale x 4 x float> %splat)
+  %vc = call <vscale x 4 x float> @llvm.minimumnum.nxv4f32(<vscale x 4 x float> %va, <vscale x 4 x float> %splat)
   ret <vscale x 4 x float> %vc
 }
 
-define <vscale x 8 x float> @vfadd_vv_nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv8f32:
+define <vscale x 8 x float> @vfmin_vv_nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv8f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v12
+; CHECK-NEXT:    vfmin.vv v8, v8, v12
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 8 x float> @llvm.maximumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vb)
+  %vc = call <vscale x 8 x float> @llvm.minimumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vb)
   ret <vscale x 8 x float> %vc
 }
 
-define <vscale x 8 x float> @vfadd_vf_nxv8f32(<vscale x 8 x float> %va, float %b) {
-; CHECK-LABEL: vfadd_vf_nxv8f32:
+define <vscale x 8 x float> @vfmin_vf_nxv8f32(<vscale x 8 x float> %va, float %b) {
+; CHECK-LABEL: vfmin_vf_nxv8f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 8 x float> poison, float %b, i32 0
   %splat = shufflevector <vscale x 8 x float> %head, <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x float> @llvm.maximumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %splat)
+  %vc = call <vscale x 8 x float> @llvm.minimumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %splat)
   ret <vscale x 8 x float> %vc
 }
 
-define <vscale x 8 x float> @vfadd_fv_nxv8f32(<vscale x 8 x float> %va, float %b) {
-; CHECK-LABEL: vfadd_fv_nxv8f32:
+define <vscale x 8 x float> @vfmin_fv_nxv8f32(<vscale x 8 x float> %va, float %b) {
+; CHECK-LABEL: vfmin_fv_nxv8f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 8 x float> poison, float %b, i32 0
   %splat = shufflevector <vscale x 8 x float> %head, <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x float> @llvm.maximumnum.nxv8f32(<vscale x 8 x float> %splat, <vscale x 8 x float> %va)
+  %vc = call <vscale x 8 x float> @llvm.minimumnum.nxv8f32(<vscale x 8 x float> %splat, <vscale x 8 x float> %va)
   ret <vscale x 8 x float> %vc
 }
 
-define <vscale x 16 x float> @vfadd_vv_nxv16f32(<vscale x 16 x float> %va, <vscale x 16 x float> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv16f32:
+define <vscale x 16 x float> @vfmin_vv_nxv16f32(<vscale x 16 x float> %va, <vscale x 16 x float> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv16f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v16
+; CHECK-NEXT:    vfmin.vv v8, v8, v16
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 16 x float> @llvm.maximumnum.nxv16f32(<vscale x 16 x float> %va, <vscale x 16 x float> %vb)
+  %vc = call <vscale x 16 x float> @llvm.minimumnum.nxv16f32(<vscale x 16 x float> %va, <vscale x 16 x float> %vb)
   ret <vscale x 16 x float> %vc
 }
 
-define <vscale x 16 x float> @vfadd_vf_nxv16f32(<vscale x 16 x float> %va, float %b) {
-; CHECK-LABEL: vfadd_vf_nxv16f32:
+define <vscale x 16 x float> @vfmin_vf_nxv16f32(<vscale x 16 x float> %va, float %b) {
+; CHECK-LABEL: vfmin_vf_nxv16f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m8, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 16 x float> poison, float %b, i32 0
   %splat = shufflevector <vscale x 16 x float> %head, <vscale x 16 x float> poison, <vscale x 16 x i32> zeroinitializer
-  %vc = call <vscale x 16 x float> @llvm.maximumnum.nxv16f32(<vscale x 16 x float> %va, <vscale x 16 x float> %splat)
+  %vc = call <vscale x 16 x float> @llvm.minimumnum.nxv16f32(<vscale x 16 x float> %va, <vscale x 16 x float> %splat)
   ret <vscale x 16 x float> %vc
 }
 
-define <vscale x 1 x double> @vfadd_vv_nxv1f64(<vscale x 1 x double> %va, <vscale x 1 x double> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv1f64:
+define <vscale x 1 x double> @vfmin_vv_nxv1f64(<vscale x 1 x double> %va, <vscale x 1 x double> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv1f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v9
+; CHECK-NEXT:    vfmin.vv v8, v8, v9
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 1 x double> @llvm.maximumnum.nxv1f64(<vscale x 1 x double> %va, <vscale x 1 x double> %vb)
+  %vc = call <vscale x 1 x double> @llvm.minimumnum.nxv1f64(<vscale x 1 x double> %va, <vscale x 1 x double> %vb)
   ret <vscale x 1 x double> %vc
 }
 
-define <vscale x 1 x double> @vfadd_vf_nxv1f64(<vscale x 1 x double> %va, double %b) {
-; CHECK-LABEL: vfadd_vf_nxv1f64:
+define <vscale x 1 x double> @vfmin_vf_nxv1f64(<vscale x 1 x double> %va, double %b) {
+; CHECK-LABEL: vfmin_vf_nxv1f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m1, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 1 x double> poison, double %b, i32 0
   %splat = shufflevector <vscale x 1 x double> %head, <vscale x 1 x double> poison, <vscale x 1 x i32> zeroinitializer
-  %vc = call <vscale x 1 x double> @llvm.maximumnum.nxv1f64(<vscale x 1 x double> %va, <vscale x 1 x double> %splat)
+  %vc = call <vscale x 1 x double> @llvm.minimumnum.nxv1f64(<vscale x 1 x double> %va, <vscale x 1 x double> %splat)
   ret <vscale x 1 x double> %vc
 }
 
-define <vscale x 2 x double> @vfadd_vv_nxv2f64(<vscale x 2 x double> %va, <vscale x 2 x double> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv2f64:
+define <vscale x 2 x double> @vfmin_vv_nxv2f64(<vscale x 2 x double> %va, <vscale x 2 x double> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv2f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m2, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v10
+; CHECK-NEXT:    vfmin.vv v8, v8, v10
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 2 x double> @llvm.maximumnum.nxv2f64(<vscale x 2 x double> %va, <vscale x 2 x double> %vb)
+  %vc = call <vscale x 2 x double> @llvm.minimumnum.nxv2f64(<vscale x 2 x double> %va, <vscale x 2 x double> %vb)
   ret <vscale x 2 x double> %vc
 }
 
-define <vscale x 2 x double> @vfadd_vf_nxv2f64(<vscale x 2 x double> %va, double %b) {
-; CHECK-LABEL: vfadd_vf_nxv2f64:
+define <vscale x 2 x double> @vfmin_vf_nxv2f64(<vscale x 2 x double> %va, double %b) {
+; CHECK-LABEL: vfmin_vf_nxv2f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m2, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 2 x double> poison, double %b, i32 0
   %splat = shufflevector <vscale x 2 x double> %head, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
-  %vc = call <vscale x 2 x double> @llvm.maximumnum.nxv2f64(<vscale x 2 x double> %va, <vscale x 2 x double> %splat)
+  %vc = call <vscale x 2 x double> @llvm.minimumnum.nxv2f64(<vscale x 2 x double> %va, <vscale x 2 x double> %splat)
   ret <vscale x 2 x double> %vc
 }
 
-define <vscale x 4 x double> @vfadd_vv_nxv4f64(<vscale x 4 x double> %va, <vscale x 4 x double> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv4f64:
+define <vscale x 4 x double> @vfmin_vv_nxv4f64(<vscale x 4 x double> %va, <vscale x 4 x double> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv4f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m4, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v12
+; CHECK-NEXT:    vfmin.vv v8, v8, v12
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 4 x double> @llvm.maximumnum.nxv4f64(<vscale x 4 x double> %va, <vscale x 4 x double> %vb)
+  %vc = call <vscale x 4 x double> @llvm.minimumnum.nxv4f64(<vscale x 4 x double> %va, <vscale x 4 x double> %vb)
   ret <vscale x 4 x double> %vc
 }
 
-define <vscale x 4 x double> @vfadd_vf_nxv4f64(<vscale x 4 x double> %va, double %b) {
-; CHECK-LABEL: vfadd_vf_nxv4f64:
+define <vscale x 4 x double> @vfmin_vf_nxv4f64(<vscale x 4 x double> %va, double %b) {
+; CHECK-LABEL: vfmin_vf_nxv4f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m4, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 4 x double> poison, double %b, i32 0
   %splat = shufflevector <vscale x 4 x double> %head, <vscale x 4 x double> poison, <vscale x 4 x i32> zeroinitializer
-  %vc = call <vscale x 4 x double> @llvm.maximumnum.nxv4f64(<vscale x 4 x double> %va, <vscale x 4 x double> %splat)
+  %vc = call <vscale x 4 x double> @llvm.minimumnum.nxv4f64(<vscale x 4 x double> %va, <vscale x 4 x double> %splat)
   ret <vscale x 4 x double> %vc
 }
 
-define <vscale x 8 x double> @vfadd_vv_nxv8f64(<vscale x 8 x double> %va, <vscale x 8 x double> %vb) {
-; CHECK-LABEL: vfadd_vv_nxv8f64:
+define <vscale x 8 x double> @vfmin_vv_nxv8f64(<vscale x 8 x double> %va, <vscale x 8 x double> %vb) {
+; CHECK-LABEL: vfmin_vv_nxv8f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m8, ta, ma
-; CHECK-NEXT:    vfmax.vv v8, v8, v16
+; CHECK-NEXT:    vfmin.vv v8, v8, v16
 ; CHECK-NEXT:    ret
-  %vc = call <vscale x 8 x double> @llvm.maximumnum.nxv8f64(<vscale x 8 x double> %va, <vscale x 8 x double> %vb)
+  %vc = call <vscale x 8 x double> @llvm.minimumnum.nxv8f64(<vscale x 8 x double> %va, <vscale x 8 x double> %vb)
   ret <vscale x 8 x double> %vc
 }
 
-define <vscale x 8 x double> @vfadd_vf_nxv8f64(<vscale x 8 x double> %va, double %b) {
-; CHECK-LABEL: vfadd_vf_nxv8f64:
+define <vscale x 8 x double> @vfmin_vf_nxv8f64(<vscale x 8 x double> %va, double %b) {
+; CHECK-LABEL: vfmin_vf_nxv8f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m8, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 8 x double> poison, double %b, i32 0
   %splat = shufflevector <vscale x 8 x double> %head, <vscale x 8 x double> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x double> @llvm.maximumnum.nxv8f64(<vscale x 8 x double> %va, <vscale x 8 x double> %splat)
+  %vc = call <vscale x 8 x double> @llvm.minimumnum.nxv8f64(<vscale x 8 x double> %va, <vscale x 8 x double> %splat)
   ret <vscale x 8 x double> %vc
 }
 
-define <vscale x 8 x double> @vfadd_fv_nxv8f64(<vscale x 8 x double> %va, double %b) {
-; CHECK-LABEL: vfadd_fv_nxv8f64:
+define <vscale x 8 x double> @vfmin_fv_nxv8f64(<vscale x 8 x double> %va, double %b) {
+; CHECK-LABEL: vfmin_fv_nxv8f64:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e64, m8, ta, ma
-; CHECK-NEXT:    vfmax.vf v8, v8, fa0
+; CHECK-NEXT:    vfmin.vf v8, v8, fa0
 ; CHECK-NEXT:    ret
   %head = insertelement <vscale x 8 x double> poison, double %b, i32 0
   %splat = shufflevector <vscale x 8 x double> %head, <vscale x 8 x double> poison, <vscale x 8 x i32> zeroinitializer
-  %vc = call <vscale x 8 x double> @llvm.maximumnum.nxv8f64(<vscale x 8 x double> %splat, <vscale x 8 x double> %va)
+  %vc = call <vscale x 8 x double> @llvm.minimumnum.nxv8f64(<vscale x 8 x double> %splat, <vscale x 8 x double> %va)
   ret <vscale x 8 x double> %vc
 }
 
-define <vscale x 8 x float> @vfadd_vv_mask_nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vb, <vscale x 8 x i1> %mask) {
-; CHECK-LABEL: vfadd_vv_mask_nxv8f32:
+define <vscale x 8 x float> @vfmin_vv_mask_nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vb, <vscale x 8 x i1> %mask) {
+; CHECK-LABEL: vfmin_vv_mask_nxv8f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
 ; CHECK-NEXT:    vmv.v.i v16, 0
 ; CHECK-NEXT:    vmerge.vvm v12, v16, v12, v0
-; CHECK-NEXT:    vfmax.vv v8, v8, v12
+; CHECK-NEXT:    vfmin.vv v8, v8, v12
 ; CHECK-NEXT:    ret
   %vs = select <vscale x 8 x i1> %mask, <vscale x 8 x float> %vb, <vscale x 8 x float> splat (float 0.0)
-  %vc = call fast <vscale x 8 x float> @llvm.maximumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vs)
+  %vc = call fast <vscale x 8 x float> @llvm.minimumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vs)
   ret <vscale x 8 x float> %vc
 }
 
-define <vscale x 8 x float> @vfadd_vf_mask_nxv8f32(<vscale x 8 x float> %va, float %b, <vscale x 8 x i1> %mask) {
-; CHECK-LABEL: vfadd_vf_mask_nxv8f32:
+define <vscale x 8 x float> @vfmin_vf_mask_nxv8f32(<vscale x 8 x float> %va, float %b, <vscale x 8 x i1> %mask) {
+; CHECK-LABEL: vfmin_vf_mask_nxv8f32:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e32, m4, ta, ma
 ; CHECK-NEXT:    vmv.v.i v12, 0
 ; CHECK-NEXT:    vfmerge.vfm v12, v12, fa0, v0
-; CHECK-NEXT:    vfmax.vv v8, v8, v12
+; CHECK-NEXT:    vfmin.vv v8, v8, v12
 ; CHECK-NEXT:    ret
   %head1 = insertelement <vscale x 8 x float> poison, float %b, i32 0
   %splat1 = shufflevector <vscale x 8 x float> %head1, <vscale x 8 x float> poison, <vscale x 8 x i32> zeroinitializer
   %vs = select <vscale x 8 x i1> %mask, <vscale x 8 x float> %splat1, <vscale x 8 x float> splat (float 0.0)
-  %vc = call fast <vscale x 8 x float> @llvm.maximumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vs)
+  %vc = call fast <vscale x 8 x float> @llvm.minimumnum.nxv8f32(<vscale x 8 x float> %va, <vscale x 8 x float> %vs)
   ret <vscale x 8 x float> %vc
 }
diff --git a/llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir b/llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir
index a967f86f5b930..ba5a05e89c506 100644
--- a/llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir
@@ -24,8 +24,8 @@ body:             |
     ; CHECK-NEXT: PseudoRET implicit $v0
     %0:vr = COPY $v0
     %1:vr = COPY $v1
-    %2:vr = COPY $v2
-    %3:vr = COPY $v3
+    %2:vrnov0 = COPY $v2
+    %3:vrnov0 = COPY $v3
     %4:vmv0 = COPY %0
     %5:vrnov0 = PseudoVMERGE_VIM_M1 undef $noreg, killed %2, 1, %4, 1, 3
     %6:vmv0 = COPY %1
diff --git a/llvm/test/CodeGen/RISCV/rvv/pr88576.ll b/llvm/test/CodeGen/RISCV/rvv/pr88576.ll
index dd7debd3ab046..a094e35ffde97 100644
--- a/llvm/test/CodeGen/RISCV/rvv/pr88576.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/pr88576.ll
@@ -31,9 +31,9 @@ define i1 @foo(<vscale x 16 x i8> %x, i64 %y) {
 ; CHECK-NEXT:    add a0, a2, a0
 ; CHECK-NEXT:    add a1, a2, a1
 ; CHECK-NEXT:    vmerge.vim v24, v16, 1, v0
-; CHECK-NEXT:    vs8r.v v24, (a1)
 ; CHECK-NEXT:    vmv1r.v v0, v8
 ; CHECK-NEXT:    vmerge.vim v8, v16, 1, v0
+; CHECK-NEXT:    vs8r.v v24, (a1)
 ; CHECK-NEXT:    vs8r.v v8, (a2)
 ; CHECK-NEXT:    lbu a0, 0(a0)
 ; CHECK-NEXT:    addi sp, s0, -80
diff --git a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-to-vmv.mir b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-to-vmv.mir
index a7eaf39793236..c73c2004834db 100644
--- a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-to-vmv.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-to-vmv.mir
@@ -10,13 +10,13 @@ body: |
     ; CHECK-LABEL: name: undef_passthru
     ; CHECK: liveins: $x1, $v8, $v9
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: %false:vr = COPY $v8
-    ; CHECK-NEXT: %true:vr = COPY $v9
+    ; CHECK-NEXT: %false:vrnov0 = COPY $v8
+    ; CHECK-NEXT: %true:vrnov0 = COPY $v9
     ; CHECK-NEXT: %avl:gprnox0 = COPY $x1
     ; CHECK-NEXT: %mask:vmv0 = PseudoVMSET_M_B8 %avl, 0 /* e8 */
     ; CHECK-NEXT: $v0 = COPY %mask
-    %false:vr = COPY $v8
-    %true:vr = COPY $v9
+    %false:vrnov0 = COPY $v8
+    %true:vrnov0 = COPY $v9
     %avl:gprnox0 = COPY $x1
     %mask:vmv0 = PseudoVMSET_M_B8 %avl, 0
     $v0 = COPY %mask
@@ -31,15 +31,15 @@ body: |
     ; CHECK: liveins: $x1, $v8, $v9
     ; CHECK-NEXT: {{  $}}
     ; CHECK-NEXT: %pt:vr = COPY $v8
-    ; CHECK-NEXT: %false:vr = COPY $noreg
-    ; CHECK-NEXT: %true:vr = COPY $v9
+    ; CHECK-NEXT: %false:vrnov0 = COPY $noreg
+    ; CHECK-NEXT: %true:vrnov0 = COPY $v9
     ; CHECK-NEXT: %avl:gprnox0 = COPY $x1
     ; CHECK-NEXT: %mask:vmv0 = PseudoVMSET_M_B8 %avl, 0 /* e8 */
     ; CHECK-NEXT: $v0 = COPY %mask
     ; CHECK-NEXT: %x:vr = PseudoVMV_V_V_M1 %pt, %true, %avl, 5 /* e32 */, 0 /* tu, mu */
     %pt:vrnov0 = COPY $v8
-    %false:vr = COPY $noreg
-    %true:vr = COPY $v9
+    %false:vrnov0 = COPY $noreg
+    %true:vrnov0 = COPY $v9
     %avl:gprnox0 = COPY $x1
     %mask:vmv0 = PseudoVMSET_M_B8 %avl, 0
     $v0 = COPY %mask
@@ -53,15 +53,15 @@ body: |
     ; CHECK-LABEL: name: equal_passthru_false
     ; CHECK: liveins: $x1, $v8, $v9
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: %false:vr = COPY $v8
+    ; CHECK-NEXT: %false:vrnov0 = COPY $v8
     ; CHECK-NEXT: %pt:vr = COPY $v8
-    ; CHECK-NEXT: %true:vr = COPY $v9
+    ; CHECK-NEXT: %true:vrnov0 = COPY $v9
     ; CHECK-NEXT: %avl:gprnox0 = COPY $x1
     ; CHECK-NEXT: %mask:vmv0 = PseudoVMSET_M_B8 %avl, 0 /* e8 */
     ; CHECK-NEXT: %x:vr = PseudoVMV_V_V_M1 %pt, %true, %avl, 5 /* e32 */, 0 /* tu, mu */
-    %false:vr = COPY $v8
+    %false:vrnov0 = COPY $v8
     %pt:vrnov0 = COPY $v8
-    %true:vr = COPY $v9
+    %true:vrnov0 = COPY $v9
     %avl:gprnox0 = COPY $x1
     %mask:vmv0 = PseudoVMSET_M_B8 %avl, 0
     %x:vrnov0 = PseudoVMERGE_VVM_M1 %pt, %false, %true, %mask, %avl, 5
@@ -136,7 +136,7 @@ body: |
     ; CHECK-NEXT: %false:vrnov0 = COPY $v8
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %true:vrnov0 = PseudoVADD_VV_M1_MASK %false, $noreg, $noreg, %mask, 4, 5 /* e32 */, 1 /* ta, mu */
-    %false:vr = COPY $v8
+    %false:vrnov0 = COPY $v8
     %mask:vmv0 = COPY $v0
     %true:vrnov0 = PseudoVADD_VV_M1_MASK $noreg, $noreg, $noreg, %mask, 4, 5 /* e32 */, 0 /* tu, mu */
     %x:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, %false, %true, %mask, 4, 5 /* e32 */
@@ -150,7 +150,7 @@ body: |
   ; CHECK-NEXT:   successors: %bb.1(0x80000000)
   ; CHECK-NEXT:   liveins: $v8, $v0
   ; CHECK-NEXT: {{  $}}
-  ; CHECK-NEXT:   %false:vr = COPY $v8
+  ; CHECK-NEXT:   %false:vrnov0 = COPY $v8
   ; CHECK-NEXT:   %mask:vmv0 = COPY $v0
   ; CHECK-NEXT:   %true:vrnov0 = PseudoVADD_VV_M1_MASK $noreg, $noreg, $noreg, %mask, 4, 5 /* e32 */, 0 /* tu, mu */
   ; CHECK-NEXT: {{  $}}
@@ -158,7 +158,7 @@ body: |
   ; CHECK-NEXT:   [[PseudoVMERGE_VVM_M1_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, %false, %true, %mask, 4, 5 /* e32 */
   bb.0:
     liveins: $v8, $v0
-    %false:vr = COPY $v8
+    %false:vrnov0 = COPY $v8
     %mask:vmv0 = COPY $v0
     %true:vrnov0 = PseudoVADD_VV_M1_MASK $noreg, $noreg, $noreg, %mask, 4, 5 /* e32 */, 0 /* tu, mu */
   bb.1:
@@ -174,14 +174,14 @@ body: |
     ; CHECK: liveins: $v8, $v9, $v0, $x8, $x9
     ; CHECK-NEXT: {{  $}}
     ; CHECK-NEXT: %pt:vrnov0 = COPY $v8
-    ; CHECK-NEXT: %false:vr = COPY $v9
+    ; CHECK-NEXT: %false:vrnov0 = COPY $v9
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %avl1:gprnox0 = COPY $x8
     ; CHECK-NEXT: %avl2:gprnox0 = COPY $x9
     ; CHECK-NEXT: %true:vrnov0 = PseudoVADD_VV_M1_MASK $noreg, $noreg, $noreg, %mask, %avl1, 5 /* e32 */, 3 /* ta, ma */
     ; CHECK-NEXT: [[PseudoVMERGE_VVM_M1_:%[0-9]+]]:vrnov0 = PseudoVMERGE_VVM_M1 %pt, %false, %true, %mask, %avl2, 5 /* e32 */
     %pt:vrnov0 = COPY $v8
-    %false:vr = COPY $v9
+    %false:vrnov0 = COPY $v9
     %mask:vmv0 = COPY $v0
     %avl1:gprnox0 = COPY $x8
     %avl2:gprnox0 = COPY $x9
@@ -203,7 +203,7 @@ body: |
     ; CHECK-NEXT: %true:vrnov0 = PseudoVADD_VV_M1_MASK %false, $noreg, $noreg, %mask, 1, 5 /* e32 */, 3 /* ta, ma */
     ; CHECK-NEXT: [[PseudoVMV_V_V_M1_:%[0-9]+]]:vr = PseudoVMV_V_V_M1 %pt, %true, 1, 5 /* e32 */, 0 /* tu, mu */
     %pt:vrnov0 = COPY $v8
-    %false:vr = COPY $v9
+    %false:vrnov0 = COPY $v9
     %mask:vmv0 = COPY $v0
     %true:vrnov0 = PseudoVADD_VV_M1_MASK $noreg, $noreg, $noreg, %mask, 2, 5 /* e32 */, 3 /* ta, ma */
     %5:vrnov0 = PseudoVMERGE_VVM_M1 %pt, %false, %true, %mask, 1, 5 /* e32 */
diff --git a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
index acd9519bb5a8e..5be32cc35fe37 100644
--- a/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/rvv-peephole-vmerge-vops.ll
@@ -867,8 +867,9 @@ define void @test_dag_loop() {
 ; CHECK-NEXT:    vmseq.vv v0, v12, v8
 ; CHECK-NEXT:    vsetvli zero, zero, e16, m8, ta, ma
 ; CHECK-NEXT:    vmv.v.i v8, 0
-; CHECK-NEXT:    vsetvli zero, zero, e16, m8, tu, mu
+; CHECK-NEXT:    vsetivli zero, 1, e16, m8, tu, mu
 ; CHECK-NEXT:    vle16.v v8, (zero), v0.t
+; CHECK-NEXT:    vsetivli zero, 0, e16, m8, ta, ma
 ; CHECK-NEXT:    vse16.v v8, (zero)
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll b/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
index e3f43cd904198..cc389236df3ff 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
@@ -516,15 +516,15 @@ define <vscale x 64 x i1> @splice_nxv64i1_offset_negone(<vscale x 64 x i1> %a, <
 ; NOVLDEP-NEXT:    vsetvli a0, zero, e8, m8, ta, ma
 ; NOVLDEP-NEXT:    vmv1r.v v9, v0
 ; NOVLDEP-NEXT:    vmv1r.v v0, v8
-; NOVLDEP-NEXT:    vmv.v.i v24, 0
+; NOVLDEP-NEXT:    vmv.v.i v16, 0
 ; NOVLDEP-NEXT:    csrr a0, vlenb
-; NOVLDEP-NEXT:    vmerge.vim v16, v24, 1, v0
+; NOVLDEP-NEXT:    vmerge.vim v24, v16, 1, v0
 ; NOVLDEP-NEXT:    vmv1r.v v0, v9
-; NOVLDEP-NEXT:    vmerge.vim v8, v24, 1, v0
+; NOVLDEP-NEXT:    vmerge.vim v8, v16, 1, v0
 ; NOVLDEP-NEXT:    slli a0, a0, 3
 ; NOVLDEP-NEXT:    addi a0, a0, -1
 ; NOVLDEP-NEXT:    vslidedown.vx v8, v8, a0
-; NOVLDEP-NEXT:    vslideup.vi v8, v16, 1
+; NOVLDEP-NEXT:    vslideup.vi v8, v24, 1
 ; NOVLDEP-NEXT:    vand.vi v8, v8, 1
 ; NOVLDEP-NEXT:    vmsne.vi v0, v8, 0
 ; NOVLDEP-NEXT:    ret
@@ -534,17 +534,17 @@ define <vscale x 64 x i1> @splice_nxv64i1_offset_negone(<vscale x 64 x i1> %a, <
 ; VLDEP-NEXT:    vsetvli a0, zero, e8, m8, ta, ma
 ; VLDEP-NEXT:    vmv1r.v v9, v0
 ; VLDEP-NEXT:    vmv1r.v v0, v8
-; VLDEP-NEXT:    vmv.v.i v24, 0
+; VLDEP-NEXT:    vmv.v.i v16, 0
 ; VLDEP-NEXT:    csrr a0, vlenb
-; VLDEP-NEXT:    vmerge.vim v16, v24, 1, v0
+; VLDEP-NEXT:    vmerge.vim v24, v16, 1, v0
 ; VLDEP-NEXT:    vmv1r.v v0, v9
-; VLDEP-NEXT:    vmerge.vim v8, v24, 1, v0
+; VLDEP-NEXT:    vmerge.vim v8, v16, 1, v0
 ; VLDEP-NEXT:    slli a0, a0, 3
 ; VLDEP-NEXT:    addi a0, a0, -1
 ; VLDEP-NEXT:    vsetivli zero, 1, e8, m8, ta, ma
 ; VLDEP-NEXT:    vslidedown.vx v8, v8, a0
 ; VLDEP-NEXT:    vsetvli a0, zero, e8, m8, ta, ma
-; VLDEP-NEXT:    vslideup.vi v8, v16, 1
+; VLDEP-NEXT:    vslideup.vi v8, v24, 1
 ; VLDEP-NEXT:    vand.vi v8, v8, 1
 ; VLDEP-NEXT:    vmsne.vi v0, v8, 0
 ; VLDEP-NEXT:    ret
@@ -558,16 +558,16 @@ define <vscale x 64 x i1> @splice_nxv64i1_offset_max(<vscale x 64 x i1> %a, <vsc
 ; NOVLDEP-NEXT:    vsetvli a0, zero, e8, m8, ta, ma
 ; NOVLDEP-NEXT:    vmv1r.v v9, v0
 ; NOVLDEP-NEXT:    vmv1r.v v0, v8
-; NOVLDEP-NEXT:    vmv.v.i v24, 0
+; NOVLDEP-NEXT:    vmv.v.i v16, 0
 ; NOVLDEP-NEXT:    li a0, 127
-; NOVLDEP-NEXT:    vmerge.vim v16, v24, 1, v0
+; NOVLDEP-NEXT:    vmerge.vim v24, v16, 1, v0
 ; NOVLDEP-NEXT:    vmv1r.v v0, v9
-; NOVLDEP-NEXT:    vmerge.vim v8, v24, 1, v0
+; NOVLDEP-NEXT:    vmerge.vim v8, v16, 1, v0
 ; NOVLDEP-NEXT:    vslidedown.vx v8, v8, a0
 ; NOVLDEP-NEXT:    csrr a0, vlenb
 ; NOVLDEP-NEXT:    slli a0, a0, 3
 ; NOVLDEP-NEXT:    addi a0, a0, -127
-; NOVLDEP-NEXT:    vslideup.vx v8, v16, a0
+; NOVLDEP-NEXT:    vslideup.vx v8, v24, a0
 ; NOVLDEP-NEXT:    vand.vi v8, v8, 1
 ; NOVLDEP-NEXT:    vmsne.vi v0, v8, 0
 ; NOVLDEP-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir b/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
index cd85853c2d12c..81a2388421cee 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir
@@ -1346,10 +1346,10 @@ name: vmerge_vim
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vim
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VIM_M1 $noreg, %x, 9, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
     %y:vrnov0 = PseudoVMERGE_VIM_M1 $noreg, %x, 9, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1358,10 +1358,10 @@ name: vmerge_vim_incompatible_eew
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vim_incompatible_eew
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VIM_M1 $noreg, %x, 9, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0
     %y:vrnov0 = PseudoVMERGE_VIM_M1 $noreg, %x, 9, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1370,10 +1370,10 @@ name: vmerge_vim_incompatible_emul
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vim_incompatible_emul
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VIM_MF2 $noreg, %x, 9, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
     %y:vrnov0 = PseudoVMERGE_VIM_MF2 $noreg, %x, 9, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1382,10 +1382,10 @@ name: vmerge_vxm
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vxm
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VXM_M1 $noreg, %x, $noreg, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
     %y:vrnov0 = PseudoVMERGE_VXM_M1 $noreg, %x, $noreg, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1394,10 +1394,10 @@ name: vmerge_vxm_incompatible_eew
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vxm_incompatible_eew
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VXM_M1 $noreg, %x, $noreg, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0
     %y:vrnov0 = PseudoVMERGE_VXM_M1 $noreg, %x, $noreg, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1406,10 +1406,10 @@ name: vmerge_vxm_incompatible_emul
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vxm_incompatible_emul
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VXM_MF2 $noreg, %x, $noreg, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
     %y:vrnov0 = PseudoVMERGE_VXM_MF2 $noreg, %x, $noreg, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1418,10 +1418,10 @@ name: vmerge_vvm
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vvm
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, 1, 3 /* e8 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, $noreg, %x, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
     %y:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, $noreg, %x, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1430,10 +1430,10 @@ name: vmerge_vvm_incompatible_eew
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vvm_incompatible_eew
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, $noreg, %x, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 4 /* e16 */, 0
     %y:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, $noreg, %x, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
@@ -1442,10 +1442,10 @@ name: vmerge_vvm_incompatible_emul
 body: |
   bb.0:
     ; CHECK-LABEL: name: vmerge_vvm_incompatible_emul
-    ; CHECK: %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
+    ; CHECK: %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0 /* tu, mu */
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VVM_MF2 $noreg, $noreg, %x, $v0, 1, 3 /* e8 */
     ; CHECK-NEXT: $v8 = COPY %y
-    %x:vr = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
+    %x:vrnov0 = PseudoVADD_VV_M1 $noreg, $noreg, $noreg, -1, 3 /* e8 */, 0
     %y:vrnov0 = PseudoVMERGE_VVM_MF2 $noreg, $noreg, %x, $v0, 1, 3 /* e8 */
     $v8 = COPY %y
 ...
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-optimizer-subreg-assert.mir b/llvm/test/CodeGen/RISCV/rvv/vl-optimizer-subreg-assert.mir
index b816741285b43..7525bf70e62d8 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-optimizer-subreg-assert.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-optimizer-subreg-assert.mir
@@ -12,17 +12,17 @@ body:             |
     ; CHECK-LABEL: name: vl_optimizer_subreg_assert
     ; CHECK: liveins: $v8m2
     ; CHECK-NEXT: {{  $}}
-    ; CHECK-NEXT: [[DEF:%[0-9]+]]:vrm8 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; CHECK-NEXT: [[DEF1:%[0-9]+]]:vmv0 = IMPLICIT_DEF
-    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:vrm8 = IMPLICIT_DEF
+    ; CHECK-NEXT: [[DEF2:%[0-9]+]]:vrm8nov0 = IMPLICIT_DEF
     ; CHECK-NEXT: [[PseudoVMERGE_VVM_M8_:%[0-9]+]]:vrm8nov0 = PseudoVMERGE_VVM_M8 $noreg, killed [[DEF2]], [[DEF]], [[DEF1]], -1, 6 /* e64 */
     ; CHECK-NEXT: [[PseudoVREDMAXU_VS_M8_E64_:%[0-9]+]]:vr = PseudoVREDMAXU_VS_M8_E64 $noreg, [[PseudoVMERGE_VVM_M8_]], [[PseudoVMERGE_VVM_M8_]].sub_vrm1_0, -1, 6 /* e64 */, 1 /* ta, mu */
     ; CHECK-NEXT: [[PseudoVMV_X_S:%[0-9]+]]:gpr = PseudoVMV_X_S killed [[PseudoVREDMAXU_VS_M8_E64_]], 6 /* e64 */
     ; CHECK-NEXT: $x10 = COPY [[PseudoVMV_X_S]]
     ; CHECK-NEXT: PseudoRET implicit $x10
-    %0:vrm8 = IMPLICIT_DEF
+    %0:vrm8nov0 = IMPLICIT_DEF
     %1:vmv0 = IMPLICIT_DEF
-    %2:vrm8 = IMPLICIT_DEF
+    %2:vrm8nov0 = IMPLICIT_DEF
     %3:vrm8nov0 = PseudoVMERGE_VVM_M8 $noreg, killed %2, %0, %1, -1, 6 /* e64 */
     %4:vr = PseudoVREDMAXU_VS_M8_E64 $noreg, %3, %3.sub_vrm1_0, -1, 6 /* e64 */, 1 /* ta, mu */
     %5:gpr = PseudoVMV_X_S killed %4, 6 /* e64 */
diff --git a/llvm/test/CodeGen/RISCV/rvv/vmerge-peephole.mir b/llvm/test/CodeGen/RISCV/rvv/vmerge-peephole.mir
index 374afa3aafdea..81a271bd975e3 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vmerge-peephole.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vmerge-peephole.mir
@@ -15,7 +15,7 @@ body: |
     ; CHECK-NEXT: %y:vrnov0 = PseudoVLE32_V_M1_MASK %passthru, $noreg, %mask, %avl, 5 /* e32 */, 0 /* tu, mu */ :: (load unknown-size, align 1)
     %avl:gprnox0 = COPY $x8
     %passthru:vrnov0 = COPY $v8
-    %x:vr = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
+    %x:vrnov0 = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
     %mask:vmv0 = COPY $v0
     %y:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %x, %mask, %avl, 5 /* e32 */
 ...
@@ -32,8 +32,8 @@ body: |
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %y:vrnov0 = PseudoVLE32_V_M1_MASK %false, $noreg, %mask, %avl, 5 /* e32 */, 1 /* ta, mu */ :: (load unknown-size, align 1)
     %avl:gprnox0 = COPY $x8
-    %false:vr = COPY $v8
-    %x:vr = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
+    %false:vrnov0 = COPY $v8
+    %x:vrnov0 = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
     %mask:vmv0 = COPY $v0
     %y:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, %false, %x, %mask, %avl, 5 /* e32 */
 ...
@@ -50,7 +50,7 @@ body: |
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %y:vrnov0 = PseudoVLE32_V_M1_MASK %passthru, $noreg, %mask, %avl, 5 /* e32 */, 0 /* tu, mu */ :: (load unknown-size, align 1)
     %avl:gprnox0 = COPY $x8
-    %x:vr = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
+    %x:vrnov0 = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
     %passthru:vrnov0 = COPY $v8
     %mask:vmv0 = COPY $v0
     %y:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %x, %mask, %avl, 5 /* e32 */
@@ -68,7 +68,7 @@ body: |
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %y:vrnov0 = PseudoVNCLIPU_WV_MF2_MASK %passthru, $noreg, $noreg, %mask, 0, %avl, 5 /* e32 */, 0 /* tu, mu */, implicit-def $vxsat
     %avl:gprnox0 = COPY $x8
-    %x:vr = PseudoVNCLIPU_WV_MF2 $noreg, $noreg, $noreg, 0, -1, 5, 3, implicit-def $vxsat
+    %x:vrnov0 = PseudoVNCLIPU_WV_MF2 $noreg, $noreg, $noreg, 0, -1, 5, 3, implicit-def $vxsat
     %passthru:vrnov0 = COPY $v8
     %mask:vmv0 = COPY $v0
     %y:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %x, %mask, %avl, 5 /* e32 */
@@ -82,13 +82,13 @@ body: |
     ; CHECK: liveins: $x8, $v0, $v8
     ; CHECK-NEXT: {{  $}}
     ; CHECK-NEXT: %avl:gprnox0 = COPY $x8
-    ; CHECK-NEXT: %x:vr = PseudoVNCLIPU_WV_MF2 $noreg, $noreg, $noreg, 0, -1, 5 /* e32 */, 3 /* ta, ma */, implicit-def $vxsat
+    ; CHECK-NEXT: %x:vrnov0 = PseudoVNCLIPU_WV_MF2 $noreg, $noreg, $noreg, 0, -1, 5 /* e32 */, 3 /* ta, ma */, implicit-def $vxsat
     ; CHECK-NEXT: %vxsat:gpr = COPY $vxsat
     ; CHECK-NEXT: %passthru:vrnov0 = COPY $v8
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %x, %mask, %avl, 5 /* e32 */
     %avl:gprnox0 = COPY $x8
-    %x:vr = PseudoVNCLIPU_WV_MF2 $noreg, $noreg, $noreg, 0, -1, 5, 3, implicit-def $vxsat
+    %x:vrnov0 = PseudoVNCLIPU_WV_MF2 $noreg, $noreg, $noreg, 0, -1, 5, 3, implicit-def $vxsat
     %vxsat:gpr = COPY $vxsat
     %passthru:vrnov0 = COPY $v8
     %mask:vmv0 = COPY $v0
@@ -116,3 +116,68 @@ body: |
     %vfmadd:vrnov0 = nofpexcept PseudoVFMADD_VV_M1_E32 %x, %y, %passthru, 7, -1, 5 /* e32 */, 3 /* ta, ma */, implicit $frm
     %vmerge:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %vfmadd, %mask, %avl, 5
 ...
+---
+name: true_copy
+body: |
+  bb.0:
+    liveins: $x8, $v0, $v8
+    ; CHECK-LABEL: name: true_copy
+    ; CHECK: liveins: $x8, $v0, $v8
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: %avl:gprnox0 = COPY $x8
+    ; CHECK-NEXT: %passthru:vrnov0 = COPY $v8
+    ; CHECK-NEXT: %mask:vmv0 = COPY $v0
+    ; CHECK-NEXT: %z:vrnov0 = PseudoVLE32_V_M1_MASK %passthru, $noreg, %mask, %avl, 5 /* e32 */, 0 /* tu, mu */ :: (load unknown-size, align 1)
+    ; CHECK-NEXT: %y:vrnov0 = COPY %z
+    %avl:gprnox0 = COPY $x8
+    %passthru:vrnov0 = COPY $v8
+    %x:vr = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
+    %mask:vmv0 = COPY $v0
+    %y:vrnov0 = COPY %x
+    %z:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %y, %mask, %avl, 5 /* e32 */
+...
+---
+name: copy_is_killed
+body: |
+  bb.0:
+    liveins: $v0, $v8, $v9
+    ; CHECK-LABEL: name: copy_is_killed
+    ; CHECK: liveins: $v0, $v8, $v9
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: %x:vr = COPY $v8
+    ; CHECK-NEXT: %y:vr = COPY $v9
+    ; CHECK-NEXT: %mask:vmv0 = COPY $v0
+    ; CHECK-NEXT: %add0:vr = PseudoVADD_VV_M1 $noreg, %x, %y, -1, 5 /* e32 */, 3 /* ta, ma */
+    ; CHECK-NEXT: %add1:vrnov0 = COPY %add:vrnov0
+    ; CHECK-NEXT: %merge:vrnov0 = PseudoVOR_VV_M1_MASK %add:vrnov0, %add1, %y, %mask, -1, 5 /* e32 */, 1 /* ta, mu */
+    %x:vr = COPY $v8
+    %y:vr = COPY $v9
+    %mask:vmv0 = COPY $v0
+    %add0:vr = PseudoVADD_VV_M1 $noreg, %x:vr, %y:vr, -1, 5, 3
+    %add1:vrnov0 = COPY killed %add:vr
+    %or:vrnov0 = PseudoVOR_VV_M1 $noreg, %add1:vrnov0, %y:vr, -1, 5, 3
+    %merge:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, %add1:vrnov0, killed %or:vrnov0, killed %mask:vmv0, -1, 5 /* e32 */
+
+...
+---
+name: copy_multiple_use
+body: |
+  bb.0:
+    liveins: $x8, $v0, $v8
+    ; CHECK-LABEL: name: copy_multiple_use
+    ; CHECK: liveins: $x8, $v0, $v8
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: %avl:gprnox0 = COPY $x8
+    ; CHECK-NEXT: %passthru:vrnov0 = COPY $v8
+    ; CHECK-NEXT: %x:vrnov0 = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size, align 1)
+    ; CHECK-NEXT: %copy:vrnov0 = COPY %x
+    ; CHECK-NEXT: %mask:vmv0 = COPY $v0
+    ; CHECK-NEXT: PseudoVSE8_V_M1 %copy, $noreg, %avl, 5 /* e32 */
+    ; CHECK-NEXT: %y:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %copy, %mask, %avl, 5 /* e32 */
+    %avl:gprnox0 = COPY $x8
+    %passthru:vrnov0 = COPY $v8
+    %x:vrnov0 = PseudoVLE32_V_M1 $noreg, $noreg, %avl, 5 /* e32 */, 2 /* tu, ma */ :: (load unknown-size)
+    %copy:vrnov0 = COPY %x
+    %mask:vmv0 = COPY $v0
+    PseudoVSE8_V_M1 %copy, $noreg, %avl, 5 /* e8 */
+    %y:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, %copy, %mask, %avl, 5 /* e32 */
diff --git a/llvm/test/CodeGen/RISCV/rvv/vmv.v.v-peephole.mir b/llvm/test/CodeGen/RISCV/rvv/vmv.v.v-peephole.mir
index 9c3e96d818556..95232e734bb18 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vmv.v.v-peephole.mir
+++ b/llvm/test/CodeGen/RISCV/rvv/vmv.v.v-peephole.mir
@@ -163,7 +163,7 @@ body: |
     ; CHECK-NEXT: %passthru:vrnov0 = COPY $v8
     ; CHECK-NEXT: %mask:vmv0 = COPY $v0
     ; CHECK-NEXT: %x:vrnov0 = PseudoVMERGE_VVM_M1 %passthru, %passthru, $noreg, %mask, 4, 5 /* e32 */
-    %passthru:vr = COPY $v8
+    %passthru:vrnov0 = COPY $v8
     %mask:vmv0 = COPY $v0
     %x:vrnov0 = PseudoVMERGE_VVM_M1 $noreg, %passthru, $noreg, %mask, 4, 5 /* e32 */
     %z:vr = PseudoVMV_V_V_M1 %passthru, %x, 4, 5 /* e32 */, 0 /* tu, mu */
diff --git a/llvm/test/CodeGen/RISCV/select-bare.ll b/llvm/test/CodeGen/RISCV/select-bare.ll
index 550eb94724ff2..fe0f74f5f2fa3 100644
--- a/llvm/test/CodeGen/RISCV/select-bare.ll
+++ b/llvm/test/CodeGen/RISCV/select-bare.ll
@@ -3,7 +3,7 @@
 ; RUN:   | FileCheck %s -check-prefix=RV32I
 ; RUN: llc -mtriple=riscv64 -mattr=+xmipscmov -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefix=RV64I-CCMOV %s
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 
 define i32 @bare_select(i1 %a, i32 %b, i32 %c) nounwind {
diff --git a/llvm/test/CodeGen/RISCV/select-cc.ll b/llvm/test/CodeGen/RISCV/select-cc.ll
index 95f5a9d0925de..a215f893837a8 100644
--- a/llvm/test/CodeGen/RISCV/select-cc.ll
+++ b/llvm/test/CodeGen/RISCV/select-cc.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -mtriple=riscv32 -disable-block-placement -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=RV32I %s
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 ; RUN: llc -mtriple=riscv64 -disable-block-placement -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=RV64I %s
diff --git a/llvm/test/CodeGen/RISCV/select-cond.ll b/llvm/test/CodeGen/RISCV/select-cond.ll
index a3c48737edc3c..ab3306e4e78e3 100644
--- a/llvm/test/CodeGen/RISCV/select-cond.ll
+++ b/llvm/test/CodeGen/RISCV/select-cond.ll
@@ -7,7 +7,7 @@
 ; RUN:   | FileCheck %s --check-prefixes=RV32-XQCICM
 ; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcics -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32-XQCICS
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 ; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV64
diff --git a/llvm/test/CodeGen/RISCV/select-const.ll b/llvm/test/CodeGen/RISCV/select-const.ll
index f2924bb364adb..1c6cc6dc97900 100644
--- a/llvm/test/CodeGen/RISCV/select-const.ll
+++ b/llvm/test/CodeGen/RISCV/select-const.ll
@@ -5,7 +5,7 @@
 ; RUN:   | FileCheck -check-prefixes=RV32,RV32IF %s
 ; RUN: llc -mtriple=riscv32 -mattr=+zicond -target-abi=ilp32 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=RV32,RV32ZICOND %s
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 ; RUN: llc -mtriple=riscv64 -target-abi=lp64 -verify-machineinstrs < %s \
 ; RUN:   | FileCheck -check-prefixes=RV64,RV64I %s
diff --git a/llvm/test/CodeGen/RISCV/select.ll b/llvm/test/CodeGen/RISCV/select.ll
index 1eb47e4c0ede2..d7a6056a4feb4 100644
--- a/llvm/test/CodeGen/RISCV/select.ll
+++ b/llvm/test/CodeGen/RISCV/select.ll
@@ -4,7 +4,7 @@
 ; RUN: llc -mtriple=riscv64 -mattr=+m,+xventanacondops -verify-machineinstrs < %s | FileCheck --check-prefixes=CHECK,RV64IMXVTCONDOPS %s
 ; RUN: llc -mtriple=riscv32 -mattr=+m,+zicond -verify-machineinstrs < %s | FileCheck --check-prefixes=CHECK,CHECKZICOND,RV32IMZICOND %s
 ; RUN: llc -mtriple=riscv64 -mattr=+m,+zicond -verify-machineinstrs < %s | FileCheck --check-prefixes=CHECK,CHECKZICOND,RV64IMZICOND %s
-; RUN: llc -mtriple=riscv32 -mattr=+m,+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+m,+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 
 define i16 @select_xor_1(i16 %A, i8 %cond) {
diff --git a/llvm/test/CodeGen/RISCV/short-forward-branch-load-imm.ll b/llvm/test/CodeGen/RISCV/short-forward-branch-load-imm.ll
index 6aae6cd0e82ee..4f51e602d1346 100644
--- a/llvm/test/CodeGen/RISCV/short-forward-branch-load-imm.ll
+++ b/llvm/test/CodeGen/RISCV/short-forward-branch-load-imm.ll
@@ -1,9 +1,9 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
 ; RUN: llc < %s -verify-machineinstrs -mtriple=riscv32 -mattr=+experimental-xqcili | FileCheck %s --check-prefixes=RV32I
 ; RUN: llc < %s -verify-machineinstrs -mtriple=riscv64 | FileCheck %s --check-prefixes=RV64I
-; RUN: llc < %s -verify-machineinstrs -mtriple=riscv32 -mattr=+experimental-xqcili,+short-forward-branch-opt | \
+; RUN: llc < %s -verify-machineinstrs -mtriple=riscv32 -mattr=+experimental-xqcili,+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV32I-SFB
-; RUN: llc < %s -verify-machineinstrs -mtriple=riscv64 -mattr=+short-forward-branch-opt | \
+; RUN: llc < %s -verify-machineinstrs -mtriple=riscv64 -mattr=+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV64I-SFB
 
 define i32 @select_example_1(i32 %a, i32 %b, i1 zeroext %x, i32 %y) {
diff --git a/llvm/test/CodeGen/RISCV/short-forward-branch-opt-min-max.ll b/llvm/test/CodeGen/RISCV/short-forward-branch-opt-min-max.ll
index 05e06cea9967a..3d32ce9b9ada6 100644
--- a/llvm/test/CodeGen/RISCV/short-forward-branch-opt-min-max.ll
+++ b/llvm/test/CodeGen/RISCV/short-forward-branch-opt-min-max.ll
@@ -1,13 +1,13 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
 ; RUN: llc < %s -mtriple=riscv32 -mattr=+zbb | FileCheck %s --check-prefixes=RV32I-ZBB
 ; RUN: llc < %s -mtriple=riscv64 -mattr=+zbb | FileCheck %s --check-prefixes=RV64I-ZBB
-; RUN: llc < %s -mtriple=riscv32 -mattr=+zbb,+short-forward-branch-opt | \
+; RUN: llc < %s -mtriple=riscv32 -mattr=+zbb,+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV32I-SFB-ZBB
-; RUN: llc < %s -mtriple=riscv64 -mattr=+zbb,+short-forward-branch-opt | \
+; RUN: llc < %s -mtriple=riscv64 -mattr=+zbb,+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV64I-SFB-ZBB
-; RUN: llc < %s -mtriple=riscv32 -mattr=+zbb,+short-forward-branch-i-minmax | \
+; RUN: llc < %s -mtriple=riscv32 -mattr=+zbb,+short-forward-branch-iminmax | \
 ; RUN:   FileCheck %s --check-prefixes=RV32I-SFBIMinMax-ZBB
-; RUN: llc < %s -mtriple=riscv64 -mattr=+zbb,+short-forward-branch-i-minmax | \
+; RUN: llc < %s -mtriple=riscv64 -mattr=+zbb,+short-forward-branch-iminmax | \
 ; RUN:   FileCheck %s --check-prefixes=RV64I-SFBIMinMax-ZBB
 
 define i32 @select_example_smax(i32 %a, i32 %b, i1 zeroext %x, i32 %y) {
diff --git a/llvm/test/CodeGen/RISCV/short-forward-branch-opt-mul.ll b/llvm/test/CodeGen/RISCV/short-forward-branch-opt-mul.ll
index 3f780fddafcce..47ad9755154b6 100644
--- a/llvm/test/CodeGen/RISCV/short-forward-branch-opt-mul.ll
+++ b/llvm/test/CodeGen/RISCV/short-forward-branch-opt-mul.ll
@@ -1,13 +1,13 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
 ; RUN: llc < %s -mtriple=riscv32 -mattr=+m | FileCheck %s --check-prefixes=RV32I-M
 ; RUN: llc < %s -mtriple=riscv64 -mattr=+m | FileCheck %s --check-prefixes=RV64I-M
-; RUN: llc < %s -mtriple=riscv32 -mattr=+m,+short-forward-branch-opt | \
+; RUN: llc < %s -mtriple=riscv32 -mattr=+m,+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV32I-SFB-M
-; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+short-forward-branch-opt | \
+; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+short-forward-branch-ialu | \
 ; RUN:   FileCheck %s --check-prefixes=RV64I-SFB-M
-; RUN: llc < %s -mtriple=riscv32 -mattr=+m,+short-forward-branch-i-mul | \
+; RUN: llc < %s -mtriple=riscv32 -mattr=+m,+short-forward-branch-imul | \
 ; RUN:   FileCheck %s --check-prefixes=RV32I-SFBIMul-M
-; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+short-forward-branch-i-mul | \
+; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+short-forward-branch-imul | \
 ; RUN:   FileCheck %s --check-prefixes=RV64I-SFBIMul-M
 
 define i32 @select_example_mul_i32(i32 %a, i32 %b, i1 zeroext %x, i32 %y) {
diff --git a/llvm/test/CodeGen/RISCV/xcvelw.ll b/llvm/test/CodeGen/RISCV/xcvelw.ll
new file mode 100644
index 0000000000000..4ff8a5b38494f
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/xcvelw.ll
@@ -0,0 +1,27 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -O0 -mtriple=riscv32 -mattr=+xcvelw -verify-machineinstrs < %s \
+; RUN:   | FileCheck %s
+
+declare i32 @llvm.riscv.cv.elw.elw(i8*)
+
+define i32 @test.cv.elw.elw(i8* %a) {
+; CHECK-LABEL: test.cv.elw.elw:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    cv.elw a0, 0(a0)
+; CHECK-NEXT:    ret
+  %1 = call i32 @llvm.riscv.cv.elw.elw(i8* %a)
+  ret i32 %1
+}
+
+define i32 @test.cv.elw.elw2(i8* %a, i32 %b) {
+; CHECK-LABEL: test.cv.elw.elw2:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    add a0, a1, a0
+; CHECK-NEXT:    cv.elw a0, 7(a0)
+; CHECK-NEXT:    ret
+  %c = add i32 %b, 4
+  %d = add i32 %c, 3
+  %e = getelementptr i8, i8* %a, i32 %d
+  %1 = call i32 @llvm.riscv.cv.elw.elw(i8* %e)
+  ret i32 %1
+}
\ No newline at end of file
diff --git a/llvm/test/CodeGen/RISCV/xqcicli.ll b/llvm/test/CodeGen/RISCV/xqcicli.ll
index cdb1947339736..229ef67e208fb 100644
--- a/llvm/test/CodeGen/RISCV/xqcicli.ll
+++ b/llvm/test/CodeGen/RISCV/xqcicli.ll
@@ -4,7 +4,7 @@
 ; RUN:   | FileCheck %s --check-prefixes=RV32I
 ; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicli -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCICLI
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 
 define i32 @select_cc_example_eq(i32 %a, i32 %b, i32 %x, i32 %y) {
diff --git a/llvm/test/CodeGen/RISCV/xqcicm.ll b/llvm/test/CodeGen/RISCV/xqcicm.ll
index 8e934963c258b..dbfbaa7c033a2 100644
--- a/llvm/test/CodeGen/RISCV/xqcicm.ll
+++ b/llvm/test/CodeGen/RISCV/xqcicm.ll
@@ -6,7 +6,7 @@
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCICM
 ; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCICM
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 
 define i32 @select_example(i32 %cond, i32 %x, i32 %y) {
diff --git a/llvm/test/CodeGen/RISCV/xqcics.ll b/llvm/test/CodeGen/RISCV/xqcics.ll
index 7656a0c0e78e0..123226655de3f 100644
--- a/llvm/test/CodeGen/RISCV/xqcics.ll
+++ b/llvm/test/CodeGen/RISCV/xqcics.ll
@@ -6,7 +6,7 @@
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCICS
 ; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcics,+experimental-xqcicm -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCICM
-; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-opt,+conditional-cmv-fusion -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+experimental-xqcicm,+experimental-xqcics,+experimental-xqcicli,+zca,+short-forward-branch-ialu,+conditional-cmv-fusion -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s --check-prefixes=RV32IXQCI
 
 define i32 @select_cc_example_eq_s1(i32 %a, i32 %b, i32 %x, i32 %y) {
diff --git a/llvm/test/CodeGen/RISCV/zicond-fp-select-zfinx.ll b/llvm/test/CodeGen/RISCV/zicond-fp-select-zfinx.ll
new file mode 100644
index 0000000000000..0e8a0c704207d
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/zicond-fp-select-zfinx.ll
@@ -0,0 +1,703 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; Zicond with zfinx(implies by zdinx)
+; RUN: llc -mtriple=riscv64 -mattr=+zdinx,+zicond -verify-machineinstrs < %s | FileCheck %s --check-prefixes=ZDINX_ZICOND,RV64ZDINX_ZICOND
+; RUN: llc -mtriple=riscv64 -mattr=+zdinx         -verify-machineinstrs < %s | FileCheck %s --check-prefixes=ZDINX_NOZICOND,RV64ZDINX_NOZICOND
+
+; Zicond with zfinx(implies by zhinx)
+; RUN: llc -mtriple=riscv64 -mattr=+zhinx,+zicond -verify-machineinstrs < %s | FileCheck %s --check-prefixes=ZHINX_ZICOND,RV64ZHINX_ZICOND
+
+; Baseline with classic FP registers (no *inx); zicond select should NOT trigger
+; RUN: llc -mtriple=riscv64 -mattr=+f,+d          -verify-machineinstrs < %s | FileCheck %s --check-prefix=RV64FD
+
+; Check same optimize work on 32bit machine
+; RUN: llc -mtriple=riscv32 -mattr=+zfinx,+zicond -verify-machineinstrs < %s | FileCheck %s --check-prefixes=ZHINX_ZICOND,RV32ZFINX_ZICOND
+; RUN: llc -mtriple=riscv32 -mattr=+zfinx         -verify-machineinstrs < %s | FileCheck %s --check-prefix=RV32ZFINX_NOZICOND
+; RUN: llc -mtriple=riscv32 -mattr=+zdinx,+zicond -verify-machineinstrs < %s | FileCheck %s --check-prefixes=ZDINX_ZICOND,RV32ZDINX_ZICOND
+; RUN: llc -mtriple=riscv32 -mattr=+zdinx         -verify-machineinstrs < %s | FileCheck %s --check-prefixes=ZDINX_NOZICOND,RV32ZDINX_NOZICOND
+
+; This test checks that floating-point SELECT is lowered through integer
+; SELECT (and thus to Zicond czero.* sequence) when FP values live in GPRs
+; (Zfinx/Zdinx) and Zicond is enabled. When Zicond is disabled, we expect
+; a branch-based lowering instead.
+
+; -----------------------------------------------------------------------------
+; float select with i1 condition (Zfinx)
+; -----------------------------------------------------------------------------
+
+define float @select_f32_i1(i1 %cond, float %t, float %f) nounwind {
+; ZDINX_ZICOND-LABEL: select_f32_i1:
+; ZDINX_ZICOND:       # %bb.0: # %entry
+; ZDINX_ZICOND-NEXT:    # kill: def $x12_w killed $x12_w def $x12
+; ZDINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; ZDINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; ZDINX_ZICOND-NEXT:    or a0, a0, a2
+; ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZDINX_ZICOND-NEXT:    ret
+;
+; ZDINX_NOZICOND-LABEL: select_f32_i1:
+; ZDINX_NOZICOND:       # %bb.0: # %entry
+; ZDINX_NOZICOND-NEXT:    andi a3, a0, 1
+; ZDINX_NOZICOND-NEXT:    mv a0, a1
+; ZDINX_NOZICOND-NEXT:    bnez a3, .LBB0_2
+; ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; ZDINX_NOZICOND-NEXT:    mv a0, a2
+; ZDINX_NOZICOND-NEXT:  .LBB0_2: # %entry
+; ZDINX_NOZICOND-NEXT:    ret
+;
+; ZHINX_ZICOND-LABEL: select_f32_i1:
+; ZHINX_ZICOND:       # %bb.0: # %entry
+; ZHINX_ZICOND-NEXT:    # kill: def $x12_w killed $x12_w def $x12
+; ZHINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZHINX_ZICOND-NEXT:    andi a0, a0, 1
+; ZHINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; ZHINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; ZHINX_ZICOND-NEXT:    or a0, a0, a2
+; ZHINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_f32_i1:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    andi a0, a0, 1
+; RV64FD-NEXT:    bnez a0, .LBB0_2
+; RV64FD-NEXT:  # %bb.1: # %entry
+; RV64FD-NEXT:    fmv.s fa0, fa1
+; RV64FD-NEXT:  .LBB0_2: # %entry
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_f32_i1:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    andi a3, a0, 1
+; RV32ZFINX_NOZICOND-NEXT:    mv a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    bnez a3, .LBB0_2
+; RV32ZFINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    mv a0, a2
+; RV32ZFINX_NOZICOND-NEXT:  .LBB0_2: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    ret
+entry:
+  %sel = select i1 %cond, float %t, float %f
+  ret float %sel
+}
+
+; -----------------------------------------------------------------------------
+; double select with i1 condition (Zdinx)
+; -----------------------------------------------------------------------------
+
+define double @select_f64_i1(i1 %cond, double %t, double %f) nounwind {
+; RV64ZDINX_ZICOND-LABEL: select_f64_i1:
+; RV64ZDINX_ZICOND:       # %bb.0: # %entry
+; RV64ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV64ZDINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; RV64ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV64ZDINX_ZICOND-NEXT:    or a0, a0, a2
+; RV64ZDINX_ZICOND-NEXT:    ret
+;
+; RV64ZDINX_NOZICOND-LABEL: select_f64_i1:
+; RV64ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    andi a3, a0, 1
+; RV64ZDINX_NOZICOND-NEXT:    mv a0, a1
+; RV64ZDINX_NOZICOND-NEXT:    bnez a3, .LBB1_2
+; RV64ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    mv a0, a2
+; RV64ZDINX_NOZICOND-NEXT:  .LBB1_2: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    ret
+;
+; RV64ZHINX_ZICOND-LABEL: select_f64_i1:
+; RV64ZHINX_ZICOND:       # %bb.0: # %entry
+; RV64ZHINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV64ZHINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; RV64ZHINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV64ZHINX_ZICOND-NEXT:    or a0, a0, a2
+; RV64ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_f64_i1:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    andi a0, a0, 1
+; RV64FD-NEXT:    bnez a0, .LBB1_2
+; RV64FD-NEXT:  # %bb.1: # %entry
+; RV64FD-NEXT:    fmv.d fa0, fa1
+; RV64FD-NEXT:  .LBB1_2: # %entry
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_ZICOND-LABEL: select_f64_i1:
+; RV32ZFINX_ZICOND:       # %bb.0: # %entry
+; RV32ZFINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV32ZFINX_ZICOND-NEXT:    czero.nez a3, a3, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a1, a1, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.nez a4, a4, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a2, a2, a0
+; RV32ZFINX_ZICOND-NEXT:    or a0, a1, a3
+; RV32ZFINX_ZICOND-NEXT:    or a1, a2, a4
+; RV32ZFINX_ZICOND-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_f64_i1:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    andi a5, a0, 1
+; RV32ZFINX_NOZICOND-NEXT:    mv a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    bnez a5, .LBB1_2
+; RV32ZFINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    mv a0, a3
+; RV32ZFINX_NOZICOND-NEXT:    mv a2, a4
+; RV32ZFINX_NOZICOND-NEXT:  .LBB1_2: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    mv a1, a2
+; RV32ZFINX_NOZICOND-NEXT:    ret
+;
+; RV32ZDINX_ZICOND-LABEL: select_f64_i1:
+; RV32ZDINX_ZICOND:       # %bb.0: # %entry
+; RV32ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV32ZDINX_ZICOND-NEXT:    bnez a0, .LBB1_2
+; RV32ZDINX_ZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZDINX_ZICOND-NEXT:    mv a7, a4
+; RV32ZDINX_ZICOND-NEXT:    mv a6, a3
+; RV32ZDINX_ZICOND-NEXT:    fmv.d a4, a6
+; RV32ZDINX_ZICOND-NEXT:    j .LBB1_3
+; RV32ZDINX_ZICOND-NEXT:  .LBB1_2:
+; RV32ZDINX_ZICOND-NEXT:    mv a5, a2
+; RV32ZDINX_ZICOND-NEXT:    mv a4, a1
+; RV32ZDINX_ZICOND-NEXT:  .LBB1_3: # %entry
+; RV32ZDINX_ZICOND-NEXT:    mv a0, a4
+; RV32ZDINX_ZICOND-NEXT:    mv a1, a5
+; RV32ZDINX_ZICOND-NEXT:    ret
+;
+; RV32ZDINX_NOZICOND-LABEL: select_f64_i1:
+; RV32ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    andi a0, a0, 1
+; RV32ZDINX_NOZICOND-NEXT:    bnez a0, .LBB1_2
+; RV32ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    mv a7, a4
+; RV32ZDINX_NOZICOND-NEXT:    mv a6, a3
+; RV32ZDINX_NOZICOND-NEXT:    fmv.d a4, a6
+; RV32ZDINX_NOZICOND-NEXT:    j .LBB1_3
+; RV32ZDINX_NOZICOND-NEXT:  .LBB1_2:
+; RV32ZDINX_NOZICOND-NEXT:    mv a5, a2
+; RV32ZDINX_NOZICOND-NEXT:    mv a4, a1
+; RV32ZDINX_NOZICOND-NEXT:  .LBB1_3: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    mv a0, a4
+; RV32ZDINX_NOZICOND-NEXT:    mv a1, a5
+; RV32ZDINX_NOZICOND-NEXT:    ret
+entry:
+  %sel = select i1 %cond, double %t, double %f
+  ret double %sel
+}
+
+; -----------------------------------------------------------------------------
+; double select with floating-point compare condition (a > b ? c : d), Zdinx
+; -----------------------------------------------------------------------------
+
+define double @select_f64_fcmp(double %a, double %b, double %c, double %d) nounwind {
+; RV64ZDINX_ZICOND-LABEL: select_f64_fcmp:
+; RV64ZDINX_ZICOND:       # %bb.0: # %entry
+; RV64ZDINX_ZICOND-NEXT:    flt.d a0, a1, a0
+; RV64ZDINX_ZICOND-NEXT:    czero.nez a1, a3, a0
+; RV64ZDINX_ZICOND-NEXT:    czero.eqz a0, a2, a0
+; RV64ZDINX_ZICOND-NEXT:    or a0, a0, a1
+; RV64ZDINX_ZICOND-NEXT:    ret
+;
+; RV64ZDINX_NOZICOND-LABEL: select_f64_fcmp:
+; RV64ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    flt.d a1, a1, a0
+; RV64ZDINX_NOZICOND-NEXT:    mv a0, a2
+; RV64ZDINX_NOZICOND-NEXT:    bnez a1, .LBB2_2
+; RV64ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    mv a0, a3
+; RV64ZDINX_NOZICOND-NEXT:  .LBB2_2: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    ret
+;
+; RV64ZHINX_ZICOND-LABEL: select_f64_fcmp:
+; RV64ZHINX_ZICOND:       # %bb.0: # %entry
+; RV64ZHINX_ZICOND-NEXT:    addi sp, sp, -32
+; RV64ZHINX_ZICOND-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; RV64ZHINX_ZICOND-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; RV64ZHINX_ZICOND-NEXT:    sd s1, 8(sp) # 8-byte Folded Spill
+; RV64ZHINX_ZICOND-NEXT:    mv s0, a3
+; RV64ZHINX_ZICOND-NEXT:    mv s1, a2
+; RV64ZHINX_ZICOND-NEXT:    call __gtdf2
+; RV64ZHINX_ZICOND-NEXT:    sgtz a0, a0
+; RV64ZHINX_ZICOND-NEXT:    czero.nez a1, s0, a0
+; RV64ZHINX_ZICOND-NEXT:    czero.eqz a0, s1, a0
+; RV64ZHINX_ZICOND-NEXT:    or a0, a0, a1
+; RV64ZHINX_ZICOND-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
+; RV64ZHINX_ZICOND-NEXT:    ld s0, 16(sp) # 8-byte Folded Reload
+; RV64ZHINX_ZICOND-NEXT:    ld s1, 8(sp) # 8-byte Folded Reload
+; RV64ZHINX_ZICOND-NEXT:    addi sp, sp, 32
+; RV64ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_f64_fcmp:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    flt.d a0, fa1, fa0
+; RV64FD-NEXT:    fmv.d fa0, fa2
+; RV64FD-NEXT:    bnez a0, .LBB2_2
+; RV64FD-NEXT:  # %bb.1: # %entry
+; RV64FD-NEXT:    fmv.d fa0, fa3
+; RV64FD-NEXT:  .LBB2_2: # %entry
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_ZICOND-LABEL: select_f64_fcmp:
+; RV32ZFINX_ZICOND:       # %bb.0: # %entry
+; RV32ZFINX_ZICOND-NEXT:    addi sp, sp, -32
+; RV32ZFINX_ZICOND-NEXT:    sw ra, 28(sp) # 4-byte Folded Spill
+; RV32ZFINX_ZICOND-NEXT:    sw s0, 24(sp) # 4-byte Folded Spill
+; RV32ZFINX_ZICOND-NEXT:    sw s1, 20(sp) # 4-byte Folded Spill
+; RV32ZFINX_ZICOND-NEXT:    sw s2, 16(sp) # 4-byte Folded Spill
+; RV32ZFINX_ZICOND-NEXT:    sw s3, 12(sp) # 4-byte Folded Spill
+; RV32ZFINX_ZICOND-NEXT:    mv s0, a7
+; RV32ZFINX_ZICOND-NEXT:    mv s1, a6
+; RV32ZFINX_ZICOND-NEXT:    mv s2, a5
+; RV32ZFINX_ZICOND-NEXT:    mv s3, a4
+; RV32ZFINX_ZICOND-NEXT:    call __gtdf2
+; RV32ZFINX_ZICOND-NEXT:    sgtz a0, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.nez a1, s1, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a2, s3, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.nez a3, s0, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a4, s2, a0
+; RV32ZFINX_ZICOND-NEXT:    or a0, a2, a1
+; RV32ZFINX_ZICOND-NEXT:    or a1, a4, a3
+; RV32ZFINX_ZICOND-NEXT:    lw ra, 28(sp) # 4-byte Folded Reload
+; RV32ZFINX_ZICOND-NEXT:    lw s0, 24(sp) # 4-byte Folded Reload
+; RV32ZFINX_ZICOND-NEXT:    lw s1, 20(sp) # 4-byte Folded Reload
+; RV32ZFINX_ZICOND-NEXT:    lw s2, 16(sp) # 4-byte Folded Reload
+; RV32ZFINX_ZICOND-NEXT:    lw s3, 12(sp) # 4-byte Folded Reload
+; RV32ZFINX_ZICOND-NEXT:    addi sp, sp, 32
+; RV32ZFINX_ZICOND-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_f64_fcmp:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    addi sp, sp, -32
+; RV32ZFINX_NOZICOND-NEXT:    sw ra, 28(sp) # 4-byte Folded Spill
+; RV32ZFINX_NOZICOND-NEXT:    sw s0, 24(sp) # 4-byte Folded Spill
+; RV32ZFINX_NOZICOND-NEXT:    sw s1, 20(sp) # 4-byte Folded Spill
+; RV32ZFINX_NOZICOND-NEXT:    sw s2, 16(sp) # 4-byte Folded Spill
+; RV32ZFINX_NOZICOND-NEXT:    sw s3, 12(sp) # 4-byte Folded Spill
+; RV32ZFINX_NOZICOND-NEXT:    mv s1, a7
+; RV32ZFINX_NOZICOND-NEXT:    mv s3, a6
+; RV32ZFINX_NOZICOND-NEXT:    mv s0, a5
+; RV32ZFINX_NOZICOND-NEXT:    mv s2, a4
+; RV32ZFINX_NOZICOND-NEXT:    call __gtdf2
+; RV32ZFINX_NOZICOND-NEXT:    bgtz a0, .LBB2_2
+; RV32ZFINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    mv s2, s3
+; RV32ZFINX_NOZICOND-NEXT:    mv s0, s1
+; RV32ZFINX_NOZICOND-NEXT:  .LBB2_2: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    mv a0, s2
+; RV32ZFINX_NOZICOND-NEXT:    mv a1, s0
+; RV32ZFINX_NOZICOND-NEXT:    lw ra, 28(sp) # 4-byte Folded Reload
+; RV32ZFINX_NOZICOND-NEXT:    lw s0, 24(sp) # 4-byte Folded Reload
+; RV32ZFINX_NOZICOND-NEXT:    lw s1, 20(sp) # 4-byte Folded Reload
+; RV32ZFINX_NOZICOND-NEXT:    lw s2, 16(sp) # 4-byte Folded Reload
+; RV32ZFINX_NOZICOND-NEXT:    lw s3, 12(sp) # 4-byte Folded Reload
+; RV32ZFINX_NOZICOND-NEXT:    addi sp, sp, 32
+; RV32ZFINX_NOZICOND-NEXT:    ret
+;
+; RV32ZDINX_ZICOND-LABEL: select_f64_fcmp:
+; RV32ZDINX_ZICOND:       # %bb.0: # %entry
+; RV32ZDINX_ZICOND-NEXT:    flt.d a0, a2, a0
+; RV32ZDINX_ZICOND-NEXT:    bnez a0, .LBB2_2
+; RV32ZDINX_ZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZDINX_ZICOND-NEXT:    fmv.d a4, a6
+; RV32ZDINX_ZICOND-NEXT:  .LBB2_2: # %entry
+; RV32ZDINX_ZICOND-NEXT:    mv a0, a4
+; RV32ZDINX_ZICOND-NEXT:    mv a1, a5
+; RV32ZDINX_ZICOND-NEXT:    ret
+;
+; RV32ZDINX_NOZICOND-LABEL: select_f64_fcmp:
+; RV32ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    flt.d a0, a2, a0
+; RV32ZDINX_NOZICOND-NEXT:    bnez a0, .LBB2_2
+; RV32ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    fmv.d a4, a6
+; RV32ZDINX_NOZICOND-NEXT:  .LBB2_2: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    mv a0, a4
+; RV32ZDINX_NOZICOND-NEXT:    mv a1, a5
+; RV32ZDINX_NOZICOND-NEXT:    ret
+entry:
+  %cmp = fcmp ogt double %a, %b
+  %sel = select i1 %cmp, double %c, double %d
+  ret double %sel
+}
+
+; -----------------------------------------------------------------------------
+; half select with  i1 condition (cond ? a : b), Zfinx
+; -----------------------------------------------------------------------------
+
+define dso_local noundef half @select_half_i1(i1 %cond, half %a, half %b) nounwind {
+; ZDINX_ZICOND-LABEL: select_half_i1:
+; ZDINX_ZICOND:       # %bb.0: # %entry
+; ZDINX_ZICOND-NEXT:    # kill: def $x12_w killed $x12_w def $x12
+; ZDINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; ZDINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; ZDINX_ZICOND-NEXT:    or a0, a0, a2
+; ZDINX_ZICOND-NEXT:    lui a1, 1048560
+; ZDINX_ZICOND-NEXT:    or a0, a0, a1
+; ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZDINX_ZICOND-NEXT:    ret
+;
+; ZDINX_NOZICOND-LABEL: select_half_i1:
+; ZDINX_NOZICOND:       # %bb.0: # %entry
+; ZDINX_NOZICOND-NEXT:    # kill: def $x12_w killed $x12_w def $x12
+; ZDINX_NOZICOND-NEXT:    andi a0, a0, 1
+; ZDINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZDINX_NOZICOND-NEXT:    bnez a0, .LBB3_2
+; ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; ZDINX_NOZICOND-NEXT:    mv a1, a2
+; ZDINX_NOZICOND-NEXT:  .LBB3_2: # %entry
+; ZDINX_NOZICOND-NEXT:    lui a0, 1048560
+; ZDINX_NOZICOND-NEXT:    or a0, a1, a0
+; ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZDINX_NOZICOND-NEXT:    ret
+;
+; RV64ZHINX_ZICOND-LABEL: select_half_i1:
+; RV64ZHINX_ZICOND:       # %bb.0: # %entry
+; RV64ZHINX_ZICOND-NEXT:    # kill: def $x12_h killed $x12_h def $x12
+; RV64ZHINX_ZICOND-NEXT:    # kill: def $x11_h killed $x11_h def $x11
+; RV64ZHINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV64ZHINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; RV64ZHINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV64ZHINX_ZICOND-NEXT:    or a0, a0, a2
+; RV64ZHINX_ZICOND-NEXT:    # kill: def $x10_h killed $x10_h killed $x10
+; RV64ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_half_i1:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    andi a0, a0, 1
+; RV64FD-NEXT:    bnez a0, .LBB3_2
+; RV64FD-NEXT:  # %bb.1: # %entry
+; RV64FD-NEXT:    fmv.x.w a0, fa1
+; RV64FD-NEXT:    j .LBB3_3
+; RV64FD-NEXT:  .LBB3_2:
+; RV64FD-NEXT:    fmv.x.w a0, fa0
+; RV64FD-NEXT:  .LBB3_3: # %entry
+; RV64FD-NEXT:    lui a1, 1048560
+; RV64FD-NEXT:    or a0, a0, a1
+; RV64FD-NEXT:    fmv.w.x fa0, a0
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_ZICOND-LABEL: select_half_i1:
+; RV32ZFINX_ZICOND:       # %bb.0: # %entry
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x12_w killed $x12_w def $x12
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZFINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV32ZFINX_ZICOND-NEXT:    czero.nez a2, a2, a0
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV32ZFINX_ZICOND-NEXT:    or a0, a0, a2
+; RV32ZFINX_ZICOND-NEXT:    lui a1, 1048560
+; RV32ZFINX_ZICOND-NEXT:    or a0, a0, a1
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_ZICOND-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_half_i1:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x12_w killed $x12_w def $x12
+; RV32ZFINX_NOZICOND-NEXT:    andi a0, a0, 1
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZFINX_NOZICOND-NEXT:    bnez a0, .LBB3_2
+; RV32ZFINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    mv a1, a2
+; RV32ZFINX_NOZICOND-NEXT:  .LBB3_2: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    lui a0, 1048560
+; RV32ZFINX_NOZICOND-NEXT:    or a0, a1, a0
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_NOZICOND-NEXT:    ret
+entry:
+  %sel = select i1 %cond, half %a, half %b
+  ret half %sel
+}
+
+; -----------------------------------------------------------------------------
+; Test select with i1 condition and zero ret val (cond ? a : 0), Zfinx
+; -----------------------------------------------------------------------------
+define dso_local noundef float @select_i1_f32_0(i1 %cond, float %t) nounwind {
+; ZDINX_ZICOND-LABEL: select_i1_f32_0:
+; ZDINX_ZICOND:       # %bb.0: # %entry
+; ZDINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZDINX_ZICOND-NEXT:    ret
+;
+; ZDINX_NOZICOND-LABEL: select_i1_f32_0:
+; ZDINX_NOZICOND:       # %bb.0: # %entry
+; ZDINX_NOZICOND-NEXT:    andi a2, a0, 1
+; ZDINX_NOZICOND-NEXT:    mv a0, a1
+; ZDINX_NOZICOND-NEXT:    bnez a2, .LBB4_2
+; ZDINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; ZDINX_NOZICOND-NEXT:    li a0, 0
+; ZDINX_NOZICOND-NEXT:  .LBB4_2: # %entry
+; ZDINX_NOZICOND-NEXT:    ret
+;
+; ZHINX_ZICOND-LABEL: select_i1_f32_0:
+; ZHINX_ZICOND:       # %bb.0: # %entry
+; ZHINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZHINX_ZICOND-NEXT:    andi a0, a0, 1
+; ZHINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; ZHINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_i1_f32_0:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    andi a0, a0, 1
+; RV64FD-NEXT:    bnez a0, .LBB4_2
+; RV64FD-NEXT:  # %bb.1: # %entry
+; RV64FD-NEXT:    fmv.w.x fa0, zero
+; RV64FD-NEXT:  .LBB4_2: # %entry
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_i1_f32_0:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    andi a2, a0, 1
+; RV32ZFINX_NOZICOND-NEXT:    mv a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    bnez a2, .LBB4_2
+; RV32ZFINX_NOZICOND-NEXT:  # %bb.1: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    li a0, 0
+; RV32ZFINX_NOZICOND-NEXT:  .LBB4_2: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    ret
+entry:
+  %sel = select i1 %cond, float %t, float 0.000000e+00
+  ret float %sel
+}
+
+; -----------------------------------------------------------------------------
+; Test select with i1 condition and zero ret val for half fp (cond ? a : 0)
+; -----------------------------------------------------------------------------
+define dso_local noundef half @select_i1_half_0(i1 %cond, half %val) nounwind {
+; ZDINX_ZICOND-LABEL: select_i1_half_0:
+; ZDINX_ZICOND:       # %bb.0: # %entry
+; ZDINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; ZDINX_ZICOND-NEXT:    lui a1, 1048560
+; ZDINX_ZICOND-NEXT:    or a0, a0, a1
+; ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; ZDINX_ZICOND-NEXT:    ret
+;
+; RV64ZDINX_NOZICOND-LABEL: select_i1_half_0:
+; RV64ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV64ZDINX_NOZICOND-NEXT:    slli a0, a0, 63
+; RV64ZDINX_NOZICOND-NEXT:    srai a0, a0, 63
+; RV64ZDINX_NOZICOND-NEXT:    and a0, a0, a1
+; RV64ZDINX_NOZICOND-NEXT:    lui a1, 1048560
+; RV64ZDINX_NOZICOND-NEXT:    or a0, a0, a1
+; RV64ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV64ZDINX_NOZICOND-NEXT:    ret
+;
+; RV64ZHINX_ZICOND-LABEL: select_i1_half_0:
+; RV64ZHINX_ZICOND:       # %bb.0: # %entry
+; RV64ZHINX_ZICOND-NEXT:    # kill: def $x11_h killed $x11_h def $x11
+; RV64ZHINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV64ZHINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV64ZHINX_ZICOND-NEXT:    # kill: def $x10_h killed $x10_h killed $x10
+; RV64ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_i1_half_0:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    fmv.x.w a1, fa0
+; RV64FD-NEXT:    slli a0, a0, 63
+; RV64FD-NEXT:    srai a0, a0, 63
+; RV64FD-NEXT:    and a0, a0, a1
+; RV64FD-NEXT:    lui a1, 1048560
+; RV64FD-NEXT:    or a0, a0, a1
+; RV64FD-NEXT:    fmv.w.x fa0, a0
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_ZICOND-LABEL: select_i1_half_0:
+; RV32ZFINX_ZICOND:       # %bb.0: # %entry
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZFINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV32ZFINX_ZICOND-NEXT:    lui a1, 1048560
+; RV32ZFINX_ZICOND-NEXT:    or a0, a0, a1
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_ZICOND-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_i1_half_0:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZFINX_NOZICOND-NEXT:    slli a0, a0, 31
+; RV32ZFINX_NOZICOND-NEXT:    srai a0, a0, 31
+; RV32ZFINX_NOZICOND-NEXT:    and a0, a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    lui a1, 1048560
+; RV32ZFINX_NOZICOND-NEXT:    or a0, a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_NOZICOND-NEXT:    ret
+;
+; RV32ZDINX_NOZICOND-LABEL: select_i1_half_0:
+; RV32ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZDINX_NOZICOND-NEXT:    slli a0, a0, 31
+; RV32ZDINX_NOZICOND-NEXT:    srai a0, a0, 31
+; RV32ZDINX_NOZICOND-NEXT:    and a0, a0, a1
+; RV32ZDINX_NOZICOND-NEXT:    lui a1, 1048560
+; RV32ZDINX_NOZICOND-NEXT:    or a0, a0, a1
+; RV32ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZDINX_NOZICOND-NEXT:    ret
+entry:
+  %sel = select i1 %cond, half %val, half 0xH0000
+  ret half %sel
+}
+
+; -----------------------------------------------------------------------------
+; Test select with i1 condition and zero value for half fp, feeding into fadd ((cond ? a : 0) + 1.0)
+; -----------------------------------------------------------------------------
+define half @select_i1_half_0_add(i1 %cond, half %val) nounwind {
+; RV64ZDINX_ZICOND-LABEL: select_i1_half_0_add:
+; RV64ZDINX_ZICOND:       # %bb.0: # %entry
+; RV64ZDINX_ZICOND-NEXT:    addi sp, sp, -16
+; RV64ZDINX_ZICOND-NEXT:    sd ra, 8(sp) # 8-byte Folded Spill
+; RV64ZDINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV64ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV64ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV64ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV64ZDINX_ZICOND-NEXT:    call __extendhfsf2
+; RV64ZDINX_ZICOND-NEXT:    lui a1, 260096
+; RV64ZDINX_ZICOND-NEXT:    fadd.s a0, a0, a1
+; RV64ZDINX_ZICOND-NEXT:    call __truncsfhf2
+; RV64ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w def $x10
+; RV64ZDINX_ZICOND-NEXT:    lui a1, 1048560
+; RV64ZDINX_ZICOND-NEXT:    or a0, a0, a1
+; RV64ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV64ZDINX_ZICOND-NEXT:    ld ra, 8(sp) # 8-byte Folded Reload
+; RV64ZDINX_ZICOND-NEXT:    addi sp, sp, 16
+; RV64ZDINX_ZICOND-NEXT:    ret
+;
+; RV64ZDINX_NOZICOND-LABEL: select_i1_half_0_add:
+; RV64ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV64ZDINX_NOZICOND-NEXT:    addi sp, sp, -16
+; RV64ZDINX_NOZICOND-NEXT:    sd ra, 8(sp) # 8-byte Folded Spill
+; RV64ZDINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV64ZDINX_NOZICOND-NEXT:    slli a0, a0, 63
+; RV64ZDINX_NOZICOND-NEXT:    srai a0, a0, 63
+; RV64ZDINX_NOZICOND-NEXT:    and a0, a0, a1
+; RV64ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV64ZDINX_NOZICOND-NEXT:    call __extendhfsf2
+; RV64ZDINX_NOZICOND-NEXT:    lui a1, 260096
+; RV64ZDINX_NOZICOND-NEXT:    fadd.s a0, a0, a1
+; RV64ZDINX_NOZICOND-NEXT:    call __truncsfhf2
+; RV64ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w def $x10
+; RV64ZDINX_NOZICOND-NEXT:    lui a1, 1048560
+; RV64ZDINX_NOZICOND-NEXT:    or a0, a0, a1
+; RV64ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV64ZDINX_NOZICOND-NEXT:    ld ra, 8(sp) # 8-byte Folded Reload
+; RV64ZDINX_NOZICOND-NEXT:    addi sp, sp, 16
+; RV64ZDINX_NOZICOND-NEXT:    ret
+;
+; RV64ZHINX_ZICOND-LABEL: select_i1_half_0_add:
+; RV64ZHINX_ZICOND:       # %bb.0: # %entry
+; RV64ZHINX_ZICOND-NEXT:    # kill: def $x11_h killed $x11_h def $x11
+; RV64ZHINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV64ZHINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV64ZHINX_ZICOND-NEXT:    li a1, 15
+; RV64ZHINX_ZICOND-NEXT:    slli a1, a1, 10
+; RV64ZHINX_ZICOND-NEXT:    fadd.h a0, a0, a1
+; RV64ZHINX_ZICOND-NEXT:    ret
+;
+; RV64FD-LABEL: select_i1_half_0_add:
+; RV64FD:       # %bb.0: # %entry
+; RV64FD-NEXT:    addi sp, sp, -16
+; RV64FD-NEXT:    sd ra, 8(sp) # 8-byte Folded Spill
+; RV64FD-NEXT:    fmv.x.w a1, fa0
+; RV64FD-NEXT:    slli a0, a0, 63
+; RV64FD-NEXT:    srai a0, a0, 63
+; RV64FD-NEXT:    and a0, a0, a1
+; RV64FD-NEXT:    fmv.w.x fa0, a0
+; RV64FD-NEXT:    call __extendhfsf2
+; RV64FD-NEXT:    lui a0, 260096
+; RV64FD-NEXT:    fmv.w.x fa5, a0
+; RV64FD-NEXT:    fadd.s fa0, fa0, fa5
+; RV64FD-NEXT:    call __truncsfhf2
+; RV64FD-NEXT:    fmv.x.w a0, fa0
+; RV64FD-NEXT:    lui a1, 1048560
+; RV64FD-NEXT:    or a0, a0, a1
+; RV64FD-NEXT:    fmv.w.x fa0, a0
+; RV64FD-NEXT:    ld ra, 8(sp) # 8-byte Folded Reload
+; RV64FD-NEXT:    addi sp, sp, 16
+; RV64FD-NEXT:    ret
+;
+; RV32ZFINX_ZICOND-LABEL: select_i1_half_0_add:
+; RV32ZFINX_ZICOND:       # %bb.0: # %entry
+; RV32ZFINX_ZICOND-NEXT:    addi sp, sp, -16
+; RV32ZFINX_ZICOND-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZFINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV32ZFINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_ZICOND-NEXT:    call __extendhfsf2
+; RV32ZFINX_ZICOND-NEXT:    lui a1, 260096
+; RV32ZFINX_ZICOND-NEXT:    fadd.s a0, a0, a1
+; RV32ZFINX_ZICOND-NEXT:    call __truncsfhf2
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w def $x10
+; RV32ZFINX_ZICOND-NEXT:    lui a1, 1048560
+; RV32ZFINX_ZICOND-NEXT:    or a0, a0, a1
+; RV32ZFINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_ZICOND-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32ZFINX_ZICOND-NEXT:    addi sp, sp, 16
+; RV32ZFINX_ZICOND-NEXT:    ret
+;
+; RV32ZFINX_NOZICOND-LABEL: select_i1_half_0_add:
+; RV32ZFINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZFINX_NOZICOND-NEXT:    addi sp, sp, -16
+; RV32ZFINX_NOZICOND-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZFINX_NOZICOND-NEXT:    slli a0, a0, 31
+; RV32ZFINX_NOZICOND-NEXT:    srai a0, a0, 31
+; RV32ZFINX_NOZICOND-NEXT:    and a0, a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_NOZICOND-NEXT:    call __extendhfsf2
+; RV32ZFINX_NOZICOND-NEXT:    lui a1, 260096
+; RV32ZFINX_NOZICOND-NEXT:    fadd.s a0, a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    call __truncsfhf2
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w def $x10
+; RV32ZFINX_NOZICOND-NEXT:    lui a1, 1048560
+; RV32ZFINX_NOZICOND-NEXT:    or a0, a0, a1
+; RV32ZFINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZFINX_NOZICOND-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32ZFINX_NOZICOND-NEXT:    addi sp, sp, 16
+; RV32ZFINX_NOZICOND-NEXT:    ret
+;
+; RV32ZDINX_ZICOND-LABEL: select_i1_half_0_add:
+; RV32ZDINX_ZICOND:       # %bb.0: # %entry
+; RV32ZDINX_ZICOND-NEXT:    addi sp, sp, -16
+; RV32ZDINX_ZICOND-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32ZDINX_ZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZDINX_ZICOND-NEXT:    andi a0, a0, 1
+; RV32ZDINX_ZICOND-NEXT:    czero.eqz a0, a1, a0
+; RV32ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZDINX_ZICOND-NEXT:    call __extendhfsf2
+; RV32ZDINX_ZICOND-NEXT:    lui a1, 260096
+; RV32ZDINX_ZICOND-NEXT:    fadd.s a0, a0, a1
+; RV32ZDINX_ZICOND-NEXT:    call __truncsfhf2
+; RV32ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w def $x10
+; RV32ZDINX_ZICOND-NEXT:    lui a1, 1048560
+; RV32ZDINX_ZICOND-NEXT:    or a0, a0, a1
+; RV32ZDINX_ZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZDINX_ZICOND-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32ZDINX_ZICOND-NEXT:    addi sp, sp, 16
+; RV32ZDINX_ZICOND-NEXT:    ret
+;
+; RV32ZDINX_NOZICOND-LABEL: select_i1_half_0_add:
+; RV32ZDINX_NOZICOND:       # %bb.0: # %entry
+; RV32ZDINX_NOZICOND-NEXT:    addi sp, sp, -16
+; RV32ZDINX_NOZICOND-NEXT:    sw ra, 12(sp) # 4-byte Folded Spill
+; RV32ZDINX_NOZICOND-NEXT:    # kill: def $x11_w killed $x11_w def $x11
+; RV32ZDINX_NOZICOND-NEXT:    slli a0, a0, 31
+; RV32ZDINX_NOZICOND-NEXT:    srai a0, a0, 31
+; RV32ZDINX_NOZICOND-NEXT:    and a0, a0, a1
+; RV32ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZDINX_NOZICOND-NEXT:    call __extendhfsf2
+; RV32ZDINX_NOZICOND-NEXT:    lui a1, 260096
+; RV32ZDINX_NOZICOND-NEXT:    fadd.s a0, a0, a1
+; RV32ZDINX_NOZICOND-NEXT:    call __truncsfhf2
+; RV32ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w def $x10
+; RV32ZDINX_NOZICOND-NEXT:    lui a1, 1048560
+; RV32ZDINX_NOZICOND-NEXT:    or a0, a0, a1
+; RV32ZDINX_NOZICOND-NEXT:    # kill: def $x10_w killed $x10_w killed $x10
+; RV32ZDINX_NOZICOND-NEXT:    lw ra, 12(sp) # 4-byte Folded Reload
+; RV32ZDINX_NOZICOND-NEXT:    addi sp, sp, 16
+; RV32ZDINX_NOZICOND-NEXT:    ret
+entry:
+  %sel = select i1 %cond, half %val, half 0xH0000
+  %add = fadd half %sel, 1.0
+  ret half %add
+}
diff --git a/llvm/test/CodeGen/SPARC/fp128-abi.ll b/llvm/test/CodeGen/SPARC/fp128-abi.ll
new file mode 100644
index 0000000000000..b598d1b004832
--- /dev/null
+++ b/llvm/test/CodeGen/SPARC/fp128-abi.ll
@@ -0,0 +1,164 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=sparc   | FileCheck %s --check-prefix=SPARC32
+; RUN: llc < %s -mtriple=sparcv9 | FileCheck %s --check-prefix=SPARC64
+
+define fp128 @f128_direct(fp128 %num) nounwind {
+; SPARC32-LABEL: f128_direct:
+; SPARC32:       ! %bb.0:
+; SPARC32-NEXT:    save %sp, -144, %sp
+; SPARC32-NEXT:    ldd [%i0], %f0
+; SPARC32-NEXT:    ldd [%i0+8], %f4
+; SPARC32-NEXT:    ld [%fp+64], %i0
+; SPARC32-NEXT:    add %fp, -16, %i1
+; SPARC32-NEXT:    st %i1, [%sp+64]
+; SPARC32-NEXT:    std %f4, [%fp+-40]
+; SPARC32-NEXT:    std %f0, [%fp+-48]
+; SPARC32-NEXT:    std %f4, [%fp+-24]
+; SPARC32-NEXT:    add %fp, -32, %o0
+; SPARC32-NEXT:    add %fp, -48, %o1
+; SPARC32-NEXT:    call f128_callee
+; SPARC32-NEXT:    std %f0, [%fp+-32]
+; SPARC32-NEXT:    unimp 16
+; SPARC32-NEXT:    ldd [%fp+-8], %f0
+; SPARC32-NEXT:    ldd [%fp+-16], %f4
+; SPARC32-NEXT:    std %f0, [%i0+8]
+; SPARC32-NEXT:    std %f4, [%i0]
+; SPARC32-NEXT:    ret
+; SPARC32-NEXT:    restore
+;
+; SPARC64-LABEL: f128_direct:
+; SPARC64:       ! %bb.0:
+; SPARC64-NEXT:    save %sp, -176, %sp
+; SPARC64-NEXT:    fmovd %f0, %f4
+; SPARC64-NEXT:    fmovd %f2, %f6
+; SPARC64-NEXT:    call f128_callee
+; SPARC64-NEXT:    nop
+; SPARC64-NEXT:    ret
+; SPARC64-NEXT:    restore
+    %ret = call fp128 @f128_callee(fp128 %num, fp128 %num)
+    ret fp128 %ret
+}
+declare fp128 @f128_callee(fp128 %a, fp128 %b)
+
+define fp128 @f128_direct_spill(i32 %o0, i32 %o1, i32 %o2, i32 %o3, i32 %o4, i32 %o5, i32 %ss0, fp128 %num) nounwind {
+; SPARC32-LABEL: f128_direct_spill:
+; SPARC32:       ! %bb.0:
+; SPARC32-NEXT:    save %sp, -136, %sp
+; SPARC32-NEXT:    ld [%fp+96], %g2
+; SPARC32-NEXT:    ldd [%g2], %f0
+; SPARC32-NEXT:    ldd [%g2+8], %f4
+; SPARC32-NEXT:    ld [%fp+64], %l0
+; SPARC32-NEXT:    mov %i5, %o5
+; SPARC32-NEXT:    mov %i4, %o4
+; SPARC32-NEXT:    mov %i3, %o3
+; SPARC32-NEXT:    mov %i2, %o2
+; SPARC32-NEXT:    mov %i1, %o1
+; SPARC32-NEXT:    mov %i0, %o0
+; SPARC32-NEXT:    add %fp, -32, %i0
+; SPARC32-NEXT:    st %i0, [%sp+92]
+; SPARC32-NEXT:    add %fp, -16, %i0
+; SPARC32-NEXT:    st %i0, [%sp+64]
+; SPARC32-NEXT:    std %f4, [%fp+-24]
+; SPARC32-NEXT:    call f128_callee_spill
+; SPARC32-NEXT:    std %f0, [%fp+-32]
+; SPARC32-NEXT:    unimp 16
+; SPARC32-NEXT:    ldd [%fp+-8], %f0
+; SPARC32-NEXT:    ldd [%fp+-16], %f4
+; SPARC32-NEXT:    std %f0, [%l0+8]
+; SPARC32-NEXT:    std %f4, [%l0]
+; SPARC32-NEXT:    ret
+; SPARC32-NEXT:    restore
+;
+; SPARC64-LABEL: f128_direct_spill:
+; SPARC64:       ! %bb.0:
+; SPARC64-NEXT:    save %sp, -192, %sp
+; SPARC64-NEXT:    fmovd %f16, %f12
+; SPARC64-NEXT:    fmovd %f18, %f14
+; SPARC64-NEXT:    mov %i5, %o5
+; SPARC64-NEXT:    mov %i4, %o4
+; SPARC64-NEXT:    mov %i3, %o3
+; SPARC64-NEXT:    mov %i2, %o2
+; SPARC64-NEXT:    mov %i1, %o1
+; SPARC64-NEXT:    call f128_callee_spill
+; SPARC64-NEXT:    mov %i0, %o0
+; SPARC64-NEXT:    ret
+; SPARC64-NEXT:    restore
+    %ret = call fp128 @f128_callee_spill(i32 %o0, i32 %o1, i32 %o2, i32 %o3, i32 %o4, i32 %o5, fp128 %num)
+    ret fp128 %ret
+}
+declare fp128 @f128_callee_spill(i32 %o0, i32 %o1, i32 %o2, i32 %o3, i32 %o4, i32 %o5, fp128 %a)
+
+define inreg { fp128, fp128 } @f128_complex(fp128 %num) nounwind {
+; SPARC32-LABEL: f128_complex:
+; SPARC32:       ! %bb.0:
+; SPARC32-NEXT:    save %sp, -192, %sp
+; SPARC32-NEXT:    ldd [%i0], %f0
+; SPARC32-NEXT:    ldd [%i0+8], %f4
+; SPARC32-NEXT:    std %f4, [%fp+-24]
+; SPARC32-NEXT:    std %f0, [%fp+-32]
+; SPARC32-NEXT:    std %f4, [%fp+-8]
+; SPARC32-NEXT:    add %fp, -16, %o0
+; SPARC32-NEXT:    add %fp, -32, %o1
+; SPARC32-NEXT:    call f128_complex_callee
+; SPARC32-NEXT:    std %f0, [%fp+-16]
+; SPARC32-NEXT:    sethi %hi(.LCPI2_0), %i0
+; SPARC32-NEXT:    ldd [%i0+%lo(.LCPI2_0)], %f8
+; SPARC32-NEXT:    add %i0, %lo(.LCPI2_0), %i0
+; SPARC32-NEXT:    ldd [%i0+8], %f12
+; SPARC32-NEXT:    std %f4, [%fp+-96]
+; SPARC32-NEXT:    std %f6, [%fp+-88] ! 16-byte Folded Spill
+; SPARC32-NEXT:    std %f8, [%fp+-80]
+; SPARC32-NEXT:    std %f12, [%fp+-72]
+; SPARC32-NEXT:    std %f2, [%fp+-56]
+; SPARC32-NEXT:    std %f0, [%fp+-64]
+; SPARC32-NEXT:    add %fp, -48, %i0
+; SPARC32-NEXT:    add %fp, -64, %o0
+; SPARC32-NEXT:    add %fp, -80, %o1
+; SPARC32-NEXT:    call _Q_add
+; SPARC32-NEXT:    st %i0, [%sp+64]
+; SPARC32-NEXT:    unimp 16
+; SPARC32-NEXT:    ldd [%fp+-48], %f0
+; SPARC32-NEXT:    ldd [%fp+-40], %f2
+; SPARC32-NEXT:    ldd [%fp+-96], %f4
+; SPARC32-NEXT:    ldd [%fp+-88], %f6 ! 16-byte Folded Reload
+; SPARC32-NEXT:    ret
+; SPARC32-NEXT:    restore
+;
+; SPARC64-LABEL: f128_complex:
+; SPARC64:       ! %bb.0:
+; SPARC64-NEXT:    save %sp, -240, %sp
+; SPARC64-NEXT:    fmovd %f0, %f4
+; SPARC64-NEXT:    fmovd %f2, %f6
+; SPARC64-NEXT:    call f128_complex_callee
+; SPARC64-NEXT:    nop
+; SPARC64-NEXT:    std %f4, [%fp+1983]
+; SPARC64-NEXT:    std %f6, [%fp+1991] ! 16-byte Folded Spill
+; SPARC64-NEXT:    sethi %h44(.LCPI2_0), %i0
+; SPARC64-NEXT:    add %i0, %m44(.LCPI2_0), %i0
+; SPARC64-NEXT:    sllx %i0, 12, %i0
+; SPARC64-NEXT:    ldd [%i0+%l44(.LCPI2_0)], %f4
+; SPARC64-NEXT:    add %i0, %l44(.LCPI2_0), %i0
+; SPARC64-NEXT:    ldd [%i0+8], %f8
+; SPARC64-NEXT:    std %f2, [%fp+2023]
+; SPARC64-NEXT:    std %f0, [%fp+2015]
+; SPARC64-NEXT:    std %f4, [%fp+1999]
+; SPARC64-NEXT:    std %f8, [%fp+2007]
+; SPARC64-NEXT:    add %fp, 2031, %o0
+; SPARC64-NEXT:    add %fp, 2015, %o1
+; SPARC64-NEXT:    call _Qp_add
+; SPARC64-NEXT:    add %fp, 1999, %o2
+; SPARC64-NEXT:    ldd [%fp+2031], %f0
+; SPARC64-NEXT:    ldd [%fp+2039], %f2
+; SPARC64-NEXT:    ldd [%fp+1983], %f4
+; SPARC64-NEXT:    ldd [%fp+1991], %f6 ! 16-byte Folded Reload
+; SPARC64-NEXT:    ret
+; SPARC64-NEXT:    restore
+    %call = call inreg { fp128, fp128 } @f128_complex_callee(fp128 %num, fp128 %num)
+    %real = extractvalue { fp128, fp128 } %call, 0
+    %imag = extractvalue { fp128, fp128 } %call, 1
+    %add  = fadd fp128 %real, 0xL00000000000000003FFF000000000000
+    %tmp = insertvalue { fp128, fp128 } poison, fp128 %add, 0
+    %ret = insertvalue { fp128, fp128 } %tmp, fp128 %imag, 1
+    ret { fp128, fp128 } %ret
+}
+declare inreg { fp128, fp128 } @f128_complex_callee(fp128 %a, fp128 %b)
diff --git a/llvm/test/CodeGen/SPARC/fp16-promote.ll b/llvm/test/CodeGen/SPARC/fp16-promote.ll
index 64873b744de50..4e46fd073923e 100644
--- a/llvm/test/CodeGen/SPARC/fp16-promote.ll
+++ b/llvm/test/CodeGen/SPARC/fp16-promote.ll
@@ -268,19 +268,20 @@ define void @test_fptrunc_double(double %d, ptr %p) nounwind {
 define void @test_fptrunc_fp128(ptr %dp, ptr %p) nounwind {
 ; V8-OPT-LABEL: test_fptrunc_fp128:
 ; V8-OPT:       ! %bb.0:
-; V8-OPT-NEXT:    save %sp, -104, %sp
+; V8-OPT-NEXT:    save %sp, -112, %sp
 ; V8-OPT-NEXT:    ldd [%i0], %f0
 ; V8-OPT-NEXT:    ldd [%i0+8], %f4
-; V8-OPT-NEXT:    std %f4, [%sp+100]
+; V8-OPT-NEXT:    std %f4, [%fp+-8]
+; V8-OPT-NEXT:    add %fp, -16, %o0
 ; V8-OPT-NEXT:    call __trunctfhf2
-; V8-OPT-NEXT:    std %f0, [%sp+92]
+; V8-OPT-NEXT:    std %f0, [%fp+-16]
 ; V8-OPT-NEXT:    sth %o0, [%i1]
 ; V8-OPT-NEXT:    ret
 ; V8-OPT-NEXT:    restore
 ;
 ; V8-UNOPT-LABEL: test_fptrunc_fp128:
 ; V8-UNOPT:       ! %bb.0:
-; V8-UNOPT-NEXT:    save %sp, -104, %sp
+; V8-UNOPT-NEXT:    save %sp, -112, %sp
 ; V8-UNOPT-NEXT:    ldd [%i0], %f4
 ; V8-UNOPT-NEXT:    ! implicit-def: $q0
 ; V8-UNOPT-NEXT:    fmovs %f4, %f0
@@ -290,22 +291,24 @@ define void @test_fptrunc_fp128(ptr %dp, ptr %p) nounwind {
 ; V8-UNOPT-NEXT:    fmovs %f5, %f3
 ; V8-UNOPT-NEXT:    fmovs %f2, %f4
 ; V8-UNOPT-NEXT:    fmovs %f3, %f5
-; V8-UNOPT-NEXT:    std %f4, [%sp+100]
+; V8-UNOPT-NEXT:    std %f4, [%fp+-8]
 ; V8-UNOPT-NEXT:    ! kill: def $d0 killed $d0 killed $q0
+; V8-UNOPT-NEXT:    std %f0, [%fp+-16]
 ; V8-UNOPT-NEXT:    call __trunctfhf2
-; V8-UNOPT-NEXT:    std %f0, [%sp+92]
+; V8-UNOPT-NEXT:    add %fp, -16, %o0
 ; V8-UNOPT-NEXT:    sth %o0, [%i1]
 ; V8-UNOPT-NEXT:    ret
 ; V8-UNOPT-NEXT:    restore
 ;
 ; V9-LABEL: test_fptrunc_fp128:
 ; V9:       ! %bb.0:
-; V9-NEXT:    save %sp, -104, %sp
+; V9-NEXT:    save %sp, -112, %sp
 ; V9-NEXT:    ldd [%i0], %f0
 ; V9-NEXT:    ldd [%i0+8], %f4
-; V9-NEXT:    std %f4, [%sp+100]
+; V9-NEXT:    std %f4, [%fp+-8]
+; V9-NEXT:    add %fp, -16, %o0
 ; V9-NEXT:    call __trunctfhf2
-; V9-NEXT:    std %f0, [%sp+92]
+; V9-NEXT:    std %f0, [%fp+-16]
 ; V9-NEXT:    sth %o0, [%i1]
 ; V9-NEXT:    ret
 ; V9-NEXT:    restore
diff --git a/llvm/test/CodeGen/SPARC/llvm.sincos.ll b/llvm/test/CodeGen/SPARC/llvm.sincos.ll
index 8d0d50f67e3f5..ea5de64607042 100644
--- a/llvm/test/CodeGen/SPARC/llvm.sincos.ll
+++ b/llvm/test/CodeGen/SPARC/llvm.sincos.ll
@@ -943,42 +943,38 @@ define { <2 x double>, <2 x double> } @test_sincos_v2f64(<2 x double> %a) #0 {
 define void @test_sincos_f128(ptr sret({ fp128, fp128 }) %ret, ptr %in) #0 {
 ; SPARC32-LABEL: test_sincos_f128:
 ; SPARC32:       ! %bb.0:
-; SPARC32-NEXT:    save %sp, -168, %sp
+; SPARC32-NEXT:    save %sp, -184, %sp
 ; SPARC32-NEXT:    ld [%fp+64], %i1
 ; SPARC32-NEXT:    ldd [%i0], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-64]
-; SPARC32-NEXT:    std %f2, [%fp+-56] ! 16-byte Folded Spill
+; SPARC32-NEXT:    std %f0, [%fp+-88]
+; SPARC32-NEXT:    std %f2, [%fp+-80] ! 16-byte Folded Spill
 ; SPARC32-NEXT:    ldd [%i0+8], %f4
-; SPARC32-NEXT:    std %f4, [%fp+-48] ! 8-byte Folded Spill
-; SPARC32-NEXT:    add %fp, -32, %i0
+; SPARC32-NEXT:    std %f4, [%fp+-72] ! 8-byte Folded Spill
+; SPARC32-NEXT:    add %fp, -48, %i0
 ; SPARC32-NEXT:    st %i0, [%sp+64]
-; SPARC32-NEXT:    std %f4, [%sp+100]
+; SPARC32-NEXT:    std %f4, [%fp+-56]
+; SPARC32-NEXT:    add %fp, -64, %o0
 ; SPARC32-NEXT:    call sinl
-; SPARC32-NEXT:    std %f0, [%sp+92]
+; SPARC32-NEXT:    std %f0, [%fp+-64]
 ; SPARC32-NEXT:    unimp 16
 ; SPARC32-NEXT:    add %fp, -16, %i0
 ; SPARC32-NEXT:    st %i0, [%sp+64]
-; SPARC32-NEXT:    ldd [%fp+-48], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%sp+100]
-; SPARC32-NEXT:    ldd [%fp+-64], %f0
-; SPARC32-NEXT:    ldd [%fp+-56], %f2 ! 16-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%sp+92]
-; SPARC32-NEXT:    ldd [%fp+-32], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-48]
-; SPARC32-NEXT:    std %f2, [%fp+-40] ! 16-byte Folded Spill
-; SPARC32-NEXT:    ldd [%fp+-24], %f0
+; SPARC32-NEXT:    ldd [%fp+-72], %f0 ! 8-byte Folded Reload
+; SPARC32-NEXT:    std %f0, [%fp+-24]
+; SPARC32-NEXT:    add %fp, -32, %o0
+; SPARC32-NEXT:    ldd [%fp+-88], %f0
+; SPARC32-NEXT:    ldd [%fp+-80], %f2 ! 16-byte Folded Reload
 ; SPARC32-NEXT:    call cosl
-; SPARC32-NEXT:    std %f0, [%fp+-64]
+; SPARC32-NEXT:    std %f0, [%fp+-32]
 ; SPARC32-NEXT:    unimp 16
 ; SPARC32-NEXT:    ldd [%fp+-8], %f0
 ; SPARC32-NEXT:    ldd [%fp+-16], %f4
+; SPARC32-NEXT:    ldd [%fp+-40], %f2
+; SPARC32-NEXT:    ldd [%fp+-48], %f8
 ; SPARC32-NEXT:    std %f0, [%i1+24]
 ; SPARC32-NEXT:    std %f4, [%i1+16]
-; SPARC32-NEXT:    ldd [%fp+-64], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i1+8]
-; SPARC32-NEXT:    ldd [%fp+-48], %f0
-; SPARC32-NEXT:    ldd [%fp+-40], %f2 ! 16-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i1]
+; SPARC32-NEXT:    std %f2, [%i1+8]
+; SPARC32-NEXT:    std %f8, [%i1]
 ; SPARC32-NEXT:    jmp %i7+12
 ; SPARC32-NEXT:    restore %g0, %i1, %o0
 ;
@@ -1006,15 +1002,16 @@ define void @test_sincos_f128(ptr sret({ fp128, fp128 }) %ret, ptr %in) #0 {
 ;
 ; GNU32-LABEL: test_sincos_f128:
 ; GNU32:       ! %bb.0:
-; GNU32-NEXT:    save %sp, -136, %sp
+; GNU32-NEXT:    save %sp, -144, %sp
 ; GNU32-NEXT:    ld [%fp+64], %i1
 ; GNU32-NEXT:    ldd [%i0], %f0
 ; GNU32-NEXT:    ldd [%i0+8], %f4
-; GNU32-NEXT:    std %f4, [%sp+100]
-; GNU32-NEXT:    add %fp, -16, %o0
-; GNU32-NEXT:    add %fp, -32, %o1
+; GNU32-NEXT:    std %f4, [%fp+-40]
+; GNU32-NEXT:    add %fp, -48, %o0
+; GNU32-NEXT:    add %fp, -16, %o1
+; GNU32-NEXT:    add %fp, -32, %o2
 ; GNU32-NEXT:    call sincosl
-; GNU32-NEXT:    std %f0, [%sp+92]
+; GNU32-NEXT:    std %f0, [%fp+-48]
 ; GNU32-NEXT:    ldd [%fp+-24], %f0
 ; GNU32-NEXT:    ldd [%fp+-32], %f4
 ; GNU32-NEXT:    ldd [%fp+-8], %f2
@@ -1057,85 +1054,71 @@ define void @test_sincos_f128(ptr sret({ fp128, fp128 }) %ret, ptr %in) #0 {
 define void @test_sincos_v2f128(ptr sret({ <2 x fp128>, <2 x fp128> }) %ret, ptr %in) #0 {
 ; SPARC32-LABEL: test_sincos_v2f128:
 ; SPARC32:       ! %bb.0:
-; SPARC32-NEXT:    save %sp, -248, %sp
+; SPARC32-NEXT:    save %sp, -272, %sp
 ; SPARC32-NEXT:    mov %i0, %i1
 ; SPARC32-NEXT:    ld [%fp+64], %i0
 ; SPARC32-NEXT:    ldd [%i1], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-80]
-; SPARC32-NEXT:    std %f2, [%fp+-72] ! 16-byte Folded Spill
+; SPARC32-NEXT:    std %f0, [%fp+-144]
+; SPARC32-NEXT:    std %f2, [%fp+-136] ! 16-byte Folded Spill
 ; SPARC32-NEXT:    ldd [%i1+8], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-88] ! 8-byte Folded Spill
+; SPARC32-NEXT:    std %f0, [%fp+-152] ! 8-byte Folded Spill
 ; SPARC32-NEXT:    ldd [%i1+16], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-120]
-; SPARC32-NEXT:    std %f2, [%fp+-112] ! 16-byte Folded Spill
+; SPARC32-NEXT:    std %f0, [%fp+-176]
+; SPARC32-NEXT:    std %f2, [%fp+-168] ! 16-byte Folded Spill
 ; SPARC32-NEXT:    ldd [%i1+24], %f4
-; SPARC32-NEXT:    std %f4, [%fp+-104] ! 8-byte Folded Spill
-; SPARC32-NEXT:    add %fp, -64, %i1
+; SPARC32-NEXT:    std %f4, [%fp+-160] ! 8-byte Folded Spill
+; SPARC32-NEXT:    add %fp, -112, %i1
 ; SPARC32-NEXT:    st %i1, [%sp+64]
-; SPARC32-NEXT:    std %f4, [%sp+100]
+; SPARC32-NEXT:    std %f4, [%fp+-120]
+; SPARC32-NEXT:    add %fp, -128, %o0
 ; SPARC32-NEXT:    call sinl
-; SPARC32-NEXT:    std %f0, [%sp+92]
+; SPARC32-NEXT:    std %f0, [%fp+-128]
 ; SPARC32-NEXT:    unimp 16
 ; SPARC32-NEXT:    add %fp, -16, %i1
 ; SPARC32-NEXT:    st %i1, [%sp+64]
-; SPARC32-NEXT:    ldd [%fp+-88], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%sp+100]
-; SPARC32-NEXT:    ldd [%fp+-80], %f0
-; SPARC32-NEXT:    ldd [%fp+-72], %f2 ! 16-byte Folded Reload
+; SPARC32-NEXT:    ldd [%fp+-152], %f0 ! 8-byte Folded Reload
+; SPARC32-NEXT:    std %f0, [%fp+-24]
+; SPARC32-NEXT:    add %fp, -32, %o0
+; SPARC32-NEXT:    ldd [%fp+-144], %f0
+; SPARC32-NEXT:    ldd [%fp+-136], %f2 ! 16-byte Folded Reload
 ; SPARC32-NEXT:    call cosl
-; SPARC32-NEXT:    std %f0, [%sp+92]
+; SPARC32-NEXT:    std %f0, [%fp+-32]
 ; SPARC32-NEXT:    unimp 16
-; SPARC32-NEXT:    add %fp, -32, %i1
+; SPARC32-NEXT:    add %fp, -48, %i1
 ; SPARC32-NEXT:    st %i1, [%sp+64]
-; SPARC32-NEXT:    ldd [%fp+-88], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%sp+100]
-; SPARC32-NEXT:    ldd [%fp+-80], %f0
-; SPARC32-NEXT:    ldd [%fp+-72], %f2 ! 16-byte Folded Reload
+; SPARC32-NEXT:    ldd [%fp+-152], %f0 ! 8-byte Folded Reload
+; SPARC32-NEXT:    std %f0, [%fp+-56]
+; SPARC32-NEXT:    add %fp, -64, %o0
+; SPARC32-NEXT:    ldd [%fp+-144], %f0
+; SPARC32-NEXT:    ldd [%fp+-136], %f2 ! 16-byte Folded Reload
 ; SPARC32-NEXT:    call sinl
-; SPARC32-NEXT:    std %f0, [%sp+92]
+; SPARC32-NEXT:    std %f0, [%fp+-64]
 ; SPARC32-NEXT:    unimp 16
-; SPARC32-NEXT:    add %fp, -48, %i1
+; SPARC32-NEXT:    add %fp, -80, %i1
 ; SPARC32-NEXT:    st %i1, [%sp+64]
-; SPARC32-NEXT:    ldd [%fp+-104], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%sp+100]
-; SPARC32-NEXT:    ldd [%fp+-120], %f0
-; SPARC32-NEXT:    ldd [%fp+-112], %f2 ! 16-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%sp+92]
-; SPARC32-NEXT:    ldd [%fp+-32], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-80]
-; SPARC32-NEXT:    std %f2, [%fp+-72] ! 16-byte Folded Spill
-; SPARC32-NEXT:    ldd [%fp+-24], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-88] ! 8-byte Folded Spill
-; SPARC32-NEXT:    ldd [%fp+-64], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-104]
-; SPARC32-NEXT:    std %f2, [%fp+-96] ! 16-byte Folded Spill
-; SPARC32-NEXT:    ldd [%fp+-56], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-120] ! 8-byte Folded Spill
-; SPARC32-NEXT:    ldd [%fp+-16], %f0
-; SPARC32-NEXT:    std %f0, [%fp+-136]
-; SPARC32-NEXT:    std %f2, [%fp+-128] ! 16-byte Folded Spill
-; SPARC32-NEXT:    ldd [%fp+-8], %f0
+; SPARC32-NEXT:    ldd [%fp+-160], %f0 ! 8-byte Folded Reload
+; SPARC32-NEXT:    std %f0, [%fp+-88]
+; SPARC32-NEXT:    add %fp, -96, %o0
+; SPARC32-NEXT:    ldd [%fp+-176], %f0
+; SPARC32-NEXT:    ldd [%fp+-168], %f2 ! 16-byte Folded Reload
 ; SPARC32-NEXT:    call cosl
-; SPARC32-NEXT:    std %f0, [%fp+-144]
+; SPARC32-NEXT:    std %f0, [%fp+-96]
 ; SPARC32-NEXT:    unimp 16
-; SPARC32-NEXT:    ldd [%fp+-40], %f0
-; SPARC32-NEXT:    ldd [%fp+-48], %f4
-; SPARC32-NEXT:    std %f0, [%i0+56]
-; SPARC32-NEXT:    std %f4, [%i0+48]
-; SPARC32-NEXT:    ldd [%fp+-144], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i0+40]
-; SPARC32-NEXT:    ldd [%fp+-136], %f0
-; SPARC32-NEXT:    ldd [%fp+-128], %f2 ! 16-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i0+32]
-; SPARC32-NEXT:    ldd [%fp+-120], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i0+24]
-; SPARC32-NEXT:    ldd [%fp+-104], %f0
-; SPARC32-NEXT:    ldd [%fp+-96], %f2 ! 16-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i0+16]
-; SPARC32-NEXT:    ldd [%fp+-88], %f0 ! 8-byte Folded Reload
-; SPARC32-NEXT:    std %f0, [%i0+8]
-; SPARC32-NEXT:    ldd [%fp+-80], %f0
-; SPARC32-NEXT:    ldd [%fp+-72], %f2 ! 16-byte Folded Reload
+; SPARC32-NEXT:    ldd [%fp+-48], %f0
+; SPARC32-NEXT:    ldd [%fp+-40], %f8
+; SPARC32-NEXT:    ldd [%fp+-112], %f4
+; SPARC32-NEXT:    ldd [%fp+-104], %f10
+; SPARC32-NEXT:    ldd [%fp+-72], %f12
+; SPARC32-NEXT:    ldd [%fp+-80], %f16
+; SPARC32-NEXT:    ldd [%fp+-8], %f14
+; SPARC32-NEXT:    ldd [%fp+-16], %f20
+; SPARC32-NEXT:    std %f12, [%i0+56]
+; SPARC32-NEXT:    std %f16, [%i0+48]
+; SPARC32-NEXT:    std %f14, [%i0+40]
+; SPARC32-NEXT:    std %f20, [%i0+32]
+; SPARC32-NEXT:    std %f10, [%i0+24]
+; SPARC32-NEXT:    std %f4, [%i0+16]
+; SPARC32-NEXT:    std %f8, [%i0+8]
 ; SPARC32-NEXT:    std %f0, [%i0]
 ; SPARC32-NEXT:    jmp %i7+12
 ; SPARC32-NEXT:    restore
@@ -1186,37 +1169,39 @@ define void @test_sincos_v2f128(ptr sret({ <2 x fp128>, <2 x fp128> }) %ret, ptr
 ;
 ; GNU32-LABEL: test_sincos_v2f128:
 ; GNU32:       ! %bb.0:
-; GNU32-NEXT:    save %sp, -192, %sp
+; GNU32-NEXT:    save %sp, -216, %sp
 ; GNU32-NEXT:    mov %i0, %i1
 ; GNU32-NEXT:    ld [%fp+64], %i0
 ; GNU32-NEXT:    ldd [%i1+16], %f0
-; GNU32-NEXT:    std %f0, [%fp+-80]
-; GNU32-NEXT:    std %f2, [%fp+-72] ! 16-byte Folded Spill
+; GNU32-NEXT:    std %f0, [%fp+-112]
+; GNU32-NEXT:    std %f2, [%fp+-104] ! 16-byte Folded Spill
 ; GNU32-NEXT:    ldd [%i1+24], %f0
-; GNU32-NEXT:    std %f0, [%fp+-88] ! 8-byte Folded Spill
+; GNU32-NEXT:    std %f0, [%fp+-120] ! 8-byte Folded Spill
 ; GNU32-NEXT:    ldd [%i1], %f0
 ; GNU32-NEXT:    ldd [%i1+8], %f4
-; GNU32-NEXT:    std %f4, [%sp+100]
-; GNU32-NEXT:    add %fp, -48, %o0
+; GNU32-NEXT:    std %f4, [%fp+-88]
+; GNU32-NEXT:    add %fp, -96, %o0
 ; GNU32-NEXT:    add %fp, -64, %o1
+; GNU32-NEXT:    add %fp, -80, %o2
 ; GNU32-NEXT:    call sincosl
-; GNU32-NEXT:    std %f0, [%sp+92]
-; GNU32-NEXT:    ldd [%fp+-88], %f0 ! 8-byte Folded Reload
-; GNU32-NEXT:    std %f0, [%sp+100]
-; GNU32-NEXT:    add %fp, -16, %o0
-; GNU32-NEXT:    add %fp, -32, %o1
-; GNU32-NEXT:    ldd [%fp+-80], %f0
-; GNU32-NEXT:    ldd [%fp+-72], %f2 ! 16-byte Folded Reload
+; GNU32-NEXT:    std %f0, [%fp+-96]
+; GNU32-NEXT:    ldd [%fp+-120], %f0 ! 8-byte Folded Reload
+; GNU32-NEXT:    std %f0, [%fp+-40]
+; GNU32-NEXT:    add %fp, -48, %o0
+; GNU32-NEXT:    add %fp, -16, %o1
+; GNU32-NEXT:    add %fp, -32, %o2
+; GNU32-NEXT:    ldd [%fp+-112], %f0
+; GNU32-NEXT:    ldd [%fp+-104], %f2 ! 16-byte Folded Reload
 ; GNU32-NEXT:    call sincosl
-; GNU32-NEXT:    std %f0, [%sp+92]
-; GNU32-NEXT:    ldd [%fp+-48], %f0
-; GNU32-NEXT:    ldd [%fp+-40], %f8
+; GNU32-NEXT:    std %f0, [%fp+-48]
+; GNU32-NEXT:    ldd [%fp+-64], %f0
+; GNU32-NEXT:    ldd [%fp+-56], %f8
 ; GNU32-NEXT:    ldd [%fp+-16], %f4
 ; GNU32-NEXT:    ldd [%fp+-8], %f10
 ; GNU32-NEXT:    ldd [%fp+-24], %f12
 ; GNU32-NEXT:    ldd [%fp+-32], %f16
-; GNU32-NEXT:    ldd [%fp+-56], %f14
-; GNU32-NEXT:    ldd [%fp+-64], %f20
+; GNU32-NEXT:    ldd [%fp+-72], %f14
+; GNU32-NEXT:    ldd [%fp+-80], %f20
 ; GNU32-NEXT:    std %f12, [%i0+56]
 ; GNU32-NEXT:    std %f16, [%i0+48]
 ; GNU32-NEXT:    std %f14, [%i0+40]
diff --git a/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-addrspacecast.ll b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-addrspacecast.ll
new file mode 100644
index 0000000000000..58638578bb3f0
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-addrspacecast.ll
@@ -0,0 +1,8 @@
+; RUN: not llc --global-isel %s -filetype=null 2>&1 | FileCheck %s
+target triple = "spirv64"
+
+define void @addrspacecast(ptr addrspace(9) %a) {
+; CHECK: unable to legalize instruction: %{{.*}}:pid(p4) = G_ADDRSPACE_CAST %{{.*}}:pid(p9)
+  %res1 = addrspacecast ptr addrspace(9) %a to ptr addrspace(4)
+  ret void
+}
diff --git a/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-load.ll b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-load.ll
new file mode 100644
index 0000000000000..229f2234220ab
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-load.ll
@@ -0,0 +1,8 @@
+; RUN: not llc --global-isel %s -filetype=null 2>&1 | FileCheck %s
+target triple = "spirv64"
+
+define void @do_load(ptr addrspace(9) %a) {
+; CHECK: unable to legalize instruction: %{{.*}}:iid(s32) = G_LOAD %{{.*}}:pid(p9) 
+  %val = load i32, ptr addrspace(9) %a
+  ret void
+}
diff --git a/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memcpy.ll b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memcpy.ll
new file mode 100644
index 0000000000000..f72b69b4fec81
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memcpy.ll
@@ -0,0 +1,8 @@
+; RUN: not llc --global-isel %s -filetype=null 2>&1 | FileCheck %s
+target triple = "spirv64"
+
+define void @memcpy(ptr addrspace(9) %a) {
+; CHECK: unable to legalize instruction: G_MEMCPY %{{.*}}:pid(p9), %{{.*}}:pid(p0), %{{.*}}:iid(s64), 0
+  call void @llvm.memcpy.p9.p0.i64(ptr addrspace(9) %a, ptr null, i32 1, i1 0)
+  ret void
+}
diff --git a/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memset.ll b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memset.ll
new file mode 100644
index 0000000000000..b8102582bba74
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-memset.ll
@@ -0,0 +1,8 @@
+; RUN: not llc --global-isel %s -filetype=null 2>&1 | FileCheck %s
+target triple = "spirv64"
+
+define void @memset(ptr addrspace(9) %a) {
+; CHECK: unable to legalize instruction: G_MEMSET %{{.*}}:pid(p9), %{{.*}}:iid(s8), %{{.*}}:iid(s64)
+  call void @llvm.memset.p9.i32(ptr addrspace(9) %a, i8 0, i32 1, i1 0)
+  ret void
+}
diff --git a/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-store.ll b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-store.ll
new file mode 100644
index 0000000000000..c00f15e82f2fe
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/GlobalISel/fn-ptr-store.ll
@@ -0,0 +1,8 @@
+; RUN: not llc --global-isel %s -filetype=null 2>&1 | FileCheck %s
+target triple = "spirv64"
+
+define void @do_store(ptr addrspace(9) %a) {
+; CHECK: unable to legalize instruction: G_STORE %{{.*}}:iid(s32), %{{.*}}:pid(p9) 
+  store i32 5, ptr addrspace(9) %a
+  ret void
+}
diff --git a/llvm/test/CodeGen/SPIRV/extensions/SPV_ALTERA_arbitrary_precision_fixed_point/capability-arbitrary-precision-fixed-point-numbers.ll b/llvm/test/CodeGen/SPIRV/extensions/SPV_ALTERA_arbitrary_precision_fixed_point/capability-arbitrary-precision-fixed-point-numbers.ll
new file mode 100644
index 0000000000000..e8bc48ec100b1
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/extensions/SPV_ALTERA_arbitrary_precision_fixed_point/capability-arbitrary-precision-fixed-point-numbers.ll
@@ -0,0 +1,254 @@
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_fixed_point,+SPV_ALTERA_arbitrary_precision_integers %s -o - | FileCheck %s 
+; TODO: %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_fixed_point,+SPV_ALTERA_arbitrary_precision_integers %s -o - -filetype=obj | spirv-val %}
+
+; CHECK-DAG: OpCapability Kernel
+; CHECK-DAG: OpCapability ArbitraryPrecisionIntegersALTERA
+; CHECK-DAG: OpCapability ArbitraryPrecisionFixedPointALTERA
+; CHECK-DAG: OpExtension "SPV_ALTERA_arbitrary_precision_fixed_point"
+; CHECK-DAG: OpExtension "SPV_ALTERA_arbitrary_precision_integers"
+
+; CHECK-DAG: %[[Ty_8:[0-9]+]] = OpTypeInt 8 0
+; CHECK-DAG: %[[Ty_13:[0-9]+]] = OpTypeInt 13 0
+; CHECK-DAG: %[[Ty_5:[0-9]+]] = OpTypeInt 5 0
+; CHECK-DAG: %[[Ty_3:[0-9]+]] = OpTypeInt 3 0
+; CHECK-DAG: %[[Ty_11:[0-9]+]] = OpTypeInt 11 0
+; CHECK-DAG: %[[Ty_10:[0-9]+]] = OpTypeInt 10 0
+; CHECK-DAG: %[[Ty_17:[0-9]+]] = OpTypeInt 17 0
+; CHECK-DAG: %[[Ty_35:[0-9]+]] = OpTypeInt 35 0
+; CHECK-DAG: %[[Ty_28:[0-9]+]] = OpTypeInt 28 0
+; CHECK-DAG: %[[Ty_31:[0-9]+]] = OpTypeInt 31 0
+; CHECK-DAG: %[[Ty_40:[0-9]+]] = OpTypeInt 40 0
+; CHECK-DAG: %[[Ty_60:[0-9]+]] = OpTypeInt 60 0
+; CHECK-DAG: %[[Ty_16:[0-9]+]] = OpTypeInt 16 0
+; CHECK-DAG: %[[Ty_64:[0-9]+]] = OpTypeInt 64 0
+; CHECK-DAG: %[[Ty_44:[0-9]+]] = OpTypeInt 44 0
+; CHECK-DAG: %[[Ty_34:[0-9]+]] = OpTypeInt 34 0
+; CHECK-DAG: %[[Ty_51:[0-9]+]] = OpTypeInt 51 0
+
+; CHECK:        %[[Sqrt_InId:[0-9]+]] = OpLoad %[[Ty_13]]
+; CHECK-NEXT:  %[[#]] = OpFixedSqrtALTERA %[[Ty_5]] %[[Sqrt_InId]] 0 2 2 0 0
+
+; CHECK:        %[[Recip_InId:[0-9]+]] = OpLoad %[[Ty_3]]
+; CHECK-NEXT:  %[[#]] = OpFixedRecipALTERA %[[Ty_8]] %[[Recip_InId]] 1 4 4 0 0
+
+; CHECK:        %[[Rsqrt_InId:[0-9]+]] = OpLoad %[[Ty_11]]
+; CHECK-NEXT:  %[[#]] = OpFixedRsqrtALTERA %[[Ty_10]] %[[Rsqrt_InId]] 0 8 6 0 0
+
+; CHECK:        %[[Sin_InId:[0-9]+]] = OpLoad %[[Ty_17]]
+; CHECK-NEXT:  %[[#]] = OpFixedSinALTERA %[[Ty_11]] %[[Sin_InId]] 1 7 5 0 0
+
+; CHECK:        %[[Cos_InId:[0-9]+]] = OpLoad %[[Ty_35]]
+; CHECK-NEXT:  %[[#]] = OpFixedCosALTERA %[[Ty_28]] %[[Cos_InId]] 0 9 3 0 0
+
+; CHECK:        %[[SinCos_InId:[0-9]+]] = OpLoad %[[Ty_31]]
+; CHECK-NEXT:  %[[#]] = OpFixedSinCosALTERA %[[Ty_40]] %[[SinCos_InId]] 1 10 12 0 0
+
+; CHECK:        %[[SinPi_InId:[0-9]+]] = OpLoad %[[Ty_60]]
+; CHECK-NEXT:  %[[#]] = OpFixedSinPiALTERA %[[Ty_5]] %[[SinPi_InId]] 0 2 2 0 0
+
+; CHECK:        %[[CosPi_InId:[0-9]+]] = OpLoad %[[Ty_28]]
+; CHECK-NEXT:  %[[#]] = OpFixedCosPiALTERA %[[Ty_16]] %[[CosPi_InId]] 0 8 5 0 0
+
+; CHECK:        %[[SinCosPi_InId:[0-9]+]] = OpLoad %[[Ty_13]]
+; CHECK-NEXT:  %[[#]] = OpFixedSinCosPiALTERA %[[Ty_10]] %[[SinCosPi_InId]] 0 2 2 0 0
+
+; CHECK:        %[[Log_InId:[0-9]+]] = OpLoad %[[Ty_64]]
+; CHECK-NEXT:  %[[#]] = OpFixedLogALTERA %[[Ty_44]] %[[Log_InId]] 1 24 22 0 0
+
+; CHECK:        %[[Exp_InId:[0-9]+]] = OpLoad %[[Ty_44]]
+; CHECK-NEXT:  %[[#]] = OpFixedExpALTERA %[[Ty_34]] %[[Exp_InId]] 0 20 20 0 0
+
+; CHECK:        %[[SinCos_InId:[0-9]+]] = OpLoad %[[Ty_34]]
+; CHECK-NEXT:  %[[SinCos_ResultId:[0-9]+]] = OpFixedSinCosALTERA %[[Ty_51]] %[[SinCos_InId]] 1 3 2 0 0
+; CHECK-NEXT:        OpStore %[[#]] %[[SinCos_ResultId]]
+
+; CHECK:       %[[ResId:[0-9]+]] = OpLoad %[[Ty_51]]
+; CHECK-NEXT:  OpStore %[[PtrId:[0-9]+]] %[[ResId]]
+; CHECK-NEXT:  %[[ExpInId2:[0-9]+]] = OpLoad %[[Ty_51]] %[[PtrId]]
+; CHECK-NEXT:  %[[#]] = OpFixedExpALTERA %[[Ty_51]] %[[ExpInId2]] 0 20 20 0 0
+
+%"class._ZTSZ4mainE3$_0.anon" = type { i8 }
+
+define dso_local spir_kernel void @_ZTSZ4mainE15kernel_function() !kernel_arg_addr_space !{} !kernel_arg_access_qual !{} !kernel_arg_type !{} !kernel_arg_base_type !{} !kernel_arg_type_qual !{} {
+entry:
+  %0 = alloca %"class._ZTSZ4mainE3$_0.anon", align 1
+  %1 = addrspacecast ptr %0 to ptr addrspace(4)
+  call spir_func void @"_ZZ4mainENK3$_0clEv"(ptr addrspace(4) %1)
+  ret void
+}
+
+define internal spir_func void @"_ZZ4mainENK3$_0clEv"(ptr addrspace(4) %this)  align 2 {
+entry:
+  %this.addr = alloca ptr addrspace(4), align 8
+  store ptr addrspace(4) %this, ptr %this.addr, align 8
+  call spir_func void @_Z4sqrtILi13ELi5ELb0ELi2ELi2EEvv()
+  call spir_func void @_Z5recipILi3ELi8ELb1ELi4ELi4EEvv()
+  call spir_func void @_Z5rsqrtILi11ELi10ELb0ELi8ELi6EEvv()
+  call spir_func void @_Z3sinILi17ELi11ELb1ELi7ELi5EEvv()
+  call spir_func void @_Z3cosILi35ELi28ELb0ELi9ELi3EEvv()
+  call spir_func void @_Z7sin_cosILi31ELi20ELb1ELi10ELi12EEvv()
+  call spir_func void @_Z6sin_piILi60ELi5ELb0ELi2ELi2EEvv()
+  call spir_func void @_Z6cos_piILi28ELi16ELb0ELi8ELi5EEvv()
+  call spir_func void @_Z10sin_cos_piILi13ELi5ELb0ELi2ELi2EEvv()
+  call spir_func void @_Z3logILi64ELi44ELb1ELi24ELi22EEvv()
+  call spir_func void @_Z3expILi44ELi34ELb0ELi20ELi20EEvv()
+  call spir_func void @_Z7sin_cosILi31ELi20ELb1ELi10ELi12EEvv_()
+  call spir_func void @_Z3expILi51ELi51ELb0ELi20ELi20EEvv()
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z4sqrtILi13ELi5ELb0ELi2ELi2EEvv() {
+entry:
+  %in_ptr  = alloca i13, align 2
+  %out_ptr = alloca i5,  align 1
+  %in_val  = load i13, ptr %in_ptr, align 2
+  %res     = call spir_func signext i5 @_Z22__spirv_FixedSqrtINTELILi13ELi5EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i13 signext %in_val, i1 zeroext false, i32 2, i32 2, i32 0, i32 0)
+  store i5 %res, ptr %out_ptr, align 1
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z5recipILi3ELi8ELb1ELi4ELi4EEvv() {
+entry:
+  %in_ptr  = alloca i3, align 1
+  %out_ptr = alloca i8, align 1
+  %in_val  = load i3, ptr %in_ptr, align 1
+  %res     = call spir_func signext i8 @_Z23__spirv_FixedRecipINTELILi3ELi8EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i3 signext %in_val, i1 zeroext true, i32 4, i32 4, i32 0, i32 0)
+  store i8 %res, ptr %out_ptr, align 1
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z5rsqrtILi11ELi10ELb0ELi8ELi6EEvv() {
+entry:
+  %in_ptr  = alloca i11, align 2
+  %out_ptr = alloca i10, align 2
+  %in_val  = load i11, ptr %in_ptr, align 2
+  %res     = call spir_func signext i10 @_Z23__spirv_FixedRsqrtINTELILi11ELi10EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i11 signext %in_val, i1 zeroext false, i32 8, i32 6, i32 0, i32 0)
+  store i10 %res, ptr %out_ptr, align 2
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z3sinILi17ELi11ELb1ELi7ELi5EEvv() {
+entry:
+  %in_ptr  = alloca i17, align 4
+  %out_ptr = alloca i11, align 2
+  %in_val  = load i17, ptr %in_ptr, align 4
+  %res     = call spir_func signext i11 @_Z21__spirv_FixedSinINTELILi17ELi11EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i17 signext %in_val, i1 zeroext true, i32 7, i32 5, i32 0, i32 0)
+  store i11 %res, ptr %out_ptr, align 2
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z3cosILi35ELi28ELb0ELi9ELi3EEvv() {
+entry:
+  %in_ptr  = alloca i35, align 8
+  %out_ptr = alloca i28, align 4
+  %in_val  = load i35, ptr %in_ptr, align 8
+  %res     = call spir_func signext i28 @_Z21__spirv_FixedCosINTELILi35ELi28EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i35 signext %in_val, i1 zeroext false, i32 9, i32 3, i32 0, i32 0)
+  store i28 %res, ptr %out_ptr, align 4
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z7sin_cosILi31ELi20ELb1ELi10ELi12EEvv() {
+entry:
+  %in_ptr  = alloca i31, align 4
+  %out_ptr = alloca i40, align 8
+  %in_val  = load i31, ptr %in_ptr, align 4
+  %res     = call spir_func i40 @_Z24__spirv_FixedSinCosINTELILi31ELi20EEU7_ExtIntIXmlLi2ET0_EEiU7_ExtIntIXT_EEibiiii(i31 signext %in_val, i1 zeroext true, i32 10, i32 12, i32 0, i32 0)
+  store i40 %res, ptr %out_ptr, align 8
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z6sin_piILi60ELi5ELb0ELi2ELi2EEvv() {
+entry:
+  %in_ptr  = alloca i60, align 8
+  %out_ptr = alloca i5,  align 1
+  %in_val  = load i60, ptr %in_ptr, align 8
+  %res     = call spir_func signext i5 @_Z23__spirv_FixedSinPiINTELILi60ELi5EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i60 signext %in_val, i1 zeroext false, i32 2, i32 2, i32 0, i32 0)
+  store i5 %res, ptr %out_ptr, align 1
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z6cos_piILi28ELi16ELb0ELi8ELi5EEvv() {
+entry:
+  %in_ptr  = alloca i28, align 4
+  %out_ptr = alloca i16, align 2
+  %in_val  = load i28, ptr %in_ptr, align 4
+  %res     = call spir_func signext i16 @_Z23__spirv_FixedCosPiINTELILi28ELi16EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i28 signext %in_val, i1 zeroext false, i32 8, i32 5, i32 0, i32 0)
+  store i16 %res, ptr %out_ptr, align 2
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z10sin_cos_piILi13ELi5ELb0ELi2ELi2EEvv() {
+entry:
+  %in_ptr  = alloca i13, align 2
+  %out_ptr = alloca i10, align 2
+  %in_val  = load i13, ptr %in_ptr, align 2
+  %res     = call spir_func signext i10 @_Z26__spirv_FixedSinCosPiINTELILi13ELi5EEU7_ExtIntIXmlLi2ET0_EEiU7_ExtIntIXT_EEibiiii(i13 signext %in_val, i1 zeroext false, i32 2, i32 2, i32 0, i32 0)
+  store i10 %res, ptr %out_ptr, align 2
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z3logILi64ELi44ELb1ELi24ELi22EEvv() {
+entry:
+  %in_ptr  = alloca i64, align 8
+  %out_ptr = alloca i44, align 8
+  %in_val  = load i64, ptr %in_ptr, align 8
+  %res     = call spir_func i44 @_Z21__spirv_FixedLogINTELILi64ELi44EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i64 %in_val, i1 zeroext true, i32 24, i32 22, i32 0, i32 0)
+  store i44 %res, ptr %out_ptr, align 8
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z3expILi44ELi34ELb0ELi20ELi20EEvv() {
+entry:
+  %in_ptr  = alloca i44, align 8
+  %out_ptr = alloca i34, align 8
+  %in_val  = load i44, ptr %in_ptr, align 8
+  %res     = call spir_func i34 @_Z21__spirv_FixedExpINTELILi44ELi34EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i44 %in_val, i1 zeroext false, i32 20, i32 20, i32 0, i32 0)
+  store i34 %res, ptr %out_ptr, align 8
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z7sin_cosILi31ELi20ELb1ELi10ELi12EEvv_() {
+entry:
+  %tmp     = alloca i34, align 8
+  %out_ptr = alloca i51, align 8
+  %in_ptr  = addrspacecast ptr %tmp to ptr addrspace(4)
+  %out_s   = addrspacecast ptr %out_ptr to ptr addrspace(4)
+  %in_val  = load i34, ptr addrspace(4) %in_ptr, align 8
+  call spir_func void @_Z24__spirv_FixedSinCosINTELILi34ELi51EEU7_ExtIntIXmlLi2ET0_EEiU7_ExtIntIXT_EEibiiii(ptr addrspace(4) sret(i51) align 8 %out_s, i34 %in_val, i1 zeroext true, i32 3, i32 2, i32 0, i32 0)
+  ret void
+}
+
+define linkonce_odr dso_local spir_func void @_Z3expILi51ELi51ELb0ELi20ELi20EEvv() {
+entry:
+  %a = alloca i51, align 8
+  %a.ascast = addrspacecast ptr %a to ptr addrspace(4)
+  %ap_fixed_Exp = alloca i51, align 8
+  %ap_fixed_Exp.ascast = addrspacecast ptr %ap_fixed_Exp to ptr addrspace(4)
+  %tmp = alloca i51, align 8
+  %tmp.ascast = addrspacecast ptr %tmp to ptr addrspace(4)
+  %indirect-arg-temp = alloca i51, align 8
+  %0 = load i51, ptr addrspace(4) %a.ascast, align 8
+  store i51 %0, ptr %indirect-arg-temp, align 8
+  call spir_func void @_Z21__spirv_FixedExpINTELILi51ELi51EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(
+      ptr addrspace(4) sret(i51) align 8 %tmp.ascast,
+      ptr byval(i64) align 8 %indirect-arg-temp,
+      i1 zeroext false, i32 20, i32 20, i32 0, i32 0)
+  %1 = load i51, ptr addrspace(4) %tmp.ascast, align 8
+  store i51 %1, ptr addrspace(4) %ap_fixed_Exp.ascast, align 8
+  ret void
+}
+
+declare dso_local spir_func signext i5 @_Z22__spirv_FixedSqrtINTELILi13ELi5EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i13 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i13 @_Z22__spirv_FixedSqrtINTELILi5ELi13EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i5 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i8 @_Z23__spirv_FixedRecipINTELILi3ELi8EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i3 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i10 @_Z23__spirv_FixedRsqrtINTELILi11ELi10EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i11 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i11 @_Z21__spirv_FixedSinINTELILi17ELi11EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i17 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i28 @_Z21__spirv_FixedCosINTELILi35ELi28EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i35, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func i40 @_Z24__spirv_FixedSinCosINTELILi31ELi20EEU7_ExtIntIXmlLi2ET0_EEiU7_ExtIntIXT_EEibiiii(i31 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i5 @_Z23__spirv_FixedSinPiINTELILi60ELi5EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i60, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i16 @_Z23__spirv_FixedCosPiINTELILi28ELi16EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i28 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func signext i10 @_Z26__spirv_FixedSinCosPiINTELILi13ELi5EEU7_ExtIntIXmlLi2ET0_EEiU7_ExtIntIXT_EEibiiii(i13 signext, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func i44 @_Z21__spirv_FixedLogINTELILi64ELi44EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i64, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func i34 @_Z21__spirv_FixedExpINTELILi44ELi34EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(i44, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func void @_Z24__spirv_FixedSinCosINTELILi34ELi51EEU7_ExtIntIXmlLi2ET0_EEiU7_ExtIntIXT_EEibiiii(ptr addrspace(4) sret(i51) align 8, i34, i1 zeroext, i32, i32, i32, i32)
+declare dso_local spir_func void @_Z21__spirv_FixedExpINTELILi51ELi51EEU7_ExtIntIXT0_EEiU7_ExtIntIXT_EEibiiii(ptr addrspace(4) sret(i51) align 8, ptr byval(i51) align 8, i1 zeroext, i32, i32, i32, i32)
diff --git a/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_arbitrary_precision_integers.ll b/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_arbitrary_precision_integers.ll
index 41d4b58ed1157..9ea8a5709154c 100644
--- a/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_arbitrary_precision_integers.ll
+++ b/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_arbitrary_precision_integers.ll
@@ -1,4 +1,4 @@
-; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers %s -o - | FileCheck %s
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers %s -o - | FileCheck %s
 
 define i6 @getConstantI6() {
   ret i6 2
@@ -9,8 +9,8 @@ define i13 @getConstantI13() {
 }
 
 ;; Capabilities:
-; CHECK-DAG: OpExtension "SPV_INTEL_arbitrary_precision_integers"
-; CHECK-DAG: OpCapability ArbitraryPrecisionIntegersINTEL
+; CHECK-DAG: OpExtension "SPV_ALTERA_arbitrary_precision_integers"
+; CHECK-DAG: OpCapability ArbitraryPrecisionIntegersALTERA
 
 ; CHECK-NOT: DAG-FENCE
 
diff --git a/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_function_pointers/fp_cmp.ll b/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_function_pointers/fp_cmp.ll
new file mode 100644
index 0000000000000..da9771914b819
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_function_pointers/fp_cmp.ll
@@ -0,0 +1,27 @@
+; RUN: llc -verify-machineinstrs -O0 --spirv-ext=+SPV_INTEL_function_pointers %s -o - | FileCheck %s
+; TODO: %if spirv-tools %{ llc -O0 %s -o - -filetype=obj | spirv-val %}
+
+; CHECK-DAG: OpCapability FunctionPointersINTEL
+; CHECK: OpExtension "SPV_INTEL_function_pointers"
+
+; CHECK: OpName %[[F1:.*]] "f1"
+; CHECK: OpName %[[ARG:.*]] "arg"
+
+; CHECK: %[[TyBool:.*]] = OpTypeBool
+
+; CHECK: %[[F1Ptr:.*]] = OpConstantFunctionPointerINTEL %{{.*}} %[[F1]]
+
+; CHECK: OpPtrEqual %[[TyBool]] %[[F1Ptr]] %[[ARG]]
+
+target triple = "spirv64"
+
+define spir_func void @f1() addrspace(9) {
+entry:
+  ret void
+}
+
+define spir_func i1 @foo(ptr addrspace(9) %arg) addrspace(9) {
+entry:
+  %a = icmp eq ptr addrspace(9) @f1, %arg
+  ret i1 %a
+}
diff --git a/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_int4/negative.ll b/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_int4/negative.ll
index 4d5fa52a166f2..fdb2776a7e2ec 100644
--- a/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_int4/negative.ll
+++ b/llvm/test/CodeGen/SPIRV/extensions/SPV_INTEL_int4/negative.ll
@@ -1,11 +1,11 @@
-; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-INT-4
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-INT-4
 
 ; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-INT-8
 ; No error would be reported in comparison to Khronos llvm-spirv, because type adjustments to integer size are made 
 ; in case no appropriate extension is enabled. Here we expect that the type is adjusted to 8 bits.
 
-; CHECK-SPIRV: Capability ArbitraryPrecisionIntegersINTEL
-; CHECK-SPIRV: Extension "SPV_INTEL_arbitrary_precision_integers"
+; CHECK-SPIRV: Capability ArbitraryPrecisionIntegersALTERA
+; CHECK-SPIRV: Extension "SPV_ALTERA_arbitrary_precision_integers"
 ; CHECK-INT-4: %[[#Int4:]] = OpTypeInt 4 0
 ; CHECK-INT-8: %[[#Int4:]] = OpTypeInt 8 0
 ; CHECK: OpTypeFunction %[[#]] %[[#Int4]]
diff --git a/llvm/test/CodeGen/SPIRV/extensions/both-allowed-disallowed-extension-error.ll b/llvm/test/CodeGen/SPIRV/extensions/both-allowed-disallowed-extension-error.ll
index fc07cca4dd240..96dca53b8ba59 100644
--- a/llvm/test/CodeGen/SPIRV/extensions/both-allowed-disallowed-extension-error.ll
+++ b/llvm/test/CodeGen/SPIRV/extensions/both-allowed-disallowed-extension-error.ll
@@ -1,6 +1,6 @@
-; RUN: not llc -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers,-SPV_INTEL_arbitrary_precision_integers %s -o %t.spvt 2>&1 | FileCheck %s
-; RUN: not llc -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=-SPV_INTEL_arbitrary_precision_integers,+SPV_INTEL_arbitrary_precision_integers %s -o %t.spvt 2>&1 | FileCheck %s
-; CHECK: Extension cannot be allowed and disallowed at the same time: SPV_INTEL_arbitrary_precision_integers
+; RUN: not llc -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers,-SPV_ALTERA_arbitrary_precision_integers %s -o %t.spvt 2>&1 | FileCheck %s
+; RUN: not llc -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=-SPV_ALTERA_arbitrary_precision_integers,+SPV_ALTERA_arbitrary_precision_integers %s -o %t.spvt 2>&1 | FileCheck %s
+; CHECK: Extension cannot be allowed and disallowed at the same time: SPV_ALTERA_arbitrary_precision_integers
 
 define i8 @foo() {
   ret i8 2
diff --git a/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-but-one.ll b/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-but-one.ll
index face4a9f5e615..5ddfc85702540 100644
--- a/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-but-one.ll
+++ b/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions-but-one.ll
@@ -1,4 +1,4 @@
-; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=all,-SPV_INTEL_arbitrary_precision_integers %s -o - | FileCheck %s
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=all,-SPV_ALTERA_arbitrary_precision_integers %s -o - | FileCheck %s
 ; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=KHR %s -o - | FileCheck %s
 ; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=khr %s -o - | FileCheck %s
 
@@ -10,7 +10,7 @@ define i6 @foo() {
   ret i6 2
 }
 
-; CHECK-NOT: OpExtension "SPV_INTEL_arbitrary_precision_integers"
+; CHECK-NOT: OpExtension "SPV_ALTERA_arbitrary_precision_integers"
 ; CHECK-DAG: OpExtension "SPV_KHR_bit_instructions"
 
 declare i32 @llvm.bitreverse.i32(i32)
diff --git a/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions.ll b/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions.ll
index 15905dd1894e2..80b094f462a70 100644
--- a/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions.ll
+++ b/llvm/test/CodeGen/SPIRV/extensions/enable-all-extensions.ll
@@ -5,4 +5,4 @@ define i6 @getConstantI6() {
   ret i6 2
 }
 
-; CHECK: OpExtension "SPV_INTEL_arbitrary_precision_integers"
+; CHECK: OpExtension "SPV_ALTERA_arbitrary_precision_integers"
diff --git a/llvm/test/CodeGen/SPIRV/extensions/unused-but-allowed-SPV_INTEL_arbitrary_precision_integers.ll b/llvm/test/CodeGen/SPIRV/extensions/unused-but-allowed-SPV_INTEL_arbitrary_precision_integers.ll
index 2c1257471d159..cc3f1ae29a681 100644
--- a/llvm/test/CodeGen/SPIRV/extensions/unused-but-allowed-SPV_INTEL_arbitrary_precision_integers.ll
+++ b/llvm/test/CodeGen/SPIRV/extensions/unused-but-allowed-SPV_INTEL_arbitrary_precision_integers.ll
@@ -1,4 +1,4 @@
-; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers %s -o - | FileCheck %s
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv32-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers %s -o - | FileCheck %s
 
 define i8 @getConstantI8() {
   ret i8 2
@@ -15,5 +15,5 @@ define i64 @getConstantI64() {
 }
 
 ;; Capabilities:
-; CHECK-NOT: OpExtension "SPV_INTEL_arbitrary_precision_integers"
-; CHECK-NOT: OpCapability ArbitraryPrecisionIntegersINTEL
+; CHECK-NOT: OpExtension "SPV_ALTERA_arbitrary_precision_integers"
+; CHECK-NOT: OpCapability ArbitraryPrecisionIntegersALTERA
diff --git a/llvm/test/CodeGen/SPIRV/legalization/vector-legalization-kernel.ll b/llvm/test/CodeGen/SPIRV/legalization/vector-legalization-kernel.ll
new file mode 100644
index 0000000000000..4fe6f217dd40f
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/legalization/vector-legalization-kernel.ll
@@ -0,0 +1,69 @@
+; RUN: llc -O0 -verify-machineinstrs -mtriple=spirv64-unknown-unknown %s -o - | FileCheck %s
+; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown %s -o - -filetype=obj | spirv-val %}
+
+; CHECK-DAG: OpName %[[#test_int32_double_conversion:]] "test_int32_double_conversion"
+; CHECK-DAG: %[[#int:]] = OpTypeInt 32 0
+; CHECK-DAG: %[[#v8i32:]] = OpTypeVector %[[#int]] 8
+; CHECK-DAG: %[[#v4i32:]] = OpTypeVector %[[#int]] 4
+; CHECK-DAG: %[[#ptr_func_v8i32:]] = OpTypePointer Function %[[#v8i32]]
+
+; CHECK-DAG: OpName %[[#test_v3f64_conversion:]] "test_v3f64_conversion"
+; CHECK-DAG: %[[#double:]] = OpTypeFloat 64
+; CHECK-DAG: %[[#v3f64:]] = OpTypeVector %[[#double]] 3
+; CHECK-DAG: %[[#ptr_func_v3f64:]] = OpTypePointer Function %[[#v3f64]]
+; CHECK-DAG: %[[#v4f64:]] = OpTypeVector %[[#double]] 4
+
+define spir_kernel void @test_int32_double_conversion(ptr %G_vec) {
+; CHECK: %[[#test_int32_double_conversion]] = OpFunction
+; CHECK: %[[#param:]] = OpFunctionParameter %[[#ptr_func_v8i32]]
+entry:
+  ; CHECK: %[[#LOAD:]] = OpLoad %[[#v8i32]] %[[#param]]
+  ; CHECK: %[[#SHUF1:]] = OpVectorShuffle %[[#v4i32]] %[[#LOAD]] %{{[a-zA-Z0-9_]+}} 0 2 4 6
+  ; CHECK: %[[#SHUF2:]] = OpVectorShuffle %[[#v4i32]] %[[#LOAD]] %{{[a-zA-Z0-9_]+}} 1 3 5 7
+  ; CHECK: %[[#SHUF3:]] = OpVectorShuffle %[[#v8i32]] %[[#SHUF1]] %[[#SHUF2]] 0 4 1 5 2 6 3 7
+  ; CHECK: OpStore %[[#param]] %[[#SHUF3]]
+
+  %0 = load <8 x i32>, ptr %G_vec
+  %1 = shufflevector <8 x i32> %0, <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+  %2 = shufflevector <8 x i32> %0, <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+  %3 = shufflevector <4 x i32> %1, <4 x i32> %2, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+  store <8 x i32> %3, ptr %G_vec
+  ret void
+}
+
+define spir_kernel void @test_v3f64_conversion(ptr %G_vec) {
+; CHECK: %[[#test_v3f64_conversion:]] = OpFunction
+; CHECK: %[[#param_v3f64:]] = OpFunctionParameter %[[#ptr_func_v3f64]]
+entry:
+  ; CHECK: %[[#LOAD:]] = OpLoad %[[#v3f64]] %[[#param_v3f64]]
+  %0 = load <3 x double>, ptr %G_vec
+
+  ; The 6-element vector is not legal. It get expanded to 8.
+  ; CHECK: %[[#EXTRACT1:]] = OpCompositeExtract %[[#double]] %[[#LOAD]] 0
+  ; CHECK: %[[#EXTRACT2:]] = OpCompositeExtract %[[#double]] %[[#LOAD]] 1
+  ; CHECK: %[[#EXTRACT3:]] = OpCompositeExtract %[[#double]] %[[#LOAD]] 2
+  ; CHECK: %[[#CONSTRUCT1:]] = OpCompositeConstruct %[[#v4f64]] %[[#EXTRACT1]] %[[#EXTRACT2]] %[[#EXTRACT3]] %{{[a-zA-Z0-9_]+}}
+  ; CHECK: %[[#BITCAST1:]] = OpBitcast %[[#v8i32]] %[[#CONSTRUCT1]]
+  %1 = bitcast <3 x double> %0 to <6 x i32>
+
+  ; CHECK: %[[#SHUFFLE1:]] = OpVectorShuffle %[[#v8i32]] %[[#BITCAST1]] %{{[a-zA-Z0-9_]+}} 0 2 4 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF
+  %2 = shufflevector <6 x i32> %1, <6 x i32> poison, <3 x i32> <i32 0, i32 2, i32 4>
+
+  ; CHECK: %[[#SHUFFLE2:]] = OpVectorShuffle %[[#v8i32]] %[[#BITCAST1]] %{{[a-zA-Z0-9_]+}} 1 3 5 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF 0xFFFFFFFF
+  %3 = shufflevector <6 x i32> %1, <6 x i32> poison, <3 x i32> <i32 1, i32 3, i32 5>
+
+  ; CHECK: %[[#SHUFFLE3:]] = OpVectorShuffle %[[#v8i32]] %[[#SHUFFLE1]] %[[#SHUFFLE2]] 0 8 1 9 2 10 0xFFFFFFFF 0xFFFFFFFF
+  %4 = shufflevector <3 x i32> %2, <3 x i32> %3, <6 x i32> <i32 0, i32 3, i32 1, i32 4, i32 2, i32 5>
+
+  ; CHECK: %[[#BITCAST2:]] = OpBitcast %[[#v4f64]] %[[#SHUFFLE3]]
+  ; CHECK: %[[#EXTRACT10:]] = OpCompositeExtract %[[#double]] %[[#BITCAST2]] 0
+  ; CHECK: %[[#EXTRACT11:]] = OpCompositeExtract %[[#double]] %[[#BITCAST2]] 1
+  ; CHECK: %[[#EXTRACT12:]] = OpCompositeExtract %[[#double]] %[[#BITCAST2]] 2
+  ; CHECK: %[[#CONSTRUCT3:]] = OpCompositeConstruct %[[#v3f64]] %[[#EXTRACT10]] %[[#EXTRACT11]] %[[#EXTRACT12]]
+  %5 = bitcast <6 x i32> %4 to <3 x double>
+  
+  ; CHECK: OpStore %[[#param_v3f64]] %[[#CONSTRUCT3]]
+  store <3 x double> %5, ptr %G_vec
+  ret void
+}
+
diff --git a/llvm/test/CodeGen/SPIRV/legalization/vector-legalization-shader.ll b/llvm/test/CodeGen/SPIRV/legalization/vector-legalization-shader.ll
new file mode 100644
index 0000000000000..438d7ae21283a
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/legalization/vector-legalization-shader.ll
@@ -0,0 +1,133 @@
+; RUN: llc -O0 -verify-machineinstrs -mtriple=spirv-unknown-vulkan %s -o - | FileCheck %s
+; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv-unknown-vulkan %s -o - -filetype=obj | spirv-val --target-env vulkan1.3 %}
+
+; CHECK-DAG: %[[#int:]] = OpTypeInt 32 0
+; CHECK-DAG: %[[#double:]] = OpTypeFloat 64
+; CHECK-DAG: %[[#v4int:]] = OpTypeVector %[[#int]] 4
+; CHECK-DAG: %[[#v4double:]] = OpTypeVector %[[#double]] 4
+; CHECK-DAG: %[[#v2int:]] = OpTypeVector %[[#int]] 2
+; CHECK-DAG: %[[#v2double:]] = OpTypeVector %[[#double]] 2
+; CHECK-DAG: %[[#v3int:]] = OpTypeVector %[[#int]] 3
+; CHECK-DAG: %[[#v3double:]] = OpTypeVector %[[#double]] 3
+; CHECK-DAG: %[[#ptr_v4double:]] = OpTypePointer Private %[[#v4double]]
+; CHECK-DAG: %[[#ptr_v4int:]] = OpTypePointer Private %[[#v4int]]
+; CHECK-DAG: %[[#ptr_v3double:]] = OpTypePointer Private %[[#v3double]]
+; CHECK-DAG: %[[#ptr_v3int:]] = OpTypePointer Private %[[#v3int]]
+; CHECK-DAG: %[[#GVec4:]] = OpVariable %[[#ptr_v4double]] Private
+; CHECK-DAG: %[[#Lows:]] = OpVariable %[[#ptr_v4int]] Private
+; CHECK-DAG: %[[#Highs:]] = OpVariable %[[#ptr_v4int]] Private
+; CHECK-DAG: %[[#GVec3:]] = OpVariable %[[#ptr_v3double]] Private
+; CHECK-DAG: %[[#Lows3:]] = OpVariable %[[#ptr_v3int]] Private
+; CHECK-DAG: %[[#Highs3:]] = OpVariable %[[#ptr_v3int]] Private
+
+ at GVec4 = internal addrspace(10) global <4 x double> zeroinitializer
+ at Lows = internal addrspace(10) global <4 x i32> zeroinitializer
+ at Highs = internal addrspace(10) global <4 x i32> zeroinitializer
+ at GVec3 = internal addrspace(10) global <3 x double> zeroinitializer
+ at Lows3 = internal addrspace(10) global <3 x i32> zeroinitializer
+ at Highs3 = internal addrspace(10) global <3 x i32> zeroinitializer
+
+; Test splitting a vector of size 8.
+define internal void @test_split() {
+entry:
+  ; CHECK: %[[#load_v4double:]] = OpLoad %[[#v4double]] %[[#GVec4]]
+  ; CHECK: %[[#v2double_01:]] = OpVectorShuffle %[[#v2double]] %[[#load_v4double]] %{{[a-zA-Z0-9_]+}} 0 1
+  ; CHECK: %[[#v2double_23:]] = OpVectorShuffle %[[#v2double]] %[[#load_v4double]] %{{[a-zA-Z0-9_]+}} 2 3
+  ; CHECK: %[[#v4int_01:]] = OpBitcast %[[#v4int]] %[[#v2double_01]]
+  ; CHECK: %[[#v4int_23:]] = OpBitcast %[[#v4int]] %[[#v2double_23]]
+  %0 = load <8 x i32>, ptr addrspace(10) @GVec4, align 32
+
+  ; CHECK: %[[#l0:]] = OpCompositeExtract %[[#int]] %[[#v4int_01]] 0
+  ; CHECK: %[[#l1:]] = OpCompositeExtract %[[#int]] %[[#v4int_01]] 2
+  ; CHECK: %[[#l2:]] = OpCompositeExtract %[[#int]] %[[#v4int_23]] 0
+  ; CHECK: %[[#l3:]] = OpCompositeExtract %[[#int]] %[[#v4int_23]] 2
+  ; CHECK: %[[#res_low:]] = OpCompositeConstruct %[[#v4int]] %[[#l0]] %[[#l1]] %[[#l2]] %[[#l3]]
+  %1 = shufflevector <8 x i32> %0, <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+  
+  ; CHECK: %[[#h0:]] = OpCompositeExtract %[[#int]] %[[#v4int_01]] 1
+  ; CHECK: %[[#h1:]] = OpCompositeExtract %[[#int]] %[[#v4int_01]] 3
+  ; CHECK: %[[#h2:]] = OpCompositeExtract %[[#int]] %[[#v4int_23]] 1
+  ; CHECK: %[[#h3:]] = OpCompositeExtract %[[#int]] %[[#v4int_23]] 3
+  ; CHECK: %[[#res_high:]] = OpCompositeConstruct %[[#v4int]] %[[#h0]] %[[#h1]] %[[#h2]] %[[#h3]]
+  %2 = shufflevector <8 x i32> %0, <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+
+  store <4 x i32> %1, ptr addrspace(10) @Lows, align 16
+  store <4 x i32> %2, ptr addrspace(10) @Highs, align 16
+  ret void
+}
+
+define internal void @test_recombine() {
+entry:
+  ; CHECK: %[[#l:]] = OpLoad %[[#v4int]] %[[#Lows]]
+  %0 = load <4 x i32>, ptr addrspace(10) @Lows, align 16
+  ; CHECK: %[[#h:]] = OpLoad %[[#v4int]] %[[#Highs]]
+  %1 = load <4 x i32>, ptr addrspace(10) @Highs, align 16
+
+  ; CHECK-DAG: %[[#l0:]] = OpCompositeExtract %[[#int]] %[[#l]] 0
+  ; CHECK-DAG: %[[#l1:]] = OpCompositeExtract %[[#int]] %[[#l]] 1
+  ; CHECK-DAG: %[[#l2:]] = OpCompositeExtract %[[#int]] %[[#l]] 2
+  ; CHECK-DAG: %[[#l3:]] = OpCompositeExtract %[[#int]] %[[#l]] 3
+  ; CHECK-DAG: %[[#h0:]] = OpCompositeExtract %[[#int]] %[[#h]] 0
+  ; CHECK-DAG: %[[#h1:]] = OpCompositeExtract %[[#int]] %[[#h]] 1
+  ; CHECK-DAG: %[[#h2:]] = OpCompositeExtract %[[#int]] %[[#h]] 2
+  ; CHECK-DAG: %[[#h3:]] = OpCompositeExtract %[[#int]] %[[#h]] 3
+  ; CHECK-DAG: %[[#v2i0:]] = OpCompositeConstruct %[[#v2int]] %[[#l0]] %[[#h0]]
+  ; CHECK-DAG: %[[#d0:]] = OpBitcast %[[#double]] %[[#v2i0]]
+  ; CHECK-DAG: %[[#v2i1:]] = OpCompositeConstruct %[[#v2int]] %[[#l1]] %[[#h1]]
+  ; CHECK-DAG: %[[#d1:]] = OpBitcast %[[#double]] %[[#v2i1]]
+  ; CHECK-DAG: %[[#v2i2:]] = OpCompositeConstruct %[[#v2int]] %[[#l2]] %[[#h2]]
+  ; CHECK-DAG: %[[#d2:]] = OpBitcast %[[#double]] %[[#v2i2]]
+  ; CHECK-DAG: %[[#v2i3:]] = OpCompositeConstruct %[[#v2int]] %[[#l3]] %[[#h3]]
+  ; CHECK-DAG: %[[#d3:]] = OpBitcast %[[#double]] %[[#v2i3]]
+  ; CHECK-DAG: %[[#res:]] = OpCompositeConstruct %[[#v4double]] %[[#d0]] %[[#d1]] %[[#d2]] %[[#d3]]
+  %2 = shufflevector <4 x i32> %0, <4 x i32> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+
+  ; CHECK: OpStore %[[#GVec4]] %[[#res]]
+  store <8 x i32> %2, ptr addrspace(10) @GVec4, align 32
+  ret void
+}
+
+; Test splitting a vector of size 6. It must be expanded to 8, and then split.
+define internal void @test_bitcast_expand() {
+entry:
+  ; CHECK: %[[#load:]] = OpLoad %[[#v3double]] %[[#GVec3]]
+  %0 = load <3 x double>, ptr addrspace(10) @GVec3, align 32
+
+  ; CHECK: %[[#d0:]] = OpCompositeExtract %[[#double]] %[[#load]] 0
+  ; CHECK: %[[#d1:]] = OpCompositeExtract %[[#double]] %[[#load]] 1
+  ; CHECK: %[[#d2:]] = OpCompositeExtract %[[#double]] %[[#load]] 2
+  ; CHECK: %[[#v2d0:]] = OpCompositeConstruct %[[#v2double]] %[[#d0]] %[[#d1]]
+  ; CHECK: %[[#v2d1:]] = OpCompositeConstruct %[[#v2double]] %[[#d2]] %[[#]]
+  ; CHECK: %[[#v4i0:]] = OpBitcast %[[#v4int]] %[[#v2d0]]
+  ; CHECK: %[[#v4i1:]] = OpBitcast %[[#v4int]] %[[#v2d1]]
+  %1 = bitcast <3 x double> %0 to <6 x i32>
+  
+  ; CHECK: %[[#l0:]] = OpCompositeExtract %[[#int]] %[[#v4i0]] 0
+  ; CHECK: %[[#l1:]] = OpCompositeExtract %[[#int]] %[[#v4i0]] 2
+  ; CHECK: %[[#l2:]] = OpCompositeExtract %[[#int]] %[[#v4i1]] 0
+  ; CHECK: %[[#res_low:]] = OpCompositeConstruct %[[#v3int]] %[[#l0]] %[[#l1]] %[[#l2]]
+  %2 = shufflevector <6 x i32> %1, <6 x i32> poison, <3 x i32> <i32 0, i32 2, i32 4>
+  
+  ; CHECK: %[[#h0:]] = OpCompositeExtract %[[#int]] %[[#v4i0]] 1
+  ; CHECK: %[[#h1:]] = OpCompositeExtract %[[#int]] %[[#v4i0]] 3
+  ; CHECK: %[[#h2:]] = OpCompositeExtract %[[#int]] %[[#v4i1]] 1
+  ; CHECK: %[[#res_high:]] = OpCompositeConstruct %[[#v3int]] %[[#h0]] %[[#h1]] %[[#h2]]
+  %3 = shufflevector <6 x i32> %1, <6 x i32> poison, <3 x i32> <i32 1, i32 3, i32 5>
+
+  ; CHECK: OpStore %[[#Lows3]] %[[#res_low]]
+  store <3 x i32> %2, ptr addrspace(10) @Lows3, align 16
+
+  ; CHECK: OpStore %[[#Highs3]] %[[#res_high]]
+  store <3 x i32> %3, ptr addrspace(10) @Highs3, align 16
+  ret void
+}
+
+define void @main() local_unnamed_addr #0 {
+entry:
+  call void @test_split()
+  call void @test_recombine()
+  call void @test_bitcast_expand()
+  ret void
+}
+
+attributes #0 = { "hlsl.numthreads"="1,1,1" "hlsl.shader"="compute" }
diff --git a/llvm/test/CodeGen/SPIRV/llc-pipeline.ll b/llvm/test/CodeGen/SPIRV/llc-pipeline.ll
index 6db375445e4a3..3a1d0f7b5d218 100644
--- a/llvm/test/CodeGen/SPIRV/llc-pipeline.ll
+++ b/llvm/test/CodeGen/SPIRV/llc-pipeline.ll
@@ -11,9 +11,11 @@
 ; REQUIRES:asserts
 
 ; SPIRV-O0:Target Library Information
+; SPIRV-O0-NEXT:Runtime Library Function Analysis
 ; SPIRV-O0-NEXT:Target Pass Configuration
 ; SPIRV-O0-NEXT:Machine Module Information
 ; SPIRV-O0-NEXT:Target Transform Information
+; SPIRV-O0-NEXT:Library Function Lowering Analysis
 ; SPIRV-O0-NEXT:Create Garbage Collector Module Metadata
 ; SPIRV-O0-NEXT:Assumption Cache Tracker
 ; SPIRV-O0-NEXT:Profile summary info
@@ -83,9 +85,11 @@
 ; SPIRV-O0-NEXT:      Free MachineFunction
 
 ; SPIRV-Opt:Target Library Information
+; SPIRV-Opt-NEXT:Runtime Library Function Analysis
 ; SPIRV-Opt-NEXT:Target Pass Configuration
 ; SPIRV-Opt-NEXT:Machine Module Information
 ; SPIRV-Opt-NEXT:Target Transform Information
+; SPIRV-Opt-NEXT:Library Function Lowering Analysis
 ; SPIRV-Opt-NEXT:Assumption Cache Tracker
 ; SPIRV-Opt-NEXT:Type-Based Alias Analysis
 ; SPIRV-Opt-NEXT:Scoped NoAlias Alias Analysis
diff --git a/llvm/test/CodeGen/SPIRV/llvm-intrinsics/bitreverse_small_type.ll b/llvm/test/CodeGen/SPIRV/llvm-intrinsics/bitreverse_small_type.ll
index 18856147896bb..d4b1592a044bc 100644
--- a/llvm/test/CodeGen/SPIRV/llvm-intrinsics/bitreverse_small_type.ll
+++ b/llvm/test/CodeGen/SPIRV/llvm-intrinsics/bitreverse_small_type.ll
@@ -1,11 +1,11 @@
 ;; Check that llvm.bitreverse.* intrinsics are lowered for
 ;; 2/4-bit scalar and vector types.
 
-; RUN: llc -O0 -verify-machineinstrs -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers,+SPV_KHR_bit_instructions %s -o - | FileCheck %s
-; TODO: %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_INTEL_arbitrary_precision_integers,+SPV_KHR_bit_instructions %s -o - -filetype=obj | spirv-val %}
+; RUN: llc -O0 -verify-machineinstrs -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers,+SPV_KHR_bit_instructions %s -o - | FileCheck %s
+; TODO: %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers,+SPV_KHR_bit_instructions %s -o - -filetype=obj | spirv-val %}
 
-; CHECK: OpCapability ArbitraryPrecisionIntegersINTEL
-; CHECK: OpExtension "SPV_INTEL_arbitrary_precision_integers"
+; CHECK: OpCapability ArbitraryPrecisionIntegersALTERA
+; CHECK: OpExtension "SPV_ALTERA_arbitrary_precision_integers"
 
 ; CHECK-DAG: %[[#I4:]] = OpTypeInt 4 0
 ; CHECK-DAG: %[[#I2:]] = OpTypeInt 2 0
diff --git a/llvm/test/CodeGen/SPIRV/semantics/target.ps.ll b/llvm/test/CodeGen/SPIRV/semantics/target.ps.ll
new file mode 100644
index 0000000000000..249ffc078f158
--- /dev/null
+++ b/llvm/test/CodeGen/SPIRV/semantics/target.ps.ll
@@ -0,0 +1,33 @@
+; RUN: llc -O0 -verify-machineinstrs -mtriple=spirv-vulkan-unknown %s -o - | FileCheck %s
+; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv-vulkan-unknown %s -o - -filetype=obj | spirv-val %}
+
+; CHECK-DAG:        OpDecorate %[[#INPUT:]] BuiltIn FragCoord
+; CHECK-DAG:        OpDecorate %[[#OUTPUT:]] Location 0
+
+; CHECK-DAG:   %[[#float:]] = OpTypeFloat 32
+; CHECK-DAG:      %[[#v4:]] = OpTypeVector %[[#float]] 4
+; CHECK-DAG:   %[[#ptr_i:]] = OpTypePointer Input %[[#v4]]
+; CHECK-DAG:   %[[#ptr_o:]] = OpTypePointer Output %[[#v4]]
+
+; CHECK-DAG:      %[[#INPUT]] = OpVariable %[[#ptr_i]] Input
+; CHECK-DAG:      %[[#OUTPUT]] = OpVariable %[[#ptr_o]] Output
+
+ at SV_Position = external hidden thread_local addrspace(7) externally_initialized constant <4 x float>, !spirv.Decorations !0
+ at SV_Target0 = external hidden thread_local addrspace(8) global <4 x float>, !spirv.Decorations !2
+
+define void @main() #1 {
+entry:
+  %0 = load <4 x float>, ptr addrspace(7) @SV_Position, align 16
+  store <4 x float> %0, ptr addrspace(8) @SV_Target0, align 16
+  ret void
+
+; CHECK: %[[#TMP:]] = OpLoad %[[#v4]] %[[#INPUT]] Aligned 16
+; CHECK:              OpStore %[[#OUTPUT]] %[[#TMP]] Aligned 16
+}
+
+!0 = !{!1}
+!1 = !{i32 11, i32 15}
+!2 = !{!3}
+!3 = !{i32 30, i32 0}
+
+
diff --git a/llvm/test/CodeGen/SPIRV/trunc-nonstd-bitwidth.ll b/llvm/test/CodeGen/SPIRV/trunc-nonstd-bitwidth.ll
index 79c2824c3dde1..16cd00b7180a7 100644
--- a/llvm/test/CodeGen/SPIRV/trunc-nonstd-bitwidth.ll
+++ b/llvm/test/CodeGen/SPIRV/trunc-nonstd-bitwidth.ll
@@ -1,12 +1,12 @@
 ; RUN: llc -O0 -mtriple=spirv64-unknown-unknown %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-NOEXT
 ; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv64-unknown-unknown %s -o - -filetype=obj | spirv-val %}
 
-; RUN: llc -O0 -mtriple=spirv64-unknown-unknown %s --spirv-ext=+SPV_INTEL_arbitrary_precision_integers -o - | FileCheck %s --check-prefixes=CHECK,CHECK-EXT
+; RUN: llc -O0 -mtriple=spirv64-unknown-unknown %s --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers -o - | FileCheck %s --check-prefixes=CHECK,CHECK-EXT
 
 ; RUN: llc -O0 -mtriple=spirv32-unknown-unknown %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-NOEXT
 ; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv32-unknown-unknown %s -o - -filetype=obj | spirv-val %}
 
-; RUN: llc -O0 -mtriple=spirv32-unknown-unknown %s --spirv-ext=+SPV_INTEL_arbitrary_precision_integers -o - | FileCheck %s --check-prefixes=CHECK,CHECK-EXT
+; RUN: llc -O0 -mtriple=spirv32-unknown-unknown %s --spirv-ext=+SPV_ALTERA_arbitrary_precision_integers -o - | FileCheck %s --check-prefixes=CHECK,CHECK-EXT
 
 ; TODO: This test currently fails with LLVM_ENABLE_EXPENSIVE_CHECKS enabled
 ; XFAIL: expensive_checks
diff --git a/llvm/test/CodeGen/SPIRV/zero-length-array.ll b/llvm/test/CodeGen/SPIRV/zero-length-array.ll
index 5fd94d25dfd87..cb34529ebfecd 100644
--- a/llvm/test/CodeGen/SPIRV/zero-length-array.ll
+++ b/llvm/test/CodeGen/SPIRV/zero-length-array.ll
@@ -1,7 +1,9 @@
-; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv-unknown-vulkan-compute %s -o - | FileCheck %s
+; RUN: llc -verify-machineinstrs -O0 -mtriple=spirv-unknown-vulkan-compute < %s | FileCheck %s
 ; RUN: %if spirv-tools %{ llc -O0 -mtriple=spirv-unknown-vulkan-compute %s -o - -filetype=obj | spirv-val %}
 
-; Nothing is generated, but compilation doesn't crash.
+; RUN: not llc -verify-machineinstrs -O0 -mtriple=spirv64-unknown-unknown < %s 2>&1 | FileCheck -check-prefix=CHECK-ERR %s
+
+; For compute, nothing is generated, but compilation doesn't crash.
 ; CHECK: OpName %[[#FOO:]] "foo"
 ; CHECK: OpName %[[#RTM:]] "reg2mem alloca point"
 ; CHECK: %[[#INT:]] = OpTypeInt 32 0
@@ -11,6 +13,10 @@
 ; CHECK-NEXT: OpReturn
 ; CHECK-NEXT: OpFunctionEnd
 
+
+; For non-compute, error.
+; CHECK-ERR: LLVM ERROR: Runtime arrays are not allowed in non-shader SPIR-V modules
+
 define spir_func void @foo() {
 entry:
   %i = alloca [0 x i32], align 4
diff --git a/llvm/test/CodeGen/Thumb2/mve-blockplacement.ll b/llvm/test/CodeGen/Thumb2/mve-blockplacement.ll
index d076cb00ad7e0..706a7c34c3df5 100644
--- a/llvm/test/CodeGen/Thumb2/mve-blockplacement.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-blockplacement.ll
@@ -66,9 +66,8 @@ define i32 @test(i8 zeroext %var_2, i16 signext %var_15, ptr %arr_60) {
 ; CHECK-NEXT:    cset r6, ne
 ; CHECK-NEXT:    strb r6, [r5]
 ; CHECK-NEXT:    add.w r2, r2, #792
-; CHECK-NEXT:    ldrb r6, [r3]
+; CHECK-NEXT:    ldrb r6, [r3], #2
 ; CHECK-NEXT:    adds r4, #8
-; CHECK-NEXT:    adds r3, #2
 ; CHECK-NEXT:    cmp r6, #0
 ; CHECK-NEXT:    ite ne
 ; CHECK-NEXT:    sxthne r6, r1
@@ -101,8 +100,7 @@ define i32 @test(i8 zeroext %var_2, i16 signext %var_15, ptr %arr_60) {
 ; CHECK-NEXT:    cset r6, ne
 ; CHECK-NEXT:    adds r4, #8
 ; CHECK-NEXT:    strb r6, [r5]
-; CHECK-NEXT:    ldrb r6, [r3]
-; CHECK-NEXT:    adds r3, #2
+; CHECK-NEXT:    ldrb r6, [r3], #2
 ; CHECK-NEXT:    cmp r6, #0
 ; CHECK-NEXT:    ite ne
 ; CHECK-NEXT:    sxthne r6, r1
@@ -134,8 +132,7 @@ define i32 @test(i8 zeroext %var_2, i16 signext %var_15, ptr %arr_60) {
 ; CHECK-NEXT:    cset r4, ne
 ; CHECK-NEXT:    add.w r11, r11, #8
 ; CHECK-NEXT:    strb r4, [r5]
-; CHECK-NEXT:    ldrb r4, [r3]
-; CHECK-NEXT:    adds r3, #2
+; CHECK-NEXT:    ldrb r4, [r3], #2
 ; CHECK-NEXT:    cmp r4, #0
 ; CHECK-NEXT:    ite ne
 ; CHECK-NEXT:    sxthne r4, r1
diff --git a/llvm/test/CodeGen/Thumb2/mve-intrinsics/strict-intrinsics.ll b/llvm/test/CodeGen/Thumb2/mve-intrinsics/strict-intrinsics.ll
index 9c3a921ba2540..9e42f3984c24d 100644
--- a/llvm/test/CodeGen/Thumb2/mve-intrinsics/strict-intrinsics.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-intrinsics/strict-intrinsics.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve.fp -o - %s | FileCheck %s
 
-define arm_aapcs_vfpcc <8 x half> @test_vaddq_f16(<8 x half> %a, <8 x half> %b) {
+define arm_aapcs_vfpcc <8 x half> @test_vaddq_f16(<8 x half> %a, <8 x half> %b) #0 {
 ; CHECK-LABEL: test_vaddq_f16:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vadd.f16 q0, q0, q1
@@ -11,7 +11,7 @@ entry:
   ret <8 x half> %0
 }
 
-define arm_aapcs_vfpcc <4 x float> @test_vaddq_f32(<4 x float> %a, <4 x float> %b) {
+define arm_aapcs_vfpcc <4 x float> @test_vaddq_f32(<4 x float> %a, <4 x float> %b) #0 {
 ; CHECK-LABEL: test_vaddq_f32:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vadd.f32 q0, q0, q1
@@ -21,7 +21,7 @@ entry:
   ret <4 x float> %0
 }
 
-define arm_aapcs_vfpcc <8 x half> @test_vsubq_f16(<8 x half> %a, <8 x half> %b) {
+define arm_aapcs_vfpcc <8 x half> @test_vsubq_f16(<8 x half> %a, <8 x half> %b) #0 {
 ; CHECK-LABEL: test_vsubq_f16:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vsub.f16 q0, q0, q1
@@ -31,7 +31,7 @@ entry:
   ret <8 x half> %0
 }
 
-define arm_aapcs_vfpcc <4 x float> @test_vsubq_f32(<4 x float> %a, <4 x float> %b) {
+define arm_aapcs_vfpcc <4 x float> @test_vsubq_f32(<4 x float> %a, <4 x float> %b) #0 {
 ; CHECK-LABEL: test_vsubq_f32:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vsub.f32 q0, q0, q1
@@ -41,7 +41,7 @@ entry:
   ret <4 x float> %0
 }
 
-define arm_aapcs_vfpcc <8 x half> @test_vmulq_f16(<8 x half> %a, <8 x half> %b) {
+define arm_aapcs_vfpcc <8 x half> @test_vmulq_f16(<8 x half> %a, <8 x half> %b) #0 {
 ; CHECK-LABEL: test_vmulq_f16:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmul.f16 q0, q0, q1
@@ -51,7 +51,7 @@ entry:
   ret <8 x half> %0
 }
 
-define arm_aapcs_vfpcc <4 x float> @test_vmulq_f32(<4 x float> %a, <4 x float> %b) {
+define arm_aapcs_vfpcc <4 x float> @test_vmulq_f32(<4 x float> %a, <4 x float> %b) #0 {
 ; CHECK-LABEL: test_vmulq_f32:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmul.f32 q0, q0, q1
@@ -64,7 +64,7 @@ entry:
 
 
 
-define arm_aapcs_vfpcc <8 x half> @test_vaddq_f16_splat(<8 x half> %a, half %b) {
+define arm_aapcs_vfpcc <8 x half> @test_vaddq_f16_splat(<8 x half> %a, half %b) #0 {
 ; CHECK-LABEL: test_vaddq_f16_splat:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmov.f16 r0, s4
@@ -77,7 +77,7 @@ entry:
   ret <8 x half> %0
 }
 
-define arm_aapcs_vfpcc <4 x float> @test_vaddq_f32_splat(<4 x float> %a, float %b) {
+define arm_aapcs_vfpcc <4 x float> @test_vaddq_f32_splat(<4 x float> %a, float %b) #0 {
 ; CHECK-LABEL: test_vaddq_f32_splat:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmov r0, s4
@@ -90,7 +90,7 @@ entry:
   ret <4 x float> %0
 }
 
-define arm_aapcs_vfpcc <8 x half> @test_vsubq_f16_splat(<8 x half> %a, half %b) {
+define arm_aapcs_vfpcc <8 x half> @test_vsubq_f16_splat(<8 x half> %a, half %b) #0 {
 ; CHECK-LABEL: test_vsubq_f16_splat:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmov.f16 r0, s4
@@ -103,7 +103,7 @@ entry:
   ret <8 x half> %0
 }
 
-define arm_aapcs_vfpcc <4 x float> @test_vsubq_f32_splat(<4 x float> %a, float %b) {
+define arm_aapcs_vfpcc <4 x float> @test_vsubq_f32_splat(<4 x float> %a, float %b) #0 {
 ; CHECK-LABEL: test_vsubq_f32_splat:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmov r0, s4
@@ -116,7 +116,7 @@ entry:
   ret <4 x float> %0
 }
 
-define arm_aapcs_vfpcc <8 x half> @test_vmulq_f16_splat(<8 x half> %a, half %b) {
+define arm_aapcs_vfpcc <8 x half> @test_vmulq_f16_splat(<8 x half> %a, half %b) #0 {
 ; CHECK-LABEL: test_vmulq_f16_splat:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmov.f16 r0, s4
@@ -129,7 +129,7 @@ entry:
   ret <8 x half> %0
 }
 
-define arm_aapcs_vfpcc <4 x float> @test_vmulq_f32_splat(<4 x float> %a, float %b) {
+define arm_aapcs_vfpcc <4 x float> @test_vmulq_f32_splat(<4 x float> %a, float %b) #0 {
 ; CHECK-LABEL: test_vmulq_f32_splat:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    vmov r0, s4
@@ -141,3 +141,192 @@ entry:
   %0 = tail call <4 x float> @llvm.arm.mve.vmul.v4f32(<4 x float> %a, <4 x float> %s)
   ret <4 x float> %0
 }
+
+define arm_aapcs_vfpcc <4 x float> @fma_v4f32(<4 x float> %dst, <4 x float> %s1, <4 x float> %s2) #0 {
+; CHECK-LABEL: fma_v4f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vfma.f32 q0, q1, q2
+; CHECK-NEXT:    bx lr
+entry:
+  %0 = tail call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> %s1, <4 x float> %s2, <4 x float> %dst)
+  ret <4 x float> %0
+}
+
+define arm_aapcs_vfpcc <8 x half> @fma_v8f16(<8 x half> %dst, <8 x half> %s1, <8 x half> %s2) #0 {
+; CHECK-LABEL: fma_v8f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vfma.f16 q0, q1, q2
+; CHECK-NEXT:    bx lr
+entry:
+  %0 = tail call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> %s1, <8 x half> %s2, <8 x half> %dst)
+  ret <8 x half> %0
+}
+
+define arm_aapcs_vfpcc <4 x float> @fma_n_v8f16(<4 x float> %s1, <4 x float> %s2, float %s3) #0 {
+; CHECK-LABEL: fma_n_v8f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmov r0, s8
+; CHECK-NEXT:    vfma.f32 q0, q1, r0
+; CHECK-NEXT:    bx lr
+entry:
+  %i = insertelement <4 x float> poison, float %s3, i32 0
+  %sp = shufflevector <4 x float> %i, <4 x float> poison, <4 x i32> zeroinitializer
+  %0 = tail call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> %s2, <4 x float> %sp, <4 x float> %s1)
+  ret <4 x float> %0
+}
+
+define arm_aapcs_vfpcc <8 x half> @fma_n_v4f32(<8 x half> %s1, <8 x half> %s2, half %s3) #0 {
+; CHECK-LABEL: fma_n_v4f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmov.f16 r0, s8
+; CHECK-NEXT:    vfma.f16 q0, q1, r0
+; CHECK-NEXT:    bx lr
+entry:
+  %i = insertelement <8 x half> poison, half %s3, i32 0
+  %sp = shufflevector <8 x half> %i, <8 x half> poison, <8 x i32> zeroinitializer
+  %0 = tail call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> %s2, <8 x half> %sp, <8 x half> %s1)
+  ret <8 x half> %0
+}
+
+define arm_aapcs_vfpcc <4 x float> @fms_v4f32(<4 x float> %dst, <4 x float> %s1, <4 x float> %s2) #0 {
+; CHECK-LABEL: fms_v4f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vfms.f32 q0, q1, q2
+; CHECK-NEXT:    bx lr
+entry:
+  %c = fneg <4 x float> %s1
+  %0 = tail call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> %c, <4 x float> %s2, <4 x float> %dst)
+  ret <4 x float> %0
+}
+
+define arm_aapcs_vfpcc <8 x half> @fms_v8f16(<8 x half> %dst, <8 x half> %s1, <8 x half> %s2) #0 {
+; CHECK-LABEL: fms_v8f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vfms.f16 q0, q1, q2
+; CHECK-NEXT:    bx lr
+entry:
+  %c = fneg <8 x half> %s1
+  %0 = tail call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> %c, <8 x half> %s2, <8 x half> %dst)
+  ret <8 x half> %0
+}
+
+define arm_aapcs_vfpcc <4 x float> @fms_n_v8f16(<4 x float> %s1, <4 x float> %s2, float %s3) #0 {
+; CHECK-LABEL: fms_n_v8f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmov r0, s8
+; CHECK-NEXT:    vdup.32 q2, r0
+; CHECK-NEXT:    vfms.f32 q0, q1, q2
+; CHECK-NEXT:    bx lr
+entry:
+  %c = fneg <4 x float> %s2
+  %i = insertelement <4 x float> poison, float %s3, i32 0
+  %sp = shufflevector <4 x float> %i, <4 x float> poison, <4 x i32> zeroinitializer
+  %0 = tail call <4 x float> @llvm.arm.mve.fma.v4f32(<4 x float> %c, <4 x float> %sp, <4 x float> %s1)
+  ret <4 x float> %0
+}
+
+define arm_aapcs_vfpcc <8 x half> @fms_n_v4f32(<8 x half> %s1, <8 x half> %s2, half %s3) #0 {
+; CHECK-LABEL: fms_n_v4f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmov.f16 r0, s8
+; CHECK-NEXT:    vdup.16 q2, r0
+; CHECK-NEXT:    vfms.f16 q0, q1, q2
+; CHECK-NEXT:    bx lr
+entry:
+  %c = fneg <8 x half> %s2
+  %i = insertelement <8 x half> poison, half %s3, i32 0
+  %sp = shufflevector <8 x half> %i, <8 x half> poison, <8 x i32> zeroinitializer
+  %0 = tail call <8 x half> @llvm.arm.mve.fma.v8f16(<8 x half> %c, <8 x half> %sp, <8 x half> %s1)
+  ret <8 x half> %0
+}
+
+
+define arm_aapcs_vfpcc <8 x half> @test_vminnmq_f16(<8 x half> %a, <8 x half> %b) #0 {
+; CHECK-LABEL: test_vminnmq_f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f16 q0, q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %2 = tail call <8 x half> @llvm.arm.mve.vmaxnm.v8f16(<8 x half> %a, <8 x half> %b)
+  ret <8 x half> %2
+}
+
+define arm_aapcs_vfpcc <4 x float> @test_vminnmq_f32(<4 x float> %a, <4 x float> %b) #0 {
+; CHECK-LABEL: test_vminnmq_f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f32 q0, q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %2 = tail call <4 x float> @llvm.arm.mve.vmaxnm.v4f32(<4 x float> %a, <4 x float> %b)
+  ret <4 x float> %2
+}
+
+define arm_aapcs_vfpcc <8 x half> @test_vmaxnmq_f16(<8 x half> %a, <8 x half> %b) #0 {
+; CHECK-LABEL: test_vmaxnmq_f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f16 q0, q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %2 = tail call <8 x half> @llvm.arm.mve.vmaxnm.v8f16(<8 x half> %a, <8 x half> %b)
+  ret <8 x half> %2
+}
+
+define arm_aapcs_vfpcc <4 x float> @test_vmaxnmq_f32(<4 x float> %a, <4 x float> %b) #0 {
+; CHECK-LABEL: test_vmaxnmq_f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnm.f32 q0, q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %2 = tail call <4 x float> @llvm.arm.mve.vmaxnm.v4f32(<4 x float> %a, <4 x float> %b)
+  ret <4 x float> %2
+}
+
+define arm_aapcs_vfpcc <8 x half> @test_vminnmaq_f16(<8 x half> %a, <8 x half> %b) #0 {
+; CHECK-LABEL: test_vminnmaq_f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnma.f16 q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %0 = tail call <8 x half> @llvm.fabs.v8f16(<8 x half> %a)
+  %1 = tail call <8 x half> @llvm.fabs.v8f16(<8 x half> %b)
+  %2 = tail call <8 x half> @llvm.arm.mve.vmaxnm.v8f16(<8 x half> %0, <8 x half> %1)
+  ret <8 x half> %2
+}
+
+define arm_aapcs_vfpcc <4 x float> @test_vminnmaq_f32(<4 x float> %a, <4 x float> %b) #0 {
+; CHECK-LABEL: test_vminnmaq_f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnma.f32 q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %0 = tail call <4 x float> @llvm.fabs.v4f32(<4 x float> %a)
+  %1 = tail call <4 x float> @llvm.fabs.v4f32(<4 x float> %b)
+  %2 = tail call <4 x float> @llvm.arm.mve.vmaxnm.v4f32(<4 x float> %0, <4 x float> %1)
+  ret <4 x float> %2
+}
+
+define arm_aapcs_vfpcc <8 x half> @test_vmaxnmaq_f16(<8 x half> %a, <8 x half> %b) #0 {
+; CHECK-LABEL: test_vmaxnmaq_f16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnma.f16 q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %0 = tail call <8 x half> @llvm.fabs.v8f16(<8 x half> %a)
+  %1 = tail call <8 x half> @llvm.fabs.v8f16(<8 x half> %b)
+  %2 = tail call <8 x half> @llvm.arm.mve.vmaxnm.v8f16(<8 x half> %0, <8 x half> %1)
+  ret <8 x half> %2
+}
+
+define arm_aapcs_vfpcc <4 x float> @test_vmaxnmaq_f32(<4 x float> %a, <4 x float> %b) #0 {
+; CHECK-LABEL: test_vmaxnmaq_f32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmaxnma.f32 q0, q1
+; CHECK-NEXT:    bx lr
+entry:
+  %0 = tail call <4 x float> @llvm.fabs.v4f32(<4 x float> %a)
+  %1 = tail call <4 x float> @llvm.fabs.v4f32(<4 x float> %b)
+  %2 = tail call <4 x float> @llvm.arm.mve.vmaxnm.v4f32(<4 x float> %0, <4 x float> %1)
+  ret <4 x float> %2
+}
+
+attributes #0 = { strictfp }
diff --git a/llvm/test/CodeGen/Thumb2/mve-vmulh.ll b/llvm/test/CodeGen/Thumb2/mve-vmulh.ll
index 32648b6b449a8..37f5e26c6e5a0 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vmulh.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vmulh.ll
@@ -71,6 +71,203 @@ entry:
   ret <4 x i32> %s2
 }
 
+define arm_aapcs_vfpcc <8 x i32> @vmulhs_v8i32(<8 x i32> %s0, <8 x i32> %s1) {
+; CHECK-LABEL: vmulhs_v8i32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmulh.s32 q0, q0, q2
+; CHECK-NEXT:    vmulh.s32 q1, q1, q3
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = sext <8 x i32> %s0 to <8 x i64>
+  %s1s = sext <8 x i32> %s1 to <8 x i64>
+  %m = mul <8 x i64> %s0s, %s1s
+  %s = ashr <8 x i64> %m, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
+  %s2 = trunc <8 x i64> %s to <8 x i32>
+  ret <8 x i32> %s2
+}
+
+define arm_aapcs_vfpcc <8 x i32> @vmulhu_v8i32(<8 x i32> %s0, <8 x i32> %s1) {
+; CHECK-LABEL: vmulhu_v8i32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmulh.u32 q0, q0, q2
+; CHECK-NEXT:    vmulh.u32 q1, q1, q3
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = zext <8 x i32> %s0 to <8 x i64>
+  %s1s = zext <8 x i32> %s1 to <8 x i64>
+  %m = mul <8 x i64> %s0s, %s1s
+  %s = lshr <8 x i64> %m, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
+  %s2 = trunc <8 x i64> %s to <8 x i32>
+  ret <8 x i32> %s2
+}
+
+define arm_aapcs_vfpcc <16 x i32> @vmulhs_v16i32(<16 x i32> %s0, <16 x i32> %s1) {
+; CHECK-LABEL: vmulhs_v16i32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d11, d12, d13, d14, d15}
+; CHECK-NEXT:    vpush {d11, d12, d13, d14, d15}
+; CHECK-NEXT:    .vsave {d9}
+; CHECK-NEXT:    vpush {d9}
+; CHECK-NEXT:    add r1, sp, #48
+; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    vldrw.u32 q6, [r1]
+; CHECK-NEXT:    vmov.f32 s18, s1
+; CHECK-NEXT:    vmov.f32 s0, s2
+; CHECK-NEXT:    vmov r1, s24
+; CHECK-NEXT:    vmov.f32 s22, s25
+; CHECK-NEXT:    vmov.f32 s2, s3
+; CHECK-NEXT:    vmov.f32 s24, s26
+; CHECK-NEXT:    vmov.f32 s26, s27
+; CHECK-NEXT:    vmullb.s32 q7, q0, q6
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s29
+; CHECK-NEXT:    vmov q0[2], q0[0], r0, r1
+; CHECK-NEXT:    vmov r0, s18
+; CHECK-NEXT:    vmov r1, s22
+; CHECK-NEXT:    vmov.f32 s18, s5
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s31
+; CHECK-NEXT:    vmov q0[3], q0[1], r0, r1
+; CHECK-NEXT:    add r1, sp, #64
+; CHECK-NEXT:    vldrw.u32 q6, [r1]
+; CHECK-NEXT:    vmov r0, s4
+; CHECK-NEXT:    vmov.f32 s4, s6
+; CHECK-NEXT:    vmov r1, s24
+; CHECK-NEXT:    vmov.f32 s22, s25
+; CHECK-NEXT:    vmov.f32 s6, s7
+; CHECK-NEXT:    vmov.f32 s24, s26
+; CHECK-NEXT:    vmov.f32 s26, s27
+; CHECK-NEXT:    vmullb.s32 q7, q1, q6
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s29
+; CHECK-NEXT:    vmov q1[2], q1[0], r0, r1
+; CHECK-NEXT:    vmov r0, s18
+; CHECK-NEXT:    vmov r1, s22
+; CHECK-NEXT:    vmov.f32 s18, s9
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s31
+; CHECK-NEXT:    vmov q1[3], q1[1], r0, r1
+; CHECK-NEXT:    add r1, sp, #80
+; CHECK-NEXT:    vldrw.u32 q6, [r1]
+; CHECK-NEXT:    vmov r0, s8
+; CHECK-NEXT:    vmov.f32 s8, s10
+; CHECK-NEXT:    vmov r1, s24
+; CHECK-NEXT:    vmov.f32 s22, s25
+; CHECK-NEXT:    vmov.f32 s10, s11
+; CHECK-NEXT:    vmov.f32 s24, s26
+; CHECK-NEXT:    vmov.f32 s26, s27
+; CHECK-NEXT:    vmullb.s32 q7, q2, q6
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s29
+; CHECK-NEXT:    vmov q2[2], q2[0], r0, r1
+; CHECK-NEXT:    vmov r0, s18
+; CHECK-NEXT:    vmov r1, s22
+; CHECK-NEXT:    vmov.f32 s18, s13
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s31
+; CHECK-NEXT:    vmov q2[3], q2[1], r0, r1
+; CHECK-NEXT:    add r1, sp, #96
+; CHECK-NEXT:    vldrw.u32 q6, [r1]
+; CHECK-NEXT:    vmov r0, s12
+; CHECK-NEXT:    vmov.f32 s12, s14
+; CHECK-NEXT:    vmov r1, s24
+; CHECK-NEXT:    vmov.f32 s22, s25
+; CHECK-NEXT:    vmov.f32 s14, s15
+; CHECK-NEXT:    vmov.f32 s24, s26
+; CHECK-NEXT:    vmov.f32 s26, s27
+; CHECK-NEXT:    vmullb.s32 q7, q3, q6
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s29
+; CHECK-NEXT:    vmov q3[2], q3[0], r0, r1
+; CHECK-NEXT:    vmov r0, s18
+; CHECK-NEXT:    vmov r1, s22
+; CHECK-NEXT:    smmul r0, r0, r1
+; CHECK-NEXT:    vmov r1, s31
+; CHECK-NEXT:    vmov q3[3], q3[1], r0, r1
+; CHECK-NEXT:    vpop {d9}
+; CHECK-NEXT:    vpop {d11, d12, d13, d14, d15}
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = sext <16 x i32> %s0 to <16 x i64>
+  %s1s = sext <16 x i32> %s1 to <16 x i64>
+  %m = mul <16 x i64> %s0s, %s1s
+  %s = ashr <16 x i64> %m, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
+  %s2 = trunc <16 x i64> %s to <16 x i32>
+  ret <16 x i32> %s2
+}
+
+define arm_aapcs_vfpcc <16 x i32> @vmulhu_v16i32(<16 x i32> %s0, <16 x i32> %s1) {
+; CHECK-LABEL: vmulhu_v16i32:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11, d12, d13, d14, d15}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11, d12, d13, d14, d15}
+; CHECK-NEXT:    add r0, sp, #64
+; CHECK-NEXT:    vmov.f32 s24, s2
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vmov.f32 s26, s3
+; CHECK-NEXT:    vmov.f32 s2, s1
+; CHECK-NEXT:    add r0, sp, #80
+; CHECK-NEXT:    vmov.f32 s28, s18
+; CHECK-NEXT:    vmov.f32 s30, s19
+; CHECK-NEXT:    vmov.f32 s18, s17
+; CHECK-NEXT:    vmullb.u32 q5, q6, q7
+; CHECK-NEXT:    vmullb.u32 q6, q0, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vmov.f32 s0, s25
+; CHECK-NEXT:    add r0, sp, #96
+; CHECK-NEXT:    vmov.f32 s1, s27
+; CHECK-NEXT:    vmov.f32 s24, s6
+; CHECK-NEXT:    vmov.f32 s26, s7
+; CHECK-NEXT:    vmov.f32 s28, s18
+; CHECK-NEXT:    vmov.f32 s30, s19
+; CHECK-NEXT:    vmov.f32 s6, s5
+; CHECK-NEXT:    vmov.f32 s18, s17
+; CHECK-NEXT:    vmov.f32 s2, s21
+; CHECK-NEXT:    vmov.f32 s3, s23
+; CHECK-NEXT:    vmullb.u32 q5, q6, q7
+; CHECK-NEXT:    vmullb.u32 q6, q1, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vmov.f32 s4, s25
+; CHECK-NEXT:    add r0, sp, #112
+; CHECK-NEXT:    vmov.f32 s5, s27
+; CHECK-NEXT:    vmov.f32 s24, s10
+; CHECK-NEXT:    vmov.f32 s26, s11
+; CHECK-NEXT:    vmov.f32 s28, s18
+; CHECK-NEXT:    vmov.f32 s30, s19
+; CHECK-NEXT:    vmov.f32 s10, s9
+; CHECK-NEXT:    vmov.f32 s18, s17
+; CHECK-NEXT:    vmov.f32 s6, s21
+; CHECK-NEXT:    vmov.f32 s7, s23
+; CHECK-NEXT:    vmullb.u32 q5, q6, q7
+; CHECK-NEXT:    vmullb.u32 q6, q2, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vmov.f32 s8, s25
+; CHECK-NEXT:    vmov.f32 s9, s27
+; CHECK-NEXT:    vmov.f32 s24, s14
+; CHECK-NEXT:    vmov.f32 s26, s15
+; CHECK-NEXT:    vmov.f32 s28, s18
+; CHECK-NEXT:    vmov.f32 s30, s19
+; CHECK-NEXT:    vmov.f32 s14, s13
+; CHECK-NEXT:    vmov.f32 s18, s17
+; CHECK-NEXT:    vmov.f32 s10, s21
+; CHECK-NEXT:    vmov.f32 s11, s23
+; CHECK-NEXT:    vmullb.u32 q5, q6, q7
+; CHECK-NEXT:    vmullb.u32 q6, q3, q4
+; CHECK-NEXT:    vmov.f32 s14, s21
+; CHECK-NEXT:    vmov.f32 s12, s25
+; CHECK-NEXT:    vmov.f32 s13, s27
+; CHECK-NEXT:    vmov.f32 s15, s23
+; CHECK-NEXT:    vpop {d8, d9, d10, d11, d12, d13, d14, d15}
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = zext <16 x i32> %s0 to <16 x i64>
+  %s1s = zext <16 x i32> %s1 to <16 x i64>
+  %m = mul <16 x i64> %s0s, %s1s
+  %s = lshr <16 x i64> %m, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
+  %s2 = trunc <16 x i64> %s to <16 x i32>
+  ret <16 x i32> %s2
+}
+
 define arm_aapcs_vfpcc <4 x i16> @vmulhs_v4i16(<4 x i16> %s0, <4 x i16> %s1) {
 ; CHECK-LABEL: vmulhs_v4i16:
 ; CHECK:       @ %bb.0: @ %entry
@@ -129,6 +326,124 @@ entry:
   ret <8 x i16> %s2
 }
 
+define arm_aapcs_vfpcc <16 x i16> @vmulhs_v16i16(<16 x i16> %s0, <16 x i16> %s1) {
+; CHECK-LABEL: vmulhs_v16i16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmulh.s16 q0, q0, q2
+; CHECK-NEXT:    vmulh.s16 q1, q1, q3
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = sext <16 x i16> %s0 to <16 x i32>
+  %s1s = sext <16 x i16> %s1 to <16 x i32>
+  %m = mul <16 x i32> %s0s, %s1s
+  %s = ashr <16 x i32> %m, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
+  %s2 = trunc <16 x i32> %s to <16 x i16>
+  ret <16 x i16> %s2
+}
+
+define arm_aapcs_vfpcc <16 x i16> @vmulhu_v16i16(<16 x i16> %s0, <16 x i16> %s1) {
+; CHECK-LABEL: vmulhu_v16i16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmulh.u16 q0, q0, q2
+; CHECK-NEXT:    vmulh.u16 q1, q1, q3
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = zext <16 x i16> %s0 to <16 x i32>
+  %s1s = zext <16 x i16> %s1 to <16 x i32>
+  %m = mul <16 x i32> %s0s, %s1s
+  %s = lshr <16 x i32> %m, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
+  %s2 = trunc <16 x i32> %s to <16 x i16>
+  ret <16 x i16> %s2
+}
+
+define arm_aapcs_vfpcc <32 x i16> @vmulhs_v32i16(<32 x i16> %s0, <32 x i16> %s1) {
+; CHECK-LABEL: vmulhs_v32i16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    add r0, sp, #32
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    add r0, sp, #48
+; CHECK-NEXT:    vmullt.s16 q5, q0, q4
+; CHECK-NEXT:    vmullb.s16 q0, q0, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q0, q0, #16
+; CHECK-NEXT:    add r0, sp, #64
+; CHECK-NEXT:    vmovnt.i32 q0, q5
+; CHECK-NEXT:    vmullt.s16 q5, q1, q4
+; CHECK-NEXT:    vmullb.s16 q1, q1, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q1, q1, #16
+; CHECK-NEXT:    add r0, sp, #80
+; CHECK-NEXT:    vmovnt.i32 q1, q5
+; CHECK-NEXT:    vmullt.s16 q5, q2, q4
+; CHECK-NEXT:    vmullb.s16 q2, q2, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q2, q2, #16
+; CHECK-NEXT:    vmovnt.i32 q2, q5
+; CHECK-NEXT:    vmullt.s16 q5, q3, q4
+; CHECK-NEXT:    vmullb.s16 q3, q3, q4
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q3, q3, #16
+; CHECK-NEXT:    vmovnt.i32 q3, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = sext <32 x i16> %s0 to <32 x i32>
+  %s1s = sext <32 x i16> %s1 to <32 x i32>
+  %m = mul <32 x i32> %s0s, %s1s
+  %s = ashr <32 x i32> %m, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
+  %s2 = trunc <32 x i32> %s to <32 x i16>
+  ret <32 x i16> %s2
+}
+
+define arm_aapcs_vfpcc <32 x i16> @vmulhu_v32i16(<32 x i16> %s0, <32 x i16> %s1) {
+; CHECK-LABEL: vmulhu_v32i16:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    add r0, sp, #32
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    add r0, sp, #48
+; CHECK-NEXT:    vmullt.u16 q5, q0, q4
+; CHECK-NEXT:    vmullb.u16 q0, q0, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q0, q0, #16
+; CHECK-NEXT:    add r0, sp, #64
+; CHECK-NEXT:    vmovnt.i32 q0, q5
+; CHECK-NEXT:    vmullt.u16 q5, q1, q4
+; CHECK-NEXT:    vmullb.u16 q1, q1, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q1, q1, #16
+; CHECK-NEXT:    add r0, sp, #80
+; CHECK-NEXT:    vmovnt.i32 q1, q5
+; CHECK-NEXT:    vmullt.u16 q5, q2, q4
+; CHECK-NEXT:    vmullb.u16 q2, q2, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q2, q2, #16
+; CHECK-NEXT:    vmovnt.i32 q2, q5
+; CHECK-NEXT:    vmullt.u16 q5, q3, q4
+; CHECK-NEXT:    vmullb.u16 q3, q3, q4
+; CHECK-NEXT:    vshr.u32 q5, q5, #16
+; CHECK-NEXT:    vshr.u32 q3, q3, #16
+; CHECK-NEXT:    vmovnt.i32 q3, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = zext <32 x i16> %s0 to <32 x i32>
+  %s1s = zext <32 x i16> %s1 to <32 x i32>
+  %m = mul <32 x i32> %s0s, %s1s
+  %s = lshr <32 x i32> %m, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
+  %s2 = trunc <32 x i32> %s to <32 x i16>
+  ret <32 x i16> %s2
+}
+
 define arm_aapcs_vfpcc <4 x i8> @vmulhs_v4i8(<4 x i8> %s0, <4 x i8> %s1) {
 ; CHECK-LABEL: vmulhs_v4i8:
 ; CHECK:       @ %bb.0: @ %entry
@@ -224,19 +539,137 @@ entry:
   ret <16 x i8> %s2
 }
 
+define arm_aapcs_vfpcc <32 x i8> @vmulhs_v32i8(<32 x i8> %s0, <32 x i8> %s1) {
+; CHECK-LABEL: vmulhs_v32i8:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmulh.s8 q0, q0, q2
+; CHECK-NEXT:    vmulh.s8 q1, q1, q3
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = sext <32 x i8> %s0 to <32 x i16>
+  %s1s = sext <32 x i8> %s1 to <32 x i16>
+  %m = mul <32 x i16> %s0s, %s1s
+  %s = ashr <32 x i16> %m, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+  %s2 = trunc <32 x i16> %s to <32 x i8>
+  ret <32 x i8> %s2
+}
+
+define arm_aapcs_vfpcc <32 x i8> @vmulhu_v32i8(<32 x i8> %s0, <32 x i8> %s1) {
+; CHECK-LABEL: vmulhu_v32i8:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    vmulh.u8 q0, q0, q2
+; CHECK-NEXT:    vmulh.u8 q1, q1, q3
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = zext <32 x i8> %s0 to <32 x i16>
+  %s1s = zext <32 x i8> %s1 to <32 x i16>
+  %m = mul <32 x i16> %s0s, %s1s
+  %s = lshr <32 x i16> %m, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+  %s2 = trunc <32 x i16> %s to <32 x i8>
+  ret <32 x i8> %s2
+}
+
+define arm_aapcs_vfpcc <64 x i8> @vmulhs_v64i8(<64 x i8> %s0, <64 x i8> %s1) {
+; CHECK-LABEL: vmulhs_v64i8:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    add r0, sp, #32
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    add r0, sp, #48
+; CHECK-NEXT:    vmullt.s8 q5, q0, q4
+; CHECK-NEXT:    vmullb.s8 q0, q0, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q0, q0, #8
+; CHECK-NEXT:    add r0, sp, #64
+; CHECK-NEXT:    vmovnt.i16 q0, q5
+; CHECK-NEXT:    vmullt.s8 q5, q1, q4
+; CHECK-NEXT:    vmullb.s8 q1, q1, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q1, q1, #8
+; CHECK-NEXT:    add r0, sp, #80
+; CHECK-NEXT:    vmovnt.i16 q1, q5
+; CHECK-NEXT:    vmullt.s8 q5, q2, q4
+; CHECK-NEXT:    vmullb.s8 q2, q2, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q2, q2, #8
+; CHECK-NEXT:    vmovnt.i16 q2, q5
+; CHECK-NEXT:    vmullt.s8 q5, q3, q4
+; CHECK-NEXT:    vmullb.s8 q3, q3, q4
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q3, q3, #8
+; CHECK-NEXT:    vmovnt.i16 q3, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = sext <64 x i8> %s0 to <64 x i16>
+  %s1s = sext <64 x i8> %s1 to <64 x i16>
+  %m = mul <64 x i16> %s0s, %s1s
+  %s = ashr <64 x i16> %m, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+  %s2 = trunc <64 x i16> %s to <64 x i8>
+  ret <64 x i8> %s2
+}
+
+define arm_aapcs_vfpcc <64 x i8> @vmulhu_v64i8(<64 x i8> %s0, <64 x i8> %s1) {
+; CHECK-LABEL: vmulhu_v64i8:
+; CHECK:       @ %bb.0: @ %entry
+; CHECK-NEXT:    .vsave {d8, d9, d10, d11}
+; CHECK-NEXT:    vpush {d8, d9, d10, d11}
+; CHECK-NEXT:    add r0, sp, #32
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    add r0, sp, #48
+; CHECK-NEXT:    vmullt.u8 q5, q0, q4
+; CHECK-NEXT:    vmullb.u8 q0, q0, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q0, q0, #8
+; CHECK-NEXT:    add r0, sp, #64
+; CHECK-NEXT:    vmovnt.i16 q0, q5
+; CHECK-NEXT:    vmullt.u8 q5, q1, q4
+; CHECK-NEXT:    vmullb.u8 q1, q1, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q1, q1, #8
+; CHECK-NEXT:    add r0, sp, #80
+; CHECK-NEXT:    vmovnt.i16 q1, q5
+; CHECK-NEXT:    vmullt.u8 q5, q2, q4
+; CHECK-NEXT:    vmullb.u8 q2, q2, q4
+; CHECK-NEXT:    vldrw.u32 q4, [r0]
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q2, q2, #8
+; CHECK-NEXT:    vmovnt.i16 q2, q5
+; CHECK-NEXT:    vmullt.u8 q5, q3, q4
+; CHECK-NEXT:    vmullb.u8 q3, q3, q4
+; CHECK-NEXT:    vshr.u16 q5, q5, #8
+; CHECK-NEXT:    vshr.u16 q3, q3, #8
+; CHECK-NEXT:    vmovnt.i16 q3, q5
+; CHECK-NEXT:    vpop {d8, d9, d10, d11}
+; CHECK-NEXT:    bx lr
+entry:
+  %s0s = zext <64 x i8> %s0 to <64 x i16>
+  %s1s = zext <64 x i8> %s1 to <64 x i16>
+  %m = mul <64 x i16> %s0s, %s1s
+  %s = lshr <64 x i16> %m, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
+  %s2 = trunc <64 x i16> %s to <64 x i8>
+  ret <64 x i8> %s2
+}
+
 define void @vmulh_s8(ptr nocapture readonly %x, ptr nocapture readonly %y, ptr noalias nocapture %z, i32 %n) {
 ; CHECK-LABEL: vmulh_s8:
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    mov.w lr, #64
-; CHECK-NEXT:  .LBB14_1: @ %vector.body
+; CHECK-NEXT:  .LBB26_1: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrb.u8 q0, [r0], #16
 ; CHECK-NEXT:    vldrb.u8 q1, [r1], #16
 ; CHECK-NEXT:    vmulh.s8 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    le lr, .LBB14_1
+; CHECK-NEXT:    le lr, .LBB26_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -269,13 +702,13 @@ define void @vmulh_s16(ptr nocapture readonly %x, ptr nocapture readonly %y, ptr
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    mov.w lr, #128
-; CHECK-NEXT:  .LBB15_1: @ %vector.body
+; CHECK-NEXT:  .LBB27_1: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrh.u16 q0, [r0], #16
 ; CHECK-NEXT:    vldrh.u16 q1, [r1], #16
 ; CHECK-NEXT:    vmulh.s16 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    le lr, .LBB15_1
+; CHECK-NEXT:    le lr, .LBB27_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -308,13 +741,13 @@ define void @vmulh_s32(ptr nocapture readonly %x, ptr nocapture readonly %y, ptr
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    mov.w lr, #256
-; CHECK-NEXT:  .LBB16_1: @ %vector.body
+; CHECK-NEXT:  .LBB28_1: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r0], #16
 ; CHECK-NEXT:    vldrw.u32 q1, [r1], #16
 ; CHECK-NEXT:    vmulh.s32 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    le lr, .LBB16_1
+; CHECK-NEXT:    le lr, .LBB28_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -347,13 +780,13 @@ define void @vmulh_u8(ptr nocapture readonly %x, ptr nocapture readonly %y, ptr
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    mov.w lr, #64
-; CHECK-NEXT:  .LBB17_1: @ %vector.body
+; CHECK-NEXT:  .LBB29_1: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrb.u8 q0, [r0], #16
 ; CHECK-NEXT:    vldrb.u8 q1, [r1], #16
 ; CHECK-NEXT:    vmulh.u8 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    le lr, .LBB17_1
+; CHECK-NEXT:    le lr, .LBB29_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -386,13 +819,13 @@ define void @vmulh_u16(ptr nocapture readonly %x, ptr nocapture readonly %y, ptr
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    mov.w lr, #128
-; CHECK-NEXT:  .LBB18_1: @ %vector.body
+; CHECK-NEXT:  .LBB30_1: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrh.u16 q0, [r0], #16
 ; CHECK-NEXT:    vldrh.u16 q1, [r1], #16
 ; CHECK-NEXT:    vmulh.u16 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    le lr, .LBB18_1
+; CHECK-NEXT:    le lr, .LBB30_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -425,13 +858,13 @@ define void @vmulh_u32(ptr nocapture readonly %x, ptr nocapture readonly %y, ptr
 ; CHECK-NEXT:    .save {r7, lr}
 ; CHECK-NEXT:    push {r7, lr}
 ; CHECK-NEXT:    mov.w lr, #256
-; CHECK-NEXT:  .LBB19_1: @ %vector.body
+; CHECK-NEXT:  .LBB31_1: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r0], #16
 ; CHECK-NEXT:    vldrw.u32 q1, [r1], #16
 ; CHECK-NEXT:    vmulh.u32 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r2], #16
-; CHECK-NEXT:    le lr, .LBB19_1
+; CHECK-NEXT:    le lr, .LBB31_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -467,15 +900,15 @@ define void @vmulh_s32_pred(ptr noalias nocapture %d, ptr noalias nocapture read
 ; CHECK-NEXT:    cmp r3, #1
 ; CHECK-NEXT:    it lt
 ; CHECK-NEXT:    poplt {r7, pc}
-; CHECK-NEXT:  .LBB20_1: @ %vector.ph
+; CHECK-NEXT:  .LBB32_1: @ %vector.ph
 ; CHECK-NEXT:    dlstp.32 lr, r3
-; CHECK-NEXT:  .LBB20_2: @ %vector.body
+; CHECK-NEXT:  .LBB32_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r1], #16
 ; CHECK-NEXT:    vldrw.u32 q1, [r2], #16
 ; CHECK-NEXT:    vmulh.s32 q0, q1, q0
 ; CHECK-NEXT:    vstrw.32 q0, [r0], #16
-; CHECK-NEXT:    letp lr, .LBB20_2
+; CHECK-NEXT:    letp lr, .LBB32_2
 ; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -517,15 +950,15 @@ define void @vmulh_u32_pred(ptr noalias nocapture %d, ptr noalias nocapture read
 ; CHECK-NEXT:    cmp r3, #1
 ; CHECK-NEXT:    it lt
 ; CHECK-NEXT:    poplt {r7, pc}
-; CHECK-NEXT:  .LBB21_1: @ %vector.ph
+; CHECK-NEXT:  .LBB33_1: @ %vector.ph
 ; CHECK-NEXT:    dlstp.32 lr, r3
-; CHECK-NEXT:  .LBB21_2: @ %vector.body
+; CHECK-NEXT:  .LBB33_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrw.u32 q0, [r1], #16
 ; CHECK-NEXT:    vldrw.u32 q1, [r2], #16
 ; CHECK-NEXT:    vmulh.u32 q0, q1, q0
 ; CHECK-NEXT:    vstrw.32 q0, [r0], #16
-; CHECK-NEXT:    letp lr, .LBB21_2
+; CHECK-NEXT:    letp lr, .LBB33_2
 ; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -567,15 +1000,15 @@ define void @vmulh_s16_pred(ptr noalias nocapture %d, ptr noalias nocapture read
 ; CHECK-NEXT:    cmp r3, #1
 ; CHECK-NEXT:    it lt
 ; CHECK-NEXT:    poplt {r7, pc}
-; CHECK-NEXT:  .LBB22_1: @ %vector.ph
+; CHECK-NEXT:  .LBB34_1: @ %vector.ph
 ; CHECK-NEXT:    dlstp.16 lr, r3
-; CHECK-NEXT:  .LBB22_2: @ %vector.body
+; CHECK-NEXT:  .LBB34_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrh.u16 q0, [r1], #16
 ; CHECK-NEXT:    vldrh.u16 q1, [r2], #16
 ; CHECK-NEXT:    vmulh.s16 q0, q1, q0
 ; CHECK-NEXT:    vstrh.16 q0, [r0], #16
-; CHECK-NEXT:    letp lr, .LBB22_2
+; CHECK-NEXT:    letp lr, .LBB34_2
 ; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -617,15 +1050,15 @@ define void @vmulh_u16_pred(ptr noalias nocapture %d, ptr noalias nocapture read
 ; CHECK-NEXT:    cmp r3, #1
 ; CHECK-NEXT:    it lt
 ; CHECK-NEXT:    poplt {r7, pc}
-; CHECK-NEXT:  .LBB23_1: @ %vector.ph
+; CHECK-NEXT:  .LBB35_1: @ %vector.ph
 ; CHECK-NEXT:    dlstp.16 lr, r3
-; CHECK-NEXT:  .LBB23_2: @ %vector.body
+; CHECK-NEXT:  .LBB35_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrh.u16 q0, [r1], #16
 ; CHECK-NEXT:    vldrh.u16 q1, [r2], #16
 ; CHECK-NEXT:    vmulh.u16 q0, q1, q0
 ; CHECK-NEXT:    vstrh.16 q0, [r0], #16
-; CHECK-NEXT:    letp lr, .LBB23_2
+; CHECK-NEXT:    letp lr, .LBB35_2
 ; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -667,15 +1100,15 @@ define void @vmulh_s8_pred(ptr noalias nocapture %d, ptr noalias nocapture reado
 ; CHECK-NEXT:    cmp r3, #1
 ; CHECK-NEXT:    it lt
 ; CHECK-NEXT:    poplt {r7, pc}
-; CHECK-NEXT:  .LBB24_1: @ %vector.ph
+; CHECK-NEXT:  .LBB36_1: @ %vector.ph
 ; CHECK-NEXT:    dlstp.8 lr, r3
-; CHECK-NEXT:  .LBB24_2: @ %vector.body
+; CHECK-NEXT:  .LBB36_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrb.u8 q0, [r1], #16
 ; CHECK-NEXT:    vldrb.u8 q1, [r2], #16
 ; CHECK-NEXT:    vmulh.s8 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r0], #16
-; CHECK-NEXT:    letp lr, .LBB24_2
+; CHECK-NEXT:    letp lr, .LBB36_2
 ; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
@@ -717,15 +1150,15 @@ define void @vmulh_u8_pred(ptr noalias nocapture %d, ptr noalias nocapture reado
 ; CHECK-NEXT:    cmp r3, #1
 ; CHECK-NEXT:    it lt
 ; CHECK-NEXT:    poplt {r7, pc}
-; CHECK-NEXT:  .LBB25_1: @ %vector.ph
+; CHECK-NEXT:  .LBB37_1: @ %vector.ph
 ; CHECK-NEXT:    dlstp.8 lr, r3
-; CHECK-NEXT:  .LBB25_2: @ %vector.body
+; CHECK-NEXT:  .LBB37_2: @ %vector.body
 ; CHECK-NEXT:    @ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vldrb.u8 q0, [r1], #16
 ; CHECK-NEXT:    vldrb.u8 q1, [r2], #16
 ; CHECK-NEXT:    vmulh.u8 q0, q1, q0
 ; CHECK-NEXT:    vstrb.8 q0, [r0], #16
-; CHECK-NEXT:    letp lr, .LBB25_2
+; CHECK-NEXT:    letp lr, .LBB37_2
 ; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-NEXT:    pop {r7, pc}
 entry:
diff --git a/llvm/test/CodeGen/WebAssembly/masked-shifts.ll b/llvm/test/CodeGen/WebAssembly/masked-shifts.ll
index 5bcb023e546b5..8f90fa68e8fbd 100644
--- a/llvm/test/CodeGen/WebAssembly/masked-shifts.ll
+++ b/llvm/test/CodeGen/WebAssembly/masked-shifts.ll
@@ -18,6 +18,21 @@ define i32 @shl_i32(i32 %v, i32 %x) {
   ret i32 %a
 }
 
+define i64 @shl_i64_zext(i64 %v, i32 %x) {
+; CHECK-LABEL: shl_i64_zext:
+; CHECK:         .functype shl_i64_zext (i64, i32) -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    local.get 0
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i64.extend_i32_u
+; CHECK-NEXT:    i64.shl
+; CHECK-NEXT:    # fallthrough-return
+  %m = and i32 %x, 63
+  %z = zext i32 %m to i64
+  %a = shl i64 %v, %z
+  ret i64 %a
+}
+
 define i32 @sra_i32(i32 %v, i32 %x) {
 ; CHECK-LABEL: sra_i32:
 ; CHECK:         .functype sra_i32 (i32, i32) -> (i32)
@@ -31,6 +46,21 @@ define i32 @sra_i32(i32 %v, i32 %x) {
   ret i32 %a
 }
 
+define i64 @sra_i64_zext(i64 %v, i32 %x) {
+; CHECK-LABEL: sra_i64_zext:
+; CHECK:         .functype sra_i64_zext (i64, i32) -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    local.get 0
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i64.extend_i32_u
+; CHECK-NEXT:    i64.shr_s
+; CHECK-NEXT:    # fallthrough-return
+  %m = and i32 %x, 63
+  %z = zext i32 %m to i64
+  %a = ashr i64 %v, %z
+  ret i64 %a
+}
+
 define i32 @srl_i32(i32 %v, i32 %x) {
 ; CHECK-LABEL: srl_i32:
 ; CHECK:         .functype srl_i32 (i32, i32) -> (i32)
@@ -44,6 +74,21 @@ define i32 @srl_i32(i32 %v, i32 %x) {
   ret i32 %a
 }
 
+define i64 @srl_i64_zext(i64 %v, i32 %x) {
+; CHECK-LABEL: srl_i64_zext:
+; CHECK:         .functype srl_i64_zext (i64, i32) -> (i64)
+; CHECK-NEXT:  # %bb.0:
+; CHECK-NEXT:    local.get 0
+; CHECK-NEXT:    local.get 1
+; CHECK-NEXT:    i64.extend_i32_u
+; CHECK-NEXT:    i64.shr_u
+; CHECK-NEXT:    # fallthrough-return
+  %m = and i32 %x, 63
+  %z = zext i32 %m to i64
+  %a = lshr i64 %v, %z
+  ret i64 %a
+}
+
 define i64 @shl_i64(i64 %v, i64 %x) {
 ; CHECK-LABEL: shl_i64:
 ; CHECK:         .functype shl_i64 (i64, i64) -> (i64)
diff --git a/llvm/test/CodeGen/X86/O0-pipeline.ll b/llvm/test/CodeGen/X86/O0-pipeline.ll
index 78a02b11b17bb..9223348abbcb9 100644
--- a/llvm/test/CodeGen/X86/O0-pipeline.ll
+++ b/llvm/test/CodeGen/X86/O0-pipeline.ll
@@ -7,9 +7,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Create Garbage Collector Module Metadata
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Profile summary info
diff --git a/llvm/test/CodeGen/X86/addcarry.ll b/llvm/test/CodeGen/X86/addcarry.ll
index f8a04f8514988..ee4482062df31 100644
--- a/llvm/test/CodeGen/X86/addcarry.ll
+++ b/llvm/test/CodeGen/X86/addcarry.ll
@@ -1517,18 +1517,9 @@ define i1 @pr84831(i64 %arg) {
 define void @pr169691(ptr %p0, i64 %implicit, i1 zeroext %carry) {
 ; CHECK-LABEL: pr169691:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    movq (%rdi), %rax
-; CHECK-NEXT:    addq %rsi, %rax
-; CHECK-NEXT:    setb %cl
-; CHECK-NEXT:    movl %edx, %edx
-; CHECK-NEXT:    addq %rax, %rdx
-; CHECK-NEXT:    setb %al
-; CHECK-NEXT:    orb %cl, %al
-; CHECK-NEXT:    movq %rdx, (%rdi)
-; CHECK-NEXT:    addq 8(%rdi), %rsi
-; CHECK-NEXT:    movzbl %al, %eax
-; CHECK-NEXT:    addq %rsi, %rax
-; CHECK-NEXT:    movq %rax, 8(%rdi)
+; CHECK-NEXT:    addb $-1, %dl
+; CHECK-NEXT:    adcq %rsi, (%rdi)
+; CHECK-NEXT:    adcq %rsi, 8(%rdi)
 ; CHECK-NEXT:    retq
   %a0 = load i64, ptr %p0, align 8
   %uaddo0 = call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %a0, i64 %implicit)
diff --git a/llvm/test/CodeGen/X86/avx512-skx-insert-subvec.ll b/llvm/test/CodeGen/X86/avx512-skx-insert-subvec.ll
index a24c1d8c2fcc4..7fb20418aeda4 100644
--- a/llvm/test/CodeGen/X86/avx512-skx-insert-subvec.ll
+++ b/llvm/test/CodeGen/X86/avx512-skx-insert-subvec.ll
@@ -52,13 +52,12 @@ define <8 x i1> @test3(<4 x i1> %a) {
 define <8 x i1> @test4(<4 x i1> %a, <4 x i1>%b) {
 ; CHECK-LABEL: test4:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vpslld $31, %xmm1, %xmm1
-; CHECK-NEXT:    vpmovd2m %xmm1, %k0
-; CHECK-NEXT:    vpslld $31, %xmm0, %xmm0
-; CHECK-NEXT:    vpmovd2m %xmm0, %k1
-; CHECK-NEXT:    kshiftlb $4, %k0, %k0
-; CHECK-NEXT:    korb %k0, %k1, %k0
+; CHECK-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; CHECK-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; CHECK-NEXT:    vpslld $31, %ymm0, %ymm0
+; CHECK-NEXT:    vpmovd2m %ymm0, %k0
 ; CHECK-NEXT:    vpmovm2w %k0, %xmm0
+; CHECK-NEXT:    vzeroupper
 ; CHECK-NEXT:    retq
 
   %res = shufflevector <4 x i1> %a, <4 x i1> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
@@ -68,13 +67,12 @@ define <8 x i1> @test4(<4 x i1> %a, <4 x i1>%b) {
 define <4 x i1> @test5(<2 x i1> %a, <2 x i1>%b) {
 ; CHECK-LABEL: test5:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vpsllq $63, %xmm1, %xmm1
-; CHECK-NEXT:    vpmovq2m %xmm1, %k0
-; CHECK-NEXT:    vpsllq $63, %xmm0, %xmm0
-; CHECK-NEXT:    vpmovq2m %xmm0, %k1
-; CHECK-NEXT:    kshiftlb $2, %k0, %k0
-; CHECK-NEXT:    korw %k0, %k1, %k0
+; CHECK-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; CHECK-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; CHECK-NEXT:    vpsllq $63, %ymm0, %ymm0
+; CHECK-NEXT:    vpmovq2m %ymm0, %k0
 ; CHECK-NEXT:    vpmovm2d %k0, %xmm0
+; CHECK-NEXT:    vzeroupper
 ; CHECK-NEXT:    retq
 
   %res = shufflevector <2 x i1> %a, <2 x i1> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
diff --git a/llvm/test/CodeGen/X86/combine-fceil.ll b/llvm/test/CodeGen/X86/combine-fceil.ll
new file mode 100644
index 0000000000000..a3f55e8f64b80
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-fceil.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_ceil_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_ceil_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $10, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $10, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_ceil_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $10, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.ceil.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.ceil.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_ceil_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_ceil_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $10, %xmm0, %xmm0
+; SSE-NEXT:    roundps $10, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_ceil_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $10, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_ceil_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_ceil_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $10, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $10, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $10, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $10, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_ceil_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $10, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $10, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_ceil_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $10, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $10, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_ceil_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $10, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.ceil.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.ceil.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.ceil.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.ceil.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_ceil_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_ceil_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $10, %xmm0, %xmm0
+; SSE-NEXT:    roundps $10, %xmm1, %xmm1
+; SSE-NEXT:    roundps $10, %xmm2, %xmm2
+; SSE-NEXT:    roundps $10, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_ceil_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $10, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $10, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_ceil_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $10, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $10, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_ceil_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $10, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.ceil.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_ceil_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_ceil_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $10, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $10, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $10, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $10, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_ceil_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $10, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $10, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_ceil_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $10, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.ceil.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.ceil.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_ceil_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_ceil_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $10, %xmm0, %xmm0
+; SSE-NEXT:    roundps $10, %xmm1, %xmm1
+; SSE-NEXT:    roundps $10, %xmm2, %xmm2
+; SSE-NEXT:    roundps $10, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_ceil_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $10, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $10, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_ceil_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $10, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.ceil.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.ceil.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-fcmp.ll b/llvm/test/CodeGen/X86/combine-fcmp.ll
new file mode 100644
index 0000000000000..f2666f69949b7
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-fcmp.ll
@@ -0,0 +1,330 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64    | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX512
+
+define i4 @concat_fcmp_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_fcmp_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    xorpd %xmm2, %xmm2
+; SSE-NEXT:    xorpd %xmm3, %xmm3
+; SSE-NEXT:    cmpltpd %xmm0, %xmm3
+; SSE-NEXT:    cmpltpd %xmm1, %xmm2
+; SSE-NEXT:    shufps {{.*#+}} xmm3 = xmm3[0,2],xmm2[0,2]
+; SSE-NEXT:    movmskps %xmm3, %eax
+; SSE-NEXT:    # kill: def $al killed $al killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_fcmp_v4f64_v2f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vxorpd %xmm2, %xmm2, %xmm2
+; AVX1OR2-NEXT:    vcmpltpd %xmm0, %xmm2, %xmm0
+; AVX1OR2-NEXT:    vcmpltpd %xmm1, %xmm2, %xmm1
+; AVX1OR2-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
+; AVX1OR2-NEXT:    vmovmskps %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_fcmp_v4f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vxorpd %xmm2, %xmm2, %xmm2
+; AVX512-NEXT:    vcmpltpd %xmm0, %xmm2, %k0
+; AVX512-NEXT:    vcmpltpd %xmm1, %xmm2, %k1
+; AVX512-NEXT:    kshiftlb $2, %k1, %k1
+; AVX512-NEXT:    korw %k1, %k0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    retq
+  %v0 = fcmp ogt <2 x double> %a0, zeroinitializer
+  %v1 = fcmp ogt <2 x double> %a1, zeroinitializer
+  %v = shufflevector <2 x i1> %v0, <2 x i1> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r = bitcast <4 x i1> %v to i4
+  ret i4 %r
+}
+
+define i8 @concat_fcmp_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_fcmp_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    xorps %xmm2, %xmm2
+; SSE-NEXT:    cmpeqps %xmm2, %xmm0
+; SSE-NEXT:    cmpeqps %xmm2, %xmm1
+; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    packsswb %xmm0, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %eax
+; SSE-NEXT:    # kill: def $al killed $al killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_fcmp_v8f32_v4f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vxorps %xmm2, %xmm2, %xmm2
+; AVX1OR2-NEXT:    vcmpeqps %xmm2, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vcmpeqps %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_fcmp_v8f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vxorps %xmm1, %xmm1, %xmm1
+; AVX512-NEXT:    vcmpeqps %ymm1, %ymm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = fcmp oeq <4 x float> %a0, zeroinitializer
+  %v1 = fcmp oeq <4 x float> %a1, zeroinitializer
+  %v = shufflevector <4 x i1> %v0, <4 x i1> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r = bitcast <8 x i1> %v to i8
+  ret i8 %r
+}
+
+define i8 @concat_fcmp_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_fcmp_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    xorpd %xmm4, %xmm4
+; SSE-NEXT:    cmpltpd %xmm4, %xmm0
+; SSE-NEXT:    cmpltpd %xmm4, %xmm1
+; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    cmpltpd %xmm4, %xmm2
+; SSE-NEXT:    cmpltpd %xmm4, %xmm3
+; SSE-NEXT:    packssdw %xmm3, %xmm2
+; SSE-NEXT:    packssdw %xmm0, %xmm0
+; SSE-NEXT:    packssdw %xmm2, %xmm2
+; SSE-NEXT:    packsswb %xmm2, %xmm0
+; SSE-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,3,2,3]
+; SSE-NEXT:    pmovmskb %xmm0, %eax
+; SSE-NEXT:    # kill: def $al killed $al killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_fcmp_v8f64_v2f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vxorpd %xmm4, %xmm4, %xmm4
+; AVX1OR2-NEXT:    vcmpltpd %xmm4, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vcmpltpd %xmm4, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vcmpltpd %xmm4, %xmm2, %xmm1
+; AVX1OR2-NEXT:    vcmpltpd %xmm4, %xmm3, %xmm2
+; AVX1OR2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,3,0,3]
+; AVX1OR2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_fcmp_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
+; AVX512-NEXT:    vcmpltpd %zmm1, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = fcmp olt <2 x double> %a0, zeroinitializer
+  %v1 = fcmp olt <2 x double> %a1, zeroinitializer
+  %v2 = fcmp olt <2 x double> %a2, zeroinitializer
+  %v3 = fcmp olt <2 x double> %a3, zeroinitializer
+  %v01 = shufflevector <2 x i1> %v0, <2 x i1> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %v23 = shufflevector <2 x i1> %v2, <2 x i1> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %v = shufflevector <4 x i1> %v01, <4 x i1> %v23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r = bitcast <8 x i1> %v to i8
+  ret i8 %r
+}
+
+define i16 @concat_fcmp_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_fcmp_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    xorps %xmm4, %xmm4
+; SSE-NEXT:    xorps %xmm5, %xmm5
+; SSE-NEXT:    cmpleps %xmm0, %xmm5
+; SSE-NEXT:    xorps %xmm0, %xmm0
+; SSE-NEXT:    cmpleps %xmm1, %xmm0
+; SSE-NEXT:    packssdw %xmm0, %xmm5
+; SSE-NEXT:    xorps %xmm0, %xmm0
+; SSE-NEXT:    cmpleps %xmm2, %xmm0
+; SSE-NEXT:    cmpleps %xmm3, %xmm4
+; SSE-NEXT:    packssdw %xmm4, %xmm0
+; SSE-NEXT:    packsswb %xmm0, %xmm5
+; SSE-NEXT:    pmovmskb %xmm5, %eax
+; SSE-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_fcmp_v16f32_v4f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vxorps %xmm4, %xmm4, %xmm4
+; AVX1OR2-NEXT:    vcmpleps %xmm0, %xmm4, %xmm0
+; AVX1OR2-NEXT:    vcmpleps %xmm1, %xmm4, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vcmpleps %xmm2, %xmm4, %xmm1
+; AVX1OR2-NEXT:    vcmpleps %xmm3, %xmm4, %xmm2
+; AVX1OR2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_fcmp_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vxorps %xmm1, %xmm1, %xmm1
+; AVX512-NEXT:    vcmpleps %zmm0, %zmm1, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = fcmp oge <4 x float> %a0, zeroinitializer
+  %v1 = fcmp oge <4 x float> %a1, zeroinitializer
+  %v2 = fcmp oge <4 x float> %a2, zeroinitializer
+  %v3 = fcmp oge <4 x float> %a3, zeroinitializer
+  %v01 = shufflevector <4 x i1> %v0, <4 x i1> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %v23 = shufflevector <4 x i1> %v2, <4 x i1> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %v = shufflevector <8 x i1> %v01, <8 x i1> %v23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %r = bitcast <16 x i1> %v to i16
+  ret i16 %r
+}
+
+define i8 @concat_fcmp_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_fcmp_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    xorpd %xmm4, %xmm4
+; SSE-NEXT:    movapd %xmm1, %xmm5
+; SSE-NEXT:    cmpneqpd %xmm4, %xmm5
+; SSE-NEXT:    cmpordpd %xmm4, %xmm1
+; SSE-NEXT:    andpd %xmm5, %xmm1
+; SSE-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
+; SSE-NEXT:    movapd %xmm0, %xmm5
+; SSE-NEXT:    cmpneqpd %xmm4, %xmm5
+; SSE-NEXT:    cmpordpd %xmm4, %xmm0
+; SSE-NEXT:    andpd %xmm5, %xmm0
+; SSE-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
+; SSE-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
+; SSE-NEXT:    movapd %xmm3, %xmm1
+; SSE-NEXT:    cmpneqpd %xmm4, %xmm1
+; SSE-NEXT:    cmpordpd %xmm4, %xmm3
+; SSE-NEXT:    andpd %xmm1, %xmm3
+; SSE-NEXT:    movapd %xmm2, %xmm1
+; SSE-NEXT:    cmpneqpd %xmm4, %xmm1
+; SSE-NEXT:    cmpordpd %xmm4, %xmm2
+; SSE-NEXT:    andpd %xmm1, %xmm2
+; SSE-NEXT:    packssdw %xmm3, %xmm2
+; SSE-NEXT:    packssdw %xmm2, %xmm2
+; SSE-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
+; SSE-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,1,3,4,5,6,7]
+; SSE-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE-NEXT:    packsswb %xmm0, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %eax
+; SSE-NEXT:    # kill: def $al killed $al killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_fcmp_v8f64_v4f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vxorpd %xmm2, %xmm2, %xmm2
+; AVX1-NEXT:    vcmpneq_oqpd %ymm2, %ymm0, %ymm0
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX1-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm3[0,2]
+; AVX1-NEXT:    vcmpneq_oqpd %ymm2, %ymm1, %ymm1
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; AVX1-NEXT:    vshufps {{.*#+}} xmm1 = xmm1[0,2],xmm2[0,2]
+; AVX1-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpsrlw $8, %xmm0, %xmm0
+; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1-NEXT:    vzeroupper
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_fcmp_v8f64_v4f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vxorpd %xmm2, %xmm2, %xmm2
+; AVX2-NEXT:    vcmpneq_oqpd %ymm2, %ymm0, %ymm0
+; AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX2-NEXT:    vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm3[0,2]
+; AVX2-NEXT:    vcmpneq_oqpd %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; AVX2-NEXT:    vshufps {{.*#+}} xmm1 = xmm1[0,2],xmm2[0,2]
+; AVX2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpshufb {{.*#+}} xmm0 = xmm0[1,3,5,7,9,11,13,15,u,u,u,u,u,u,u,u]
+; AVX2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_fcmp_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
+; AVX512-NEXT:    vcmpneq_oqpd %zmm1, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = fcmp one <4 x double> %a0, zeroinitializer
+  %v1 = fcmp one <4 x double> %a1, zeroinitializer
+  %v = shufflevector <4 x i1> %v0, <4 x i1> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r = bitcast <8 x i1> %v to i8
+  ret i8 %r
+}
+
+define i16 @concat_fcmp_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_fcmp_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    xorps %xmm4, %xmm4
+; SSE-NEXT:    cmpleps %xmm4, %xmm1
+; SSE-NEXT:    cmpleps %xmm4, %xmm0
+; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    cmpleps %xmm4, %xmm3
+; SSE-NEXT:    cmpleps %xmm4, %xmm2
+; SSE-NEXT:    packssdw %xmm3, %xmm2
+; SSE-NEXT:    packsswb %xmm2, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %eax
+; SSE-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_fcmp_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vxorps %xmm2, %xmm2, %xmm2
+; AVX1OR2-NEXT:    vcmpleps %ymm2, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vextractf128 $1, %ymm0, %xmm3
+; AVX1OR2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vcmpleps %ymm2, %ymm1, %ymm1
+; AVX1OR2-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; AVX1OR2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX1OR2-NEXT:    vzeroupper
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_fcmp_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vxorps %xmm1, %xmm1, %xmm1
+; AVX512-NEXT:    vcmpleps %zmm1, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = fcmp ole <8 x float> %a0, zeroinitializer
+  %v1 = fcmp ole <8 x float> %a1, zeroinitializer
+  %v = shufflevector <8 x i1> %v0, <8 x i1> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %r = bitcast <16 x i1> %v to i16
+  ret i16 %r
+}
diff --git a/llvm/test/CodeGen/X86/combine-ffloor.ll b/llvm/test/CodeGen/X86/combine-ffloor.ll
new file mode 100644
index 0000000000000..5cde95ec7aa4f
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-ffloor.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_floor_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_floor_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $9, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $9, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_floor_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $9, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.floor.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.floor.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_floor_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_floor_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $9, %xmm0, %xmm0
+; SSE-NEXT:    roundps $9, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_floor_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $9, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.floor.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.floor.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_floor_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_floor_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $9, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $9, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $9, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $9, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_floor_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $9, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $9, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_floor_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $9, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $9, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_floor_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $9, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.floor.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.floor.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.floor.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.floor.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_floor_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_floor_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $9, %xmm0, %xmm0
+; SSE-NEXT:    roundps $9, %xmm1, %xmm1
+; SSE-NEXT:    roundps $9, %xmm2, %xmm2
+; SSE-NEXT:    roundps $9, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_floor_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $9, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $9, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_floor_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $9, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $9, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_floor_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $9, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.floor.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.floor.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.floor.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.floor.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_floor_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_floor_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $9, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $9, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $9, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $9, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_floor_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $9, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $9, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_floor_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $9, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.floor.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.floor.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_floor_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_floor_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $9, %xmm0, %xmm0
+; SSE-NEXT:    roundps $9, %xmm1, %xmm1
+; SSE-NEXT:    roundps $9, %xmm2, %xmm2
+; SSE-NEXT:    roundps $9, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_floor_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $9, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $9, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_floor_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $9, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.floor.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.floor.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-fnearbyint.ll b/llvm/test/CodeGen/X86/combine-fnearbyint.ll
new file mode 100644
index 0000000000000..fde136af7c4c2
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-fnearbyint.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_nearbyint_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_nearbyint_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $12, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $12, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_nearbyint_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $12, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.nearbyint.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.nearbyint.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_nearbyint_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_nearbyint_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $12, %xmm0, %xmm0
+; SSE-NEXT:    roundps $12, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_nearbyint_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $12, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_nearbyint_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_nearbyint_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $12, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $12, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $12, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $12, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_nearbyint_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $12, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $12, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_nearbyint_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $12, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $12, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_nearbyint_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $12, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.nearbyint.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.nearbyint.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.nearbyint.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.nearbyint.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_nearbyint_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_nearbyint_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $12, %xmm0, %xmm0
+; SSE-NEXT:    roundps $12, %xmm1, %xmm1
+; SSE-NEXT:    roundps $12, %xmm2, %xmm2
+; SSE-NEXT:    roundps $12, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_nearbyint_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $12, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $12, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_nearbyint_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $12, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $12, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_nearbyint_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $12, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.nearbyint.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_nearbyint_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_nearbyint_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $12, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $12, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $12, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $12, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_nearbyint_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $12, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $12, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_nearbyint_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $12, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.nearbyint.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.nearbyint.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_nearbyint_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_nearbyint_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $12, %xmm0, %xmm0
+; SSE-NEXT:    roundps $12, %xmm1, %xmm1
+; SSE-NEXT:    roundps $12, %xmm2, %xmm2
+; SSE-NEXT:    roundps $12, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_nearbyint_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $12, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $12, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_nearbyint_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $12, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.nearbyint.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.nearbyint.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-frint.ll b/llvm/test/CodeGen/X86/combine-frint.ll
new file mode 100644
index 0000000000000..1c52529e8386c
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-frint.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_rint_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_rint_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $4, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $4, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_rint_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.rint.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.rint.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_rint_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_rint_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $4, %xmm0, %xmm0
+; SSE-NEXT:    roundps $4, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_rint_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.rint.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.rint.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_rint_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_rint_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $4, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $4, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $4, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $4, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_rint_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $4, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_rint_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $4, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_rint_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.rint.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.rint.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.rint.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.rint.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_rint_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_rint_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $4, %xmm0, %xmm0
+; SSE-NEXT:    roundps $4, %xmm1, %xmm1
+; SSE-NEXT:    roundps $4, %xmm2, %xmm2
+; SSE-NEXT:    roundps $4, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_rint_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $4, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_rint_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $4, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_rint_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.rint.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.rint.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.rint.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.rint.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_rint_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_rint_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $4, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $4, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $4, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $4, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_rint_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $4, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_rint_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.rint.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.rint.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_rint_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_rint_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $4, %xmm0, %xmm0
+; SSE-NEXT:    roundps $4, %xmm1, %xmm1
+; SSE-NEXT:    roundps $4, %xmm2, %xmm2
+; SSE-NEXT:    roundps $4, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_rint_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $4, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_rint_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.rint.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.rint.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-fround.ll b/llvm/test/CodeGen/X86/combine-fround.ll
new file mode 100644
index 0000000000000..42dbaf234dbc7
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-fround.ll
@@ -0,0 +1,419 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_round_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_round_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movapd {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0]
+; SSE-NEXT:    movapd %xmm0, %xmm3
+; SSE-NEXT:    andpd %xmm2, %xmm3
+; SSE-NEXT:    movapd {{.*#+}} xmm4 = [4.9999999999999994E-1,4.9999999999999994E-1]
+; SSE-NEXT:    orpd %xmm4, %xmm3
+; SSE-NEXT:    addpd %xmm0, %xmm3
+; SSE-NEXT:    roundpd $11, %xmm3, %xmm0
+; SSE-NEXT:    andpd %xmm1, %xmm2
+; SSE-NEXT:    orpd %xmm4, %xmm2
+; SSE-NEXT:    addpd %xmm1, %xmm2
+; SSE-NEXT:    roundpd $11, %xmm2, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_round_v4f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm1
+; AVX1-NEXT:    vorpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
+; AVX1-NEXT:    vaddpd %ymm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_round_v4f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX2-NEXT:    vandpd %ymm1, %ymm0, %ymm1
+; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm2 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX2-NEXT:    vorpd %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vaddpd %ymm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_round_v4f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vpbroadcastq {{.*#+}} ymm2 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vpternlogq {{.*#+}} ymm2 = ymm2 | (ymm0 & m64bcst)
+; AVX512-NEXT:    vaddpd %ymm2, %ymm0, %ymm0
+; AVX512-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.round.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.round.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_round_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_round_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movaps {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; SSE-NEXT:    movaps %xmm0, %xmm3
+; SSE-NEXT:    andps %xmm2, %xmm3
+; SSE-NEXT:    movaps {{.*#+}} xmm4 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; SSE-NEXT:    orps %xmm4, %xmm3
+; SSE-NEXT:    addps %xmm0, %xmm3
+; SSE-NEXT:    roundps $11, %xmm3, %xmm0
+; SSE-NEXT:    andps %xmm1, %xmm2
+; SSE-NEXT:    orps %xmm4, %xmm2
+; SSE-NEXT:    addps %xmm1, %xmm2
+; SSE-NEXT:    roundps $11, %xmm2, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_round_v8f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm1
+; AVX1-NEXT:    vorps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
+; AVX1-NEXT:    vaddps %ymm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_round_v8f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vbroadcastss {{.*#+}} ymm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX2-NEXT:    vandps %ymm1, %ymm0, %ymm1
+; AVX2-NEXT:    vbroadcastss {{.*#+}} ymm2 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX2-NEXT:    vorps %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vaddps %ymm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_round_v8f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vpbroadcastd {{.*#+}} ymm2 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vpternlogd {{.*#+}} ymm2 = ymm2 | (ymm0 & m32bcst)
+; AVX512-NEXT:    vaddps %ymm2, %ymm0, %ymm0
+; AVX512-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.round.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.round.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_round_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_round_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movapd {{.*#+}} xmm4 = [-0.0E+0,-0.0E+0]
+; SSE-NEXT:    movapd %xmm0, %xmm5
+; SSE-NEXT:    andpd %xmm4, %xmm5
+; SSE-NEXT:    movapd {{.*#+}} xmm6 = [4.9999999999999994E-1,4.9999999999999994E-1]
+; SSE-NEXT:    orpd %xmm6, %xmm5
+; SSE-NEXT:    addpd %xmm0, %xmm5
+; SSE-NEXT:    roundpd $11, %xmm5, %xmm0
+; SSE-NEXT:    movapd %xmm1, %xmm5
+; SSE-NEXT:    andpd %xmm4, %xmm5
+; SSE-NEXT:    orpd %xmm6, %xmm5
+; SSE-NEXT:    addpd %xmm1, %xmm5
+; SSE-NEXT:    roundpd $11, %xmm5, %xmm1
+; SSE-NEXT:    movapd %xmm2, %xmm5
+; SSE-NEXT:    andpd %xmm4, %xmm5
+; SSE-NEXT:    orpd %xmm6, %xmm5
+; SSE-NEXT:    addpd %xmm2, %xmm5
+; SSE-NEXT:    roundpd $11, %xmm5, %xmm2
+; SSE-NEXT:    andpd %xmm3, %xmm4
+; SSE-NEXT:    orpd %xmm6, %xmm4
+; SSE-NEXT:    addpd %xmm3, %xmm4
+; SSE-NEXT:    roundpd $11, %xmm4, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_round_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vmovapd {{.*#+}} ymm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX1-NEXT:    vandpd %ymm1, %ymm0, %ymm4
+; AVX1-NEXT:    vmovapd {{.*#+}} ymm5 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX1-NEXT:    vorpd %ymm5, %ymm4, %ymm4
+; AVX1-NEXT:    vaddpd %ymm4, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX1-NEXT:    vandpd %ymm1, %ymm2, %ymm1
+; AVX1-NEXT:    vorpd %ymm5, %ymm1, %ymm1
+; AVX1-NEXT:    vaddpd %ymm1, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_round_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX2-NEXT:    vandpd %ymm1, %ymm0, %ymm4
+; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm5 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX2-NEXT:    vorpd %ymm5, %ymm4, %ymm4
+; AVX2-NEXT:    vaddpd %ymm4, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX2-NEXT:    vandpd %ymm1, %ymm2, %ymm1
+; AVX2-NEXT:    vorpd %ymm5, %ymm1, %ymm1
+; AVX2-NEXT:    vaddpd %ymm1, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_round_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vpbroadcastq {{.*#+}} zmm1 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX512-NEXT:    vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vpternlogq {{.*#+}} zmm1 = zmm1 | (zmm0 & m64bcst)
+; AVX512-NEXT:    vaddpd %zmm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.round.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.round.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.round.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.round.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_round_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_round_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movaps {{.*#+}} xmm4 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; SSE-NEXT:    movaps %xmm0, %xmm5
+; SSE-NEXT:    andps %xmm4, %xmm5
+; SSE-NEXT:    movaps {{.*#+}} xmm6 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; SSE-NEXT:    orps %xmm6, %xmm5
+; SSE-NEXT:    addps %xmm0, %xmm5
+; SSE-NEXT:    roundps $11, %xmm5, %xmm0
+; SSE-NEXT:    movaps %xmm1, %xmm5
+; SSE-NEXT:    andps %xmm4, %xmm5
+; SSE-NEXT:    orps %xmm6, %xmm5
+; SSE-NEXT:    addps %xmm1, %xmm5
+; SSE-NEXT:    roundps $11, %xmm5, %xmm1
+; SSE-NEXT:    movaps %xmm2, %xmm5
+; SSE-NEXT:    andps %xmm4, %xmm5
+; SSE-NEXT:    orps %xmm6, %xmm5
+; SSE-NEXT:    addps %xmm2, %xmm5
+; SSE-NEXT:    roundps $11, %xmm5, %xmm2
+; SSE-NEXT:    andps %xmm3, %xmm4
+; SSE-NEXT:    orps %xmm6, %xmm4
+; SSE-NEXT:    addps %xmm3, %xmm4
+; SSE-NEXT:    roundps $11, %xmm4, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_round_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vmovaps {{.*#+}} ymm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX1-NEXT:    vandps %ymm1, %ymm0, %ymm4
+; AVX1-NEXT:    vmovaps {{.*#+}} ymm5 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX1-NEXT:    vorps %ymm5, %ymm4, %ymm4
+; AVX1-NEXT:    vaddps %ymm4, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX1-NEXT:    vandps %ymm1, %ymm2, %ymm1
+; AVX1-NEXT:    vorps %ymm5, %ymm1, %ymm1
+; AVX1-NEXT:    vaddps %ymm1, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_round_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vbroadcastss {{.*#+}} ymm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX2-NEXT:    vandps %ymm1, %ymm0, %ymm4
+; AVX2-NEXT:    vbroadcastss {{.*#+}} ymm5 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX2-NEXT:    vorps %ymm5, %ymm4, %ymm4
+; AVX2-NEXT:    vaddps %ymm4, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX2-NEXT:    vandps %ymm1, %ymm2, %ymm1
+; AVX2-NEXT:    vorps %ymm5, %ymm1, %ymm1
+; AVX2-NEXT:    vaddps %ymm1, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_round_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vpbroadcastd {{.*#+}} zmm1 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX512-NEXT:    vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vpternlogd {{.*#+}} zmm1 = zmm1 | (zmm0 & m32bcst)
+; AVX512-NEXT:    vaddps %zmm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.round.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.round.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.round.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.round.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_round_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_round_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movapd {{.*#+}} xmm4 = [-0.0E+0,-0.0E+0]
+; SSE-NEXT:    movapd %xmm0, %xmm5
+; SSE-NEXT:    andpd %xmm4, %xmm5
+; SSE-NEXT:    movapd {{.*#+}} xmm6 = [4.9999999999999994E-1,4.9999999999999994E-1]
+; SSE-NEXT:    orpd %xmm6, %xmm5
+; SSE-NEXT:    addpd %xmm0, %xmm5
+; SSE-NEXT:    roundpd $11, %xmm5, %xmm0
+; SSE-NEXT:    movapd %xmm1, %xmm5
+; SSE-NEXT:    andpd %xmm4, %xmm5
+; SSE-NEXT:    orpd %xmm6, %xmm5
+; SSE-NEXT:    addpd %xmm1, %xmm5
+; SSE-NEXT:    roundpd $11, %xmm5, %xmm1
+; SSE-NEXT:    movapd %xmm2, %xmm5
+; SSE-NEXT:    andpd %xmm4, %xmm5
+; SSE-NEXT:    orpd %xmm6, %xmm5
+; SSE-NEXT:    addpd %xmm2, %xmm5
+; SSE-NEXT:    roundpd $11, %xmm5, %xmm2
+; SSE-NEXT:    andpd %xmm3, %xmm4
+; SSE-NEXT:    orpd %xmm6, %xmm4
+; SSE-NEXT:    addpd %xmm3, %xmm4
+; SSE-NEXT:    roundpd $11, %xmm4, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_round_v8f64_v4f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vmovapd {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX1-NEXT:    vandpd %ymm2, %ymm0, %ymm3
+; AVX1-NEXT:    vmovapd {{.*#+}} ymm4 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX1-NEXT:    vorpd %ymm4, %ymm3, %ymm3
+; AVX1-NEXT:    vaddpd %ymm3, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX1-NEXT:    vandpd %ymm2, %ymm1, %ymm2
+; AVX1-NEXT:    vorpd %ymm4, %ymm2, %ymm2
+; AVX1-NEXT:    vaddpd %ymm2, %ymm1, %ymm1
+; AVX1-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_round_v8f64_v4f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX2-NEXT:    vandpd %ymm2, %ymm0, %ymm3
+; AVX2-NEXT:    vbroadcastsd {{.*#+}} ymm4 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX2-NEXT:    vorpd %ymm4, %ymm3, %ymm3
+; AVX2-NEXT:    vaddpd %ymm3, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX2-NEXT:    vandpd %ymm2, %ymm1, %ymm2
+; AVX2-NEXT:    vorpd %ymm4, %ymm2, %ymm2
+; AVX2-NEXT:    vaddpd %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_round_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vpbroadcastq {{.*#+}} zmm2 = [4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1,4.9999999999999994E-1]
+; AVX512-NEXT:    vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vpternlogq {{.*#+}} zmm2 = zmm2 | (zmm0 & m64bcst)
+; AVX512-NEXT:    vaddpd %zmm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.round.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.round.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_round_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_round_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movaps {{.*#+}} xmm4 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; SSE-NEXT:    movaps %xmm0, %xmm5
+; SSE-NEXT:    andps %xmm4, %xmm5
+; SSE-NEXT:    movaps {{.*#+}} xmm6 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; SSE-NEXT:    orps %xmm6, %xmm5
+; SSE-NEXT:    addps %xmm0, %xmm5
+; SSE-NEXT:    roundps $11, %xmm5, %xmm0
+; SSE-NEXT:    movaps %xmm1, %xmm5
+; SSE-NEXT:    andps %xmm4, %xmm5
+; SSE-NEXT:    orps %xmm6, %xmm5
+; SSE-NEXT:    addps %xmm1, %xmm5
+; SSE-NEXT:    roundps $11, %xmm5, %xmm1
+; SSE-NEXT:    movaps %xmm2, %xmm5
+; SSE-NEXT:    andps %xmm4, %xmm5
+; SSE-NEXT:    orps %xmm6, %xmm5
+; SSE-NEXT:    addps %xmm2, %xmm5
+; SSE-NEXT:    roundps $11, %xmm5, %xmm2
+; SSE-NEXT:    andps %xmm3, %xmm4
+; SSE-NEXT:    orps %xmm6, %xmm4
+; SSE-NEXT:    addps %xmm3, %xmm4
+; SSE-NEXT:    roundps $11, %xmm4, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_round_v16f32_v8f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vmovaps {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX1-NEXT:    vandps %ymm2, %ymm0, %ymm3
+; AVX1-NEXT:    vmovaps {{.*#+}} ymm4 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX1-NEXT:    vorps %ymm4, %ymm3, %ymm3
+; AVX1-NEXT:    vaddps %ymm3, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX1-NEXT:    vandps %ymm2, %ymm1, %ymm2
+; AVX1-NEXT:    vorps %ymm4, %ymm2, %ymm2
+; AVX1-NEXT:    vaddps %ymm2, %ymm1, %ymm1
+; AVX1-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_round_v16f32_v8f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vbroadcastss {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; AVX2-NEXT:    vandps %ymm2, %ymm0, %ymm3
+; AVX2-NEXT:    vbroadcastss {{.*#+}} ymm4 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX2-NEXT:    vorps %ymm4, %ymm3, %ymm3
+; AVX2-NEXT:    vaddps %ymm3, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX2-NEXT:    vandps %ymm2, %ymm1, %ymm2
+; AVX2-NEXT:    vorps %ymm4, %ymm2, %ymm2
+; AVX2-NEXT:    vaddps %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_round_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vpbroadcastd {{.*#+}} zmm2 = [4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1,4.9999997E-1]
+; AVX512-NEXT:    vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vpternlogd {{.*#+}} zmm2 = zmm2 | (zmm0 & m32bcst)
+; AVX512-NEXT:    vaddps %zmm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.round.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.round.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; AVX: {{.*}}
diff --git a/llvm/test/CodeGen/X86/combine-froundeven.ll b/llvm/test/CodeGen/X86/combine-froundeven.ll
new file mode 100644
index 0000000000000..4bf1e86d887ae
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-froundeven.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_roundeven_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_roundeven_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $8, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $8, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_roundeven_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $8, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.roundeven.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.roundeven.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_roundeven_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_roundeven_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $8, %xmm0, %xmm0
+; SSE-NEXT:    roundps $8, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_roundeven_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $8, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.roundeven.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.roundeven.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_roundeven_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_roundeven_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $8, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $8, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $8, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $8, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_roundeven_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $8, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $8, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_roundeven_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $8, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $8, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundeven_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $8, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.roundeven.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.roundeven.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.roundeven.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.roundeven.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_roundeven_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_roundeven_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $8, %xmm0, %xmm0
+; SSE-NEXT:    roundps $8, %xmm1, %xmm1
+; SSE-NEXT:    roundps $8, %xmm2, %xmm2
+; SSE-NEXT:    roundps $8, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_roundeven_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $8, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $8, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_roundeven_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $8, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $8, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundeven_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $8, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.roundeven.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.roundeven.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.roundeven.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.roundeven.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_roundeven_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_roundeven_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $8, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $8, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $8, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $8, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_roundeven_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $8, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $8, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundeven_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $8, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.roundeven.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.roundeven.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_roundeven_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_roundeven_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $8, %xmm0, %xmm0
+; SSE-NEXT:    roundps $8, %xmm1, %xmm1
+; SSE-NEXT:    roundps $8, %xmm2, %xmm2
+; SSE-NEXT:    roundps $8, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_roundeven_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $8, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $8, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundeven_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $8, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.roundeven.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.roundeven.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-fsqrt.ll b/llvm/test/CodeGen/X86/combine-fsqrt.ll
new file mode 100644
index 0000000000000..f30eac16b7b1b
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-fsqrt.ll
@@ -0,0 +1,174 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64    | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_sqrt_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_sqrt_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    sqrtpd %xmm0, %xmm0
+; SSE-NEXT:    sqrtpd %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_sqrt_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vsqrtpd %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_sqrt_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_sqrt_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    sqrtps %xmm0, %xmm0
+; SSE-NEXT:    sqrtps %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_sqrt_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vsqrtps %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_sqrt_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_sqrt_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    sqrtpd %xmm0, %xmm0
+; SSE-NEXT:    sqrtpd %xmm1, %xmm1
+; SSE-NEXT:    sqrtpd %xmm2, %xmm2
+; SSE-NEXT:    sqrtpd %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_sqrt_v8f64_v2f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1OR2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vsqrtpd %ymm0, %ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1OR2-NEXT:    vsqrtpd %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_sqrt_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vsqrtpd %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_sqrt_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_sqrt_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    sqrtps %xmm0, %xmm0
+; SSE-NEXT:    sqrtps %xmm1, %xmm1
+; SSE-NEXT:    sqrtps %xmm2, %xmm2
+; SSE-NEXT:    sqrtps %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_sqrt_v16f32_v4f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1OR2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vsqrtps %ymm0, %ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1OR2-NEXT:    vsqrtps %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_sqrt_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vsqrtps %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_sqrt_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_sqrt_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    sqrtpd %xmm0, %xmm0
+; SSE-NEXT:    sqrtpd %xmm1, %xmm1
+; SSE-NEXT:    sqrtpd %xmm2, %xmm2
+; SSE-NEXT:    sqrtpd %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_sqrt_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vsqrtpd %ymm0, %ymm0
+; AVX1OR2-NEXT:    vsqrtpd %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_sqrt_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vsqrtpd %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_sqrt_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_sqrt_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    sqrtps %xmm0, %xmm0
+; SSE-NEXT:    sqrtps %xmm1, %xmm1
+; SSE-NEXT:    sqrtps %xmm2, %xmm2
+; SSE-NEXT:    sqrtps %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_sqrt_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vsqrtps %ymm0, %ymm0
+; AVX1OR2-NEXT:    vsqrtps %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_sqrt_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vsqrtps %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-ftrunc.ll b/llvm/test/CodeGen/X86/combine-ftrunc.ll
new file mode 100644
index 0000000000000..3dde226db73df
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-ftrunc.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_trunc_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; SSE-LABEL: concat_trunc_v4f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $11, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $11, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_trunc_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.trunc.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.trunc.v2f64(<2 x double> %a1)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_trunc_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_trunc_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $11, %xmm0, %xmm0
+; SSE-NEXT:    roundps $11, %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_trunc_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.trunc.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.trunc.v4f32(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_trunc_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; SSE-LABEL: concat_trunc_v8f64_v2f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $11, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $11, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $11, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $11, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_trunc_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_trunc_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_trunc_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.trunc.v2f64(<2 x double> %a0)
+  %v1 = call <2 x double> @llvm.trunc.v2f64(<2 x double> %a1)
+  %v2 = call <2 x double> @llvm.trunc.v2f64(<2 x double> %a2)
+  %v3 = call <2 x double> @llvm.trunc.v2f64(<2 x double> %a3)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_trunc_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_trunc_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $11, %xmm0, %xmm0
+; SSE-NEXT:    roundps $11, %xmm1, %xmm1
+; SSE-NEXT:    roundps $11, %xmm2, %xmm2
+; SSE-NEXT:    roundps $11, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_trunc_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_trunc_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_trunc_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.trunc.v4f32(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.trunc.v4f32(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.trunc.v4f32(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.trunc.v4f32(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_trunc_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; SSE-LABEL: concat_trunc_v8f64_v4f64:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundpd $11, %xmm0, %xmm0
+; SSE-NEXT:    roundpd $11, %xmm1, %xmm1
+; SSE-NEXT:    roundpd $11, %xmm2, %xmm2
+; SSE-NEXT:    roundpd $11, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_trunc_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $11, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $11, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_trunc_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.trunc.v4f64(<4 x double> %a0)
+  %v1 = call <4 x double> @llvm.trunc.v4f64(<4 x double> %a1)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_trunc_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; SSE-LABEL: concat_trunc_v16f32_v8f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    roundps $11, %xmm0, %xmm0
+; SSE-NEXT:    roundps $11, %xmm1, %xmm1
+; SSE-NEXT:    roundps $11, %xmm2, %xmm2
+; SSE-NEXT:    roundps $11, %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_trunc_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $11, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $11, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_trunc_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $11, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.trunc.v8f32(<8 x float> %a0)
+  %v1 = call <8 x float> @llvm.trunc.v8f32(<8 x float> %a1)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-icmp.ll b/llvm/test/CodeGen/X86/combine-icmp.ll
new file mode 100644
index 0000000000000..dba583905c2c5
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-icmp.ll
@@ -0,0 +1,905 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64    | FileCheck %s --check-prefixes=SSE,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE,SSE42
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX512
+
+define i4 @concat_icmp_v4i64_v2i64(<2 x i64> %a0, <2 x i64> %a1) {
+; SSE2-LABEL: concat_icmp_v4i64_v2i64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    pxor %xmm2, %xmm2
+; SSE2-NEXT:    pcmpeqd %xmm2, %xmm0
+; SSE2-NEXT:    pcmpeqd %xmm2, %xmm1
+; SSE2-NEXT:    movdqa %xmm0, %xmm2
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[1,3],xmm1[1,3]
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
+; SSE2-NEXT:    andps %xmm2, %xmm0
+; SSE2-NEXT:    movmskps %xmm0, %eax
+; SSE2-NEXT:    xorl $15, %eax
+; SSE2-NEXT:    # kill: def $al killed $al killed $eax
+; SSE2-NEXT:    retq
+;
+; SSE42-LABEL: concat_icmp_v4i64_v2i64:
+; SSE42:       # %bb.0:
+; SSE42-NEXT:    pxor %xmm2, %xmm2
+; SSE42-NEXT:    pcmpeqq %xmm2, %xmm0
+; SSE42-NEXT:    pcmpeqq %xmm2, %xmm1
+; SSE42-NEXT:    packssdw %xmm1, %xmm0
+; SSE42-NEXT:    movmskps %xmm0, %eax
+; SSE42-NEXT:    xorl $15, %eax
+; SSE42-NEXT:    # kill: def $al killed $al killed $eax
+; SSE42-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_icmp_v4i64_v2i64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX1OR2-NEXT:    vpcmpeqq %xmm2, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpcmpeqq %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vmovmskps %xmm0, %eax
+; AVX1OR2-NEXT:    xorl $15, %eax
+; AVX1OR2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v4i64_v2i64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vptestmq %xmm0, %xmm0, %k0
+; AVX512-NEXT:    vptestmq %xmm1, %xmm1, %k1
+; AVX512-NEXT:    kshiftlb $2, %k1, %k1
+; AVX512-NEXT:    korw %k1, %k0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    retq
+  %v0 = icmp ne <2 x i64> %a0, zeroinitializer
+  %v1 = icmp ne <2 x i64> %a1, zeroinitializer
+  %v = shufflevector <2 x i1> %v0, <2 x i1> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r = bitcast <4 x i1> %v to i4
+  ret i4 %r
+}
+
+define i8 @concat_icmp_v8i32_v4i32(<4 x i32> %a0, <4 x i32> %a1) {
+; SSE-LABEL: concat_icmp_v8i32_v4i32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    pxor %xmm2, %xmm2
+; SSE-NEXT:    pcmpeqd %xmm2, %xmm0
+; SSE-NEXT:    pcmpeqd %xmm2, %xmm1
+; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    packsswb %xmm0, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %eax
+; SSE-NEXT:    # kill: def $al killed $al killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_icmp_v8i32_v4i32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpcmpeqd %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpacksswb %xmm0, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v8i32_v4i32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vptestnmd %ymm0, %ymm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp eq <4 x i32> %a0, zeroinitializer
+  %v1 = icmp eq <4 x i32> %a1, zeroinitializer
+  %v = shufflevector <4 x i1> %v0, <4 x i1> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r = bitcast <8 x i1> %v to i8
+  ret i8 %r
+}
+
+define i16 @concat_icmp_v16i16_v8i16(<8 x i16> %a0, <8 x i16> %a1) {
+; SSE2-LABEL: concat_icmp_v16i16_v8i16:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa {{.*#+}} xmm2 = [2,2,2,2,2,2,2,2]
+; SSE2-NEXT:    movdqa %xmm2, %xmm3
+; SSE2-NEXT:    psubusw %xmm0, %xmm3
+; SSE2-NEXT:    pxor %xmm0, %xmm0
+; SSE2-NEXT:    pcmpeqw %xmm0, %xmm3
+; SSE2-NEXT:    psubusw %xmm1, %xmm2
+; SSE2-NEXT:    pcmpeqw %xmm0, %xmm2
+; SSE2-NEXT:    packsswb %xmm2, %xmm3
+; SSE2-NEXT:    pmovmskb %xmm3, %eax
+; SSE2-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT:    retq
+;
+; SSE42-LABEL: concat_icmp_v16i16_v8i16:
+; SSE42:       # %bb.0:
+; SSE42-NEXT:    movdqa {{.*#+}} xmm2 = [2,2,2,2,2,2,2,2]
+; SSE42-NEXT:    movdqa %xmm0, %xmm3
+; SSE42-NEXT:    pmaxuw %xmm2, %xmm3
+; SSE42-NEXT:    pcmpeqw %xmm0, %xmm3
+; SSE42-NEXT:    pmaxuw %xmm1, %xmm2
+; SSE42-NEXT:    pcmpeqw %xmm1, %xmm2
+; SSE42-NEXT:    packsswb %xmm2, %xmm3
+; SSE42-NEXT:    pmovmskb %xmm3, %eax
+; SSE42-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE42-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v16i16_v8i16:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [2,2,2,2,2,2,2,2]
+; AVX1-NEXT:    vpmaxuw %xmm2, %xmm0, %xmm3
+; AVX1-NEXT:    vpcmpeqw %xmm3, %xmm0, %xmm0
+; AVX1-NEXT:    vpmaxuw %xmm2, %xmm1, %xmm2
+; AVX1-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v16i16_v8i16:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm2 = [2,2,2,2,2,2,2,2]
+; AVX2-NEXT:    vpmaxuw %xmm2, %xmm0, %xmm3
+; AVX2-NEXT:    vpcmpeqw %xmm3, %xmm0, %xmm0
+; AVX2-NEXT:    vpmaxuw %xmm2, %xmm1, %xmm2
+; AVX2-NEXT:    vpcmpeqw %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX2-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v16i16_v8i16:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vpcmpnleuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp ugt <8 x i16> %a0, splat (i16 1)
+  %v1 = icmp ugt <8 x i16> %a1, splat (i16 1)
+  %v = shufflevector <8 x i1> %v0, <8 x i1> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %r = bitcast <16 x i1> %v to i16
+  ret i16 %r
+}
+
+define i32 @concat_icmp_v32i8_v16i8(<16 x i8> %a0, <16 x i8> %a1) {
+; SSE-LABEL: concat_icmp_v32i8_v16i8:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movdqa {{.*#+}} xmm2 = [5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5]
+; SSE-NEXT:    pcmpgtb %xmm2, %xmm0
+; SSE-NEXT:    pcmpgtb %xmm2, %xmm1
+; SSE-NEXT:    pmovmskb %xmm0, %ecx
+; SSE-NEXT:    pmovmskb %xmm1, %eax
+; SSE-NEXT:    shll $16, %eax
+; SSE-NEXT:    orl %ecx, %eax
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v32i8_v16i8:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5]
+; AVX1-NEXT:    vpcmpgtb %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vpcmpgtb %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpmovmskb %xmm0, %ecx
+; AVX1-NEXT:    vpmovmskb %xmm1, %eax
+; AVX1-NEXT:    shll $16, %eax
+; AVX1-NEXT:    orl %ecx, %eax
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v32i8_v16i8:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vpcmpgtb {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; AVX2-NEXT:    vpmovmskb %ymm0, %eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v32i8_v16i8:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vpcmpgtb {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp sgt <16 x i8> %a0, splat (i8 5)
+  %v1 = icmp sgt <16 x i8> %a1, splat (i8 5)
+  %v = shufflevector <16 x i1> %v0, <16 x i1> %v1, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  %r = bitcast <32 x i1> %v to i32
+  ret i32 %r
+}
+
+define i8 @concat_icmp_v8i64_v2i64(<2 x i64> %a0, <2 x i64> %a1, <2 x i64> %a2, <2 x i64> %a3) {
+; SSE2-LABEL: concat_icmp_v8i64_v2i64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [9223372039002259456,9223372039002259456]
+; SSE2-NEXT:    pxor %xmm4, %xmm0
+; SSE2-NEXT:    pxor %xmm4, %xmm1
+; SSE2-NEXT:    pxor %xmm4, %xmm2
+; SSE2-NEXT:    pxor %xmm4, %xmm3
+; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm3[0,2,2,3]
+; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [2147483776,2147483776,2147483776,2147483648]
+; SSE2-NEXT:    movdqa %xmm5, %xmm7
+; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
+; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm3[1,3,3,3]
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
+; SSE2-NEXT:    pand %xmm7, %xmm3
+; SSE2-NEXT:    pshuflw {{.*#+}} xmm3 = xmm3[0,1,0,2,4,5,6,7]
+; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm2[0,2,2,3]
+; SSE2-NEXT:    movdqa %xmm5, %xmm7
+; SSE2-NEXT:    pcmpgtd %xmm6, %xmm7
+; SSE2-NEXT:    pshufd {{.*#+}} xmm2 = xmm2[1,3,3,3]
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
+; SSE2-NEXT:    pand %xmm7, %xmm2
+; SSE2-NEXT:    pshuflw {{.*#+}} xmm2 = xmm2[0,1,0,2,4,5,6,7]
+; SSE2-NEXT:    punpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
+; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm1[0,2,2,3]
+; SSE2-NEXT:    movdqa %xmm5, %xmm6
+; SSE2-NEXT:    pcmpgtd %xmm3, %xmm6
+; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,3,3]
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm1
+; SSE2-NEXT:    pand %xmm6, %xmm1
+; SSE2-NEXT:    pshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
+; SSE2-NEXT:    pshufd {{.*#+}} xmm3 = xmm0[0,2,2,3]
+; SSE2-NEXT:    pcmpgtd %xmm3, %xmm5
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,3,3]
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm0
+; SSE2-NEXT:    pand %xmm5, %xmm0
+; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,2,3,4,5,6,7]
+; SSE2-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
+; SSE2-NEXT:    packsswb %xmm2, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,3,2,3]
+; SSE2-NEXT:    pmovmskb %xmm0, %eax
+; SSE2-NEXT:    # kill: def $al killed $al killed $eax
+; SSE2-NEXT:    retq
+;
+; SSE42-LABEL: concat_icmp_v8i64_v2i64:
+; SSE42:       # %bb.0:
+; SSE42-NEXT:    movdqa {{.*#+}} xmm4 = [9223372036854775808,9223372036854775808]
+; SSE42-NEXT:    pxor %xmm4, %xmm0
+; SSE42-NEXT:    movdqa {{.*#+}} xmm5 = [9223372036854775936,9223372036854775936]
+; SSE42-NEXT:    movdqa %xmm5, %xmm6
+; SSE42-NEXT:    pcmpgtq %xmm0, %xmm6
+; SSE42-NEXT:    pxor %xmm4, %xmm1
+; SSE42-NEXT:    movdqa %xmm5, %xmm0
+; SSE42-NEXT:    pcmpgtq %xmm1, %xmm0
+; SSE42-NEXT:    packssdw %xmm0, %xmm6
+; SSE42-NEXT:    pxor %xmm4, %xmm2
+; SSE42-NEXT:    movdqa %xmm5, %xmm0
+; SSE42-NEXT:    pcmpgtq %xmm2, %xmm0
+; SSE42-NEXT:    pxor %xmm4, %xmm3
+; SSE42-NEXT:    pcmpgtq %xmm3, %xmm5
+; SSE42-NEXT:    packssdw %xmm5, %xmm0
+; SSE42-NEXT:    packssdw %xmm6, %xmm6
+; SSE42-NEXT:    packssdw %xmm0, %xmm0
+; SSE42-NEXT:    packsswb %xmm0, %xmm6
+; SSE42-NEXT:    pshufd {{.*#+}} xmm0 = xmm6[0,3,2,3]
+; SSE42-NEXT:    pmovmskb %xmm0, %eax
+; SSE42-NEXT:    # kill: def $al killed $al killed $eax
+; SSE42-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v8i64_v2i64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vmovddup {{.*#+}} xmm4 = [9223372036854775808,9223372036854775808]
+; AVX1-NEXT:    # xmm4 = mem[0,0]
+; AVX1-NEXT:    vpxor %xmm4, %xmm0, %xmm0
+; AVX1-NEXT:    vmovddup {{.*#+}} xmm5 = [9223372036854775936,9223372036854775936]
+; AVX1-NEXT:    # xmm5 = mem[0,0]
+; AVX1-NEXT:    vpcmpgtq %xmm0, %xmm5, %xmm0
+; AVX1-NEXT:    vpxor %xmm4, %xmm1, %xmm1
+; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm5, %xmm1
+; AVX1-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpxor %xmm4, %xmm2, %xmm1
+; AVX1-NEXT:    vpcmpgtq %xmm1, %xmm5, %xmm1
+; AVX1-NEXT:    vpxor %xmm4, %xmm3, %xmm2
+; AVX1-NEXT:    vpcmpgtq %xmm2, %xmm5, %xmm2
+; AVX1-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpackssdw %xmm1, %xmm1, %xmm1
+; AVX1-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
+; AVX1-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,3,0,3]
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v8i64_v2i64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpbroadcastq {{.*#+}} xmm4 = [9223372036854775808,9223372036854775808]
+; AVX2-NEXT:    vpxor %xmm4, %xmm0, %xmm0
+; AVX2-NEXT:    vpbroadcastq {{.*#+}} xmm5 = [9223372036854775936,9223372036854775936]
+; AVX2-NEXT:    vpcmpgtq %xmm0, %xmm5, %xmm0
+; AVX2-NEXT:    vpxor %xmm4, %xmm1, %xmm1
+; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm5, %xmm1
+; AVX2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpxor %xmm4, %xmm2, %xmm1
+; AVX2-NEXT:    vpcmpgtq %xmm1, %xmm5, %xmm1
+; AVX2-NEXT:    vpxor %xmm4, %xmm3, %xmm2
+; AVX2-NEXT:    vpcmpgtq %xmm2, %xmm5, %xmm2
+; AVX2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpackssdw %xmm1, %xmm1, %xmm1
+; AVX2-NEXT:    vpackssdw %xmm0, %xmm0, %xmm0
+; AVX2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpshufd {{.*#+}} xmm0 = xmm0[0,3,0,3]
+; AVX2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v8i64_v2i64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vpcmpltuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp ult <2 x i64> %a0, splat (i64 128)
+  %v1 = icmp ult <2 x i64> %a1, splat (i64 128)
+  %v2 = icmp ult <2 x i64> %a2, splat (i64 128)
+  %v3 = icmp ult <2 x i64> %a3, splat (i64 128)
+  %v01 = shufflevector <2 x i1> %v0, <2 x i1> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %v23 = shufflevector <2 x i1> %v2, <2 x i1> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %v = shufflevector <4 x i1> %v01, <4 x i1> %v23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r = bitcast <8 x i1> %v to i8
+  ret i8 %r
+}
+
+define i16 @concat_icmp_v16i32_v4i32(<4 x i32> %a0, <4 x i32> %a1, <4 x i32> %a2, <4 x i32> %a3) {
+; SSE-LABEL: concat_icmp_v16i32_v4i32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    pxor %xmm4, %xmm4
+; SSE-NEXT:    pcmpgtd %xmm4, %xmm0
+; SSE-NEXT:    pcmpgtd %xmm4, %xmm1
+; SSE-NEXT:    packssdw %xmm1, %xmm0
+; SSE-NEXT:    pcmpgtd %xmm4, %xmm2
+; SSE-NEXT:    pcmpgtd %xmm4, %xmm3
+; SSE-NEXT:    packssdw %xmm3, %xmm2
+; SSE-NEXT:    packsswb %xmm2, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %eax
+; SSE-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_icmp_v16i32_v4i32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vpxor %xmm4, %xmm4, %xmm4
+; AVX1OR2-NEXT:    vpcmpgtd %xmm4, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpcmpgtd %xmm4, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpackssdw %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpcmpgtd %xmm4, %xmm2, %xmm1
+; AVX1OR2-NEXT:    vpcmpgtd %xmm4, %xmm3, %xmm2
+; AVX1OR2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1OR2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1OR2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1OR2-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v16i32_v4i32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vpxor %xmm1, %xmm1, %xmm1
+; AVX512-NEXT:    vpcmpgtd %zmm1, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp sgt <4 x i32> %a0, zeroinitializer
+  %v1 = icmp sgt <4 x i32> %a1, zeroinitializer
+  %v2 = icmp sgt <4 x i32> %a2, zeroinitializer
+  %v3 = icmp sgt <4 x i32> %a3, zeroinitializer
+  %v01 = shufflevector <4 x i1> %v0, <4 x i1> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %v23 = shufflevector <4 x i1> %v2, <4 x i1> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %v = shufflevector <8 x i1> %v01, <8 x i1> %v23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %r = bitcast <16 x i1> %v to i16
+  ret i16 %r
+}
+
+define i32 @concat_icmp_v32i16_v8i16(<8 x i16> %a0, <8 x i16> %a1, <8 x i16> %a2, <8 x i16> %a3) {
+; SSE-LABEL: concat_icmp_v32i16_v8i16:
+; SSE:       # %bb.0:
+; SSE-NEXT:    pxor %xmm4, %xmm4
+; SSE-NEXT:    pcmpeqw %xmm4, %xmm0
+; SSE-NEXT:    pcmpeqw %xmm4, %xmm1
+; SSE-NEXT:    packsswb %xmm1, %xmm0
+; SSE-NEXT:    pcmpeqw %xmm4, %xmm2
+; SSE-NEXT:    pcmpeqw %xmm4, %xmm3
+; SSE-NEXT:    packsswb %xmm3, %xmm2
+; SSE-NEXT:    pmovmskb %xmm0, %ecx
+; SSE-NEXT:    xorl $65535, %ecx # imm = 0xFFFF
+; SSE-NEXT:    pmovmskb %xmm2, %eax
+; SSE-NEXT:    notl %eax
+; SSE-NEXT:    shll $16, %eax
+; SSE-NEXT:    orl %ecx, %eax
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v32i16_v8i16:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vpxor %xmm4, %xmm4, %xmm4
+; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm0, %xmm0
+; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm1, %xmm1
+; AVX1-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm2, %xmm1
+; AVX1-NEXT:    vpcmpeqw %xmm4, %xmm3, %xmm2
+; AVX1-NEXT:    vpacksswb %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpmovmskb %xmm0, %ecx
+; AVX1-NEXT:    xorl $65535, %ecx # imm = 0xFFFF
+; AVX1-NEXT:    vpmovmskb %xmm1, %eax
+; AVX1-NEXT:    notl %eax
+; AVX1-NEXT:    shll $16, %eax
+; AVX1-NEXT:    orl %ecx, %eax
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v32i16_v8i16:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX2-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX2-NEXT:    vpcmpeqw %ymm3, %ymm2, %ymm2
+; AVX2-NEXT:    vpacksswb %ymm2, %ymm2, %ymm2
+; AVX2-NEXT:    vpcmpeqd %ymm4, %ymm4, %ymm4
+; AVX2-NEXT:    vpxor %ymm4, %ymm2, %ymm2
+; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vpcmpeqw %ymm3, %ymm0, %ymm0
+; AVX2-NEXT:    vpacksswb %ymm0, %ymm0, %ymm0
+; AVX2-NEXT:    vpxor %ymm4, %ymm0, %ymm0
+; AVX2-NEXT:    vpblendd {{.*#+}} ymm0 = ymm0[0,1],ymm2[2,3],ymm0[4,5],ymm2[6,7]
+; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
+; AVX2-NEXT:    vpmovmskb %ymm0, %eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v32i16_v8i16:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vptestmw %zmm0, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp ne <8 x i16> %a0, zeroinitializer
+  %v1 = icmp ne <8 x i16> %a1, zeroinitializer
+  %v2 = icmp ne <8 x i16> %a2, zeroinitializer
+  %v3 = icmp ne <8 x i16> %a3, zeroinitializer
+  %v01 = shufflevector <8 x i1> %v0, <8 x i1> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %v23 = shufflevector <8 x i1> %v2, <8 x i1> %v3, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %v = shufflevector <16 x i1> %v01, <16 x i1> %v23, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  %r = bitcast <32 x i1> %v to i32
+  ret i32 %r
+}
+
+define i64 @concat_icmp_v64i8_v16i8(<16 x i8> %a0, <16 x i8> %a1, <16 x i8> %a2, <16 x i8> %a3) {
+; SSE-LABEL: concat_icmp_v64i8_v16i8:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movdqa {{.*#+}} xmm4 = [16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
+; SSE-NEXT:    movdqa %xmm0, %xmm5
+; SSE-NEXT:    pmaxub %xmm4, %xmm5
+; SSE-NEXT:    pcmpeqb %xmm0, %xmm5
+; SSE-NEXT:    pmovmskb %xmm5, %eax
+; SSE-NEXT:    movdqa %xmm1, %xmm0
+; SSE-NEXT:    pmaxub %xmm4, %xmm0
+; SSE-NEXT:    pcmpeqb %xmm1, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %ecx
+; SSE-NEXT:    shll $16, %ecx
+; SSE-NEXT:    orl %eax, %ecx
+; SSE-NEXT:    movdqa %xmm2, %xmm0
+; SSE-NEXT:    pmaxub %xmm4, %xmm0
+; SSE-NEXT:    pcmpeqb %xmm2, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %edx
+; SSE-NEXT:    pmaxub %xmm3, %xmm4
+; SSE-NEXT:    pcmpeqb %xmm3, %xmm4
+; SSE-NEXT:    pmovmskb %xmm4, %eax
+; SSE-NEXT:    shll $16, %eax
+; SSE-NEXT:    orl %edx, %eax
+; SSE-NEXT:    shlq $32, %rax
+; SSE-NEXT:    orq %rcx, %rax
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v64i8_v16i8:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm4 = [16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
+; AVX1-NEXT:    vpmaxub %xmm4, %xmm0, %xmm5
+; AVX1-NEXT:    vpcmpeqb %xmm5, %xmm0, %xmm0
+; AVX1-NEXT:    vpmaxub %xmm4, %xmm1, %xmm5
+; AVX1-NEXT:    vpcmpeqb %xmm5, %xmm1, %xmm1
+; AVX1-NEXT:    vpmaxub %xmm4, %xmm2, %xmm5
+; AVX1-NEXT:    vpcmpeqb %xmm5, %xmm2, %xmm2
+; AVX1-NEXT:    vpmaxub %xmm4, %xmm3, %xmm4
+; AVX1-NEXT:    vpcmpeqb %xmm4, %xmm3, %xmm3
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    vpmovmskb %xmm1, %ecx
+; AVX1-NEXT:    shll $16, %ecx
+; AVX1-NEXT:    orl %eax, %ecx
+; AVX1-NEXT:    vpmovmskb %xmm2, %edx
+; AVX1-NEXT:    vpmovmskb %xmm3, %eax
+; AVX1-NEXT:    shll $16, %eax
+; AVX1-NEXT:    orl %edx, %eax
+; AVX1-NEXT:    shlq $32, %rax
+; AVX1-NEXT:    orq %rcx, %rax
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v64i8_v16i8:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpbroadcastd {{.*#+}} xmm4 = [16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16]
+; AVX2-NEXT:    vpmaxub %xmm4, %xmm0, %xmm5
+; AVX2-NEXT:    vpcmpeqb %xmm5, %xmm0, %xmm0
+; AVX2-NEXT:    vpmaxub %xmm4, %xmm1, %xmm5
+; AVX2-NEXT:    vpcmpeqb %xmm5, %xmm1, %xmm1
+; AVX2-NEXT:    vpmaxub %xmm4, %xmm2, %xmm5
+; AVX2-NEXT:    vpcmpeqb %xmm5, %xmm2, %xmm2
+; AVX2-NEXT:    vpmaxub %xmm4, %xmm3, %xmm4
+; AVX2-NEXT:    vpcmpeqb %xmm4, %xmm3, %xmm3
+; AVX2-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vpmovmskb %ymm0, %ecx
+; AVX2-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm0
+; AVX2-NEXT:    vpmovmskb %ymm0, %eax
+; AVX2-NEXT:    shlq $32, %rax
+; AVX2-NEXT:    orq %rcx, %rax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v64i8_v16i8:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinserti128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinserti128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vpcmpnleub {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm0, %k0
+; AVX512-NEXT:    kmovq %k0, %rax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp ugt <16 x i8> %a0, splat (i8 15)
+  %v1 = icmp ugt <16 x i8> %a1, splat (i8 15)
+  %v2 = icmp ugt <16 x i8> %a2, splat (i8 15)
+  %v3 = icmp ugt <16 x i8> %a3, splat (i8 15)
+  %v01 = shufflevector <16 x i1> %v0, <16 x i1> %v1, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  %v23 = shufflevector <16 x i1> %v2, <16 x i1> %v3, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  %v = shufflevector <32 x i1> %v01, <32 x i1> %v23, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
+  %r = bitcast <64 x i1> %v to i64
+  ret i64 %r
+}
+
+define i8 @concat_icmp_v8i64_v4i64(<4 x i64> %a0, <4 x i64> %a1) {
+; SSE2-LABEL: concat_icmp_v8i64_v4i64:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    pxor %xmm4, %xmm4
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm1
+; SSE2-NEXT:    pshufd {{.*#+}} xmm5 = xmm1[0,2,2,3]
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm0
+; SSE2-NEXT:    pshufd {{.*#+}} xmm6 = xmm0[0,2,2,3]
+; SSE2-NEXT:    punpcklwd {{.*#+}} xmm6 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3]
+; SSE2-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[1,3,3,2]
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[1,3,3,2]
+; SSE2-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
+; SSE2-NEXT:    pand %xmm6, %xmm0
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm3
+; SSE2-NEXT:    pcmpeqd %xmm4, %xmm2
+; SSE2-NEXT:    movdqa %xmm2, %xmm1
+; SSE2-NEXT:    shufps {{.*#+}} xmm1 = xmm1[1,3],xmm3[1,3]
+; SSE2-NEXT:    shufps {{.*#+}} xmm2 = xmm2[0,2],xmm3[0,2]
+; SSE2-NEXT:    andps %xmm1, %xmm2
+; SSE2-NEXT:    packssdw %xmm2, %xmm2
+; SSE2-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
+; SSE2-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,1,3,4,5,6,7]
+; SSE2-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE2-NEXT:    packsswb %xmm0, %xmm0
+; SSE2-NEXT:    pmovmskb %xmm0, %eax
+; SSE2-NEXT:    # kill: def $al killed $al killed $eax
+; SSE2-NEXT:    retq
+;
+; SSE42-LABEL: concat_icmp_v8i64_v4i64:
+; SSE42:       # %bb.0:
+; SSE42-NEXT:    pxor %xmm4, %xmm4
+; SSE42-NEXT:    pcmpeqq %xmm4, %xmm1
+; SSE42-NEXT:    pshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
+; SSE42-NEXT:    pcmpeqq %xmm4, %xmm0
+; SSE42-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
+; SSE42-NEXT:    punpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
+; SSE42-NEXT:    pcmpeqq %xmm4, %xmm3
+; SSE42-NEXT:    pcmpeqq %xmm4, %xmm2
+; SSE42-NEXT:    packssdw %xmm3, %xmm2
+; SSE42-NEXT:    packssdw %xmm2, %xmm2
+; SSE42-NEXT:    pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
+; SSE42-NEXT:    pshuflw {{.*#+}} xmm0 = xmm0[0,2,1,3,4,5,6,7]
+; SSE42-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
+; SSE42-NEXT:    packsswb %xmm0, %xmm0
+; SSE42-NEXT:    pmovmskb %xmm0, %eax
+; SSE42-NEXT:    # kill: def $al killed $al killed $eax
+; SSE42-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v8i64_v4i64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX1-NEXT:    vpxor %xmm3, %xmm3, %xmm3
+; AVX1-NEXT:    vpcmpeqq %xmm3, %xmm2, %xmm2
+; AVX1-NEXT:    vpcmpeqq %xmm3, %xmm0, %xmm0
+; AVX1-NEXT:    vpackssdw %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; AVX1-NEXT:    vpcmpeqq %xmm3, %xmm2, %xmm2
+; AVX1-NEXT:    vpcmpeqq %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0],xmm3[1],xmm1[2],xmm3[3],xmm1[4],xmm3[5],xmm1[6],xmm3[7]
+; AVX1-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0],xmm3[1],xmm0[2],xmm3[3],xmm0[4],xmm3[5],xmm0[6],xmm3[7]
+; AVX1-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX1-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    # kill: def $al killed $al killed $eax
+; AVX1-NEXT:    vzeroupper
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v8i64_v4i64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX2-NEXT:    vpcmpeqq %ymm2, %ymm0, %ymm0
+; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
+; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
+; AVX2-NEXT:    vpcmpeqq %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
+; AVX2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpxor %xmm2, %xmm2, %xmm2
+; AVX2-NEXT:    vpblendw {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3],xmm1[4],xmm2[5],xmm1[6],xmm2[7]
+; AVX2-NEXT:    vpblendw {{.*#+}} xmm0 = xmm0[0],xmm2[1],xmm0[2],xmm2[3],xmm0[4],xmm2[5],xmm0[6],xmm2[7]
+; AVX2-NEXT:    vpackusdw %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpackuswb %xmm0, %xmm0, %xmm0
+; AVX2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX2-NEXT:    # kill: def $al killed $al killed $eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v8i64_v4i64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vptestnmq %zmm0, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $al killed $al killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp eq <4 x i64> %a0, zeroinitializer
+  %v1 = icmp eq <4 x i64> %a1, zeroinitializer
+  %v = shufflevector <4 x i1> %v0, <4 x i1> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r = bitcast <8 x i1> %v to i8
+  ret i8 %r
+}
+
+define i16 @concat_icmp_v16i32_v8i32(<8 x i32> %a0, <8 x i32> %a1) {
+; SSE2-LABEL: concat_icmp_v16i32_v8i32:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa {{.*#+}} xmm4 = [2147483648,2147483648,2147483648,2147483648]
+; SSE2-NEXT:    pxor %xmm4, %xmm1
+; SSE2-NEXT:    movdqa {{.*#+}} xmm5 = [2147483649,2147483649,2147483649,2147483649]
+; SSE2-NEXT:    pcmpgtd %xmm5, %xmm1
+; SSE2-NEXT:    pxor %xmm4, %xmm0
+; SSE2-NEXT:    pcmpgtd %xmm5, %xmm0
+; SSE2-NEXT:    packssdw %xmm1, %xmm0
+; SSE2-NEXT:    pxor %xmm4, %xmm3
+; SSE2-NEXT:    pcmpgtd %xmm5, %xmm3
+; SSE2-NEXT:    pxor %xmm4, %xmm2
+; SSE2-NEXT:    pcmpgtd %xmm5, %xmm2
+; SSE2-NEXT:    packssdw %xmm3, %xmm2
+; SSE2-NEXT:    packsswb %xmm2, %xmm0
+; SSE2-NEXT:    pmovmskb %xmm0, %eax
+; SSE2-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE2-NEXT:    retq
+;
+; SSE42-LABEL: concat_icmp_v16i32_v8i32:
+; SSE42:       # %bb.0:
+; SSE42-NEXT:    movdqa {{.*#+}} xmm4 = [2,2,2,2]
+; SSE42-NEXT:    movdqa %xmm1, %xmm5
+; SSE42-NEXT:    pmaxud %xmm4, %xmm5
+; SSE42-NEXT:    pcmpeqd %xmm1, %xmm5
+; SSE42-NEXT:    movdqa %xmm0, %xmm1
+; SSE42-NEXT:    pmaxud %xmm4, %xmm1
+; SSE42-NEXT:    pcmpeqd %xmm0, %xmm1
+; SSE42-NEXT:    packssdw %xmm5, %xmm1
+; SSE42-NEXT:    movdqa %xmm3, %xmm0
+; SSE42-NEXT:    pmaxud %xmm4, %xmm0
+; SSE42-NEXT:    pcmpeqd %xmm3, %xmm0
+; SSE42-NEXT:    pmaxud %xmm2, %xmm4
+; SSE42-NEXT:    pcmpeqd %xmm2, %xmm4
+; SSE42-NEXT:    packssdw %xmm0, %xmm4
+; SSE42-NEXT:    packsswb %xmm4, %xmm1
+; SSE42-NEXT:    pmovmskb %xmm1, %eax
+; SSE42-NEXT:    # kill: def $ax killed $ax killed $eax
+; SSE42-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v16i32_v8i32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm3 = [2,2,2,2]
+; AVX1-NEXT:    vpmaxud %xmm3, %xmm2, %xmm4
+; AVX1-NEXT:    vpcmpeqd %xmm4, %xmm2, %xmm2
+; AVX1-NEXT:    vpmaxud %xmm3, %xmm0, %xmm4
+; AVX1-NEXT:    vpcmpeqd %xmm4, %xmm0, %xmm0
+; AVX1-NEXT:    vpackssdw %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; AVX1-NEXT:    vpmaxud %xmm3, %xmm2, %xmm4
+; AVX1-NEXT:    vpcmpeqd %xmm4, %xmm2, %xmm2
+; AVX1-NEXT:    vpmaxud %xmm3, %xmm1, %xmm3
+; AVX1-NEXT:    vpcmpeqd %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX1-NEXT:    vzeroupper
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v16i32_v8i32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm2 = [2,2,2,2,2,2,2,2]
+; AVX2-NEXT:    vpmaxud %ymm2, %ymm0, %ymm3
+; AVX2-NEXT:    vpcmpeqd %ymm3, %ymm0, %ymm0
+; AVX2-NEXT:    vextracti128 $1, %ymm0, %xmm3
+; AVX2-NEXT:    vpackssdw %xmm3, %xmm0, %xmm0
+; AVX2-NEXT:    vpmaxud %ymm2, %ymm1, %ymm2
+; AVX2-NEXT:    vpcmpeqd %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vextracti128 $1, %ymm1, %xmm2
+; AVX2-NEXT:    vpackssdw %xmm2, %xmm1, %xmm1
+; AVX2-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vpmovmskb %xmm0, %eax
+; AVX2-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v16i32_v8i32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vpcmpnleud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    # kill: def $ax killed $ax killed $eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp ugt <8 x i32> %a0, splat (i32 1)
+  %v1 = icmp ugt <8 x i32> %a1, splat (i32 1)
+  %v = shufflevector <8 x i1> %v0, <8 x i1> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  %r = bitcast <16 x i1> %v to i16
+  ret i16 %r
+}
+
+define i32 @concat_icmp_v32i16_v16i16(<16 x i16> %a0, <16 x i16> %a1) {
+; SSE-LABEL: concat_icmp_v32i16_v16i16:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movdqa {{.*#+}} xmm4 = [5,5,5,5,5,5,5,5]
+; SSE-NEXT:    pcmpgtw %xmm4, %xmm1
+; SSE-NEXT:    pcmpgtw %xmm4, %xmm0
+; SSE-NEXT:    packsswb %xmm1, %xmm0
+; SSE-NEXT:    pcmpgtw %xmm4, %xmm3
+; SSE-NEXT:    pcmpgtw %xmm4, %xmm2
+; SSE-NEXT:    packsswb %xmm3, %xmm2
+; SSE-NEXT:    pmovmskb %xmm0, %ecx
+; SSE-NEXT:    pmovmskb %xmm2, %eax
+; SSE-NEXT:    shll $16, %eax
+; SSE-NEXT:    orl %ecx, %eax
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v32i16_v16i16:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm3 = [5,5,5,5,5,5,5,5]
+; AVX1-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm2
+; AVX1-NEXT:    vpcmpgtw %xmm3, %xmm0, %xmm0
+; AVX1-NEXT:    vpacksswb %xmm2, %xmm0, %xmm0
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
+; AVX1-NEXT:    vpcmpgtw %xmm3, %xmm2, %xmm2
+; AVX1-NEXT:    vpcmpgtw %xmm3, %xmm1, %xmm1
+; AVX1-NEXT:    vpacksswb %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpmovmskb %xmm0, %ecx
+; AVX1-NEXT:    vpmovmskb %xmm1, %eax
+; AVX1-NEXT:    shll $16, %eax
+; AVX1-NEXT:    orl %ecx, %eax
+; AVX1-NEXT:    vzeroupper
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v32i16_v16i16:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm2 = [5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5]
+; AVX2-NEXT:    vpcmpgtw %ymm2, %ymm0, %ymm0
+; AVX2-NEXT:    vpcmpgtw %ymm2, %ymm1, %ymm1
+; AVX2-NEXT:    vpacksswb %ymm1, %ymm0, %ymm0
+; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
+; AVX2-NEXT:    vpmovmskb %ymm0, %eax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v32i16_v16i16:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vpcmpgtw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm0, %k0
+; AVX512-NEXT:    kmovd %k0, %eax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp sgt <16 x i16> %a0, splat (i16 5)
+  %v1 = icmp sgt <16 x i16> %a1, splat (i16 5)
+  %v = shufflevector <16 x i1> %v0, <16 x i1> %v1, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  %r = bitcast <32 x i1> %v to i32
+  ret i32 %r
+}
+
+define i64 @concat_icmp_v64i8_v32i8(<32 x i8> %a0, <32 x i8> %a1) {
+; SSE-LABEL: concat_icmp_v64i8_v32i8:
+; SSE:       # %bb.0:
+; SSE-NEXT:    movdqa {{.*#+}} xmm4 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
+; SSE-NEXT:    movdqa %xmm4, %xmm5
+; SSE-NEXT:    pcmpgtb %xmm0, %xmm5
+; SSE-NEXT:    pmovmskb %xmm5, %eax
+; SSE-NEXT:    movdqa %xmm4, %xmm0
+; SSE-NEXT:    pcmpgtb %xmm1, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %ecx
+; SSE-NEXT:    shll $16, %ecx
+; SSE-NEXT:    orl %eax, %ecx
+; SSE-NEXT:    movdqa %xmm4, %xmm0
+; SSE-NEXT:    pcmpgtb %xmm2, %xmm0
+; SSE-NEXT:    pmovmskb %xmm0, %edx
+; SSE-NEXT:    pcmpgtb %xmm3, %xmm4
+; SSE-NEXT:    pmovmskb %xmm4, %eax
+; SSE-NEXT:    shll $16, %eax
+; SSE-NEXT:    orl %edx, %eax
+; SSE-NEXT:    shlq $32, %rax
+; SSE-NEXT:    orq %rcx, %rax
+; SSE-NEXT:    retq
+;
+; AVX1-LABEL: concat_icmp_v64i8_v32i8:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
+; AVX1-NEXT:    vpcmpgtb %xmm0, %xmm2, %xmm3
+; AVX1-NEXT:    vpmovmskb %xmm3, %eax
+; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm0
+; AVX1-NEXT:    vpcmpgtb %xmm0, %xmm2, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %ecx
+; AVX1-NEXT:    shll $16, %ecx
+; AVX1-NEXT:    orl %eax, %ecx
+; AVX1-NEXT:    vpcmpgtb %xmm1, %xmm2, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %edx
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm0
+; AVX1-NEXT:    vpcmpgtb %xmm0, %xmm2, %xmm0
+; AVX1-NEXT:    vpmovmskb %xmm0, %eax
+; AVX1-NEXT:    shll $16, %eax
+; AVX1-NEXT:    orl %edx, %eax
+; AVX1-NEXT:    shlq $32, %rax
+; AVX1-NEXT:    orq %rcx, %rax
+; AVX1-NEXT:    vzeroupper
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_icmp_v64i8_v32i8:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpbroadcastd {{.*#+}} ymm2 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
+; AVX2-NEXT:    vpcmpgtb %ymm0, %ymm2, %ymm0
+; AVX2-NEXT:    vpmovmskb %ymm0, %ecx
+; AVX2-NEXT:    vpcmpgtb %ymm1, %ymm2, %ymm0
+; AVX2-NEXT:    vpmovmskb %ymm0, %eax
+; AVX2-NEXT:    shlq $32, %rax
+; AVX2-NEXT:    orq %rcx, %rax
+; AVX2-NEXT:    vzeroupper
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_icmp_v64i8_v32i8:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vpcmpltb {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm0, %k0
+; AVX512-NEXT:    kmovq %k0, %rax
+; AVX512-NEXT:    vzeroupper
+; AVX512-NEXT:    retq
+  %v0 = icmp slt <32 x i8> %a0, splat (i8 1)
+  %v1 = icmp slt <32 x i8> %a1, splat (i8 1)
+  %v = shufflevector <32 x i1> %v0, <32 x i1> %v1, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
+  %r = bitcast <64 x i1> %v to i64
+  ret i64 %r
+}
diff --git a/llvm/test/CodeGen/X86/combine-rcp.ll b/llvm/test/CodeGen/X86/combine-rcp.ll
new file mode 100644
index 0000000000000..4647516528bf3
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-rcp.ll
@@ -0,0 +1,65 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64    | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <8 x float> @concat_rcp_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_rcp_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    rcpps %xmm0, %xmm0
+; SSE-NEXT:    rcpps %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_rcp_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vrcpps %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+; Ensure we don't convert rcpps to rcp14ps
+define <16 x float> @concat_rcp_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_rcp_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    rcpps %xmm0, %xmm0
+; SSE-NEXT:    rcpps %xmm1, %xmm1
+; SSE-NEXT:    rcpps %xmm2, %xmm2
+; SSE-NEXT:    rcpps %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_rcp_v16f32_v4f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1OR2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vrcpps %ymm0, %ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1OR2-NEXT:    vrcpps %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_rcp_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vrcpps %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX512-NEXT:    vrcpps %ymm1, %ymm1
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.x86.sse.rcp.ps(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-rndscale.ll b/llvm/test/CodeGen/X86/combine-rndscale.ll
new file mode 100644
index 0000000000000..b557dd8106d8e
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-rndscale.ll
@@ -0,0 +1,162 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX1
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <4 x double> @concat_roundpd_v4f64_v2f64(<2 x double> %a0, <2 x double> %a1) {
+; AVX-LABEL: concat_roundpd_v4f64_v2f64:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a0, i32 4)
+  %v1 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a1, i32 4)
+  %res  = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  ret <4 x double> %res
+}
+
+define <8 x float> @concat_roundps_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; AVX-LABEL: concat_roundps_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a0, i32 4)
+  %v1 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a1, i32 4)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x double> @concat_roundpd_v8f64_v2f64(<2 x double> %a0, <2 x double> %a1, <2 x double> %a2, <2 x double> %a3) {
+; AVX1-LABEL: concat_roundpd_v8f64_v2f64:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundpd $4, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_roundpd_v8f64_v2f64:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundpd $4, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundpd_v8f64_v2f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a0, i32 4)
+  %v1 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a1, i32 4)
+  %v2 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a2, i32 4)
+  %v3 = call <2 x double> @llvm.x86.sse41.round.pd(<2 x double> %a3, i32 4)
+  %r01 = shufflevector <2 x double> %v0, <2 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %r23 = shufflevector <2 x double> %v2, <2 x double> %v3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+  %res  = shufflevector <4 x double> %r01, <4 x double> %r23, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_roundps_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; AVX1-LABEL: concat_roundps_v16f32_v4f32:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX1-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1-NEXT:    vroundps $4, %ymm1, %ymm1
+; AVX1-NEXT:    retq
+;
+; AVX2-LABEL: concat_roundps_v16f32_v4f32:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX2-NEXT:    vroundps $4, %ymm1, %ymm1
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundps_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm2
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a0, i32 4)
+  %v1 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a1, i32 4)
+  %v2 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a2, i32 4)
+  %v3 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a3, i32 4)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+define <8 x double> @concat_roundpd_v8f64_v4f64(<4 x double> %a0, <4 x double> %a1) {
+; AVX1OR2-LABEL: concat_roundpd_v8f64_v4f64:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundpd $4, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundpd $4, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundpd_v8f64_v4f64:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscalepd $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a0, i32 4)
+  %v1 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a1, i32 4)
+  %res  = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x double> %res
+}
+
+define <16 x float> @concat_roundps_v16f32_v8f32(<8 x float> %a0, <8 x float> %a1) {
+; AVX1OR2-LABEL: concat_roundps_v16f32_v8f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    vroundps $4, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vroundps $4, %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_roundps_v16f32_v8f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    vrndscaleps $4, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a0, i32 4)
+  %v1 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a1, i32 4)
+  %res  = shufflevector <8 x float> %v0, <8 x float> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
+
+; negative test - rounding mode mismatch
+define <8 x float> @concat_roundps_v8f32_v4f32_mismatch(<4 x float> %a0, <4 x float> %a1) {
+; AVX-LABEL: concat_roundps_v8f32_v4f32_mismatch:
+; AVX:       # %bb.0:
+; AVX-NEXT:    vroundps $0, %xmm0, %xmm0
+; AVX-NEXT:    vroundps $4, %xmm1, %xmm1
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a0, i32 0)
+  %v1 = call <4 x float> @llvm.x86.sse41.round.ps(<4 x float> %a1, i32 4)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/combine-rsqrt.ll b/llvm/test/CodeGen/X86/combine-rsqrt.ll
new file mode 100644
index 0000000000000..b373458654419
--- /dev/null
+++ b/llvm/test/CodeGen/X86/combine-rsqrt.ll
@@ -0,0 +1,65 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64    | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge | FileCheck %s --check-prefixes=AVX,AVX1OR2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=AVX,AVX1OR2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX,AVX512
+
+define <8 x float> @concat_rsqrt_v8f32_v4f32(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: concat_rsqrt_v8f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    rsqrtps %xmm0, %xmm0
+; SSE-NEXT:    rsqrtps %xmm1, %xmm1
+; SSE-NEXT:    retq
+;
+; AVX-LABEL: concat_rsqrt_v8f32_v4f32:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX-NEXT:    vrsqrtps %ymm0, %ymm0
+; AVX-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a1)
+  %res  = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+; Ensure we don't convert rsqrtps to rsqrt14ps
+define <16 x float> @concat_rsqrt_v16f32_v4f32(<4 x float> %a0, <4 x float> %a1, <4 x float> %a2, <4 x float> %a3) {
+; SSE-LABEL: concat_rsqrt_v16f32_v4f32:
+; SSE:       # %bb.0:
+; SSE-NEXT:    rsqrtps %xmm0, %xmm0
+; SSE-NEXT:    rsqrtps %xmm1, %xmm1
+; SSE-NEXT:    rsqrtps %xmm2, %xmm2
+; SSE-NEXT:    rsqrtps %xmm3, %xmm3
+; SSE-NEXT:    retq
+;
+; AVX1OR2-LABEL: concat_rsqrt_v16f32_v4f32:
+; AVX1OR2:       # %bb.0:
+; AVX1OR2-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX1OR2-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX1OR2-NEXT:    vrsqrtps %ymm0, %ymm0
+; AVX1OR2-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX1OR2-NEXT:    vrsqrtps %ymm1, %ymm1
+; AVX1OR2-NEXT:    retq
+;
+; AVX512-LABEL: concat_rsqrt_v16f32_v4f32:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    # kill: def $xmm2 killed $xmm2 def $ymm2
+; AVX512-NEXT:    # kill: def $xmm0 killed $xmm0 def $ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX512-NEXT:    vrsqrtps %ymm0, %ymm0
+; AVX512-NEXT:    vinsertf128 $1, %xmm3, %ymm2, %ymm1
+; AVX512-NEXT:    vrsqrtps %ymm1, %ymm1
+; AVX512-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512-NEXT:    retq
+  %v0 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a0)
+  %v1 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a1)
+  %v2 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a2)
+  %v3 = call <4 x float> @llvm.x86.sse.rsqrt.ps(<4 x float> %a3)
+  %r01 = shufflevector <4 x float> %v0, <4 x float> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %r23 = shufflevector <4 x float> %v2, <4 x float> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  %res  = shufflevector <8 x float> %r01, <8 x float> %r23, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+  ret <16 x float> %res
+}
diff --git a/llvm/test/CodeGen/X86/dag-combine-counter.ll b/llvm/test/CodeGen/X86/dag-combine-counter.ll
index 4cc3c71b2328c..9b565860e2726 100644
--- a/llvm/test/CodeGen/X86/dag-combine-counter.ll
+++ b/llvm/test/CodeGen/X86/dag-combine-counter.ll
@@ -1,8 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
 ; RUN: llc -mtriple=x86_64-- -debug-counter=dagcombine=0-5 < %s | FileCheck %s
 
-; REQUIRES: asserts
-
 define i32 @test(i32 %x) {
 ; CHECK-LABEL: test:
 ; CHECK:       # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fmaxnum.ll b/llvm/test/CodeGen/X86/fmaxnum.ll
index 150bef01bdbe0..6a03628d9f078 100644
--- a/llvm/test/CodeGen/X86/fmaxnum.ll
+++ b/llvm/test/CodeGen/X86/fmaxnum.ll
@@ -676,15 +676,44 @@ define float @test_maxnum_neg_inf_nnan(float %x, float %y) nounwind {
 
 ; Test SNaN quieting
 define float @test_maxnum_snan(float %x) {
-; SSE-LABEL: test_maxnum_snan:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movss {{.*#+}} xmm0 = [NaN,0.0E+0,0.0E+0,0.0E+0]
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_maxnum_snan:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movss {{.*#+}} xmm2 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; SSE2-NEXT:    movaps %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
+; SSE2-NEXT:    retq
 ;
-; AVX-LABEL: test_maxnum_snan:
-; AVX:       # %bb.0:
-; AVX-NEXT:    vmovss {{.*#+}} xmm0 = [NaN,0.0E+0,0.0E+0,0.0E+0]
-; AVX-NEXT:    retq
+; SSE4-LABEL: test_maxnum_snan:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movss {{.*#+}} xmm1 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; SSE4-NEXT:    maxss %xmm0, %xmm1
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE4-NEXT:    movaps %xmm1, %xmm0
+; SSE4-NEXT:    retq
+;
+; AVX1-LABEL: test_maxnum_snan:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vmovss {{.*#+}} xmm1 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; AVX1-NEXT:    vmaxss %xmm0, %xmm1, %xmm1
+; AVX1-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX1-NEXT:    vblendvps %xmm0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm0
+; AVX1-NEXT:    retq
+;
+; AVX512-LABEL: test_maxnum_snan:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vmovss {{.*#+}} xmm2 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; AVX512-NEXT:    vmaxss %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovaps %xmm1, %xmm0
+; AVX512-NEXT:    retq
   %r = call float @llvm.maxnum.f32(float 0x7ff4000000000000, float %x)
   ret float %r
 }
diff --git a/llvm/test/CodeGen/X86/fminnum.ll b/llvm/test/CodeGen/X86/fminnum.ll
index 4aa1a618be758..5c882c99d4f14 100644
--- a/llvm/test/CodeGen/X86/fminnum.ll
+++ b/llvm/test/CodeGen/X86/fminnum.ll
@@ -676,15 +676,44 @@ define float @test_minnum_inf_nnan(float %x, float %y) nounwind {
 
 ; Test SNaN quieting
 define float @test_minnum_snan(float %x) {
-; SSE-LABEL: test_minnum_snan:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movss {{.*#+}} xmm0 = [NaN,0.0E+0,0.0E+0,0.0E+0]
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_minnum_snan:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movss {{.*#+}} xmm2 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; SSE2-NEXT:    movaps %xmm0, %xmm1
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm3
+; SSE2-NEXT:    andps %xmm2, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm2
+; SSE2-NEXT:    andnps %xmm2, %xmm1
+; SSE2-NEXT:    orps %xmm3, %xmm1
+; SSE2-NEXT:    movaps %xmm1, %xmm0
+; SSE2-NEXT:    retq
 ;
-; AVX-LABEL: test_minnum_snan:
-; AVX:       # %bb.0:
-; AVX-NEXT:    vmovss {{.*#+}} xmm0 = [NaN,0.0E+0,0.0E+0,0.0E+0]
-; AVX-NEXT:    retq
+; SSE4-LABEL: test_minnum_snan:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movss {{.*#+}} xmm1 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; SSE4-NEXT:    minss %xmm0, %xmm1
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE4-NEXT:    movaps %xmm1, %xmm0
+; SSE4-NEXT:    retq
+;
+; AVX1-LABEL: test_minnum_snan:
+; AVX1:       # %bb.0:
+; AVX1-NEXT:    vmovss {{.*#+}} xmm1 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; AVX1-NEXT:    vminss %xmm0, %xmm1, %xmm1
+; AVX1-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm0
+; AVX1-NEXT:    vblendvps %xmm0, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm0
+; AVX1-NEXT:    retq
+;
+; AVX512-LABEL: test_minnum_snan:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vmovss {{.*#+}} xmm2 = [NaN,0.0E+0,0.0E+0,0.0E+0]
+; AVX512-NEXT:    vminss %xmm0, %xmm2, %xmm1
+; AVX512-NEXT:    vcmpunordss %xmm0, %xmm0, %k1
+; AVX512-NEXT:    vmovss %xmm2, %xmm1, %xmm1 {%k1}
+; AVX512-NEXT:    vmovaps %xmm1, %xmm0
+; AVX512-NEXT:    retq
   %r = call float @llvm.minnum.f32(float 0x7ff4000000000000, float %x)
   ret float %r
 }
diff --git a/llvm/test/CodeGen/X86/kmov.ll b/llvm/test/CodeGen/X86/kmov.ll
index 8b1e69a97d545..5d216a218cf9b 100644
--- a/llvm/test/CodeGen/X86/kmov.ll
+++ b/llvm/test/CodeGen/X86/kmov.ll
@@ -477,16 +477,13 @@ define <32 x i1> @invert_i64_mask_extract_32(i64 %mask) {
 ; X64-AVX512-LABEL: invert_i64_mask_extract_32:
 ; X64-AVX512:       # %bb.0:
 ; X64-AVX512-NEXT:    kmovq %rdi, %k0
-; X64-AVX512-NEXT:    knotb %k0, %k1
-; X64-AVX512-NEXT:    kshiftrd $8, %k0, %k2
-; X64-AVX512-NEXT:    knotb %k2, %k2
-; X64-AVX512-NEXT:    kunpckbw %k1, %k2, %k1
+; X64-AVX512-NEXT:    kshiftrd $8, %k0, %k1
+; X64-AVX512-NEXT:    kunpckbw %k0, %k1, %k1
 ; X64-AVX512-NEXT:    kshiftrd $16, %k0, %k2
-; X64-AVX512-NEXT:    knotb %k2, %k2
 ; X64-AVX512-NEXT:    kshiftrd $24, %k0, %k0
-; X64-AVX512-NEXT:    knotb %k0, %k0
 ; X64-AVX512-NEXT:    kunpckbw %k2, %k0, %k0
 ; X64-AVX512-NEXT:    kunpckwd %k1, %k0, %k0
+; X64-AVX512-NEXT:    knotd %k0, %k0
 ; X64-AVX512-NEXT:    vpmovm2b %k0, %ymm0
 ; X64-AVX512-NEXT:    retq
 ;
@@ -495,18 +492,16 @@ define <32 x i1> @invert_i64_mask_extract_32(i64 %mask) {
 ; X64-KNL-NEXT:    movl %edi, %eax
 ; X64-KNL-NEXT:    shrl $16, %eax
 ; X64-KNL-NEXT:    kmovw %eax, %k0
-; X64-KNL-NEXT:    knotw %k0, %k0
 ; X64-KNL-NEXT:    movl %edi, %eax
 ; X64-KNL-NEXT:    shrl $24, %eax
 ; X64-KNL-NEXT:    kmovw %eax, %k1
-; X64-KNL-NEXT:    knotw %k1, %k1
-; X64-KNL-NEXT:    kunpckbw %k0, %k1, %k1
+; X64-KNL-NEXT:    kunpckbw %k0, %k1, %k0
+; X64-KNL-NEXT:    knotw %k0, %k1
 ; X64-KNL-NEXT:    kmovw %edi, %k0
-; X64-KNL-NEXT:    knotw %k0, %k0
 ; X64-KNL-NEXT:    shrl $8, %edi
 ; X64-KNL-NEXT:    kmovw %edi, %k2
-; X64-KNL-NEXT:    knotw %k2, %k2
-; X64-KNL-NEXT:    kunpckbw %k0, %k2, %k2
+; X64-KNL-NEXT:    kunpckbw %k0, %k2, %k0
+; X64-KNL-NEXT:    knotw %k0, %k2
 ; X64-KNL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k2} {z} = -1
 ; X64-KNL-NEXT:    vpmovdb %zmm0, %xmm0
 ; X64-KNL-NEXT:    vpternlogd {{.*#+}} zmm1 {%k1} {z} = -1
@@ -586,27 +581,20 @@ define <64 x i1> @invert_i64_mask_extract_64(i64 %mask) {
 ; X64-AVX512:       # %bb.0:
 ; X64-AVX512-NEXT:    kmovq %rdi, %k0
 ; X64-AVX512-NEXT:    kshiftrq $32, %k0, %k1
-; X64-AVX512-NEXT:    knotb %k1, %k1
 ; X64-AVX512-NEXT:    kshiftrq $40, %k0, %k2
-; X64-AVX512-NEXT:    knotb %k2, %k2
 ; X64-AVX512-NEXT:    kunpckbw %k1, %k2, %k1
 ; X64-AVX512-NEXT:    kshiftrq $48, %k0, %k2
-; X64-AVX512-NEXT:    knotb %k2, %k2
 ; X64-AVX512-NEXT:    kshiftrq $56, %k0, %k3
-; X64-AVX512-NEXT:    knotb %k3, %k3
 ; X64-AVX512-NEXT:    kunpckbw %k2, %k3, %k2
 ; X64-AVX512-NEXT:    kunpckwd %k1, %k2, %k1
-; X64-AVX512-NEXT:    knotb %k0, %k2
-; X64-AVX512-NEXT:    kshiftrd $8, %k0, %k3
-; X64-AVX512-NEXT:    knotb %k3, %k3
-; X64-AVX512-NEXT:    kunpckbw %k2, %k3, %k2
+; X64-AVX512-NEXT:    kshiftrd $8, %k0, %k2
+; X64-AVX512-NEXT:    kunpckbw %k0, %k2, %k2
 ; X64-AVX512-NEXT:    kshiftrd $16, %k0, %k3
-; X64-AVX512-NEXT:    knotb %k3, %k3
 ; X64-AVX512-NEXT:    kshiftrd $24, %k0, %k0
-; X64-AVX512-NEXT:    knotb %k0, %k0
 ; X64-AVX512-NEXT:    kunpckbw %k3, %k0, %k0
 ; X64-AVX512-NEXT:    kunpckwd %k2, %k0, %k0
 ; X64-AVX512-NEXT:    kunpckdq %k0, %k1, %k0
+; X64-AVX512-NEXT:    knotq %k0, %k0
 ; X64-AVX512-NEXT:    vpmovm2b %k0, %zmm0
 ; X64-AVX512-NEXT:    retq
 ;
@@ -614,38 +602,34 @@ define <64 x i1> @invert_i64_mask_extract_64(i64 %mask) {
 ; X64-KNL:       # %bb.0:
 ; X64-KNL-NEXT:    movq %rdi, %rax
 ; X64-KNL-NEXT:    kmovw %esi, %k0
-; X64-KNL-NEXT:    knotw %k0, %k0
 ; X64-KNL-NEXT:    movl %esi, %ecx
 ; X64-KNL-NEXT:    shrl $8, %ecx
 ; X64-KNL-NEXT:    kmovw %ecx, %k1
-; X64-KNL-NEXT:    knotw %k1, %k1
 ; X64-KNL-NEXT:    kunpckbw %k0, %k1, %k0
+; X64-KNL-NEXT:    knotw %k0, %k0
 ; X64-KNL-NEXT:    movl %esi, %ecx
 ; X64-KNL-NEXT:    shrl $16, %ecx
 ; X64-KNL-NEXT:    kmovw %ecx, %k1
-; X64-KNL-NEXT:    knotw %k1, %k1
 ; X64-KNL-NEXT:    movl %esi, %ecx
 ; X64-KNL-NEXT:    shrl $24, %ecx
 ; X64-KNL-NEXT:    kmovw %ecx, %k2
-; X64-KNL-NEXT:    knotw %k2, %k2
 ; X64-KNL-NEXT:    kunpckbw %k1, %k2, %k1
+; X64-KNL-NEXT:    knotw %k1, %k1
 ; X64-KNL-NEXT:    movq %rsi, %rcx
 ; X64-KNL-NEXT:    shrq $32, %rcx
 ; X64-KNL-NEXT:    kmovw %ecx, %k2
-; X64-KNL-NEXT:    knotw %k2, %k2
 ; X64-KNL-NEXT:    movq %rsi, %rcx
 ; X64-KNL-NEXT:    shrq $40, %rcx
 ; X64-KNL-NEXT:    kmovw %ecx, %k3
-; X64-KNL-NEXT:    knotw %k3, %k3
 ; X64-KNL-NEXT:    kunpckbw %k2, %k3, %k2
+; X64-KNL-NEXT:    knotw %k2, %k2
 ; X64-KNL-NEXT:    movq %rsi, %rcx
 ; X64-KNL-NEXT:    shrq $48, %rcx
 ; X64-KNL-NEXT:    kmovw %ecx, %k3
-; X64-KNL-NEXT:    knotw %k3, %k3
 ; X64-KNL-NEXT:    shrq $56, %rsi
 ; X64-KNL-NEXT:    kmovw %esi, %k4
-; X64-KNL-NEXT:    knotw %k4, %k4
 ; X64-KNL-NEXT:    kunpckbw %k3, %k4, %k3
+; X64-KNL-NEXT:    knotw %k3, %k3
 ; X64-KNL-NEXT:    kmovw %k3, 6(%rdi)
 ; X64-KNL-NEXT:    kmovw %k2, 4(%rdi)
 ; X64-KNL-NEXT:    kmovw %k1, 2(%rdi)
diff --git a/llvm/test/CodeGen/X86/masked_store_trunc_ssat.ll b/llvm/test/CodeGen/X86/masked_store_trunc_ssat.ll
index 18d394e1281b4..57b0577ac7cc9 100644
--- a/llvm/test/CodeGen/X86/masked_store_trunc_ssat.ll
+++ b/llvm/test/CodeGen/X86/masked_store_trunc_ssat.ll
@@ -4,9 +4,9 @@
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx     | FileCheck %s --check-prefixes=AVX,AVX1
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2    | FileCheck %s --check-prefixes=AVX,AVX2
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f | FileCheck %s --check-prefixes=AVX512,AVX512F
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl | FileCheck %s --check-prefixes=AVX512VL,AVX512FVL
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl | FileCheck %s --check-prefixes=AVX512FVL
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512bw | FileCheck %s --check-prefixes=AVX512,AVX512BW
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl,avx512bw | FileCheck %s --check-prefixes=AVX512VL,AVX512BWVL
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl,avx512bw | FileCheck %s --check-prefixes=AVX512BWVL
 
 define void @truncstore_v8i64_v8i32(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; SSE2-LABEL: truncstore_v8i64_v8i32:
@@ -350,14 +350,21 @@ define void @truncstore_v8i64_v8i32(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
 ;
-; AVX512VL-LABEL: truncstore_v8i64_v8i32:
-; AVX512VL:       # %bb.0:
-; AVX512VL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512VL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512VL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512VL-NEXT:    vpmovqd %zmm0, (%rdi) {%k1}
-; AVX512VL-NEXT:    vzeroupper
-; AVX512VL-NEXT:    retq
+; AVX512FVL-LABEL: truncstore_v8i64_v8i32:
+; AVX512FVL:       # %bb.0:
+; AVX512FVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
+; AVX512FVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
+; AVX512FVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
+; AVX512FVL-NEXT:    vpmovqd %zmm0, (%rdi) {%k1}
+; AVX512FVL-NEXT:    vzeroupper
+; AVX512FVL-NEXT:    retq
+;
+; AVX512BWVL-LABEL: truncstore_v8i64_v8i32:
+; AVX512BWVL:       # %bb.0:
+; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
+; AVX512BWVL-NEXT:    vpmovsqd %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vzeroupper
+; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
   %b = icmp slt <8 x i64> %x, <i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647>
   %c = select <8 x i1> %b, <8 x i64> %x, <8 x i64> <i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647>
@@ -964,9 +971,7 @@ define void @truncstore_v8i64_v8i16(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i64_v8i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovqw %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsqw %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -1572,9 +1577,7 @@ define void @truncstore_v8i64_v8i8(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i64_v8i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovqb %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsqb %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -1788,14 +1791,21 @@ define void @truncstore_v4i64_v4i32(<4 x i64> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
 ;
-; AVX512VL-LABEL: truncstore_v4i64_v4i32:
-; AVX512VL:       # %bb.0:
-; AVX512VL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512VL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512VL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512VL-NEXT:    vpmovqd %ymm0, (%rdi) {%k1}
-; AVX512VL-NEXT:    vzeroupper
-; AVX512VL-NEXT:    retq
+; AVX512FVL-LABEL: truncstore_v4i64_v4i32:
+; AVX512FVL:       # %bb.0:
+; AVX512FVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
+; AVX512FVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
+; AVX512FVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
+; AVX512FVL-NEXT:    vpmovqd %ymm0, (%rdi) {%k1}
+; AVX512FVL-NEXT:    vzeroupper
+; AVX512FVL-NEXT:    retq
+;
+; AVX512BWVL-LABEL: truncstore_v4i64_v4i32:
+; AVX512BWVL:       # %bb.0:
+; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
+; AVX512BWVL-NEXT:    vpmovsqd %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vzeroupper
+; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
   %b = icmp slt <4 x i64> %x, <i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647>
   %c = select <4 x i1> %b, <4 x i64> %x, <4 x i64> <i64 2147483647, i64 2147483647, i64 2147483647, i64 2147483647>
@@ -2141,9 +2151,7 @@ define void @truncstore_v4i64_v4i16(<4 x i64> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i64_v4i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovqw %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsqw %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
@@ -2495,9 +2503,7 @@ define void @truncstore_v4i64_v4i8(<4 x i64> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i64_v4i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovqb %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsqb %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
@@ -2641,13 +2647,19 @@ define void @truncstore_v2i64_v2i32(<2 x i64> %x, ptr %p, <2 x i64> %mask) {
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
 ;
-; AVX512VL-LABEL: truncstore_v2i64_v2i32:
-; AVX512VL:       # %bb.0:
-; AVX512VL-NEXT:    vptestmq %xmm1, %xmm1, %k1
-; AVX512VL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512VL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512VL-NEXT:    vpmovqd %xmm0, (%rdi) {%k1}
-; AVX512VL-NEXT:    retq
+; AVX512FVL-LABEL: truncstore_v2i64_v2i32:
+; AVX512FVL:       # %bb.0:
+; AVX512FVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
+; AVX512FVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
+; AVX512FVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
+; AVX512FVL-NEXT:    vpmovqd %xmm0, (%rdi) {%k1}
+; AVX512FVL-NEXT:    retq
+;
+; AVX512BWVL-LABEL: truncstore_v2i64_v2i32:
+; AVX512BWVL:       # %bb.0:
+; AVX512BWVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
+; AVX512BWVL-NEXT:    vpmovsqd %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    retq
   %a = icmp ne <2 x i64> %mask, zeroinitializer
   %b = icmp slt <2 x i64> %x, <i64 2147483647, i64 2147483647>
   %c = select <2 x i1> %b, <2 x i64> %x, <2 x i64> <i64 2147483647, i64 2147483647>
@@ -2832,9 +2844,7 @@ define void @truncstore_v2i64_v2i16(<2 x i64> %x, ptr %p, <2 x i64> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v2i64_v2i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovqw %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsqw %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <2 x i64> %mask, zeroinitializer
   %b = icmp slt <2 x i64> %x, <i64 32767, i64 32767>
@@ -3018,9 +3028,7 @@ define void @truncstore_v2i64_v2i8(<2 x i64> %x, ptr %p, <2 x i64> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v2i64_v2i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmaxsq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovqb %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsqb %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <2 x i64> %mask, zeroinitializer
   %b = icmp slt <2 x i64> %x, <i64 127, i64 127>
@@ -3816,9 +3824,7 @@ define void @truncstore_v16i32_v16i16(<16 x i32> %x, ptr %p, <16 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v16i32_v16i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %zmm1, %zmm1, %k1
-; AVX512BWVL-NEXT:    vpminsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmaxsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovdw %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsdw %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <16 x i32> %mask, zeroinitializer
@@ -4594,9 +4600,7 @@ define void @truncstore_v16i32_v16i8(<16 x i32> %x, ptr %p, <16 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v16i32_v16i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %zmm1, %zmm1, %k1
-; AVX512BWVL-NEXT:    vpminsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmaxsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovdb %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsdb %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <16 x i32> %mask, zeroinitializer
@@ -5034,9 +5038,7 @@ define void @truncstore_v8i32_v8i16(<8 x i32> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i32_v8i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmaxsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovdw %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsdw %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -5473,9 +5475,7 @@ define void @truncstore_v8i32_v8i8(<8 x i32> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i32_v8i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmaxsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovdb %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsdb %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -5686,9 +5686,7 @@ define void @truncstore_v4i32_v4i16(<4 x i32> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i32_v4i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmaxsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovdw %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsdw %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
   %b = icmp slt <4 x i32> %x, <i32 32767, i32 32767, i32 32767, i32 32767>
@@ -5904,9 +5902,7 @@ define void @truncstore_v4i32_v4i8(<4 x i32> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i32_v4i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmaxsd {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovdb %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovsdb %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
   %b = icmp slt <4 x i32> %x, <i32 127, i32 127, i32 127, i32 127>
@@ -7332,9 +7328,7 @@ define void @truncstore_v32i16_v32i8(<32 x i16> %x, ptr %p, <32 x i8> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v32i16_v32i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmb %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmaxsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovwb %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovswb %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <32 x i8> %mask, zeroinitializer
@@ -8083,9 +8077,7 @@ define void @truncstore_v16i16_v16i8(<16 x i16> %x, ptr %p, <16 x i8> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v16i16_v16i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmb %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmaxsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovwb %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovswb %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <16 x i8> %mask, zeroinitializer
@@ -8445,9 +8437,7 @@ define void @truncstore_v8i16_v8i8(<8 x i16> %x, ptr %p, <8 x i16> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i16_v8i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmw %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmaxsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovwb %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovswb %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i16> %mask, zeroinitializer
   %b = icmp slt <8 x i16> %x, <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>
diff --git a/llvm/test/CodeGen/X86/masked_store_trunc_usat.ll b/llvm/test/CodeGen/X86/masked_store_trunc_usat.ll
index 4c4b6e78d1f8c..0386d9531723d 100644
--- a/llvm/test/CodeGen/X86/masked_store_trunc_usat.ll
+++ b/llvm/test/CodeGen/X86/masked_store_trunc_usat.ll
@@ -4,9 +4,9 @@
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx     | FileCheck %s --check-prefixes=AVX,AVX1
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2    | FileCheck %s --check-prefixes=AVX,AVX2
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f | FileCheck %s --check-prefixes=AVX512,AVX512F
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl | FileCheck %s --check-prefixes=AVX512VL,AVX512FVL
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl | FileCheck %s --check-prefixes=AVX512FVL
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512bw | FileCheck %s --check-prefixes=AVX512,AVX512BW
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl,avx512bw | FileCheck %s --check-prefixes=AVX512VL,AVX512BWVL
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl,avx512bw | FileCheck %s --check-prefixes=AVX512BWVL
 
 define void @truncstore_v8i64_v8i32(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; SSE2-LABEL: truncstore_v8i64_v8i32:
@@ -281,13 +281,20 @@ define void @truncstore_v8i64_v8i32(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
 ;
-; AVX512VL-LABEL: truncstore_v8i64_v8i32:
-; AVX512VL:       # %bb.0:
-; AVX512VL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512VL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512VL-NEXT:    vpmovqd %zmm0, (%rdi) {%k1}
-; AVX512VL-NEXT:    vzeroupper
-; AVX512VL-NEXT:    retq
+; AVX512FVL-LABEL: truncstore_v8i64_v8i32:
+; AVX512FVL:       # %bb.0:
+; AVX512FVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
+; AVX512FVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
+; AVX512FVL-NEXT:    vpmovqd %zmm0, (%rdi) {%k1}
+; AVX512FVL-NEXT:    vzeroupper
+; AVX512FVL-NEXT:    retq
+;
+; AVX512BWVL-LABEL: truncstore_v8i64_v8i32:
+; AVX512BWVL:       # %bb.0:
+; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
+; AVX512BWVL-NEXT:    vpmovusqd %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vzeroupper
+; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
   %b = icmp ult <8 x i64> %x, <i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295>
   %c = select <8 x i1> %b, <8 x i64> %x, <8 x i64> <i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295>
@@ -829,8 +836,7 @@ define void @truncstore_v8i64_v8i16(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i64_v8i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovqw %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusqw %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -1367,8 +1373,7 @@ define void @truncstore_v8i64_v8i8(<8 x i64> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i64_v8i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovqb %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusqb %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -1547,13 +1552,20 @@ define void @truncstore_v4i64_v4i32(<4 x i64> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
 ;
-; AVX512VL-LABEL: truncstore_v4i64_v4i32:
-; AVX512VL:       # %bb.0:
-; AVX512VL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512VL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512VL-NEXT:    vpmovqd %ymm0, (%rdi) {%k1}
-; AVX512VL-NEXT:    vzeroupper
-; AVX512VL-NEXT:    retq
+; AVX512FVL-LABEL: truncstore_v4i64_v4i32:
+; AVX512FVL:       # %bb.0:
+; AVX512FVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
+; AVX512FVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
+; AVX512FVL-NEXT:    vpmovqd %ymm0, (%rdi) {%k1}
+; AVX512FVL-NEXT:    vzeroupper
+; AVX512FVL-NEXT:    retq
+;
+; AVX512BWVL-LABEL: truncstore_v4i64_v4i32:
+; AVX512BWVL:       # %bb.0:
+; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
+; AVX512BWVL-NEXT:    vpmovusqd %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vzeroupper
+; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
   %b = icmp ult <4 x i64> %x, <i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295>
   %c = select <4 x i1> %b, <4 x i64> %x, <4 x i64> <i64 4294967295, i64 4294967295, i64 4294967295, i64 4294967295>
@@ -1868,8 +1880,7 @@ define void @truncstore_v4i64_v4i16(<4 x i64> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i64_v4i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovqw %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusqw %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
@@ -2188,8 +2199,7 @@ define void @truncstore_v4i64_v4i8(<4 x i64> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i64_v4i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovqb %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusqb %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
@@ -2304,12 +2314,18 @@ define void @truncstore_v2i64_v2i32(<2 x i64> %x, ptr %p, <2 x i64> %mask) {
 ; AVX512-NEXT:    vzeroupper
 ; AVX512-NEXT:    retq
 ;
-; AVX512VL-LABEL: truncstore_v2i64_v2i32:
-; AVX512VL:       # %bb.0:
-; AVX512VL-NEXT:    vptestmq %xmm1, %xmm1, %k1
-; AVX512VL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512VL-NEXT:    vpmovqd %xmm0, (%rdi) {%k1}
-; AVX512VL-NEXT:    retq
+; AVX512FVL-LABEL: truncstore_v2i64_v2i32:
+; AVX512FVL:       # %bb.0:
+; AVX512FVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
+; AVX512FVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
+; AVX512FVL-NEXT:    vpmovqd %xmm0, (%rdi) {%k1}
+; AVX512FVL-NEXT:    retq
+;
+; AVX512BWVL-LABEL: truncstore_v2i64_v2i32:
+; AVX512BWVL:       # %bb.0:
+; AVX512BWVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
+; AVX512BWVL-NEXT:    vpmovusqd %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    retq
   %a = icmp ne <2 x i64> %mask, zeroinitializer
   %b = icmp ult <2 x i64> %x, <i64 4294967295, i64 4294967295>
   %c = select <2 x i1> %b, <2 x i64> %x, <2 x i64> <i64 4294967295, i64 4294967295>
@@ -2470,8 +2486,7 @@ define void @truncstore_v2i64_v2i16(<2 x i64> %x, ptr %p, <2 x i64> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v2i64_v2i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovqw %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusqw %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <2 x i64> %mask, zeroinitializer
   %b = icmp ult <2 x i64> %x, <i64 65535, i64 65535>
@@ -2630,8 +2645,7 @@ define void @truncstore_v2i64_v2i8(<2 x i64> %x, ptr %p, <2 x i64> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v2i64_v2i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmq %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminuq {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to2}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovqb %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusqb %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <2 x i64> %mask, zeroinitializer
   %b = icmp ult <2 x i64> %x, <i64 255, i64 255>
@@ -3457,8 +3471,7 @@ define void @truncstore_v16i32_v16i16(<16 x i32> %x, ptr %p, <16 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v16i32_v16i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %zmm1, %zmm1, %k1
-; AVX512BWVL-NEXT:    vpminud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovdw %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusdw %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <16 x i32> %mask, zeroinitializer
@@ -4273,8 +4286,7 @@ define void @truncstore_v16i32_v16i8(<16 x i32> %x, ptr %p, <16 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v16i32_v16i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %zmm1, %zmm1, %k1
-; AVX512BWVL-NEXT:    vpminud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to16}, %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovdb %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusdb %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <16 x i32> %mask, zeroinitializer
@@ -4737,8 +4749,7 @@ define void @truncstore_v8i32_v8i16(<8 x i32> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i32_v8i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovdw %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusdw %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -5194,8 +5205,7 @@ define void @truncstore_v8i32_v8i8(<8 x i32> %x, ptr %p, <8 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i32_v8i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovdb %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusdb %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i32> %mask, zeroinitializer
@@ -5455,8 +5465,7 @@ define void @truncstore_v4i32_v4i16(<4 x i32> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i32_v4i16:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovdw %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusdw %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
   %b = icmp ult <4 x i32> %x, <i32 65535, i32 65535, i32 65535, i32 65535>
@@ -5717,8 +5726,7 @@ define void @truncstore_v4i32_v4i8(<4 x i32> %x, ptr %p, <4 x i32> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v4i32_v4i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmd %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminud {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to4}, %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovdb %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovusdb %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <4 x i32> %mask, zeroinitializer
   %b = icmp ult <4 x i32> %x, <i32 255, i32 255, i32 255, i32 255>
@@ -7171,8 +7179,7 @@ define void @truncstore_v32i16_v32i8(<32 x i16> %x, ptr %p, <32 x i8> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v32i16_v32i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmb %ymm1, %ymm1, %k1
-; AVX512BWVL-NEXT:    vpminuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm0, %zmm0
-; AVX512BWVL-NEXT:    vpmovwb %zmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovuswb %zmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <32 x i8> %mask, zeroinitializer
@@ -7935,8 +7942,7 @@ define void @truncstore_v16i16_v16i8(<16 x i16> %x, ptr %p, <16 x i8> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v16i16_v16i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmb %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
-; AVX512BWVL-NEXT:    vpmovwb %ymm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovuswb %ymm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    vzeroupper
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <16 x i8> %mask, zeroinitializer
@@ -8302,8 +8308,7 @@ define void @truncstore_v8i16_v8i8(<8 x i16> %x, ptr %p, <8 x i16> %mask) {
 ; AVX512BWVL-LABEL: truncstore_v8i16_v8i8:
 ; AVX512BWVL:       # %bb.0:
 ; AVX512BWVL-NEXT:    vptestmw %xmm1, %xmm1, %k1
-; AVX512BWVL-NEXT:    vpminuw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; AVX512BWVL-NEXT:    vpmovwb %xmm0, (%rdi) {%k1}
+; AVX512BWVL-NEXT:    vpmovuswb %xmm0, (%rdi) {%k1}
 ; AVX512BWVL-NEXT:    retq
   %a = icmp ne <8 x i16> %mask, zeroinitializer
   %b = icmp ult <8 x i16> %x, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
diff --git a/llvm/test/CodeGen/X86/opt-pipeline.ll b/llvm/test/CodeGen/X86/opt-pipeline.ll
index 276232e27c000..9f08658e067ab 100644
--- a/llvm/test/CodeGen/X86/opt-pipeline.ll
+++ b/llvm/test/CodeGen/X86/opt-pipeline.ll
@@ -13,9 +13,11 @@
 
 ; CHECK-LABEL: Pass Arguments:
 ; CHECK-NEXT: Target Library Information
+; CHECK-NEXT: Runtime Library Function Analysis
 ; CHECK-NEXT: Target Pass Configuration
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
+; CHECK-NEXT: Library Function Lowering Analysis
 ; CHECK-NEXT: Assumption Cache Tracker
 ; CHECK-NEXT: Type-Based Alias Analysis
 ; CHECK-NEXT: Scoped NoAlias Alias Analysis
diff --git a/llvm/test/CodeGen/X86/pr114360.ll b/llvm/test/CodeGen/X86/pr114360.ll
index cf510854cce66..41cf06a77571d 100644
--- a/llvm/test/CodeGen/X86/pr114360.ll
+++ b/llvm/test/CodeGen/X86/pr114360.ll
@@ -1,5 +1,4 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; REQUIRES: asserts
 ; RUN: llc < %s -mtriple=x86_64-- -debug-counter=dagcombine=0 | FileCheck %s
 
 ; BUG: shrinkAndImmediate folds away the AND after the ZEXT has already been folded away to SUBREG_TO_REG losing implicit zext.
diff --git a/llvm/test/CodeGen/X86/prefer-avx256-mask-extend.ll b/llvm/test/CodeGen/X86/prefer-avx256-mask-extend.ll
index ad08eaffab383..7e00d679d56b2 100644
--- a/llvm/test/CodeGen/X86/prefer-avx256-mask-extend.ll
+++ b/llvm/test/CodeGen/X86/prefer-avx256-mask-extend.ll
@@ -43,25 +43,23 @@ define <16 x i8> @testv16i1_sext_v16i8(ptr %p, ptr %q) {
 ; AVX256-LABEL: testv16i1_sext_v16i8:
 ; AVX256:       # %bb.0:
 ; AVX256-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX256-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k2
+; AVX256-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX256-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX256-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
-; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k2} {z}
+; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm1, %xmm1
+; AVX256-NEXT:    kshiftrw $8, %k1, %k1
 ; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm0, %xmm0
-; AVX256-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX256-NEXT:    vpacksswb %xmm0, %xmm1, %xmm0
 ; AVX256-NEXT:    vzeroupper
 ; AVX256-NEXT:    retq
 ;
 ; AVX512VL-LABEL: testv16i1_sext_v16i8:
 ; AVX512VL:       # %bb.0:
 ; AVX512VL-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k0
-; AVX512VL-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX512VL-NEXT:    kunpckbw %k0, %k1, %k1
+; AVX512VL-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX512VL-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX512VL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; AVX512VL-NEXT:    vpmovdb %zmm0, %xmm0
 ; AVX512VL-NEXT:    vzeroupper
@@ -70,10 +68,8 @@ define <16 x i8> @testv16i1_sext_v16i8(ptr %p, ptr %q) {
 ; AVX512F-LABEL: testv16i1_sext_v16i8:
 ; AVX512F:       # %bb.0:
 ; AVX512F-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
-; AVX512F-NEXT:    vmovdqa (%rsi), %ymm0
+; AVX512F-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
 ; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k1
-; AVX512F-NEXT:    kunpckbw %k0, %k1, %k1
 ; AVX512F-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
 ; AVX512F-NEXT:    vzeroupper
@@ -91,13 +87,13 @@ define <16 x i16> @testv16i1_sext_v16i16(ptr %p, ptr %q) {
 ; AVX256-LABEL: testv16i1_sext_v16i16:
 ; AVX256:       # %bb.0:
 ; AVX256-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX256-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k2
+; AVX256-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX256-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX256-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
 ; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm1, %xmm1
-; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k2} {z}
+; AVX256-NEXT:    kshiftrw $8, %k1, %k1
+; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm0, %xmm0
 ; AVX256-NEXT:    vinserti128 $1, %xmm0, %ymm1, %ymm0
 ; AVX256-NEXT:    retq
@@ -105,10 +101,8 @@ define <16 x i16> @testv16i1_sext_v16i16(ptr %p, ptr %q) {
 ; AVX512VL-LABEL: testv16i1_sext_v16i16:
 ; AVX512VL:       # %bb.0:
 ; AVX512VL-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k0
-; AVX512VL-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX512VL-NEXT:    kunpckbw %k0, %k1, %k1
+; AVX512VL-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX512VL-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX512VL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; AVX512VL-NEXT:    vpmovdw %zmm0, %ymm0
 ; AVX512VL-NEXT:    retq
@@ -116,10 +110,8 @@ define <16 x i16> @testv16i1_sext_v16i16(ptr %p, ptr %q) {
 ; AVX512F-LABEL: testv16i1_sext_v16i16:
 ; AVX512F:       # %bb.0:
 ; AVX512F-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
-; AVX512F-NEXT:    vmovdqa (%rsi), %ymm0
+; AVX512F-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
 ; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k1
-; AVX512F-NEXT:    kunpckbw %k0, %k1, %k1
 ; AVX512F-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
 ; AVX512F-NEXT:    retq
@@ -173,27 +165,25 @@ define <16 x i8> @testv16i1_zext_v16i8(ptr %p, ptr %q) {
 ; AVX256-LABEL: testv16i1_zext_v16i8:
 ; AVX256:       # %bb.0:
 ; AVX256-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX256-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k2
+; AVX256-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX256-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX256-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
-; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k2} {z}
+; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm1, %xmm1
 ; AVX256-NEXT:    vpsrlw $15, %xmm1, %xmm1
+; AVX256-NEXT:    kshiftrw $8, %k1, %k1
 ; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm0, %xmm0
 ; AVX256-NEXT:    vpsrlw $15, %xmm0, %xmm0
-; AVX256-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
+; AVX256-NEXT:    vpackuswb %xmm0, %xmm1, %xmm0
 ; AVX256-NEXT:    vzeroupper
 ; AVX256-NEXT:    retq
 ;
 ; AVX512VL-LABEL: testv16i1_zext_v16i8:
 ; AVX512VL:       # %bb.0:
 ; AVX512VL-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k0
-; AVX512VL-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX512VL-NEXT:    kunpckbw %k0, %k1, %k1
+; AVX512VL-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX512VL-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX512VL-NEXT:    vpbroadcastd {{.*#+}} zmm0 {%k1} {z} = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
 ; AVX512VL-NEXT:    vpmovdb %zmm0, %xmm0
 ; AVX512VL-NEXT:    vzeroupper
@@ -202,10 +192,8 @@ define <16 x i8> @testv16i1_zext_v16i8(ptr %p, ptr %q) {
 ; AVX512F-LABEL: testv16i1_zext_v16i8:
 ; AVX512F:       # %bb.0:
 ; AVX512F-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
-; AVX512F-NEXT:    vmovdqa (%rsi), %ymm0
+; AVX512F-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
 ; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k1
-; AVX512F-NEXT:    kunpckbw %k0, %k1, %k1
 ; AVX512F-NEXT:    vpbroadcastd {{.*#+}} zmm0 {%k1} {z} = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
 ; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
 ; AVX512F-NEXT:    vzeroupper
@@ -223,13 +211,13 @@ define <16 x i16> @testv16i1_zext_v16i16(ptr %p, ptr %q) {
 ; AVX256-LABEL: testv16i1_zext_v16i16:
 ; AVX256:       # %bb.0:
 ; AVX256-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX256-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX256-NEXT:    vptestnmd %ymm0, %ymm0, %k2
+; AVX256-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX256-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX256-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
 ; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm1, %xmm1
-; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k2} {z}
+; AVX256-NEXT:    kshiftrw $8, %k1, %k1
+; AVX256-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k1} {z}
 ; AVX256-NEXT:    vpmovdw %ymm0, %xmm0
 ; AVX256-NEXT:    vinserti128 $1, %xmm0, %ymm1, %ymm0
 ; AVX256-NEXT:    vpsrlw $15, %ymm0, %ymm0
@@ -238,10 +226,8 @@ define <16 x i16> @testv16i1_zext_v16i16(ptr %p, ptr %q) {
 ; AVX512VL-LABEL: testv16i1_zext_v16i16:
 ; AVX512VL:       # %bb.0:
 ; AVX512VL-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k0
-; AVX512VL-NEXT:    vmovdqa (%rsi), %ymm0
-; AVX512VL-NEXT:    vptestnmd %ymm0, %ymm0, %k1
-; AVX512VL-NEXT:    kunpckbw %k0, %k1, %k1
+; AVX512VL-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
+; AVX512VL-NEXT:    vptestnmd %zmm0, %zmm0, %k1
 ; AVX512VL-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; AVX512VL-NEXT:    vpmovdw %zmm0, %ymm0
 ; AVX512VL-NEXT:    vpsrlw $15, %ymm0, %ymm0
@@ -250,10 +236,8 @@ define <16 x i16> @testv16i1_zext_v16i16(ptr %p, ptr %q) {
 ; AVX512F-LABEL: testv16i1_zext_v16i16:
 ; AVX512F:       # %bb.0:
 ; AVX512F-NEXT:    vmovdqa (%rdi), %ymm0
-; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k0
-; AVX512F-NEXT:    vmovdqa (%rsi), %ymm0
+; AVX512F-NEXT:    vinserti64x4 $1, (%rsi), %zmm0, %zmm0
 ; AVX512F-NEXT:    vptestnmd %zmm0, %zmm0, %k1
-; AVX512F-NEXT:    kunpckbw %k0, %k1, %k1
 ; AVX512F-NEXT:    vpternlogd {{.*#+}} zmm0 {%k1} {z} = -1
 ; AVX512F-NEXT:    vpmovdw %zmm0, %ymm0
 ; AVX512F-NEXT:    vpsrlw $15, %ymm0, %ymm0
diff --git a/llvm/test/CodeGen/X86/prefer-avx256-mask-shuffle.ll b/llvm/test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
index 3699c7f75c861..93384341e03a4 100644
--- a/llvm/test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
+++ b/llvm/test/CodeGen/X86/prefer-avx256-mask-shuffle.ll
@@ -18,26 +18,23 @@ define <16 x i1> @shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0(ptr %a, ptr %b) {
 ; AVX256VL-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
 ; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k2} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm1, %xmm1
-; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm2 {%k1} {z}
-; AVX256VL-NEXT:    vpmovdw %ymm2, %xmm2
-; AVX256VL-NEXT:    vpblendw {{.*#+}} xmm3 = xmm2[0,1],xmm1[2],xmm2[3],xmm1[4],xmm2[5,6,7]
-; AVX256VL-NEXT:    vpshufb {{.*#+}} xmm3 = xmm3[6,7,12,13,4,5,8,9,6,7,14,15,14,15,0,1]
-; AVX256VL-NEXT:    vpmovsxwd %xmm3, %ymm3
-; AVX256VL-NEXT:    vpslld $31, %ymm3, %ymm3
-; AVX256VL-NEXT:    vptestmd %ymm3, %ymm3, %k1
-; AVX256VL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[0,2,1,3]
-; AVX256VL-NEXT:    vpshufb {{.*#+}} xmm2 = xmm2[6,7,12,13,2,3,u,u,6,7,u,u,14,15,0,1]
-; AVX256VL-NEXT:    vpblendw {{.*#+}} xmm1 = xmm2[0,1,2],xmm1[3],xmm2[4],xmm1[5],xmm2[6,7]
-; AVX256VL-NEXT:    vpmovsxwd %xmm1, %ymm1
-; AVX256VL-NEXT:    vpslld $31, %ymm1, %ymm1
-; AVX256VL-NEXT:    vptestmd %ymm1, %ymm1, %k0
-; AVX256VL-NEXT:    kunpckbw %k1, %k0, %k0
-; AVX256VL-NEXT:    kshiftrw $8, %k0, %k2
-; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k2} {z}
+; AVX256VL-NEXT:    vpshufd {{.*#+}} xmm2 = xmm1[0,2,1,3]
+; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm3 {%k1} {z}
+; AVX256VL-NEXT:    vpmovdw %ymm3, %xmm3
+; AVX256VL-NEXT:    vpshufb {{.*#+}} xmm4 = xmm3[6,7,12,13,2,3,u,u,6,7,u,u,14,15,0,1]
+; AVX256VL-NEXT:    vpblendw {{.*#+}} xmm2 = xmm4[0,1,2],xmm2[3],xmm4[4],xmm2[5],xmm4[6,7]
+; AVX256VL-NEXT:    vpblendw {{.*#+}} xmm1 = xmm3[0,1],xmm1[2],xmm3[3],xmm1[4],xmm3[5,6,7]
+; AVX256VL-NEXT:    vpshufb {{.*#+}} xmm1 = xmm1[6,7,12,13,4,5,8,9,6,7,14,15,14,15,0,1]
+; AVX256VL-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
+; AVX256VL-NEXT:    vpmovsxwd %ymm1, %zmm1
+; AVX256VL-NEXT:    vpslld $31, %zmm1, %zmm1
+; AVX256VL-NEXT:    vptestmd %zmm1, %zmm1, %k1
+; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k1} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm1, %xmm1
+; AVX256VL-NEXT:    kshiftrw $8, %k1, %k1
 ; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k1} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm0, %xmm0
-; AVX256VL-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX256VL-NEXT:    vpacksswb %xmm0, %xmm1, %xmm0
 ; AVX256VL-NEXT:    vzeroupper
 ; AVX256VL-NEXT:    retq
 ;
@@ -135,14 +132,12 @@ define <32 x i1> @shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0
 ; AVX256VL-NEXT:    vextracti128 $1, %ymm0, %xmm1
 ; AVX256VL-NEXT:    vpmovsxbd %xmm1, %ymm1
 ; AVX256VL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX256VL-NEXT:    vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
-; AVX256VL-NEXT:    vpmovsxbd %xmm1, %ymm1
-; AVX256VL-NEXT:    vptestmd %ymm1, %ymm1, %k2
-; AVX256VL-NEXT:    vpmovsxbd %xmm0, %ymm0
-; AVX256VL-NEXT:    vptestmd %ymm0, %ymm0, %k3
+; AVX256VL-NEXT:    vpmovsxbd %xmm0, %zmm0
+; AVX256VL-NEXT:    vptestmd %zmm0, %zmm0, %k2
 ; AVX256VL-NEXT:    vpcmpeqd %ymm0, %ymm0, %ymm0
-; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k3} {z}
+; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k2} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm1, %xmm1
+; AVX256VL-NEXT:    kshiftrw $8, %k2, %k2
 ; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm2 {%k2} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm2, %xmm2
 ; AVX256VL-NEXT:    vinserti128 $1, %xmm2, %ymm1, %ymm1
@@ -153,20 +148,15 @@ define <32 x i1> @shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0
 ; AVX256VL-NEXT:    vpmovdw %ymm2, %xmm2
 ; AVX256VL-NEXT:    vpermq {{.*#+}} ymm2 = ymm2[1,1,1,1]
 ; AVX256VL-NEXT:    vpternlogq {{.*#+}} ymm2 = (ymm2 & ~mem) | ymm1
-; AVX256VL-NEXT:    vpmovsxwd %xmm2, %ymm1
-; AVX256VL-NEXT:    vpslld $31, %ymm1, %ymm1
-; AVX256VL-NEXT:    vptestmd %ymm1, %ymm1, %k1
-; AVX256VL-NEXT:    vextracti128 $1, %ymm2, %xmm1
-; AVX256VL-NEXT:    vpmovsxwd %xmm1, %ymm1
-; AVX256VL-NEXT:    vpslld $31, %ymm1, %ymm1
-; AVX256VL-NEXT:    vptestmd %ymm1, %ymm1, %k0
-; AVX256VL-NEXT:    kunpckbw %k1, %k0, %k0
-; AVX256VL-NEXT:    kshiftrw $8, %k0, %k2
-; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k2} {z}
+; AVX256VL-NEXT:    vpmovsxwd %ymm2, %zmm1
+; AVX256VL-NEXT:    vpslld $31, %zmm1, %zmm1
+; AVX256VL-NEXT:    vptestmd %zmm1, %zmm1, %k1
+; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm1 {%k1} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm1, %xmm1
+; AVX256VL-NEXT:    kshiftrw $8, %k1, %k1
 ; AVX256VL-NEXT:    vmovdqa32 %ymm0, %ymm0 {%k1} {z}
 ; AVX256VL-NEXT:    vpmovdw %ymm0, %xmm0
-; AVX256VL-NEXT:    vpacksswb %xmm1, %xmm0, %xmm0
+; AVX256VL-NEXT:    vpacksswb %xmm0, %xmm1, %xmm0
 ; AVX256VL-NEXT:    vinserti128 $1, %xmm0, %ymm0, %ymm0
 ; AVX256VL-NEXT:    retq
 ;
diff --git a/llvm/test/Instrumentation/AllocToken/hot-cold-new.ll b/llvm/test/Instrumentation/AllocToken/hot-cold-new.ll
new file mode 100644
index 0000000000000..36f3df1096fe4
--- /dev/null
+++ b/llvm/test/Instrumentation/AllocToken/hot-cold-new.ll
@@ -0,0 +1,20 @@
+; Manually add instcombine to ensure the hot/cold transformation happens before
+; the LTO pipeline. The default LTO pipeline includes MemProfRemoveInfo which
+; strips the memprof attributes unless the summary index indicates support.
+; RUN: opt < %s -passes='function(instcombine),thinlto<O2>' -optimize-hot-cold-new -S | FileCheck %s
+; RUN: opt < %s -passes='function(instcombine),lto<O2>' -optimize-hot-cold-new -S | FileCheck %s
+; RUN: opt < %s -passes='function(instcombine),alloc-token' -optimize-hot-cold-new -S | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
+
+declare ptr @_Znwm(i64)
+
+define ptr @new_hot() sanitize_alloc_token {
+; CHECK-LABEL: @new_hot(
+; CHECK: call {{.*}} @__alloc_token__Znwm12__hot_cold_t(i64 10, i8 -2, i64 2689373973731826898){{.*}} !alloc_token
+  %ret = call ptr @_Znwm(i64 10) #0, !alloc_token !0
+  ret ptr %ret
+}
+
+attributes #0 = { builtin allocsize(0) "memprof"="hot" }
+!0 = !{!"int", i1 false}
diff --git a/llvm/test/Instrumentation/AllocToken/module-flags.ll b/llvm/test/Instrumentation/AllocToken/module-flags.ll
new file mode 100644
index 0000000000000..7b86510fe6eaf
--- /dev/null
+++ b/llvm/test/Instrumentation/AllocToken/module-flags.ll
@@ -0,0 +1,35 @@
+; Test that all supported module flags are retrieved correctly.
+;
+; RUN: opt < %s -passes='inferattrs,alloc-token' -S | FileCheck %s --check-prefixes=CHECK,DEFAULT
+; RUN: opt < %s -passes='inferattrs,alloc-token' -alloc-token-max=2 -alloc-token-fast-abi=0 -alloc-token-extended=0 -S | FileCheck %s --check-prefixes=CHECK,OVERRIDE
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
+
+declare ptr @_Znwm(i64)
+declare ptr @malloc(i64)
+declare ptr @my_malloc(i64)
+
+define void @test() sanitize_alloc_token {
+; CHECK-LABEL: define void @test(
+; DEFAULT: call ptr @__alloc_token_0_malloc(i64 8)
+; DEFAULT: call ptr @__alloc_token_1__Znwm(i64 8)
+; DEFAULT: call ptr @__alloc_token_2_malloc(i64 8)
+; DEFAULT: call ptr @__alloc_token_0_my_malloc(i64 8)
+; OVERRIDE: call ptr @__alloc_token_malloc(i64 8, i64 0)
+; OVERRIDE: call ptr @__alloc_token__Znwm(i64 8, i64 1)
+; OVERRIDE: call ptr @__alloc_token_malloc(i64 8, i64 0)
+; OVERRIDE: call ptr @my_malloc(i64 8)
+  %1 = call ptr @malloc(i64 8)
+  %2 = call ptr @_Znwm(i64 8)
+  %3 = call ptr @malloc(i64 8)
+  %4 = call ptr @my_malloc(i64 8), !alloc_token !0
+  ret void
+}
+
+!0 = !{!"int", i1 0}
+
+!llvm.module.flags = !{!1, !2, !3, !4}
+!1 = !{i32 1, !"alloc-token-mode", !"increment"}
+!2 = !{i32 1, !"alloc-token-max", i64 3}
+!3 = !{i32 1, !"alloc-token-fast-abi", i64 1}
+!4 = !{i32 1, !"alloc-token-extended", i64 1}
diff --git a/llvm/test/Instrumentation/RealtimeSanitizer/rtsan_attrib_declare.ll b/llvm/test/Instrumentation/RealtimeSanitizer/rtsan_attrib_declare.ll
new file mode 100644
index 0000000000000..3526a010ce489
--- /dev/null
+++ b/llvm/test/Instrumentation/RealtimeSanitizer/rtsan_attrib_declare.ll
@@ -0,0 +1,11 @@
+; RUN: opt < %s -passes='rtsan' -S | FileCheck %s
+
+declare void @declared_realtime_function() sanitize_realtime #0
+
+declare void @declared_blocking_function() sanitize_realtime_blocking #0
+
+; RealtimeSanitizer pass should ignore attributed functions that are just declarations
+; CHECK: declared_realtime_function
+; CHECK-EMPTY:
+; CHECK: declared_blocking_function
+; CHECK-EMPTY:
diff --git a/llvm/test/LTO/X86/alloc-token-hot-cold-new.ll b/llvm/test/LTO/X86/alloc-token-hot-cold-new.ll
new file mode 100644
index 0000000000000..7f7a8e45b7da0
--- /dev/null
+++ b/llvm/test/LTO/X86/alloc-token-hot-cold-new.ll
@@ -0,0 +1,25 @@
+; RUN: opt -module-summary -o %t.thin.bc %s
+; RUN: llvm-lto2 run %t.thin.bc -o %t.thin.out \
+; RUN:   -r=%t.thin.bc,main,plx \
+; RUN:   -r=%t.thin.bc,_Znwm, \
+; RUN:   -r=%t.thin.bc,sink,pl \
+; RUN:   -supports-hot-cold-new -optimize-hot-cold-new
+; RUN: llvm-objdump -d -r %t.thin.out.1 | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-unknown-linux-gnu"
+
+declare ptr @_Znwm(i64)
+
+ at sink = global ptr null
+
+; CHECK-LABEL: <main>:
+; CHECK: callq
+; CHECK-NEXT: R_X86_64_PLT32 __alloc_token__Znwm12__hot_cold_t
+define void @main() sanitize_alloc_token {
+  %call = call ptr @_Znwm(i64 8) #0
+  store volatile ptr %call, ptr @sink
+  ret void
+}
+
+attributes #0 = { builtin allocsize(0) "memprof"="hot" }
diff --git a/llvm/test/LTO/X86/alloc-token.ll b/llvm/test/LTO/X86/alloc-token.ll
new file mode 100644
index 0000000000000..f9c921992c52e
--- /dev/null
+++ b/llvm/test/LTO/X86/alloc-token.ll
@@ -0,0 +1,27 @@
+; --- Full LTO ---
+; RUN: llvm-as %s -o %t.bc
+; RUN: llvm-lto -exported-symbol=main -o %t.out %t.bc
+; RUN: llvm-objdump -d -r %t.out | FileCheck %s
+; --- ThinLTO ---
+; RUN: opt -module-summary -o %t.thin.bc %s
+; RUN: llvm-lto2 run %t.thin.bc -o %t.thin.out \
+; RUN:   -r=%t.thin.bc,main,plx \
+; RUN:   -r=%t.thin.bc,_Znwm, \
+; RUN:   -r=%t.thin.bc,sink,pl
+; RUN: llvm-objdump -d -r %t.thin.out.1 | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-unknown-linux-gnu"
+
+declare ptr @_Znwm(i64)
+
+ at sink = global ptr null
+
+; CHECK-LABEL: <main>:
+; CHECK: callq
+; CHECK-NEXT: R_X86_64_PLT32 __alloc_token__Znwm
+define void @main() sanitize_alloc_token {
+  %call = call ptr @_Znwm(i64 8)
+  store volatile ptr %call, ptr @sink
+  ret void
+}
diff --git a/llvm/test/MC/AArch64/seh-large-func-multi-epilog.s b/llvm/test/MC/AArch64/seh-large-func-multi-epilog.s
index c2d7f94f7b11f..8c6864fe9e196 100644
--- a/llvm/test/MC/AArch64/seh-large-func-multi-epilog.s
+++ b/llvm/test/MC/AArch64/seh-large-func-multi-epilog.s
@@ -198,7 +198,7 @@ multi_epilog:
 	.seh_save_regp x25, 192
 	stp	x27, x28, [sp, #176]
 	.seh_save_regp x27, 176
-	mov	x29, fp
+	mov	x29, sp
 	.seh_set_fp
 	.seh_endprologue
         .rept 30
@@ -210,13 +210,13 @@ multi_epilog:
 	.seh_startepilogue
 	mov	sp, x29
 	.seh_set_fp
-	stp	x27, x28, [sp, #176]
+	ldp	x27, x28, [sp, #176]
 	.seh_save_regp x27, 176
-	stp	x25, x26, [sp, #192]
+	ldp	x25, x26, [sp, #192]
 	.seh_save_regp x25, 192
-	stp	x23, x24, [sp, #208]
+	ldp	x23, x24, [sp, #208]
 	.seh_save_regp x23, 208
-	stp	x21, x22, [sp, #224]
+	ldp	x21, x22, [sp, #224]
 	.seh_save_regp x21, 224
 	ldp	x19, x20, [sp, #240]
 	.seh_save_regp x19, 240
@@ -226,11 +226,11 @@ multi_epilog:
 	ret
 // epilog2 - a subsequence at the end of prolog, can use prolog's opcodes.
 	.seh_startepilogue
-	stp	x25, x26, [sp, #192]
+	ldp	x25, x26, [sp, #192]
 	.seh_save_regp x25, 192
-	stp	x23, x24, [sp, #208]
+	ldp	x23, x24, [sp, #208]
 	.seh_save_regp x23, 208
-	stp	x21, x22, [sp, #224]
+	ldp	x21, x22, [sp, #224]
 	.seh_save_regp x21, 224
 	ldp	x19, x20, [sp, #240]
 	.seh_save_regp x19, 240
@@ -242,9 +242,9 @@ multi_epilog:
 	.seh_startepilogue
 	mov	sp, x29
 	.seh_set_fp
-	stp	x23, x24, [sp, #208]
+	ldp	x23, x24, [sp, #208]
 	.seh_save_regp x23, 208
-	stp	x21, x22, [sp, #224]
+	ldp	x21, x22, [sp, #224]
 	.seh_save_regp x21, 224
 	ldp	x19, x20, [sp, #240]
 	.seh_save_regp x19, 240
@@ -261,13 +261,13 @@ multi_epilog:
 	.seh_startepilogue
 	mov	sp, x29
 	.seh_set_fp
-	stp	x27, x28, [sp, #176]
+	ldp	x27, x28, [sp, #176]
 	.seh_save_regp x27, 176
-	stp	x25, x26, [sp, #192]
+	ldp	x25, x26, [sp, #192]
 	.seh_save_regp x25, 192
-	stp	x23, x24, [sp, #208]
+	ldp	x23, x24, [sp, #208]
 	.seh_save_regp x23, 208
-	stp	x21, x22, [sp, #224]
+	ldp	x21, x22, [sp, #224]
 	.seh_save_regp x21, 224
 	ldp	x19, x20, [sp, #240]
 	.seh_save_regp x19, 240
@@ -277,11 +277,11 @@ multi_epilog:
 	ret
 // epilog5 - same as epilog2, its start index should be: 1 + epilog2's index.
 	.seh_startepilogue
-	stp	x25, x26, [sp, #192]
+	ldp	x25, x26, [sp, #192]
 	.seh_save_regp x25, 192
-	stp	x23, x24, [sp, #208]
+	ldp	x23, x24, [sp, #208]
 	.seh_save_regp x23, 208
-	stp	x21, x22, [sp, #224]
+	ldp	x21, x22, [sp, #224]
 	.seh_save_regp x21, 224
 	ldp	x19, x20, [sp, #240]
 	.seh_save_regp x19, 240
@@ -294,9 +294,9 @@ multi_epilog:
 	.seh_startepilogue
 	mov	sp, x29
 	.seh_set_fp
-	stp	x23, x24, [sp, #208]
+	ldp	x23, x24, [sp, #208]
 	.seh_save_regp x23, 208
-	stp	x21, x22, [sp, #224]
+	ldp	x21, x22, [sp, #224]
 	.seh_save_regp x21, 224
 	ldp	x19, x20, [sp, #240]
 	.seh_save_regp x19, 240
diff --git a/llvm/test/MC/AArch64/seh-packed-epilog.s b/llvm/test/MC/AArch64/seh-packed-epilog.s
index 85ac8e80dbdda..9fee71a10d445 100644
--- a/llvm/test/MC/AArch64/seh-packed-epilog.s
+++ b/llvm/test/MC/AArch64/seh-packed-epilog.s
@@ -126,7 +126,7 @@ func:
     .seh_set_fp
     ldp x29, x30, [sp, #16]
     .seh_save_fplr 16
-    ldp x29, x30, [sp, #-48]!
+    ldp x29, x30, [sp], #48
     .seh_save_fplr_x 48
     ldp x21, x22, [sp, #16]
     .seh_save_next
diff --git a/llvm/test/MC/AArch64/seh-packed-unwind.s b/llvm/test/MC/AArch64/seh-packed-unwind.s
index 5b86ab4bc0d49..cbb4762667633 100644
--- a/llvm/test/MC/AArch64/seh-packed-unwind.s
+++ b/llvm/test/MC/AArch64/seh-packed-unwind.s
@@ -295,6 +295,26 @@
 // CHECK-NEXT:       end
 // CHECK-NEXT:     ]
 // CHECK-NEXT:   }
+// CHECK-NEXT:   RuntimeFunction {
+// CHECK-NEXT:     Function: func19
+// CHECK-NEXT:     Fragment: No
+// CHECK-NEXT:     FunctionLength: 32
+// CHECK-NEXT:     RegF: 0
+// CHECK-NEXT:     RegI: 1
+// CHECK-NEXT:     HomedParameters: No
+// CHECK-NEXT:     CR: 1
+// CHECK-NEXT:     FrameSize: 80
+// CHECK-NEXT:     Prologue [
+// CHECK-NEXT:       sub sp, sp, #64
+// CHECK-NEXT:       stp x19, lr, [sp]
+// CHECK-NEXT:       sub sp, sp, #16
+// CHECK-NEXT:       end
+// CHECK-NEXT:     ]
+// CHECK-NEXT:   }
+// CHECK-NEXT:   RuntimeFunction {
+// CHECK-NEXT:     Function: notpacked_func20
+// CHECK-NEXT:     ExceptionRecord:
+// CHECK-NEXT:     ExceptionData {
 // CHECK:        RuntimeFunction {
 // CHECK-NEXT:     Function: nonpacked1
 // CHECK-NEXT:     ExceptionRecord:
@@ -374,6 +394,11 @@
 // CHECK-NEXT:     Function: nonpacked16
 // CHECK-NEXT:     ExceptionRecord:
 // CHECK-NEXT:     ExceptionData {
+// CHECK:            EpiloguePacked: Yes
+// CHECK:        RuntimeFunction {
+// CHECK-NEXT:     Function: nonpacked17
+// CHECK-NEXT:     ExceptionRecord:
+// CHECK-NEXT:     ExceptionData {
 // CHECK:            EpiloguePacked: Yes
 
 
@@ -809,12 +834,65 @@ func18:
     ret
     .seh_endproc
 
+func19:
+    .seh_proc func19
+    sub sp, sp, #16
+    .seh_stackalloc 16
+    stp x19, lr, [sp]
+    .seh_save_lrpair x19, 0
+    sub sp,  sp,  #64
+    .seh_stackalloc 64
+    .seh_endprologue
+    nop
+    .seh_startepilogue
+    add sp, sp, #64
+    .seh_stackalloc 64
+    ldp x19, lr, [sp]
+    .seh_save_lrpair x19, 0
+    add sp,  sp,  #16
+    .seh_stackalloc 16
+    .seh_endepilogue
+    ret
+    .seh_endproc
+
+notpacked_func20:
+    // This function is expressible with packed unwind info, but older
+    // versions of Windows unwind cases with CR=01, RegI=1, RegF>0
+    // incorrectly; therefore, we choose not to pack this case.
+    .seh_proc notpacked_func20
+    sub sp, sp, #48
+    .seh_stackalloc 48
+    stp x19, lr, [sp]
+    .seh_save_lrpair x19, 0
+    stp d8,  d9,  [sp, #16]
+    .seh_save_fregp d8, 16
+    str d10,      [sp, #32]
+    .seh_save_freg d10, 32
+    sub sp,  sp,  #64
+    .seh_stackalloc 64
+    .seh_endprologue
+    nop
+    .seh_startepilogue
+    add sp, sp, #64
+    .seh_stackalloc 64
+    ldr d10,      [sp, #32]
+    .seh_save_freg d10, 32
+    ldp d8,  d9,  [sp, #16]
+    .seh_save_fregp d8, 16
+    ldp x19, lr, [sp]
+    .seh_save_lrpair x19, 0
+    add sp,  sp,  #48
+    .seh_stackalloc 48
+    .seh_endepilogue
+    ret
+    .seh_endproc
+
 nonpacked1:
     .seh_proc nonpacked1
     // Can't be packed; can't save integer registers after float registers.
     stp d8,  d9,  [sp, #-32]!
     .seh_save_fregp_x d8, 32
-    stp x19, x20, [sp, #16]!
+    stp x19, x20, [sp, #16]
     .seh_save_regp x19, 16
     .seh_endprologue
     nop
@@ -932,7 +1010,7 @@ nonpacked6:
     .seh_startepilogue
     mov sp,  x29
     .seh_set_fp
-    ldp x29, lr,  [sp], #32
+    ldp x29, lr,  [sp], #16
     .seh_save_fplr_x 16
     ldr lr, [sp, #16]
     .seh_save_reg lr, 16
@@ -1157,3 +1235,34 @@ nonpacked16:
     .seh_endepilogue
     br      x9
     .seh_endproc
+
+nonpacked17:
+    .seh_proc nonpacked17
+    // Can't be packed; more predecrement for SavSZ than used for
+    // corresponding RegI/RegF/LR saves
+    sub sp, sp, #64
+    .seh_stackalloc 64
+    stp x19, lr, [sp]
+    .seh_save_lrpair x19, 0
+    stp d8,  d9,  [sp, #16]
+    .seh_save_fregp d8, 16
+    str d10,      [sp, #32]
+    .seh_save_freg d10, 32
+    sub sp,  sp,  #64
+    .seh_stackalloc 64
+    .seh_endprologue
+    nop
+    .seh_startepilogue
+    add sp, sp, #64
+    .seh_stackalloc 64
+    ldr d10,      [sp, #32]
+    .seh_save_freg d10, 32
+    ldp d8,  d9,  [sp, #16]
+    .seh_save_fregp d8, 16
+    ldp x19, lr, [sp]
+    .seh_save_lrpair x19, 0
+    add sp,  sp,  #64
+    .seh_stackalloc 64
+    .seh_endepilogue
+    ret
+    .seh_endproc
diff --git a/llvm/test/MC/AArch64/seh.s b/llvm/test/MC/AArch64/seh.s
index 5e194568f62dd..95411391710b4 100644
--- a/llvm/test/MC/AArch64/seh.s
+++ b/llvm/test/MC/AArch64/seh.s
@@ -1,6 +1,7 @@
 // This test checks that the SEH directives emit the correct unwind data.
 
-// RUN: llvm-mc -triple aarch64-pc-win32 -filetype=obj %s | llvm-readobj -S -r -u - | FileCheck %s
+// RUN: llvm-mc -triple aarch64-pc-win32 -filetype=obj %s -o %t.o
+// RUN: llvm-readobj -S -r -u %t.o | FileCheck %s
 
 // Check that the output assembler directives also can be parsed, and
 // that they produce equivalent output:
@@ -20,7 +21,7 @@
 // CHECK-NEXT:   }
 // CHECK:        Section {
 // CHECK:          Name: .xdata
-// CHECK:          RawDataSize: 100
+// CHECK:          RawDataSize: 108
 // CHECK:          RelocationCount: 1
 // CHECK:          Characteristics [
 // CHECK-NEXT:       ALIGN_4BYTES
@@ -30,7 +31,7 @@
 // CHECK-NEXT:   }
 // CHECK:        Section {
 // CHECK:          Name: .pdata
-// CHECK:          RelocationCount: 2
+// CHECK:          RelocationCount: 4
 // CHECK:          Characteristics [
 // CHECK-NEXT:       ALIGN_4BYTES
 // CHECK-NEXT:       CNT_INITIALIZED_DATA
@@ -41,11 +42,13 @@
 
 // CHECK-NEXT: Relocations [
 // CHECK-NEXT:   Section (4) .xdata {
-// CHECK-NEXT:     0x58 IMAGE_REL_ARM64_ADDR32NB __C_specific_handler
+// CHECK-NEXT:     0x54 IMAGE_REL_ARM64_ADDR32NB __C_specific_handler
 // CHECK-NEXT:   }
 // CHECK-NEXT:   Section (5) .pdata {
 // CHECK-NEXT:     0x0 IMAGE_REL_ARM64_ADDR32NB .text
 // CHECK-NEXT:     0x4 IMAGE_REL_ARM64_ADDR32NB .xdata
+// CHECK-NEXT:     0x8 IMAGE_REL_ARM64_ADDR32NB .text
+// CHECK-NEXT:     0xC IMAGE_REL_ARM64_ADDR32NB .xdata
 // CHECK-NEXT:   }
 // CHECK-NEXT: ]
 
@@ -54,7 +57,7 @@
 // CHECK-NEXT:     Function: func
 // CHECK-NEXT:     ExceptionRecord: .xdata
 // CHECK-NEXT:     ExceptionData {
-// CHECK-NEXT:       FunctionLength: 172
+// CHECK-NEXT:       FunctionLength: 148
 // CHECK:            Prologue [
 // CHECK-NEXT:         0xe716c3            ; str p6, [sp, #3, mul vl]
 // CHECK-NEXT:         0xe703c5            ; str z11, [sp, #5, mul vl]
@@ -72,11 +75,6 @@
 // CHECK-NEXT:         0xe74104            ; stp x1, x2, [sp, #64]
 // CHECK-NEXT:         0xe70008            ; str x0, [sp, #64]
 // CHECK-NEXT:         0xfc                ; pacibsp
-// CHECK-NEXT:         0xec                ; clear unwound to call
-// CHECK-NEXT:         0xeb                ; EC context
-// CHECK-NEXT:         0xea                ; context
-// CHECK-NEXT:         0xe9                ; machine frame
-// CHECK-NEXT:         0xe8                ; trap frame
 // CHECK-NEXT:         0xe3                ; nop
 // CHECK-NEXT:         0xe202              ; add fp, sp, #16
 // CHECK-NEXT:         0xdd41              ; str d13, [sp, #8]
@@ -99,8 +97,8 @@
 // CHECK-NEXT:       ]
 // CHECK-NEXT:       EpilogueScopes [
 // CHECK-NEXT:         EpilogueScope {
-// CHECK-NEXT:           StartOffset: 41
-// CHECK-NEXT:           EpilogueStartIndex: 77
+// CHECK-NEXT:           StartOffset: 35
+// CHECK-NEXT:           EpilogueStartIndex: 72
 // CHECK-NEXT:           Opcodes [
 // CHECK-NEXT:             0x01                ; add sp, #16
 // CHECK-NEXT:             0xe4                ; end
@@ -113,9 +111,28 @@
 // CHECK-NEXT:       ]
 // CHECK-NEXT:     }
 // CHECK-NEXT:   }
+// CHECK-NEXT:   RuntimeFunction {
+// CHECK-NEXT:     Function: customfunc
+// CHECK-NEXT:     ExceptionRecord: .xdata
+// CHECK-NEXT:     ExceptionData {
+// CHECK-NEXT:       FunctionLength: 24
+// CHECK:            Prologue [
+// CHECK-NEXT:         0xec                ; clear unwound to call
+// CHECK-NEXT:         0xeb                ; EC context
+// CHECK-NEXT:         0xea                ; context
+// CHECK-NEXT:         0xe9                ; machine frame
+// CHECK-NEXT:         0xe8                ; trap frame
+// CHECK-NEXT:         0xe4                ; end
+// CHECK-NEXT:       ]
+// CHECK-NEXT:       EpilogueScopes [
+// CHECK-NEXT:       ]
+// CHECK-NEXT:     }
+// CHECK-NEXT:   }
 // CHECK-NEXT: ]
 
 
+    .arch_extension sve
+
     .text
     .globl func
     .def func
@@ -124,8 +141,8 @@
     .endef
     .seh_proc func
 func:
-    sub sp, sp, #24
-    .seh_stackalloc 24
+    sub sp, sp, #16
+    .seh_stackalloc 16
     mov x29, sp
     .seh_set_fp
     stp x29, x30, [sp, #-32]!
@@ -160,54 +177,43 @@ func:
     .seh_add_fp 16
     nop
     .seh_nop
-    nop
-    .seh_trap_frame
-    nop
-    .seh_pushframe
-    nop
-    .seh_context
-    nop
-    .seh_ec_context
-    nop
-    .seh_clear_unwound_to_call
     pacibsp
     .seh_pac_sign_lr
-    nop
+    str x0, [sp, #64]
     .seh_save_any_reg x0, 64
-    nop
+    stp x1, x2, [sp, #64]
     .seh_save_any_reg_p x1, 64
-    nop
+    str d29, [sp, #64]
     .seh_save_any_reg d29, 64
-    nop
+    stp d4, d5, [sp, #64]
     .seh_save_any_reg_p d4, 64
-    nop
+    str q30, [sp, #64]
     .seh_save_any_reg q30, 64
-    nop
+    stp q3, q4, [sp, #64]
     .seh_save_any_reg_p q3, 64
-    nop
+    str x30, [sp, #-64]!
     .seh_save_any_reg_x lr, 64
-    nop
+    stp x29, x30, [sp, #-64]!
     .seh_save_any_reg_px fp, 64
-    nop
+    str d31, [sp, #-64]!
     .seh_save_any_reg_x d31, 64
-    nop
+    stp d2, d3, [sp, #-64]!
     .seh_save_any_reg_px d2, 64
-    nop
+    str q29, [sp, #-64]!
     .seh_save_any_reg_x q29, 64
-    nop
+    stp q9, q10, [sp, #-64]!
     .seh_save_any_reg_px q9, 64
-    nop
+    addvl sp, sp, #-5
     .seh_allocz 5
-    nop
+    str z11, [sp, #5, mul vl]
     .seh_save_zreg z11, 5
-    nop
+    str p6, [sp, #3, mul vl]
     .seh_save_preg p6, 3
-    nop
     .seh_endprologue
     nop
     .seh_startepilogue
-    add sp, sp, #24
-    .seh_stackalloc 24
+    add sp, sp, #16
+    .seh_stackalloc 16
     .seh_endepilogue
     ret
     .seh_handler __C_specific_handler, @except
@@ -216,6 +222,22 @@ func:
     .text
     .seh_endproc
 
+    .seh_proc customfunc
+customfunc:
+    nop
+    .seh_trap_frame
+    nop
+    .seh_pushframe
+    nop
+    .seh_context
+    nop
+    .seh_ec_context
+    nop
+    .seh_clear_unwound_to_call
+    .seh_endprologue
+    ret
+    .seh_endproc
+
     // Function with no .seh directives; no pdata/xdata entries are
     // generated.
     .globl smallFunc
diff --git a/llvm/test/MC/PowerPC/fixup-out-of-range.s b/llvm/test/MC/PowerPC/fixup-out-of-range.s
new file mode 100644
index 0000000000000..a036b4e232815
--- /dev/null
+++ b/llvm/test/MC/PowerPC/fixup-out-of-range.s
@@ -0,0 +1,91 @@
+# RUN: not llvm-mc -triple powerpc64le-unknown-unknown -filetype=obj %s 2>&1 >/dev/null | FileCheck %s
+
+# CHECK: error: branch target out of range (32772 not between -32768 and 32764)
+brcond14_out_of_range_hi:
+    beq 0, brcond14_target
+    .space 0x8000
+
+brcond14_target:
+    blr
+
+# CHECK: error: branch target out of range (-32772 not between -32768 and 32764)
+brcond14_out_of_range_lo:
+    .space 0x8004
+    beq 0, brcond14_out_of_range_lo
+
+# CHECK: error: branch target not a multiple of four (5)
+brcond14_misaligned:
+    beq 0, brcond14_misaligned_target
+    .byte 0
+
+brcond14_misaligned_target:
+    blr
+
+
+
+# CHECK: error: branch target out of range (32772 not between -32768 and 32764)
+brcond14abs_out_of_range_hi:
+    beqa 0, brcond14abs_target-.
+    .space 0x8000
+
+brcond14abs_target:
+    blr
+
+# CHECK: error: branch target out of range (-32772 not between -32768 and 32764)
+brcond14abs_out_of_range_lo:
+    .space 0x8004
+    beqa 0, brcond14abs_out_of_range_lo-.
+
+# CHECK: error: branch target not a multiple of four (5)
+brcond14abs_misaligned:
+    beqa 0, brcond14abs_misaligned_target-.
+    .byte 0
+
+brcond14abs_misaligned_target:
+    blr
+
+
+
+# CHECK: error: branch target out of range (33554436 not between -33554432 and 33554428)
+br24_out_of_range_hi:
+    b br24_target
+    .space 0x2000000
+
+br24_target:
+    blr
+
+# CHECK: error: branch target out of range (-33554436 not between -33554432 and 33554428)
+br24_out_of_range_lo:
+    .space 0x2000004
+    b br24_out_of_range_lo
+
+# CHECK: error: branch target not a multiple of four (5)
+br24_misaligned:
+    b br24_misaligned_target
+    .byte 0
+
+br24_misaligned_target:
+    blr
+
+
+
+# CHECK: error: branch target out of range (33554436 not between -33554432 and 33554428)
+br24abs_out_of_range_hi:
+    ba br24abs_target-.
+    .space 0x2000000
+
+br24abs_target:
+    blr
+
+# CHECK: error: branch target out of range (-33554436 not between -33554432 and 33554428)
+br24abs_out_of_range_lo:
+    .space 0x2000004
+    ba br24abs_out_of_range_lo-.
+
+# CHECK: error: branch target not a multiple of four (5)
+br24abs_misaligned:
+    ba br24abs_misaligned_target-.
+    .byte 0
+
+br24abs_misaligned_target:
+    blr
diff --git a/llvm/test/MC/RISCV/corev/XCVelw-pseudo.s b/llvm/test/MC/RISCV/corev/XCVelw-pseudo.s
new file mode 100644
index 0000000000000..172ebfde9f338
--- /dev/null
+++ b/llvm/test/MC/RISCV/corev/XCVelw-pseudo.s
@@ -0,0 +1,11 @@
+# RUN: llvm-mc %s -triple=riscv32 --mattr=+xcvelw | FileCheck %s
+
+# CHECK: .Lpcrel_hi0:
+# CHECK: auipc a2, %pcrel_hi(a_symbol)
+# CHECK: cv.elw a2, %pcrel_lo(.Lpcrel_hi0)(a2)
+cv.elw a2, a_symbol
+
+# CHECK: .Lpcrel_hi1:
+# CHECK: auipc a3, %pcrel_hi(a_symbol)
+# CHECK: cv.elw a3, %pcrel_lo(.Lpcrel_hi1)(a3)
+cv.elw a3, a_symbol
diff --git a/llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-invalid-lanemask.mir b/llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-invalid-lanemask.mir
new file mode 100644
index 0000000000000..b7d775f7a1b35
--- /dev/null
+++ b/llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-invalid-lanemask.mir
@@ -0,0 +1,37 @@
+# RUN: not --crash llc -o - -mtriple=amdgcn-amd-amdhsa -run-pass=none %s 2>&1 | FileCheck %s
+
+---
+name: test_copy_lanemask_instruction_0
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1
+
+    %0:vgpr_32 = IMPLICIT_DEF
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK must read at least one lane ***
+    $vgpr2 = COPY_LANEMASK $vgpr0, lanemask(0)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK attempts to read from the lanes that don't exist in the source register ***
+    $vgpr3 = COPY_LANEMASK $vgpr1, lanemask(0xFFFFFFFFFFFFFFFF)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK cannot be used to do full copy ***
+    $vgpr4_vgpr5 = COPY_LANEMASK $vgpr2_vgpr3, lanemask(0x000000000000000F)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK cannot be used to do full copy ***
+    %1:vgpr_32 = COPY_LANEMASK %0, lanemask(0x0000000000000003)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK attempts to read from the lanes that don't exist in the source register ***
+    %2:vgpr_32 = COPY_LANEMASK %1, lanemask(0x0000000FFFFFFFFF)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK attempts to read from the lanes that don't exist in the source register ***
+    %3:vreg_64 = COPY_LANEMASK $vgpr4_vgpr5, lanemask(0x00000000000000FF)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK cannot be used to do full copy ***
+    $vgpr6_vgpr7 = COPY_LANEMASK %3, lanemask(0x000000000000000F)
+
+    ; CHECK: *** Bad machine code: COPY_LANEMASK must not use a subregister index ***
+    %4:vgpr_32 = COPY_LANEMASK %3.sub0, lanemask(0x0000000000000003)
+
+    S_ENDPGM 0
+...
diff --git a/llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-missing-lanemask.mir b/llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-missing-lanemask.mir
new file mode 100644
index 0000000000000..0b461107f5b5f
--- /dev/null
+++ b/llvm/test/MachineVerifier/AMDGPU/verifier-copyLanemask-missing-lanemask.mir
@@ -0,0 +1,19 @@
+# RUN: not --crash llc -o - -mtriple=amdgcn-amd-amdhsa -run-pass=none %s 2>&1 | FileCheck %s
+
+# CHECK: *** Bad machine code: Too few operands ***
+# CHECK-NEXT: - function:    test_copy_lanemask_instruction_1
+# CHECK-NEXT: - basic block: %bb.0
+# CHECK-NEXT: - instruction: $vgpr2 = COPY_LANEMASK %0:vgpr_32
+
+---
+name: test_copy_lanemask_instruction_1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    liveins: $vgpr0, $vgpr1
+
+    %0:vgpr_32 = COPY $vgpr0
+    $vgpr2 = COPY_LANEMASK %0
+    S_ENDPGM 0
+...
+
diff --git a/llvm/test/Other/X86/debugcounter-divrempairs.ll b/llvm/test/Other/X86/debugcounter-divrempairs.ll
index c196dcd01eca4..ca7e6f5ae4d02 100644
--- a/llvm/test/Other/X86/debugcounter-divrempairs.ll
+++ b/llvm/test/Other/X86/debugcounter-divrempairs.ll
@@ -1,4 +1,3 @@
-; REQUIRES: asserts
 ; RUN: opt < %s -passes=div-rem-pairs -debug-counter=div-rem-pairs-transform=1 \
 ; RUN: -S -mtriple=x86_64-unknown-unknown    | FileCheck %s
 ;; Test that, with debug counters on, we only skip the first div-rem-pairs opportunity, optimize one after it,
diff --git a/llvm/test/Other/X86/debugcounter-partiallyinlinelibcalls.ll b/llvm/test/Other/X86/debugcounter-partiallyinlinelibcalls.ll
index 8024c05feea3b..13c111d119d76 100644
--- a/llvm/test/Other/X86/debugcounter-partiallyinlinelibcalls.ll
+++ b/llvm/test/Other/X86/debugcounter-partiallyinlinelibcalls.ll
@@ -1,4 +1,3 @@
-; REQUIRES: asserts
 ; RUN: opt -S -debug-counter=partially-inline-libcalls-transform=1 \
 ; RUN:     -passes=partially-inline-libcalls -mtriple=x86_64-unknown-linux-gnu < %s | FileCheck %s
 ;; Test that, with debug counters on, we will skip the first optimization opportunity, perform next 1,
diff --git a/llvm/test/Other/debugcounter-dce.ll b/llvm/test/Other/debugcounter-dce.ll
index 3b1dfb453593a..ad35766c5dff7 100644
--- a/llvm/test/Other/debugcounter-dce.ll
+++ b/llvm/test/Other/debugcounter-dce.ll
@@ -1,4 +1,3 @@
-; REQUIRES: asserts
 ; RUN: opt -passes=dce -S -debug-counter=dce-transform=1-2  < %s | FileCheck %s --check-prefixes=CHECK,NO-PRINT
 ; RUN: opt -passes=dce -S -debug-counter=dce-transform=1-2 -print-debug-counter-queries < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,PRINT
 ;; Test that, with debug counters on, we will skip the first DCE opportunity, perform next 2,
diff --git a/llvm/test/Other/debugcounter-earlycse.ll b/llvm/test/Other/debugcounter-earlycse.ll
index d3628c760ca33..9ef5d515e1c0f 100644
--- a/llvm/test/Other/debugcounter-earlycse.ll
+++ b/llvm/test/Other/debugcounter-earlycse.ll
@@ -1,4 +1,3 @@
-; REQUIRES: asserts
 ; RUN: opt -S -debug-counter=early-cse=1 -passes=early-cse -earlycse-debug-hash < %s 2>&1 | FileCheck %s
 ;; Test that, with debug counters on, we only optimize the second CSE opportunity.
 define i32 @test(i32 %a, i32 %b) {
diff --git a/llvm/test/Other/debugcounter-newgvn.ll b/llvm/test/Other/debugcounter-newgvn.ll
index fba21bd5de962..eb3b792cb42b3 100644
--- a/llvm/test/Other/debugcounter-newgvn.ll
+++ b/llvm/test/Other/debugcounter-newgvn.ll
@@ -1,5 +1,4 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; REQUIRES: asserts
 ; RUN: opt -S -debug-counter=newgvn-vn=1-2 -passes=newgvn  < %s 2>&1 | FileCheck %s
 ;; Test that, with debug counters on, we don't value number the first instruction, only the second and third,
 ;; which means we do not discover the return is constant.
diff --git a/llvm/test/Other/debugcounter-predicateinfo.ll b/llvm/test/Other/debugcounter-predicateinfo.ll
index 4f9440875e21a..d91b0f8904cb0 100644
--- a/llvm/test/Other/debugcounter-predicateinfo.ll
+++ b/llvm/test/Other/debugcounter-predicateinfo.ll
@@ -1,5 +1,4 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; REQUIRES: asserts
 ; RUN: opt -debug-counter=predicateinfo-rename=1 -passes=print-predicateinfo < %s 2>&1 | FileCheck %s
 ;; Test that, with debug counters on, we don't rename the first info, only the second
 define fastcc void @barney() {
diff --git a/llvm/test/Other/debugcounter-slsr.ll b/llvm/test/Other/debugcounter-slsr.ll
index a9ca45222a5cc..0e24f493c3bc8 100644
--- a/llvm/test/Other/debugcounter-slsr.ll
+++ b/llvm/test/Other/debugcounter-slsr.ll
@@ -1,4 +1,3 @@
-; REQUIRES: asserts
 ; RUN: opt -passes=slsr -S -debug-counter=slsr-counter=1  < %s | FileCheck %s
 
 ; Test that, with debug counters on, we will skip the first slsr opportunity.
diff --git a/llvm/test/Other/new-pm-O0-defaults.ll b/llvm/test/Other/new-pm-O0-defaults.ll
index 278a89261691a..a7f43d1fc4591 100644
--- a/llvm/test/Other/new-pm-O0-defaults.ll
+++ b/llvm/test/Other/new-pm-O0-defaults.ll
@@ -9,13 +9,13 @@
 
 ; RUN: opt -disable-verify -verify-analysis-invalidation=0 -debug-pass-manager \
 ; RUN:     -passes='default<O0>' -S %s 2>&1 \
-; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DEFAULT,CHECK-CORO
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DEFAULT,CHECK-CORO,CHECK-ALLOCTOKEN
 ; RUN: opt -disable-verify -verify-analysis-invalidation=0 -debug-pass-manager -enable-matrix \
 ; RUN:     -passes='default<O0>' -S %s 2>&1 \
-; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DEFAULT,CHECK-MATRIX,CHECK-CORO
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DEFAULT,CHECK-MATRIX,CHECK-CORO,CHECK-ALLOCTOKEN
 ; RUN: opt -disable-verify -verify-analysis-invalidation=0 -debug-pass-manager -debug-info-for-profiling \
 ; RUN:     -passes='default<O0>' -S %s 2>&1 \
-; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DIS,CHECK-CORO
+; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DIS,CHECK-CORO,CHECK-ALLOCTOKEN
 ; RUN: opt -disable-verify -verify-analysis-invalidation=0 -debug-pass-manager \
 ; RUN:     -passes='thinlto-pre-link<O0>' -S %s 2>&1 \
 ; RUN:     | FileCheck %s --check-prefixes=CHECK,CHECK-DEFAULT,CHECK-PRE-LINK,CHECK-CORO
@@ -41,10 +41,13 @@
 ; CHECK-MATRIX: Running pass: LowerMatrixIntrinsicsPass
 ; CHECK-MATRIX-NEXT: Running analysis: TargetIRAnalysis
 ; CHECK-CORO-NEXT: Running pass: CoroConditionalWrapper
+; CHECK-ALLOCTOKEN-NEXT: Running pass: AllocTokenPass
 ; CHECK-PRE-LINK: Running pass: CanonicalizeAliasesPass
 ; CHECK-PRE-LINK-NEXT: Running pass: NameAnonGlobalPass
 ; CHECK-THINLTO: Running pass: LowerTypeTestsPass
 ; CHECK-THINLTO-NEXT: Running pass: CoroConditionalWrapper
+; CHECK-THINLTO-NEXT: Running pass: AllocTokenPass
+; CHECK-THINLTO-NEXT: Running analysis: InnerAnalysisManagerProxy
 ; CHECK-THINLTO-NEXT: Running pass: EliminateAvailableExternallyPass
 ; CHECK-THINLTO-NEXT: Running pass: GlobalDCEPass
 ; CHECK-LTO: Running pass: CrossDSOCFIPass on [module]
@@ -53,6 +56,7 @@
 ; CHECK-LTO-NEXT: Running pass: LowerTypeTestsPass
 ; CHECK-LTO-NEXT: Running pass: LowerTypeTestsPass
 ; CHECK-LTO-NEXT: CoroConditionalWrapper
+; CHECK-LTO-NEXT: Running pass: AllocTokenPass
 ; CHECK-CORO-NEXT: Running pass: AnnotationRemarksPass
 ; CHECK-CORO-NEXT: Running analysis: TargetLibraryAnalysis
 ; CHECK-LTO-NEXT: Running pass: AnnotationRemarksPass
diff --git a/llvm/test/Other/new-pm-defaults.ll b/llvm/test/Other/new-pm-defaults.ll
index 1f437a662cc96..f074b2fdd3ab8 100644
--- a/llvm/test/Other/new-pm-defaults.ll
+++ b/llvm/test/Other/new-pm-defaults.ll
@@ -285,6 +285,7 @@
 ; CHECK-O-NEXT: Running pass: DivRemPairsPass
 ; CHECK-O-NEXT: Running pass: TailCallElimPass
 ; CHECK-O-NEXT: Running pass: SimplifyCFGPass
+; CHECK-DEFAULT-NEXT: Running pass: AllocToken
 ; CHECK-EP-OPTIMIZER-LAST: Running pass: NoOpModulePass
 ; CHECK-HOT-COLD-SPLIT-NEXT: Running pass: HotColdSplittingPass
 ; CHECK-IR-OUTLINER-NEXT: Running pass: IROutlinerPass
diff --git a/llvm/test/Other/new-pm-lto-defaults.ll b/llvm/test/Other/new-pm-lto-defaults.ll
index c865d77c86d77..de0feca55e5b2 100644
--- a/llvm/test/Other/new-pm-lto-defaults.ll
+++ b/llvm/test/Other/new-pm-lto-defaults.ll
@@ -163,6 +163,7 @@
 ; CHECK-O23SZ-NEXT: Running pass: CGProfilePass
 ; CHECK-O1-NEXT: Running pass: CoroConditionalWrapper
 ; CHECK-O23SZ-NEXT: Running pass: CoroCleanupPass
+; CHECK-O-NEXT: Running pass: AllocTokenPass
 ; CHECK-EP-NEXT: Running pass: NoOpModulePass
 ; CHECK-O-NEXT: Running pass: AnnotationRemarksPass on foo
 ; CHECK-O-NEXT: Running pass: PrintModulePass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
index 2d8b8f1b22091..b0d08316de4f0 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-defaults.ll
@@ -203,6 +203,7 @@
 ; CHECK-POSTLINK-O-NEXT: Running pass: DivRemPairsPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: TailCallElimPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: SimplifyCFGPass
+; CHECK-POSTLINK-O-NEXT: Running pass: AllocTokenPass
 ; CHECK-POST-EP-OPT-LAST-NEXT: Running pass: NoOpModulePass
 ; CHECK-POSTLINK-O-NEXT: Running pass: GlobalDCEPass
 ; CHECK-POSTLINK-O-NEXT: Running pass: ConstantMergePass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
index 7cacc17c7ab9a..6b3e82a752899 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
@@ -188,6 +188,7 @@
 ; CHECK-O-NEXT: Running pass: DivRemPairsPass
 ; CHECK-O-NEXT: Running pass: TailCallElimPass
 ; CHECK-O-NEXT: Running pass: SimplifyCFGPass
+; CHECK-O-NEXT: Running pass: AllocTokenPass
 ; CHECK-O-NEXT: Running pass: GlobalDCEPass
 ; CHECK-O-NEXT: Running pass: ConstantMergePass
 ; CHECK-O-NEXT: Running pass: CGProfilePass
diff --git a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
index ef6cd8354ae3d..88dc18f605ce2 100644
--- a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
+++ b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
@@ -197,6 +197,7 @@
 ; CHECK-O-NEXT: Running pass: DivRemPairsPass
 ; CHECK-O-NEXT: Running pass: TailCallElimPass
 ; CHECK-O-NEXT: Running pass: SimplifyCFGPass
+; CHECK-O-NEXT: Running pass: AllocTokenPass
 ; CHECK-O-NEXT: Running pass: GlobalDCEPass
 ; CHECK-O-NEXT: Running pass: ConstantMergePass
 ; CHECK-O-NEXT: Running pass: CGProfilePass
diff --git a/llvm/test/Other/print-debug-counter.ll b/llvm/test/Other/print-debug-counter.ll
index 0bba811f71c6d..2bef843475923 100644
--- a/llvm/test/Other/print-debug-counter.ll
+++ b/llvm/test/Other/print-debug-counter.ll
@@ -1,5 +1,3 @@
-; REQUIRES: asserts
-
 ; RUN: opt -S -debug-counter=early-cse=1 -passes=early-cse,newgvn,instcombine -earlycse-debug-hash \
 ; RUN:        -debug-counter=newgvn-vn=1-2 \
 ; RUN:        -print-debug-counter < %s 2>&1 | FileCheck %s
diff --git a/llvm/test/TableGen/GlobalISelCombinerEmitter/match-table-cxx.td b/llvm/test/TableGen/GlobalISelCombinerEmitter/match-table-cxx.td
index a8488ca3b8e6a..28017700a0448 100644
--- a/llvm/test/TableGen/GlobalISelCombinerEmitter/match-table-cxx.td
+++ b/llvm/test/TableGen/GlobalISelCombinerEmitter/match-table-cxx.td
@@ -96,7 +96,7 @@ def MyCombiner: GICombiner<"GenMyCombiner", [
 
 // CHECK:      const uint8_t *GenMyCombiner::getMatchTable() const {
 // CHECK-NEXT:   constexpr static uint8_t MatchTable0[] = {
-// CHECK-NEXT:      /*   0 */ GIM_SwitchOpcode, /*MI*/0, /*[*/GIMT_Encode2(104), GIMT_Encode2(216), /*)*//*default:*//*Label 5*/ GIMT_Encode4(524),
+// CHECK-NEXT:      /*   0 */ GIM_SwitchOpcode, /*MI*/0, /*[*/GIMT_Encode2(105), GIMT_Encode2(217), /*)*//*default:*//*Label 5*/ GIMT_Encode4(524),
 // CHECK-NEXT:      /* 10 */ /*TargetOpcode::G_STORE*//*Label 0*/ GIMT_Encode4(458), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0),
 // CHECK-NEXT:      /* 182 */ /*TargetOpcode::G_SEXT*//*Label 1*/ GIMT_Encode4(476), GIMT_Encode4(0),
 // CHECK-NEXT:      /* 190 */ /*TargetOpcode::G_ZEXT*//*Label 2*/ GIMT_Encode4(488), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0), GIMT_Encode4(0),
diff --git a/llvm/test/Transforms/AggressiveInstCombine/X86/or-load.ll b/llvm/test/Transforms/AggressiveInstCombine/X86/or-load.ll
index 46ec9e0a50842..f62a1ca15729b 100644
--- a/llvm/test/Transforms/AggressiveInstCombine/X86/or-load.ll
+++ b/llvm/test/Transforms/AggressiveInstCombine/X86/or-load.ll
@@ -2505,3 +2505,56 @@ entry:
   %or = or disjoint i32 %shl, %conv.2
   ret i32 %or
 }
+
+ at g = global i64 1060856922120
+
+; Make sure we use the correct memory location for alias analysis.
+define i64 @loadcombine_consecutive_mayalias(ptr %p) {
+; LE-LABEL: @loadcombine_consecutive_mayalias(
+; LE-NEXT:  entry:
+; LE-NEXT:    [[LOAD3:%.*]] = load i32, ptr [[P:%.*]], align 4
+; LE-NEXT:    [[GEP1:%.*]] = getelementptr i8, ptr [[P]], i64 4
+; LE-NEXT:    store i8 0, ptr getelementptr inbounds nuw (i8, ptr @g, i64 4), align 4
+; LE-NEXT:    [[LOAD2:%.*]] = load i32, ptr [[GEP1]], align 4
+; LE-NEXT:    [[TMP0:%.*]] = zext i32 [[LOAD2]] to i64
+; LE-NEXT:    [[TMP1:%.*]] = shl i64 [[TMP0]], 32
+; LE-NEXT:    [[ZEXT3:%.*]] = zext i32 [[LOAD3]] to i64
+; LE-NEXT:    [[LOAD1:%.*]] = or i64 [[TMP1]], [[ZEXT3]]
+; LE-NEXT:    [[RES:%.*]] = lshr i64 [[LOAD1]], 32
+; LE-NEXT:    ret i64 [[RES]]
+;
+; BE-LABEL: @loadcombine_consecutive_mayalias(
+; BE-NEXT:  entry:
+; BE-NEXT:    [[LOAD1:%.*]] = load i32, ptr [[P:%.*]], align 4
+; BE-NEXT:    [[GEP1:%.*]] = getelementptr i8, ptr [[P]], i64 4
+; BE-NEXT:    [[GEP2:%.*]] = getelementptr i8, ptr [[P]], i64 5
+; BE-NEXT:    store i8 0, ptr getelementptr inbounds nuw (i8, ptr @g, i64 4), align 4
+; BE-NEXT:    [[LOAD2:%.*]] = load i8, ptr [[GEP1]], align 4
+; BE-NEXT:    [[LOAD3:%.*]] = load i24, ptr [[GEP2]], align 1
+; BE-NEXT:    [[ZEXT1:%.*]] = zext i24 [[LOAD3]] to i64
+; BE-NEXT:    [[SHL1:%.*]] = shl i64 [[ZEXT1]], 40
+; BE-NEXT:    [[ZEXT2:%.*]] = zext i8 [[LOAD2]] to i64
+; BE-NEXT:    [[SHL2:%.*]] = shl i64 [[ZEXT2]], 32
+; BE-NEXT:    [[OR1:%.*]] = or i64 [[SHL1]], [[SHL2]]
+; BE-NEXT:    [[ZEXT3:%.*]] = zext i32 [[LOAD1]] to i64
+; BE-NEXT:    [[OR2:%.*]] = or i64 [[OR1]], [[ZEXT3]]
+; BE-NEXT:    [[RES:%.*]] = lshr i64 [[OR2]], 32
+; BE-NEXT:    ret i64 [[RES]]
+;
+entry:
+  %load1 = load i32, ptr %p, align 4
+  %gep1 = getelementptr i8, ptr %p, i64 4
+  %gep2 = getelementptr i8, ptr %p, i64 5
+  store i8 0, ptr getelementptr inbounds nuw (i8, ptr @g, i64 4), align 4
+  %load2 = load i8, ptr %gep1, align 4
+  %load3 = load i24, ptr %gep2, align 1
+  %zext1 = zext i24 %load3 to i64
+  %shl1 = shl i64 %zext1, 40
+  %zext2 = zext i8 %load2 to i64
+  %shl2 = shl i64 %zext2, 32
+  %or1 = or i64 %shl1, %shl2
+  %zext3 = zext i32 %load1 to i64
+  %or2 = or i64 %or1, %zext3
+  %res = lshr i64 %or2, 32
+  ret i64 %res
+}
diff --git a/llvm/test/Transforms/Attributor/dereferenceable-1.ll b/llvm/test/Transforms/Attributor/dereferenceable-1.ll
index 246a8c42ba912..5bff2a2e6b208 100644
--- a/llvm/test/Transforms/Attributor/dereferenceable-1.ll
+++ b/llvm/test/Transforms/Attributor/dereferenceable-1.ll
@@ -555,10 +555,12 @@ cont2:
 ;        *ptr = 4;
 ;    }
 ;  }
+;
+; FIXME: %ptr should be dereferenceable(4)
 define dso_local void @rec-branch-1(i32 %a, i32 %b, i32 %c, ptr %ptr) {
 ; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: write)
 ; CHECK-LABEL: define {{[^@]+}}@rec-branch-1
-; CHECK-SAME: (i32 [[A:%.*]], i32 [[B:%.*]], i32 [[C:%.*]], ptr nofree nonnull writeonly align 4 captures(none) dereferenceable(4) [[PTR:%.*]]) #[[ATTR3]] {
+; CHECK-SAME: (i32 [[A:%.*]], i32 [[B:%.*]], i32 [[C:%.*]], ptr nofree writeonly captures(none) [[PTR:%.*]]) #[[ATTR3]] {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[TOBOOL:%.*]] = icmp eq i32 [[A]], 0
 ; CHECK-NEXT:    br i1 [[TOBOOL]], label [[IF_ELSE3:%.*]], label [[IF_THEN:%.*]]
@@ -628,10 +630,11 @@ if.end8:                                          ; preds = %if.then5, %if.else6
 ;        rec-branch-2(1, 1, 1, ptr);
 ;    }
 ;  }
+; FIXME: %ptr should be dereferenceable(4)
 define dso_local void @rec-branch-2(i32 %a, i32 %b, i32 %c, ptr %ptr) {
 ; CHECK: Function Attrs: nofree nosync nounwind memory(argmem: write)
 ; CHECK-LABEL: define {{[^@]+}}@rec-branch-2
-; CHECK-SAME: (i32 [[A:%.*]], i32 [[B:%.*]], i32 [[C:%.*]], ptr nofree nonnull writeonly align 4 captures(none) dereferenceable(4) [[PTR:%.*]]) #[[ATTR5:[0-9]+]] {
+; CHECK-SAME: (i32 [[A:%.*]], i32 [[B:%.*]], i32 [[C:%.*]], ptr nofree writeonly captures(none) [[PTR:%.*]]) #[[ATTR5:[0-9]+]] {
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[TOBOOL:%.*]] = icmp eq i32 [[A]], 0
 ; CHECK-NEXT:    br i1 [[TOBOOL]], label [[IF_ELSE3:%.*]], label [[IF_THEN:%.*]]
@@ -651,7 +654,7 @@ define dso_local void @rec-branch-2(i32 %a, i32 %b, i32 %c, ptr %ptr) {
 ; CHECK-NEXT:    store i32 3, ptr [[PTR]], align 4
 ; CHECK-NEXT:    br label [[IF_END8]]
 ; CHECK:       if.else6:
-; CHECK-NEXT:    tail call void @rec-branch-2(i32 noundef 1, i32 noundef 1, i32 noundef 1, ptr nofree nonnull writeonly align 4 captures(none) dereferenceable(4) [[PTR]]) #[[ATTR8:[0-9]+]]
+; CHECK-NEXT:    tail call void @rec-branch-2(i32 noundef 1, i32 noundef 1, i32 noundef 1, ptr nofree writeonly captures(none) [[PTR]]) #[[ATTR8:[0-9]+]]
 ; CHECK-NEXT:    br label [[IF_END8]]
 ; CHECK:       if.end8:
 ; CHECK-NEXT:    ret void
diff --git a/llvm/test/Transforms/Attributor/nofpclass.ll b/llvm/test/Transforms/Attributor/nofpclass.ll
index a9ebdaa397015..d82dc412f5e36 100644
--- a/llvm/test/Transforms/Attributor/nofpclass.ll
+++ b/llvm/test/Transforms/Attributor/nofpclass.ll
@@ -2667,15 +2667,10 @@ define [4 x float] @constant_aggregate_zero() {
 }
 
 define <vscale x 4 x float> @scalable_splat_pnorm() {
-; CHECK-CV: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
-; CHECK-CV-LABEL: define noundef <vscale x 4 x float> @scalable_splat_pnorm
-; CHECK-CV-SAME: () #[[ATTR3]] {
-; CHECK-CV-NEXT:    ret <vscale x 4 x float> splat (float 1.000000e+00)
-;
-; CHECK-CI: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
-; CHECK-CI-LABEL: define noundef nofpclass(nan inf zero sub nnorm) <vscale x 4 x float> @scalable_splat_pnorm
-; CHECK-CI-SAME: () #[[ATTR3]] {
-; CHECK-CI-NEXT:    ret <vscale x 4 x float> splat (float 1.000000e+00)
+; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
+; CHECK-LABEL: define noundef nofpclass(nan inf zero sub nnorm) <vscale x 4 x float> @scalable_splat_pnorm
+; CHECK-SAME: () #[[ATTR3]] {
+; CHECK-NEXT:    ret <vscale x 4 x float> splat (float 1.000000e+00)
 ;
   ret <vscale x 4 x float> splat (float 1.0)
 }
@@ -2689,6 +2684,19 @@ define <vscale x 4 x float> @scalable_splat_zero() {
   ret <vscale x 4 x float> zeroinitializer
 }
 
+define <vscale x 4 x float> @scalable_splat_nnan(float nofpclass(nan) %x) {
+; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
+; CHECK-LABEL: define nofpclass(nan) <vscale x 4 x float> @scalable_splat_nnan
+; CHECK-SAME: (float nofpclass(nan) [[X:%.*]]) #[[ATTR3]] {
+; CHECK-NEXT:    [[HEAD:%.*]] = insertelement <vscale x 4 x float> poison, float [[X]], i32 0
+; CHECK-NEXT:    [[SPLAT:%.*]] = shufflevector <vscale x 4 x float> [[HEAD]], <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    ret <vscale x 4 x float> [[SPLAT]]
+;
+  %head = insertelement <vscale x 4 x float> poison, float %x, i32 0
+  %splat = shufflevector <vscale x 4 x float> %head, <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer
+  ret <vscale x 4 x float> %splat
+}
+
 ; Verify we do not derive 'nofpclass(inf zero sub norm)' for the argument __x.
 ; See https://github.com/llvm/llvm-project/issues/78507
 
@@ -2989,5 +2997,7 @@ attributes #5 = { "denormal-fp-math"="ieee,positive-zero" }
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; CGSCC-CI: {{.*}}
 ; CGSCC-CV: {{.*}}
+; CHECK-CI: {{.*}}
+; CHECK-CV: {{.*}}
 ; TUNIT-CI: {{.*}}
 ; TUNIT-CV: {{.*}}
diff --git a/llvm/test/Transforms/Attributor/nonnull.ll b/llvm/test/Transforms/Attributor/nonnull.ll
index 57a6d09af64fa..2ff8a3fa3a688 100644
--- a/llvm/test/Transforms/Attributor/nonnull.ll
+++ b/llvm/test/Transforms/Attributor/nonnull.ll
@@ -32,27 +32,16 @@ define ptr @test2(ptr nonnull %p) {
 }
 
 define ptr @test2A(i1 %c, ptr %ret) {
-; TUNIT: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
-; TUNIT-LABEL: define {{[^@]+}}@test2A
-; TUNIT-SAME: (i1 noundef [[C:%.*]], ptr nofree nonnull readnone returned "no-capture-maybe-returned" [[RET:%.*]]) #[[ATTR2:[0-9]+]] {
-; TUNIT-NEXT:    br i1 [[C]], label [[A:%.*]], label [[B:%.*]]
-; TUNIT:       A:
-; TUNIT-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR15:[0-9]+]] [ "nonnull"(ptr [[RET]]) ]
-; TUNIT-NEXT:    ret ptr [[RET]]
-; TUNIT:       B:
-; TUNIT-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR15]] [ "nonnull"(ptr [[RET]]) ]
-; TUNIT-NEXT:    ret ptr [[RET]]
-;
-; CGSCC: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
-; CGSCC-LABEL: define {{[^@]+}}@test2A
-; CGSCC-SAME: (i1 noundef [[C:%.*]], ptr nofree nonnull readnone returned "no-capture-maybe-returned" [[RET:%.*]]) #[[ATTR2:[0-9]+]] {
-; CGSCC-NEXT:    br i1 [[C]], label [[A:%.*]], label [[B:%.*]]
-; CGSCC:       A:
-; CGSCC-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16:[0-9]+]] [ "nonnull"(ptr [[RET]]) ]
-; CGSCC-NEXT:    ret ptr [[RET]]
-; CGSCC:       B:
-; CGSCC-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "nonnull"(ptr [[RET]]) ]
-; CGSCC-NEXT:    ret ptr [[RET]]
+; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
+; CHECK-LABEL: define {{[^@]+}}@test2A
+; CHECK-SAME: (i1 noundef [[C:%.*]], ptr nofree nonnull readnone returned "no-capture-maybe-returned" [[RET:%.*]]) #[[ATTR2:[0-9]+]] {
+; CHECK-NEXT:    br i1 [[C]], label [[A:%.*]], label [[B:%.*]]
+; CHECK:       A:
+; CHECK-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16:[0-9]+]] [ "nonnull"(ptr [[RET]]) ]
+; CHECK-NEXT:    ret ptr [[RET]]
+; CHECK:       B:
+; CHECK-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "nonnull"(ptr [[RET]]) ]
+; CHECK-NEXT:    ret ptr [[RET]]
 ;
   br i1 %c, label %A, label %B
 A:
@@ -64,27 +53,16 @@ B:
 }
 
 define ptr @test2B(i1 %c, ptr %ret) {
-; TUNIT: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
-; TUNIT-LABEL: define {{[^@]+}}@test2B
-; TUNIT-SAME: (i1 noundef [[C:%.*]], ptr nofree nonnull readnone returned dereferenceable(4) "no-capture-maybe-returned" [[RET:%.*]]) #[[ATTR2]] {
-; TUNIT-NEXT:    br i1 [[C]], label [[A:%.*]], label [[B:%.*]]
-; TUNIT:       A:
-; TUNIT-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR15]] [ "dereferenceable"(ptr [[RET]], i32 4) ]
-; TUNIT-NEXT:    ret ptr [[RET]]
-; TUNIT:       B:
-; TUNIT-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR15]] [ "dereferenceable"(ptr [[RET]], i32 4) ]
-; TUNIT-NEXT:    ret ptr [[RET]]
-;
-; CGSCC: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
-; CGSCC-LABEL: define {{[^@]+}}@test2B
-; CGSCC-SAME: (i1 noundef [[C:%.*]], ptr nofree nonnull readnone returned dereferenceable(4) "no-capture-maybe-returned" [[RET:%.*]]) #[[ATTR2]] {
-; CGSCC-NEXT:    br i1 [[C]], label [[A:%.*]], label [[B:%.*]]
-; CGSCC:       A:
-; CGSCC-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "dereferenceable"(ptr [[RET]], i32 4) ]
-; CGSCC-NEXT:    ret ptr [[RET]]
-; CGSCC:       B:
-; CGSCC-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "dereferenceable"(ptr [[RET]], i32 4) ]
-; CGSCC-NEXT:    ret ptr [[RET]]
+; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
+; CHECK-LABEL: define {{[^@]+}}@test2B
+; CHECK-SAME: (i1 noundef [[C:%.*]], ptr nofree nonnull readnone returned dereferenceable(4) "no-capture-maybe-returned" [[RET:%.*]]) #[[ATTR2]] {
+; CHECK-NEXT:    br i1 [[C]], label [[A:%.*]], label [[B:%.*]]
+; CHECK:       A:
+; CHECK-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "dereferenceable"(ptr [[RET]], i32 4) ]
+; CHECK-NEXT:    ret ptr [[RET]]
+; CHECK:       B:
+; CHECK-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "dereferenceable"(ptr [[RET]], i32 4) ]
+; CHECK-NEXT:    ret ptr [[RET]]
 ;
   br i1 %c, label %A, label %B
 A:
@@ -295,21 +273,13 @@ define ptr @test9(ptr %a, i64 %n) {
 ; ATTRIBUTOR_OPM: define ptr @test10
 ; ATTRIBUTOR_NPM: define nonnull ptr @test10
 define ptr @test10(ptr %a, i64 %n) {
-; TUNIT: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
-; TUNIT-LABEL: define {{[^@]+}}@test10
-; TUNIT-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[A:%.*]], i64 [[N:%.*]]) #[[ATTR2]] {
-; TUNIT-NEXT:    [[CMP:%.*]] = icmp ne i64 [[N]], 0
-; TUNIT-NEXT:    call void @llvm.assume(i1 noundef [[CMP]]) #[[ATTR15]]
-; TUNIT-NEXT:    [[B:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[N]]
-; TUNIT-NEXT:    ret ptr [[B]]
-;
-; CGSCC: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
-; CGSCC-LABEL: define {{[^@]+}}@test10
-; CGSCC-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[A:%.*]], i64 [[N:%.*]]) #[[ATTR2]] {
-; CGSCC-NEXT:    [[CMP:%.*]] = icmp ne i64 [[N]], 0
-; CGSCC-NEXT:    call void @llvm.assume(i1 noundef [[CMP]]) #[[ATTR16]]
-; CGSCC-NEXT:    [[B:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[N]]
-; CGSCC-NEXT:    ret ptr [[B]]
+; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write)
+; CHECK-LABEL: define {{[^@]+}}@test10
+; CHECK-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[A:%.*]], i64 [[N:%.*]]) #[[ATTR2]] {
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ne i64 [[N]], 0
+; CHECK-NEXT:    call void @llvm.assume(i1 noundef [[CMP]]) #[[ATTR16]]
+; CHECK-NEXT:    [[B:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[N]]
+; CHECK-NEXT:    ret ptr [[B]]
 ;
   %cmp = icmp ne i64 %n, 0
   call void @llvm.assume(i1 %cmp)
@@ -422,22 +392,50 @@ declare nonnull ptr @nonnull()
 
 
 define internal ptr @f1(ptr %arg) {
-; CGSCC: Function Attrs: mustprogress nofree nosync nounwind willreturn memory(argmem: read)
+; FIXME: missing nonnull It should be nonnull @f1(ptr nonnull readonly %arg)
+; TUNIT: Function Attrs: nofree nosync nounwind memory(argmem: read)
+; TUNIT-LABEL: define {{[^@]+}}@f1
+; TUNIT-SAME: (ptr nofree readonly [[ARG:%.*]]) #[[ATTR6:[0-9]+]] {
+; TUNIT-NEXT:  bb:
+; TUNIT-NEXT:    [[TMP:%.*]] = icmp eq ptr [[ARG]], null
+; TUNIT-NEXT:    br i1 [[TMP]], label [[BB9:%.*]], label [[BB1:%.*]]
+; TUNIT:       bb1:
+; TUNIT-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARG]], align 4
+; TUNIT-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[TMP2]], 0
+; TUNIT-NEXT:    br i1 [[TMP3]], label [[BB6:%.*]], label [[BB4:%.*]]
+; TUNIT:       bb4:
+; TUNIT-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[ARG]], i64 1
+; TUNIT-NEXT:    [[TMP5B:%.*]] = tail call ptr @f3(ptr nofree nonnull readonly [[TMP5]]) #[[ATTR17:[0-9]+]]
+; TUNIT-NEXT:    [[TMP5C:%.*]] = getelementptr inbounds i32, ptr [[TMP5B]], i64 -1
+; TUNIT-NEXT:    br label [[BB9]]
+; TUNIT:       bb6:
+; TUNIT-NEXT:    [[TMP7:%.*]] = tail call ptr @f2(ptr nofree nonnull readonly align 4 dereferenceable(4) [[ARG]]) #[[ATTR17]]
+; TUNIT-NEXT:    ret ptr [[TMP7]]
+; TUNIT:       bb9:
+; TUNIT-NEXT:    [[TMP10:%.*]] = phi ptr [ [[TMP5C]], [[BB4]] ], [ inttoptr (i64 4 to ptr), [[BB:%.*]] ]
+; TUNIT-NEXT:    ret ptr [[TMP10]]
+;
+; CGSCC: Function Attrs: nofree nosync nounwind memory(argmem: read)
 ; CGSCC-LABEL: define {{[^@]+}}@f1
-; CGSCC-SAME: (ptr nofree nonnull readonly align 4 captures(none) dereferenceable(4) [[ARG:%.*]]) #[[ATTR5:[0-9]+]] {
+; CGSCC-SAME: (ptr nofree readonly [[ARG:%.*]]) #[[ATTR5:[0-9]+]] {
 ; CGSCC-NEXT:  bb:
-; CGSCC-NEXT:    br label [[BB1:%.*]]
+; CGSCC-NEXT:    [[TMP:%.*]] = icmp eq ptr [[ARG]], null
+; CGSCC-NEXT:    br i1 [[TMP]], label [[BB9:%.*]], label [[BB1:%.*]]
 ; CGSCC:       bb1:
-; CGSCC-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARG]], align 4, !invariant.load [[META0:![0-9]+]]
+; CGSCC-NEXT:    [[TMP2:%.*]] = load i32, ptr [[ARG]], align 4
 ; CGSCC-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[TMP2]], 0
 ; CGSCC-NEXT:    br i1 [[TMP3]], label [[BB6:%.*]], label [[BB4:%.*]]
 ; CGSCC:       bb4:
-; CGSCC-NEXT:    [[TMP5C:%.*]] = getelementptr inbounds i32, ptr undef, i64 -1
-; CGSCC-NEXT:    br label [[BB9:%.*]]
+; CGSCC-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[ARG]], i64 1
+; CGSCC-NEXT:    [[TMP5B:%.*]] = tail call ptr @f3(ptr nofree nonnull readonly [[TMP5]]) #[[ATTR17:[0-9]+]]
+; CGSCC-NEXT:    [[TMP5C:%.*]] = getelementptr inbounds i32, ptr [[TMP5B]], i64 -1
+; CGSCC-NEXT:    br label [[BB9]]
 ; CGSCC:       bb6:
-; CGSCC-NEXT:    ret ptr undef
+; CGSCC-NEXT:    [[TMP7:%.*]] = tail call ptr @f2(ptr nofree nonnull readonly align 4 dereferenceable(4) [[ARG]]) #[[ATTR17]]
+; CGSCC-NEXT:    ret ptr [[TMP7]]
 ; CGSCC:       bb9:
-; CGSCC-NEXT:    ret ptr undef
+; CGSCC-NEXT:    [[TMP10:%.*]] = phi ptr [ [[TMP5C]], [[BB4]] ], [ inttoptr (i64 4 to ptr), [[BB:%.*]] ]
+; CGSCC-NEXT:    ret ptr [[TMP10]]
 ;
 
 bb:
@@ -465,11 +463,19 @@ bb9:                                              ; preds = %bb4, %bb
 }
 
 define internal ptr @f2(ptr %arg) {
-; CGSCC: Function Attrs: mustprogress nofree nosync nounwind willreturn memory(none)
+; TUNIT: Function Attrs: nofree nosync nounwind memory(argmem: read)
+; TUNIT-LABEL: define {{[^@]+}}@f2
+; TUNIT-SAME: (ptr nofree nonnull readonly align 4 dereferenceable(4) [[ARG:%.*]]) #[[ATTR6]] {
+; TUNIT-NEXT:  bb:
+; TUNIT-NEXT:    [[TMP:%.*]] = tail call ptr @f1(ptr nofree readonly [[ARG]]) #[[ATTR17]]
+; TUNIT-NEXT:    ret ptr [[TMP]]
+;
+; CGSCC: Function Attrs: nofree nosync nounwind memory(argmem: read)
 ; CGSCC-LABEL: define {{[^@]+}}@f2
-; CGSCC-SAME: (ptr noalias nofree nonnull readnone align 4 captures(none) dereferenceable(4) [[ARG:%.*]]) #[[ATTR6:[0-9]+]] {
+; CGSCC-SAME: (ptr nofree nonnull readonly align 4 dereferenceable(4) [[ARG:%.*]]) #[[ATTR5]] {
 ; CGSCC-NEXT:  bb:
-; CGSCC-NEXT:    ret ptr undef
+; CGSCC-NEXT:    [[TMP:%.*]] = tail call ptr @f1(ptr nofree readonly [[ARG]]) #[[ATTR17]]
+; CGSCC-NEXT:    ret ptr [[TMP]]
 ;
 bb:
   %tmp = tail call ptr @f1(ptr %arg)
@@ -478,17 +484,19 @@ bb:
 
 define dso_local noalias ptr @f3(ptr %arg) {
 ; FIXME: missing nonnull. It should be nonnull @f3(ptr nonnull readonly %arg)
-; TUNIT: Function Attrs: mustprogress nofree nosync nounwind willreturn memory(none)
+; TUNIT: Function Attrs: nofree nosync nounwind memory(argmem: read)
 ; TUNIT-LABEL: define {{[^@]+}}@f3
-; TUNIT-SAME: (ptr nofree readnone captures(none) [[ARG:%.*]]) #[[ATTR3]] {
+; TUNIT-SAME: (ptr nofree readonly [[ARG:%.*]]) #[[ATTR6]] {
 ; TUNIT-NEXT:  bb:
-; TUNIT-NEXT:    ret ptr undef
+; TUNIT-NEXT:    [[TMP:%.*]] = call ptr @f1(ptr nofree readonly [[ARG]]) #[[ATTR17]]
+; TUNIT-NEXT:    ret ptr [[TMP]]
 ;
-; CGSCC: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
+; CGSCC: Function Attrs: nofree nosync nounwind memory(argmem: read)
 ; CGSCC-LABEL: define {{[^@]+}}@f3
-; CGSCC-SAME: (ptr nofree readnone captures(none) [[ARG:%.*]]) #[[ATTR1]] {
+; CGSCC-SAME: (ptr nofree readonly [[ARG:%.*]]) #[[ATTR5]] {
 ; CGSCC-NEXT:  bb:
-; CGSCC-NEXT:    ret ptr undef
+; CGSCC-NEXT:    [[TMP:%.*]] = call ptr @f1(ptr nofree readonly [[ARG]]) #[[ATTR17]]
+; CGSCC-NEXT:    ret ptr [[TMP]]
 ;
 bb:
 ; FIXME: missing nonnull. It should be @f1(ptr nonnull readonly %arg)
@@ -521,26 +529,26 @@ declare void @fun3(ptr, ptr, ptr) #1
 define void @f16(ptr %a, ptr %b, i8 %c) {
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@f16
-; TUNIT-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR7:[0-9]+]] {
+; TUNIT-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR8:[0-9]+]] {
 ; TUNIT-NEXT:    [[CMP:%.*]] = icmp eq i8 [[C]], 0
 ; TUNIT-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
 ; TUNIT:       if.then:
-; TUNIT-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr nonnull [[B]]) #[[ATTR6:[0-9]+]]
+; TUNIT-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr nonnull [[B]]) #[[ATTR7:[0-9]+]]
 ; TUNIT-NEXT:    ret void
 ; TUNIT:       if.else:
-; TUNIT-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr [[B]]) #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr [[B]]) #[[ATTR7]]
 ; TUNIT-NEXT:    ret void
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@f16
-; CGSCC-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR8:[0-9]+]] {
+; CGSCC-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR7:[0-9]+]] {
 ; CGSCC-NEXT:    [[CMP:%.*]] = icmp eq i8 [[C]], 0
 ; CGSCC-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
 ; CGSCC:       if.then:
-; CGSCC-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr nonnull [[B]]) #[[ATTR7:[0-9]+]]
+; CGSCC-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr nonnull [[B]]) #[[ATTR6:[0-9]+]]
 ; CGSCC-NEXT:    ret void
 ; CGSCC:       if.else:
-; CGSCC-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr [[B]]) #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun2(ptr nonnull [[A]], ptr [[B]]) #[[ATTR6]]
 ; CGSCC-NEXT:    ret void
 ;
   %cmp = icmp eq i8 %c, 0
@@ -563,32 +571,32 @@ define void @f17(ptr %a, i8 %c) {
 ;
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@f17
-; TUNIT-SAME: (ptr nonnull [[A:%.*]], i8 [[C:%.*]]) #[[ATTR7]] {
+; TUNIT-SAME: (ptr nonnull [[A:%.*]], i8 [[C:%.*]]) #[[ATTR8]] {
 ; TUNIT-NEXT:    [[CMP:%.*]] = icmp eq i8 [[C]], 0
 ; TUNIT-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
 ; TUNIT:       if.then:
-; TUNIT-NEXT:    tail call void @fun0() #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun0() #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[CONT:%.*]]
 ; TUNIT:       if.else:
-; TUNIT-NEXT:    tail call void @fun0() #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun0() #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[CONT]]
 ; TUNIT:       cont:
-; TUNIT-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR7]]
 ; TUNIT-NEXT:    ret void
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@f17
-; CGSCC-SAME: (ptr nonnull [[A:%.*]], i8 [[C:%.*]]) #[[ATTR8]] {
+; CGSCC-SAME: (ptr nonnull [[A:%.*]], i8 [[C:%.*]]) #[[ATTR7]] {
 ; CGSCC-NEXT:    [[CMP:%.*]] = icmp eq i8 [[C]], 0
 ; CGSCC-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
 ; CGSCC:       if.then:
-; CGSCC-NEXT:    tail call void @fun0() #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun0() #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[CONT:%.*]]
 ; CGSCC:       if.else:
-; CGSCC-NEXT:    tail call void @fun0() #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun0() #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[CONT]]
 ; CGSCC:       cont:
-; CGSCC-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR6]]
 ; CGSCC-NEXT:    ret void
 ;
   %cmp = icmp eq i8 %c, 0
@@ -617,50 +625,50 @@ cont:
 define void @f18(ptr %a, ptr %b, i8 %c) {
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@f18
-; TUNIT-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR7]] {
+; TUNIT-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR8]] {
 ; TUNIT-NEXT:    [[CMP1:%.*]] = icmp eq i8 [[C]], 0
 ; TUNIT-NEXT:    br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
 ; TUNIT:       if.then:
-; TUNIT-NEXT:    tail call void @fun0() #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun0() #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[CONT:%.*]]
 ; TUNIT:       if.else:
-; TUNIT-NEXT:    tail call void @fun0() #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun0() #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[CONT]]
 ; TUNIT:       cont:
 ; TUNIT-NEXT:    [[CMP2:%.*]] = icmp eq i8 [[C]], 1
 ; TUNIT-NEXT:    br i1 [[CMP2]], label [[CONT_THEN:%.*]], label [[CONT_ELSE:%.*]]
 ; TUNIT:       cont.then:
-; TUNIT-NEXT:    tail call void @fun1(ptr nonnull [[B]]) #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun1(ptr nonnull [[B]]) #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[CONT2:%.*]]
 ; TUNIT:       cont.else:
-; TUNIT-NEXT:    tail call void @fun0() #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun0() #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[CONT2]]
 ; TUNIT:       cont2:
-; TUNIT-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR7]]
 ; TUNIT-NEXT:    ret void
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@f18
-; CGSCC-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR8]] {
+; CGSCC-SAME: (ptr nonnull [[A:%.*]], ptr [[B:%.*]], i8 [[C:%.*]]) #[[ATTR7]] {
 ; CGSCC-NEXT:    [[CMP1:%.*]] = icmp eq i8 [[C]], 0
 ; CGSCC-NEXT:    br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
 ; CGSCC:       if.then:
-; CGSCC-NEXT:    tail call void @fun0() #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun0() #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[CONT:%.*]]
 ; CGSCC:       if.else:
-; CGSCC-NEXT:    tail call void @fun0() #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun0() #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[CONT]]
 ; CGSCC:       cont:
 ; CGSCC-NEXT:    [[CMP2:%.*]] = icmp eq i8 [[C]], 1
 ; CGSCC-NEXT:    br i1 [[CMP2]], label [[CONT_THEN:%.*]], label [[CONT_ELSE:%.*]]
 ; CGSCC:       cont.then:
-; CGSCC-NEXT:    tail call void @fun1(ptr nonnull [[B]]) #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun1(ptr nonnull [[B]]) #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[CONT2:%.*]]
 ; CGSCC:       cont.else:
-; CGSCC-NEXT:    tail call void @fun0() #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun0() #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[CONT2]]
 ; CGSCC:       cont2:
-; CGSCC-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @fun1(ptr nonnull [[A]]) #[[ATTR6]]
 ; CGSCC-NEXT:    ret void
 ;
   %cmp1 = icmp eq i8 %c, 0
@@ -849,17 +857,11 @@ define i8 @parent6(ptr %a, ptr %b) {
 ; The nonnull callsite is guaranteed to execute, so the argument must be nonnull throughout the parent.
 
 define i8 @parent7(ptr %a) {
-; TUNIT-LABEL: define {{[^@]+}}@parent7
-; TUNIT-SAME: (ptr nonnull [[A:%.*]]) {
-; TUNIT-NEXT:    [[RET:%.*]] = call i8 @use1safecall(ptr nonnull readonly [[A]]) #[[ATTR16:[0-9]+]]
-; TUNIT-NEXT:    call void @use1nonnull(ptr nonnull [[A]])
-; TUNIT-NEXT:    ret i8 [[RET]]
-;
-; CGSCC-LABEL: define {{[^@]+}}@parent7
-; CGSCC-SAME: (ptr nonnull [[A:%.*]]) {
-; CGSCC-NEXT:    [[RET:%.*]] = call i8 @use1safecall(ptr nonnull readonly [[A]]) #[[ATTR17:[0-9]+]]
-; CGSCC-NEXT:    call void @use1nonnull(ptr nonnull [[A]])
-; CGSCC-NEXT:    ret i8 [[RET]]
+; CHECK-LABEL: define {{[^@]+}}@parent7
+; CHECK-SAME: (ptr nonnull [[A:%.*]]) {
+; CHECK-NEXT:    [[RET:%.*]] = call i8 @use1safecall(ptr nonnull readonly [[A]]) #[[ATTR18:[0-9]+]]
+; CHECK-NEXT:    call void @use1nonnull(ptr nonnull [[A]])
+; CHECK-NEXT:    ret i8 [[RET]]
 ;
 
 
@@ -929,13 +931,13 @@ define ptr @gep1_no_null_opt(ptr %p) #0 {
 ; Should't be able to derive nonnull based on gep.
 ; TUNIT: Function Attrs: mustprogress nofree norecurse nosync nounwind null_pointer_is_valid willreturn memory(none)
 ; TUNIT-LABEL: define {{[^@]+}}@gep1_no_null_opt
-; TUNIT-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[P:%.*]]) #[[ATTR9:[0-9]+]] {
+; TUNIT-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[P:%.*]]) #[[ATTR10:[0-9]+]] {
 ; TUNIT-NEXT:    [[Q:%.*]] = getelementptr inbounds i32, ptr [[P]], i32 1
 ; TUNIT-NEXT:    ret ptr [[Q]]
 ;
 ; CGSCC: Function Attrs: mustprogress nofree norecurse nosync nounwind null_pointer_is_valid willreturn memory(none)
 ; CGSCC-LABEL: define {{[^@]+}}@gep1_no_null_opt
-; CGSCC-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[P:%.*]]) #[[ATTR10:[0-9]+]] {
+; CGSCC-SAME: (ptr nofree readnone "no-capture-maybe-returned" [[P:%.*]]) #[[ATTR9:[0-9]+]] {
 ; CGSCC-NEXT:    [[Q:%.*]] = getelementptr inbounds i32, ptr [[P]], i32 1
 ; CGSCC-NEXT:    ret ptr [[Q]]
 ;
@@ -981,8 +983,8 @@ define ptr @g1() {
 ;
 ; CGSCC: Function Attrs: mustprogress nofree nosync nounwind willreturn memory(none)
 ; CGSCC-LABEL: define {{[^@]+}}@g1
-; CGSCC-SAME: () #[[ATTR6]] {
-; CGSCC-NEXT:    [[C:%.*]] = call noundef nonnull align 4 ptr @g2() #[[ATTR18:[0-9]+]]
+; CGSCC-SAME: () #[[ATTR10:[0-9]+]] {
+; CGSCC-NEXT:    [[C:%.*]] = call noundef nonnull align 4 ptr @g2() #[[ATTR19:[0-9]+]]
 ; CGSCC-NEXT:    ret ptr [[C]]
 ;
   %c = call ptr @g2()
@@ -1043,32 +1045,21 @@ define internal void @control(ptr dereferenceable(4) %a) {
 }
 ; Avoid nonnull as we do not touch naked functions
 define internal void @naked(ptr dereferenceable(4) %a) naked {
-; TUNIT: Function Attrs: naked
-; TUNIT-LABEL: define {{[^@]+}}@naked
-; TUNIT-SAME: (ptr noundef nonnull dereferenceable(4) [[A:%.*]]) #[[ATTR10:[0-9]+]] {
-; TUNIT-NEXT:    ret void
-;
-; CGSCC: Function Attrs: naked
-; CGSCC-LABEL: define {{[^@]+}}@naked
-; CGSCC-SAME: (ptr noundef nonnull dereferenceable(4) [[A:%.*]]) #[[ATTR11:[0-9]+]] {
-; CGSCC-NEXT:    ret void
+; CHECK: Function Attrs: naked
+; CHECK-LABEL: define {{[^@]+}}@naked
+; CHECK-SAME: (ptr noundef nonnull dereferenceable(4) [[A:%.*]]) #[[ATTR11:[0-9]+]] {
+; CHECK-NEXT:    ret void
 ;
   ret void
 }
 ; Avoid nonnull as we do not touch optnone
 define internal void @optnone(ptr dereferenceable(4) %a) optnone noinline {
 ;
-; TUNIT: Function Attrs: noinline optnone
-; TUNIT-LABEL: define {{[^@]+}}@optnone
-; TUNIT-SAME: (ptr noundef nonnull dereferenceable(4) [[A:%.*]]) #[[ATTR11:[0-9]+]] {
-; TUNIT-NEXT:    call void @use_i32_ptr(ptr nofree noundef nonnull captures(none) [[A]])
-; TUNIT-NEXT:    ret void
-;
-; CGSCC: Function Attrs: noinline optnone
-; CGSCC-LABEL: define {{[^@]+}}@optnone
-; CGSCC-SAME: (ptr noundef nonnull dereferenceable(4) [[A:%.*]]) #[[ATTR12:[0-9]+]] {
-; CGSCC-NEXT:    call void @use_i32_ptr(ptr nofree noundef nonnull captures(none) [[A]])
-; CGSCC-NEXT:    ret void
+; CHECK: Function Attrs: noinline optnone
+; CHECK-LABEL: define {{[^@]+}}@optnone
+; CHECK-SAME: (ptr noundef nonnull dereferenceable(4) [[A:%.*]]) #[[ATTR12:[0-9]+]] {
+; CHECK-NEXT:    call void @use_i32_ptr(ptr nofree noundef nonnull captures(none) [[A]])
+; CHECK-NEXT:    ret void
 ;
   call void @use_i32_ptr(ptr %a)
   ret void
@@ -1107,32 +1098,32 @@ define i32 @nonnull_exec_ctx_1(ptr %a, i32 %b) {
 ;
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@nonnull_exec_ctx_1
-; TUNIT-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
+; TUNIT-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
 ; TUNIT-NEXT:  en:
 ; TUNIT-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; TUNIT-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
 ; TUNIT:       ex:
-; TUNIT-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR6]]
+; TUNIT-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR7]]
 ; TUNIT-NEXT:    ret i32 [[TMP5]]
 ; TUNIT:       hd:
 ; TUNIT-NEXT:    [[TMP7:%.*]] = phi i32 [ [[TMP8:%.*]], [[HD]] ], [ 0, [[EN:%.*]] ]
-; TUNIT-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR7]]
 ; TUNIT-NEXT:    [[TMP8]] = add nuw i32 [[TMP7]], 1
 ; TUNIT-NEXT:    [[TMP9:%.*]] = icmp eq i32 [[TMP8]], [[B]]
 ; TUNIT-NEXT:    br i1 [[TMP9]], label [[EX]], label [[HD]]
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@nonnull_exec_ctx_1
-; CGSCC-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
+; CGSCC-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
 ; CGSCC-NEXT:  en:
 ; CGSCC-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; CGSCC-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
 ; CGSCC:       ex:
-; CGSCC-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR7]]
+; CGSCC-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR6]]
 ; CGSCC-NEXT:    ret i32 [[TMP5]]
 ; CGSCC:       hd:
 ; CGSCC-NEXT:    [[TMP7:%.*]] = phi i32 [ [[TMP8:%.*]], [[HD]] ], [ 0, [[EN:%.*]] ]
-; CGSCC-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR6]]
 ; CGSCC-NEXT:    [[TMP8]] = add nuw i32 [[TMP7]], 1
 ; CGSCC-NEXT:    [[TMP9:%.*]] = icmp eq i32 [[TMP8]], [[B]]
 ; CGSCC-NEXT:    br i1 [[TMP9]], label [[EX]], label [[HD]]
@@ -1157,16 +1148,16 @@ define i32 @nonnull_exec_ctx_1b(ptr %a, i32 %b) {
 ;
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@nonnull_exec_ctx_1b
-; TUNIT-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
+; TUNIT-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
 ; TUNIT-NEXT:  en:
 ; TUNIT-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; TUNIT-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
 ; TUNIT:       ex:
-; TUNIT-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR6]]
+; TUNIT-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR7]]
 ; TUNIT-NEXT:    ret i32 [[TMP5]]
 ; TUNIT:       hd:
 ; TUNIT-NEXT:    [[TMP7:%.*]] = phi i32 [ [[TMP8:%.*]], [[HD2:%.*]] ], [ 0, [[EN:%.*]] ]
-; TUNIT-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR6]]
+; TUNIT-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR7]]
 ; TUNIT-NEXT:    br label [[HD2]]
 ; TUNIT:       hd2:
 ; TUNIT-NEXT:    [[TMP8]] = add nuw i32 [[TMP7]], 1
@@ -1175,16 +1166,16 @@ define i32 @nonnull_exec_ctx_1b(ptr %a, i32 %b) {
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@nonnull_exec_ctx_1b
-; CGSCC-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
+; CGSCC-SAME: (ptr [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
 ; CGSCC-NEXT:  en:
 ; CGSCC-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; CGSCC-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
 ; CGSCC:       ex:
-; CGSCC-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR7]]
+; CGSCC-NEXT:    [[TMP5:%.*]] = tail call i32 @g(ptr nonnull [[A]]) #[[ATTR6]]
 ; CGSCC-NEXT:    ret i32 [[TMP5]]
 ; CGSCC:       hd:
 ; CGSCC-NEXT:    [[TMP7:%.*]] = phi i32 [ [[TMP8:%.*]], [[HD2:%.*]] ], [ 0, [[EN:%.*]] ]
-; CGSCC-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR7]]
+; CGSCC-NEXT:    tail call void @h(ptr [[A]]) #[[ATTR6]]
 ; CGSCC-NEXT:    br label [[HD2]]
 ; CGSCC:       hd2:
 ; CGSCC-NEXT:    [[TMP8]] = add nuw i32 [[TMP7]], 1
@@ -1214,7 +1205,7 @@ define i32 @nonnull_exec_ctx_2(ptr %a, i32 %b) willreturn nounwind {
 ;
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@nonnull_exec_ctx_2
-; TUNIT-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
+; TUNIT-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
 ; TUNIT-NEXT:  en:
 ; TUNIT-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; TUNIT-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
@@ -1230,7 +1221,7 @@ define i32 @nonnull_exec_ctx_2(ptr %a, i32 %b) willreturn nounwind {
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@nonnull_exec_ctx_2
-; CGSCC-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
+; CGSCC-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
 ; CGSCC-NEXT:  en:
 ; CGSCC-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; CGSCC-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
@@ -1264,7 +1255,7 @@ define i32 @nonnull_exec_ctx_2b(ptr %a, i32 %b) willreturn nounwind {
 ;
 ; TUNIT: Function Attrs: mustprogress nounwind willreturn
 ; TUNIT-LABEL: define {{[^@]+}}@nonnull_exec_ctx_2b
-; TUNIT-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
+; TUNIT-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
 ; TUNIT-NEXT:  en:
 ; TUNIT-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; TUNIT-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
@@ -1282,7 +1273,7 @@ define i32 @nonnull_exec_ctx_2b(ptr %a, i32 %b) willreturn nounwind {
 ;
 ; CGSCC: Function Attrs: mustprogress nounwind willreturn
 ; CGSCC-LABEL: define {{[^@]+}}@nonnull_exec_ctx_2b
-; CGSCC-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR8]] {
+; CGSCC-SAME: (ptr nonnull [[A:%.*]], i32 [[B:%.*]]) #[[ATTR7]] {
 ; CGSCC-NEXT:  en:
 ; CGSCC-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[B]], 0
 ; CGSCC-NEXT:    br i1 [[TMP3]], label [[EX:%.*]], label [[HD:%.*]]
@@ -1401,8 +1392,8 @@ declare ptr @strrchr(ptr %0, i32 %1) nofree nounwind readonly willreturn
 define ptr @mybasename(ptr nofree readonly %str) {
 ; TUNIT: Function Attrs: mustprogress nofree nosync nounwind willreturn memory(read)
 ; TUNIT-LABEL: define {{[^@]+}}@mybasename
-; TUNIT-SAME: (ptr nofree readonly [[STR:%.*]]) #[[ATTR13:[0-9]+]] {
-; TUNIT-NEXT:    [[CALL:%.*]] = call ptr @strrchr(ptr nofree readonly [[STR]], i32 noundef 47) #[[ATTR17:[0-9]+]]
+; TUNIT-SAME: (ptr nofree readonly [[STR:%.*]]) #[[ATTR14:[0-9]+]] {
+; TUNIT-NEXT:    [[CALL:%.*]] = call ptr @strrchr(ptr nofree readonly [[STR]], i32 noundef 47) #[[ATTR19:[0-9]+]]
 ; TUNIT-NEXT:    [[TOBOOL:%.*]] = icmp ne ptr [[CALL]], null
 ; TUNIT-NEXT:    [[ADD_PTR:%.*]] = getelementptr inbounds i8, ptr [[CALL]], i64 1
 ; TUNIT-NEXT:    [[COND:%.*]] = select i1 [[TOBOOL]], ptr [[ADD_PTR]], ptr [[STR]]
@@ -1411,7 +1402,7 @@ define ptr @mybasename(ptr nofree readonly %str) {
 ; CGSCC: Function Attrs: mustprogress nofree nosync nounwind willreturn memory(read)
 ; CGSCC-LABEL: define {{[^@]+}}@mybasename
 ; CGSCC-SAME: (ptr nofree readonly [[STR:%.*]]) #[[ATTR14:[0-9]+]] {
-; CGSCC-NEXT:    [[CALL:%.*]] = call ptr @strrchr(ptr nofree readonly [[STR]], i32 noundef 47) #[[ATTR19:[0-9]+]]
+; CGSCC-NEXT:    [[CALL:%.*]] = call ptr @strrchr(ptr nofree readonly [[STR]], i32 noundef 47) #[[ATTR20:[0-9]+]]
 ; CGSCC-NEXT:    [[TOBOOL:%.*]] = icmp ne ptr [[CALL]], null
 ; CGSCC-NEXT:    [[ADD_PTR:%.*]] = getelementptr inbounds i8, ptr [[CALL]], i64 1
 ; CGSCC-NEXT:    [[COND:%.*]] = select i1 [[TOBOOL]], ptr [[ADD_PTR]], ptr [[STR]]
@@ -1434,7 +1425,7 @@ define void @nonnull_assume_pos(ptr %arg) {
 ;
 ; TUNIT-LABEL: define {{[^@]+}}@nonnull_assume_pos
 ; TUNIT-SAME: (ptr nofree nonnull readnone captures(none) [[ARG:%.*]]) {
-; TUNIT-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR15]] [ "nonnull"(ptr [[ARG]]) ]
+; TUNIT-NEXT:    call void @llvm.assume(i1 noundef true) #[[ATTR16]] [ "nonnull"(ptr [[ARG]]) ]
 ; TUNIT-NEXT:    call void @use_i8_ptr(ptr noalias nofree nonnull readnone captures(none) [[ARG]]) #[[ATTR5]]
 ; TUNIT-NEXT:    [[TMP1:%.*]] = call ptr @unknown()
 ; TUNIT-NEXT:    ret void
@@ -1563,14 +1554,14 @@ define void @phi_caller(ptr %p) {
 ; TUNIT: Function Attrs: nounwind
 ; TUNIT-LABEL: define {{[^@]+}}@phi_caller
 ; TUNIT-SAME: (ptr nofree [[P:%.*]]) #[[ATTR5]] {
-; TUNIT-NEXT:    [[C:%.*]] = call nonnull ptr @phi(ptr noalias nofree readnone [[P]]) #[[ATTR18:[0-9]+]]
+; TUNIT-NEXT:    [[C:%.*]] = call nonnull ptr @phi(ptr noalias nofree readnone [[P]]) #[[ATTR20:[0-9]+]]
 ; TUNIT-NEXT:    call void @use_i8_ptr(ptr noalias nofree nonnull readnone captures(none) [[C]]) #[[ATTR5]]
 ; TUNIT-NEXT:    ret void
 ;
 ; CGSCC: Function Attrs: nounwind
 ; CGSCC-LABEL: define {{[^@]+}}@phi_caller
 ; CGSCC-SAME: (ptr nofree [[P:%.*]]) #[[ATTR4]] {
-; CGSCC-NEXT:    [[C:%.*]] = call nonnull ptr @phi(ptr noalias nofree readnone [[P]]) #[[ATTR20:[0-9]+]]
+; CGSCC-NEXT:    [[C:%.*]] = call nonnull ptr @phi(ptr noalias nofree readnone [[P]]) #[[ATTR21:[0-9]+]]
 ; CGSCC-NEXT:    call void @use_i8_ptr(ptr noalias nofree nonnull readnone captures(none) [[C]]) #[[ATTR4]]
 ; CGSCC-NEXT:    ret void
 ;
@@ -1603,14 +1594,14 @@ define void @multi_ret_caller(ptr %p) {
 ; TUNIT: Function Attrs: nounwind
 ; TUNIT-LABEL: define {{[^@]+}}@multi_ret_caller
 ; TUNIT-SAME: (ptr nofree [[P:%.*]]) #[[ATTR5]] {
-; TUNIT-NEXT:    [[C:%.*]] = call nonnull ptr @multi_ret(ptr noalias nofree readnone [[P]]) #[[ATTR18]]
+; TUNIT-NEXT:    [[C:%.*]] = call nonnull ptr @multi_ret(ptr noalias nofree readnone [[P]]) #[[ATTR20]]
 ; TUNIT-NEXT:    call void @use_i8_ptr(ptr noalias nofree nonnull readnone captures(none) [[C]]) #[[ATTR5]]
 ; TUNIT-NEXT:    ret void
 ;
 ; CGSCC: Function Attrs: nounwind
 ; CGSCC-LABEL: define {{[^@]+}}@multi_ret_caller
 ; CGSCC-SAME: (ptr nofree [[P:%.*]]) #[[ATTR4]] {
-; CGSCC-NEXT:    [[C:%.*]] = call nonnull ptr @multi_ret(ptr noalias nofree readnone [[P]]) #[[ATTR20]]
+; CGSCC-NEXT:    [[C:%.*]] = call nonnull ptr @multi_ret(ptr noalias nofree readnone [[P]]) #[[ATTR21]]
 ; CGSCC-NEXT:    call void @use_i8_ptr(ptr noalias nofree nonnull readnone captures(none) [[C]]) #[[ATTR4]]
 ; CGSCC-NEXT:    ret void
 ;
@@ -1622,31 +1613,18 @@ define void @multi_ret_caller(ptr %p) {
 ; From https://github.com/llvm/llvm-project/pull/85810
 @G = internal global i64 1, align 8
 define dso_local ptr @update_global_in_alive_bb() {
-; TUNIT: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
-; TUNIT-LABEL: define {{[^@]+}}@update_global_in_alive_bb
-; TUNIT-SAME: () #[[ATTR14:[0-9]+]] {
-; TUNIT-NEXT:  entry:
-; TUNIT-NEXT:    [[TMP0:%.*]] = load i64, ptr @G, align 8
-; TUNIT-NEXT:    [[CMP:%.*]] = icmp ne i64 [[TMP0]], 0
-; TUNIT-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
-; TUNIT:       if.then:
-; TUNIT-NEXT:    store i64 0, ptr @G, align 8
-; TUNIT-NEXT:    ret ptr inttoptr (i64 5 to ptr)
-; TUNIT:       if.else:
-; TUNIT-NEXT:    ret ptr null
-;
-; CGSCC: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
-; CGSCC-LABEL: define {{[^@]+}}@update_global_in_alive_bb
-; CGSCC-SAME: () #[[ATTR15:[0-9]+]] {
-; CGSCC-NEXT:  entry:
-; CGSCC-NEXT:    [[TMP0:%.*]] = load i64, ptr @G, align 8
-; CGSCC-NEXT:    [[CMP:%.*]] = icmp ne i64 [[TMP0]], 0
-; CGSCC-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
-; CGSCC:       if.then:
-; CGSCC-NEXT:    store i64 0, ptr @G, align 8
-; CGSCC-NEXT:    ret ptr inttoptr (i64 5 to ptr)
-; CGSCC:       if.else:
-; CGSCC-NEXT:    ret ptr null
+; CHECK: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
+; CHECK-LABEL: define {{[^@]+}}@update_global_in_alive_bb
+; CHECK-SAME: () #[[ATTR15:[0-9]+]] {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr @G, align 8
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ne i64 [[TMP0]], 0
+; CHECK-NEXT:    br i1 [[CMP]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
+; CHECK:       if.then:
+; CHECK-NEXT:    store i64 0, ptr @G, align 8
+; CHECK-NEXT:    ret ptr inttoptr (i64 5 to ptr)
+; CHECK:       if.else:
+; CHECK-NEXT:    ret ptr null
 ;
 entry:
   %0 = load i64, ptr @G, align 8
@@ -1662,47 +1640,48 @@ if.else:
 attributes #0 = { null_pointer_is_valid }
 attributes #1 = { nounwind willreturn}
 ;.
+; TUNIT: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: write) }
+; TUNIT: attributes #[[ATTR1]] = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }
+; TUNIT: attributes #[[ATTR2]] = { mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write) }
+; TUNIT: attributes #[[ATTR3]] = { mustprogress nofree nosync nounwind willreturn memory(none) }
+; TUNIT: attributes #[[ATTR4]] = { noreturn }
+; TUNIT: attributes #[[ATTR5]] = { nounwind }
+; TUNIT: attributes #[[ATTR6]] = { nofree nosync nounwind memory(argmem: read) }
+; TUNIT: attributes #[[ATTR7]] = { nounwind willreturn }
+; TUNIT: attributes #[[ATTR8]] = { mustprogress nounwind willreturn }
+; TUNIT: attributes #[[ATTR9:[0-9]+]] = { nounwind willreturn memory(read) }
+; TUNIT: attributes #[[ATTR10]] = { mustprogress nofree norecurse nosync nounwind null_pointer_is_valid willreturn memory(none) }
+; TUNIT: attributes #[[ATTR11]] = { naked }
+; TUNIT: attributes #[[ATTR12]] = { noinline optnone }
+; TUNIT: attributes #[[ATTR13:[0-9]+]] = { nofree nounwind willreturn memory(read) }
+; TUNIT: attributes #[[ATTR14]] = { mustprogress nofree nosync nounwind willreturn memory(read) }
+; TUNIT: attributes #[[ATTR15]] = { mustprogress nofree norecurse nosync nounwind willreturn }
+; TUNIT: attributes #[[ATTR16]] = { nofree willreturn memory(write) }
+; TUNIT: attributes #[[ATTR17]] = { nofree nosync nounwind memory(read) }
+; TUNIT: attributes #[[ATTR18]] = { nosync willreturn memory(read) }
+; TUNIT: attributes #[[ATTR19]] = { nofree nosync willreturn memory(read) }
+; TUNIT: attributes #[[ATTR20]] = { nofree nosync nounwind willreturn memory(none) }
+;.
 ; CGSCC: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: write) }
 ; CGSCC: attributes #[[ATTR1]] = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }
 ; CGSCC: attributes #[[ATTR2]] = { mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write) }
 ; CGSCC: attributes #[[ATTR3]] = { noreturn }
 ; CGSCC: attributes #[[ATTR4]] = { nounwind }
-; CGSCC: attributes #[[ATTR5]] = { mustprogress nofree nosync nounwind willreturn memory(argmem: read) }
-; CGSCC: attributes #[[ATTR6]] = { mustprogress nofree nosync nounwind willreturn memory(none) }
-; CGSCC: attributes #[[ATTR7]] = { nounwind willreturn }
-; CGSCC: attributes #[[ATTR8]] = { mustprogress nounwind willreturn }
-; CGSCC: attributes #[[ATTR9:[0-9]+]] = { nounwind willreturn memory(read) }
-; CGSCC: attributes #[[ATTR10]] = { mustprogress nofree norecurse nosync nounwind null_pointer_is_valid willreturn memory(none) }
+; CGSCC: attributes #[[ATTR5]] = { nofree nosync nounwind memory(argmem: read) }
+; CGSCC: attributes #[[ATTR6]] = { nounwind willreturn }
+; CGSCC: attributes #[[ATTR7]] = { mustprogress nounwind willreturn }
+; CGSCC: attributes #[[ATTR8:[0-9]+]] = { nounwind willreturn memory(read) }
+; CGSCC: attributes #[[ATTR9]] = { mustprogress nofree norecurse nosync nounwind null_pointer_is_valid willreturn memory(none) }
+; CGSCC: attributes #[[ATTR10]] = { mustprogress nofree nosync nounwind willreturn memory(none) }
 ; CGSCC: attributes #[[ATTR11]] = { naked }
 ; CGSCC: attributes #[[ATTR12]] = { noinline optnone }
 ; CGSCC: attributes #[[ATTR13:[0-9]+]] = { nofree nounwind willreturn memory(read) }
 ; CGSCC: attributes #[[ATTR14]] = { mustprogress nofree nosync nounwind willreturn memory(read) }
 ; CGSCC: attributes #[[ATTR15]] = { mustprogress nofree norecurse nosync nounwind willreturn }
 ; CGSCC: attributes #[[ATTR16]] = { nofree willreturn memory(write) }
-; CGSCC: attributes #[[ATTR17]] = { nosync willreturn memory(read) }
-; CGSCC: attributes #[[ATTR18]] = { nofree nosync willreturn }
-; CGSCC: attributes #[[ATTR19]] = { nofree nosync willreturn memory(read) }
-; CGSCC: attributes #[[ATTR20]] = { nofree willreturn }
-;.
-; TUNIT: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: write) }
-; TUNIT: attributes #[[ATTR1]] = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }
-; TUNIT: attributes #[[ATTR2]] = { mustprogress nofree norecurse nosync nounwind willreturn memory(inaccessiblemem: write) }
-; TUNIT: attributes #[[ATTR3]] = { mustprogress nofree nosync nounwind willreturn memory(none) }
-; TUNIT: attributes #[[ATTR4]] = { noreturn }
-; TUNIT: attributes #[[ATTR5]] = { nounwind }
-; TUNIT: attributes #[[ATTR6]] = { nounwind willreturn }
-; TUNIT: attributes #[[ATTR7]] = { mustprogress nounwind willreturn }
-; TUNIT: attributes #[[ATTR8:[0-9]+]] = { nounwind willreturn memory(read) }
-; TUNIT: attributes #[[ATTR9]] = { mustprogress nofree norecurse nosync nounwind null_pointer_is_valid willreturn memory(none) }
-; TUNIT: attributes #[[ATTR10]] = { naked }
-; TUNIT: attributes #[[ATTR11]] = { noinline optnone }
-; TUNIT: attributes #[[ATTR12:[0-9]+]] = { nofree nounwind willreturn memory(read) }
-; TUNIT: attributes #[[ATTR13]] = { mustprogress nofree nosync nounwind willreturn memory(read) }
-; TUNIT: attributes #[[ATTR14]] = { mustprogress nofree norecurse nosync nounwind willreturn }
-; TUNIT: attributes #[[ATTR15]] = { nofree willreturn memory(write) }
-; TUNIT: attributes #[[ATTR16]] = { nosync willreturn memory(read) }
-; TUNIT: attributes #[[ATTR17]] = { nofree nosync willreturn memory(read) }
-; TUNIT: attributes #[[ATTR18]] = { nofree nosync nounwind willreturn memory(none) }
-;.
-; CGSCC: [[META0]] = !{}
+; CGSCC: attributes #[[ATTR17]] = { nofree nosync nounwind memory(read) }
+; CGSCC: attributes #[[ATTR18]] = { nosync willreturn memory(read) }
+; CGSCC: attributes #[[ATTR19]] = { nofree nosync willreturn }
+; CGSCC: attributes #[[ATTR20]] = { nofree nosync willreturn memory(read) }
+; CGSCC: attributes #[[ATTR21]] = { nofree willreturn }
 ;.
diff --git a/llvm/test/Transforms/Attributor/value-simplify-pointer-info.ll b/llvm/test/Transforms/Attributor/value-simplify-pointer-info.ll
index 2235f194af8ea..3e07fe42261e9 100644
--- a/llvm/test/Transforms/Attributor/value-simplify-pointer-info.ll
+++ b/llvm/test/Transforms/Attributor/value-simplify-pointer-info.ll
@@ -1267,7 +1267,7 @@ entry:
 define void @noalias_arg_simplifiable_2(ptr %Bytes) {
 ; TUNIT: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn
 ; TUNIT-LABEL: define void @noalias_arg_simplifiable_2(
-; TUNIT-SAME: ptr nofree nonnull captures(none) dereferenceable(24) [[BYTES:%.*]]) #[[ATTR3]] {
+; TUNIT-SAME: ptr nofree captures(none) [[BYTES:%.*]]) #[[ATTR3]] {
 ; TUNIT-NEXT:  [[ENTRY:.*]]:
 ; TUNIT-NEXT:    br label %[[FOR_COND:.*]]
 ; TUNIT:       [[FOR_COND]]:
@@ -1344,7 +1344,7 @@ define void @noalias_arg_simplifiable_2(ptr %Bytes) {
 ;
 ; CGSCC: Function Attrs: mustprogress nofree nosync nounwind willreturn
 ; CGSCC-LABEL: define void @noalias_arg_simplifiable_2(
-; CGSCC-SAME: ptr nofree nonnull align 4 captures(none) dereferenceable(1024) [[BYTES:%.*]]) #[[ATTR3]] {
+; CGSCC-SAME: ptr nofree captures(none) [[BYTES:%.*]]) #[[ATTR3]] {
 ; CGSCC-NEXT:  [[ENTRY:.*]]:
 ; CGSCC-NEXT:    br label %[[FOR_COND:.*]]
 ; CGSCC:       [[FOR_COND]]:
@@ -1399,7 +1399,7 @@ define void @noalias_arg_simplifiable_2(ptr %Bytes) {
 ; CGSCC-NEXT:    [[ARRAYIDX24:%.*]] = getelementptr inbounds i8, ptr [[BYTES]], i64 1023
 ; CGSCC-NEXT:    store i8 0, ptr [[ARRAYIDX24]], align 1, !tbaa [[CHAR_TBAA15]]
 ; CGSCC-NEXT:    [[ARRAYIDX25:%.*]] = getelementptr inbounds i8, ptr [[BYTES]], i64 500
-; CGSCC-NEXT:    call void @write_arg(ptr nofree noundef nonnull writeonly align 4 captures(none) dereferenceable(524) [[ARRAYIDX25]], i32 noundef 0) #[[ATTR21]]
+; CGSCC-NEXT:    call void @write_arg(ptr nofree noundef nonnull writeonly align 4 captures(none) dereferenceable(4) [[ARRAYIDX25]], i32 noundef 0) #[[ATTR21]]
 ; CGSCC-NEXT:    br label %[[FOR_COND27:.*]]
 ; CGSCC:       [[FOR_COND27]]:
 ; CGSCC-NEXT:    [[INDVARS_IV12:%.*]] = phi i64 [ [[INDVARS_IV_NEXT13:%.*]], %[[FOR_INC35:.*]] ], [ 0, %[[FOR_END23]] ]
diff --git a/llvm/test/Transforms/Attributor/willreturn.ll b/llvm/test/Transforms/Attributor/willreturn.ll
index 543f33ee0621b..d65480b05759a 100644
--- a/llvm/test/Transforms/Attributor/willreturn.ll
+++ b/llvm/test/Transforms/Attributor/willreturn.ll
@@ -238,7 +238,7 @@ define void @only_exit() local_unnamed_addr #0 {
 define void @conditional_exit(i32 %0, ptr nocapture readonly %1) local_unnamed_addr #0 {
 ; CHECK: Function Attrs: noinline nounwind uwtable
 ; CHECK-LABEL: define {{[^@]+}}@conditional_exit
-; CHECK-SAME: (i32 [[TMP0:%.*]], ptr nofree nonnull readonly align 4 captures(none) dereferenceable(4) [[TMP1:%.*]]) local_unnamed_addr #[[ATTR7:[0-9]+]] {
+; CHECK-SAME: (i32 [[TMP0:%.*]], ptr nofree readonly captures(none) [[TMP1:%.*]]) local_unnamed_addr #[[ATTR7:[0-9]+]] {
 ; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[TMP0]], 0
 ; CHECK-NEXT:    br i1 [[TMP3]], label [[TMP5:%.*]], label [[TMP4:%.*]]
 ; CHECK:       4:
diff --git a/llvm/test/Transforms/DeadStoreElimination/debug-counter.ll b/llvm/test/Transforms/DeadStoreElimination/debug-counter.ll
index ffa10e37c76f6..a38a845fc63c3 100644
--- a/llvm/test/Transforms/DeadStoreElimination/debug-counter.ll
+++ b/llvm/test/Transforms/DeadStoreElimination/debug-counter.ll
@@ -1,7 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 
-; REQUIRES: asserts
-
 ; Eliminates store to %R in the entry block.
 ; RUN: opt < %s -passes=dse -debug-counter=dse-memoryssa=0 -S | FileCheck --check-prefix=SKIP0-COUNT1 %s
 
diff --git a/llvm/test/Transforms/ExpandFp/AMDGPU/frem-inf.ll b/llvm/test/Transforms/ExpandFp/AMDGPU/frem-inf.ll
index f70f0d25f172d..54ece8d52f08a 100644
--- a/llvm/test/Transforms/ExpandFp/AMDGPU/frem-inf.ll
+++ b/llvm/test/Transforms/ExpandFp/AMDGPU/frem-inf.ll
@@ -1,5 +1,5 @@
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O0>" %s -S -o - | FileCheck --check-prefixes CHECK %s
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O1>" %s -S -o - | FileCheck --check-prefixes CHECK,OPT1 %s
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O0>" %s -S -o - | FileCheck --check-prefixes CHECK %s
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O1>" %s -S -o - | FileCheck --check-prefixes CHECK,OPT1 %s
 
 ; Check the handling of potentially infinite numerators in the frem
 ; expansion at different optimization levels and with different
diff --git a/llvm/test/Transforms/ExpandFp/AMDGPU/frem.ll b/llvm/test/Transforms/ExpandFp/AMDGPU/frem.ll
index 4c0f9db147c96..5cd6f1e8a6086 100644
--- a/llvm/test/Transforms/ExpandFp/AMDGPU/frem.ll
+++ b/llvm/test/Transforms/ExpandFp/AMDGPU/frem.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O1>" %s -S -o - | FileCheck %s
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O1>" %s -S -o - | FileCheck %s
 
 define amdgpu_kernel void @frem_f16(ptr addrspace(1) %out, ptr addrspace(1) %in1,
 ; CHECK-LABEL: define amdgpu_kernel void @frem_f16(
diff --git a/llvm/test/Transforms/ExpandFp/AMDGPU/missing-analysis.ll b/llvm/test/Transforms/ExpandFp/AMDGPU/missing-analysis.ll
new file mode 100644
index 0000000000000..2d5f2a7223e3a
--- /dev/null
+++ b/llvm/test/Transforms/ExpandFp/AMDGPU/missing-analysis.ll
@@ -0,0 +1,6 @@
+; RUN: not opt -mtriple=amdgcn -passes=expand-fp -disable-output %s 2>&1 | FileCheck %s
+
+; CHECK: 'LibcallLoweringModuleAnalysis' analysis required
+define void @empty() {
+  ret void
+}
diff --git a/llvm/test/Transforms/ExpandFp/AMDGPU/pass-parameters.ll b/llvm/test/Transforms/ExpandFp/AMDGPU/pass-parameters.ll
index 03cafd4ff1160..794d5805291b0 100644
--- a/llvm/test/Transforms/ExpandFp/AMDGPU/pass-parameters.ll
+++ b/llvm/test/Transforms/ExpandFp/AMDGPU/pass-parameters.ll
@@ -1,18 +1,18 @@
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O0>" %s -S -o /dev/null
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O1>" %s -S -o /dev/null
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O2>" %s -S -o /dev/null
-; RUN: opt -mtriple=amdgcn -passes="expand-fp<O3>" %s -S -o /dev/null
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O0>" %s -S -disable-output
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O1>" %s -S -disable-output
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O2>" %s -S -disable-output
+; RUN: opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O3>" %s -S -disable-output
 
-; RUN: not opt -mtriple=amdgcn -passes="expand-fp<O4>" %s -S -o /dev/null 2>&1 | FileCheck --check-prefix=TOO-LARGE %s
+; RUN: not opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O4>" %s -S -disable-output 2>&1 | FileCheck --check-prefix=TOO-LARGE %s
 ; TOO-LARGE: {{.*}}invalid optimization level for expand-fp pass: 4
 
-; RUN: not opt -mtriple=amdgcn -passes="expand-fp<Os>" %s -S -o /dev/null 2>&1 | FileCheck --check-prefix=NON-NUMERIC %s
+; RUN: not opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<Os>" %s -S -disable-output 2>&1 | FileCheck --check-prefix=NON-NUMERIC %s
 ; NON-NUMERIC: {{.*}}invalid expand-fp pass parameter
 
-; RUN: not opt -mtriple=amdgcn -passes="expand-fp<O-1>" %s -S -o /dev/null 2>&1 | FileCheck --check-prefix=NEGATIVE %s
+; RUN: not opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<O-1>" %s -S -disable-output 2>&1 | FileCheck --check-prefix=NEGATIVE %s
 ; NEGATIVE: {{.*}}invalid expand-fp pass parameter 'O-1'
 
-; RUN: not opt -mtriple=amdgcn -passes="expand-fp<foo>" %s -S -o /dev/null 2>&1 | FileCheck --check-prefix=NO-O-PREFIX %s
+; RUN: not opt -mtriple=amdgcn -passes="require<libcall-lowering-info>,expand-fp<foo>" %s -S -disable-output 2>&1 | FileCheck --check-prefix=NO-O-PREFIX %s
 ; NO-O-PREFIX: {{.*}}invalid expand-fp pass parameter 'foo'
 
 define void @empty() {
diff --git a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptosi129.ll b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptosi129.ll
index f5bf8bb61a16e..0cf8829aec037 100644
--- a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptosi129.ll
+++ b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptosi129.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -S -mtriple=x86_64-- --expand-fp < %s | FileCheck %s
-; RUN: opt -S -mtriple=x86_64-- -passes=expand-fp < %s | FileCheck %s
+; RUN: opt -S -mtriple=x86_64-- -passes='require<libcall-lowering-info>,expand-fp' < %s | FileCheck %s
 
 define i129 @halftosi129(half %a) {
 ; CHECK-LABEL: @halftosi129(
diff --git a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptoui129.ll b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptoui129.ll
index 94ed32abe46f8..055e3e0dc261d 100644
--- a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptoui129.ll
+++ b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-fptoui129.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -S -mtriple=x86_64-- --expand-fp < %s | FileCheck %s
-; RUN: opt -S -mtriple=x86_64-- -passes=expand-fp < %s | FileCheck %s
+; RUN: opt -S -mtriple=x86_64-- -passes='require<libcall-lowering-info>,expand-fp' < %s | FileCheck %s
 
 define i129 @halftoui129(half %a) {
 ; CHECK-LABEL: @halftoui129(
diff --git a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-si129tofp.ll b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-si129tofp.ll
index 8820b873f3818..af053e82a62a4 100644
--- a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-si129tofp.ll
+++ b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-si129tofp.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -S -mtriple=x86_64-- --expand-fp < %s | FileCheck %s
-; RUN: opt -S -mtriple=x86_64-- -passes=expand-fp < %s | FileCheck %s
+; RUN: opt -S -mtriple=x86_64-- -passes='require<libcall-lowering-info>,expand-fp' < %s | FileCheck %s
 
 define half @si129tohalf(i129 %a) {
 ; CHECK-LABEL: @si129tohalf(
diff --git a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-ui129tofp.ll b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-ui129tofp.ll
index b58d88bc02c79..ede9b2a4cd049 100644
--- a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-ui129tofp.ll
+++ b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-ui129tofp.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -S -mtriple=x86_64-- --expand-fp < %s | FileCheck %s
-; RUN: opt -S -mtriple=x86_64-- -passes=expand-fp < %s | FileCheck %s
+; RUN: opt -S -mtriple=x86_64-- -passes='require<libcall-lowering-info>,expand-fp' < %s | FileCheck %s
 
 define half @ui129tohalf(i129 %a) {
 ; CHECK-LABEL: @ui129tohalf(
diff --git a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-optnone.ll b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-optnone.ll
index 78bc0006fda23..e78eaeb70fbf1 100644
--- a/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-optnone.ll
+++ b/llvm/test/Transforms/ExpandLargeFpConvert/X86/expand-large-fp-optnone.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
 ; RUN: opt -S -mtriple=x86_64-- --expand-fp < %s | FileCheck %s
-; RUN: opt -S -mtriple=x86_64-- -passes=expand-fp < %s | FileCheck %s
+; RUN: opt -S -mtriple=x86_64-- -passes='require<libcall-lowering-info>,expand-fp' < %s | FileCheck %s
 
 ; expand-fp must also run with optnone
 
diff --git a/llvm/test/Transforms/FunctionAttrs/nonnull.ll b/llvm/test/Transforms/FunctionAttrs/nonnull.ll
index e06fb1cfd9656..9d5ae1606f2e3 100644
--- a/llvm/test/Transforms/FunctionAttrs/nonnull.ll
+++ b/llvm/test/Transforms/FunctionAttrs/nonnull.ll
@@ -360,6 +360,7 @@ declare nonnull ptr @nonnull()
 
 
 define internal ptr @f1(ptr %arg) {
+; FIXME: missing nonnull It should be nonnull @f1(ptr nonnull readonly %arg)
 ; FNATTRS-LABEL: define internal nonnull ptr @f1(
 ; FNATTRS-SAME: ptr readonly captures(address_is_null) [[ARG:%.*]]) #[[ATTR4:[0-9]+]] {
 ; FNATTRS-NEXT:  bb:
@@ -382,7 +383,7 @@ define internal ptr @f1(ptr %arg) {
 ; FNATTRS-NEXT:    ret ptr [[TMP10]]
 ;
 ; ATTRIBUTOR-LABEL: define internal ptr @f1(
-; ATTRIBUTOR-SAME: ptr nofree nonnull readonly [[ARG:%.*]]) #[[ATTR4:[0-9]+]] {
+; ATTRIBUTOR-SAME: ptr nofree readonly [[ARG:%.*]]) #[[ATTR4:[0-9]+]] {
 ; ATTRIBUTOR-NEXT:  bb:
 ; ATTRIBUTOR-NEXT:    [[TMP:%.*]] = icmp eq ptr [[ARG]], null
 ; ATTRIBUTOR-NEXT:    br i1 [[TMP]], label [[BB9:%.*]], label [[BB1:%.*]]
diff --git a/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll b/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
index 156c49c8b6677..a7fcc667dbedc 100644
--- a/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
+++ b/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
@@ -1,4 +1,4 @@
-; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "call i32 @(test_single_bb_resolver|test_multi_bb_resolver|test_caller_feats_not_implied|test_non_fmv_caller|test_priority|test_alternative_names|test_unrelated_callers)" --version 4
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "call i32 @(test_single_bb_resolver|test_multi_bb_resolver|test_caller_feats_not_implied|test_non_fmv_caller|test_priority|test_alternative_names|test_unrelated_callers|test_known_bits)" --version 4
 
 ; REQUIRES: aarch64-registered-target
 
@@ -14,6 +14,7 @@ $test_non_fmv_caller.resolver = comdat any
 $test_priority.resolver = comdat any
 $test_alternative_names.resolver = comdat any
 $test_unrelated_callers.resolver = comdat any
+$test_known_bits.resolver = comdat any
 $caller1.resolver = comdat any
 $caller2.resolver = comdat any
 $caller3.resolver = comdat any
@@ -21,6 +22,7 @@ $caller6.resolver = comdat any
 $caller7.resolver = comdat any
 $caller8.resolver = comdat any
 $caller9.resolver = comdat any
+$caller11.resolver = comdat any
 
 @__aarch64_cpu_features = external local_unnamed_addr global { i64 }
 
@@ -31,6 +33,7 @@ $caller9.resolver = comdat any
 @test_priority = weak_odr ifunc i32 (), ptr @test_priority.resolver
 @test_alternative_names = weak_odr ifunc i32 (), ptr @test_alternative_names.resolver
 @test_unrelated_callers = weak_odr ifunc i32 (), ptr @test_unrelated_callers.resolver
+ at test_known_bits = weak_odr ifunc i32 (), ptr @test_known_bits.resolver
 @caller1 = weak_odr ifunc i32 (), ptr @caller1.resolver
 @caller2 = weak_odr ifunc i32 (), ptr @caller2.resolver
 @caller3 = weak_odr ifunc i32 (), ptr @caller3.resolver
@@ -38,6 +41,7 @@ $caller9.resolver = comdat any
 @caller7 = weak_odr ifunc i32 (), ptr @caller7.resolver
 @caller8 = weak_odr ifunc i32 (), ptr @caller8.resolver
 @caller9 = weak_odr ifunc i32 (), ptr @caller9.resolver
+ at caller11 = weak_odr ifunc i32 (), ptr @caller11.resolver
 
 declare void @__init_cpu_features_resolver() local_unnamed_addr
 
@@ -509,7 +513,7 @@ entry:
 define dso_local i32 @caller8._Msve2() #2 {
 ; CHECK-LABEL: define dso_local i32 @caller8._Msve2(
 ; CHECK-SAME: ) #[[ATTR2]] {
-; CHECK:    [[CALL:%.*]] = tail call i32 @test_unrelated_callers()
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_unrelated_callers._Msve2()
 ;
 entry:
   %call = tail call i32 @test_unrelated_callers()
@@ -591,6 +595,89 @@ entry:
   ret i32 %call
 }
 
+declare i32 @test_known_bits._Mmops() #3
+declare i32 @test_known_bits._Maes() #6
+declare i32 @test_known_bits.default() #0
+
+define weak_odr ptr @test_known_bits.resolver() comdat {
+; CHECK-LABEL: define weak_odr ptr @test_known_bits.resolver() comdat {
+resolver_entry:
+  tail call void @__init_cpu_features_resolver()
+  %0 = load i64, ptr @__aarch64_cpu_features, align 8
+  %1 = and i64 %0, 576460752303423488
+  %.not = icmp eq i64 %1, 0
+  %2 = and i64 %0, 33536
+  %3 = icmp eq i64 %2, 33536
+  %test_known_bits._Maes.test_known_bits.default = select i1 %3, ptr @test_known_bits._Maes, ptr @test_known_bits.default
+  %common.ret.op = select i1 %.not, ptr %test_known_bits._Maes.test_known_bits.default, ptr @test_known_bits._Mmops
+  ret ptr %common.ret.op
+}
+
+define i32 @caller11._MmopsMsve2() #4 {
+; CHECK-LABEL: define i32 @caller11._MmopsMsve2(
+; CHECK-SAME: ) #[[ATTR4]] {
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_known_bits._Mmops()
+;
+entry:
+  %call = tail call i32 @test_known_bits()
+  ret i32 %call
+}
+
+define i32 @caller11._Msme() #5 {
+; CHECK-LABEL: define i32 @caller11._Msme(
+; CHECK-SAME: ) #[[ATTR5:[0-9]+]] {
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_known_bits()
+;
+entry:
+  %call = tail call i32 @test_known_bits()
+  ret i32 %call
+}
+
+define noundef i32 @caller11._MaesMsve2() #19 {
+; CHECK-LABEL: define noundef i32 @caller11._MaesMsve2(
+; CHECK-SAME: ) #[[ATTR19:[0-9]+]] {
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_known_bits._Maes()
+;
+entry:
+  %call = tail call i32 @test_known_bits()
+  ret i32 %call
+}
+
+define i32 @caller11.default() #0 {
+; CHECK-LABEL: define i32 @caller11.default(
+; CHECK-SAME: ) #[[ATTR0]] {
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_known_bits()
+;
+entry:
+  %call = tail call i32 @test_known_bits()
+  ret i32 %call
+}
+
+define weak_odr ptr @caller11.resolver() comdat {
+; CHECK-LABEL: define weak_odr ptr @caller11.resolver() comdat {
+resolver_entry:
+  tail call void @__init_cpu_features_resolver()
+  %0 = load i64, ptr @__aarch64_cpu_features, align 8
+  %1 = and i64 %0, 576460822096707840
+  %2 = icmp eq i64 %1, 576460822096707840
+  br i1 %2, label %common.ret, label %resolver_else
+
+common.ret:                                       ; preds = %resolver_else2, %resolver_else, %resolver_entry
+  %common.ret.op = phi ptr [ @caller11._MmopsMsve2, %resolver_entry ], [ @caller11._Msme, %resolver_else ], [ %caller11._MaesMsve2.caller11.default, %resolver_else2 ]
+  ret ptr %common.ret.op
+
+resolver_else:                                    ; preds = %resolver_entry
+  %3 = and i64 %0, 4398180795136
+  %4 = icmp eq i64 %3, 4398180795136
+  br i1 %4, label %common.ret, label %resolver_else2
+
+resolver_else2:                                   ; preds = %resolver_else
+  %5 = and i64 %0, 69793317632
+  %6 = icmp eq i64 %5, 69793317632
+  %caller11._MaesMsve2.caller11.default = select i1 %6, ptr @caller11._MaesMsve2, ptr @caller11.default
+  br label %common.ret
+}
+
 attributes #0 = { "fmv-features" }
 attributes #1 = { "fmv-features"="sve" }
 attributes #2 = { "fmv-features"="sve2" }
@@ -610,3 +697,4 @@ attributes #15 = { "fmv-features"="flagm2,frintts" }
 attributes #16 = { "fmv-features"="rcpc2" }
 attributes #17 = { "fmv-features"="frintts" }
 attributes #18 = { "target-features"="+fp-armv8,+mops,+neon,+outline-atomics,+sve,+v8a" }
+attributes #19 = { "fmv-features"="aes,sve2" }
diff --git a/llvm/test/Transforms/IndVarSimplify/AArch64/widen-loop-comp.ll b/llvm/test/Transforms/IndVarSimplify/AArch64/widen-loop-comp.ll
index 257816650017a..d4498baf0577a 100644
--- a/llvm/test/Transforms/IndVarSimplify/AArch64/widen-loop-comp.ll
+++ b/llvm/test/Transforms/IndVarSimplify/AArch64/widen-loop-comp.ll
@@ -97,7 +97,7 @@ define void @test2(ptr %a, ptr %b, i8 %limit, i1 %arg) {
 ; CHECK-LABEL: @test2(
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[CONV:%.*]] = zext i8 [[LIMIT:%.*]] to i32
-; CHECK-NEXT:    br i1 %arg, label [[FOR_COND1_PREHEADER_PREHEADER:%.*]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.*]]
+; CHECK-NEXT:    br i1 [[ARG:%.*]], label [[FOR_COND1_PREHEADER_PREHEADER:%.*]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.*]]
 ; CHECK:       for.cond1.preheader.us.preheader:
 ; CHECK-NEXT:    [[SMAX:%.*]] = call i32 @llvm.smax.i32(i32 [[CONV]], i32 1)
 ; CHECK-NEXT:    br label [[FOR_COND1_PREHEADER_US:%.*]]
@@ -237,8 +237,7 @@ define i32 @test4(i32 %a) {
 ; CHECK-NEXT:    [[CONV3:%.*]] = trunc i32 [[OR]] to i8
 ; CHECK-NEXT:    [[CALL:%.*]] = call i32 @fn1(i8 signext [[CONV3]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i32 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[TMP0:%.*]] = trunc nuw i32 [[INDVARS_IV_NEXT]] to i8
-; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i8 [[TMP0]], -14
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i32 [[INDVARS_IV_NEXT]], 242
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END:%.*]]
 ; CHECK:       for.end:
 ; CHECK-NEXT:    ret i32 0
diff --git a/llvm/test/Transforms/IndVarSimplify/ARM/code-size.ll b/llvm/test/Transforms/IndVarSimplify/ARM/code-size.ll
index 2003b1a72206d..7080707bc1de9 100644
--- a/llvm/test/Transforms/IndVarSimplify/ARM/code-size.ll
+++ b/llvm/test/Transforms/IndVarSimplify/ARM/code-size.ll
@@ -4,13 +4,12 @@
 
 define i32 @remove_loop(i32 %size) #0 {
 ; CHECK-V8M-LABEL: @remove_loop(
-; CHECK-V8M-SAME: i32 [[SIZE:%.*]]) #[[ATTR0:[0-9]+]] {
 ; CHECK-V8M-NEXT:  entry:
-; CHECK-V8M-NEXT:    br label %[[WHILE_COND:.*]]
+; CHECK-V8M-NEXT:    br label [[WHILE_COND:%.*]]
 ; CHECK-V8M:       while.cond:
-; CHECK-V8M-NEXT:    br i1 false, label %[[WHILE_COND]], label %[[WHILE_END:.*]]
+; CHECK-V8M-NEXT:    br i1 false, label [[WHILE_COND]], label [[WHILE_END:%.*]]
 ; CHECK-V8M:       while.end:
-; CHECK-V8M-NEXT:    [[TMP0:%.*]] = add i32 [[SIZE]], 31
+; CHECK-V8M-NEXT:    [[TMP0:%.*]] = add i32 [[SIZE:%.*]], 31
 ; CHECK-V8M-NEXT:    [[UMIN:%.*]] = call i32 @llvm.umin.i32(i32 [[SIZE]], i32 31)
 ; CHECK-V8M-NEXT:    [[TMP1:%.*]] = sub i32 [[TMP0]], [[UMIN]]
 ; CHECK-V8M-NEXT:    [[TMP2:%.*]] = lshr i32 [[TMP1]], 5
@@ -19,13 +18,12 @@ define i32 @remove_loop(i32 %size) #0 {
 ; CHECK-V8M-NEXT:    ret i32 [[TMP4]]
 ;
 ; CHECK-V8A-LABEL: @remove_loop(
-; CHECK-V8A-SAME: i32 [[SIZE:%.*]]) #[[ATTR0:[0-9]+]] {
 ; CHECK-V8A-NEXT:  entry:
-; CHECK-V8A-NEXT:    br label %[[WHILE_COND:.*]]
+; CHECK-V8A-NEXT:    br label [[WHILE_COND:%.*]]
 ; CHECK-V8A:       while.cond:
-; CHECK-V8A-NEXT:    br i1 false, label %[[WHILE_COND]], label %[[WHILE_END:.*]]
+; CHECK-V8A-NEXT:    br i1 false, label [[WHILE_COND]], label [[WHILE_END:%.*]]
 ; CHECK-V8A:       while.end:
-; CHECK-V8A-NEXT:    [[TMP0:%.*]] = add i32 [[SIZE]], 31
+; CHECK-V8A-NEXT:    [[TMP0:%.*]] = add i32 [[SIZE:%.*]], 31
 ; CHECK-V8A-NEXT:    [[UMIN:%.*]] = call i32 @llvm.umin.i32(i32 [[SIZE]], i32 31)
 ; CHECK-V8A-NEXT:    [[TMP1:%.*]] = sub i32 [[TMP0]], [[UMIN]]
 ; CHECK-V8A-NEXT:    [[TMP2:%.*]] = lshr i32 [[TMP1]], 5
@@ -751,7 +749,7 @@ define i32 @different_ivs(ptr %array, i32 %length, i32 %n) #0 {
 ; CHECK-V8M-NEXT:    [[ARRAY_I:%.*]] = load i32, ptr [[ARRAY_I_PTR]], align 4
 ; CHECK-V8M-NEXT:    [[LOOP_ACC_NEXT]] = add i32 [[LOOP_ACC]], [[ARRAY_I]]
 ; CHECK-V8M-NEXT:    [[I_NEXT]] = add nuw nsw i64 [[I]], 1
-; CHECK-V8M-NEXT:    [[CONTINUE:%.*]] = icmp ult i64 [[I_NEXT]], [[N64]]
+; CHECK-V8M-NEXT:    [[CONTINUE:%.*]] = icmp samesign ult i64 [[I_NEXT]], [[N64]]
 ; CHECK-V8M-NEXT:    br i1 [[CONTINUE]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK-V8M:       exit:
 ; CHECK-V8M-NEXT:    [[RESULT:%.*]] = phi i32 [ [[LOOP_ACC_NEXT]], [[GUARDED]] ]
@@ -780,7 +778,7 @@ define i32 @different_ivs(ptr %array, i32 %length, i32 %n) #0 {
 ; CHECK-V8A-NEXT:    [[ARRAY_I:%.*]] = load i32, ptr [[ARRAY_I_PTR]], align 4
 ; CHECK-V8A-NEXT:    [[LOOP_ACC_NEXT]] = add i32 [[LOOP_ACC]], [[ARRAY_I]]
 ; CHECK-V8A-NEXT:    [[I_NEXT]] = add nuw nsw i64 [[I]], 1
-; CHECK-V8A-NEXT:    [[CONTINUE:%.*]] = icmp ult i64 [[I_NEXT]], [[N64]]
+; CHECK-V8A-NEXT:    [[CONTINUE:%.*]] = icmp samesign ult i64 [[I_NEXT]], [[N64]]
 ; CHECK-V8A-NEXT:    br i1 [[CONTINUE]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK-V8A:       exit:
 ; CHECK-V8A-NEXT:    [[RESULT:%.*]] = phi i32 [ [[LOOP_ACC_NEXT]], [[GUARDED]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/ARM/indvar-unroll-imm-cost.ll b/llvm/test/Transforms/IndVarSimplify/ARM/indvar-unroll-imm-cost.ll
index 2261423766792..1cec2dd83988b 100644
--- a/llvm/test/Transforms/IndVarSimplify/ARM/indvar-unroll-imm-cost.ll
+++ b/llvm/test/Transforms/IndVarSimplify/ARM/indvar-unroll-imm-cost.ll
@@ -60,7 +60,7 @@ define dso_local arm_aapcscc void @test(ptr nocapture %pDest, ptr nocapture read
 ; CHECK-NEXT:    [[ADD_PTR23]] = getelementptr inbounds i16, ptr [[PSRCB_ADDR_173]], i32 4
 ; CHECK-NEXT:    [[INCDEC_PTR]] = getelementptr inbounds i32, ptr [[PDEST_ADDR_175]], i32 1
 ; CHECK-NEXT:    [[ADD24]] = add nuw nsw i32 [[J_076]], 4
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i32 [[ADD24]], [[TMP0]]
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i32 [[ADD24]], [[TMP0]]
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[FOR_BODY3]], label [[FOR_END_LOOPEXIT:%.*]]
 ; CHECK:       for.end.loopexit:
 ; CHECK-NEXT:    [[ADD_PTR_LCSSA:%.*]] = phi ptr [ [[ADD_PTR]], [[FOR_BODY3]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/X86/eliminate-trunc.ll b/llvm/test/Transforms/IndVarSimplify/X86/eliminate-trunc.ll
index 565ac5c8743d4..a506739ad6cc8 100644
--- a/llvm/test/Transforms/IndVarSimplify/X86/eliminate-trunc.ll
+++ b/llvm/test/Transforms/IndVarSimplify/X86/eliminate-trunc.ll
@@ -414,7 +414,7 @@ define void @test_08(i32 %n) {
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 1, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[TMP0:%.*]] = icmp slt i64 [[IV]], [[SEXT]]
-; CHECK-NEXT:    [[TMP1:%.*]] = icmp ult i64 [[IV]], [[ZEXT]]
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp samesign ult i64 [[IV]], [[ZEXT]]
 ; CHECK-NEXT:    [[CMP:%.*]] = and i1 [[TMP0]], [[TMP1]]
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK:       exit:
@@ -600,7 +600,7 @@ define void @test_13b(i32 %n) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 2
-; CHECK-NEXT:    [[TMP0:%.*]] = icmp ult i64 [[IV]], 1024
+; CHECK-NEXT:    [[TMP0:%.*]] = icmp samesign ult i64 [[IV]], 1024
 ; CHECK-NEXT:    br i1 [[TMP0]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -625,7 +625,7 @@ define void @test_13c(i32 %n) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 2
-; CHECK-NEXT:    [[TMP0:%.*]] = icmp ult i64 [[IV]], 1024
+; CHECK-NEXT:    [[TMP0:%.*]] = icmp samesign ult i64 [[IV]], 1024
 ; CHECK-NEXT:    br i1 [[TMP0]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
diff --git a/llvm/test/Transforms/IndVarSimplify/X86/iv-widen.ll b/llvm/test/Transforms/IndVarSimplify/X86/iv-widen.ll
index cc0f2587266a2..45bb66d1d7d80 100644
--- a/llvm/test/Transforms/IndVarSimplify/X86/iv-widen.ll
+++ b/llvm/test/Transforms/IndVarSimplify/X86/iv-widen.ll
@@ -16,7 +16,7 @@ declare void @use(i64 %x)
 define void @loop_0(ptr %a, i1 %arg) {
 ; CHECK-LABEL: @loop_0(
 ; CHECK-NEXT:  Prologue:
-; CHECK-NEXT:    br i1 %arg, label [[B18_PREHEADER:%.*]], label [[B6:%.*]]
+; CHECK-NEXT:    br i1 [[ARG:%.*]], label [[B18_PREHEADER:%.*]], label [[B6:%.*]]
 ; CHECK:       B18.preheader:
 ; CHECK-NEXT:    br label [[B18:%.*]]
 ; CHECK:       B18:
@@ -70,7 +70,7 @@ exit24:                      ; preds = %B18
 define void @loop_0_dead(ptr %a, i1 %arg) {
 ; CHECK-LABEL: @loop_0_dead(
 ; CHECK-NEXT:  Prologue:
-; CHECK-NEXT:    br i1 %arg, label [[B18_PREHEADER:%.*]], label [[B6:%.*]]
+; CHECK-NEXT:    br i1 [[ARG:%.*]], label [[B18_PREHEADER:%.*]], label [[B6:%.*]]
 ; CHECK:       B18.preheader:
 ; CHECK-NEXT:    br label [[B18:%.*]]
 ; CHECK:       B18:
diff --git a/llvm/test/Transforms/IndVarSimplify/X86/pr59615.ll b/llvm/test/Transforms/IndVarSimplify/X86/pr59615.ll
index 17b7b9d40b07a..1e5a4156686ad 100644
--- a/llvm/test/Transforms/IndVarSimplify/X86/pr59615.ll
+++ b/llvm/test/Transforms/IndVarSimplify/X86/pr59615.ll
@@ -7,7 +7,7 @@ target triple = "x86_64-unknown-linux-gnu"
 define void @test() {
 ; CHECK-LABEL: @test(
 ; CHECK-NEXT:  bb:
-; CHECK-NEXT:    [[VAR:%.*]] = load atomic i32, ptr addrspace(1) poison unordered, align 8, !range [[RNG0:![0-9]+]], !invariant.load !1, !noundef !1
+; CHECK-NEXT:    [[VAR:%.*]] = load atomic i32, ptr addrspace(1) poison unordered, align 8, !range [[RNG0:![0-9]+]], !invariant.load [[META1:![0-9]+]], !noundef [[META1]]
 ; CHECK-NEXT:    [[VAR2:%.*]] = icmp eq i32 [[VAR]], 0
 ; CHECK-NEXT:    br i1 [[VAR2]], label [[BB18:%.*]], label [[BB19:%.*]]
 ; CHECK:       bb3:
@@ -16,9 +16,9 @@ define void @test() {
 ; CHECK:       bb7:
 ; CHECK-NEXT:    ret void
 ; CHECK:       bb8:
-; CHECK-NEXT:    [[VAR9:%.*]] = load atomic i32, ptr addrspace(1) poison unordered, align 8, !range [[RNG0]], !invariant.load !1, !noundef !1
+; CHECK-NEXT:    [[VAR9:%.*]] = load atomic i32, ptr addrspace(1) poison unordered, align 8, !range [[RNG0]], !invariant.load [[META1]], !noundef [[META1]]
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[VAR9]] to i64
-; CHECK-NEXT:    [[VAR10:%.*]] = icmp ult i64 [[INDVARS_IV]], [[TMP0]]
+; CHECK-NEXT:    [[VAR10:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], [[TMP0]]
 ; CHECK-NEXT:    br i1 [[VAR10]], label [[BB12]], label [[BB11:%.*]]
 ; CHECK:       bb11:
 ; CHECK-NEXT:    ret void
diff --git a/llvm/test/Transforms/IndVarSimplify/backedge-on-min-max.ll b/llvm/test/Transforms/IndVarSimplify/backedge-on-min-max.ll
index c4b9a4e711b64..577edc3650d90 100644
--- a/llvm/test/Transforms/IndVarSimplify/backedge-on-min-max.ll
+++ b/llvm/test/Transforms/IndVarSimplify/backedge-on-min-max.ll
@@ -535,7 +535,7 @@ define void @min.unsigned.3(ptr %a, i32 %n) {
 ; CHECK-NEXT:    store i32 [[IDX]], ptr [[ADDR]], align 4
 ; CHECK-NEXT:    br label [[LATCH]]
 ; CHECK:       latch:
-; CHECK-NEXT:    [[BE_COND:%.*]] = icmp ult i32 [[IDX_INC]], [[UMIN]]
+; CHECK-NEXT:    [[BE_COND:%.*]] = icmp samesign ult i32 [[IDX_INC]], [[UMIN]]
 ; CHECK-NEXT:    br i1 [[BE_COND]], label [[LOOP]], label [[EXIT_LOOPEXIT:%.*]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    br label [[EXIT]]
@@ -586,7 +586,7 @@ define void @min.unsigned.4(ptr %a, i32 %n) {
 ; CHECK-NEXT:    store i32 [[IDX]], ptr [[ADDR]], align 4
 ; CHECK-NEXT:    br label [[LATCH]]
 ; CHECK:       latch:
-; CHECK-NEXT:    [[BE_COND:%.*]] = icmp ult i32 [[IDX_INC]], [[UMIN]]
+; CHECK-NEXT:    [[BE_COND:%.*]] = icmp samesign ult i32 [[IDX_INC]], [[UMIN]]
 ; CHECK-NEXT:    br i1 [[BE_COND]], label [[LOOP]], label [[EXIT_LOOPEXIT:%.*]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    br label [[EXIT]]
diff --git a/llvm/test/Transforms/IndVarSimplify/canonicalize-cmp.ll b/llvm/test/Transforms/IndVarSimplify/canonicalize-cmp.ll
index 4b52479fc6c4d..6ac09fafcb7a4 100644
--- a/llvm/test/Transforms/IndVarSimplify/canonicalize-cmp.ll
+++ b/llvm/test/Transforms/IndVarSimplify/canonicalize-cmp.ll
@@ -21,7 +21,7 @@ define i32 @test_01(i32 %a, i32 %b, ptr %p) {
 ; CHECK-NEXT:    store i32 [[A:%.*]], ptr [[P]], align 4
 ; CHECK-NEXT:    br label [[MERGE]]
 ; CHECK:       merge:
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i32 [[IV]], 100
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i32 [[IV]], 100
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[B3:%.*]], label [[B4:%.*]]
 ; CHECK:       b3:
 ; CHECK-NEXT:    store i32 [[IV]], ptr [[P]], align 4
@@ -89,7 +89,7 @@ define i32 @test_02(i32 %a, i32 %b, ptr %p) {
 ; CHECK-NEXT:    store i32 [[A:%.*]], ptr [[P]], align 4
 ; CHECK-NEXT:    br label [[MERGE]]
 ; CHECK:       merge:
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ugt i32 100, [[IV]]
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ugt i32 100, [[IV]]
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[B3:%.*]], label [[B4:%.*]]
 ; CHECK:       b3:
 ; CHECK-NEXT:    store i32 [[IV]], ptr [[P]], align 4
diff --git a/llvm/test/Transforms/IndVarSimplify/constant_result.ll b/llvm/test/Transforms/IndVarSimplify/constant_result.ll
index 1eb5bb9a4dc14..61c9b030a60dd 100644
--- a/llvm/test/Transforms/IndVarSimplify/constant_result.ll
+++ b/llvm/test/Transforms/IndVarSimplify/constant_result.ll
@@ -12,7 +12,7 @@ define i16 @foo() {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [400 x i16], ptr @Y, i16 0, i16 [[I]]
 ; CHECK-NEXT:    store i16 0, ptr [[ARRAYIDX]], align 1
 ; CHECK-NEXT:    [[INC]] = add nuw nsw i16 [[I]], 1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i16 [[INC]], 400
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ult i16 [[INC]], 400
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END:%.*]]
 ; CHECK:       for.end:
 ; CHECK-NEXT:    ret i16 400
diff --git a/llvm/test/Transforms/IndVarSimplify/cycled_phis.ll b/llvm/test/Transforms/IndVarSimplify/cycled_phis.ll
index 9843a7ec028b6..42729fca78789 100644
--- a/llvm/test/Transforms/IndVarSimplify/cycled_phis.ll
+++ b/llvm/test/Transforms/IndVarSimplify/cycled_phis.ll
@@ -144,7 +144,7 @@ define i32 @start.from.sibling.iv(ptr %len.ptr, ptr %sibling.len.ptr) {
 ; CHECK-NEXT:    br label [[SIBLING_LOOP:%.*]]
 ; CHECK:       sibling.loop:
 ; CHECK-NEXT:    [[SIBLING_IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[SIBLING_IV_NEXT:%.*]], [[SIBLING_BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp ult i32 [[SIBLING_IV]], [[SIBLING_LEN]]
+; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp samesign ult i32 [[SIBLING_IV]], [[SIBLING_LEN]]
 ; CHECK-NEXT:    br i1 [[SIBLING_RC]], label [[SIBLING_BACKEDGE]], label [[FAILED_SIBLING:%.*]]
 ; CHECK:       sibling.backedge:
 ; CHECK-NEXT:    [[SIBLING_IV_NEXT]] = add nuw nsw i32 [[SIBLING_IV]], 1
@@ -235,7 +235,7 @@ define i32 @start.from.sibling.iv.wide(ptr %len.ptr, ptr %sibling.len.ptr) {
 ; CHECK-NEXT:    br label [[SIBLING_LOOP:%.*]]
 ; CHECK:       sibling.loop:
 ; CHECK-NEXT:    [[SIBLING_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[SIBLING_IV_NEXT:%.*]], [[SIBLING_BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp ult i64 [[SIBLING_IV]], [[SIBLING_LEN_WIDE]]
+; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp samesign ult i64 [[SIBLING_IV]], [[SIBLING_LEN_WIDE]]
 ; CHECK-NEXT:    br i1 [[SIBLING_RC]], label [[SIBLING_BACKEDGE]], label [[FAILED_SIBLING:%.*]]
 ; CHECK:       sibling.backedge:
 ; CHECK-NEXT:    [[SIBLING_IV_NEXT]] = add nuw nsw i64 [[SIBLING_IV]], 1
@@ -331,7 +331,7 @@ define i32 @start.from.sibling.iv.wide.cycled.phis(ptr %len.ptr, ptr %sibling.le
 ; CHECK-NEXT:    br label [[SIBLING_LOOP:%.*]]
 ; CHECK:       sibling.loop:
 ; CHECK-NEXT:    [[SIBLING_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[SIBLING_IV_NEXT:%.*]], [[SIBLING_BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp ult i64 [[SIBLING_IV]], [[SIBLING_LEN_WIDE]]
+; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp samesign ult i64 [[SIBLING_IV]], [[SIBLING_LEN_WIDE]]
 ; CHECK-NEXT:    br i1 [[SIBLING_RC]], label [[SIBLING_BACKEDGE]], label [[FAILED_SIBLING:%.*]]
 ; CHECK:       sibling.backedge:
 ; CHECK-NEXT:    [[SIBLING_IV_NEXT]] = add nuw nsw i64 [[SIBLING_IV]], 1
@@ -449,7 +449,7 @@ define i32 @start.from.sibling.iv.wide.cycled.phis.complex.phis(ptr %len.ptr, pt
 ; CHECK-NEXT:    br label [[SIBLING_LOOP:%.*]]
 ; CHECK:       sibling.loop:
 ; CHECK-NEXT:    [[SIBLING_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[SIBLING_IV_NEXT:%.*]], [[SIBLING_BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp ult i64 [[SIBLING_IV]], [[SIBLING_LEN_WIDE]]
+; CHECK-NEXT:    [[SIBLING_RC:%.*]] = icmp samesign ult i64 [[SIBLING_IV]], [[SIBLING_LEN_WIDE]]
 ; CHECK-NEXT:    br i1 [[SIBLING_RC]], label [[SIBLING_BACKEDGE]], label [[FAILED_SIBLING:%.*]]
 ; CHECK:       sibling.backedge:
 ; CHECK-NEXT:    [[SIBLING_IV_NEXT]] = add nuw nsw i64 [[SIBLING_IV]], 1
diff --git a/llvm/test/Transforms/IndVarSimplify/debugloc-rem-subst.ll b/llvm/test/Transforms/IndVarSimplify/debugloc-rem-subst.ll
index 121eec75c1b3c..4502416a19477 100644
--- a/llvm/test/Transforms/IndVarSimplify/debugloc-rem-subst.ll
+++ b/llvm/test/Transforms/IndVarSimplify/debugloc-rem-subst.ll
@@ -51,7 +51,7 @@ bb2:                                              ; preds = %bb2, %bb1
 !8 = !DILocation(line: 1, column: 1, scope: !5)
 ;.
 ; CHECK: [[META0:![0-9]+]] = distinct !DICompileUnit(language: DW_LANG_C, file: [[META1:![0-9]+]], producer: "debugify", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
-; CHECK: [[META1]] = !DIFile(filename: "llvm/test/Transforms/IndVarSimplify/debugloc-rem-subst.ll", directory: {{.*}})
+; CHECK: [[META1]] = !DIFile(filename: "{{.*}}debugloc-rem-subst.ll", directory: {{.*}})
 ; CHECK: [[DBG5]] = distinct !DISubprogram(name: "widget", linkageName: "widget", scope: null, file: [[META1]], line: 1, type: [[META6:![0-9]+]], scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: [[META0]])
 ; CHECK: [[META6]] = !DISubroutineType(types: [[META7:![0-9]+]])
 ; CHECK: [[META7]] = !{}
diff --git a/llvm/test/Transforms/IndVarSimplify/dont-recompute.ll b/llvm/test/Transforms/IndVarSimplify/dont-recompute.ll
index b4cd98cd234f0..6a809fe45d660 100644
--- a/llvm/test/Transforms/IndVarSimplify/dont-recompute.ll
+++ b/llvm/test/Transforms/IndVarSimplify/dont-recompute.ll
@@ -211,7 +211,7 @@ define void @test6(i32 %m, ptr %p) nounwind uwtable {
 ; CHECK-NEXT:    [[ADD]] = add i32 [[A_05]], [[M:%.*]]
 ; CHECK-NEXT:    [[SOFT_USE:%.*]] = add i32 [[ADD]], 123
 ; CHECK-NEXT:    [[PIDX:%.*]] = getelementptr i32, ptr [[P:%.*]], i32 [[ADD]]
-; CHECK-NEXT:    store i32 [[SOFT_USE]], ptr [[PIDX]]
+; CHECK-NEXT:    store i32 [[SOFT_USE]], ptr [[PIDX]], align 4
 ; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I_06]], 1
 ; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i32 [[INC]], 186
 ; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
diff --git a/llvm/test/Transforms/IndVarSimplify/eliminate-exit.ll b/llvm/test/Transforms/IndVarSimplify/eliminate-exit.ll
index b24650830778f..b20891d2f9ed8 100644
--- a/llvm/test/Transforms/IndVarSimplify/eliminate-exit.ll
+++ b/llvm/test/Transforms/IndVarSimplify/eliminate-exit.ll
@@ -193,7 +193,7 @@ define void @mixed_width(i32 %len) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i64 [[IV]], [[LEN_ZEXT]]
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i64 [[IV]], [[LEN_ZEXT]]
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[BACKEDGE]], label [[EXIT:%.*]]
 ; CHECK:       backedge:
 ; CHECK-NEXT:    call void @side_effect()
@@ -221,6 +221,220 @@ exit:
 }
 
 define void @many_exits([100 x i64] %len) {
+; CHECK-LABEL: @many_exits(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[LEN1:%.*]] = extractvalue [100 x i64] [[LEN:%.*]], 1
+; CHECK-NEXT:    [[LEN2:%.*]] = extractvalue [100 x i64] [[LEN]], 2
+; CHECK-NEXT:    [[LEN3:%.*]] = extractvalue [100 x i64] [[LEN]], 3
+; CHECK-NEXT:    [[LEN4:%.*]] = extractvalue [100 x i64] [[LEN]], 4
+; CHECK-NEXT:    [[LEN5:%.*]] = extractvalue [100 x i64] [[LEN]], 5
+; CHECK-NEXT:    [[LEN6:%.*]] = extractvalue [100 x i64] [[LEN]], 6
+; CHECK-NEXT:    [[LEN7:%.*]] = extractvalue [100 x i64] [[LEN]], 7
+; CHECK-NEXT:    [[LEN8:%.*]] = extractvalue [100 x i64] [[LEN]], 8
+; CHECK-NEXT:    [[LEN9:%.*]] = extractvalue [100 x i64] [[LEN]], 9
+; CHECK-NEXT:    [[LEN10:%.*]] = extractvalue [100 x i64] [[LEN]], 10
+; CHECK-NEXT:    [[LEN11:%.*]] = extractvalue [100 x i64] [[LEN]], 11
+; CHECK-NEXT:    [[LEN12:%.*]] = extractvalue [100 x i64] [[LEN]], 12
+; CHECK-NEXT:    [[LEN13:%.*]] = extractvalue [100 x i64] [[LEN]], 13
+; CHECK-NEXT:    [[LEN14:%.*]] = extractvalue [100 x i64] [[LEN]], 14
+; CHECK-NEXT:    [[LEN15:%.*]] = extractvalue [100 x i64] [[LEN]], 15
+; CHECK-NEXT:    [[LEN16:%.*]] = extractvalue [100 x i64] [[LEN]], 16
+; CHECK-NEXT:    [[LEN17:%.*]] = extractvalue [100 x i64] [[LEN]], 17
+; CHECK-NEXT:    [[LEN18:%.*]] = extractvalue [100 x i64] [[LEN]], 18
+; CHECK-NEXT:    [[LEN19:%.*]] = extractvalue [100 x i64] [[LEN]], 19
+; CHECK-NEXT:    [[LEN20:%.*]] = extractvalue [100 x i64] [[LEN]], 20
+; CHECK-NEXT:    [[LEN21:%.*]] = extractvalue [100 x i64] [[LEN]], 21
+; CHECK-NEXT:    [[LEN22:%.*]] = extractvalue [100 x i64] [[LEN]], 22
+; CHECK-NEXT:    [[LEN23:%.*]] = extractvalue [100 x i64] [[LEN]], 23
+; CHECK-NEXT:    [[LEN24:%.*]] = extractvalue [100 x i64] [[LEN]], 24
+; CHECK-NEXT:    [[LEN25:%.*]] = extractvalue [100 x i64] [[LEN]], 25
+; CHECK-NEXT:    [[LEN26:%.*]] = extractvalue [100 x i64] [[LEN]], 26
+; CHECK-NEXT:    [[LEN27:%.*]] = extractvalue [100 x i64] [[LEN]], 27
+; CHECK-NEXT:    [[LEN28:%.*]] = extractvalue [100 x i64] [[LEN]], 28
+; CHECK-NEXT:    [[LEN29:%.*]] = extractvalue [100 x i64] [[LEN]], 29
+; CHECK-NEXT:    [[LEN30:%.*]] = extractvalue [100 x i64] [[LEN]], 30
+; CHECK-NEXT:    [[LEN31:%.*]] = extractvalue [100 x i64] [[LEN]], 31
+; CHECK-NEXT:    [[LEN32:%.*]] = extractvalue [100 x i64] [[LEN]], 32
+; CHECK-NEXT:    [[LEN33:%.*]] = extractvalue [100 x i64] [[LEN]], 33
+; CHECK-NEXT:    [[LEN34:%.*]] = extractvalue [100 x i64] [[LEN]], 34
+; CHECK-NEXT:    [[LEN35:%.*]] = extractvalue [100 x i64] [[LEN]], 35
+; CHECK-NEXT:    [[LEN36:%.*]] = extractvalue [100 x i64] [[LEN]], 36
+; CHECK-NEXT:    [[LEN37:%.*]] = extractvalue [100 x i64] [[LEN]], 37
+; CHECK-NEXT:    [[LEN38:%.*]] = extractvalue [100 x i64] [[LEN]], 38
+; CHECK-NEXT:    [[LEN39:%.*]] = extractvalue [100 x i64] [[LEN]], 39
+; CHECK-NEXT:    br label [[LOOP:%.*]]
+; CHECK:       loop:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
+; CHECK-NEXT:    [[LEN0:%.*]] = extractvalue [100 x i64] [[LEN]], 0
+; CHECK-NEXT:    [[EARLY0:%.*]] = icmp eq i64 [[IV]], [[LEN0]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY0]], label [[EXIT:%.*]], label [[CONT0:%.*]]
+; CHECK:       cont0:
+; CHECK-NEXT:    [[EARLY1:%.*]] = icmp eq i64 [[IV]], [[LEN1]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY1]], label [[EXIT]], label [[CONT1:%.*]]
+; CHECK:       cont1:
+; CHECK-NEXT:    [[EARLY2:%.*]] = icmp eq i64 [[IV]], [[LEN2]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY2]], label [[EXIT]], label [[CONT2:%.*]]
+; CHECK:       cont2:
+; CHECK-NEXT:    [[EARLY3:%.*]] = icmp eq i64 [[IV]], [[LEN3]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY3]], label [[EXIT]], label [[CONT3:%.*]]
+; CHECK:       cont3:
+; CHECK-NEXT:    [[EARLY4:%.*]] = icmp eq i64 [[IV]], [[LEN4]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY4]], label [[EXIT]], label [[CONT4:%.*]]
+; CHECK:       cont4:
+; CHECK-NEXT:    [[EARLY5:%.*]] = icmp eq i64 [[IV]], [[LEN5]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY5]], label [[EXIT]], label [[CONT5:%.*]]
+; CHECK:       cont5:
+; CHECK-NEXT:    [[EARLY6:%.*]] = icmp eq i64 [[IV]], [[LEN6]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY6]], label [[EXIT]], label [[CONT6:%.*]]
+; CHECK:       cont6:
+; CHECK-NEXT:    [[EARLY7:%.*]] = icmp eq i64 [[IV]], [[LEN7]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY7]], label [[EXIT]], label [[CONT7:%.*]]
+; CHECK:       cont7:
+; CHECK-NEXT:    [[EARLY8:%.*]] = icmp eq i64 [[IV]], [[LEN8]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY8]], label [[EXIT]], label [[CONT8:%.*]]
+; CHECK:       cont8:
+; CHECK-NEXT:    [[EARLY9:%.*]] = icmp eq i64 [[IV]], [[LEN9]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY9]], label [[EXIT]], label [[CONT9:%.*]]
+; CHECK:       cont9:
+; CHECK-NEXT:    [[EARLY10:%.*]] = icmp eq i64 [[IV]], [[LEN10]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY10]], label [[EXIT]], label [[CONT10:%.*]]
+; CHECK:       cont10:
+; CHECK-NEXT:    [[EARLY11:%.*]] = icmp eq i64 [[IV]], [[LEN11]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY11]], label [[EXIT]], label [[CONT11:%.*]]
+; CHECK:       cont11:
+; CHECK-NEXT:    [[EARLY12:%.*]] = icmp eq i64 [[IV]], [[LEN12]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY12]], label [[EXIT]], label [[CONT12:%.*]]
+; CHECK:       cont12:
+; CHECK-NEXT:    [[EARLY13:%.*]] = icmp eq i64 [[IV]], [[LEN13]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY13]], label [[EXIT]], label [[CONT13:%.*]]
+; CHECK:       cont13:
+; CHECK-NEXT:    [[EARLY14:%.*]] = icmp eq i64 [[IV]], [[LEN14]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY14]], label [[EXIT]], label [[CONT14:%.*]]
+; CHECK:       cont14:
+; CHECK-NEXT:    [[EARLY15:%.*]] = icmp eq i64 [[IV]], [[LEN15]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY15]], label [[EXIT]], label [[CONT15:%.*]]
+; CHECK:       cont15:
+; CHECK-NEXT:    [[EARLY16:%.*]] = icmp eq i64 [[IV]], [[LEN16]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY16]], label [[EXIT]], label [[CONT16:%.*]]
+; CHECK:       cont16:
+; CHECK-NEXT:    [[EARLY17:%.*]] = icmp eq i64 [[IV]], [[LEN17]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY17]], label [[EXIT]], label [[CONT17:%.*]]
+; CHECK:       cont17:
+; CHECK-NEXT:    [[EARLY18:%.*]] = icmp eq i64 [[IV]], [[LEN18]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY18]], label [[EXIT]], label [[CONT18:%.*]]
+; CHECK:       cont18:
+; CHECK-NEXT:    [[EARLY19:%.*]] = icmp eq i64 [[IV]], [[LEN19]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY19]], label [[EXIT]], label [[CONT19:%.*]]
+; CHECK:       cont19:
+; CHECK-NEXT:    [[EARLY20:%.*]] = icmp eq i64 [[IV]], [[LEN20]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY20]], label [[EXIT]], label [[CONT20:%.*]]
+; CHECK:       cont20:
+; CHECK-NEXT:    [[EARLY21:%.*]] = icmp eq i64 [[IV]], [[LEN21]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY21]], label [[EXIT]], label [[CONT21:%.*]]
+; CHECK:       cont21:
+; CHECK-NEXT:    [[EARLY22:%.*]] = icmp eq i64 [[IV]], [[LEN22]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY22]], label [[EXIT]], label [[CONT22:%.*]]
+; CHECK:       cont22:
+; CHECK-NEXT:    [[EARLY23:%.*]] = icmp eq i64 [[IV]], [[LEN23]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY23]], label [[EXIT]], label [[CONT23:%.*]]
+; CHECK:       cont23:
+; CHECK-NEXT:    [[EARLY24:%.*]] = icmp eq i64 [[IV]], [[LEN24]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY24]], label [[EXIT]], label [[CONT24:%.*]]
+; CHECK:       cont24:
+; CHECK-NEXT:    [[EARLY25:%.*]] = icmp eq i64 [[IV]], [[LEN25]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY25]], label [[EXIT]], label [[CONT25:%.*]]
+; CHECK:       cont25:
+; CHECK-NEXT:    [[EARLY26:%.*]] = icmp eq i64 [[IV]], [[LEN26]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY26]], label [[EXIT]], label [[CONT26:%.*]]
+; CHECK:       cont26:
+; CHECK-NEXT:    [[EARLY27:%.*]] = icmp eq i64 [[IV]], [[LEN27]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY27]], label [[EXIT]], label [[CONT27:%.*]]
+; CHECK:       cont27:
+; CHECK-NEXT:    [[EARLY28:%.*]] = icmp eq i64 [[IV]], [[LEN28]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY28]], label [[EXIT]], label [[CONT28:%.*]]
+; CHECK:       cont28:
+; CHECK-NEXT:    [[EARLY29:%.*]] = icmp eq i64 [[IV]], [[LEN29]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY29]], label [[EXIT]], label [[CONT29:%.*]]
+; CHECK:       cont29:
+; CHECK-NEXT:    [[EARLY30:%.*]] = icmp eq i64 [[IV]], [[LEN30]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY30]], label [[EXIT]], label [[CONT30:%.*]]
+; CHECK:       cont30:
+; CHECK-NEXT:    [[EARLY31:%.*]] = icmp eq i64 [[IV]], [[LEN31]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY31]], label [[EXIT]], label [[CONT31:%.*]]
+; CHECK:       cont31:
+; CHECK-NEXT:    [[EARLY32:%.*]] = icmp eq i64 [[IV]], [[LEN32]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY32]], label [[EXIT]], label [[CONT32:%.*]]
+; CHECK:       cont32:
+; CHECK-NEXT:    [[EARLY33:%.*]] = icmp eq i64 [[IV]], [[LEN33]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY33]], label [[EXIT]], label [[CONT33:%.*]]
+; CHECK:       cont33:
+; CHECK-NEXT:    [[EARLY34:%.*]] = icmp eq i64 [[IV]], [[LEN34]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY34]], label [[EXIT]], label [[CONT34:%.*]]
+; CHECK:       cont34:
+; CHECK-NEXT:    [[EARLY35:%.*]] = icmp eq i64 [[IV]], [[LEN35]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY35]], label [[EXIT]], label [[CONT35:%.*]]
+; CHECK:       cont35:
+; CHECK-NEXT:    [[EARLY36:%.*]] = icmp eq i64 [[IV]], [[LEN36]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY36]], label [[EXIT]], label [[CONT36:%.*]]
+; CHECK:       cont36:
+; CHECK-NEXT:    [[EARLY37:%.*]] = icmp eq i64 [[IV]], [[LEN37]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY37]], label [[EXIT]], label [[CONT37:%.*]]
+; CHECK:       cont37:
+; CHECK-NEXT:    [[EARLY38:%.*]] = icmp eq i64 [[IV]], [[LEN38]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY38]], label [[EXIT]], label [[CONT38:%.*]]
+; CHECK:       cont38:
+; CHECK-NEXT:    [[EARLY39:%.*]] = icmp eq i64 [[IV]], [[LEN39]]
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    br i1 [[EARLY39]], label [[EXIT]], label [[CONT39:%.*]]
+; CHECK:       cont39:
+; CHECK-NEXT:    br label [[BACKEDGE]]
+; CHECK:       backedge:
+; CHECK-NEXT:    call void @side_effect()
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i64 [[IV]], 999
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    br i1 [[CMP2]], label [[LOOP]], label [[EXIT]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
 entry:
   br label %loop
 loop:
@@ -457,7 +671,7 @@ define i32 @exit_cond_depends_on_inner_loop() {
 ; CHECK-NEXT:    br i1 [[OUTER_COND_1]], label [[EXIT:%.*]], label [[OUTER_LATCH]]
 ; CHECK:       outer.latch:
 ; CHECK-NEXT:    [[IV_OUTER_NEXT]] = add nuw nsw i32 [[IV_OUTER]], 1
-; CHECK-NEXT:    [[OUTER_COND_2:%.*]] = icmp ult i32 [[IV_OUTER]], 100
+; CHECK-NEXT:    [[OUTER_COND_2:%.*]] = icmp samesign ult i32 [[IV_OUTER]], 100
 ; CHECK-NEXT:    br i1 [[OUTER_COND_2]], label [[OUTER_HEADER]], label [[EXIT]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    [[X_RES:%.*]] = phi i32 [ [[X_LCSSA]], [[OUTER_EXITING_1]] ], [ -1, [[OUTER_LATCH]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/eliminate-sat.ll b/llvm/test/Transforms/IndVarSimplify/eliminate-sat.ll
index 9fcfc7c9b349a..dc0e49efb091f 100644
--- a/llvm/test/Transforms/IndVarSimplify/eliminate-sat.ll
+++ b/llvm/test/Transforms/IndVarSimplify/eliminate-sat.ll
@@ -13,7 +13,7 @@ define void @uadd_sat(ptr %p) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_INC:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[SAT1:%.*]] = add nuw nsw i32 [[I]], 1
-; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]]
+; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]], align 4
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i32 [[I]], 1
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ne i32 [[I_INC]], 100
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP]], label [[END:%.*]]
@@ -42,7 +42,7 @@ define void @sadd_sat(ptr %p) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_INC:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[SAT1:%.*]] = add nuw nsw i32 [[I]], 1
-; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]]
+; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]], align 4
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i32 [[I]], 1
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ne i32 [[I_INC]], 100
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP]], label [[END:%.*]]
@@ -71,7 +71,7 @@ define void @usub_sat(ptr %p) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 1, [[ENTRY:%.*]] ], [ [[I_INC:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[SAT1:%.*]] = sub nuw nsw i32 [[I]], 1
-; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]]
+; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]], align 4
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i32 [[I]], 1
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ne i32 [[I_INC]], 100
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP]], label [[END:%.*]]
@@ -100,7 +100,7 @@ define void @ssub_sat(ptr %p) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_INC:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[SAT1:%.*]] = sub nsw i32 [[I]], 1
-; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]]
+; CHECK-NEXT:    store volatile i32 [[SAT1]], ptr [[P:%.*]], align 4
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i32 [[I]], 1
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ne i32 [[I_INC]], 100
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP]], label [[END:%.*]]
diff --git a/llvm/test/Transforms/IndVarSimplify/exit_value_tests.ll b/llvm/test/Transforms/IndVarSimplify/exit_value_tests.ll
index 66a4cbbb23b01..52be86c9b0988 100644
--- a/llvm/test/Transforms/IndVarSimplify/exit_value_tests.ll
+++ b/llvm/test/Transforms/IndVarSimplify/exit_value_tests.ll
@@ -201,7 +201,7 @@ define i32 @neg_unroll_phi_select_constant_nonzero(i32 %arg) {
 ; CHECK-NEXT:    [[SELECTOR:%.*]] = phi i32 [ [[ARG:%.*]], [[ENTRY]] ], [ [[F:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[F]] = call i32 @f()
 ; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i32 [[I]], 1
-; CHECK-NEXT:    [[C:%.*]] = icmp ult i32 [[I]], 4
+; CHECK-NEXT:    [[C:%.*]] = icmp samesign ult i32 [[I]], 4
 ; CHECK-NEXT:    br i1 [[C]], label [[LOOP]], label [[LOOPEXIT:%.*]]
 ; CHECK:       loopexit:
 ; CHECK-NEXT:    [[SELECTOR_LCSSA:%.*]] = phi i32 [ [[SELECTOR]], [[LOOP]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/floating-point-small-iv.ll b/llvm/test/Transforms/IndVarSimplify/floating-point-small-iv.ll
index d2c7cc4128306..07c9d353d9753 100644
--- a/llvm/test/Transforms/IndVarSimplify/floating-point-small-iv.ll
+++ b/llvm/test/Transforms/IndVarSimplify/floating-point-small-iv.ll
@@ -13,7 +13,7 @@ define void @sitofp_fptosi_range() {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IDXPROM]]
 ; CHECK-NEXT:    store i32 [[IV_INT]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[DEC_INT]] = add nsw i32 [[IV_INT]], -1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i32 [[DEC_INT]], 0
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i32 [[DEC_INT]], 0
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -49,7 +49,7 @@ define void @sitofp_fptosi_range_overflow() {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IDXPROM]]
 ; CHECK-NEXT:    store i32 [[CONV]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[DEC_INT]] = add nsw i32 [[IV_INT]], -1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i32 [[DEC_INT]], 0
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i32 [[DEC_INT]], 0
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -84,7 +84,7 @@ define void @sitofp_fptosi_range_trunc() {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IV_INT]]
 ; CHECK-NEXT:    store i32 [[IV_INT_TRUNC]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[DEC_INT]] = add nsw i64 [[IV_INT]], -1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[DEC_INT]], 0
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i64 [[DEC_INT]], 0
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -155,7 +155,7 @@ define void @sitofp_fptoui_range_zext() {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IV_INT_ZEXT]]
 ; CHECK-NEXT:    store i32 [[IV_INT_ZEXT1]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[DEC_INT]] = add nsw i16 [[IV_INT]], -1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i16 [[DEC_INT]], 0
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i16 [[DEC_INT]], 0
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -191,7 +191,7 @@ define void @sitofp_fptoui_range_zext_postinc() {
 ; CHECK-NEXT:    [[INC_INT_ZEXT:%.*]] = zext i16 [[INC_INT]] to i64
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[INC_INT_ZEXT]]
 ; CHECK-NEXT:    store i32 [[INC_INT_ZEXT1]], ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i16 [[INC_INT]], 200
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ult i16 [[INC_INT]], 200
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -227,7 +227,7 @@ define void @uitofp_fptosi_range_zext() {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IV_INT_ZEXT]]
 ; CHECK-NEXT:    store i32 [[IV_INT_ZEXT1]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[INC_INT]] = add nuw nsw i16 [[IV_INT]], 2
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i16 [[INC_INT]], 200
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ult i16 [[INC_INT]], 200
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -329,7 +329,7 @@ define void @uitofp_fptoui_range () {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IDXPROM]]
 ; CHECK-NEXT:    store i32 [[IV_INT]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[DEC_INT]] = add nsw i32 [[IV_INT]], -1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i32 [[DEC_INT]], 3
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i32 [[DEC_INT]], 3
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
@@ -390,7 +390,7 @@ define void @uitofp_fptosi_range () {
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [16777219 x i32], ptr @array, i64 0, i64 [[IDXPROM]]
 ; CHECK-NEXT:    store i32 [[IV_INT]], ptr [[ARRAYIDX]], align 4
 ; CHECK-NEXT:    [[DEC_INT]] = add nsw i32 [[IV_INT]], -1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i32 [[DEC_INT]], 3
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ugt i32 [[DEC_INT]], 3
 ; CHECK-NEXT:    br i1 [[CMP]], label [[FOR_BODY]], label [[CLEANUP:%.*]]
 ; CHECK:       cleanup:
 ; CHECK-NEXT:    ret void
diff --git a/llvm/test/Transforms/IndVarSimplify/invalidate-modified-lcssa-phi.ll b/llvm/test/Transforms/IndVarSimplify/invalidate-modified-lcssa-phi.ll
index 0538c1c64de34..72c292a5f2bcf 100644
--- a/llvm/test/Transforms/IndVarSimplify/invalidate-modified-lcssa-phi.ll
+++ b/llvm/test/Transforms/IndVarSimplify/invalidate-modified-lcssa-phi.ll
@@ -133,7 +133,7 @@ define i16 @test_pr58515_invalidate_loop_disposition(ptr %a) {
 ; CHECK-NEXT:    [[SUM:%.*]] = phi i16 [ 0, [[ENTRY]] ], [ [[SUM_NEXT:%.*]], [[LOOP]] ]
 ; CHECK-NEXT:    [[SUM_NEXT]] = add i16 [[SEL]], [[SUM]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i16 [[IV]], 1
-; CHECK-NEXT:    [[C_2:%.*]] = icmp ult i16 [[IV]], 9
+; CHECK-NEXT:    [[C_2:%.*]] = icmp samesign ult i16 [[IV]], 9
 ; CHECK-NEXT:    br i1 [[C_2]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    [[LCSSA:%.*]] = phi i16 [ [[SUM_NEXT]], [[LOOP]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/loop-predication.ll b/llvm/test/Transforms/IndVarSimplify/loop-predication.ll
index 3246220da87b1..8ccd227bed4ff 100644
--- a/llvm/test/Transforms/IndVarSimplify/loop-predication.ll
+++ b/llvm/test/Transforms/IndVarSimplify/loop-predication.ll
@@ -659,7 +659,7 @@ define i32 @different_ivs(ptr %array, i32 %length, i32 %n) {
 ; CHECK-NEXT:    [[ARRAY_I:%.*]] = load i32, ptr [[ARRAY_I_PTR]], align 4
 ; CHECK-NEXT:    [[LOOP_ACC_NEXT]] = add i32 [[LOOP_ACC]], [[ARRAY_I]]
 ; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i64 [[I]], 1
-; CHECK-NEXT:    [[CONTINUE:%.*]] = icmp ult i64 [[I_NEXT]], [[N64]]
+; CHECK-NEXT:    [[CONTINUE:%.*]] = icmp samesign ult i64 [[I_NEXT]], [[N64]]
 ; CHECK-NEXT:    br i1 [[CONTINUE]], label [[LOOP]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    [[RESULT:%.*]] = phi i32 [ [[LOOP_ACC_NEXT]], [[GUARDED]] ]
@@ -722,7 +722,7 @@ define i32 @different_ivs2(ptr %array, i32 %length, i32 %n) {
 ; CHECK-NEXT:    [[LOOP_ACC_NEXT]] = add i32 [[LOOP_ACC]], [[ARRAY_I]]
 ; CHECK-NEXT:    [[I_NEXT]] = add nuw nsw i64 [[I]], 1
 ; CHECK-NEXT:    [[J_NEXT]] = sub nuw i32 [[J]], 1
-; CHECK-NEXT:    [[CONTINUE:%.*]] = icmp ult i64 [[I_NEXT]], [[N64]]
+; CHECK-NEXT:    [[CONTINUE:%.*]] = icmp samesign ult i64 [[I_NEXT]], [[N64]]
 ; CHECK-NEXT:    br i1 [[CONTINUE]], label [[LOOP]], label [[EXIT_LOOPEXIT:%.*]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    [[LOOP_ACC_NEXT_LCSSA:%.*]] = phi i32 [ [[LOOP_ACC_NEXT]], [[GUARDED]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/monotonic_checks.ll b/llvm/test/Transforms/IndVarSimplify/monotonic_checks.ll
index a1c07b0a24638..1f8bf5fecb248 100644
--- a/llvm/test/Transforms/IndVarSimplify/monotonic_checks.ll
+++ b/llvm/test/Transforms/IndVarSimplify/monotonic_checks.ll
@@ -6,7 +6,7 @@
 define i32 @test_01(ptr %p) {
 ; CHECK-LABEL: @test_01(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, [[RNG0:!range !.*]]
+; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, !range [[RNG0:![0-9]+]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[LEN]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -45,7 +45,7 @@ exit:
 define i32 @test_01_neg(ptr %p) {
 ; CHECK-LABEL: @test_01_neg(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, [[RNG0]]
+; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, !range [[RNG0]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[LEN]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -86,7 +86,7 @@ exit:
 define i32 @test_02(ptr %p) {
 ; CHECK-LABEL: @test_02(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, [[RNG1:!range !.*]]
+; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, !range [[RNG1:![0-9]+]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[LEN]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -125,7 +125,7 @@ exit:
 define i32 @test_02_neg(ptr %p) {
 ; CHECK-LABEL: @test_02_neg(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, [[RNG1]]
+; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, !range [[RNG1]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[LEN]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -164,7 +164,7 @@ exit:
 define i32 @test_03(ptr %p) {
 ; CHECK-LABEL: @test_03(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, [[RNG2:!range !.*]]
+; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, !range [[RNG2:![0-9]+]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[LEN]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
@@ -202,7 +202,7 @@ exit:
 define i32 @test_04(ptr %p) {
 ; CHECK-LABEL: @test_04(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, [[RNG2]]
+; CHECK-NEXT:    [[LEN:%.*]] = load i32, ptr [[P:%.*]], align 4, !range [[RNG2]]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[LEN]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/negative_ranges.ll b/llvm/test/Transforms/IndVarSimplify/negative_ranges.ll
index b7c7457ff9c6b..4acc5b04bc29a 100644
--- a/llvm/test/Transforms/IndVarSimplify/negative_ranges.ll
+++ b/llvm/test/Transforms/IndVarSimplify/negative_ranges.ll
@@ -11,7 +11,7 @@ define i32 @test_01(ptr %p, ptr %s) {
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[START]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[C1:%.*]] = icmp slt i32 [[IV]], [[END]]
+; CHECK-NEXT:    [[C1:%.*]] = icmp samesign ult i32 [[IV]], [[END]]
 ; CHECK-NEXT:    br i1 [[C1]], label [[GUARDED:%.*]], label [[SIDE_EXIT:%.*]]
 ; CHECK:       guarded:
 ; CHECK-NEXT:    br i1 true, label [[BACKEDGE]], label [[SIDE_EXIT]]
@@ -58,7 +58,7 @@ define i32 @test_02(ptr %p, ptr %s) {
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[START]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[C1:%.*]] = icmp ult i32 [[IV]], [[END]]
+; CHECK-NEXT:    [[C1:%.*]] = icmp samesign ult i32 [[IV]], [[END]]
 ; CHECK-NEXT:    br i1 [[C1]], label [[GUARDED:%.*]], label [[SIDE_EXIT:%.*]]
 ; CHECK:       guarded:
 ; CHECK-NEXT:    br i1 true, label [[BACKEDGE]], label [[SIDE_EXIT]]
diff --git a/llvm/test/Transforms/IndVarSimplify/post-inc-range.ll b/llvm/test/Transforms/IndVarSimplify/post-inc-range.ll
index bbdee0267effb..6d0451a5a6493 100644
--- a/llvm/test/Transforms/IndVarSimplify/post-inc-range.ll
+++ b/llvm/test/Transforms/IndVarSimplify/post-inc-range.ll
@@ -121,7 +121,7 @@ define void @test_range_metadata(ptr %array_length_ptr, ptr %base,
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_INC:%.*]] ], [ [[TMP0]], [[FOR_BODY_LR_PH:%.*]] ]
 ; CHECK-NEXT:    [[ARRAY_LENGTH:%.*]] = load i32, ptr [[ARRAY_LENGTH_PTR:%.*]], align 4, !range [[RNG0:![0-9]+]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = zext i32 [[ARRAY_LENGTH]] to i64
-; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp ult i64 [[INDVARS_IV]], [[TMP2]]
+; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], [[TMP2]]
 ; CHECK-NEXT:    br i1 [[WITHIN_LIMITS]], label [[CONTINUE:%.*]], label [[FOR_END:%.*]]
 ; CHECK:       continue:
 ; CHECK-NEXT:    br label [[FOR_INC]]
@@ -174,7 +174,7 @@ define void @test_neg(ptr %array_length_ptr, ptr %base,
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_INC:%.*]] ], [ [[TMP0]], [[FOR_BODY_LR_PH:%.*]] ]
 ; CHECK-NEXT:    [[ARRAY_LENGTH:%.*]] = load i32, ptr [[ARRAY_LENGTH_PTR:%.*]], align 4
 ; CHECK-NEXT:    [[TMP1:%.*]] = zext i32 [[ARRAY_LENGTH]] to i64
-; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp ult i64 [[INDVARS_IV]], [[TMP1]]
+; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], [[TMP1]]
 ; CHECK-NEXT:    br i1 [[WITHIN_LIMITS]], label [[CONTINUE:%.*]], label [[FOR_END:%.*]]
 ; CHECK:       continue:
 ; CHECK-NEXT:    br label [[FOR_INC]]
@@ -232,7 +232,7 @@ define void @test_transitive_use(ptr %base, i32 %limit, i32 %start) {
 ; CHECK-NEXT:    br i1 [[EXITCOND]], label [[CONTINUE:%.*]], label [[FOR_END:%.*]]
 ; CHECK:       continue:
 ; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw nsw i64 [[INDVARS_IV]], 3
-; CHECK-NEXT:    [[MUL_WITHIN:%.*]] = icmp ult i64 [[TMP3]], 64
+; CHECK-NEXT:    [[MUL_WITHIN:%.*]] = icmp samesign ult i64 [[TMP3]], 64
 ; CHECK-NEXT:    br i1 [[MUL_WITHIN]], label [[GUARDED:%.*]], label [[CONTINUE_2:%.*]]
 ; CHECK:       guarded:
 ; CHECK-NEXT:    [[TMP4:%.*]] = add nuw nsw i64 [[TMP3]], 1
@@ -297,7 +297,7 @@ define void @test_guard_one_bb(ptr %base, i32 %limit, i32 %start) {
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[TMP0]], [[FOR_BODY_LR_PH:%.*]] ]
-; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp ult i64 [[INDVARS_IV]], 64
+; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], 64
 ; CHECK-NEXT:    call void (i1, ...) @llvm.experimental.guard(i1 [[WITHIN_LIMITS]]) [ "deopt"() ]
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i64 [[INDVARS_IV_NEXT]], [[TMP1]]
@@ -337,7 +337,7 @@ define void @test_guard_in_the_same_bb(ptr %base, i32 %limit, i32 %start) {
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_INC:%.*]] ], [ [[TMP0]], [[FOR_BODY_LR_PH:%.*]] ]
-; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp ult i64 [[INDVARS_IV]], 64
+; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], 64
 ; CHECK-NEXT:    br label [[FOR_INC]]
 ; CHECK:       for.inc:
 ; CHECK-NEXT:    call void (i1, ...) @llvm.experimental.guard(i1 [[WITHIN_LIMITS]]) [ "deopt"() ]
@@ -382,7 +382,7 @@ define void @test_guard_in_idom(ptr %base, i32 %limit, i32 %start) {
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_INC:%.*]] ], [ [[TMP0]], [[FOR_BODY_LR_PH:%.*]] ]
-; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp ult i64 [[INDVARS_IV]], 64
+; CHECK-NEXT:    [[WITHIN_LIMITS:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], 64
 ; CHECK-NEXT:    call void (i1, ...) @llvm.experimental.guard(i1 [[WITHIN_LIMITS]]) [ "deopt"() ]
 ; CHECK-NEXT:    br label [[FOR_INC]]
 ; CHECK:       for.inc:
@@ -427,9 +427,9 @@ define void @test_guard_merge_ranges(ptr %base, i32 %limit, i32 %start) {
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ], [ [[TMP0]], [[FOR_BODY_LR_PH:%.*]] ]
-; CHECK-NEXT:    [[WITHIN_LIMITS_1:%.*]] = icmp ult i64 [[INDVARS_IV]], 64
+; CHECK-NEXT:    [[WITHIN_LIMITS_1:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], 64
 ; CHECK-NEXT:    call void (i1, ...) @llvm.experimental.guard(i1 [[WITHIN_LIMITS_1]]) [ "deopt"() ]
-; CHECK-NEXT:    [[WITHIN_LIMITS_2:%.*]] = icmp ult i64 [[INDVARS_IV]], 2147483647
+; CHECK-NEXT:    [[WITHIN_LIMITS_2:%.*]] = icmp samesign ult i64 [[INDVARS_IV]], 2147483647
 ; CHECK-NEXT:    call void (i1, ...) @llvm.experimental.guard(i1 [[WITHIN_LIMITS_2]]) [ "deopt"() ]
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i64 [[INDVARS_IV_NEXT]], [[TMP1]]
diff --git a/llvm/test/Transforms/IndVarSimplify/pr38674.ll b/llvm/test/Transforms/IndVarSimplify/pr38674.ll
index e701c4df82072..3b8197a0ffd9e 100644
--- a/llvm/test/Transforms/IndVarSimplify/pr38674.ll
+++ b/llvm/test/Transforms/IndVarSimplify/pr38674.ll
@@ -81,7 +81,7 @@ define i32 @test_02(i32 %x) {
 ; CHECK-NEXT:    [[ZEXT:%.*]] = mul i32 [[X:%.*]], 1
 ; CHECK-NEXT:    br label [[FOR_BODY6:%.*]]
 ; CHECK:       for.cond4:
-; CHECK-NEXT:    [[CMP5:%.*]] = icmp ult i32 [[INC:%.*]], 2
+; CHECK-NEXT:    [[CMP5:%.*]] = icmp samesign ult i32 [[INC:%.*]], 2
 ; CHECK-NEXT:    br i1 [[CMP5]], label [[FOR_BODY6]], label [[FOR_END:%.*]]
 ; CHECK:       for.body6:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[FOR_COND4_PREHEADER]] ], [ [[INC]], [[FOR_COND4:%.*]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/pr39673.ll b/llvm/test/Transforms/IndVarSimplify/pr39673.ll
index 7b093b34b91ad..27ada6b9bde81 100644
--- a/llvm/test/Transforms/IndVarSimplify/pr39673.ll
+++ b/llvm/test/Transforms/IndVarSimplify/pr39673.ll
@@ -8,7 +8,7 @@ define i16 @constant() {
 ; CHECK:       loop1:
 ; CHECK-NEXT:    [[L1:%.*]] = phi i16 [ 0, [[ENTRY:%.*]] ], [ [[L1_ADD:%.*]], [[LOOP1]] ]
 ; CHECK-NEXT:    [[L1_ADD]] = add nuw nsw i16 [[L1]], 1
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i16 [[L1_ADD]], 2
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i16 [[L1_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[LOOP1]], label [[LOOP2_PREHEADER:%.*]]
 ; CHECK:       loop2.preheader:
 ; CHECK-NEXT:    br label [[LOOP2:%.*]]
@@ -18,7 +18,7 @@ define i16 @constant() {
 ; CHECK-NEXT:    [[L2_ADD]] = add nuw nsw i16 [[L2]], 1
 ; CHECK-NEXT:    tail call void @foo(i16 [[K2]])
 ; CHECK-NEXT:    [[K2_ADD]] = add nuw nsw i16 [[K2]], 1
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i16 [[L2_ADD]], 2
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i16 [[L2_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[LOOP2]], label [[LOOP2_END:%.*]]
 ; CHECK:       loop2.end:
 ; CHECK-NEXT:    ret i16 184
@@ -59,7 +59,7 @@ define i16 @dom_argument(i16 %arg1, i16 %arg2) {
 ; CHECK:       loop1:
 ; CHECK-NEXT:    [[L1:%.*]] = phi i16 [ 0, [[ENTRY:%.*]] ], [ [[L1_ADD:%.*]], [[LOOP1]] ]
 ; CHECK-NEXT:    [[L1_ADD]] = add nuw nsw i16 [[L1]], 1
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i16 [[L1_ADD]], 2
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i16 [[L1_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[LOOP1]], label [[LOOP2_PREHEADER:%.*]]
 ; CHECK:       loop2.preheader:
 ; CHECK-NEXT:    br label [[LOOP2:%.*]]
@@ -69,7 +69,7 @@ define i16 @dom_argument(i16 %arg1, i16 %arg2) {
 ; CHECK-NEXT:    [[L2_ADD]] = add nuw nsw i16 [[L2]], 1
 ; CHECK-NEXT:    tail call void @foo(i16 [[K2]])
 ; CHECK-NEXT:    [[K2_ADD]] = add nuw nsw i16 [[K2]], 1
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i16 [[L2_ADD]], 2
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i16 [[L2_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[LOOP2]], label [[LOOP2_END:%.*]]
 ; CHECK:       loop2.end:
 ; CHECK-NEXT:    [[K2_ADD_LCSSA:%.*]] = phi i16 [ [[K2_ADD]], [[LOOP2]] ]
@@ -118,7 +118,7 @@ define i16 @dummy_phi_outside_loop(i16 %arg) {
 ; CHECK-NEXT:    [[L2_ADD]] = add nuw nsw i16 [[L2]], 1
 ; CHECK-NEXT:    tail call void @foo(i16 [[K2]])
 ; CHECK-NEXT:    [[K2_ADD]] = add nuw nsw i16 [[K2]], 1
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i16 [[L2_ADD]], 2
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i16 [[L2_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[LOOP2]], label [[LOOP2_END:%.*]]
 ; CHECK:       loop2.end:
 ; CHECK-NEXT:    [[K2_ADD_LCSSA:%.*]] = phi i16 [ [[K2_ADD]], [[LOOP2]] ]
@@ -152,7 +152,7 @@ define i16 @neg_loop_carried(i16 %arg) {
 ; CHECK:       loop1:
 ; CHECK-NEXT:    [[L1:%.*]] = phi i16 [ 0, [[ENTRY:%.*]] ], [ [[L1_ADD:%.*]], [[LOOP1]] ]
 ; CHECK-NEXT:    [[L1_ADD]] = add nuw nsw i16 [[L1]], 1
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i16 [[L1_ADD]], 2
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i16 [[L1_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[LOOP1]], label [[LOOP2_PREHEADER:%.*]]
 ; CHECK:       loop2.preheader:
 ; CHECK-NEXT:    [[TMP0:%.*]] = add i16 [[ARG:%.*]], 2
@@ -163,7 +163,7 @@ define i16 @neg_loop_carried(i16 %arg) {
 ; CHECK-NEXT:    [[L2_ADD]] = add nuw nsw i16 [[L2]], 1
 ; CHECK-NEXT:    tail call void @foo(i16 [[K2]])
 ; CHECK-NEXT:    [[K2_ADD]] = add nuw nsw i16 [[K2]], 1
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i16 [[L2_ADD]], 2
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i16 [[L2_ADD]], 2
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[LOOP2]], label [[LOOP2_END:%.*]]
 ; CHECK:       loop2.end:
 ; CHECK-NEXT:    [[K2_ADD_LCSSA:%.*]] = phi i16 [ [[K2_ADD]], [[LOOP2]] ]
diff --git a/llvm/test/Transforms/IndVarSimplify/pr56242.ll b/llvm/test/Transforms/IndVarSimplify/pr56242.ll
index a52b683ba510d..22e4467297f5a 100644
--- a/llvm/test/Transforms/IndVarSimplify/pr56242.ll
+++ b/llvm/test/Transforms/IndVarSimplify/pr56242.ll
@@ -20,7 +20,7 @@ define void @test(ptr %arr) {
 ; CHECK-NEXT:    br label [[LOOP_LATCH]]
 ; CHECK:       loop.latch:
 ; CHECK-NEXT:    [[IV_INC]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i64 [[IV_INC]], 16
+; CHECK-NEXT:    [[CMP:%.*]] = icmp samesign ult i64 [[IV_INC]], 16
 ; CHECK-NEXT:    br i1 [[CMP]], label [[LOOP_HEADER]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
diff --git a/llvm/test/Transforms/IndVarSimplify/pr57247.ll b/llvm/test/Transforms/IndVarSimplify/pr57247.ll
index 867856a0f48c6..c7bc9977ef9fd 100644
--- a/llvm/test/Transforms/IndVarSimplify/pr57247.ll
+++ b/llvm/test/Transforms/IndVarSimplify/pr57247.ll
@@ -15,7 +15,7 @@ define i32 @test_01() {
 ; CHECK-NEXT:    br i1 [[CHECK_1]], label [[INNER_LATCH]], label [[EXIT:%.*]]
 ; CHECK:       inner.latch:
 ; CHECK-NEXT:    [[ADD_I]] = add nuw nsw i64 [[STOREMERGE611_I]], 1
-; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp ult i64 [[STOREMERGE611_I]], 11
+; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp samesign ult i64 [[STOREMERGE611_I]], 11
 ; CHECK-NEXT:    br i1 [[CMP5_I]], label [[INNER_LOOP]], label [[OUTER_LATCH]]
 ; CHECK:       outer.latch:
 ; CHECK-NEXT:    [[IV_NEXT]] = add nsw i32 [[IV]], -1
@@ -55,14 +55,14 @@ define i32 @test_02() {
 ; CHECK-NEXT:    br label [[OUTER_LOOP:%.*]]
 ; CHECK:       outer.loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[OUTER_LATCH:%.*]] ]
-; CHECK-NEXT:    [[CHECK_1:%.*]] = icmp ult i32 [[IV]], 2147483640
+; CHECK-NEXT:    [[CHECK_1:%.*]] = icmp samesign ult i32 [[IV]], 2147483640
 ; CHECK-NEXT:    br label [[INNER_LOOP:%.*]]
 ; CHECK:       inner.loop:
 ; CHECK-NEXT:    [[STOREMERGE611_I:%.*]] = phi i64 [ 0, [[OUTER_LOOP]] ], [ [[ADD_I:%.*]], [[INNER_LATCH:%.*]] ]
 ; CHECK-NEXT:    br i1 [[CHECK_1]], label [[INNER_LATCH]], label [[EXIT:%.*]]
 ; CHECK:       inner.latch:
 ; CHECK-NEXT:    [[ADD_I]] = add nuw nsw i64 [[STOREMERGE611_I]], 1
-; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp ult i64 [[STOREMERGE611_I]], 11
+; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp samesign ult i64 [[STOREMERGE611_I]], 11
 ; CHECK-NEXT:    br i1 [[CMP5_I]], label [[INNER_LOOP]], label [[OUTER_LATCH]]
 ; CHECK:       outer.latch:
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw i32 [[IV]], 10
@@ -109,7 +109,7 @@ define i32 @test_03() {
 ; CHECK-NEXT:    br i1 [[CHECK_1]], label [[INNER_LATCH]], label [[EXIT:%.*]]
 ; CHECK:       inner.latch:
 ; CHECK-NEXT:    [[ADD_I]] = add nuw nsw i64 [[STOREMERGE611_I]], 1
-; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp ult i64 [[STOREMERGE611_I]], 11
+; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp samesign ult i64 [[STOREMERGE611_I]], 11
 ; CHECK-NEXT:    br i1 [[CMP5_I]], label [[INNER_LOOP]], label [[OUTER_LATCH]]
 ; CHECK:       outer.latch:
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw i32 [[IV]], 10
diff --git a/llvm/test/Transforms/IndVarSimplify/pr62992.ll b/llvm/test/Transforms/IndVarSimplify/pr62992.ll
index c8f47b57f1eda..afc2c003ee987 100644
--- a/llvm/test/Transforms/IndVarSimplify/pr62992.ll
+++ b/llvm/test/Transforms/IndVarSimplify/pr62992.ll
@@ -14,7 +14,7 @@ define i32 @test(i32 %arg) {
 ; CHECK-NEXT:    br i1 false, label [[IF:%.*]], label [[LOOP_LATCH:%.*]]
 ; CHECK:       if:
 ; CHECK-NEXT:    [[DIV:%.*]] = udiv i32 7, [[ARG]]
-; CHECK-NEXT:    [[CMP2:%.*]] = icmp ult i32 1, [[DIV]]
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp samesign ult i32 1, [[DIV]]
 ; CHECK-NEXT:    call void @use(i1 [[CMP2]])
 ; CHECK-NEXT:    br label [[LOOP_LATCH]]
 ; CHECK:       loop.latch:
diff --git a/llvm/test/Transforms/IndVarSimplify/sharpen-range.ll b/llvm/test/Transforms/IndVarSimplify/sharpen-range.ll
index 4dd4e9831c966..e29e29cf40e34 100644
--- a/llvm/test/Transforms/IndVarSimplify/sharpen-range.ll
+++ b/llvm/test/Transforms/IndVarSimplify/sharpen-range.ll
@@ -87,7 +87,7 @@ loop.begin:
 ; CHECK: loop.begin:
   %i.01 = phi i64 [ 2, %entry ], [ %add, %loop.end ]
   %cmp = icmp ugt i64 %i.01, 1
-; CHECK: %cmp = icmp ugt i64 %i.01, 1
+; CHECK: %cmp = icmp samesign ugt i64 %i.01, 1
   br i1 %cmp, label %loop, label %loop.end
 
 loop:
diff --git a/llvm/test/Transforms/IndVarSimplify/shift-range-checks.ll b/llvm/test/Transforms/IndVarSimplify/shift-range-checks.ll
index 1334d671d5a69..703199f7daa32 100644
--- a/llvm/test/Transforms/IndVarSimplify/shift-range-checks.ll
+++ b/llvm/test/Transforms/IndVarSimplify/shift-range-checks.ll
@@ -124,7 +124,7 @@ define void @test_03(ptr %p, i32 %shift) {
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[LESS_THAN_SHIFTED:%.*]] = icmp ult i32 [[IV]], [[X_SHIFTED]]
+; CHECK-NEXT:    [[LESS_THAN_SHIFTED:%.*]] = icmp samesign ult i32 [[IV]], [[X_SHIFTED]]
 ; CHECK-NEXT:    br i1 [[LESS_THAN_SHIFTED]], label [[GUARDED:%.*]], label [[FAILURE:%.*]]
 ; CHECK:       guarded:
 ; CHECK-NEXT:    br i1 true, label [[BACKEDGE]], label [[NEVER_HAPPENS:%.*]]
@@ -180,7 +180,7 @@ define void @test_04(ptr %p, i32 %shift) {
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
-; CHECK-NEXT:    [[LESS_THAN_SHIFTED:%.*]] = icmp ugt i32 [[X_SHIFTED]], [[IV]]
+; CHECK-NEXT:    [[LESS_THAN_SHIFTED:%.*]] = icmp samesign ugt i32 [[X_SHIFTED]], [[IV]]
 ; CHECK-NEXT:    br i1 [[LESS_THAN_SHIFTED]], label [[GUARDED:%.*]], label [[FAILURE:%.*]]
 ; CHECK:       guarded:
 ; CHECK-NEXT:    br i1 true, label [[BACKEDGE]], label [[NEVER_HAPPENS:%.*]]
diff --git a/llvm/test/Transforms/IndVarSimplify/simplify-pointer-arithmetic.ll b/llvm/test/Transforms/IndVarSimplify/simplify-pointer-arithmetic.ll
index 7c3562943e16e..0c33b1138041e 100644
--- a/llvm/test/Transforms/IndVarSimplify/simplify-pointer-arithmetic.ll
+++ b/llvm/test/Transforms/IndVarSimplify/simplify-pointer-arithmetic.ll
@@ -22,7 +22,7 @@ define i1 @can_simplify_ult_i32_ptr_len_zext(ptr %p.base, i32 %len) {
 ; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[P_INC:%.*]], [[LATCH:%.*]] ], [ [[P_BASE]], [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I:%.*]] = phi i64 [ [[I_INC:%.*]], [[LATCH]] ], [ 0, [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i64 [[I]], 1
-; CHECK-NEXT:    [[I_ULT_EXT:%.*]] = icmp ult i64 [[I]], [[EXT]]
+; CHECK-NEXT:    [[I_ULT_EXT:%.*]] = icmp samesign ult i64 [[I]], [[EXT]]
 ; CHECK-NEXT:    br i1 [[I_ULT_EXT]], label [[LATCH]], label [[TRAP_LOOPEXIT:%.*]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[P_INC]] = getelementptr inbounds i32, ptr [[P]], i64 1
@@ -128,7 +128,7 @@ define i1 @cannot_simplify_ult_i32_ptr_len_zext(ptr %p.base, i32 %len) {
 ; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[P_INC:%.*]], [[LATCH:%.*]] ], [ [[P_BASE]], [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I:%.*]] = phi i64 [ [[I_INC:%.*]], [[LATCH]] ], [ 1, [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i64 [[I]], 1
-; CHECK-NEXT:    [[I_ULT_EXT:%.*]] = icmp ult i64 [[I]], [[EXT]]
+; CHECK-NEXT:    [[I_ULT_EXT:%.*]] = icmp samesign ult i64 [[I]], [[EXT]]
 ; CHECK-NEXT:    br i1 [[I_ULT_EXT]], label [[LATCH]], label [[TRAP_LOOPEXIT:%.*]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[P_INC]] = getelementptr inbounds i32, ptr [[P]], i64 1
@@ -181,7 +181,7 @@ define i1 @can_simplify_ule_i32_ptr_len_zext(ptr %p.base, i32 %len) {
 ; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[P_INC:%.*]], [[LATCH:%.*]] ], [ [[P_BASE]], [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I:%.*]] = phi i64 [ [[I_INC:%.*]], [[LATCH]] ], [ 1, [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i64 [[I]], 1
-; CHECK-NEXT:    [[I_ULT_EXT:%.*]] = icmp ule i64 [[I]], [[EXT]]
+; CHECK-NEXT:    [[I_ULT_EXT:%.*]] = icmp samesign ule i64 [[I]], [[EXT]]
 ; CHECK-NEXT:    br i1 [[I_ULT_EXT]], label [[LATCH]], label [[TRAP_LOOPEXIT:%.*]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[P_INC]] = getelementptr inbounds i32, ptr [[P]], i64 1
@@ -236,7 +236,7 @@ define i1 @can_simplify_uge_i32_ptr_len_zext(ptr %p.base, i32 %len) {
 ; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[P_INC:%.*]], [[LATCH:%.*]] ], [ [[P_BASE]], [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I:%.*]] = phi i64 [ [[I_INC:%.*]], [[LATCH]] ], [ 0, [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i64 [[I]], 1
-; CHECK-NEXT:    [[I_UGE_EXT:%.*]] = icmp uge i64 [[I]], [[EXT]]
+; CHECK-NEXT:    [[I_UGE_EXT:%.*]] = icmp samesign uge i64 [[I]], [[EXT]]
 ; CHECK-NEXT:    br i1 [[I_UGE_EXT]], label [[TRAP_LOOPEXIT:%.*]], label [[LATCH]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[P_INC]] = getelementptr inbounds i32, ptr [[P]], i64 1
@@ -340,7 +340,7 @@ define i1 @cannot_simplify_uge_i32_ptr_len_zext_step_2(ptr %p.base, i32 %len) {
 ; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[P_INC:%.*]], [[LATCH:%.*]] ], [ [[P_BASE]], [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I:%.*]] = phi i64 [ [[I_INC:%.*]], [[LATCH]] ], [ 0, [[HEADER_PREHEADER]] ]
 ; CHECK-NEXT:    [[I_INC]] = add nuw nsw i64 [[I]], 2
-; CHECK-NEXT:    [[I_UGE_EXT:%.*]] = icmp uge i64 [[I]], [[EXT]]
+; CHECK-NEXT:    [[I_UGE_EXT:%.*]] = icmp samesign uge i64 [[I]], [[EXT]]
 ; CHECK-NEXT:    br i1 [[I_UGE_EXT]], label [[TRAP_LOOPEXIT:%.*]], label [[LATCH]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[P_INC]] = getelementptr inbounds i32, ptr [[P]], i64 1
diff --git a/llvm/test/Transforms/IndVarSimplify/skip-predication-convergence.ll b/llvm/test/Transforms/IndVarSimplify/skip-predication-convergence.ll
index 59b84a3c082c2..e08307ff27f36 100644
--- a/llvm/test/Transforms/IndVarSimplify/skip-predication-convergence.ll
+++ b/llvm/test/Transforms/IndVarSimplify/skip-predication-convergence.ll
@@ -13,7 +13,7 @@ define void @loop(i32 %tid, ptr %array) #0 {
 ; CHECK:       for.cond.i:
 ; CHECK-NEXT:    [[I_0_I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC_I:%.*]], [[FOR_BODY_I:%.*]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token [[TMP0]]) ]
-; CHECK-NEXT:    [[CMP_I:%.*]] = icmp ult i32 [[I_0_I]], 8
+; CHECK-NEXT:    [[CMP_I:%.*]] = icmp samesign ult i32 [[I_0_I]], 8
 ; CHECK-NEXT:    br i1 [[CMP_I]], label [[FOR_BODY_I]], label [[EXIT_LOOPEXIT:%.*]]
 ; CHECK:       for.body.i:
 ; CHECK-NEXT:    [[CMP1_I:%.*]] = icmp eq i32 [[I_0_I]], [[TID:%.*]]
diff --git a/llvm/test/Transforms/IndVarSimplify/skip-predication-nested-convergence.ll b/llvm/test/Transforms/IndVarSimplify/skip-predication-nested-convergence.ll
index 0944205839aca..4d630a9bbb501 100644
--- a/llvm/test/Transforms/IndVarSimplify/skip-predication-nested-convergence.ll
+++ b/llvm/test/Transforms/IndVarSimplify/skip-predication-nested-convergence.ll
@@ -15,7 +15,7 @@ define void @nested(i32 %tidx, i32 %tidy, ptr %array) #0 {
 ; CHECK:       for.cond.i:
 ; CHECK-NEXT:    [[I_0_I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC10_I:%.*]], [[CLEANUP_I:%.*]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token [[TMP0]]) ]
-; CHECK-NEXT:    [[CMP_I:%.*]] = icmp ult i32 [[I_0_I]], 8
+; CHECK-NEXT:    [[CMP_I:%.*]] = icmp samesign ult i32 [[I_0_I]], 8
 ; CHECK-NEXT:    br i1 [[CMP_I]], label [[FOR_COND1_I_PREHEADER:%.*]], label [[EXIT:%.*]]
 ; CHECK:       for.cond1.i.preheader:
 ; CHECK-NEXT:    [[CMP5_I:%.*]] = icmp eq i32 [[I_0_I]], [[TIDX]]
@@ -23,7 +23,7 @@ define void @nested(i32 %tidx, i32 %tidy, ptr %array) #0 {
 ; CHECK:       for.cond1.i:
 ; CHECK-NEXT:    [[J_0_I:%.*]] = phi i32 [ [[INC_I:%.*]], [[FOR_BODY4_I:%.*]] ], [ 0, [[FOR_COND1_I_PREHEADER]] ]
 ; CHECK-NEXT:    [[TMP2:%.*]] = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token [[TMP1]]) ]
-; CHECK-NEXT:    [[CMP2_I:%.*]] = icmp ult i32 [[J_0_I]], 8
+; CHECK-NEXT:    [[CMP2_I:%.*]] = icmp samesign ult i32 [[J_0_I]], 8
 ; CHECK-NEXT:    br i1 [[CMP2_I]], label [[FOR_BODY4_I]], label [[CLEANUP_I_LOOPEXIT:%.*]]
 ; CHECK:       for.body4.i:
 ; CHECK-NEXT:    [[CMP6_I:%.*]] = icmp eq i32 [[J_0_I]], [[TIDY]]
diff --git a/llvm/test/Transforms/IndVarSimplify/turn-to-invariant.ll b/llvm/test/Transforms/IndVarSimplify/turn-to-invariant.ll
index 326ee75e135b0..d3a5d4c443cb9 100644
--- a/llvm/test/Transforms/IndVarSimplify/turn-to-invariant.ll
+++ b/llvm/test/Transforms/IndVarSimplify/turn-to-invariant.ll
@@ -852,7 +852,7 @@ define i32 @test_litter_conditions_constant(i32 %start, i32 %len) {
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[START]], [[ENTRY:%.*]] ], [ [[IV_NEXT:%.*]], [[BACKEDGE:%.*]] ]
 ; CHECK-NEXT:    [[CANONICAL_IV:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[CANONICAL_IV_NEXT:%.*]], [[BACKEDGE]] ]
-; CHECK-NEXT:    [[CONSTANT_CHECK:%.*]] = icmp ult i32 [[CANONICAL_IV]], 65635
+; CHECK-NEXT:    [[CONSTANT_CHECK:%.*]] = icmp samesign ult i32 [[CANONICAL_IV]], 65635
 ; CHECK-NEXT:    br i1 [[CONSTANT_CHECK]], label [[CONSTANT_CHECK_PASSED:%.*]], label [[CONSTANT_CHECK_FAILED:%.*]]
 ; CHECK:       constant_check_passed:
 ; CHECK-NEXT:    [[ZERO_CHECK:%.*]] = icmp ne i32 [[IV]], 0
diff --git a/llvm/test/Transforms/IndVarSimplify/widen-nonnegative-countdown.ll b/llvm/test/Transforms/IndVarSimplify/widen-nonnegative-countdown.ll
index 9c8983421029f..d24442287cf4d 100644
--- a/llvm/test/Transforms/IndVarSimplify/widen-nonnegative-countdown.ll
+++ b/llvm/test/Transforms/IndVarSimplify/widen-nonnegative-countdown.ll
@@ -20,7 +20,7 @@ define void @zext_postinc_constant_start(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV_NEXT]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV_NEXT]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -51,7 +51,7 @@ define void @zext_preinc_constant_start(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -158,7 +158,7 @@ define void @sext_postinc_constant_start(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV_NEXT]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV_NEXT]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -189,7 +189,7 @@ define void @sext_preinc_constant_start(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -300,7 +300,7 @@ define void @zext_postinc_constant_start_offset_constant_one(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[TMP0]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV_NEXT]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV_NEXT]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -333,7 +333,7 @@ define void @zext_preinc_constant_start_offset_constant_one(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[TMP0]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -448,7 +448,7 @@ define void @sext_postinc_constant_start_offset_constant_one(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[TMP0]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV_NEXT]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV_NEXT]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -481,7 +481,7 @@ define void @sext_preinc_constant_start_offset_constant_one(ptr %A) {
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[TMP0]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -600,7 +600,7 @@ define void @zext_postinc_constant_start_offset_constant_minus_one(ptr %A) {
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV_NEXT]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV_NEXT]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV_NEXT]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -632,7 +632,7 @@ define void @zext_preinc_constant_start_offset_constant_minus_one(ptr %A) {
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV_NEXT]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -744,7 +744,7 @@ define void @sext_postinc_constant_start_offset_constant_minus_one(ptr %A) {
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV_NEXT]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV_NEXT]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV_NEXT]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
@@ -776,7 +776,7 @@ define void @sext_preinc_constant_start_offset_constant_minus_one(ptr %A) {
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
 ; CHECK-NEXT:    [[ARRAYIDX_US:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], i64 [[INDVARS_IV_NEXT]]
 ; CHECK-NEXT:    tail call void @use_ptr(ptr [[ARRAYIDX_US]])
-; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp ugt i64 [[INDVARS_IV]], 6
+; CHECK-NEXT:    [[CMP2_US:%.*]] = icmp samesign ugt i64 [[INDVARS_IV]], 6
 ; CHECK-NEXT:    br i1 [[CMP2_US]], label [[FOR_BODY]], label [[EXIT:%.*]]
 ; CHECK:       exit:
 ; CHECK-NEXT:    ret void
diff --git a/llvm/test/Transforms/InstCombine/known-bits.ll b/llvm/test/Transforms/InstCombine/known-bits.ll
index f8c97d86a9230..da2123a5dfe74 100644
--- a/llvm/test/Transforms/InstCombine/known-bits.ll
+++ b/llvm/test/Transforms/InstCombine/known-bits.ll
@@ -2425,6 +2425,23 @@ exit:
   ret i8 %or2
 }
 
+define <vscale x 4 x i32> @scalable_add_to_disjoint_or(i8 %x, <vscale x 4 x i32> range(i32 0, 256) %rhs) {
+; CHECK-LABEL: @scalable_add_to_disjoint_or(
+; CHECK-NEXT:    [[EXTX:%.*]] = zext i8 [[X:%.*]] to i32
+; CHECK-NEXT:    [[SHIFT:%.*]] = shl nuw nsw i32 [[EXTX]], 8
+; CHECK-NEXT:    [[INSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[SHIFT]], i64 0
+; CHECK-NEXT:    [[SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[ADD:%.*]] = or disjoint <vscale x 4 x i32> [[SPLAT]], [[RHS:%.*]]
+; CHECK-NEXT:    ret <vscale x 4 x i32> [[ADD]]
+;
+  %extx = zext i8 %x to i32
+  %shift = shl nuw nsw i32 %extx, 8
+  %insert = insertelement <vscale x 4 x i32> poison, i32 %shift, i32 0
+  %splat = shufflevector <vscale x 4 x i32> %insert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
+  %add = add <vscale x 4 x i32> %splat, %rhs
+  ret <vscale x 4 x i32> %add
+}
+
 declare void @dummy()
 declare void @use(i1)
 declare void @sink(i8)
diff --git a/llvm/test/Transforms/InstCombine/saturating-add-sub.ll b/llvm/test/Transforms/InstCombine/saturating-add-sub.ll
index c0ad5818e448a..1294f867f07c0 100644
--- a/llvm/test/Transforms/InstCombine/saturating-add-sub.ll
+++ b/llvm/test/Transforms/InstCombine/saturating-add-sub.ll
@@ -2671,3 +2671,19 @@ define i8 @neg_neg_constant(i8 %x, i8 %y) {
   %s = select i1 %cmp, i8 127, i8 %d
   ret i8 %s
 }
+
+; Make sure we don't crash in this case.
+define i32 @pr153053_strict_pred_with_nonconstant_rhs(i32 %x, i32 %y) {
+; CHECK-LABEL: @pr153053_strict_pred_with_nonconstant_rhs(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i32 [[X:%.*]], [[Y:%.*]]
+; CHECK-NEXT:    [[ADD:%.*]] = add i32 [[X]], 1
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[CMP]], i32 [[ADD]], i32 2147483647
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %cmp = icmp slt i32 %x, %y
+  %add = add i32 %x, 1
+  %res = select i1 %cmp, i32 %add, i32 2147483647
+  ret i32 %res
+}
diff --git a/llvm/test/Transforms/InstCombine/simplify-demanded-fpclass.ll b/llvm/test/Transforms/InstCombine/simplify-demanded-fpclass.ll
index df60078dbf452..a7ff967d3123b 100644
--- a/llvm/test/Transforms/InstCombine/simplify-demanded-fpclass.ll
+++ b/llvm/test/Transforms/InstCombine/simplify-demanded-fpclass.ll
@@ -10,6 +10,8 @@ declare float @llvm.trunc.f32(float)
 declare float @llvm.arithmetic.fence.f32(float)
 declare float @llvm.minnum.f32(float, float)
 declare float @llvm.maxnum.f32(float, float)
+declare float @llvm.minimumnum.f32(float, float)
+declare float @llvm.maximumnum.f32(float, float)
 
 
 define float @ninf_user_select_inf(i1 %cond, float %x, float %y) {
@@ -1314,7 +1316,7 @@ define nofpclass(pinf) float @ret_nofpclass_pinf__minnum_ninf(i1 %cond, float %x
 ; CHECK-SAME: (i1 [[COND:%.*]], float [[X:%.*]]) {
 ; CHECK-NEXT:    ret float 0xFFF0000000000000
 ;
-  %min = call float @llvm.minnum.f32(float %x, float 0xFFF0000000000000)
+  %min = call float @llvm.minimumnum.f32(float %x, float 0xFFF0000000000000)
   ret float %min
 }
 
@@ -1335,6 +1337,6 @@ define nofpclass(ninf) float @ret_nofpclass_ninf__maxnum_pinf(i1 %cond, float %x
 ; CHECK-SAME: (i1 [[COND:%.*]], float [[X:%.*]]) {
 ; CHECK-NEXT:    ret float 0x7FF0000000000000
 ;
-  %max = call float @llvm.maxnum.f32(float %x, float 0x7FF0000000000000)
+  %max = call float @llvm.maximumnum.f32(float %x, float 0x7FF0000000000000)
   ret float %max
 }
diff --git a/llvm/test/Transforms/InstSimplify/ConstProp/min-max.ll b/llvm/test/Transforms/InstSimplify/ConstProp/min-max.ll
index a633d29179896..84bec15d6ed32 100644
--- a/llvm/test/Transforms/InstSimplify/ConstProp/min-max.ll
+++ b/llvm/test/Transforms/InstSimplify/ConstProp/min-max.ll
@@ -97,7 +97,8 @@ define float @minnum_float_qnan_p0() {
 
 define float @minnum_float_p0_snan() {
 ; CHECK-LABEL: @minnum_float_p0_snan(
-; CHECK-NEXT:    ret float 0x7FFC000000000000
+; CHECK-NEXT:    [[MIN:%.*]] = call float @llvm.minnum.f32(float 0.000000e+00, float 0x7FF4000000000000)
+; CHECK-NEXT:    ret float [[MIN]]
 ;
   %min = call float @llvm.minnum.f32(float 0.0, float 0x7FF4000000000000)
   ret float %min
@@ -105,7 +106,8 @@ define float @minnum_float_p0_snan() {
 
 define float @minnum_float_snan_p0() {
 ; CHECK-LABEL: @minnum_float_snan_p0(
-; CHECK-NEXT:    ret float 0x7FFC000000000000
+; CHECK-NEXT:    [[MIN:%.*]] = call float @llvm.minnum.f32(float 0x7FF4000000000000, float 0.000000e+00)
+; CHECK-NEXT:    ret float [[MIN]]
 ;
   %min = call float @llvm.minnum.f32(float 0x7FF4000000000000, float 0.0)
   ret float %min
@@ -205,7 +207,8 @@ define float @maxnum_float_qnan_p0() {
 
 define float @maxnum_float_p0_snan() {
 ; CHECK-LABEL: @maxnum_float_p0_snan(
-; CHECK-NEXT:    ret float 0x7FFC000000000000
+; CHECK-NEXT:    [[MAX:%.*]] = call float @llvm.maxnum.f32(float 0.000000e+00, float 0x7FF4000000000000)
+; CHECK-NEXT:    ret float [[MAX]]
 ;
   %max = call float @llvm.maxnum.f32(float 0.0, float 0x7FF4000000000000)
   ret float %max
@@ -213,7 +216,8 @@ define float @maxnum_float_p0_snan() {
 
 define float @maxnum_float_snan_p0() {
 ; CHECK-LABEL: @maxnum_float_snan_p0(
-; CHECK-NEXT:    ret float 0x7FFC000000000000
+; CHECK-NEXT:    [[MAX:%.*]] = call float @llvm.maxnum.f32(float 0x7FF4000000000000, float 0.000000e+00)
+; CHECK-NEXT:    ret float [[MAX]]
 ;
   %max = call float @llvm.maxnum.f32(float 0x7FF4000000000000, float 0.0)
   ret float %max
diff --git a/llvm/test/Transforms/InstSimplify/fminmax-folds.ll b/llvm/test/Transforms/InstSimplify/fminmax-folds.ll
index 091e85920c0df..7544f7190df89 100644
--- a/llvm/test/Transforms/InstSimplify/fminmax-folds.ll
+++ b/llvm/test/Transforms/InstSimplify/fminmax-folds.ll
@@ -43,11 +43,13 @@ define void @minmax_qnan_f32(float %x, ptr %minnum_res, ptr %maxnum_res, ptr %mi
 ; Note that maxnum/minnum return qnan here for snan inputs, unlike maximumnum/minimumnum
 define void @minmax_snan_f32(float %x, ptr %minnum_res, ptr %maxnum_res, ptr %minimum_res, ptr %maximum_res, ptr %minimumnum_res, ptr %maximumnum_res) {
 ; CHECK-LABEL: @minmax_snan_f32(
-; CHECK-NEXT:    store float 0x7FFC000000000000, ptr [[MINNUM_RES:%.*]], align 4
-; CHECK-NEXT:    store float 0x7FFC000000000000, ptr [[MAXNUM_RES:%.*]], align 4
+; CHECK-NEXT:    [[MINNUM:%.*]] = call float @llvm.minnum.f32(float [[X:%.*]], float 0x7FF4000000000000)
+; CHECK-NEXT:    store float [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 4
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call float @llvm.maxnum.f32(float [[X]], float 0x7FF4000000000000)
+; CHECK-NEXT:    store float [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    store float 0x7FFC000000000000, ptr [[MINIMUM_RES:%.*]], align 4
 ; CHECK-NEXT:    store float 0x7FFC000000000000, ptr [[MAXIMUM_RES:%.*]], align 4
-; CHECK-NEXT:    store float [[X:%.*]], ptr [[MINIMUMNUM_RES:%.*]], align 4
+; CHECK-NEXT:    store float [[X]], ptr [[MINIMUMNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    store float [[X]], ptr [[MAXIMUMNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    ret void
 ;
@@ -98,11 +100,13 @@ define void @minmax_qnan_nxv2f64_op0(<vscale x 2 x double> %x, ptr %minnum_res,
 ; Note that maxnum/minnum return qnan here for snan inputs, unlike maximumnum/minimumnum
 define void @minmax_snan_nxv2f64_op1(<vscale x 2 x double> %x, ptr %minnum_res, ptr %maxnum_res, ptr %minimum_res, ptr %maximum_res, ptr %minimumnum_res, ptr %maximumnum_res) {
 ; CHECK-LABEL: @minmax_snan_nxv2f64_op1(
-; CHECK-NEXT:    store <vscale x 2 x double> splat (double 0x7FFC00DEAD00DEAD), ptr [[MINNUM_RES:%.*]], align 16
-; CHECK-NEXT:    store <vscale x 2 x double> splat (double 0x7FFC00DEAD00DEAD), ptr [[MAXNUM_RES:%.*]], align 16
+; CHECK-NEXT:    [[MINNUM:%.*]] = call <vscale x 2 x double> @llvm.minnum.nxv2f64(<vscale x 2 x double> splat (double 0x7FF400DEAD00DEAD), <vscale x 2 x double> [[X:%.*]])
+; CHECK-NEXT:    store <vscale x 2 x double> [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 16
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call <vscale x 2 x double> @llvm.maxnum.nxv2f64(<vscale x 2 x double> splat (double 0x7FF400DEAD00DEAD), <vscale x 2 x double> [[X]])
+; CHECK-NEXT:    store <vscale x 2 x double> [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 16
 ; CHECK-NEXT:    store <vscale x 2 x double> splat (double 0x7FFC00DEAD00DEAD), ptr [[MINIMUM_RES:%.*]], align 16
 ; CHECK-NEXT:    store <vscale x 2 x double> splat (double 0x7FFC00DEAD00DEAD), ptr [[MAXIMUM_RES:%.*]], align 16
-; CHECK-NEXT:    store <vscale x 2 x double> [[X:%.*]], ptr [[MINIMUMNUM_RES:%.*]], align 16
+; CHECK-NEXT:    store <vscale x 2 x double> [[X]], ptr [[MINIMUMNUM_RES:%.*]], align 16
 ; CHECK-NEXT:    store <vscale x 2 x double> [[X]], ptr [[MAXIMUMNUM_RES:%.*]], align 16
 ; CHECK-NEXT:    ret void
 ;
@@ -255,7 +259,8 @@ define void @minmax_pos_inf_f32(float %x, ptr %minnum_res, ptr %maxnum_res, ptr
 ; CHECK-LABEL: @minmax_pos_inf_f32(
 ; CHECK-NEXT:    [[MINNUM:%.*]] = call float @llvm.minnum.f32(float [[X:%.*]], float 0x7FF0000000000000)
 ; CHECK-NEXT:    store float [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 4
-; CHECK-NEXT:    store float 0x7FF0000000000000, ptr [[MAXNUM_RES:%.*]], align 4
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call float @llvm.maxnum.f32(float [[X]], float 0x7FF0000000000000)
+; CHECK-NEXT:    store float [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    store float [[X]], ptr [[MINIMUM_RES:%.*]], align 4
 ; CHECK-NEXT:    [[MAXIMUM:%.*]] = call float @llvm.maximum.f32(float [[X]], float 0x7FF0000000000000)
 ; CHECK-NEXT:    store float [[MAXIMUM]], ptr [[MAXIMUM_RES:%.*]], align 4
@@ -322,8 +327,9 @@ define void @minmax_pos_inf_nnan_v2f32(<2 x float> %x, ptr %minnum_res, ptr %max
 ; Can only optimize minnum, maximum, and minimumnum without the nnan flag
 define void @minmax_neg_inf_f32(float %x, ptr %minnum_res, ptr %maxnum_res, ptr %minimum_res, ptr %maximum_res, ptr %minimumnum_res, ptr %maximumnum_res) {
 ; CHECK-LABEL: @minmax_neg_inf_f32(
-; CHECK-NEXT:    store float 0xFFF0000000000000, ptr [[MINNUM_RES:%.*]], align 4
-; CHECK-NEXT:    [[MAXNUM:%.*]] = call float @llvm.maxnum.f32(float [[X:%.*]], float 0xFFF0000000000000)
+; CHECK-NEXT:    [[MINNUM:%.*]] = call float @llvm.minnum.f32(float [[X:%.*]], float 0xFFF0000000000000)
+; CHECK-NEXT:    store float [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 4
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call float @llvm.maxnum.f32(float [[X]], float 0xFFF0000000000000)
 ; CHECK-NEXT:    store float [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    [[MINIMUM:%.*]] = call float @llvm.minimum.f32(float [[X]], float 0xFFF0000000000000)
 ; CHECK-NEXT:    store float [[MINIMUM]], ptr [[MINIMUM_RES:%.*]], align 4
@@ -427,7 +433,8 @@ define void @minmax_largest_f32_ninf(float %x, ptr %minnum_res, ptr %maxnum_res,
 ; CHECK-LABEL: @minmax_largest_f32_ninf(
 ; CHECK-NEXT:    [[MINNUM:%.*]] = call ninf float @llvm.minnum.f32(float [[X:%.*]], float 0x47EFFFFFE0000000)
 ; CHECK-NEXT:    store float [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 4
-; CHECK-NEXT:    store float 0x47EFFFFFE0000000, ptr [[MAXNUM_RES:%.*]], align 4
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call ninf float @llvm.maxnum.f32(float [[X]], float 0x47EFFFFFE0000000)
+; CHECK-NEXT:    store float [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    store float [[X]], ptr [[MINIMUM_RES:%.*]], align 4
 ; CHECK-NEXT:    [[MAXIMUM:%.*]] = call ninf float @llvm.maximum.f32(float [[X]], float 0x47EFFFFFE0000000)
 ; CHECK-NEXT:    store float [[MAXIMUM]], ptr [[MAXIMUM_RES:%.*]], align 4
@@ -528,8 +535,9 @@ define void @minmax_neg_largest_f32(float %x, ptr %minnum_res, ptr %maxnum_res,
 ; We can optimize minnum, maximum, and minimumnum if we know ninf is set
 define void @minmax_neg_largest_f32_ninf(float %x, ptr %minnum_res, ptr %maxnum_res, ptr %minimum_res, ptr %maximum_res, ptr %minimumnum_res, ptr %maximumnum_res) {
 ; CHECK-LABEL: @minmax_neg_largest_f32_ninf(
-; CHECK-NEXT:    store float 0xC7EFFFFFE0000000, ptr [[MINNUM_RES:%.*]], align 4
-; CHECK-NEXT:    [[MAXNUM:%.*]] = call ninf float @llvm.maxnum.f32(float [[X:%.*]], float 0xC7EFFFFFE0000000)
+; CHECK-NEXT:    [[MINNUM:%.*]] = call ninf float @llvm.minnum.f32(float [[X:%.*]], float 0xC7EFFFFFE0000000)
+; CHECK-NEXT:    store float [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 4
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call ninf float @llvm.maxnum.f32(float [[X]], float 0xC7EFFFFFE0000000)
 ; CHECK-NEXT:    store float [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 4
 ; CHECK-NEXT:    [[MINIMUM:%.*]] = call ninf float @llvm.minimum.f32(float [[X]], float 0xC7EFFFFFE0000000)
 ; CHECK-NEXT:    store float [[MINIMUM]], ptr [[MINIMUM_RES:%.*]], align 4
@@ -632,7 +640,8 @@ define void @minmax_mixed_pos_inf_poison_snan_v3f32(<3 x float> %x, ptr %minnum_
 ; CHECK-LABEL: @minmax_mixed_pos_inf_poison_snan_v3f32(
 ; CHECK-NEXT:    [[MINNUM:%.*]] = call nnan <3 x float> @llvm.minnum.v3f32(<3 x float> <float poison, float 0x7FF0000000000000, float 0x7FF4000000000000>, <3 x float> [[X:%.*]])
 ; CHECK-NEXT:    store <3 x float> [[MINNUM]], ptr [[MINNUM_RES:%.*]], align 16
-; CHECK-NEXT:    store <3 x float> <float poison, float 0x7FF0000000000000, float 0x7FFC000000000000>, ptr [[MAXNUM_RES:%.*]], align 16
+; CHECK-NEXT:    [[MAXNUM:%.*]] = call nnan <3 x float> @llvm.maxnum.v3f32(<3 x float> <float poison, float 0x7FF0000000000000, float 0x7FF4000000000000>, <3 x float> [[X]])
+; CHECK-NEXT:    store <3 x float> [[MAXNUM]], ptr [[MAXNUM_RES:%.*]], align 16
 ; CHECK-NEXT:    [[MINIMUM:%.*]] = call nnan <3 x float> @llvm.minimum.v3f32(<3 x float> <float poison, float 0x7FF0000000000000, float 0x7FF4000000000000>, <3 x float> [[X]])
 ; CHECK-NEXT:    store <3 x float> [[MINIMUM]], ptr [[MINIMUM_RES:%.*]], align 16
 ; CHECK-NEXT:    store <3 x float> <float poison, float 0x7FF0000000000000, float 0x7FFC000000000000>, ptr [[MAXIMUM_RES:%.*]], align 16
diff --git a/llvm/test/Transforms/LICM/lnicm.ll b/llvm/test/Transforms/LICM/lnicm.ll
index 814f964666305..e331ab7d39e83 100644
--- a/llvm/test/Transforms/LICM/lnicm.ll
+++ b/llvm/test/Transforms/LICM/lnicm.ll
@@ -3,6 +3,9 @@
 ; RUN: opt -aa-pipeline=basic-aa -passes='loop-mssa(lnicm),loop(loop-interchange)' -cache-line-size=64 -S %s | FileCheck %s --check-prefixes LNICM
 ; RUN: opt -aa-pipeline=basic-aa -passes='loop-mssa(licm),loop(loop-interchange)' -cache-line-size=64 -S %s | FileCheck %s --check-prefixes LICM
 
+; XFAIL: *
+; Loop interchange currently fails due to a failure in dependence analysis.
+
 ; This test represents the following function:
 ; void test(int n, int m, int x[m][n], int y[n], int *z) {
 ;   for (int k = 0; k < n; k++) {
diff --git a/llvm/test/Transforms/LoopDistribute/laa-invalidation.ll b/llvm/test/Transforms/LoopDistribute/laa-invalidation.ll
index 62c5627ac2d38..54b29d279818a 100644
--- a/llvm/test/Transforms/LoopDistribute/laa-invalidation.ll
+++ b/llvm/test/Transforms/LoopDistribute/laa-invalidation.ll
@@ -26,7 +26,7 @@ define void @test_pr50940(ptr %A, ptr %B) {
 ; CHECK-NEXT:    store i16 0, ptr [[GEP_A_3]], align 1
 ; CHECK-NEXT:    store i16 1, ptr [[B]], align 1
 ; CHECK-NEXT:    [[IV_NEXT_LVER_ORIG]] = add nuw nsw i16 [[IV_LVER_ORIG]], 1
-; CHECK-NEXT:    [[C_1_LVER_ORIG:%.*]] = icmp ult i16 [[IV_LVER_ORIG]], 38
+; CHECK-NEXT:    [[C_1_LVER_ORIG:%.*]] = icmp samesign ult i16 [[IV_LVER_ORIG]], 38
 ; CHECK-NEXT:    br i1 [[C_1_LVER_ORIG]], label [[INNER_LVER_ORIG]], label [[EXIT_LOOPEXIT:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       inner.ph3.ldist1:
 ; CHECK-NEXT:    br label [[INNER_LDIST1:%.*]]
@@ -35,7 +35,7 @@ define void @test_pr50940(ptr %A, ptr %B) {
 ; CHECK-NEXT:    [[L_LDIST1:%.*]] = load <2 x i16>, ptr [[UGLYGEP]], align 1, !alias.scope [[META2:![0-9]+]], !noalias [[META5:![0-9]+]]
 ; CHECK-NEXT:    store i16 0, ptr [[GEP_A_3]], align 1, !alias.scope [[META2]], !noalias [[META5]]
 ; CHECK-NEXT:    [[IV_NEXT_LDIST1]] = add nuw nsw i16 [[IV_LDIST1]], 1
-; CHECK-NEXT:    [[C_1_LDIST1:%.*]] = icmp ult i16 [[IV_LDIST1]], 38
+; CHECK-NEXT:    [[C_1_LDIST1:%.*]] = icmp samesign ult i16 [[IV_LDIST1]], 38
 ; CHECK-NEXT:    br i1 [[C_1_LDIST1]], label [[INNER_LDIST1]], label [[INNER_PH3:%.*]]
 ; CHECK:       inner.ph3:
 ; CHECK-NEXT:    br label [[INNER:%.*]]
@@ -43,7 +43,7 @@ define void @test_pr50940(ptr %A, ptr %B) {
 ; CHECK-NEXT:    [[IV:%.*]] = phi i16 [ 0, [[INNER_PH3]] ], [ [[IV_NEXT:%.*]], [[INNER]] ]
 ; CHECK-NEXT:    store i16 1, ptr [[B]], align 1, !alias.scope [[META5]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i16 [[IV]], 1
-; CHECK-NEXT:    [[C_1:%.*]] = icmp ult i16 [[IV]], 38
+; CHECK-NEXT:    [[C_1:%.*]] = icmp samesign ult i16 [[IV]], 38
 ; CHECK-NEXT:    br i1 [[C_1]], label [[INNER]], label [[EXIT_LOOPEXIT4:%.*]]
 ; CHECK:       outer.latch:
 ; CHECK-NEXT:    br label [[OUTER_HEADER]]
diff --git a/llvm/test/Transforms/LoopStrengthReduce/AArch64/non-cmp-cond.ll b/llvm/test/Transforms/LoopStrengthReduce/AArch64/non-cmp-cond.ll
new file mode 100644
index 0000000000000..5590208865bd2
--- /dev/null
+++ b/llvm/test/Transforms/LoopStrengthReduce/AArch64/non-cmp-cond.ll
@@ -0,0 +1,205 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -loop-reduce %s -S -o - | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; Tests where the loop termination condition is not generated by a compare.
+
+; The call to get.active.lane.mask in the loop should use the postincrement
+; value of %index.
+define void @lane_mask(ptr %dst, i64 %n) #0 {
+; CHECK-LABEL: define void @lane_mask(
+; CHECK-SAME: ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[VSCALE:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[VSCALEX4:%.*]] = shl i64 [[VSCALE]], 2
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[TMP1:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = shl i64 [[IV]], 2
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP0]]
+; CHECK-NEXT:    tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat (i32 1), ptr align 4 [[SCEVGEP]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[TMP1]] = add i64 [[IV]], [[VSCALEX4]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP1]], i64 [[N]])
+; CHECK-NEXT:    [[COND:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+; CHECK-NEXT:    br i1 [[COND]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %vscale = tail call i64 @llvm.vscale.i64()
+  %vscalex4 = shl i64 %vscale, 2
+  %active.lane.mask.entry = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 %n)
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %active.lane.mask = phi <vscale x 4 x i1> [ %active.lane.mask.entry, %entry ], [ %active.lane.mask.next, %loop ]
+  %gep = getelementptr inbounds nuw i32, ptr %dst, i64 %iv
+  tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat (i32 1), ptr %gep, i32 4, <vscale x 4 x i1> %active.lane.mask)
+  %iv.next = add i64 %iv, %vscalex4
+  %active.lane.mask.next = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %iv.next, i64 %n)
+  %cond = extractelement <vscale x 4 x i1> %active.lane.mask.next, i64 0
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+; The store between the call and the branch shouldn't prevent the
+; postincement value from being used.
+define void @lane_mask_not_last(ptr %dst, i64 %n) #0 {
+; CHECK-LABEL: define void @lane_mask_not_last(
+; CHECK-SAME: ptr [[DST:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[VSCALE:%.*]] = tail call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[VSCALEX4:%.*]] = shl i64 [[VSCALE]], 2
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[TMP0:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[ENTRY]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP1:%.*]] = shl i64 [[IV]], 2
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP0]] = add i64 [[IV]], [[VSCALEX4]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[TMP0]], i64 [[N]])
+; CHECK-NEXT:    tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat (i32 1), ptr align 4 [[SCEVGEP]], <vscale x 4 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[COND:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+; CHECK-NEXT:    br i1 [[COND]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %vscale = tail call i64 @llvm.vscale.i64()
+  %vscalex4 = shl i64 %vscale, 2
+  %active.lane.mask.entry = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 %n)
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %active.lane.mask = phi <vscale x 4 x i1> [ %active.lane.mask.entry, %entry ], [ %active.lane.mask.next, %loop ]
+  %gep = getelementptr inbounds nuw i32, ptr %dst, i64 %iv
+  %iv.next = add i64 %iv, %vscalex4
+  %active.lane.mask.next = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %iv.next, i64 %n)
+  tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> splat (i32 1), ptr %gep, i32 4, <vscale x 4 x i1> %active.lane.mask)
+  %cond = extractelement <vscale x 4 x i1> %active.lane.mask.next, i64 0
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+; The call to cmp_fn in the loop should use the postincrement value of %iv.
+define void @uses_cmp_fn(ptr %dst, i64 %n) {
+; CHECK-LABEL: define void @uses_cmp_fn(
+; CHECK-SAME: ptr [[DST:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[LSR_IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[LSR_IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = shl i64 [[LSR_IV]], 2
+; CHECK-NEXT:    [[LSR_IV1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP0]]
+; CHECK-NEXT:    store i32 0, ptr [[LSR_IV1]], align 4
+; CHECK-NEXT:    [[LSR_IV_NEXT]] = add i64 [[LSR_IV]], 1
+; CHECK-NEXT:    [[COND:%.*]] = tail call i1 @cmp_fn(i64 [[LSR_IV_NEXT]])
+; CHECK-NEXT:    br i1 [[COND]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep = getelementptr inbounds nuw i32, ptr %dst, i64 %iv
+  store i32 0, ptr %gep, align 4
+  %iv.next = add i64 %iv, 1
+  %cond = tail call i1 @cmp_fn(i64 %iv.next)
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+; The store between the call and the branch shouldn't prevent the
+; postincement value from being used.
+define void @uses_cmp_fn_not_last(ptr %dst, i64 %n) {
+; CHECK-LABEL: define void @uses_cmp_fn_not_last(
+; CHECK-SAME: ptr [[DST:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[LSR_IV:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = shl i64 [[IV]], 2
+; CHECK-NEXT:    [[LSR_IV1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP0]]
+; CHECK-NEXT:    [[LSR_IV]] = add i64 [[IV]], 1
+; CHECK-NEXT:    [[COND:%.*]] = tail call i1 @cmp_fn(i64 [[LSR_IV]])
+; CHECK-NEXT:    store i32 0, ptr [[LSR_IV1]], align 4
+; CHECK-NEXT:    br i1 [[COND]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep = getelementptr inbounds nuw i32, ptr %dst, i64 %iv
+  %iv.next = add i64 %iv, 1
+  %cond = tail call i1 @cmp_fn(i64 %iv.next)
+  store i32 0, ptr %gep, align 4
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+; cmp2 will use a preincrement induction variable as it isn't directly the loop
+; termination condition.
+; FIXME: We could potentially handle this by examining the operands of the 'and'
+; instruction.
+define void @cmp_and(ptr %dst, i64 %n) {
+; CHECK-LABEL: define void @cmp_and(
+; CHECK-SAME: ptr [[DST:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[LSR_IV1:%.*]] = phi ptr [ [[SCEVGEP:%.*]], %[[LOOP]] ], [ [[DST]], %[[ENTRY]] ]
+; CHECK-NEXT:    [[LSR_IV:%.*]] = phi i64 [ [[LSR_IV_NEXT:%.*]], %[[LOOP]] ], [ [[TMP0]], %[[ENTRY]] ]
+; CHECK-NEXT:    [[VAL:%.*]] = load i64, ptr [[LSR_IV1]], align 8
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp ne i64 [[VAL]], [[N]]
+; CHECK-NEXT:    [[CMP2:%.*]] = icmp ne i64 [[LSR_IV]], 0
+; CHECK-NEXT:    [[COND:%.*]] = and i1 [[CMP1]], [[CMP2]]
+; CHECK-NEXT:    [[LSR_IV_NEXT]] = add i64 [[LSR_IV]], -1
+; CHECK-NEXT:    [[SCEVGEP]] = getelementptr i8, ptr [[LSR_IV1]], i64 4
+; CHECK-NEXT:    br i1 [[COND]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep = getelementptr inbounds nuw i32, ptr %dst, i64 %iv
+  %val = load i64, ptr %gep, align 8
+  %iv.next = add i64 %iv, 1
+  %cmp1 = icmp ne i64 %val, %n
+  %cmp2 = icmp ne i64 %iv.next, %n
+  %cond = and i1 %cmp1, %cmp2
+  br i1 %cond, label %loop, label %exit
+
+exit:
+  ret void
+}
+
+
+declare i64 @llvm.vscale.i64()
+declare <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64, i64)
+declare void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32>, ptr captures(none), i32 immarg, <vscale x 4 x i1>)
+declare i1 @cmp_fn(i64)
+
+attributes #0 = { "target-features"="+sve2" }
diff --git a/llvm/test/Transforms/LoopStrengthReduce/AArch64/prefer-all.ll b/llvm/test/Transforms/LoopStrengthReduce/AArch64/prefer-all.ll
index 1944a9c800355..5fe72ea0d4fea 100644
--- a/llvm/test/Transforms/LoopStrengthReduce/AArch64/prefer-all.ll
+++ b/llvm/test/Transforms/LoopStrengthReduce/AArch64/prefer-all.ll
@@ -230,8 +230,6 @@ exit:
 
 ; The control-flow before and after the load of qval shouldn't prevent postindex
 ; addressing from happening.
-; FIXME: We choose postindex addressing, but the scevgep is placed in for.inc so
-; during codegen we will fail to actually generate a postindex load.
 define void @middle_block_load(ptr %p, ptr %q, i64 %n) {
 ; CHECK-LABEL: define void @middle_block_load(
 ; CHECK-SAME: ptr [[P:%.*]], ptr [[Q:%.*]], i64 [[N:%.*]]) {
@@ -254,6 +252,7 @@ define void @middle_block_load(ptr %p, ptr %q, i64 %n) {
 ; CHECK:       [[IF_END]]:
 ; CHECK-NEXT:    [[QVAL:%.*]] = load i32, ptr [[LSR_IV1]], align 4
 ; CHECK-NEXT:    [[CMP2:%.*]] = icmp sgt i32 [[QVAL]], 0
+; CHECK-NEXT:    [[SCEVGEP]] = getelementptr i8, ptr [[LSR_IV1]], i64 4
 ; CHECK-NEXT:    br i1 [[CMP2]], label %[[IF_THEN2:.*]], label %[[IF_ELSE2:.*]]
 ; CHECK:       [[IF_THEN2]]:
 ; CHECK-NEXT:    tail call void @otherfn1()
@@ -263,7 +262,6 @@ define void @middle_block_load(ptr %p, ptr %q, i64 %n) {
 ; CHECK-NEXT:    br label %[[FOR_INC]]
 ; CHECK:       [[FOR_INC]]:
 ; CHECK-NEXT:    [[LSR_IV_NEXT]] = add i64 [[LSR_IV]], -1
-; CHECK-NEXT:    [[SCEVGEP]] = getelementptr i8, ptr [[LSR_IV1]], i64 4
 ; CHECK-NEXT:    [[CMP3:%.*]] = icmp eq i64 [[LSR_IV_NEXT]], 0
 ; CHECK-NEXT:    br i1 [[CMP3]], label %[[EXIT:.*]], label %[[FOR_BODY]]
 ; CHECK:       [[EXIT]]:
diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
index 2e4fc55a8f16d..e3dabfaedbdef 100644
--- a/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
+++ b/llvm/test/Transforms/LoopUnroll/AArch64/apple-unrolling.ll
@@ -3,6 +3,7 @@
 ; RUN: opt -p loop-unroll -mcpu=apple-m2 -S %s | FileCheck --check-prefix=APPLE %s
 ; RUN: opt -p loop-unroll -mcpu=apple-m3 -S %s | FileCheck --check-prefix=APPLE %s
 ; RUN: opt -p loop-unroll -mcpu=apple-m4 -S %s | FileCheck --check-prefix=APPLE %s
+; RUN: opt -p loop-unroll -mcpu=apple-a17 -S %s | FileCheck --check-prefix=APPLE-A17 %s
 ; RUN: opt -p loop-unroll -mcpu=cortex-a57 -S %s | FileCheck --check-prefix=OTHER %s
 
 target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-n32:64-S128-Fn32"
@@ -20,56 +21,56 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
 ; APPLE:       [[LOOP]]:
-; APPLE-NEXT:    [[IV_EPIL:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_7:%.*]], %[[LOOP]] ]
+; APPLE-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_7:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_7:%.*]], %[[LOOP]] ]
-; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_EPIL]], [[SCALE]]
-; APPLE-NEXT:    [[GEP_SRC_EPIL:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL]]
-; APPLE-NEXT:    [[L_EPIL:%.*]] = load float, ptr [[GEP_SRC_EPIL]], align 4
-; APPLE-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_EPIL]]
-; APPLE-NEXT:    store float [[L_EPIL]], ptr [[GEP_DST_EPIL]], align 4
-; APPLE-NEXT:    [[IV_NEXT_EPIL:%.*]] = add nuw nsw i64 [[IV_EPIL]], 1
-; APPLE-NEXT:    [[SCALED_IV_1:%.*]] = mul nuw nsw i64 [[IV_NEXT_EPIL]], [[SCALE]]
+; APPLE-NEXT:    [[SCALED_IV:%.*]] = mul nuw nsw i64 [[IV]], [[SCALE]]
+; APPLE-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV]]
+; APPLE-NEXT:    [[L:%.*]] = load float, ptr [[GEP_SRC]], align 4
+; APPLE-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV]]
+; APPLE-NEXT:    store float [[L]], ptr [[GEP_DST]], align 4
+; APPLE-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
+; APPLE-NEXT:    [[SCALED_IV_1:%.*]] = mul nuw nsw i64 [[IV_NEXT]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_1:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_1]]
 ; APPLE-NEXT:    [[L_1:%.*]] = load float, ptr [[GEP_SRC_1]], align 4
-; APPLE-NEXT:    [[GEP_DST_1:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_EPIL]]
+; APPLE-NEXT:    [[GEP_DST_1:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT]]
 ; APPLE-NEXT:    store float [[L_1]], ptr [[GEP_DST_1]], align 4
-; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV_EPIL]], 2
+; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[SCALED_IV_2:%.*]] = mul nuw nsw i64 [[IV_NEXT_1]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_2:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_2]]
 ; APPLE-NEXT:    [[L_2:%.*]] = load float, ptr [[GEP_SRC_2]], align 4
 ; APPLE-NEXT:    [[GEP_DST_2:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_1]]
 ; APPLE-NEXT:    store float [[L_2]], ptr [[GEP_DST_2]], align 4
-; APPLE-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV_EPIL]], 3
+; APPLE-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV]], 3
 ; APPLE-NEXT:    [[SCALED_IV_3:%.*]] = mul nuw nsw i64 [[IV_NEXT_2]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_3:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_3]]
 ; APPLE-NEXT:    [[L_3:%.*]] = load float, ptr [[GEP_SRC_3]], align 4
 ; APPLE-NEXT:    [[GEP_DST_3:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_2]]
 ; APPLE-NEXT:    store float [[L_3]], ptr [[GEP_DST_3]], align 4
-; APPLE-NEXT:    [[IV_NEXT_3:%.*]] = add nuw nsw i64 [[IV_EPIL]], 4
+; APPLE-NEXT:    [[IV_NEXT_3:%.*]] = add nuw nsw i64 [[IV]], 4
 ; APPLE-NEXT:    [[SCALED_IV_4:%.*]] = mul nuw nsw i64 [[IV_NEXT_3]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_4:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_4]]
 ; APPLE-NEXT:    [[L_4:%.*]] = load float, ptr [[GEP_SRC_4]], align 4
 ; APPLE-NEXT:    [[GEP_DST_4:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_3]]
 ; APPLE-NEXT:    store float [[L_4]], ptr [[GEP_DST_4]], align 4
-; APPLE-NEXT:    [[IV_NEXT_4:%.*]] = add nuw nsw i64 [[IV_EPIL]], 5
+; APPLE-NEXT:    [[IV_NEXT_4:%.*]] = add nuw nsw i64 [[IV]], 5
 ; APPLE-NEXT:    [[SCALED_IV_5:%.*]] = mul nuw nsw i64 [[IV_NEXT_4]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_5:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_5]]
 ; APPLE-NEXT:    [[L_5:%.*]] = load float, ptr [[GEP_SRC_5]], align 4
 ; APPLE-NEXT:    [[GEP_DST_5:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_4]]
 ; APPLE-NEXT:    store float [[L_5]], ptr [[GEP_DST_5]], align 4
-; APPLE-NEXT:    [[IV_NEXT_5:%.*]] = add nuw nsw i64 [[IV_EPIL]], 6
+; APPLE-NEXT:    [[IV_NEXT_5:%.*]] = add nuw nsw i64 [[IV]], 6
 ; APPLE-NEXT:    [[SCALED_IV_6:%.*]] = mul nuw nsw i64 [[IV_NEXT_5]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_6:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_6]]
 ; APPLE-NEXT:    [[L_6:%.*]] = load float, ptr [[GEP_SRC_6]], align 4
 ; APPLE-NEXT:    [[GEP_DST_6:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_5]]
 ; APPLE-NEXT:    store float [[L_6]], ptr [[GEP_DST_6]], align 4
-; APPLE-NEXT:    [[IV_NEXT_6:%.*]] = add nuw nsw i64 [[IV_EPIL]], 7
+; APPLE-NEXT:    [[IV_NEXT_6:%.*]] = add nuw nsw i64 [[IV]], 7
 ; APPLE-NEXT:    [[SCALED_IV_7:%.*]] = mul nuw nsw i64 [[IV_NEXT_6]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_7:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_7]]
 ; APPLE-NEXT:    [[L_7:%.*]] = load float, ptr [[GEP_SRC_7]], align 4
 ; APPLE-NEXT:    [[GEP_DST_7:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_NEXT_6]]
 ; APPLE-NEXT:    store float [[L_7]], ptr [[GEP_DST_7]], align 4
-; APPLE-NEXT:    [[IV_NEXT_7]] = add nuw nsw i64 [[IV_EPIL]], 8
+; APPLE-NEXT:    [[IV_NEXT_7]] = add nuw nsw i64 [[IV]], 8
 ; APPLE-NEXT:    [[NITER_NEXT_7]] = add i64 [[NITER]], 8
 ; APPLE-NEXT:    [[NITER_NCMP_7:%.*]] = icmp eq i64 [[NITER_NEXT_7]], [[UNROLL_ITER]]
 ; APPLE-NEXT:    br i1 [[NITER_NCMP_7]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
@@ -83,15 +84,15 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
-; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_EPIL]] ]
+; APPLE-NEXT:    [[IV_EPIL:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[LOOP_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[LOOP_EPIL]] ]
-; APPLE-NEXT:    [[SCALED_IV_EPIL1:%.*]] = mul nuw nsw i64 [[IV_EPIL1]], [[SCALE]]
-; APPLE-NEXT:    [[GEP_SRC_EPIL1:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL1]]
-; APPLE-NEXT:    [[L_EPIL1:%.*]] = load float, ptr [[GEP_SRC_EPIL1]], align 4
-; APPLE-NEXT:    [[GEP_DST_EPIL1:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_EPIL1]]
-; APPLE-NEXT:    store float [[L_EPIL1]], ptr [[GEP_DST_EPIL1]], align 4
-; APPLE-NEXT:    [[IV_NEXT_EPIL1]] = add nuw nsw i64 [[IV_EPIL1]], 1
-; APPLE-NEXT:    [[EC_EPIL:%.*]] = icmp eq i64 [[IV_NEXT_EPIL1]], [[N]]
+; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_EPIL]], [[SCALE]]
+; APPLE-NEXT:    [[GEP_SRC_EPIL:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL]]
+; APPLE-NEXT:    [[L_EPIL:%.*]] = load float, ptr [[GEP_SRC_EPIL]], align 4
+; APPLE-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_EPIL]]
+; APPLE-NEXT:    store float [[L_EPIL]], ptr [[GEP_DST_EPIL]], align 4
+; APPLE-NEXT:    [[IV_NEXT_EPIL]] = add nuw nsw i64 [[IV_EPIL]], 1
+; APPLE-NEXT:    [[EC_EPIL:%.*]] = icmp eq i64 [[IV_NEXT_EPIL]], [[N]]
 ; APPLE-NEXT:    [[EPIL_ITER_NEXT]] = add i64 [[EPIL_ITER]], 1
 ; APPLE-NEXT:    [[EPIL_ITER_CMP:%.*]] = icmp ne i64 [[EPIL_ITER_NEXT]], [[XTRAITER]]
 ; APPLE-NEXT:    br i1 [[EPIL_ITER_CMP]], label %[[LOOP_EPIL]], label %[[EXIT_EPILOG_LCSSA:.*]], !llvm.loop [[LOOP0:![0-9]+]]
@@ -100,6 +101,23 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; APPLE:       [[EXIT]]:
 ; APPLE-NEXT:    ret void
 ;
+; APPLE-A17-LABEL: define void @small_load_store_loop(
+; APPLE-A17-SAME: ptr [[SRC:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]], i64 [[SCALE:%.*]]) #[[ATTR0:[0-9]+]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[SCALED_IV:%.*]] = mul nuw nsw i64 [[IV]], [[SCALE]]
+; APPLE-A17-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV]]
+; APPLE-A17-NEXT:    [[L:%.*]] = load float, ptr [[GEP_SRC]], align 4
+; APPLE-A17-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV]]
+; APPLE-A17-NEXT:    store float [[L]], ptr [[GEP_DST]], align 4
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    ret void
+;
 ; OTHER-LABEL: define void @small_load_store_loop(
 ; OTHER-SAME: ptr [[SRC:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]], i64 [[SCALE:%.*]]) #[[ATTR0:[0-9]+]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -129,19 +147,19 @@ define void @small_load_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale) {
 ; OTHER-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
 ; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT_UNR_LCSSA]]:
-; OTHER-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; OTHER-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; OTHER:       [[LOOP_EPIL_PREHEADER]]:
-; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; OTHER-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
 ; OTHER-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; OTHER-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; OTHER-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; OTHER:       [[LOOP_EPIL]]:
-; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
+; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_EPIL_INIT]], [[SCALE]]
 ; OTHER-NEXT:    [[GEP_SRC_EPIL:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL]]
 ; OTHER-NEXT:    [[L_EPIL:%.*]] = load float, ptr [[GEP_SRC_EPIL]], align 4
-; OTHER-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_UNR]]
+; OTHER-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_EPIL_INIT]]
 ; OTHER-NEXT:    store float [[L_EPIL]], ptr [[GEP_DST_EPIL]], align 4
 ; OTHER-NEXT:    br label %[[EXIT]]
 ; OTHER:       [[EXIT]]:
@@ -197,25 +215,43 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; APPLE-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
 ; APPLE-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT_UNR_LCSSA]]:
-; APPLE-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
-; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
 ; APPLE-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
-; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
+; APPLE-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_EPIL_INIT]], [[SCALE]]
 ; APPLE-NEXT:    [[GEP_SRC_EPIL:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL]]
 ; APPLE-NEXT:    [[L_EPIL:%.*]] = load float, ptr [[GEP_SRC_EPIL]], align 4
 ; APPLE-NEXT:    [[O_EPIL:%.*]] = fadd float [[L_EPIL]], [[K]]
-; APPLE-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_UNR]]
+; APPLE-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_EPIL_INIT]]
 ; APPLE-NEXT:    store float [[O_EPIL]], ptr [[GEP_DST_EPIL]], align 4
 ; APPLE-NEXT:    br label %[[EXIT]]
 ; APPLE:       [[EXIT]]:
 ; APPLE-NEXT:    ret void
 ;
+; APPLE-A17-LABEL: define void @load_op_store_loop(
+; APPLE-A17-SAME: ptr [[SRC:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]], i64 [[SCALE:%.*]], float [[K:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[SCALED_IV:%.*]] = mul nuw nsw i64 [[IV]], [[SCALE]]
+; APPLE-A17-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV]]
+; APPLE-A17-NEXT:    [[L:%.*]] = load float, ptr [[GEP_SRC]], align 4
+; APPLE-A17-NEXT:    [[O:%.*]] = fadd float [[L]], [[K]]
+; APPLE-A17-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV]]
+; APPLE-A17-NEXT:    store float [[O]], ptr [[GEP_DST]], align 4
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    ret void
+;
 ; OTHER-LABEL: define void @load_op_store_loop(
 ; OTHER-SAME: ptr [[SRC:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]], i64 [[SCALE:%.*]], float [[K:%.*]]) #[[ATTR0]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -247,20 +283,20 @@ define void @load_op_store_loop(ptr %src, ptr %dst, i64 %N, i64 %scale, float %k
 ; OTHER-NEXT:    [[NITER_NCMP_1:%.*]] = icmp eq i64 [[NITER_NEXT_1]], [[UNROLL_ITER]]
 ; OTHER-NEXT:    br i1 [[NITER_NCMP_1]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT_UNR_LCSSA]]:
-; OTHER-NEXT:    [[IV_UNR1:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
+; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_1]], %[[LOOP]] ]
 ; OTHER-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; OTHER-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; OTHER:       [[LOOP_EPIL_PREHEADER]]:
-; OTHER-NEXT:    [[IV_UNR:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR1]], %[[EXIT_UNR_LCSSA]] ]
+; OTHER-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
 ; OTHER-NEXT:    [[LCMP_MOD1:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; OTHER-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; OTHER-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; OTHER:       [[LOOP_EPIL]]:
-; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_UNR]], [[SCALE]]
+; OTHER-NEXT:    [[SCALED_IV_EPIL:%.*]] = mul nuw nsw i64 [[IV_EPIL_INIT]], [[SCALE]]
 ; OTHER-NEXT:    [[GEP_SRC_EPIL:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV_EPIL]]
 ; OTHER-NEXT:    [[L_EPIL:%.*]] = load float, ptr [[GEP_SRC_EPIL]], align 4
 ; OTHER-NEXT:    [[O_EPIL:%.*]] = fadd float [[L_EPIL]], [[K]]
-; OTHER-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_UNR]]
+; OTHER-NEXT:    [[GEP_DST_EPIL:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV_EPIL_INIT]]
 ; OTHER-NEXT:    store float [[O_EPIL]], ptr [[GEP_DST_EPIL]], align 4
 ; OTHER-NEXT:    br label %[[EXIT]]
 ; OTHER:       [[EXIT]]:
@@ -312,6 +348,32 @@ define void @load_op_store_loop_multiblock(ptr %src, ptr %dst, i64 %N, i64 %scal
 ; APPLE:       [[EXIT]]:
 ; APPLE-NEXT:    ret void
 ;
+; APPLE-A17-LABEL: define void @load_op_store_loop_multiblock(
+; APPLE-A17-SAME: ptr [[SRC:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]], i64 [[SCALE:%.*]], float [[K:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOPCONT:.*]] ]
+; APPLE-A17-NEXT:    [[SCALED_IV:%.*]] = mul nuw nsw i64 [[IV]], [[SCALE]]
+; APPLE-A17-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 [[SCALED_IV]]
+; APPLE-A17-NEXT:    [[L1:%.*]] = load float, ptr [[GEP_SRC]], align 4
+; APPLE-A17-NEXT:    [[AND:%.*]] = and i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[ODD:%.*]] = icmp eq i64 [[AND]], 1
+; APPLE-A17-NEXT:    br i1 [[ODD]], label %[[LOOPODD:.*]], label %[[LOOPCONT]]
+; APPLE-A17:       [[LOOPCONT]]:
+; APPLE-A17-NEXT:    [[D:%.*]] = phi float [ [[L2:%.*]], %[[LOOPODD]] ], [ [[L1]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[O:%.*]] = fadd float [[D]], [[K]]
+; APPLE-A17-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 [[IV]]
+; APPLE-A17-NEXT:    store float [[O]], ptr [[GEP_DST]], align 4
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[LOOPODD]]:
+; APPLE-A17-NEXT:    [[L2]] = fneg float [[L1]]
+; APPLE-A17-NEXT:    br label %[[LOOPCONT]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    ret void
+;
 ; OTHER-LABEL: define void @load_op_store_loop_multiblock(
 ; OTHER-SAME: ptr [[SRC:%.*]], ptr [[DST:%.*]], i64 [[N:%.*]], i64 [[SCALE:%.*]], float [[K:%.*]]) #[[ATTR0]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -380,58 +442,58 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[TMP0]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP_HEADER:.*]]
 ; APPLE:       [[LOOP_HEADER]]:
-; APPLE-NEXT:    [[IV_EPIL:%.*]] = phi i64 [ 1, %[[ENTRY_NEW]] ], [ [[IV_NEXT_3:%.*]], %[[LOOP_LATCH_3:.*]] ]
+; APPLE-NEXT:    [[IV:%.*]] = phi i64 [ 1, %[[ENTRY_NEW]] ], [ [[IV_NEXT_3:%.*]], %[[LOOP_LATCH_3:.*]] ]
 ; APPLE-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_3:%.*]], %[[LOOP_LATCH_3]] ]
-; APPLE-NEXT:    [[GEP_EPIL:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_EPIL]]
-; APPLE-NEXT:    [[L_1_EPIL:%.*]] = load i32, ptr [[GEP_EPIL]], align 4
-; APPLE-NEXT:    [[CMP6_NOT_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[T_1]]
-; APPLE-NEXT:    br i1 [[CMP6_NOT_EPIL]], label %[[THEN:.*]], label %[[LOOP_LATCH:.*]]
+; APPLE-NEXT:    [[GEP:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV]]
+; APPLE-NEXT:    [[L_1:%.*]] = load i32, ptr [[GEP]], align 4
+; APPLE-NEXT:    [[C_1:%.*]] = icmp sgt i32 [[L_1]], [[T_1]]
+; APPLE-NEXT:    br i1 [[C_1]], label %[[THEN:.*]], label %[[LOOP_LATCH:.*]]
 ; APPLE:       [[THEN]]:
-; APPLE-NEXT:    [[GEP_4_EPIL:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP_EPIL]], i64 4
-; APPLE-NEXT:    [[L_2_EPIL:%.*]] = load i8, ptr [[GEP_4_EPIL]], align 4
-; APPLE-NEXT:    [[OR_COND_EPIL:%.*]] = icmp ugt i8 [[L_2_EPIL]], 7
-; APPLE-NEXT:    br i1 [[OR_COND_EPIL]], label %[[MERGE:.*]], label %[[ELSE:.*]]
+; APPLE-NEXT:    [[GEP_4:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP]], i64 4
+; APPLE-NEXT:    [[L_2:%.*]] = load i8, ptr [[GEP_4]], align 4
+; APPLE-NEXT:    [[C_2:%.*]] = icmp ugt i8 [[L_2]], 7
+; APPLE-NEXT:    br i1 [[C_2]], label %[[MERGE:.*]], label %[[ELSE:.*]]
 ; APPLE:       [[ELSE]]:
-; APPLE-NEXT:    [[CONV_I_EPIL:%.*]] = zext nneg i8 [[L_2_EPIL]] to i64
-; APPLE-NEXT:    [[ARRAYIDX_I_EPIL:%.*]] = getelementptr inbounds [9 x i8], ptr @A, i64 0, i64 [[CONV_I_EPIL]]
-; APPLE-NEXT:    [[TMP27:%.*]] = load i8, ptr [[ARRAYIDX_I_EPIL]], align 1
-; APPLE-NEXT:    [[IDXPROM_I_EPIL:%.*]] = sext i8 [[TMP27]] to i64
-; APPLE-NEXT:    [[ARRAYIDX_I37_EPIL:%.*]] = getelementptr inbounds [8 x i32], ptr @B, i64 0, i64 [[IDXPROM_I_EPIL]]
-; APPLE-NEXT:    [[TMP28:%.*]] = load i32, ptr [[ARRAYIDX_I37_EPIL]], align 4
-; APPLE-NEXT:    [[ARRAYIDX_I42_EPIL:%.*]] = getelementptr inbounds [8 x i32], ptr @C, i64 0, i64 [[IDXPROM_I_EPIL]]
-; APPLE-NEXT:    [[TMP29:%.*]] = load i32, ptr [[ARRAYIDX_I42_EPIL]], align 4
+; APPLE-NEXT:    [[CONV_I:%.*]] = zext nneg i8 [[L_2]] to i64
+; APPLE-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds [9 x i8], ptr @A, i64 0, i64 [[CONV_I]]
+; APPLE-NEXT:    [[L_3:%.*]] = load i8, ptr [[GEP_A]], align 1
+; APPLE-NEXT:    [[IDXPROM_I:%.*]] = sext i8 [[L_3]] to i64
+; APPLE-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds [8 x i32], ptr @B, i64 0, i64 [[IDXPROM_I]]
+; APPLE-NEXT:    [[L_4:%.*]] = load i32, ptr [[GEP_B]], align 4
+; APPLE-NEXT:    [[GEP_C:%.*]] = getelementptr inbounds [8 x i32], ptr @C, i64 0, i64 [[IDXPROM_I]]
+; APPLE-NEXT:    [[L_5:%.*]] = load i32, ptr [[GEP_C]], align 4
 ; APPLE-NEXT:    br label %[[MERGE]]
 ; APPLE:       [[MERGE]]:
-; APPLE-NEXT:    [[RETVAL_0_I3851_EPIL:%.*]] = phi i32 [ 0, %[[THEN]] ], [ [[TMP28]], %[[ELSE]] ]
-; APPLE-NEXT:    [[RETVAL_0_I43_EPIL:%.*]] = phi i32 [ 0, %[[THEN]] ], [ [[TMP29]], %[[ELSE]] ]
-; APPLE-NEXT:    [[ADD14_EPIL:%.*]] = add nsw i32 [[RETVAL_0_I43_EPIL]], [[X]]
-; APPLE-NEXT:    [[MUL15_EPIL:%.*]] = mul nsw i32 [[ADD14_EPIL]], [[WIDTH]]
-; APPLE-NEXT:    [[TMP30:%.*]] = trunc nuw nsw i64 [[IV_EPIL]] to i32
-; APPLE-NEXT:    [[ADD16_EPIL:%.*]] = add nsw i32 [[RETVAL_0_I3851_EPIL]], [[TMP30]]
-; APPLE-NEXT:    [[ADD17_EPIL:%.*]] = add nsw i32 [[ADD16_EPIL]], [[MUL15_EPIL]]
-; APPLE-NEXT:    [[IDXPROM18_EPIL:%.*]] = sext i32 [[ADD17_EPIL]] to i64
-; APPLE-NEXT:    [[ARRAYIDX19_EPIL:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM18_EPIL]]
-; APPLE-NEXT:    [[TMP31:%.*]] = load i32, ptr [[ARRAYIDX19_EPIL]], align 4
-; APPLE-NEXT:    [[SUB_EPIL:%.*]] = sub nsw i32 [[X]], [[RETVAL_0_I43_EPIL]]
-; APPLE-NEXT:    [[MUL21_EPIL:%.*]] = mul nsw i32 [[SUB_EPIL]], [[WIDTH]]
-; APPLE-NEXT:    [[SUB22_EPIL:%.*]] = sub i32 [[TMP30]], [[RETVAL_0_I3851_EPIL]]
-; APPLE-NEXT:    [[ADD23_EPIL:%.*]] = add nsw i32 [[SUB22_EPIL]], [[MUL21_EPIL]]
-; APPLE-NEXT:    [[IDXPROM24_EPIL:%.*]] = sext i32 [[ADD23_EPIL]] to i64
-; APPLE-NEXT:    [[ARRAYIDX25_EPIL:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM24_EPIL]]
-; APPLE-NEXT:    [[TMP32:%.*]] = load i32, ptr [[ARRAYIDX25_EPIL]], align 4
-; APPLE-NEXT:    [[CMP27_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[TMP31]]
-; APPLE-NEXT:    [[CMP28_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[TMP32]]
-; APPLE-NEXT:    [[AND34_EPIL:%.*]] = and i1 [[CMP27_EPIL]], [[CMP28_EPIL]]
-; APPLE-NEXT:    br i1 [[AND34_EPIL]], label %[[STORE_RES:.*]], label %[[LOOP_LATCH]]
+; APPLE-NEXT:    [[MERGE_1:%.*]] = phi i32 [ 0, %[[THEN]] ], [ [[L_4]], %[[ELSE]] ]
+; APPLE-NEXT:    [[MERGE_2:%.*]] = phi i32 [ 0, %[[THEN]] ], [ [[L_5]], %[[ELSE]] ]
+; APPLE-NEXT:    [[ADD14:%.*]] = add nsw i32 [[MERGE_2]], [[X]]
+; APPLE-NEXT:    [[MUL15:%.*]] = mul nsw i32 [[ADD14]], [[WIDTH]]
+; APPLE-NEXT:    [[TMP3:%.*]] = trunc nuw nsw i64 [[IV]] to i32
+; APPLE-NEXT:    [[ADD16:%.*]] = add nsw i32 [[MERGE_1]], [[TMP3]]
+; APPLE-NEXT:    [[ADD17:%.*]] = add nsw i32 [[ADD16]], [[MUL15]]
+; APPLE-NEXT:    [[IDXPROM18:%.*]] = sext i32 [[ADD17]] to i64
+; APPLE-NEXT:    [[GEP_P_2:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM18]]
+; APPLE-NEXT:    [[L_6:%.*]] = load i32, ptr [[GEP_P_2]], align 4
+; APPLE-NEXT:    [[SUB:%.*]] = sub nsw i32 [[X]], [[MERGE_2]]
+; APPLE-NEXT:    [[MUL21:%.*]] = mul nsw i32 [[SUB]], [[WIDTH]]
+; APPLE-NEXT:    [[SUB22:%.*]] = sub i32 [[TMP3]], [[MERGE_1]]
+; APPLE-NEXT:    [[ADD23:%.*]] = add nsw i32 [[SUB22]], [[MUL21]]
+; APPLE-NEXT:    [[IDXPROM24:%.*]] = sext i32 [[ADD23]] to i64
+; APPLE-NEXT:    [[GEP_P2_1:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM24]]
+; APPLE-NEXT:    [[L_7:%.*]] = load i32, ptr [[GEP_P2_1]], align 4
+; APPLE-NEXT:    [[C_3:%.*]] = icmp sgt i32 [[L_1]], [[L_6]]
+; APPLE-NEXT:    [[C_4:%.*]] = icmp sgt i32 [[L_1]], [[L_7]]
+; APPLE-NEXT:    [[AND34:%.*]] = and i1 [[C_3]], [[C_4]]
+; APPLE-NEXT:    br i1 [[AND34]], label %[[STORE_RES:.*]], label %[[LOOP_LATCH]]
 ; APPLE:       [[STORE_RES]]:
-; APPLE-NEXT:    [[CMP32_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[T_2]]
-; APPLE-NEXT:    [[GEP_5_EPIL:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP_EPIL]], i64 5
-; APPLE-NEXT:    [[RES_EPIL:%.*]] = select i1 [[CMP32_EPIL]], i8 1, i8 2
-; APPLE-NEXT:    store i8 [[RES_EPIL]], ptr [[GEP_5_EPIL]], align 1
+; APPLE-NEXT:    [[C_5:%.*]] = icmp sgt i32 [[L_1]], [[T_2]]
+; APPLE-NEXT:    [[GEP_5:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP]], i64 5
+; APPLE-NEXT:    [[RES:%.*]] = select i1 [[C_5]], i8 1, i8 2
+; APPLE-NEXT:    store i8 [[RES]], ptr [[GEP_5]], align 1
 ; APPLE-NEXT:    br label %[[LOOP_LATCH]]
 ; APPLE:       [[LOOP_LATCH]]:
-; APPLE-NEXT:    [[IV_NEXT_EPIL:%.*]] = add nuw nsw i64 [[IV_EPIL]], 1
-; APPLE-NEXT:    [[GEP_1:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_NEXT_EPIL]]
+; APPLE-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
+; APPLE-NEXT:    [[GEP_1:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_NEXT]]
 ; APPLE-NEXT:    [[L_1_1:%.*]] = load i32, ptr [[GEP_1]], align 4
 ; APPLE-NEXT:    [[C_1_1:%.*]] = icmp sgt i32 [[L_1_1]], [[T_1]]
 ; APPLE-NEXT:    br i1 [[C_1_1]], label %[[THEN_1:.*]], label %[[LOOP_LATCH_1:.*]]
@@ -455,7 +517,7 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE-NEXT:    [[MERGE_2_1:%.*]] = phi i32 [ 0, %[[THEN_1]] ], [ [[L_5_1]], %[[ELSE_1]] ]
 ; APPLE-NEXT:    [[ADD14_1:%.*]] = add nsw i32 [[MERGE_2_1]], [[X]]
 ; APPLE-NEXT:    [[MUL15_1:%.*]] = mul nsw i32 [[ADD14_1]], [[WIDTH]]
-; APPLE-NEXT:    [[TMP4:%.*]] = trunc nuw nsw i64 [[IV_NEXT_EPIL]] to i32
+; APPLE-NEXT:    [[TMP4:%.*]] = trunc nuw nsw i64 [[IV_NEXT]] to i32
 ; APPLE-NEXT:    [[ADD16_1:%.*]] = add nsw i32 [[MERGE_1_1]], [[TMP4]]
 ; APPLE-NEXT:    [[ADD17_1:%.*]] = add nsw i32 [[ADD16_1]], [[MUL15_1]]
 ; APPLE-NEXT:    [[IDXPROM18_1:%.*]] = sext i32 [[ADD17_1]] to i64
@@ -479,7 +541,7 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE-NEXT:    store i8 [[RES_1]], ptr [[GEP_5_1]], align 1
 ; APPLE-NEXT:    br label %[[LOOP_LATCH_1]]
 ; APPLE:       [[LOOP_LATCH_1]]:
-; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV_EPIL]], 2
+; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[GEP_2:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_NEXT_1]]
 ; APPLE-NEXT:    [[L_1_2:%.*]] = load i32, ptr [[GEP_2]], align 4
 ; APPLE-NEXT:    [[C_1_2:%.*]] = icmp sgt i32 [[L_1_2]], [[T_1]]
@@ -528,7 +590,7 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE-NEXT:    store i8 [[RES_2]], ptr [[GEP_5_2]], align 1
 ; APPLE-NEXT:    br label %[[LOOP_LATCH_2]]
 ; APPLE:       [[LOOP_LATCH_2]]:
-; APPLE-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV_EPIL]], 3
+; APPLE-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV]], 3
 ; APPLE-NEXT:    [[GEP_3:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_NEXT_2]]
 ; APPLE-NEXT:    [[L_1_3:%.*]] = load i32, ptr [[GEP_3]], align 4
 ; APPLE-NEXT:    [[C_1_3:%.*]] = icmp sgt i32 [[L_1_3]], [[T_1]]
@@ -577,7 +639,7 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE-NEXT:    store i8 [[RES_3]], ptr [[GEP_5_3]], align 1
 ; APPLE-NEXT:    br label %[[LOOP_LATCH_3]]
 ; APPLE:       [[LOOP_LATCH_3]]:
-; APPLE-NEXT:    [[IV_NEXT_3]] = add nuw nsw i64 [[IV_EPIL]], 4
+; APPLE-NEXT:    [[IV_NEXT_3]] = add nuw nsw i64 [[IV]], 4
 ; APPLE-NEXT:    [[NITER_NEXT_3]] = add i64 [[NITER]], 4
 ; APPLE-NEXT:    [[NITER_NCMP_3:%.*]] = icmp eq i64 [[NITER_NEXT_3]], [[UNROLL_ITER]]
 ; APPLE-NEXT:    br i1 [[NITER_NCMP_3]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP_HEADER]]
@@ -591,58 +653,58 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD1]])
 ; APPLE-NEXT:    br label %[[LOOP_HEADER_EPIL:.*]]
 ; APPLE:       [[LOOP_HEADER_EPIL]]:
-; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_HEADER_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL1:%.*]], %[[LOOP_LATCH_EPIL:.*]] ]
+; APPLE-NEXT:    [[IV_EPIL:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_HEADER_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL:%.*]], %[[LOOP_LATCH_EPIL:.*]] ]
 ; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[LOOP_HEADER_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[LOOP_LATCH_EPIL]] ]
-; APPLE-NEXT:    [[GEP_EPIL1:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_EPIL1]]
-; APPLE-NEXT:    [[L_1_EPIL1:%.*]] = load i32, ptr [[GEP_EPIL1]], align 4
-; APPLE-NEXT:    [[C_1_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL1]], [[T_1]]
+; APPLE-NEXT:    [[GEP_EPIL:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV_EPIL]]
+; APPLE-NEXT:    [[L_1_EPIL:%.*]] = load i32, ptr [[GEP_EPIL]], align 4
+; APPLE-NEXT:    [[C_1_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[T_1]]
 ; APPLE-NEXT:    br i1 [[C_1_EPIL]], label %[[THEN_EPIL:.*]], label %[[LOOP_LATCH_EPIL]]
 ; APPLE:       [[THEN_EPIL]]:
-; APPLE-NEXT:    [[GEP_4_EPIL1:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP_EPIL1]], i64 4
-; APPLE-NEXT:    [[L_2_EPIL1:%.*]] = load i8, ptr [[GEP_4_EPIL1]], align 4
-; APPLE-NEXT:    [[C_2_EPIL:%.*]] = icmp ugt i8 [[L_2_EPIL1]], 7
+; APPLE-NEXT:    [[GEP_4_EPIL:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP_EPIL]], i64 4
+; APPLE-NEXT:    [[L_2_EPIL:%.*]] = load i8, ptr [[GEP_4_EPIL]], align 4
+; APPLE-NEXT:    [[C_2_EPIL:%.*]] = icmp ugt i8 [[L_2_EPIL]], 7
 ; APPLE-NEXT:    br i1 [[C_2_EPIL]], label %[[MERGE_EPIL:.*]], label %[[ELSE_EPIL:.*]]
 ; APPLE:       [[ELSE_EPIL]]:
-; APPLE-NEXT:    [[CONV_I_EPIL1:%.*]] = zext nneg i8 [[L_2_EPIL1]] to i64
-; APPLE-NEXT:    [[GEP_A_EPIL:%.*]] = getelementptr inbounds [9 x i8], ptr @A, i64 0, i64 [[CONV_I_EPIL1]]
+; APPLE-NEXT:    [[CONV_I_EPIL:%.*]] = zext nneg i8 [[L_2_EPIL]] to i64
+; APPLE-NEXT:    [[GEP_A_EPIL:%.*]] = getelementptr inbounds [9 x i8], ptr @A, i64 0, i64 [[CONV_I_EPIL]]
 ; APPLE-NEXT:    [[L_3_EPIL:%.*]] = load i8, ptr [[GEP_A_EPIL]], align 1
-; APPLE-NEXT:    [[IDXPROM_I_EPIL1:%.*]] = sext i8 [[L_3_EPIL]] to i64
-; APPLE-NEXT:    [[GEP_B_EPIL:%.*]] = getelementptr inbounds [8 x i32], ptr @B, i64 0, i64 [[IDXPROM_I_EPIL1]]
+; APPLE-NEXT:    [[IDXPROM_I_EPIL:%.*]] = sext i8 [[L_3_EPIL]] to i64
+; APPLE-NEXT:    [[GEP_B_EPIL:%.*]] = getelementptr inbounds [8 x i32], ptr @B, i64 0, i64 [[IDXPROM_I_EPIL]]
 ; APPLE-NEXT:    [[L_4_EPIL:%.*]] = load i32, ptr [[GEP_B_EPIL]], align 4
-; APPLE-NEXT:    [[GEP_C_EPIL:%.*]] = getelementptr inbounds [8 x i32], ptr @C, i64 0, i64 [[IDXPROM_I_EPIL1]]
+; APPLE-NEXT:    [[GEP_C_EPIL:%.*]] = getelementptr inbounds [8 x i32], ptr @C, i64 0, i64 [[IDXPROM_I_EPIL]]
 ; APPLE-NEXT:    [[L_5_EPIL:%.*]] = load i32, ptr [[GEP_C_EPIL]], align 4
 ; APPLE-NEXT:    br label %[[MERGE_EPIL]]
 ; APPLE:       [[MERGE_EPIL]]:
 ; APPLE-NEXT:    [[MERGE_1_EPIL:%.*]] = phi i32 [ 0, %[[THEN_EPIL]] ], [ [[L_4_EPIL]], %[[ELSE_EPIL]] ]
 ; APPLE-NEXT:    [[MERGE_2_EPIL:%.*]] = phi i32 [ 0, %[[THEN_EPIL]] ], [ [[L_5_EPIL]], %[[ELSE_EPIL]] ]
-; APPLE-NEXT:    [[ADD14_EPIL1:%.*]] = add nsw i32 [[MERGE_2_EPIL]], [[X]]
-; APPLE-NEXT:    [[MUL15_EPIL1:%.*]] = mul nsw i32 [[ADD14_EPIL1]], [[WIDTH]]
-; APPLE-NEXT:    [[TMP7:%.*]] = trunc nuw nsw i64 [[IV_EPIL1]] to i32
-; APPLE-NEXT:    [[ADD16_EPIL1:%.*]] = add nsw i32 [[MERGE_1_EPIL]], [[TMP7]]
-; APPLE-NEXT:    [[ADD17_EPIL1:%.*]] = add nsw i32 [[ADD16_EPIL1]], [[MUL15_EPIL1]]
-; APPLE-NEXT:    [[IDXPROM18_EPIL1:%.*]] = sext i32 [[ADD17_EPIL1]] to i64
-; APPLE-NEXT:    [[GEP_P_2_EPIL:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM18_EPIL1]]
+; APPLE-NEXT:    [[ADD14_EPIL:%.*]] = add nsw i32 [[MERGE_2_EPIL]], [[X]]
+; APPLE-NEXT:    [[MUL15_EPIL:%.*]] = mul nsw i32 [[ADD14_EPIL]], [[WIDTH]]
+; APPLE-NEXT:    [[TMP7:%.*]] = trunc nuw nsw i64 [[IV_EPIL]] to i32
+; APPLE-NEXT:    [[ADD16_EPIL:%.*]] = add nsw i32 [[MERGE_1_EPIL]], [[TMP7]]
+; APPLE-NEXT:    [[ADD17_EPIL:%.*]] = add nsw i32 [[ADD16_EPIL]], [[MUL15_EPIL]]
+; APPLE-NEXT:    [[IDXPROM18_EPIL:%.*]] = sext i32 [[ADD17_EPIL]] to i64
+; APPLE-NEXT:    [[GEP_P_2_EPIL:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM18_EPIL]]
 ; APPLE-NEXT:    [[L_6_EPIL:%.*]] = load i32, ptr [[GEP_P_2_EPIL]], align 4
-; APPLE-NEXT:    [[SUB_EPIL1:%.*]] = sub nsw i32 [[X]], [[MERGE_2_EPIL]]
-; APPLE-NEXT:    [[MUL21_EPIL1:%.*]] = mul nsw i32 [[SUB_EPIL1]], [[WIDTH]]
-; APPLE-NEXT:    [[SUB22_EPIL1:%.*]] = sub i32 [[TMP7]], [[MERGE_1_EPIL]]
-; APPLE-NEXT:    [[ADD23_EPIL1:%.*]] = add nsw i32 [[SUB22_EPIL1]], [[MUL21_EPIL1]]
-; APPLE-NEXT:    [[IDXPROM24_EPIL1:%.*]] = sext i32 [[ADD23_EPIL1]] to i64
-; APPLE-NEXT:    [[GEP_P2_1_EPIL:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM24_EPIL1]]
+; APPLE-NEXT:    [[SUB_EPIL:%.*]] = sub nsw i32 [[X]], [[MERGE_2_EPIL]]
+; APPLE-NEXT:    [[MUL21_EPIL:%.*]] = mul nsw i32 [[SUB_EPIL]], [[WIDTH]]
+; APPLE-NEXT:    [[SUB22_EPIL:%.*]] = sub i32 [[TMP7]], [[MERGE_1_EPIL]]
+; APPLE-NEXT:    [[ADD23_EPIL:%.*]] = add nsw i32 [[SUB22_EPIL]], [[MUL21_EPIL]]
+; APPLE-NEXT:    [[IDXPROM24_EPIL:%.*]] = sext i32 [[ADD23_EPIL]] to i64
+; APPLE-NEXT:    [[GEP_P2_1_EPIL:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM24_EPIL]]
 ; APPLE-NEXT:    [[L_7_EPIL:%.*]] = load i32, ptr [[GEP_P2_1_EPIL]], align 4
-; APPLE-NEXT:    [[C_3_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL1]], [[L_6_EPIL]]
-; APPLE-NEXT:    [[C_4_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL1]], [[L_7_EPIL]]
-; APPLE-NEXT:    [[AND34_EPIL1:%.*]] = and i1 [[C_3_EPIL]], [[C_4_EPIL]]
-; APPLE-NEXT:    br i1 [[AND34_EPIL1]], label %[[STORE_RES_EPIL:.*]], label %[[LOOP_LATCH_EPIL]]
+; APPLE-NEXT:    [[C_3_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[L_6_EPIL]]
+; APPLE-NEXT:    [[C_4_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[L_7_EPIL]]
+; APPLE-NEXT:    [[AND34_EPIL:%.*]] = and i1 [[C_3_EPIL]], [[C_4_EPIL]]
+; APPLE-NEXT:    br i1 [[AND34_EPIL]], label %[[STORE_RES_EPIL:.*]], label %[[LOOP_LATCH_EPIL]]
 ; APPLE:       [[STORE_RES_EPIL]]:
-; APPLE-NEXT:    [[C_5_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL1]], [[T_2]]
-; APPLE-NEXT:    [[GEP_5_EPIL1:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP_EPIL1]], i64 5
-; APPLE-NEXT:    [[RES_EPIL1:%.*]] = select i1 [[C_5_EPIL]], i8 1, i8 2
-; APPLE-NEXT:    store i8 [[RES_EPIL1]], ptr [[GEP_5_EPIL1]], align 1
+; APPLE-NEXT:    [[C_5_EPIL:%.*]] = icmp sgt i32 [[L_1_EPIL]], [[T_2]]
+; APPLE-NEXT:    [[GEP_5_EPIL:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP_EPIL]], i64 5
+; APPLE-NEXT:    [[RES_EPIL:%.*]] = select i1 [[C_5_EPIL]], i8 1, i8 2
+; APPLE-NEXT:    store i8 [[RES_EPIL]], ptr [[GEP_5_EPIL]], align 1
 ; APPLE-NEXT:    br label %[[LOOP_LATCH_EPIL]]
 ; APPLE:       [[LOOP_LATCH_EPIL]]:
-; APPLE-NEXT:    [[IV_NEXT_EPIL1]] = add nuw nsw i64 [[IV_EPIL1]], 1
-; APPLE-NEXT:    [[EC_EPIL:%.*]] = icmp eq i64 [[IV_NEXT_EPIL1]], [[N]]
+; APPLE-NEXT:    [[IV_NEXT_EPIL]] = add nuw nsw i64 [[IV_EPIL]], 1
+; APPLE-NEXT:    [[EC_EPIL:%.*]] = icmp eq i64 [[IV_NEXT_EPIL]], [[N]]
 ; APPLE-NEXT:    [[EPIL_ITER_NEXT]] = add i64 [[EPIL_ITER]], 1
 ; APPLE-NEXT:    [[EPIL_ITER_CMP:%.*]] = icmp ne i64 [[EPIL_ITER_NEXT]], [[XTRAITER]]
 ; APPLE-NEXT:    br i1 [[EPIL_ITER_CMP]], label %[[LOOP_HEADER_EPIL]], label %[[EXIT_EPILOG_LCSSA:.*]], !llvm.loop [[LOOP2:![0-9]+]]
@@ -651,6 +713,66 @@ define void @early_continue_dep_on_load_large(ptr %p.1, ptr %p.2, i64 %N, i32 %x
 ; APPLE:       [[EXIT]]:
 ; APPLE-NEXT:    ret void
 ;
+; APPLE-A17-LABEL: define void @early_continue_dep_on_load_large(
+; APPLE-A17-SAME: ptr [[P_1:%.*]], ptr [[P_2:%.*]], i64 [[N:%.*]], i32 [[X:%.*]], i32 [[WIDTH:%.*]], i32 [[T_1:%.*]], i32 [[T_2:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP_HEADER:.*]]
+; APPLE-A17:       [[LOOP_HEADER]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 1, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; APPLE-A17-NEXT:    [[GEP:%.*]] = getelementptr { i32, i8, i8, [2 x i8] }, ptr [[P_1]], i64 [[IV]]
+; APPLE-A17-NEXT:    [[L_1:%.*]] = load i32, ptr [[GEP]], align 4
+; APPLE-A17-NEXT:    [[C_1:%.*]] = icmp sgt i32 [[L_1]], [[T_1]]
+; APPLE-A17-NEXT:    br i1 [[C_1]], label %[[THEN:.*]], label %[[LOOP_LATCH]]
+; APPLE-A17:       [[THEN]]:
+; APPLE-A17-NEXT:    [[GEP_4:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP]], i64 4
+; APPLE-A17-NEXT:    [[L_2:%.*]] = load i8, ptr [[GEP_4]], align 4
+; APPLE-A17-NEXT:    [[C_2:%.*]] = icmp ugt i8 [[L_2]], 7
+; APPLE-A17-NEXT:    br i1 [[C_2]], label %[[MERGE:.*]], label %[[ELSE:.*]]
+; APPLE-A17:       [[ELSE]]:
+; APPLE-A17-NEXT:    [[CONV_I:%.*]] = zext nneg i8 [[L_2]] to i64
+; APPLE-A17-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds [9 x i8], ptr @A, i64 0, i64 [[CONV_I]]
+; APPLE-A17-NEXT:    [[L_3:%.*]] = load i8, ptr [[GEP_A]], align 1
+; APPLE-A17-NEXT:    [[IDXPROM_I:%.*]] = sext i8 [[L_3]] to i64
+; APPLE-A17-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds [8 x i32], ptr @B, i64 0, i64 [[IDXPROM_I]]
+; APPLE-A17-NEXT:    [[L_4:%.*]] = load i32, ptr [[GEP_B]], align 4
+; APPLE-A17-NEXT:    [[GEP_C:%.*]] = getelementptr inbounds [8 x i32], ptr @C, i64 0, i64 [[IDXPROM_I]]
+; APPLE-A17-NEXT:    [[L_5:%.*]] = load i32, ptr [[GEP_C]], align 4
+; APPLE-A17-NEXT:    br label %[[MERGE]]
+; APPLE-A17:       [[MERGE]]:
+; APPLE-A17-NEXT:    [[MERGE_1:%.*]] = phi i32 [ 0, %[[THEN]] ], [ [[L_4]], %[[ELSE]] ]
+; APPLE-A17-NEXT:    [[MERGE_2:%.*]] = phi i32 [ 0, %[[THEN]] ], [ [[L_5]], %[[ELSE]] ]
+; APPLE-A17-NEXT:    [[ADD14:%.*]] = add nsw i32 [[MERGE_2]], [[X]]
+; APPLE-A17-NEXT:    [[MUL15:%.*]] = mul nsw i32 [[ADD14]], [[WIDTH]]
+; APPLE-A17-NEXT:    [[TMP0:%.*]] = trunc nuw nsw i64 [[IV]] to i32
+; APPLE-A17-NEXT:    [[ADD16:%.*]] = add nsw i32 [[MERGE_1]], [[TMP0]]
+; APPLE-A17-NEXT:    [[ADD17:%.*]] = add nsw i32 [[ADD16]], [[MUL15]]
+; APPLE-A17-NEXT:    [[IDXPROM18:%.*]] = sext i32 [[ADD17]] to i64
+; APPLE-A17-NEXT:    [[GEP_P_2:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM18]]
+; APPLE-A17-NEXT:    [[L_6:%.*]] = load i32, ptr [[GEP_P_2]], align 4
+; APPLE-A17-NEXT:    [[SUB:%.*]] = sub nsw i32 [[X]], [[MERGE_2]]
+; APPLE-A17-NEXT:    [[MUL21:%.*]] = mul nsw i32 [[SUB]], [[WIDTH]]
+; APPLE-A17-NEXT:    [[SUB22:%.*]] = sub i32 [[TMP0]], [[MERGE_1]]
+; APPLE-A17-NEXT:    [[ADD23:%.*]] = add nsw i32 [[SUB22]], [[MUL21]]
+; APPLE-A17-NEXT:    [[IDXPROM24:%.*]] = sext i32 [[ADD23]] to i64
+; APPLE-A17-NEXT:    [[GEP_P2_1:%.*]] = getelementptr inbounds { i32, i8, i8, [2 x i8] }, ptr [[P_2]], i64 [[IDXPROM24]]
+; APPLE-A17-NEXT:    [[L_7:%.*]] = load i32, ptr [[GEP_P2_1]], align 4
+; APPLE-A17-NEXT:    [[C_3:%.*]] = icmp sgt i32 [[L_1]], [[L_6]]
+; APPLE-A17-NEXT:    [[C_4:%.*]] = icmp sgt i32 [[L_1]], [[L_7]]
+; APPLE-A17-NEXT:    [[AND34:%.*]] = and i1 [[C_3]], [[C_4]]
+; APPLE-A17-NEXT:    br i1 [[AND34]], label %[[STORE_RES:.*]], label %[[LOOP_LATCH]]
+; APPLE-A17:       [[STORE_RES]]:
+; APPLE-A17-NEXT:    [[C_5:%.*]] = icmp sgt i32 [[L_1]], [[T_2]]
+; APPLE-A17-NEXT:    [[GEP_5:%.*]] = getelementptr inbounds nuw i8, ptr [[GEP]], i64 5
+; APPLE-A17-NEXT:    [[RES:%.*]] = select i1 [[C_5]], i8 1, i8 2
+; APPLE-A17-NEXT:    store i8 [[RES]], ptr [[GEP_5]], align 1
+; APPLE-A17-NEXT:    br label %[[LOOP_LATCH]]
+; APPLE-A17:       [[LOOP_LATCH]]:
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    ret void
+;
 ; OTHER-LABEL: define void @early_continue_dep_on_load_large(
 ; OTHER-SAME: ptr [[P_1:%.*]], ptr [[P_2:%.*]], i64 [[N:%.*]], i32 [[X:%.*]], i32 [[WIDTH:%.*]], i32 [[T_1:%.*]], i32 [[T_2:%.*]]) #[[ATTR0]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -813,6 +935,23 @@ define i32 @test_add_reduction_unroll_partial(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    [[BIN_RDX2:%.*]] = add i32 [[RDX_NEXT_3]], [[BIN_RDX1]]
 ; APPLE-NEXT:    ret i32 [[BIN_RDX2]]
 ;
+; APPLE-A17-LABEL: define i32 @test_add_reduction_unroll_partial(
+; APPLE-A17-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; APPLE-A17-NEXT:    [[TMP0:%.*]] = load i32, ptr [[GEP_A]], align 2
+; APPLE-A17-NEXT:    [[RDX_NEXT]] = add nuw nsw i32 [[RDX]], [[TMP0]]
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    [[RES:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    ret i32 [[RES]]
+;
 ; OTHER-LABEL: define i32 @test_add_reduction_unroll_partial(
 ; OTHER-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -826,11 +965,11 @@ define i32 @test_add_reduction_unroll_partial(ptr %a, i64 noundef %n) {
 ; OTHER-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
 ; OTHER-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT]]
 ; OTHER-NEXT:    [[TMP1:%.*]] = load i32, ptr [[GEP_A_1]], align 2
-; OTHER-NEXT:    [[RDX_2:%.*]] = add nuw nsw i32 [[RDX_NEXT]], [[TMP1]]
+; OTHER-NEXT:    [[RDX_NEXT_1:%.*]] = add nuw nsw i32 [[RDX_NEXT]], [[TMP1]]
 ; OTHER-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV]], 2
 ; OTHER-NEXT:    [[GEP_A_2:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_1]]
 ; OTHER-NEXT:    [[TMP2:%.*]] = load i32, ptr [[GEP_A_2]], align 2
-; OTHER-NEXT:    [[RDX_NEXT_2:%.*]] = add nuw nsw i32 [[RDX_2]], [[TMP2]]
+; OTHER-NEXT:    [[RDX_NEXT_2:%.*]] = add nuw nsw i32 [[RDX_NEXT_1]], [[TMP2]]
 ; OTHER-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV]], 3
 ; OTHER-NEXT:    [[GEP_A_3:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_2]]
 ; OTHER-NEXT:    [[TMP3:%.*]] = load i32, ptr [[GEP_A_3]], align 2
@@ -839,8 +978,8 @@ define i32 @test_add_reduction_unroll_partial(ptr %a, i64 noundef %n) {
 ; OTHER-NEXT:    [[EC_3:%.*]] = icmp eq i64 [[IV_NEXT_3]], 1024
 ; OTHER-NEXT:    br i1 [[EC_3]], label %[[EXIT:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT]]:
-; OTHER-NEXT:    [[BIN_RDX2:%.*]] = phi i32 [ [[RDX_NEXT_3]], %[[LOOP]] ]
-; OTHER-NEXT:    ret i32 [[BIN_RDX2]]
+; OTHER-NEXT:    [[RES:%.*]] = phi i32 [ [[RDX_NEXT_3]], %[[LOOP]] ]
+; OTHER-NEXT:    ret i32 [[RES]]
 ;
 entry:
   br label %loop
@@ -886,6 +1025,29 @@ define i32 @test_add_reduction_multi_block(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    [[RES:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP_LATCH]] ]
 ; APPLE-NEXT:    ret i32 [[RES]]
 ;
+; APPLE-A17-LABEL: define i32 @test_add_reduction_multi_block(
+; APPLE-A17-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; APPLE-A17-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT:%.*]], %[[LOOP_LATCH]] ]
+; APPLE-A17-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; APPLE-A17-NEXT:    [[TMP0:%.*]] = load i32, ptr [[GEP_A]], align 2
+; APPLE-A17-NEXT:    [[C:%.*]] = call i1 @cond()
+; APPLE-A17-NEXT:    br i1 [[C]], label %[[THEN:.*]], label %[[LOOP_LATCH]]
+; APPLE-A17:       [[THEN]]:
+; APPLE-A17-NEXT:    store i32 0, ptr [[GEP_A]], align 4
+; APPLE-A17-NEXT:    br label %[[LOOP_LATCH]]
+; APPLE-A17:       [[LOOP_LATCH]]:
+; APPLE-A17-NEXT:    [[RDX_NEXT]] = add nuw nsw i32 [[RDX]], [[TMP0]]
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    [[RES:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP_LATCH]] ]
+; APPLE-A17-NEXT:    ret i32 [[RES]]
+;
 ; OTHER-LABEL: define i32 @test_add_reduction_multi_block(
 ; OTHER-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -942,19 +1104,19 @@ define i32 @test_add_and_mul_reduction_unroll_partial(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
 ; APPLE:       [[LOOP]]:
 ; APPLE-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT_3:%.*]], %[[LOOP]] ]
-; APPLE-NEXT:    [[RDX_1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[BIN_RDX3:%.*]], %[[LOOP]] ]
+; APPLE-NEXT:    [[RDX_1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT_1:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_21:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT_2:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_3:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT_3:%.*]], %[[LOOP]] ]
-; APPLE-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RES_2:%.*]], %[[LOOP]] ]
+; APPLE-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_2:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_2_NEXT_3:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
 ; APPLE-NEXT:    [[TMP0:%.*]] = load i32, ptr [[GEP_A]], align 2
-; APPLE-NEXT:    [[RES_2]] = add i32 [[RDX]], [[TMP0]]
+; APPLE-NEXT:    [[RDX_NEXT]] = add i32 [[RDX]], [[TMP0]]
 ; APPLE-NEXT:    [[RDX_2_NEXT:%.*]] = mul i32 [[RDX_2]], [[TMP0]]
 ; APPLE-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
 ; APPLE-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT]]
 ; APPLE-NEXT:    [[TMP1:%.*]] = load i32, ptr [[GEP_A_1]], align 2
-; APPLE-NEXT:    [[BIN_RDX3]] = add i32 [[RDX_1]], [[TMP1]]
+; APPLE-NEXT:    [[RDX_NEXT_1]] = add i32 [[RDX_1]], [[TMP1]]
 ; APPLE-NEXT:    [[RDX_2_NEXT_1:%.*]] = mul i32 [[RDX_2_NEXT]], [[TMP1]]
 ; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[GEP_A_2:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_1]]
@@ -971,12 +1133,33 @@ define i32 @test_add_and_mul_reduction_unroll_partial(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    br i1 [[EC_3]], label %[[EXIT:.*]], label %[[LOOP]]
 ; APPLE:       [[EXIT]]:
 ; APPLE-NEXT:    [[RES_1:%.*]] = phi i32 [ [[RDX_NEXT_3]], %[[LOOP]] ]
-; APPLE-NEXT:    [[RES_3:%.*]] = phi i32 [ [[RDX_2_NEXT_3]], %[[LOOP]] ]
+; APPLE-NEXT:    [[RES_2:%.*]] = phi i32 [ [[RDX_2_NEXT_3]], %[[LOOP]] ]
+; APPLE-NEXT:    [[BIN_RDX:%.*]] = add i32 [[RDX_NEXT_1]], [[RDX_NEXT]]
+; APPLE-NEXT:    [[BIN_RDX2:%.*]] = add i32 [[RDX_NEXT_2]], [[BIN_RDX]]
+; APPLE-NEXT:    [[BIN_RDX3:%.*]] = add i32 [[RDX_NEXT_3]], [[BIN_RDX2]]
 ; APPLE-NEXT:    [[SUM:%.*]] = add i32 [[BIN_RDX3]], [[RES_2]]
-; APPLE-NEXT:    [[BIN_RDX2:%.*]] = add i32 [[RDX_NEXT_2]], [[SUM]]
-; APPLE-NEXT:    [[BIN_RDX4:%.*]] = add i32 [[RDX_NEXT_3]], [[BIN_RDX2]]
-; APPLE-NEXT:    [[SUM1:%.*]] = add i32 [[BIN_RDX4]], [[RES_3]]
-; APPLE-NEXT:    ret i32 [[SUM1]]
+; APPLE-NEXT:    ret i32 [[SUM]]
+;
+; APPLE-A17-LABEL: define i32 @test_add_and_mul_reduction_unroll_partial(
+; APPLE-A17-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[RDX_2:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_2_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; APPLE-A17-NEXT:    [[TMP0:%.*]] = load i32, ptr [[GEP_A]], align 2
+; APPLE-A17-NEXT:    [[RDX_NEXT]] = add nuw nsw i32 [[RDX]], [[TMP0]]
+; APPLE-A17-NEXT:    [[RDX_2_NEXT]] = mul i32 [[RDX_2]], [[TMP0]]
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    [[RES_1:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[RES_2:%.*]] = phi i32 [ [[RDX_2_NEXT]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[SUM:%.*]] = add i32 [[RES_1]], [[RES_2]]
+; APPLE-A17-NEXT:    ret i32 [[SUM]]
 ;
 ; OTHER-LABEL: define i32 @test_add_and_mul_reduction_unroll_partial(
 ; OTHER-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
@@ -999,9 +1182,9 @@ define i32 @test_add_and_mul_reduction_unroll_partial(ptr %a, i64 noundef %n) {
 ; OTHER-NEXT:    [[EC_1:%.*]] = icmp eq i64 [[IV_NEXT_1]], 1024
 ; OTHER-NEXT:    br i1 [[EC_1]], label %[[EXIT:.*]], label %[[LOOP]]
 ; OTHER:       [[EXIT]]:
-; OTHER-NEXT:    [[BIN_RDX:%.*]] = phi i32 [ [[RDX_NEXT_1]], %[[LOOP]] ]
+; OTHER-NEXT:    [[RES_1:%.*]] = phi i32 [ [[RDX_NEXT_1]], %[[LOOP]] ]
 ; OTHER-NEXT:    [[RES_2:%.*]] = phi i32 [ [[RDX_2_NEXT_1]], %[[LOOP]] ]
-; OTHER-NEXT:    [[SUM:%.*]] = add i32 [[BIN_RDX]], [[RES_2]]
+; OTHER-NEXT:    [[SUM:%.*]] = add i32 [[RES_1]], [[RES_2]]
 ; OTHER-NEXT:    ret i32 [[SUM]]
 ;
 entry:
@@ -1039,28 +1222,28 @@ define i32 @test_add_reduction_runtime(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    [[UNROLL_ITER:%.*]] = sub i64 [[N]], [[XTRAITER]]
 ; APPLE-NEXT:    br label %[[LOOP:.*]]
 ; APPLE:       [[LOOP]]:
-; APPLE-NEXT:    [[IV_EPIL:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_3:%.*]], %[[LOOP]] ]
+; APPLE-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[IV_NEXT_3:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_1:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_1:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_2:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_2:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_3:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT_3:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY_NEW]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[NITER:%.*]] = phi i64 [ 0, %[[ENTRY_NEW]] ], [ [[NITER_NEXT_3:%.*]], %[[LOOP]] ]
-; APPLE-NEXT:    [[GEP_A_EPIL:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_EPIL]]
-; APPLE-NEXT:    [[TMP6:%.*]] = load i32, ptr [[GEP_A_EPIL]], align 2
-; APPLE-NEXT:    [[RDX_NEXT]] = add i32 [[RDX]], [[TMP6]]
-; APPLE-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV_EPIL]], 1
+; APPLE-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; APPLE-NEXT:    [[TMP2:%.*]] = load i32, ptr [[GEP_A]], align 2
+; APPLE-NEXT:    [[RDX_NEXT]] = add i32 [[RDX]], [[TMP2]]
+; APPLE-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
 ; APPLE-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT]]
 ; APPLE-NEXT:    [[TMP3:%.*]] = load i32, ptr [[GEP_A_1]], align 2
 ; APPLE-NEXT:    [[RDX_NEXT_1]] = add i32 [[RDX_1]], [[TMP3]]
-; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV_EPIL]], 2
+; APPLE-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV]], 2
 ; APPLE-NEXT:    [[GEP_A_2:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_1]]
 ; APPLE-NEXT:    [[TMP4:%.*]] = load i32, ptr [[GEP_A_2]], align 2
 ; APPLE-NEXT:    [[RDX_NEXT_2]] = add i32 [[RDX_2]], [[TMP4]]
-; APPLE-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV_EPIL]], 3
+; APPLE-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV]], 3
 ; APPLE-NEXT:    [[GEP_A_3:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_2]]
 ; APPLE-NEXT:    [[TMP5:%.*]] = load i32, ptr [[GEP_A_3]], align 2
 ; APPLE-NEXT:    [[RDX_NEXT_3]] = add i32 [[RDX_3]], [[TMP5]]
-; APPLE-NEXT:    [[IV_NEXT_3]] = add nuw nsw i64 [[IV_EPIL]], 4
+; APPLE-NEXT:    [[IV_NEXT_3]] = add nuw nsw i64 [[IV]], 4
 ; APPLE-NEXT:    [[NITER_NEXT_3]] = add nuw i64 [[NITER]], 4
 ; APPLE-NEXT:    [[NITER_NCMP_3:%.*]] = icmp eq i64 [[NITER_NEXT_3]], [[UNROLL_ITER]]
 ; APPLE-NEXT:    br i1 [[NITER_NCMP_3]], label %[[EXIT_UNR_LCSSA:.*]], label %[[LOOP]]
@@ -1069,24 +1252,24 @@ define i32 @test_add_reduction_runtime(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    [[IV_UNR:%.*]] = phi i64 [ [[IV_NEXT_3]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[RDX_UNR:%.*]] = phi i32 [ [[RDX_NEXT_3]], %[[LOOP]] ]
 ; APPLE-NEXT:    [[BIN_RDX:%.*]] = add i32 [[RDX_NEXT_1]], [[RDX_NEXT]]
-; APPLE-NEXT:    [[BIN_RDX2:%.*]] = add i32 [[RDX_NEXT_2]], [[BIN_RDX]]
-; APPLE-NEXT:    [[BIN_RDX3:%.*]] = add i32 [[RDX_NEXT_3]], [[BIN_RDX2]]
+; APPLE-NEXT:    [[BIN_RDX3:%.*]] = add i32 [[RDX_NEXT_2]], [[BIN_RDX]]
+; APPLE-NEXT:    [[BIN_RDX4:%.*]] = add i32 [[RDX_NEXT_3]], [[BIN_RDX3]]
 ; APPLE-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; APPLE-NEXT:    br i1 [[LCMP_MOD]], label %[[LOOP_EPIL_PREHEADER]], label %[[EXIT:.*]]
 ; APPLE:       [[LOOP_EPIL_PREHEADER]]:
 ; APPLE-NEXT:    [[IV_EPIL_INIT:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_UNR]], %[[EXIT_UNR_LCSSA]] ]
-; APPLE-NEXT:    [[RDX_EPIL_INIT:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[BIN_RDX3]], %[[EXIT_UNR_LCSSA]] ]
+; APPLE-NEXT:    [[RDX_EPIL_INIT:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[BIN_RDX4]], %[[EXIT_UNR_LCSSA]] ]
 ; APPLE-NEXT:    [[LCMP_MOD2:%.*]] = icmp ne i64 [[XTRAITER]], 0
 ; APPLE-NEXT:    call void @llvm.assume(i1 [[LCMP_MOD2]])
 ; APPLE-NEXT:    br label %[[LOOP_EPIL:.*]]
 ; APPLE:       [[LOOP_EPIL]]:
-; APPLE-NEXT:    [[IV_EPIL1:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL:%.*]], %[[LOOP_EPIL]] ]
+; APPLE-NEXT:    [[IV_EPIL:%.*]] = phi i64 [ [[IV_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[IV_NEXT_EPIL:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[RDX_EPIL:%.*]] = phi i32 [ [[RDX_EPIL_INIT]], %[[LOOP_EPIL_PREHEADER]] ], [ [[RDX_NEXT_EPIL:%.*]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    [[EPIL_ITER:%.*]] = phi i64 [ 0, %[[LOOP_EPIL_PREHEADER]] ], [ [[EPIL_ITER_NEXT:%.*]], %[[LOOP_EPIL]] ]
-; APPLE-NEXT:    [[GEP_A_EPIL1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_EPIL1]]
-; APPLE-NEXT:    [[TMP7:%.*]] = load i32, ptr [[GEP_A_EPIL1]], align 2
-; APPLE-NEXT:    [[RDX_NEXT_EPIL]] = add nuw nsw i32 [[RDX_EPIL]], [[TMP7]]
-; APPLE-NEXT:    [[IV_NEXT_EPIL]] = add nuw nsw i64 [[IV_EPIL1]], 1
+; APPLE-NEXT:    [[GEP_A_EPIL:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_EPIL]]
+; APPLE-NEXT:    [[TMP6:%.*]] = load i32, ptr [[GEP_A_EPIL]], align 2
+; APPLE-NEXT:    [[RDX_NEXT_EPIL]] = add nuw nsw i32 [[RDX_EPIL]], [[TMP6]]
+; APPLE-NEXT:    [[IV_NEXT_EPIL]] = add nuw nsw i64 [[IV_EPIL]], 1
 ; APPLE-NEXT:    [[EC_EPIL:%.*]] = icmp eq i64 [[IV_NEXT_EPIL]], [[N]]
 ; APPLE-NEXT:    [[EPIL_ITER_NEXT]] = add i64 [[EPIL_ITER]], 1
 ; APPLE-NEXT:    [[EPIL_ITER_CMP:%.*]] = icmp ne i64 [[EPIL_ITER_NEXT]], [[XTRAITER]]
@@ -1095,9 +1278,26 @@ define i32 @test_add_reduction_runtime(ptr %a, i64 noundef %n) {
 ; APPLE-NEXT:    [[RES_PH1:%.*]] = phi i32 [ [[RDX_NEXT_EPIL]], %[[LOOP_EPIL]] ]
 ; APPLE-NEXT:    br label %[[EXIT]]
 ; APPLE:       [[EXIT]]:
-; APPLE-NEXT:    [[RES:%.*]] = phi i32 [ [[BIN_RDX3]], %[[EXIT_UNR_LCSSA]] ], [ [[RES_PH1]], %[[EXIT_EPILOG_LCSSA]] ]
+; APPLE-NEXT:    [[RES:%.*]] = phi i32 [ [[BIN_RDX4]], %[[EXIT_UNR_LCSSA]] ], [ [[RES_PH1]], %[[EXIT_EPILOG_LCSSA]] ]
 ; APPLE-NEXT:    ret i32 [[RES]]
 ;
+; APPLE-A17-LABEL: define i32 @test_add_reduction_runtime(
+; APPLE-A17-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
+; APPLE-A17-NEXT:  [[ENTRY:.*]]:
+; APPLE-A17-NEXT:    br label %[[LOOP:.*]]
+; APPLE-A17:       [[LOOP]]:
+; APPLE-A17-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[RDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[RDX_NEXT:%.*]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV]]
+; APPLE-A17-NEXT:    [[TMP0:%.*]] = load i32, ptr [[GEP_A]], align 2
+; APPLE-A17-NEXT:    [[RDX_NEXT]] = add nuw nsw i32 [[RDX]], [[TMP0]]
+; APPLE-A17-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; APPLE-A17-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; APPLE-A17-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; APPLE-A17:       [[EXIT]]:
+; APPLE-A17-NEXT:    [[RES:%.*]] = phi i32 [ [[RDX_NEXT]], %[[LOOP]] ]
+; APPLE-A17-NEXT:    ret i32 [[RES]]
+;
 ; OTHER-LABEL: define i32 @test_add_reduction_runtime(
 ; OTHER-SAME: ptr [[A:%.*]], i64 noundef [[N:%.*]]) #[[ATTR0]] {
 ; OTHER-NEXT:  [[ENTRY:.*]]:
@@ -1118,11 +1318,11 @@ define i32 @test_add_reduction_runtime(ptr %a, i64 noundef %n) {
 ; OTHER-NEXT:    [[IV_NEXT:%.*]] = add nuw nsw i64 [[IV]], 1
 ; OTHER-NEXT:    [[GEP_A_1:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT]]
 ; OTHER-NEXT:    [[TMP3:%.*]] = load i32, ptr [[GEP_A_1]], align 2
-; OTHER-NEXT:    [[RDX_2:%.*]] = add nuw nsw i32 [[RDX_NEXT]], [[TMP3]]
+; OTHER-NEXT:    [[RDX_NEXT_1:%.*]] = add nuw nsw i32 [[RDX_NEXT]], [[TMP3]]
 ; OTHER-NEXT:    [[IV_NEXT_1:%.*]] = add nuw nsw i64 [[IV]], 2
 ; OTHER-NEXT:    [[GEP_A_2:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_1]]
 ; OTHER-NEXT:    [[TMP4:%.*]] = load i32, ptr [[GEP_A_2]], align 2
-; OTHER-NEXT:    [[RDX_NEXT_2:%.*]] = add nuw nsw i32 [[RDX_2]], [[TMP4]]
+; OTHER-NEXT:    [[RDX_NEXT_2:%.*]] = add nuw nsw i32 [[RDX_NEXT_1]], [[TMP4]]
 ; OTHER-NEXT:    [[IV_NEXT_2:%.*]] = add nuw nsw i64 [[IV]], 3
 ; OTHER-NEXT:    [[GEP_A_3:%.*]] = getelementptr inbounds nuw i32, ptr [[A]], i64 [[IV_NEXT_2]]
 ; OTHER-NEXT:    [[TMP5:%.*]] = load i32, ptr [[GEP_A_3]], align 2
diff --git a/llvm/test/Transforms/LoopUnroll/peel-multiple-unreachable-exits.ll b/llvm/test/Transforms/LoopUnroll/peel-multiple-unreachable-exits.ll
index d8be878a23586..b2d7f3b21a59a 100644
--- a/llvm/test/Transforms/LoopUnroll/peel-multiple-unreachable-exits.ll
+++ b/llvm/test/Transforms/LoopUnroll/peel-multiple-unreachable-exits.ll
@@ -43,7 +43,7 @@ define void @peel_unreachable_exit_and_latch_exit(ptr %ptr, i32 %N, i32 %x) {
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[PTR]], i32 [[IV]]
 ; CHECK-NEXT:    store i32 [[M]], ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
-; CHECK-NEXT:    [[C_3:%.*]] = icmp ult i32 [[IV]], 1000
+; CHECK-NEXT:    [[C_3:%.*]] = icmp samesign ult i32 [[IV]], 1000
 ; CHECK-NEXT:    br i1 [[C_3]], label [[LOOP_HEADER]], label [[EXIT_LOOPEXIT:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    br label [[EXIT]]
@@ -172,7 +172,7 @@ define void @peel_unreachable_and_multiple_reachable_exits(ptr %ptr, i32 %N, i32
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[PTR]], i32 [[IV]]
 ; CHECK-NEXT:    store i32 [[M]], ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
-; CHECK-NEXT:    [[C_4:%.*]] = icmp ult i32 [[IV]], 1000
+; CHECK-NEXT:    [[C_4:%.*]] = icmp samesign ult i32 [[IV]], 1000
 ; CHECK-NEXT:    br i1 [[C_4]], label [[LOOP_HEADER]], label [[EXIT_LOOPEXIT]], !llvm.loop [[LOOP2:![0-9]+]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    br label [[EXIT]]
@@ -256,7 +256,7 @@ define void @peel_exits_to_blocks_branch_to_unreachable_block(ptr %ptr, i32 %N,
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i32, ptr [[PTR]], i32 [[IV]]
 ; CHECK-NEXT:    store i32 [[M]], ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
-; CHECK-NEXT:    [[C_3:%.*]] = icmp ult i32 [[IV]], 1000
+; CHECK-NEXT:    [[C_3:%.*]] = icmp samesign ult i32 [[IV]], 1000
 ; CHECK-NEXT:    br i1 [[C_3]], label [[LOOP_HEADER]], label [[EXIT_LOOPEXIT:%.*]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    br label [[EXIT]]
diff --git a/llvm/test/Transforms/LoopUnroll/peel-to-turn-invariant-accesses-dereferenceable.ll b/llvm/test/Transforms/LoopUnroll/peel-to-turn-invariant-accesses-dereferenceable.ll
index 1098de0acd1a9..5bfbd5e98ba8d 100644
--- a/llvm/test/Transforms/LoopUnroll/peel-to-turn-invariant-accesses-dereferenceable.ll
+++ b/llvm/test/Transforms/LoopUnroll/peel-to-turn-invariant-accesses-dereferenceable.ll
@@ -41,7 +41,7 @@ define i32 @peel_readonly_to_make_loads_derefenceable(ptr %ptr, i32 %N, ptr %inv
 ; CHECK-NEXT:    [[LV:%.*]] = load i32, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[SUM_NEXT]] = add i32 [[SUM]], [[LV]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i32 [[IV]], 1
-; CHECK-NEXT:    [[C_3:%.*]] = icmp ult i32 [[IV]], 1000
+; CHECK-NEXT:    [[C_3:%.*]] = icmp samesign ult i32 [[IV]], 1000
 ; CHECK-NEXT:    br i1 [[C_3]], label [[LOOP_HEADER]], label [[EXIT_LOOPEXIT:%.*]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       exit.loopexit:
 ; CHECK-NEXT:    [[SUM_NEXT_LCSSA_PH:%.*]] = phi i32 [ [[SUM_NEXT]], [[LOOP_LATCH]] ]
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll b/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll
index de54852313456..bfbda4ad4036e 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll
+++ b/llvm/test/Transforms/LoopUnroll/runtime-loop-multiexit-dom-verify.ll
@@ -20,7 +20,7 @@ define i64 @test1() {
 ; CHECK:       header:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 2, [[PREHEADER]] ], [ [[ADD_IV_3:%.*]], [[LATCH_3:%.*]] ]
 ; CHECK-NEXT:    [[ADD_IV:%.*]] = add nuw nsw i64 [[IV]], 2
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i64 [[ADD_IV]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i64 [[ADD_IV]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[LATCH:%.*]], label [[HEADEREXIT:%.*]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[SHFT:%.*]] = ashr i64 [[ADD_IV]], 1
@@ -28,7 +28,7 @@ define i64 @test1() {
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[HEADER_1:%.*]], label [[LATCHEXIT:%.*]]
 ; CHECK:       header.1:
 ; CHECK-NEXT:    [[ADD_IV_1:%.*]] = add nuw nsw i64 [[IV]], 4
-; CHECK-NEXT:    [[CMP1_1:%.*]] = icmp ult i64 [[ADD_IV_1]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_1:%.*]] = icmp samesign ult i64 [[ADD_IV_1]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_1]], label [[LATCH_1:%.*]], label [[HEADEREXIT]]
 ; CHECK:       latch.1:
 ; CHECK-NEXT:    [[SHFT_1:%.*]] = ashr i64 [[ADD_IV_1]], 1
@@ -36,7 +36,7 @@ define i64 @test1() {
 ; CHECK-NEXT:    br i1 [[CMP2_1]], label [[HEADER_2:%.*]], label [[LATCHEXIT]]
 ; CHECK:       header.2:
 ; CHECK-NEXT:    [[ADD_IV_2:%.*]] = add nuw nsw i64 [[IV]], 6
-; CHECK-NEXT:    [[CMP1_2:%.*]] = icmp ult i64 [[ADD_IV_2]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_2:%.*]] = icmp samesign ult i64 [[ADD_IV_2]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_2]], label [[LATCH_2:%.*]], label [[HEADEREXIT]]
 ; CHECK:       latch.2:
 ; CHECK-NEXT:    [[SHFT_2:%.*]] = ashr i64 [[ADD_IV_2]], 1
@@ -44,7 +44,7 @@ define i64 @test1() {
 ; CHECK-NEXT:    br i1 [[CMP2_2]], label [[HEADER_3:%.*]], label [[LATCHEXIT]]
 ; CHECK:       header.3:
 ; CHECK-NEXT:    [[ADD_IV_3]] = add nuw nsw i64 [[IV]], 8
-; CHECK-NEXT:    [[CMP1_3:%.*]] = icmp ult i64 [[ADD_IV_3]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_3:%.*]] = icmp samesign ult i64 [[ADD_IV_3]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_3]], label [[LATCH_3]], label [[HEADEREXIT]]
 ; CHECK:       latch.3:
 ; CHECK-NEXT:    [[SHFT_3:%.*]] = ashr i64 [[ADD_IV_3]], 1
@@ -102,7 +102,7 @@ define  void @test2(i1 %cond, i32 %n) {
 ; CHECK:       header:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 2, [[PREHEADER]] ], [ [[ADD_IV_3:%.*]], [[LATCH_3:%.*]] ]
 ; CHECK-NEXT:    [[ADD_IV:%.*]] = add nuw nsw i64 [[IV]], 2
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i64 [[ADD_IV]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i64 [[ADD_IV]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[LATCH:%.*]], label [[HEADEREXIT:%.*]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[SHFT:%.*]] = ashr i64 [[ADD_IV]], 1
@@ -110,7 +110,7 @@ define  void @test2(i1 %cond, i32 %n) {
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[HEADER_1:%.*]], label [[LATCHEXIT:%.*]]
 ; CHECK:       header.1:
 ; CHECK-NEXT:    [[ADD_IV_1:%.*]] = add nuw nsw i64 [[IV]], 4
-; CHECK-NEXT:    [[CMP1_1:%.*]] = icmp ult i64 [[ADD_IV_1]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_1:%.*]] = icmp samesign ult i64 [[ADD_IV_1]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_1]], label [[LATCH_1:%.*]], label [[HEADEREXIT]]
 ; CHECK:       latch.1:
 ; CHECK-NEXT:    [[SHFT_1:%.*]] = ashr i64 [[ADD_IV_1]], 1
@@ -118,7 +118,7 @@ define  void @test2(i1 %cond, i32 %n) {
 ; CHECK-NEXT:    br i1 [[CMP2_1]], label [[HEADER_2:%.*]], label [[LATCHEXIT]]
 ; CHECK:       header.2:
 ; CHECK-NEXT:    [[ADD_IV_2:%.*]] = add nuw nsw i64 [[IV]], 6
-; CHECK-NEXT:    [[CMP1_2:%.*]] = icmp ult i64 [[ADD_IV_2]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_2:%.*]] = icmp samesign ult i64 [[ADD_IV_2]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_2]], label [[LATCH_2:%.*]], label [[HEADEREXIT]]
 ; CHECK:       latch.2:
 ; CHECK-NEXT:    [[SHFT_2:%.*]] = ashr i64 [[ADD_IV_2]], 1
@@ -126,7 +126,7 @@ define  void @test2(i1 %cond, i32 %n) {
 ; CHECK-NEXT:    br i1 [[CMP2_2]], label [[HEADER_3:%.*]], label [[LATCHEXIT]]
 ; CHECK:       header.3:
 ; CHECK-NEXT:    [[ADD_IV_3]] = add nuw nsw i64 [[IV]], 8
-; CHECK-NEXT:    [[CMP1_3:%.*]] = icmp ult i64 [[ADD_IV_3]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_3:%.*]] = icmp samesign ult i64 [[ADD_IV_3]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_3]], label [[LATCH_3]], label [[HEADEREXIT]]
 ; CHECK:       latch.3:
 ; CHECK-NEXT:    [[SHFT_3:%.*]] = ashr i64 [[ADD_IV_3]], 1
@@ -179,7 +179,7 @@ define i64 @test3(i32 %n) {
 ; CHECK:       header:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 2, [[PREHEADER]] ], [ [[ADD_IV_3:%.*]], [[LATCH_3:%.*]] ]
 ; CHECK-NEXT:    [[ADD_IV:%.*]] = add nuw nsw i64 [[IV]], 2
-; CHECK-NEXT:    [[CMP1:%.*]] = icmp ult i64 [[ADD_IV]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1:%.*]] = icmp samesign ult i64 [[ADD_IV]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1]], label [[LATCH:%.*]], label [[HEADEREXIT:%.*]]
 ; CHECK:       latch:
 ; CHECK-NEXT:    [[SHFT:%.*]] = ashr i64 [[ADD_IV]], 1
@@ -187,7 +187,7 @@ define i64 @test3(i32 %n) {
 ; CHECK-NEXT:    br i1 [[CMP2]], label [[HEADER_1:%.*]], label [[LATCHEXIT:%.*]]
 ; CHECK:       header.1:
 ; CHECK-NEXT:    [[ADD_IV_1:%.*]] = add nuw nsw i64 [[IV]], 4
-; CHECK-NEXT:    [[CMP1_1:%.*]] = icmp ult i64 [[ADD_IV_1]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_1:%.*]] = icmp samesign ult i64 [[ADD_IV_1]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_1]], label [[LATCH_1:%.*]], label [[HEADEREXIT]]
 ; CHECK:       latch.1:
 ; CHECK-NEXT:    [[SHFT_1:%.*]] = ashr i64 [[ADD_IV_1]], 1
@@ -195,7 +195,7 @@ define i64 @test3(i32 %n) {
 ; CHECK-NEXT:    br i1 [[CMP2_1]], label [[HEADER_2:%.*]], label [[LATCHEXIT]]
 ; CHECK:       header.2:
 ; CHECK-NEXT:    [[ADD_IV_2:%.*]] = add nuw nsw i64 [[IV]], 6
-; CHECK-NEXT:    [[CMP1_2:%.*]] = icmp ult i64 [[ADD_IV_2]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_2:%.*]] = icmp samesign ult i64 [[ADD_IV_2]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_2]], label [[LATCH_2:%.*]], label [[HEADEREXIT]]
 ; CHECK:       latch.2:
 ; CHECK-NEXT:    [[SHFT_2:%.*]] = ashr i64 [[ADD_IV_2]], 1
@@ -203,7 +203,7 @@ define i64 @test3(i32 %n) {
 ; CHECK-NEXT:    br i1 [[CMP2_2]], label [[HEADER_3:%.*]], label [[LATCHEXIT]]
 ; CHECK:       header.3:
 ; CHECK-NEXT:    [[ADD_IV_3]] = add nuw nsw i64 [[IV]], 8
-; CHECK-NEXT:    [[CMP1_3:%.*]] = icmp ult i64 [[ADD_IV_3]], [[TRIP]]
+; CHECK-NEXT:    [[CMP1_3:%.*]] = icmp samesign ult i64 [[ADD_IV_3]], [[TRIP]]
 ; CHECK-NEXT:    br i1 [[CMP1_3]], label [[LATCH_3]], label [[HEADEREXIT]]
 ; CHECK:       latch.3:
 ; CHECK-NEXT:    [[SHFT_3:%.*]] = ashr i64 [[ADD_IV_3]], 1
@@ -265,8 +265,8 @@ define void @test4(i16 %c3) {
 ; CHECK-NEXT:    br label [[EXITING_PROL:%.*]]
 ; CHECK:       exiting.prol:
 ; CHECK-NEXT:    switch i16 [[C3:%.*]], label [[DEFAULT_LOOPEXIT_LOOPEXIT1:%.*]] [
-; CHECK-NEXT:    i16 45, label [[OTHEREXIT_LOOPEXIT2:%.*]]
-; CHECK-NEXT:    i16 95, label [[LATCH_PROL]]
+; CHECK-NEXT:      i16 45, label [[OTHEREXIT_LOOPEXIT2:%.*]]
+; CHECK-NEXT:      i16 95, label [[LATCH_PROL]]
 ; CHECK-NEXT:    ]
 ; CHECK:       latch.prol:
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT_PROL]] = add nuw nsw i64 [[INDVARS_IV_PROL]], 1
@@ -288,33 +288,33 @@ define void @test4(i16 %c3) {
 ; CHECK-NEXT:    br label [[EXITING:%.*]]
 ; CHECK:       exiting:
 ; CHECK-NEXT:    switch i16 [[C3]], label [[DEFAULT_LOOPEXIT_LOOPEXIT:%.*]] [
-; CHECK-NEXT:    i16 45, label [[OTHEREXIT_LOOPEXIT:%.*]]
-; CHECK-NEXT:    i16 95, label [[LATCH:%.*]]
+; CHECK-NEXT:      i16 45, label [[OTHEREXIT_LOOPEXIT:%.*]]
+; CHECK-NEXT:      i16 95, label [[LATCH:%.*]]
 ; CHECK-NEXT:    ]
 ; CHECK:       latch:
 ; CHECK-NEXT:    br label [[EXITING_1:%.*]]
 ; CHECK:       exiting.1:
 ; CHECK-NEXT:    switch i16 [[C3]], label [[DEFAULT_LOOPEXIT_LOOPEXIT]] [
-; CHECK-NEXT:    i16 45, label [[OTHEREXIT_LOOPEXIT]]
-; CHECK-NEXT:    i16 95, label [[LATCH_1:%.*]]
+; CHECK-NEXT:      i16 45, label [[OTHEREXIT_LOOPEXIT]]
+; CHECK-NEXT:      i16 95, label [[LATCH_1:%.*]]
 ; CHECK-NEXT:    ]
 ; CHECK:       latch.1:
 ; CHECK-NEXT:    br label [[EXITING_2:%.*]]
 ; CHECK:       exiting.2:
 ; CHECK-NEXT:    switch i16 [[C3]], label [[DEFAULT_LOOPEXIT_LOOPEXIT]] [
-; CHECK-NEXT:    i16 45, label [[OTHEREXIT_LOOPEXIT]]
-; CHECK-NEXT:    i16 95, label [[LATCH_2:%.*]]
+; CHECK-NEXT:      i16 45, label [[OTHEREXIT_LOOPEXIT]]
+; CHECK-NEXT:      i16 95, label [[LATCH_2:%.*]]
 ; CHECK-NEXT:    ]
 ; CHECK:       latch.2:
 ; CHECK-NEXT:    br label [[EXITING_3:%.*]]
 ; CHECK:       exiting.3:
 ; CHECK-NEXT:    switch i16 [[C3]], label [[DEFAULT_LOOPEXIT_LOOPEXIT]] [
-; CHECK-NEXT:    i16 45, label [[OTHEREXIT_LOOPEXIT]]
-; CHECK-NEXT:    i16 95, label [[LATCH_3]]
+; CHECK-NEXT:      i16 45, label [[OTHEREXIT_LOOPEXIT]]
+; CHECK-NEXT:      i16 95, label [[LATCH_3]]
 ; CHECK-NEXT:    ]
 ; CHECK:       latch.3:
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT_3]] = add nuw nsw i64 [[INDVARS_IV]], 4
-; CHECK-NEXT:    [[C2_3:%.*]] = icmp ult i64 [[INDVARS_IV_NEXT_3]], [[C1]]
+; CHECK-NEXT:    [[C2_3:%.*]] = icmp samesign ult i64 [[INDVARS_IV_NEXT_3]], [[C1]]
 ; CHECK-NEXT:    br i1 [[C2_3]], label [[HEADER]], label [[LATCHEXIT_UNR_LCSSA:%.*]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       latchexit.unr-lcssa:
 ; CHECK-NEXT:    br label [[LATCHEXIT]]
diff --git a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll b/llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll
index 6835e9b1d6f8c..7a330f77685b2 100644
--- a/llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll
+++ b/llvm/test/Transforms/LoopUnroll/runtime-loop-multiple-exits.ll
@@ -4651,7 +4651,7 @@ define void @test8() {
 ; PROLOG-NEXT:    %i4.7 = add nuw nsw i64 %i3, 8
 ; PROLOG-NEXT:    br i1 false, label %outerloop.loopexit.loopexit, label %latch.7
 ; PROLOG:       latch.7:
-; PROLOG-NEXT:    %i6.7 = icmp ult i64 %i4.7, 100
+; PROLOG-NEXT:    %i6.7 = icmp samesign ult i64 %i4.7, 100
 ; PROLOG-NEXT:    br i1 %i6.7, label %innerH, label %exit.unr-lcssa
 ; PROLOG:       exit.unr-lcssa:
 ; PROLOG-NEXT:    br label %exit
@@ -4685,7 +4685,7 @@ define void @test8() {
 ; PROLOG-BLOCK-NEXT:    %i4.1.1 = add nuw nsw i64 %i3.1, 2
 ; PROLOG-BLOCK-NEXT:    br i1 false, label %outerloop.loopexit.loopexit.1, label %latch.1.1
 ; PROLOG-BLOCK:       latch.1.1:
-; PROLOG-BLOCK-NEXT:    %i6.1.1 = icmp ult i64 %i4.1.1, 100
+; PROLOG-BLOCK-NEXT:    %i6.1.1 = icmp samesign ult i64 %i4.1.1, 100
 ; PROLOG-BLOCK-NEXT:    br i1 %i6.1.1, label %innerH.1, label %exit.unr-lcssa.loopexit2, !llvm.loop !12
 ; PROLOG-BLOCK:       outerloop.loopexit.loopexit.1:
 ; PROLOG-BLOCK-NEXT:    br label %outerloop.loopexit.1
@@ -4718,7 +4718,7 @@ define void @test8() {
 ; PROLOG-BLOCK-NEXT:    %i4.1 = add nuw nsw i64 %i3, 2
 ; PROLOG-BLOCK-NEXT:    br i1 false, label %outerloop.loopexit.loopexit, label %latch.1
 ; PROLOG-BLOCK:       latch.1:
-; PROLOG-BLOCK-NEXT:    %i6.1 = icmp ult i64 %i4.1, 100
+; PROLOG-BLOCK-NEXT:    %i6.1 = icmp samesign ult i64 %i4.1, 100
 ; PROLOG-BLOCK-NEXT:    br i1 %i6.1, label %innerH, label %exit.unr-lcssa.loopexit, !llvm.loop !12
 ; PROLOG-BLOCK:       exit.unr-lcssa.loopexit:
 ; PROLOG-BLOCK-NEXT:    br label %exit.unr-lcssa
diff --git a/llvm/test/Transforms/LoopUnroll/unroll-header-exiting-with-phis.ll b/llvm/test/Transforms/LoopUnroll/unroll-header-exiting-with-phis.ll
index ff8e6efbbaee1..a6a9ca47ceb07 100644
--- a/llvm/test/Transforms/LoopUnroll/unroll-header-exiting-with-phis.ll
+++ b/llvm/test/Transforms/LoopUnroll/unroll-header-exiting-with-phis.ll
@@ -66,7 +66,7 @@ define i16 @partial_unroll(ptr %A) {
 ; CHECK-NEXT:    br label [[FOR_COND_CLEANUP3_1]]
 ; CHECK:       for.cond.cleanup3.1:
 ; CHECK-NEXT:    [[INC9_1:%.*]] = add nuw nsw i64 [[I_0]], 2
-; CHECK-NEXT:    [[CMP_2:%.*]] = icmp ult i64 [[INC9_1]], 200
+; CHECK-NEXT:    [[CMP_2:%.*]] = icmp samesign ult i64 [[INC9_1]], 200
 ; CHECK-NEXT:    br i1 [[CMP_2]], label [[FOR_COND_CLEANUP3_2]], label [[FOR_COND_CLEANUP:%.*]]
 ; CHECK:       for.cond.cleanup3.2:
 ; CHECK-NEXT:    [[INC9_2]] = add nuw nsw i64 [[I_0]], 3
diff --git a/llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll b/llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
index 9ee51cfbcb590..a3d2fcb5ab946 100644
--- a/llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
+++ b/llvm/test/Transforms/LoopUnrollAndJam/unroll-and-jam.ll
@@ -638,18 +638,18 @@ define i32 @test6() #0 {
 ; CHECK:       [[FOR_LATCH]]:
 ; CHECK-NEXT:    br i1 false, label %[[FOR_OUTER]], label %[[FOR_END_UNR_LCSSA:.*]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[FOR_END_UNR_LCSSA]]:
-; CHECK-NEXT:    [[DOTLCSSA_LCSSA_PH_PH:%.*]] = phi i32 [ 2, %[[FOR_LATCH]] ]
-; CHECK-NEXT:    [[INC_LCSSA_LCSSA_PH_PH:%.*]] = phi i32 [ 7, %[[FOR_LATCH]] ]
-; CHECK-NEXT:    [[P0_UNR_PH:%.*]] = phi i32 [ 2, %[[FOR_LATCH]] ]
+; CHECK-NEXT:    [[DOTLCSSA_LCSSA_PH:%.*]] = phi i32 [ 2, %[[FOR_LATCH]] ]
+; CHECK-NEXT:    [[INC_LCSSA_LCSSA_PH:%.*]] = phi i32 [ 7, %[[FOR_LATCH]] ]
+; CHECK-NEXT:    [[P0_UNR:%.*]] = phi i32 [ 2, %[[FOR_LATCH]] ]
 ; CHECK-NEXT:    br i1 true, label %[[FOR_OUTER_EPIL_PREHEADER]], label %[[FOR_END:.*]]
 ; CHECK:       [[FOR_OUTER_EPIL_PREHEADER]]:
-; CHECK-NEXT:    [[P0_UNR:%.*]] = phi i32 [ [[F_PROMOTED10]], %[[ENTRY]] ], [ [[P0_UNR_PH]], %[[FOR_END_UNR_LCSSA]] ]
+; CHECK-NEXT:    [[P0_EPIL_INIT:%.*]] = phi i32 [ [[F_PROMOTED10]], %[[ENTRY]] ], [ [[P0_UNR]], %[[FOR_END_UNR_LCSSA]] ]
 ; CHECK-NEXT:    call void @llvm.assume(i1 true)
 ; CHECK-NEXT:    br label %[[FOR_OUTER_EPIL:.*]]
 ; CHECK:       [[FOR_OUTER_EPIL]]:
 ; CHECK-NEXT:    br label %[[FOR_INNER_EPIL:.*]]
 ; CHECK:       [[FOR_INNER_EPIL]]:
-; CHECK-NEXT:    [[P1_EPIL:%.*]] = phi i32 [ [[P0_UNR]], %[[FOR_OUTER_EPIL]] ], [ 2, %[[FOR_INNER_EPIL]] ]
+; CHECK-NEXT:    [[P1_EPIL:%.*]] = phi i32 [ [[P0_EPIL_INIT]], %[[FOR_OUTER_EPIL]] ], [ 2, %[[FOR_INNER_EPIL]] ]
 ; CHECK-NEXT:    [[INC_SINK8_EPIL:%.*]] = phi i32 [ 0, %[[FOR_OUTER_EPIL]] ], [ [[INC_EPIL:%.*]], %[[FOR_INNER_EPIL]] ]
 ; CHECK-NEXT:    [[INC_EPIL]] = add nuw nsw i32 [[INC_SINK8_EPIL]], 1
 ; CHECK-NEXT:    [[EXITCOND_EPIL:%.*]] = icmp ne i32 [[INC_EPIL]], 7
@@ -658,8 +658,8 @@ define i32 @test6() #0 {
 ; CHECK-NEXT:    [[DOTLCSSA_EPIL:%.*]] = phi i32 [ [[P1_EPIL]], %[[FOR_INNER_EPIL]] ]
 ; CHECK-NEXT:    br label %[[FOR_END]]
 ; CHECK:       [[FOR_END]]:
-; CHECK-NEXT:    [[DOTLCSSA_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA_LCSSA_PH_PH]], %[[FOR_END_UNR_LCSSA]] ], [ [[DOTLCSSA_EPIL]], %[[FOR_LATCH_EPIL]] ]
-; CHECK-NEXT:    [[INC_LCSSA_LCSSA:%.*]] = phi i32 [ [[INC_LCSSA_LCSSA_PH_PH]], %[[FOR_END_UNR_LCSSA]] ], [ 7, %[[FOR_LATCH_EPIL]] ]
+; CHECK-NEXT:    [[DOTLCSSA_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA_LCSSA_PH]], %[[FOR_END_UNR_LCSSA]] ], [ [[DOTLCSSA_EPIL]], %[[FOR_LATCH_EPIL]] ]
+; CHECK-NEXT:    [[INC_LCSSA_LCSSA:%.*]] = phi i32 [ [[INC_LCSSA_LCSSA_PH]], %[[FOR_END_UNR_LCSSA]] ], [ 7, %[[FOR_LATCH_EPIL]] ]
 ; CHECK-NEXT:    ret i32 0
 ;
 entry:
@@ -1324,9 +1324,9 @@ define signext i16 @test10(i32 %k) #0 {
 ; CHECK-NEXT:    [[STOREMERGE_4_LCSSA_3:%.*]] = phi i64 [ [[STOREMERGE_4_3:%.*]], %[[FOR_INC21_3]] ]
 ; CHECK-NEXT:    br i1 false, label %[[FOR_BODY]], label %[[FOR_END26_UNR_LCSSA:.*]], !llvm.loop [[LOOP13:![0-9]+]]
 ; CHECK:       [[FOR_END26_UNR_LCSSA]]:
-; CHECK-NEXT:    [[DEC_LCSSA_LCSSA_PH_PH:%.*]] = phi i64 [ 0, %[[FOR_INC24]] ]
-; CHECK-NEXT:    [[STOREMERGE_4_LCSSA_LCSSA_PH_PH:%.*]] = phi i64 [ [[STOREMERGE_4_LCSSA_3]], %[[FOR_INC24]] ]
-; CHECK-NEXT:    [[STOREMERGE_5_LCSSA_LCSSA_PH_PH:%.*]] = phi i32 [ 0, %[[FOR_INC24]] ]
+; CHECK-NEXT:    [[DEC_LCSSA_LCSSA_PH:%.*]] = phi i64 [ 0, %[[FOR_INC24]] ]
+; CHECK-NEXT:    [[STOREMERGE_4_LCSSA_LCSSA_PH:%.*]] = phi i64 [ [[STOREMERGE_4_LCSSA_3]], %[[FOR_INC24]] ]
+; CHECK-NEXT:    [[STOREMERGE_5_LCSSA_LCSSA_PH:%.*]] = phi i32 [ 0, %[[FOR_INC24]] ]
 ; CHECK-NEXT:    br i1 true, label %[[FOR_BODY_EPIL_PREHEADER]], label %[[FOR_END26:.*]]
 ; CHECK:       [[FOR_BODY_EPIL_PREHEADER]]:
 ; CHECK-NEXT:    call void @llvm.assume(i1 true)
@@ -1353,9 +1353,9 @@ define signext i16 @test10(i32 %k) #0 {
 ; CHECK-NEXT:    [[STOREMERGE_4_LCSSA_EPIL:%.*]] = phi i64 [ [[STOREMERGE_4_EPIL]], %[[FOR_INC21_EPIL]] ]
 ; CHECK-NEXT:    br label %[[FOR_END26]]
 ; CHECK:       [[FOR_END26]]:
-; CHECK-NEXT:    [[DEC_LCSSA_LCSSA:%.*]] = phi i64 [ [[DEC_LCSSA_LCSSA_PH_PH]], %[[FOR_END26_UNR_LCSSA]] ], [ 0, %[[FOR_INC24_EPIL]] ]
-; CHECK-NEXT:    [[STOREMERGE_4_LCSSA_LCSSA:%.*]] = phi i64 [ [[STOREMERGE_4_LCSSA_LCSSA_PH_PH]], %[[FOR_END26_UNR_LCSSA]] ], [ [[STOREMERGE_4_LCSSA_EPIL]], %[[FOR_INC24_EPIL]] ]
-; CHECK-NEXT:    [[STOREMERGE_5_LCSSA_LCSSA:%.*]] = phi i32 [ [[STOREMERGE_5_LCSSA_LCSSA_PH_PH]], %[[FOR_END26_UNR_LCSSA]] ], [ 0, %[[FOR_INC24_EPIL]] ]
+; CHECK-NEXT:    [[DEC_LCSSA_LCSSA:%.*]] = phi i64 [ [[DEC_LCSSA_LCSSA_PH]], %[[FOR_END26_UNR_LCSSA]] ], [ 0, %[[FOR_INC24_EPIL]] ]
+; CHECK-NEXT:    [[STOREMERGE_4_LCSSA_LCSSA:%.*]] = phi i64 [ [[STOREMERGE_4_LCSSA_LCSSA_PH]], %[[FOR_END26_UNR_LCSSA]] ], [ [[STOREMERGE_4_LCSSA_EPIL]], %[[FOR_INC24_EPIL]] ]
+; CHECK-NEXT:    [[STOREMERGE_5_LCSSA_LCSSA:%.*]] = phi i32 [ [[STOREMERGE_5_LCSSA_LCSSA_PH]], %[[FOR_END26_UNR_LCSSA]] ], [ 0, %[[FOR_INC24_EPIL]] ]
 ; CHECK-NEXT:    store i64 [[DEC_LCSSA_LCSSA]], ptr @g, align 8
 ; CHECK-NEXT:    ret i16 0
 ; CHECK:       [[FOR_BODY2_SPLIT2_1]]:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll
index 21b21774d18cf..91c65ba8f6267 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll
@@ -380,7 +380,7 @@ for.end:
   ret void
 }
 
-define void @loop_with_freeze_and_conditional_srem(ptr %dst, ptr %keyinfo, ptr %invariant.ptr, i32 %divisor) #1 {
+define void @loop_with_freeze_and_conditional_srem(ptr %dst, ptr %keyinfo, ptr %invariant.ptr, i32 %divisor) {
 ; COMMON-LABEL: define void @loop_with_freeze_and_conditional_srem(
 ; COMMON-SAME: ptr [[DST:%.*]], ptr [[KEYINFO:%.*]], ptr [[INVARIANT_PTR:%.*]], i32 [[DIVISOR:%.*]]) {
 ; COMMON-NEXT:  [[ENTRY:.*]]:
@@ -433,7 +433,165 @@ exit:                                             ; preds = %loop.latch
   ret void
 }
 
+define void @interleave_group(ptr %dst) #1 {
+; COST1-LABEL: define void @interleave_group(
+; COST1-SAME: ptr [[DST:%.*]]) #[[ATTR1:[0-9]+]] {
+; COST1-NEXT:  [[ITER_CHECK:.*:]]
+; COST1-NEXT:    br i1 false, label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
+; COST1:       [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
+; COST1-NEXT:    br i1 false, label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; COST1:       [[VECTOR_PH]]:
+; COST1-NEXT:    br label %[[VECTOR_BODY:.*]]
+; COST1:       [[VECTOR_BODY]]:
+; COST1-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; COST1-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 16
+; COST1-NEXT:    [[TMP1:%.*]] = mul i64 [[INDEX]], 3
+; COST1-NEXT:    [[TMP2:%.*]] = mul i64 [[TMP0]], 3
+; COST1-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP1]]
+; COST1-NEXT:    [[TMP4:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP2]]
+; COST1-NEXT:    store <48 x i8> zeroinitializer, ptr [[TMP3]], align 1
+; COST1-NEXT:    store <48 x i8> zeroinitializer, ptr [[TMP4]], align 1
+; COST1-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
+; COST1-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 96
+; COST1-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]
+; COST1:       [[MIDDLE_BLOCK]]:
+; COST1-NEXT:    br i1 false, [[EXIT:label %.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
+; COST1:       [[VEC_EPILOG_ITER_CHECK]]:
+; COST1-NEXT:    br i1 false, label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF4]]
+; COST1:       [[VEC_EPILOG_PH]]:
+; COST1-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 96, %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; COST1-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[BC_RESUME_VAL]], i64 0
+; COST1-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; COST1-NEXT:    [[INDUCTION:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
+; COST1-NEXT:    br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
+; COST1:       [[VEC_EPILOG_VECTOR_BODY]]:
+; COST1-NEXT:    [[INDEX1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT2:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
+; COST1-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ [[INDUCTION]], %[[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
+; COST1-NEXT:    [[TMP6:%.*]] = mul <4 x i64> [[VEC_IND]], splat (i64 3)
+; COST1-NEXT:    [[TMP7:%.*]] = extractelement <4 x i64> [[TMP6]], i32 0
+; COST1-NEXT:    [[TMP8:%.*]] = extractelement <4 x i64> [[TMP6]], i32 1
+; COST1-NEXT:    [[TMP9:%.*]] = extractelement <4 x i64> [[TMP6]], i32 2
+; COST1-NEXT:    [[TMP10:%.*]] = extractelement <4 x i64> [[TMP6]], i32 3
+; COST1-NEXT:    [[TMP11:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP7]]
+; COST1-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP8]]
+; COST1-NEXT:    [[TMP13:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP9]]
+; COST1-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP10]]
+; COST1-NEXT:    [[TMP15:%.*]] = getelementptr i8, ptr [[TMP11]], i64 2
+; COST1-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[TMP12]], i64 2
+; COST1-NEXT:    [[TMP17:%.*]] = getelementptr i8, ptr [[TMP13]], i64 2
+; COST1-NEXT:    [[TMP18:%.*]] = getelementptr i8, ptr [[TMP14]], i64 2
+; COST1-NEXT:    store i8 0, ptr [[TMP15]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP16]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP17]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP18]], align 1
+; COST1-NEXT:    [[TMP19:%.*]] = getelementptr i8, ptr [[TMP11]], i64 1
+; COST1-NEXT:    [[TMP20:%.*]] = getelementptr i8, ptr [[TMP12]], i64 1
+; COST1-NEXT:    [[TMP21:%.*]] = getelementptr i8, ptr [[TMP13]], i64 1
+; COST1-NEXT:    [[TMP22:%.*]] = getelementptr i8, ptr [[TMP14]], i64 1
+; COST1-NEXT:    store i8 0, ptr [[TMP19]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP20]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP21]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP22]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP11]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP12]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP13]], align 1
+; COST1-NEXT:    store i8 0, ptr [[TMP14]], align 1
+; COST1-NEXT:    [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 4
+; COST1-NEXT:    [[VEC_IND_NEXT]] = add <4 x i64> [[VEC_IND]], splat (i64 4)
+; COST1-NEXT:    [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 100
+; COST1-NEXT:    br i1 [[TMP23]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP17:![0-9]+]]
+; COST1:       [[VEC_EPILOG_MIDDLE_BLOCK]]:
+; COST1-NEXT:    br i1 false, [[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
+; COST1:       [[VEC_EPILOG_SCALAR_PH]]:
+;
+; COST10-LABEL: define void @interleave_group(
+; COST10-SAME: ptr [[DST:%.*]]) #[[ATTR1:[0-9]+]] {
+; COST10-NEXT:  [[ITER_CHECK:.*:]]
+; COST10-NEXT:    br i1 false, label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]]
+; COST10:       [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
+; COST10-NEXT:    br i1 false, label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]]
+; COST10:       [[VECTOR_PH]]:
+; COST10-NEXT:    br label %[[VECTOR_BODY:.*]]
+; COST10:       [[VECTOR_BODY]]:
+; COST10-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; COST10-NEXT:    [[TMP0:%.*]] = mul i64 [[INDEX]], 3
+; COST10-NEXT:    [[TMP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP0]]
+; COST10-NEXT:    store <48 x i8> zeroinitializer, ptr [[TMP1]], align 1
+; COST10-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; COST10-NEXT:    [[TMP2:%.*]] = icmp eq i64 [[INDEX_NEXT]], 96
+; COST10-NEXT:    br i1 [[TMP2]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]
+; COST10:       [[MIDDLE_BLOCK]]:
+; COST10-NEXT:    br i1 false, [[EXIT:label %.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
+; COST10:       [[VEC_EPILOG_ITER_CHECK]]:
+; COST10-NEXT:    br i1 false, label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF4]]
+; COST10:       [[VEC_EPILOG_PH]]:
+; COST10-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 96, %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
+; COST10-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[BC_RESUME_VAL]], i64 0
+; COST10-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; COST10-NEXT:    [[INDUCTION:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
+; COST10-NEXT:    br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
+; COST10:       [[VEC_EPILOG_VECTOR_BODY]]:
+; COST10-NEXT:    [[INDEX1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT2:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
+; COST10-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ [[INDUCTION]], %[[VEC_EPILOG_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VEC_EPILOG_VECTOR_BODY]] ]
+; COST10-NEXT:    [[TMP3:%.*]] = mul <4 x i64> [[VEC_IND]], splat (i64 3)
+; COST10-NEXT:    [[TMP4:%.*]] = extractelement <4 x i64> [[TMP3]], i32 0
+; COST10-NEXT:    [[TMP5:%.*]] = extractelement <4 x i64> [[TMP3]], i32 1
+; COST10-NEXT:    [[TMP6:%.*]] = extractelement <4 x i64> [[TMP3]], i32 2
+; COST10-NEXT:    [[TMP7:%.*]] = extractelement <4 x i64> [[TMP3]], i32 3
+; COST10-NEXT:    [[TMP8:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP4]]
+; COST10-NEXT:    [[TMP9:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP5]]
+; COST10-NEXT:    [[TMP10:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP6]]
+; COST10-NEXT:    [[TMP11:%.*]] = getelementptr i8, ptr [[DST]], i64 [[TMP7]]
+; COST10-NEXT:    [[TMP12:%.*]] = getelementptr i8, ptr [[TMP8]], i64 2
+; COST10-NEXT:    [[TMP13:%.*]] = getelementptr i8, ptr [[TMP9]], i64 2
+; COST10-NEXT:    [[TMP14:%.*]] = getelementptr i8, ptr [[TMP10]], i64 2
+; COST10-NEXT:    [[TMP15:%.*]] = getelementptr i8, ptr [[TMP11]], i64 2
+; COST10-NEXT:    store i8 0, ptr [[TMP12]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP13]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP14]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP15]], align 1
+; COST10-NEXT:    [[TMP16:%.*]] = getelementptr i8, ptr [[TMP8]], i64 1
+; COST10-NEXT:    [[TMP17:%.*]] = getelementptr i8, ptr [[TMP9]], i64 1
+; COST10-NEXT:    [[TMP18:%.*]] = getelementptr i8, ptr [[TMP10]], i64 1
+; COST10-NEXT:    [[TMP19:%.*]] = getelementptr i8, ptr [[TMP11]], i64 1
+; COST10-NEXT:    store i8 0, ptr [[TMP16]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP17]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP18]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP19]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP8]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP9]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP10]], align 1
+; COST10-NEXT:    store i8 0, ptr [[TMP11]], align 1
+; COST10-NEXT:    [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 4
+; COST10-NEXT:    [[VEC_IND_NEXT]] = add <4 x i64> [[VEC_IND]], splat (i64 4)
+; COST10-NEXT:    [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 100
+; COST10-NEXT:    br i1 [[TMP20]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP17:![0-9]+]]
+; COST10:       [[VEC_EPILOG_MIDDLE_BLOCK]]:
+; COST10-NEXT:    br i1 false, [[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
+; COST10:       [[VEC_EPILOG_SCALAR_PH]]:
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %iv.3 = mul i64 %iv, 3
+  %gep.0 = getelementptr i8, ptr %dst, i64 %iv.3
+  %gep.2 = getelementptr i8, ptr %gep.0, i64 2
+  store i8 0, ptr %gep.2, align 1
+  %gep.1 = getelementptr i8, ptr %gep.0, i64 1
+  store i8 0, ptr %gep.1, align 1
+  store i8 0, ptr %gep.0, align 1
+  %iv.next = add i64 %iv, 1
+  %ec = icmp eq i64 %iv, 100
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret void
+}
+
 attributes #0 = { "target-features"="+neon,+sve" vscale_range(1,16) }
+attributes #1 = { "target-cpu"="neoverse-512tvb" }
 
 declare void @llvm.assume(i1 noundef)
 declare i64 @llvm.umin.i64(i64, i64)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll
index 88e035ebf3be8..131b3d1b02727 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr60831-sve-inv-store-crash.ll
@@ -15,25 +15,22 @@ define void @test_invar_gep(ptr %dst) #0 {
 ; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP2]], 4
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[TMP4:%.*]] = mul nsw <vscale x 4 x i64> [[TMP5]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> zeroinitializer, [[TMP4]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP3]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX]], i64 0
-; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP10:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP5]]
-; CHECK-NEXT:    [[TMP4:%.*]] = mul <vscale x 4 x i64> [[TMP10]], splat (i64 1)
-; CHECK-NEXT:    [[TMP9:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP4]]
-; CHECK-NEXT:    [[TMP6:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[INDEX]], 1
-; CHECK-NEXT:    [[TMP8:%.*]] = add i64 [[INDEX]], 2
-; CHECK-NEXT:    [[TMP11:%.*]] = add i64 [[INDEX]], 3
+; CHECK-NEXT:    [[TMP9:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP15:%.*]] = call i32 @llvm.vscale.i32()
 ; CHECK-NEXT:    [[TMP16:%.*]] = mul nuw i32 [[TMP15]], 4
 ; CHECK-NEXT:    [[TMP17:%.*]] = sub i32 [[TMP16]], 1
 ; CHECK-NEXT:    [[TMP18:%.*]] = extractelement <vscale x 4 x i64> [[TMP9]], i32 [[TMP17]]
 ; CHECK-NEXT:    store i64 [[TMP18]], ptr [[TMP14:%.*]], align 1
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP9]], [[DOTSPLAT]]
 ; CHECK-NEXT:    [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       middle.block:
@@ -60,38 +57,26 @@ define void @test_invar_gep(ptr %dst) #0 {
 ; IC2:       vector.ph:
 ; IC2-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
 ; IC2-NEXT:    [[TMP11:%.*]] = mul nuw i64 [[TMP2]], 4
+; IC2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
+; IC2-NEXT:    [[TMP21:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; IC2-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP11]], 2
 ; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 100, [[TMP3]]
 ; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 100, [[N_MOD_VF]]
+; IC2-NEXT:    [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; IC2-NEXT:    [[TMP12:%.*]] = mul nsw <vscale x 4 x i64> [[TMP5]], splat (i64 1)
+; IC2-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> zeroinitializer, [[TMP12]]
 ; IC2-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; IC2:       vector.body:
 ; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; IC2-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX]], i64 0
-; IC2-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
-; IC2-NEXT:    [[VEC_IND:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[TMP5:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; IC2-NEXT:    [[TMP21:%.*]] = mul <vscale x 4 x i64> [[TMP5]], splat (i64 1)
+; IC2-NEXT:    [[DOTSPLAT:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; IC2-NEXT:    [[TMP22:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP21]]
-; IC2-NEXT:    [[TMP23:%.*]] = add i64 [[TMP11]], 0
-; IC2-NEXT:    [[TMP24:%.*]] = mul i64 [[TMP23]], 1
-; IC2-NEXT:    [[TMP25:%.*]] = add i64 [[INDEX]], [[TMP24]]
-; IC2-NEXT:    [[TMP12:%.*]] = add i64 [[TMP11]], 1
-; IC2-NEXT:    [[TMP13:%.*]] = mul i64 [[TMP12]], 1
-; IC2-NEXT:    [[TMP14:%.*]] = add i64 [[INDEX]], [[TMP13]]
-; IC2-NEXT:    [[TMP15:%.*]] = add i64 [[TMP11]], 2
-; IC2-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 1
-; IC2-NEXT:    [[TMP17:%.*]] = add i64 [[INDEX]], [[TMP16]]
-; IC2-NEXT:    [[TMP18:%.*]] = add i64 [[TMP11]], 3
-; IC2-NEXT:    [[TMP19:%.*]] = mul i64 [[TMP18]], 1
-; IC2-NEXT:    [[TMP20:%.*]] = add i64 [[INDEX]], [[TMP19]]
 ; IC2-NEXT:    [[TMP6:%.*]] = call i32 @llvm.vscale.i32()
 ; IC2-NEXT:    [[TMP7:%.*]] = mul nuw i32 [[TMP6]], 4
 ; IC2-NEXT:    [[TMP8:%.*]] = sub i32 [[TMP7]], 1
 ; IC2-NEXT:    [[TMP9:%.*]] = extractelement <vscale x 4 x i64> [[TMP22]], i32 [[TMP8]]
 ; IC2-NEXT:    store i64 [[TMP9]], ptr [[DST:%.*]], align 1
 ; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
+; IC2-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP22]], [[TMP21]]
 ; IC2-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; IC2-NEXT:    br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; IC2:       middle.block:
@@ -139,26 +124,24 @@ define void @test_invar_gep_var_start(i64 %start, ptr %dst) #0 {
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP4]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i64 [[START]], [[N_VEC]]
+; CHECK-NEXT:    [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP13:%.*]] = mul nsw <vscale x 4 x i64> [[TMP6]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[DOTSPLAT]], [[TMP13]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[INDEX]]
-; CHECK-NEXT:    [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
-; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP14:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP6]]
-; CHECK-NEXT:    [[TMP15:%.*]] = mul <vscale x 4 x i64> [[TMP14]], splat (i64 1)
-; CHECK-NEXT:    [[TMP7:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP15]]
-; CHECK-NEXT:    [[TMP16:%.*]] = add i64 [[OFFSET_IDX]], 0
-; CHECK-NEXT:    [[TMP17:%.*]] = add i64 [[OFFSET_IDX]], 1
-; CHECK-NEXT:    [[TMP18:%.*]] = add i64 [[OFFSET_IDX]], 2
-; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[OFFSET_IDX]], 3
+; CHECK-NEXT:    [[TMP7:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP8:%.*]] = call i32 @llvm.vscale.i32()
 ; CHECK-NEXT:    [[TMP9:%.*]] = mul nuw i32 [[TMP8]], 4
 ; CHECK-NEXT:    [[TMP10:%.*]] = sub i32 [[TMP9]], 1
 ; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <vscale x 4 x i64> [[TMP7]], i32 [[TMP10]]
 ; CHECK-NEXT:    store i64 [[TMP11]], ptr [[DST:%.*]], align 1
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP4]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP7]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       middle.block:
@@ -187,40 +170,29 @@ define void @test_invar_gep_var_start(i64 %start, ptr %dst) #0 {
 ; IC2:       vector.ph:
 ; IC2-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
 ; IC2-NEXT:    [[TMP4:%.*]] = mul nuw i64 [[TMP3]], 4
+; IC2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
+; IC2-NEXT:    [[TMP9:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; IC2-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 2
 ; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP0]], [[TMP5]]
 ; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP0]], [[N_MOD_VF]]
 ; IC2-NEXT:    [[TMP6:%.*]] = add i64 [[START]], [[N_VEC]]
+; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
+; IC2-NEXT:    [[VEC_IND:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; IC2-NEXT:    [[TMP8:%.*]] = mul nsw <vscale x 4 x i64> [[BROADCAST_SPLAT]], splat (i64 1)
+; IC2-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[VEC_IND]], [[TMP8]]
 ; IC2-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; IC2:       vector.body:
 ; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; IC2-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[INDEX]]
-; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; IC2-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
-; IC2-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
-; IC2-NEXT:    [[VEC_IND:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[TMP11:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
-; IC2-NEXT:    [[TMP9:%.*]] = mul <vscale x 4 x i64> [[TMP11]], splat (i64 1)
+; IC2-NEXT:    [[DOTSPLAT:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; IC2-NEXT:    [[TMP10:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP9]]
-; IC2-NEXT:    [[TMP23:%.*]] = add i64 [[TMP4]], 0
-; IC2-NEXT:    [[TMP24:%.*]] = mul i64 [[TMP23]], 1
-; IC2-NEXT:    [[TMP25:%.*]] = add i64 [[OFFSET_IDX]], [[TMP24]]
-; IC2-NEXT:    [[TMP26:%.*]] = add i64 [[TMP4]], 1
-; IC2-NEXT:    [[TMP27:%.*]] = mul i64 [[TMP26]], 1
-; IC2-NEXT:    [[TMP28:%.*]] = add i64 [[OFFSET_IDX]], [[TMP27]]
-; IC2-NEXT:    [[TMP17:%.*]] = add i64 [[TMP4]], 2
-; IC2-NEXT:    [[TMP18:%.*]] = mul i64 [[TMP17]], 1
-; IC2-NEXT:    [[TMP19:%.*]] = add i64 [[OFFSET_IDX]], [[TMP18]]
-; IC2-NEXT:    [[TMP20:%.*]] = add i64 [[TMP4]], 3
-; IC2-NEXT:    [[TMP21:%.*]] = mul i64 [[TMP20]], 1
-; IC2-NEXT:    [[TMP22:%.*]] = add i64 [[OFFSET_IDX]], [[TMP21]]
 ; IC2-NEXT:    [[TMP12:%.*]] = call i32 @llvm.vscale.i32()
 ; IC2-NEXT:    [[TMP13:%.*]] = mul nuw i32 [[TMP12]], 4
 ; IC2-NEXT:    [[TMP14:%.*]] = sub i32 [[TMP13]], 1
 ; IC2-NEXT:    [[TMP15:%.*]] = extractelement <vscale x 4 x i64> [[TMP10]], i32 [[TMP14]]
 ; IC2-NEXT:    store i64 [[TMP15]], ptr [[DST:%.*]], align 1
 ; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP5]]
+; IC2-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP10]], [[TMP9]]
 ; IC2-NEXT:    [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; IC2-NEXT:    br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
 ; IC2:       middle.block:
@@ -269,36 +241,34 @@ define void @test_invar_gep_var_start_step_2(i64 %start, ptr %dst) #0 {
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 4
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
-; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[N_VEC]], 2
-; CHECK-NEXT:    [[TMP8:%.*]] = add i64 [[START]], [[TMP7]]
-; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
-; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 [[N_VEC]], 2
 ; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[TMP9]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
 ; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP11:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP10]]
-; CHECK-NEXT:    [[TMP18:%.*]] = mul <vscale x 4 x i64> [[TMP11]], splat (i64 2)
-; CHECK-NEXT:    [[TMP12:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP18]]
-; CHECK-NEXT:    [[TMP19:%.*]] = add i64 [[OFFSET_IDX]], 0
-; CHECK-NEXT:    [[TMP20:%.*]] = add i64 [[OFFSET_IDX]], 2
-; CHECK-NEXT:    [[TMP21:%.*]] = add i64 [[OFFSET_IDX]], 4
-; CHECK-NEXT:    [[TMP22:%.*]] = add i64 [[OFFSET_IDX]], 6
+; CHECK-NEXT:    [[TMP18:%.*]] = mul nsw <vscale x 4 x i64> [[TMP10]], splat (i64 2)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[DOTSPLAT]], [[TMP18]]
+; CHECK-NEXT:    [[TMP11:%.*]] = mul nsw i64 2, [[TMP6]]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP12:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP13:%.*]] = call i32 @llvm.vscale.i32()
 ; CHECK-NEXT:    [[TMP14:%.*]] = mul nuw i32 [[TMP13]], 4
 ; CHECK-NEXT:    [[TMP15:%.*]] = sub i32 [[TMP14]], 1
 ; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <vscale x 4 x i64> [[TMP12]], i32 [[TMP15]]
 ; CHECK-NEXT:    store i64 [[TMP16]], ptr [[DST:%.*]], align 1
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP12]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       middle.block:
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP8]], [[MIDDLE_BLOCK]] ], [ [[START]], [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[OFFSET_IDX]], [[MIDDLE_BLOCK]] ], [ [[START]], [[ENTRY:%.*]] ]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
@@ -322,49 +292,38 @@ define void @test_invar_gep_var_start_step_2(i64 %start, ptr %dst) #0 {
 ; IC2:       vector.ph:
 ; IC2-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; IC2-NEXT:    [[TMP6:%.*]] = mul nuw i64 [[TMP5]], 4
+; IC2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP6]], i64 0
+; IC2-NEXT:    [[BROADCAST_SPLAT1:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; IC2-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 2
 ; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP7]]
 ; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
-; IC2-NEXT:    [[TMP8:%.*]] = mul i64 [[N_VEC]], 2
-; IC2-NEXT:    [[TMP9:%.*]] = add i64 [[START]], [[TMP8]]
-; IC2-NEXT:    br label [[VECTOR_BODY:%.*]]
-; IC2:       vector.body:
-; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; IC2-NEXT:    [[TMP10:%.*]] = mul i64 [[INDEX]], 2
+; IC2-NEXT:    [[TMP10:%.*]] = mul i64 [[N_VEC]], 2
 ; IC2-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[START]], [[TMP10]]
+; IC2-NEXT:    [[TMP13:%.*]] = mul <vscale x 4 x i64> [[BROADCAST_SPLAT1]], splat (i64 2)
 ; IC2-NEXT:    [[TMP11:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
-; IC2-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[OFFSET_IDX]], i64 0
-; IC2-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP6]], i64 0
+; IC2-NEXT:    [[DOTSPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[START]], i64 0
 ; IC2-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
-; IC2-NEXT:    [[TMP16:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP11]]
-; IC2-NEXT:    [[TMP13:%.*]] = mul <vscale x 4 x i64> [[TMP16]], splat (i64 2)
+; IC2-NEXT:    [[TMP12:%.*]] = mul nsw <vscale x 4 x i64> [[TMP11]], splat (i64 2)
+; IC2-NEXT:    [[INDUCTION:%.*]] = add nsw <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP12]]
+; IC2-NEXT:    br label [[VECTOR_BODY:%.*]]
+; IC2:       vector.body:
+; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; IC2-NEXT:    [[DOTSPLAT:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; IC2-NEXT:    [[TMP14:%.*]] = add <vscale x 4 x i64> [[DOTSPLAT]], [[TMP13]]
-; IC2-NEXT:    [[TMP15:%.*]] = add i64 [[TMP6]], 0
-; IC2-NEXT:    [[TMP27:%.*]] = mul i64 [[TMP15]], 2
-; IC2-NEXT:    [[TMP28:%.*]] = add i64 [[OFFSET_IDX]], [[TMP27]]
-; IC2-NEXT:    [[TMP29:%.*]] = add i64 [[TMP6]], 1
-; IC2-NEXT:    [[TMP30:%.*]] = mul i64 [[TMP29]], 2
-; IC2-NEXT:    [[TMP31:%.*]] = add i64 [[OFFSET_IDX]], [[TMP30]]
-; IC2-NEXT:    [[TMP32:%.*]] = add i64 [[TMP6]], 2
-; IC2-NEXT:    [[TMP22:%.*]] = mul i64 [[TMP32]], 2
-; IC2-NEXT:    [[TMP23:%.*]] = add i64 [[OFFSET_IDX]], [[TMP22]]
-; IC2-NEXT:    [[TMP24:%.*]] = add i64 [[TMP6]], 3
-; IC2-NEXT:    [[TMP25:%.*]] = mul i64 [[TMP24]], 2
-; IC2-NEXT:    [[TMP26:%.*]] = add i64 [[OFFSET_IDX]], [[TMP25]]
 ; IC2-NEXT:    [[TMP17:%.*]] = call i32 @llvm.vscale.i32()
 ; IC2-NEXT:    [[TMP18:%.*]] = mul nuw i32 [[TMP17]], 4
 ; IC2-NEXT:    [[TMP19:%.*]] = sub i32 [[TMP18]], 1
 ; IC2-NEXT:    [[TMP20:%.*]] = extractelement <vscale x 4 x i64> [[TMP14]], i32 [[TMP19]]
 ; IC2-NEXT:    store i64 [[TMP20]], ptr [[DST:%.*]], align 1
 ; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP7]]
+; IC2-NEXT:    [[VEC_IND_NEXT]] = add nsw <vscale x 4 x i64> [[TMP14]], [[TMP13]]
 ; IC2-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; IC2-NEXT:    br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
 ; IC2:       middle.block:
 ; IC2-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; IC2-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; IC2:       scalar.ph:
-; IC2-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ [[START]], [[ENTRY:%.*]] ]
+; IC2-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[OFFSET_IDX]], [[MIDDLE_BLOCK]] ], [ [[START]], [[ENTRY:%.*]] ]
 ; IC2-NEXT:    br label [[LOOP:%.*]]
 ; IC2:       loop:
 ; IC2-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll
index 403fc9f316d35..20409f66fc51f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll
@@ -73,3 +73,26 @@ exit:
   %1 = select i1 %all.off, i32 1, i32 %0
   ret i32 %1
 }
+
+define i32 @select_vpinst_for_tail_folding(i8 %n) {
+; CHECK: LV: Checking a loop in 'select_vpinst_for_tail_folding'
+; CHECK: Cost of 1 for VF 2: EMIT vp<{{.+}}> = select vp<{{.+}}>, ir<%red.next>, ir<%red>
+; CHECK: Cost of 1 for VF 4: EMIT vp<{{.+}}> = select vp<{{.+}}>, ir<%red.next>, ir<%red>
+; CHECK: LV: Selecting VF: 4
+
+entry:
+  %c = icmp ne i8 %n, 0
+  %ext = zext i1 %c to i32
+  br label %loop
+
+loop:
+  %iv = phi i32 [ %ext, %entry ], [ %iv.next, %loop ]
+  %red = phi i32 [ 0, %entry ], [ %red.next, %loop ]
+  %iv.next = add i32 %iv, 1
+  %red.next = mul i32 %red, %iv
+  %ec = icmp eq i32 %iv, 12
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret i32 %red.next
+}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-index.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-index.ll
index 32d419cc0934a..56d34a61be1db 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/select-index.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-index.ll
@@ -47,11 +47,58 @@ define i64 @test_vectorize_select_umin_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <2 x i64> [ splat (i64 100), %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <2 x i64> [ splat (i64 100), %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 2
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i64>, ptr [[GEP]], align 8
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <2 x i64>, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp uge <2 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp uge <2 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <2 x i64> @llvm.umin.v2i64(<2 x i64> [[VEC_PHI2]], <2 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <2 x i64> @llvm.umin.v2i64(<2 x i64> [[VEC_PHI3]], <2 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <2 x i1> [[TMP3]], <2 x i64> [[VEC_IND]], <2 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <2 x i1> [[TMP4]], <2 x i64> [[STEP_ADD]], <2 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <2 x i64> [[STEP_ADD]], splat (i64 2)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <2 x i64> @llvm.umin.v2i64(<2 x i64> [[TMP5]], <2 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.umin.v2i64(<2 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <2 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <2 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <2 x i1> [[TMP11]], <2 x i64> [[TMP7]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <2 x i1> [[TMP12]], <2 x i64> [[TMP8]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[TMP13]], <2 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v2i64(<2 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 100, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 100, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 8
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp uge i64 [[MIN_VAL]], [[L]]
@@ -59,9 +106,9 @@ define i64 @test_vectorize_select_umin_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -131,11 +178,58 @@ define i64 @test_vectorize_select_smin_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smin_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 2
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i64>, ptr [[GEP]], align 8
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <2 x i64>, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp sge <2 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp sge <2 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <2 x i64> @llvm.smin.v2i64(<2 x i64> [[VEC_PHI2]], <2 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <2 x i64> @llvm.smin.v2i64(<2 x i64> [[VEC_PHI3]], <2 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <2 x i1> [[TMP3]], <2 x i64> [[VEC_IND]], <2 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <2 x i1> [[TMP4]], <2 x i64> [[STEP_ADD]], <2 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <2 x i64> [[STEP_ADD]], splat (i64 2)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <2 x i64> @llvm.smin.v2i64(<2 x i64> [[TMP5]], <2 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.smin.v2i64(<2 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <2 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <2 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <2 x i1> [[TMP11]], <2 x i64> [[TMP7]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <2 x i1> [[TMP12]], <2 x i64> [[TMP8]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[TMP13]], <2 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v2i64(<2 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 8
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sge i64 [[MIN_VAL]], [[L]]
@@ -143,9 +237,9 @@ define i64 @test_vectorize_select_smin_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -215,11 +309,58 @@ define i64 @test_vectorize_select_umax_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umax_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 2
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i64>, ptr [[GEP]], align 8
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <2 x i64>, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ule <2 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ule <2 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <2 x i64> @llvm.umax.v2i64(<2 x i64> [[VEC_PHI2]], <2 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <2 x i64> @llvm.umax.v2i64(<2 x i64> [[VEC_PHI3]], <2 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <2 x i1> [[TMP3]], <2 x i64> [[VEC_IND]], <2 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <2 x i1> [[TMP4]], <2 x i64> [[STEP_ADD]], <2 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <2 x i64> [[STEP_ADD]], splat (i64 2)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <2 x i64> @llvm.umax.v2i64(<2 x i64> [[TMP5]], <2 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.umax.v2i64(<2 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <2 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <2 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <2 x i1> [[TMP11]], <2 x i64> [[TMP7]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <2 x i1> [[TMP12]], <2 x i64> [[TMP8]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[TMP13]], <2 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v2i64(<2 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 8
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ule i64 [[MIN_VAL]], [[L]]
@@ -227,9 +368,9 @@ define i64 @test_vectorize_select_umax_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -299,11 +440,58 @@ define i64 @test_vectorize_select_smax_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smax_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 1>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <2 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <2 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 2
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i64>, ptr [[GEP]], align 8
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <2 x i64>, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp sle <2 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp sle <2 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[VEC_PHI2]], <2 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[VEC_PHI3]], <2 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <2 x i1> [[TMP3]], <2 x i64> [[VEC_IND]], <2 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <2 x i1> [[TMP4]], <2 x i64> [[STEP_ADD]], <2 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <2 x i64> [[STEP_ADD]], splat (i64 2)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[TMP5]], <2 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.smax.v2i64(<2 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i64> [[BROADCAST_SPLATINSERT]], <2 x i64> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <2 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <2 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <2 x i1> [[TMP11]], <2 x i64> [[TMP7]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <2 x i1> [[TMP12]], <2 x i64> [[TMP8]], <2 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <2 x i64> @llvm.smax.v2i64(<2 x i64> [[TMP13]], <2 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v2i64(<2 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 8
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sle i64 [[MIN_VAL]], [[L]]
@@ -311,9 +499,9 @@ define i64 @test_vectorize_select_smax_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP9:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -383,22 +571,71 @@ define i32 @test_multi_use_reduction_with_trunc_iv(ptr %src, i32 %n) {
 ; CHECK-NEXT:    [[PRE:%.*]] = icmp eq i32 [[N]], 0
 ; CHECK-NEXT:    br i1 [[PRE]], label %[[EXIT:.*]], label %[[LOOP_PREHEADER:.*]]
 ; CHECK:       [[LOOP_PREHEADER]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N_EXT]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_EXT]], 8
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_EXT]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 1, [[N_VEC]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ <i32 1, i32 2, i32 3, i32 4>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <4 x i32> [[VEC_IND]], splat (i32 4)
+; CHECK-NEXT:    [[IV:%.*]] = add i64 1, [[INDEX]]
+; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr i32, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i32, ptr [[GEP_SRC]], i64 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[GEP_SRC]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ugt <4 x i32> [[WIDE_LOAD]], [[VEC_PHI2]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ugt <4 x i32> [[WIDE_LOAD4]], [[VEC_PHI3]]
+; CHECK-NEXT:    [[TMP5]] = call <4 x i32> @llvm.umin.v4i32(<4 x i32> [[WIDE_LOAD]], <4 x i32> [[VEC_PHI2]])
+; CHECK-NEXT:    [[TMP6]] = call <4 x i32> @llvm.umin.v4i32(<4 x i32> [[WIDE_LOAD4]], <4 x i32> [[VEC_PHI3]])
+; CHECK-NEXT:    [[TMP7]] = select <4 x i1> [[TMP3]], <4 x i32> [[VEC_PHI]], <4 x i32> [[VEC_IND]]
+; CHECK-NEXT:    [[TMP8]] = select <4 x i1> [[TMP4]], <4 x i32> [[VEC_PHI1]], <4 x i32> [[STEP_ADD]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[STEP_ADD]], splat (i32 4)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <4 x i32> @llvm.umin.v4i32(<4 x i32> [[TMP5]], <4 x i32> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i32 @llvm.vector.reduce.umin.v4i32(<4 x i32> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i32> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i32> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <4 x i1> [[TMP11]], <4 x i32> [[TMP7]], <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP14:%.*]] = select <4 x i1> [[TMP12]], <4 x i32> [[TMP8]], <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <4 x i32> @llvm.umax.v4i32(<4 x i32> [[TMP13]], <4 x i32> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i32 @llvm.vector.reduce.umax.v4i32(<4 x i32> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i32 [[TMP15]], 0
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i32 [[TMP15]], i32 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N_EXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP0]], %[[MIDDLE_BLOCK]] ], [ 1, %[[LOOP_PREHEADER]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i32 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[LOOP_PREHEADER]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i32 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[LOOP_PREHEADER]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[IV_NEXT:%.*]], %[[LOOP]] ], [ 1, %[[LOOP_PREHEADER]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i32 [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ], [ 0, %[[LOOP_PREHEADER]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i32 [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ], [ 0, %[[LOOP_PREHEADER]] ]
-; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr i32, ptr [[SRC]], i64 [[IV]]
-; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[GEP_SRC]], align 4
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[IV_NEXT:%.*]], %[[LOOP]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i32 [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ], [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i32 [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ], [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[GEP_SRC1:%.*]] = getelementptr i32, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[L:%.*]] = load i32, ptr [[GEP_SRC1]], align 4
 ; CHECK-NEXT:    [[C_0:%.*]] = icmp ugt i32 [[L]], [[MIN_VAL]]
 ; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i32 @llvm.umin.i32(i32 [[L]], i32 [[MIN_VAL]])
-; CHECK-NEXT:    [[IV_TRUNC:%.*]] = trunc i64 [[IV]] to i32
+; CHECK-NEXT:    [[IV_TRUNC:%.*]] = trunc i64 [[IV1]] to i32
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[C_0]], i32 [[MIN_IDX]], i32 [[IV_TRUNC]]
-; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
-; CHECK-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV]], [[N_EXT]]
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT_LOOPEXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV1]], 1
+; CHECK-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV1]], [[N_EXT]]
+; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT_LOOPEXIT]], label %[[LOOP]], !llvm.loop [[LOOP11:![0-9]+]]
 ; CHECK:       [[EXIT_LOOPEXIT]]:
-; CHECK-NEXT:    [[MIN_IDX_NEXT_LCSSA:%.*]] = phi i32 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX_NEXT_LCSSA:%.*]] = phi i32 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    br label %[[EXIT]]
 ; CHECK:       [[EXIT]]:
 ; CHECK-NEXT:    [[RES:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT_LCSSA]], %[[EXIT_LOOPEXIT]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
index 955b4d45d7222..8935010e71676 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
@@ -1197,8 +1197,8 @@ define void @PR27626_5(ptr %a, i32 %x, i32 %y, i32 %z, i64 %n) #1 {
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], splat (i64 -1)
-; CHECK-NEXT:    [[TMP14:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], splat (i64 -3)
+; CHECK-NEXT:    [[TMP13:%.*]] = add nsw <vscale x 4 x i64> [[VEC_IND]], splat (i64 -1)
+; CHECK-NEXT:    [[TMP14:%.*]] = add nsw <vscale x 4 x i64> [[VEC_IND]], splat (i64 -3)
 ; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[A:%.*]], <vscale x 4 x i64> [[VEC_IND]]
 ; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[A]], <vscale x 4 x i64> [[TMP13]]
 ; CHECK-NEXT:    [[TMP17:%.*]] = getelementptr inbounds i32, ptr [[A]], <vscale x 4 x i64> [[TMP14]]
@@ -1286,7 +1286,7 @@ define void @PR34743(ptr %a, ptr %b, i64 %n) #1 {
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VECTOR_RECUR:%.*]] = phi <vscale x 4 x i16> [ [[VECTOR_RECUR_INIT]], [[VECTOR_PH]] ], [ [[WIDE_MASKED_GATHER4:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[TMP15]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP18:%.*]] = add nuw nsw <vscale x 4 x i64> [[VEC_IND]], splat (i64 1)
+; CHECK-NEXT:    [[TMP18:%.*]] = or disjoint <vscale x 4 x i64> [[VEC_IND]], splat (i64 1)
 ; CHECK-NEXT:    [[TMP19:%.*]] = add nuw nsw <vscale x 4 x i64> [[VEC_IND]], splat (i64 2)
 ; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr inbounds i16, ptr [[A]], <vscale x 4 x i64> [[TMP18]]
 ; CHECK-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0(<vscale x 4 x ptr> align 4 [[TMP20]], <vscale x 4 x i1> splat (i1 true), <vscale x 4 x i16> poison), !alias.scope [[META34:![0-9]+]]
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/gather-scatter-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/gather-scatter-cost.ll
index 877484f5159fd..36ebd422b5d7b 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/gather-scatter-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/gather-scatter-cost.ll
@@ -219,3 +219,119 @@ loop:
 exit:
   ret void
 }
+
+; Test for https://github.com/llvm/llvm-project/issues/169948.
+define i8 @mixed_gather_scatters(ptr %A, ptr %B, ptr %C) #0 {
+; RVA23-LABEL: @mixed_gather_scatters(
+; RVA23-NEXT:  entry:
+; RVA23-NEXT:    br label [[VECTOR_PH:%.*]]
+; RVA23:       vector.ph:
+; RVA23-NEXT:    br label [[VECTOR_BODY:%.*]]
+; RVA23:       vector.body:
+; RVA23-NEXT:    [[VEC_PHI:%.*]] = phi <vscale x 2 x i8> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP14:%.*]], [[VECTOR_BODY]] ]
+; RVA23-NEXT:    [[AVL:%.*]] = phi i32 [ 10, [[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], [[VECTOR_BODY]] ]
+; RVA23-NEXT:    [[TMP0:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[AVL]], i32 2, i1 true)
+; RVA23-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[A:%.*]], align 8
+; RVA23-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x ptr> poison, ptr [[TMP1]], i64 0
+; RVA23-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x ptr> [[BROADCAST_SPLATINSERT]], <vscale x 2 x ptr> poison, <vscale x 2 x i32> zeroinitializer
+; RVA23-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 2 x i64> @llvm.vp.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> align 8 [[BROADCAST_SPLAT]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP0]])
+; RVA23-NEXT:    [[TMP2:%.*]] = icmp sgt <vscale x 2 x i64> [[WIDE_MASKED_GATHER]], zeroinitializer
+; RVA23-NEXT:    [[TMP3:%.*]] = zext <vscale x 2 x i1> [[TMP2]] to <vscale x 2 x i8>
+; RVA23-NEXT:    [[TMP4:%.*]] = or <vscale x 2 x i8> [[VEC_PHI]], [[TMP3]]
+; RVA23-NEXT:    [[TMP5:%.*]] = load ptr, ptr [[B:%.*]], align 8
+; RVA23-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x ptr> poison, ptr [[TMP5]], i64 0
+; RVA23-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x ptr> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x ptr> poison, <vscale x 2 x i32> zeroinitializer
+; RVA23-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 2 x i64> @llvm.vp.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> align 8 [[BROADCAST_SPLAT2]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP0]])
+; RVA23-NEXT:    [[TMP6:%.*]] = icmp sgt <vscale x 2 x i64> [[WIDE_MASKED_GATHER3]], zeroinitializer
+; RVA23-NEXT:    [[TMP7:%.*]] = zext <vscale x 2 x i1> [[TMP6]] to <vscale x 2 x i8>
+; RVA23-NEXT:    [[TMP8:%.*]] = or <vscale x 2 x i8> [[TMP4]], [[TMP7]]
+; RVA23-NEXT:    [[TMP9:%.*]] = or <vscale x 2 x i8> [[TMP8]], splat (i8 1)
+; RVA23-NEXT:    [[TMP10:%.*]] = load ptr, ptr [[C:%.*]], align 8
+; RVA23-NEXT:    [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <vscale x 2 x ptr> poison, ptr [[TMP10]], i64 0
+; RVA23-NEXT:    [[BROADCAST_SPLAT5:%.*]] = shufflevector <vscale x 2 x ptr> [[BROADCAST_SPLATINSERT4]], <vscale x 2 x ptr> poison, <vscale x 2 x i32> zeroinitializer
+; RVA23-NEXT:    [[WIDE_MASKED_GATHER6:%.*]] = call <vscale x 2 x i64> @llvm.vp.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> align 8 [[BROADCAST_SPLAT5]], <vscale x 2 x i1> splat (i1 true), i32 [[TMP0]])
+; RVA23-NEXT:    [[TMP11:%.*]] = icmp sgt <vscale x 2 x i64> [[WIDE_MASKED_GATHER6]], zeroinitializer
+; RVA23-NEXT:    [[TMP12:%.*]] = zext <vscale x 2 x i1> [[TMP11]] to <vscale x 2 x i8>
+; RVA23-NEXT:    [[TMP13:%.*]] = or <vscale x 2 x i8> [[TMP9]], [[TMP12]]
+; RVA23-NEXT:    [[TMP14]] = call <vscale x 2 x i8> @llvm.vp.merge.nxv2i8(<vscale x 2 x i1> splat (i1 true), <vscale x 2 x i8> [[TMP13]], <vscale x 2 x i8> [[VEC_PHI]], i32 [[TMP0]])
+; RVA23-NEXT:    [[AVL_NEXT]] = sub nuw i32 [[AVL]], [[TMP0]]
+; RVA23-NEXT:    [[TMP15:%.*]] = icmp eq i32 [[AVL_NEXT]], 0
+; RVA23-NEXT:    br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
+; RVA23:       middle.block:
+; RVA23-NEXT:    [[TMP16:%.*]] = call i8 @llvm.vector.reduce.or.nxv2i8(<vscale x 2 x i8> [[TMP14]])
+; RVA23-NEXT:    br label [[EXIT:%.*]]
+; RVA23:       exit:
+; RVA23-NEXT:    ret i8 [[TMP16]]
+;
+; RVA23ZVL1024B-LABEL: @mixed_gather_scatters(
+; RVA23ZVL1024B-NEXT:  entry:
+; RVA23ZVL1024B-NEXT:    br label [[VECTOR_PH:%.*]]
+; RVA23ZVL1024B:       vector.ph:
+; RVA23ZVL1024B-NEXT:    br label [[VECTOR_BODY:%.*]]
+; RVA23ZVL1024B:       vector.body:
+; RVA23ZVL1024B-NEXT:    [[VEC_PHI:%.*]] = phi <vscale x 1 x i8> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP14:%.*]], [[VECTOR_BODY]] ]
+; RVA23ZVL1024B-NEXT:    [[AVL:%.*]] = phi i32 [ 10, [[VECTOR_PH]] ], [ [[AVL_NEXT:%.*]], [[VECTOR_BODY]] ]
+; RVA23ZVL1024B-NEXT:    [[TMP0:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[AVL]], i32 1, i1 true)
+; RVA23ZVL1024B-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[A:%.*]], align 8
+; RVA23ZVL1024B-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 1 x ptr> poison, ptr [[TMP1]], i64 0
+; RVA23ZVL1024B-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 1 x ptr> [[BROADCAST_SPLATINSERT]], <vscale x 1 x ptr> poison, <vscale x 1 x i32> zeroinitializer
+; RVA23ZVL1024B-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 1 x i64> @llvm.vp.gather.nxv1i64.nxv1p0(<vscale x 1 x ptr> align 8 [[BROADCAST_SPLAT]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP0]])
+; RVA23ZVL1024B-NEXT:    [[TMP2:%.*]] = icmp sgt <vscale x 1 x i64> [[WIDE_MASKED_GATHER]], zeroinitializer
+; RVA23ZVL1024B-NEXT:    [[TMP3:%.*]] = zext <vscale x 1 x i1> [[TMP2]] to <vscale x 1 x i8>
+; RVA23ZVL1024B-NEXT:    [[TMP4:%.*]] = or <vscale x 1 x i8> [[VEC_PHI]], [[TMP3]]
+; RVA23ZVL1024B-NEXT:    [[TMP5:%.*]] = load ptr, ptr [[B:%.*]], align 8
+; RVA23ZVL1024B-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 1 x ptr> poison, ptr [[TMP5]], i64 0
+; RVA23ZVL1024B-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 1 x ptr> [[BROADCAST_SPLATINSERT1]], <vscale x 1 x ptr> poison, <vscale x 1 x i32> zeroinitializer
+; RVA23ZVL1024B-NEXT:    [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 1 x i64> @llvm.vp.gather.nxv1i64.nxv1p0(<vscale x 1 x ptr> align 8 [[BROADCAST_SPLAT2]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP0]])
+; RVA23ZVL1024B-NEXT:    [[TMP6:%.*]] = icmp sgt <vscale x 1 x i64> [[WIDE_MASKED_GATHER3]], zeroinitializer
+; RVA23ZVL1024B-NEXT:    [[TMP7:%.*]] = zext <vscale x 1 x i1> [[TMP6]] to <vscale x 1 x i8>
+; RVA23ZVL1024B-NEXT:    [[TMP8:%.*]] = or <vscale x 1 x i8> [[TMP4]], [[TMP7]]
+; RVA23ZVL1024B-NEXT:    [[TMP9:%.*]] = or <vscale x 1 x i8> [[TMP8]], splat (i8 1)
+; RVA23ZVL1024B-NEXT:    [[TMP10:%.*]] = load ptr, ptr [[C:%.*]], align 8
+; RVA23ZVL1024B-NEXT:    [[BROADCAST_SPLATINSERT4:%.*]] = insertelement <vscale x 1 x ptr> poison, ptr [[TMP10]], i64 0
+; RVA23ZVL1024B-NEXT:    [[BROADCAST_SPLAT5:%.*]] = shufflevector <vscale x 1 x ptr> [[BROADCAST_SPLATINSERT4]], <vscale x 1 x ptr> poison, <vscale x 1 x i32> zeroinitializer
+; RVA23ZVL1024B-NEXT:    [[WIDE_MASKED_GATHER6:%.*]] = call <vscale x 1 x i64> @llvm.vp.gather.nxv1i64.nxv1p0(<vscale x 1 x ptr> align 8 [[BROADCAST_SPLAT5]], <vscale x 1 x i1> splat (i1 true), i32 [[TMP0]])
+; RVA23ZVL1024B-NEXT:    [[TMP11:%.*]] = icmp sgt <vscale x 1 x i64> [[WIDE_MASKED_GATHER6]], zeroinitializer
+; RVA23ZVL1024B-NEXT:    [[TMP12:%.*]] = zext <vscale x 1 x i1> [[TMP11]] to <vscale x 1 x i8>
+; RVA23ZVL1024B-NEXT:    [[TMP13:%.*]] = or <vscale x 1 x i8> [[TMP9]], [[TMP12]]
+; RVA23ZVL1024B-NEXT:    [[TMP14]] = call <vscale x 1 x i8> @llvm.vp.merge.nxv1i8(<vscale x 1 x i1> splat (i1 true), <vscale x 1 x i8> [[TMP13]], <vscale x 1 x i8> [[VEC_PHI]], i32 [[TMP0]])
+; RVA23ZVL1024B-NEXT:    [[AVL_NEXT]] = sub nuw i32 [[AVL]], [[TMP0]]
+; RVA23ZVL1024B-NEXT:    [[TMP15:%.*]] = icmp eq i32 [[AVL_NEXT]], 0
+; RVA23ZVL1024B-NEXT:    br i1 [[TMP15]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
+; RVA23ZVL1024B:       middle.block:
+; RVA23ZVL1024B-NEXT:    [[TMP16:%.*]] = call i8 @llvm.vector.reduce.or.nxv1i8(<vscale x 1 x i8> [[TMP14]])
+; RVA23ZVL1024B-NEXT:    br label [[EXIT:%.*]]
+; RVA23ZVL1024B:       exit:
+; RVA23ZVL1024B-NEXT:    ret i8 [[TMP16]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
+  %accum = phi i8 [ 0, %entry ], [ %or.4, %loop ]
+  %ptr.0 = load ptr, ptr %A, align 8
+  %val.0 = load i64, ptr %ptr.0, align 8
+  %cmp.0 = icmp sgt i64 %val.0, 0
+  %ext.0 = zext i1 %cmp.0 to i8
+  %or.0 = or i8 %accum, %ext.0
+  %ptr.1 = load ptr, ptr %B, align 8
+  %val.1 = load i64, ptr %ptr.1, align 8
+  %cmp.1 = icmp sgt i64 %val.1, 0
+  %ext.1 = zext i1 %cmp.1 to i8
+  %or.1 = or i8 %or.0, %ext.1
+  %or.2 = or i8 %or.1, 1
+  %ptr.4 = load ptr, ptr %C, align 8
+  %val.4 = load i64, ptr %ptr.4, align 8
+  %cmp.4 = icmp sgt i64 %val.4, 0
+  %ext.4 = zext i1 %cmp.4 to i8
+  %or.4 = or i8 %or.2, %ext.4
+  %iv.next = add i32 %iv, 1
+  %exitcond = icmp eq i32 %iv, 9
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret i8 %or.4
+}
+
+attributes #0 = { "target-features"="+zve64x,+zvl256b" }
diff --git a/llvm/test/Transforms/LoopVectorize/X86/uniformshift.ll b/llvm/test/Transforms/LoopVectorize/X86/uniformshift.ll
index 166875dd55aae..02c0b676374f4 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/uniformshift.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/uniformshift.ll
@@ -5,7 +5,7 @@
 ; CHECK: LV: Found an estimated cost of 1 for VF 1 For instruction:   %shift = ashr i32 %val, %k
 ; CHECK: Cost of 2 for VF 2: WIDEN ir<%shift> = ashr ir<%val>, ir<%k>
 ; CHECK: Cost of 2 for VF 4: WIDEN ir<%shift> = ashr ir<%val>, ir<%k>
-define void @foo(ptr nocapture %p, i32 %k) local_unnamed_addr #0 {
+define void @foo(ptr nocapture %p, i32 %k) local_unnamed_addr {
 entry:
   br label %body
 
@@ -21,5 +21,64 @@ body:
 
 exit:
   ret void
+}
+
+; CHECK: 'shift_and_masked_load_store'
+; CHECK: Cost of 1 for VF 2: CLONE ir<%shifted> = lshr vp<{{.+}}>, ir<2>
+; CHECK: Cost of 1 for VF 4: CLONE ir<%shifted> = lshr vp<{{.+}}>, ir<2>
+; CHECK: Cost of 4 for VF 8: WIDEN ir<%shifted> = lshr ir<%iv>, ir<2>
+define void @shift_and_masked_load_store(i64 %trip.count) #0 {
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %shifted = lshr i64 %iv, 2
+  %masked.idx = and i64 %shifted, 1
+  %load.ptr = getelementptr i16, ptr poison, i64 %masked.idx
+  %val = load i16, ptr %load.ptr, align 2
+  %store.idx = shl nuw i64 %iv, 2
+  %store.ptr = getelementptr i8, ptr poison, i64 %store.idx
+  store i16 %val, ptr %store.ptr, align 2
+  %iv.next = add i64 %iv, 1
+  %cmp = icmp eq i64 %iv, %trip.count
+  br i1 %cmp, label %exit, label %loop
 
+exit:
+  ret void
 }
+
+define i64 @sdiv_arg_outer_iv(ptr noalias %dst, ptr %src) {
+; CHECK: 'sdiv_arg_outer_iv'
+; CHECK: Cost of 0 for VF 2: CLONE ir<%div> = sdiv ir<%add.offset>, ir<8>
+; CHECK: Cost of 0 for VF 4: CLONE ir<%div> = sdiv ir<%add.offset>, ir<8>
+; CHECK: Cost of 0 for VF 8: CLONE ir<%div> = sdiv ir<%add.offset>, ir<8>
+; CHECK: Cost of 0 for VF 16: REPLICATE ir<%div> = sdiv ir<%add.offset>, ir<8>
+entry:
+  br label %outer.header
+
+outer.header:
+  %outer.iv = phi i32 [ 0, %entry ], [ %outer.iv.next, %outer.latch ]
+  %offset = shl nsw i32 %outer.iv, 7
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %outer.header ], [ %iv.next, %loop ]
+  %iv.trunc = trunc i64 %iv to i32
+  %add.offset = add i32 %offset, %iv.trunc
+  %div = sdiv i32 %add.offset, 8
+  %div.ext = sext i32 %div to i64
+  %gep.src = getelementptr i8, ptr %src, i64 %div.ext
+  %l = load i8, ptr %gep.src, align 1
+  %gep.dst = getelementptr i8, ptr %dst, i64 %iv
+  store i8 %l, ptr %gep.dst, align 1
+  %iv.next = add i64 %iv, 1
+  %ec = icmp eq i64 %iv, 64
+  br i1 %ec, label %outer.latch, label %loop
+
+outer.latch:
+  %outer.iv.next = add nsw i32 %outer.iv, 1
+  br label %outer.header
+}
+
+attributes #0 = { "target-features"="+avx2" "tune-cpu"="alderlake" }
diff --git a/llvm/test/Transforms/LoopVectorize/diag-disabled-vectorization-msgs.ll b/llvm/test/Transforms/LoopVectorize/diag-disabled-vectorization-msgs.ll
new file mode 100644
index 0000000000000..31454401876b6
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/diag-disabled-vectorization-msgs.ll
@@ -0,0 +1,94 @@
+; REQUIRES: asserts
+
+; TEST 1
+; Checks that we emit only the correct debug messages and
+; optimization remark when the loop vectorizer is disabled by loop metadata.
+; RUN: opt -S -passes=loop-vectorize -pass-remarks=loop-vectorize \
+; RUN:     -pass-remarks-missed=loop-vectorize \
+; RUN:     -pass-remarks-analysis=loop-vectorize -debug \
+; RUN:     < %s 2>&1 | FileCheck --check-prefixes=METADATA,ALL %s
+; TEST 2
+; Checks that we emit only the correct debug messages and
+; optimization remark when the loop is not vectorized due to the
+; vectorize-forced-only pass option being set.
+; Strip metadata for FORCEDONLY run, keep it for METADATA run
+; RUN: sed 's/,[[:space:]]*!llvm\.loop[[:space:]]*!0//' %s | \
+; RUN: opt -S -passes='loop-vectorize<vectorize-forced-only>' \
+; RUN:   -pass-remarks=loop-vectorize \
+; RUN:   -pass-remarks-missed=loop-vectorize \
+; RUN:   -pass-remarks-analysis=loop-vectorize -debug \
+; RUN:   2>&1 | FileCheck --check-prefixes=FORCEDONLY,ALL %s
+; TEST 3
+; Checks that we emit only the correct debug messages and
+; optimization remark when the loop vectorizer is disabled by loop metadata
+; that requests no loop transformations.
+; RUN: opt -S -passes=loop-vectorize -pass-remarks=loop-vectorize \
+; RUN:     -pass-remarks-missed=loop-vectorize \
+; RUN:     -pass-remarks-analysis=loop-vectorize -debug \
+; RUN:     -force-vector-interleave=1 -force-vector-width=2 \
+; RUN:     < %s 2>&1 | FileCheck --check-prefix=ALL %s
+
+; ALL-LABEL: 'disabled_loop_vectorization' from <stdin>
+; ALL-NOT: LV: We can vectorize this loop
+; ALL-NOT: LV: Not vectorizing: loop hasDisableAllTransformsHint
+; ALL-NOT: LV: Not vectorizing: Disabled/already vectorized
+; ALL-NOT: LV: Not vectorizing: Cannot prove legality
+;
+; METADATA-NOT: LV: Not vectorizing: VectorizeOnlyWhenForced is set
+; METADATA: LV: Loop hints: force=disabled
+; METADATA: LV: Not vectorizing: #pragma vectorize disable.
+; METADATA: remark:
+; METADATA-SAME: loop not vectorized: vectorization is explicitly disabled
+;
+; FORCEDONLY-NOT: LV: Not vectorizing: #pragma vectorize disable
+; FORCEDONLY: LV: Loop hints: force=?
+; FORCEDONLY: LV: Not vectorizing: VectorizeOnlyWhenForced is set, and no #pragma vectorize enable
+; FORCEDONLY: remark:
+; FORCEDONLY-SAME: loop not vectorized: only vectorizing loops that explicitly request it
+;
+; ALL: LV: Loop hints prevent vectorization
+define void @disabled_loop_vectorization(ptr %src) {
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %inc, %loop ]
+  %arrayidx = getelementptr inbounds nuw double, ptr %src, i64 %iv
+  store double 0.0, ptr %arrayidx, align 8
+  %inc = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %inc, 15
+  br i1 %exitcond.not, label %exit, label %loop, !llvm.loop !0
+
+exit:
+  ret void
+}
+!0 = distinct !{!0, !1}
+!1 = !{!"llvm.loop.vectorize.enable", i1 false}
+
+; ALL-LABEL: 'disable_nonforced' from <stdin>
+; ALL-NOT: LV: We can vectorize this loop
+; ALL-NOT: LV: Not vectorizing: #pragma vectorize disable.
+; ALL-NOT: LV: Not vectorizing: VectorizeOnlyWhenForced is set
+; ALL-NOT: LV: Not vectorizing: Disabled/already vectorized
+; ALL-NOT: LV: Not vectorizing: Cannot prove legality
+; ALL: LV: Loop hints: force=disabled
+; ALL: LV: Not vectorizing: loop hasDisableAllTransformsHint.
+; ALL: remark:
+; ALL-SAME: loop not vectorized: loop transformations are disabled
+; ALL: LV: Loop hints prevent vectorization
+define void @disable_nonforced(ptr nocapture %a, i32 %n) {
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ %iv.next, %loop ], [ 0, %entry ]
+  %arrayidx = getelementptr inbounds i32, ptr %a, i32 %iv
+  store i32 %iv, ptr %arrayidx, align 4
+  %iv.next = add i32 %iv, 1
+  %exitcond = icmp eq i32 %iv.next, %n
+  br i1 %exitcond, label %end, label %loop, !llvm.loop !2
+
+end:
+  ret void
+}
+!2 = !{!2, !{!"llvm.loop.disable_nonforced"}}
diff --git a/llvm/test/Transforms/LoopVectorize/hoist-predicated-loads-with-predicated-stores.ll b/llvm/test/Transforms/LoopVectorize/hoist-predicated-loads-with-predicated-stores.ll
index 87942911e915f..cdbe9bb555834 100644
--- a/llvm/test/Transforms/LoopVectorize/hoist-predicated-loads-with-predicated-stores.ll
+++ b/llvm/test/Transforms/LoopVectorize/hoist-predicated-loads-with-predicated-stores.ll
@@ -21,13 +21,12 @@ define void @test_stores_noalias_via_rt_checks_after_loads(ptr %dst, ptr %src, p
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE11:.*]] ]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP6]], align 4, !alias.scope [[META0:![0-9]+]]
 ; CHECK-NEXT:    [[TMP7:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 11)
-; CHECK-NEXT:    [[TMP8:%.*]] = xor <2 x i1> [[TMP7]], splat (i1 true)
 ; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i32 [[TMP5]]
 ; CHECK-NEXT:    [[TMP9:%.*]] = load i32, ptr [[TMP10]], align 4, !alias.scope [[META3:![0-9]+]]
@@ -35,39 +34,14 @@ define void @test_stores_noalias_via_rt_checks_after_loads(ptr %dst, ptr %src, p
 ; CHECK-NEXT:    [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP9]], i32 0
 ; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP16]], i32 1
 ; CHECK-NEXT:    [[TMP19:%.*]] = sub <2 x i32> [[TMP17]], splat (i32 5)
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
-; CHECK-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; CHECK:       [[PRED_STORE_IF]]:
+; CHECK-NEXT:    [[TMP36:%.*]] = add <2 x i32> [[TMP17]], splat (i32 10)
 ; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x i32> [[TMP19]], i32 0
-; CHECK-NEXT:    store i32 [[TMP22]], ptr [[TMP21]], align 4, !alias.scope [[META5:![0-9]+]], !noalias [[META7:![0-9]+]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; CHECK:       [[PRED_STORE_CONTINUE]]:
-; CHECK-NEXT:    [[TMP23:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP23]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7:.*]]
-; CHECK:       [[PRED_STORE_IF6]]:
 ; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <2 x i32> [[TMP19]], i32 1
-; CHECK-NEXT:    store i32 [[TMP25]], ptr [[TMP24]], align 4, !alias.scope [[META5]], !noalias [[META7]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE7]]
-; CHECK:       [[PRED_STORE_CONTINUE7]]:
-; CHECK-NEXT:    [[TMP36:%.*]] = add <2 x i32> [[TMP17]], splat (i32 10)
-; CHECK-NEXT:    [[TMP37:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP37]], label %[[PRED_STORE_IF8:.*]], label %[[PRED_STORE_CONTINUE9:.*]]
-; CHECK:       [[PRED_STORE_IF8]]:
-; CHECK-NEXT:    [[TMP38:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP39:%.*]] = extractelement <2 x i32> [[TMP36]], i32 0
-; CHECK-NEXT:    store i32 [[TMP39]], ptr [[TMP38]], align 4, !alias.scope [[META5]], !noalias [[META7]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE9]]
-; CHECK:       [[PRED_STORE_CONTINUE9]]:
-; CHECK-NEXT:    [[TMP40:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP40]], label %[[PRED_STORE_IF10:.*]], label %[[PRED_STORE_CONTINUE11]]
-; CHECK:       [[PRED_STORE_IF10]]:
-; CHECK-NEXT:    [[TMP41:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP42:%.*]] = extractelement <2 x i32> [[TMP36]], i32 1
-; CHECK-NEXT:    store i32 [[TMP42]], ptr [[TMP41]], align 4, !alias.scope [[META5]], !noalias [[META7]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE11]]
-; CHECK:       [[PRED_STORE_CONTINUE11]]:
+; CHECK-NEXT:    [[TMP14:%.*]] = select <2 x i1> [[TMP7]], <2 x i32> [[TMP36]], <2 x i32> [[TMP19]]
+; CHECK-NEXT:    [[TMP18:%.*]] = extractelement <2 x i32> [[TMP14]], i32 0
+; CHECK-NEXT:    store i32 [[TMP18]], ptr [[TMP21]], align 4, !alias.scope [[META5:![0-9]+]], !noalias [[META7:![0-9]+]]
+; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i32> [[TMP14]], i32 1
+; CHECK-NEXT:    store i32 [[TMP20]], ptr [[TMP24]], align 4, !alias.scope [[META5]], !noalias [[META7]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP43:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP43]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
@@ -134,7 +108,7 @@ define void @test_aliasing_store(ptr %dst, ptr %src, ptr %cond) {
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE21:.*]] ]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_LOAD_CONTINUE15:.*]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP4]]
@@ -162,57 +136,32 @@ define void @test_aliasing_store(ptr %dst, ptr %src, ptr %cond) {
 ; CHECK:       [[PRED_LOAD_CONTINUE11]]:
 ; CHECK-NEXT:    [[TMP18:%.*]] = phi <2 x i32> [ [[TMP13]], %[[PRED_LOAD_CONTINUE]] ], [ [[TMP17]], %[[PRED_LOAD_IF10]] ]
 ; CHECK-NEXT:    [[TMP19:%.*]] = sub <2 x i32> [[TMP18]], splat (i32 5)
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
-; CHECK-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; CHECK:       [[PRED_STORE_IF]]:
-; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x i32> [[TMP19]], i32 0
-; CHECK-NEXT:    store i32 [[TMP22]], ptr [[TMP21]], align 4, !alias.scope [[META19:![0-9]+]], !noalias [[META12]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; CHECK:       [[PRED_STORE_CONTINUE]]:
-; CHECK-NEXT:    [[TMP23:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP23]], label %[[PRED_STORE_IF12:.*]], label %[[PRED_STORE_CONTINUE13:.*]]
-; CHECK:       [[PRED_STORE_IF12]]:
-; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <2 x i32> [[TMP19]], i32 1
-; CHECK-NEXT:    store i32 [[TMP25]], ptr [[TMP24]], align 4, !alias.scope [[META19]], !noalias [[META12]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE13]]
-; CHECK:       [[PRED_STORE_CONTINUE13]]:
 ; CHECK-NEXT:    [[TMP26:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP26]], label %[[PRED_LOAD_IF14:.*]], label %[[PRED_LOAD_CONTINUE15:.*]]
-; CHECK:       [[PRED_LOAD_IF14]]:
+; CHECK-NEXT:    br i1 [[TMP26]], label %[[PRED_LOAD_IF12:.*]], label %[[PRED_LOAD_CONTINUE13:.*]]
+; CHECK:       [[PRED_LOAD_IF12]]:
 ; CHECK-NEXT:    [[TMP27:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[TMP27]], align 4, !alias.scope [[META15]], !noalias [[META17]]
 ; CHECK-NEXT:    [[TMP29:%.*]] = insertelement <2 x i32> poison, i32 [[TMP28]], i32 0
-; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE15]]
-; CHECK:       [[PRED_LOAD_CONTINUE15]]:
-; CHECK-NEXT:    [[TMP30:%.*]] = phi <2 x i32> [ poison, %[[PRED_STORE_CONTINUE13]] ], [ [[TMP29]], %[[PRED_LOAD_IF14]] ]
+; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE13]]
+; CHECK:       [[PRED_LOAD_CONTINUE13]]:
+; CHECK-NEXT:    [[TMP30:%.*]] = phi <2 x i32> [ poison, %[[PRED_LOAD_CONTINUE11]] ], [ [[TMP29]], %[[PRED_LOAD_IF12]] ]
 ; CHECK-NEXT:    [[TMP31:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP31]], label %[[PRED_LOAD_IF16:.*]], label %[[PRED_LOAD_CONTINUE17:.*]]
-; CHECK:       [[PRED_LOAD_IF16]]:
+; CHECK-NEXT:    br i1 [[TMP31]], label %[[PRED_LOAD_IF14:.*]], label %[[PRED_LOAD_CONTINUE15]]
+; CHECK:       [[PRED_LOAD_IF14]]:
 ; CHECK-NEXT:    [[TMP32:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i32 [[TMP5]]
 ; CHECK-NEXT:    [[TMP33:%.*]] = load i32, ptr [[TMP32]], align 4, !alias.scope [[META15]], !noalias [[META17]]
 ; CHECK-NEXT:    [[TMP34:%.*]] = insertelement <2 x i32> [[TMP30]], i32 [[TMP33]], i32 1
-; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE17]]
-; CHECK:       [[PRED_LOAD_CONTINUE17]]:
-; CHECK-NEXT:    [[TMP35:%.*]] = phi <2 x i32> [ [[TMP30]], %[[PRED_LOAD_CONTINUE15]] ], [ [[TMP34]], %[[PRED_LOAD_IF16]] ]
+; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE15]]
+; CHECK:       [[PRED_LOAD_CONTINUE15]]:
+; CHECK-NEXT:    [[TMP35:%.*]] = phi <2 x i32> [ [[TMP30]], %[[PRED_LOAD_CONTINUE13]] ], [ [[TMP34]], %[[PRED_LOAD_IF14]] ]
 ; CHECK-NEXT:    [[TMP36:%.*]] = add <2 x i32> [[TMP35]], splat (i32 10)
-; CHECK-NEXT:    [[TMP37:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP37]], label %[[PRED_STORE_IF18:.*]], label %[[PRED_STORE_CONTINUE19:.*]]
-; CHECK:       [[PRED_STORE_IF18]]:
-; CHECK-NEXT:    [[TMP38:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP39:%.*]] = extractelement <2 x i32> [[TMP36]], i32 0
-; CHECK-NEXT:    store i32 [[TMP39]], ptr [[TMP38]], align 4, !alias.scope [[META19]], !noalias [[META12]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE19]]
-; CHECK:       [[PRED_STORE_CONTINUE19]]:
-; CHECK-NEXT:    [[TMP40:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP40]], label %[[PRED_STORE_IF20:.*]], label %[[PRED_STORE_CONTINUE21]]
-; CHECK:       [[PRED_STORE_IF20]]:
+; CHECK-NEXT:    [[TMP40:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[TMP41:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP42:%.*]] = extractelement <2 x i32> [[TMP36]], i32 1
-; CHECK-NEXT:    store i32 [[TMP42]], ptr [[TMP41]], align 4, !alias.scope [[META19]], !noalias [[META12]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE21]]
-; CHECK:       [[PRED_STORE_CONTINUE21]]:
+; CHECK-NEXT:    [[TMP37:%.*]] = select <2 x i1> [[TMP7]], <2 x i32> [[TMP36]], <2 x i32> [[TMP19]]
+; CHECK-NEXT:    [[TMP38:%.*]] = extractelement <2 x i32> [[TMP37]], i32 0
+; CHECK-NEXT:    store i32 [[TMP38]], ptr [[TMP40]], align 4, !alias.scope [[META19:![0-9]+]], !noalias [[META12]]
+; CHECK-NEXT:    [[TMP39:%.*]] = extractelement <2 x i32> [[TMP37]], i32 1
+; CHECK-NEXT:    store i32 [[TMP39]], ptr [[TMP41]], align 4, !alias.scope [[META19]], !noalias [[META12]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP43:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP43]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
@@ -289,13 +238,12 @@ define void @test_noalias_store_via_runtime_checks(ptr %dst, ptr %dst.1, ptr %sr
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE28:.*]] ]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE20:.*]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP6]], align 4, !alias.scope [[META22:![0-9]+]]
-; CHECK-NEXT:    [[TMP7:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 11)
-; CHECK-NEXT:    [[TMP8:%.*]] = xor <2 x i1> [[TMP7]], splat (i1 true)
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp ugt <2 x i32> [[WIDE_LOAD]], splat (i32 11)
 ; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
 ; CHECK-NEXT:    br i1 [[TMP9]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
 ; CHECK:       [[PRED_STORE_IF]]:
@@ -304,7 +252,7 @@ define void @test_noalias_store_via_runtime_checks(ptr %dst, ptr %dst.1, ptr %sr
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
 ; CHECK:       [[PRED_STORE_CONTINUE]]:
 ; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP15]], label %[[PRED_STORE_IF19:.*]], label %[[PRED_STORE_CONTINUE20:.*]]
+; CHECK-NEXT:    br i1 [[TMP15]], label %[[PRED_STORE_IF19:.*]], label %[[PRED_STORE_CONTINUE20]]
 ; CHECK:       [[PRED_STORE_IF19]]:
 ; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[DST_1]], i32 [[TMP5]]
 ; CHECK-NEXT:    store i32 10, ptr [[TMP16]], align 4, !alias.scope [[META25]], !noalias [[META27]]
@@ -317,39 +265,14 @@ define void @test_noalias_store_via_runtime_checks(ptr %dst, ptr %dst.1, ptr %sr
 ; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x i32> poison, i32 [[TMP11]], i32 0
 ; CHECK-NEXT:    [[TMP19:%.*]] = insertelement <2 x i32> [[TMP14]], i32 [[TMP18]], i32 1
 ; CHECK-NEXT:    [[TMP21:%.*]] = sub <2 x i32> [[TMP19]], splat (i32 5)
-; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
-; CHECK-NEXT:    br i1 [[TMP22]], label %[[PRED_STORE_IF21:.*]], label %[[PRED_STORE_CONTINUE22:.*]]
-; CHECK:       [[PRED_STORE_IF21]]:
+; CHECK-NEXT:    [[TMP38:%.*]] = add <2 x i32> [[TMP19]], splat (i32 10)
 ; CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP24:%.*]] = extractelement <2 x i32> [[TMP21]], i32 0
-; CHECK-NEXT:    store i32 [[TMP24]], ptr [[TMP23]], align 4, !alias.scope [[META31:![0-9]+]], !noalias [[META32:![0-9]+]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE22]]
-; CHECK:       [[PRED_STORE_CONTINUE22]]:
-; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP25]], label %[[PRED_STORE_IF23:.*]], label %[[PRED_STORE_CONTINUE24:.*]]
-; CHECK:       [[PRED_STORE_IF23]]:
 ; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP27:%.*]] = extractelement <2 x i32> [[TMP21]], i32 1
-; CHECK-NEXT:    store i32 [[TMP27]], ptr [[TMP26]], align 4, !alias.scope [[META31]], !noalias [[META32]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE24]]
-; CHECK:       [[PRED_STORE_CONTINUE24]]:
-; CHECK-NEXT:    [[TMP38:%.*]] = add <2 x i32> [[TMP19]], splat (i32 10)
-; CHECK-NEXT:    [[TMP39:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP39]], label %[[PRED_STORE_IF25:.*]], label %[[PRED_STORE_CONTINUE26:.*]]
-; CHECK:       [[PRED_STORE_IF25]]:
-; CHECK-NEXT:    [[TMP40:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP41:%.*]] = extractelement <2 x i32> [[TMP38]], i32 0
-; CHECK-NEXT:    store i32 [[TMP41]], ptr [[TMP40]], align 4, !alias.scope [[META31]], !noalias [[META32]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE26]]
-; CHECK:       [[PRED_STORE_CONTINUE26]]:
-; CHECK-NEXT:    [[TMP42:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP42]], label %[[PRED_STORE_IF27:.*]], label %[[PRED_STORE_CONTINUE28]]
-; CHECK:       [[PRED_STORE_IF27]]:
-; CHECK-NEXT:    [[TMP43:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP44:%.*]] = extractelement <2 x i32> [[TMP38]], i32 1
-; CHECK-NEXT:    store i32 [[TMP44]], ptr [[TMP43]], align 4, !alias.scope [[META31]], !noalias [[META32]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE28]]
-; CHECK:       [[PRED_STORE_CONTINUE28]]:
+; CHECK-NEXT:    [[TMP22:%.*]] = select <2 x i1> [[TMP8]], <2 x i32> [[TMP21]], <2 x i32> [[TMP38]]
+; CHECK-NEXT:    [[TMP24:%.*]] = extractelement <2 x i32> [[TMP22]], i32 0
+; CHECK-NEXT:    store i32 [[TMP24]], ptr [[TMP23]], align 4, !alias.scope [[META31:![0-9]+]], !noalias [[META32:![0-9]+]]
+; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i32> [[TMP22]], i32 1
+; CHECK-NEXT:    store i32 [[TMP20]], ptr [[TMP26]], align 4, !alias.scope [[META31]], !noalias [[META32]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP45:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP45]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP33:![0-9]+]]
@@ -418,7 +341,7 @@ define void @test_memory_op_between_loads_alias(ptr %dst, ptr %src, ptr %cond, p
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE17:.*]] ]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_LOAD_CONTINUE15:.*]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP4]]
@@ -446,40 +369,31 @@ define void @test_memory_op_between_loads_alias(ptr %dst, ptr %src, ptr %cond, p
 ; CHECK:       [[PRED_LOAD_CONTINUE11]]:
 ; CHECK-NEXT:    [[TMP18:%.*]] = phi <2 x i32> [ [[TMP13]], %[[PRED_LOAD_CONTINUE]] ], [ [[TMP17]], %[[PRED_LOAD_IF10]] ]
 ; CHECK-NEXT:    [[TMP19:%.*]] = add <2 x i32> [[TMP18]], splat (i32 10)
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
-; CHECK-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; CHECK:       [[PRED_STORE_IF]]:
-; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x i32> [[TMP19]], i32 0
-; CHECK-NEXT:    store i32 [[TMP22]], ptr [[TMP21]], align 4, !alias.scope [[META42:![0-9]+]], !noalias [[META35]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; CHECK:       [[PRED_STORE_CONTINUE]]:
-; CHECK-NEXT:    [[TMP23:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP23]], label %[[PRED_STORE_IF12:.*]], label %[[PRED_STORE_CONTINUE13:.*]]
-; CHECK:       [[PRED_STORE_IF12]]:
-; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <2 x i32> [[TMP19]], i32 1
-; CHECK-NEXT:    store i32 [[TMP25]], ptr [[TMP24]], align 4, !alias.scope [[META42]], !noalias [[META35]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE13]]
-; CHECK:       [[PRED_STORE_CONTINUE13]]:
 ; CHECK-NEXT:    [[TMP26:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP26]], label %[[PRED_STORE_IF14:.*]], label %[[PRED_STORE_CONTINUE15:.*]]
-; CHECK:       [[PRED_STORE_IF14]]:
+; CHECK-NEXT:    br i1 [[TMP26]], label %[[PRED_LOAD_IF12:.*]], label %[[PRED_LOAD_CONTINUE13:.*]]
+; CHECK:       [[PRED_LOAD_IF12]]:
 ; CHECK-NEXT:    [[TMP27:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP32:%.*]] = load i32, ptr [[TMP27]], align 4, !alias.scope [[META38]], !noalias [[META40]]
-; CHECK-NEXT:    [[TMP29:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    store i32 [[TMP32]], ptr [[TMP29]], align 4, !alias.scope [[META42]], !noalias [[META35]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE15]]
-; CHECK:       [[PRED_STORE_CONTINUE15]]:
+; CHECK-NEXT:    [[TMP20:%.*]] = load i32, ptr [[TMP27]], align 4, !alias.scope [[META38]], !noalias [[META40]]
+; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <2 x i32> poison, i32 [[TMP20]], i32 0
+; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE13]]
+; CHECK:       [[PRED_LOAD_CONTINUE13]]:
+; CHECK-NEXT:    [[TMP22:%.*]] = phi <2 x i32> [ poison, %[[PRED_LOAD_CONTINUE11]] ], [ [[TMP23]], %[[PRED_LOAD_IF12]] ]
 ; CHECK-NEXT:    [[TMP30:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP30]], label %[[PRED_STORE_IF16:.*]], label %[[PRED_STORE_CONTINUE17]]
-; CHECK:       [[PRED_STORE_IF16]]:
+; CHECK-NEXT:    br i1 [[TMP30]], label %[[PRED_LOAD_IF14:.*]], label %[[PRED_LOAD_CONTINUE15]]
+; CHECK:       [[PRED_LOAD_IF14]]:
 ; CHECK-NEXT:    [[TMP31:%.*]] = getelementptr inbounds i32, ptr [[SRC]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[TMP31]], align 4, !alias.scope [[META38]], !noalias [[META40]]
-; CHECK-NEXT:    [[TMP33:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    store i32 [[TMP28]], ptr [[TMP33]], align 4, !alias.scope [[META42]], !noalias [[META35]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE17]]
-; CHECK:       [[PRED_STORE_CONTINUE17]]:
+; CHECK-NEXT:    [[TMP25:%.*]] = load i32, ptr [[TMP31]], align 4, !alias.scope [[META38]], !noalias [[META40]]
+; CHECK-NEXT:    [[TMP32:%.*]] = insertelement <2 x i32> [[TMP22]], i32 [[TMP25]], i32 1
+; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE15]]
+; CHECK:       [[PRED_LOAD_CONTINUE15]]:
+; CHECK-NEXT:    [[TMP33:%.*]] = phi <2 x i32> [ [[TMP22]], %[[PRED_LOAD_CONTINUE13]] ], [ [[TMP32]], %[[PRED_LOAD_IF14]] ]
+; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
+; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
+; CHECK-NEXT:    [[TMP28:%.*]] = select <2 x i1> [[TMP7]], <2 x i32> [[TMP33]], <2 x i32> [[TMP19]]
+; CHECK-NEXT:    [[TMP29:%.*]] = extractelement <2 x i32> [[TMP28]], i32 0
+; CHECK-NEXT:    store i32 [[TMP29]], ptr [[TMP36]], align 4, !alias.scope [[META42:![0-9]+]], !noalias [[META35]]
+; CHECK-NEXT:    [[TMP35:%.*]] = extractelement <2 x i32> [[TMP28]], i32 1
+; CHECK-NEXT:    store i32 [[TMP35]], ptr [[TMP37]], align 4, !alias.scope [[META42]], !noalias [[META35]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP34:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP34]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP43:![0-9]+]]
@@ -559,13 +473,12 @@ define void @test_memory_op_between_loads_no_alias_via_rt_checks(ptr %dst, ptr %
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE28:.*]] ]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE20:.*]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP6]], align 4, !alias.scope [[META45:![0-9]+]]
-; CHECK-NEXT:    [[TMP7:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 11)
-; CHECK-NEXT:    [[TMP8:%.*]] = xor <2 x i1> [[TMP7]], splat (i1 true)
+; CHECK-NEXT:    [[TMP8:%.*]] = icmp ugt <2 x i32> [[WIDE_LOAD]], splat (i32 11)
 ; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
 ; CHECK-NEXT:    br i1 [[TMP9]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
 ; CHECK:       [[PRED_STORE_IF]]:
@@ -574,7 +487,7 @@ define void @test_memory_op_between_loads_no_alias_via_rt_checks(ptr %dst, ptr %
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
 ; CHECK:       [[PRED_STORE_CONTINUE]]:
 ; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP15]], label %[[PRED_STORE_IF19:.*]], label %[[PRED_STORE_CONTINUE20:.*]]
+; CHECK-NEXT:    br i1 [[TMP15]], label %[[PRED_STORE_IF19:.*]], label %[[PRED_STORE_CONTINUE20]]
 ; CHECK:       [[PRED_STORE_IF19]]:
 ; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[DST_1]], i32 [[TMP5]]
 ; CHECK-NEXT:    store i32 0, ptr [[TMP16]], align 4, !alias.scope [[META48]], !noalias [[META50]]
@@ -587,36 +500,13 @@ define void @test_memory_op_between_loads_no_alias_via_rt_checks(ptr %dst, ptr %
 ; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <2 x i32> poison, i32 [[TMP11]], i32 0
 ; CHECK-NEXT:    [[TMP19:%.*]] = insertelement <2 x i32> [[TMP14]], i32 [[TMP18]], i32 1
 ; CHECK-NEXT:    [[TMP21:%.*]] = add <2 x i32> [[TMP19]], splat (i32 10)
-; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x i1> [[TMP8]], i32 0
-; CHECK-NEXT:    br i1 [[TMP22]], label %[[PRED_STORE_IF21:.*]], label %[[PRED_STORE_CONTINUE22:.*]]
-; CHECK:       [[PRED_STORE_IF21]]:
 ; CHECK-NEXT:    [[TMP23:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP24:%.*]] = extractelement <2 x i32> [[TMP21]], i32 0
-; CHECK-NEXT:    store i32 [[TMP24]], ptr [[TMP23]], align 4, !alias.scope [[META54:![0-9]+]], !noalias [[META55:![0-9]+]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE22]]
-; CHECK:       [[PRED_STORE_CONTINUE22]]:
-; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <2 x i1> [[TMP8]], i32 1
-; CHECK-NEXT:    br i1 [[TMP25]], label %[[PRED_STORE_IF23:.*]], label %[[PRED_STORE_CONTINUE24:.*]]
-; CHECK:       [[PRED_STORE_IF23]]:
 ; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    [[TMP27:%.*]] = extractelement <2 x i32> [[TMP21]], i32 1
-; CHECK-NEXT:    store i32 [[TMP27]], ptr [[TMP26]], align 4, !alias.scope [[META54]], !noalias [[META55]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE24]]
-; CHECK:       [[PRED_STORE_CONTINUE24]]:
-; CHECK-NEXT:    [[TMP28:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP28]], label %[[PRED_STORE_IF25:.*]], label %[[PRED_STORE_CONTINUE26:.*]]
-; CHECK:       [[PRED_STORE_IF25]]:
-; CHECK-NEXT:    [[TMP31:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    store i32 [[TMP11]], ptr [[TMP31]], align 4, !alias.scope [[META54]], !noalias [[META55]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE26]]
-; CHECK:       [[PRED_STORE_CONTINUE26]]:
-; CHECK-NEXT:    [[TMP32:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP32]], label %[[PRED_STORE_IF27:.*]], label %[[PRED_STORE_CONTINUE28]]
-; CHECK:       [[PRED_STORE_IF27]]:
-; CHECK-NEXT:    [[TMP35:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    store i32 [[TMP18]], ptr [[TMP35]], align 4, !alias.scope [[META54]], !noalias [[META55]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE28]]
-; CHECK:       [[PRED_STORE_CONTINUE28]]:
+; CHECK-NEXT:    [[TMP20:%.*]] = select <2 x i1> [[TMP8]], <2 x i32> [[TMP21]], <2 x i32> [[TMP19]]
+; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x i32> [[TMP20]], i32 0
+; CHECK-NEXT:    store i32 [[TMP22]], ptr [[TMP23]], align 4, !alias.scope [[META54:![0-9]+]], !noalias [[META55:![0-9]+]]
+; CHECK-NEXT:    [[TMP24:%.*]] = extractelement <2 x i32> [[TMP20]], i32 1
+; CHECK-NEXT:    store i32 [[TMP24]], ptr [[TMP26]], align 4, !alias.scope [[META54]], !noalias [[META55]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP36:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP36]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP56:![0-9]+]]
@@ -685,45 +575,37 @@ define void @test_stores_not_sunk_due_to_aliasing_load(ptr %dst, ptr %alias, ptr
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE11:.*]] ]
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_LOAD_CONTINUE7:.*]] ]
 ; CHECK-NEXT:    [[TMP4:%.*]] = add i32 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP5:%.*]] = add i32 [[INDEX]], 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP4]]
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP6]], align 4, !alias.scope [[META58:![0-9]+]]
-; CHECK-NEXT:    [[TMP10:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 11)
-; CHECK-NEXT:    [[TMP7:%.*]] = xor <2 x i1> [[TMP10]], splat (i1 true)
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp ugt <2 x i32> [[WIDE_LOAD]], splat (i32 11)
 ; CHECK-NEXT:    [[TMP8:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
-; CHECK-NEXT:    br i1 [[TMP8]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
-; CHECK:       [[PRED_STORE_IF]]:
+; CHECK-NEXT:    br i1 [[TMP8]], label %[[PRED_LOAD_IF:.*]], label %[[PRED_LOAD_CONTINUE:.*]]
+; CHECK:       [[PRED_LOAD_IF]]:
 ; CHECK-NEXT:    [[TMP9:%.*]] = getelementptr inbounds i32, ptr [[ALIAS]], i32 [[TMP4]]
-; CHECK-NEXT:    [[TMP15:%.*]] = load i32, ptr [[TMP9]], align 4, !alias.scope [[META61:![0-9]+]]
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    store i32 [[TMP15]], ptr [[TMP12]], align 4, !alias.scope [[META63:![0-9]+]], !noalias [[META65:![0-9]+]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
-; CHECK:       [[PRED_STORE_CONTINUE]]:
+; CHECK-NEXT:    [[TMP10:%.*]] = load i32, ptr [[TMP9]], align 4, !alias.scope [[META61:![0-9]+]]
+; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <2 x i32> poison, i32 [[TMP10]], i32 0
+; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE]]
+; CHECK:       [[PRED_LOAD_CONTINUE]]:
+; CHECK-NEXT:    [[TMP20:%.*]] = phi <2 x i32> [ poison, %[[VECTOR_BODY]] ], [ [[TMP15]], %[[PRED_LOAD_IF]] ]
 ; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
-; CHECK-NEXT:    br i1 [[TMP13]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7:.*]]
-; CHECK:       [[PRED_STORE_IF6]]:
+; CHECK-NEXT:    br i1 [[TMP13]], label %[[PRED_LOAD_IF6:.*]], label %[[PRED_LOAD_CONTINUE7]]
+; CHECK:       [[PRED_LOAD_IF6]]:
 ; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[ALIAS]], i32 [[TMP5]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = load i32, ptr [[TMP14]], align 4, !alias.scope [[META61]]
-; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    store i32 [[TMP11]], ptr [[TMP16]], align 4, !alias.scope [[META63]], !noalias [[META65]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE7]]
-; CHECK:       [[PRED_STORE_CONTINUE7]]:
-; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <2 x i1> [[TMP10]], i32 0
-; CHECK-NEXT:    br i1 [[TMP17]], label %[[PRED_STORE_IF8:.*]], label %[[PRED_STORE_CONTINUE9:.*]]
-; CHECK:       [[PRED_STORE_IF8]]:
+; CHECK-NEXT:    [[TMP12:%.*]] = insertelement <2 x i32> [[TMP20]], i32 [[TMP11]], i32 1
+; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE7]]
+; CHECK:       [[PRED_LOAD_CONTINUE7]]:
+; CHECK-NEXT:    [[TMP22:%.*]] = phi <2 x i32> [ [[TMP20]], %[[PRED_LOAD_CONTINUE]] ], [ [[TMP12]], %[[PRED_LOAD_IF6]] ]
 ; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP4]]
-; CHECK-NEXT:    store i32 10, ptr [[TMP18]], align 4, !alias.scope [[META63]], !noalias [[META65]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE9]]
-; CHECK:       [[PRED_STORE_CONTINUE9]]:
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i1> [[TMP10]], i32 1
-; CHECK-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF10:.*]], label %[[PRED_STORE_CONTINUE11]]
-; CHECK:       [[PRED_STORE_IF10]]:
 ; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP5]]
-; CHECK-NEXT:    store i32 10, ptr [[TMP19]], align 4, !alias.scope [[META63]], !noalias [[META65]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE11]]
-; CHECK:       [[PRED_STORE_CONTINUE11]]:
+; CHECK-NEXT:    [[TMP16:%.*]] = select <2 x i1> [[TMP7]], <2 x i32> [[TMP22]], <2 x i32> splat (i32 10)
+; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <2 x i32> [[TMP16]], i32 0
+; CHECK-NEXT:    store i32 [[TMP17]], ptr [[TMP18]], align 4, !alias.scope [[META63:![0-9]+]], !noalias [[META65:![0-9]+]]
+; CHECK-NEXT:    [[TMP23:%.*]] = extractelement <2 x i32> [[TMP16]], i32 1
+; CHECK-NEXT:    store i32 [[TMP23]], ptr [[TMP19]], align 4, !alias.scope [[META63]], !noalias [[META65]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
 ; CHECK-NEXT:    [[TMP21:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP66:![0-9]+]]
@@ -873,7 +755,7 @@ define void @sink_multiple_store_groups_noalias_via_scev(ptr %dst, ptr %src) {
 ; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[VECTOR_MEMCHECK:.*]]
 ; CHECK:       [[VECTOR_MEMCHECK]]:
-; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DST]], i64 12688
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DST]], i64 12696
 ; CHECK-NEXT:    [[SCEVGEP8:%.*]] = getelementptr i8, ptr [[SRC]], i64 12828
 ; CHECK-NEXT:    [[BOUND1:%.*]] = icmp ult ptr [[DST]], [[SCEVGEP8]]
 ; CHECK-NEXT:    [[BOUND2:%.*]] = icmp ult ptr [[SRC]], [[SCEVGEP]]
@@ -882,88 +764,59 @@ define void @sink_multiple_store_groups_noalias_via_scev(ptr %dst, ptr %src) {
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
 ; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE9:.*]] ]
+; CHECK-NEXT:    [[INDEX1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE7:.*]] ]
 ; CHECK-NEXT:    [[INDEX:%.*]] = mul i64 [[INDEX1]], 16
 ; CHECK-NEXT:    [[IV:%.*]] = add i64 [[INDEX]], 0
 ; CHECK-NEXT:    [[TMP17:%.*]] = add i64 [[INDEX]], 16
 ; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr double, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[TMP22:%.*]] = getelementptr double, ptr [[SRC]], i64 [[TMP17]]
-; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <2 x ptr> poison, ptr [[GEP_SRC]], i32 0
-; CHECK-NEXT:    [[TMP24:%.*]] = insertelement <2 x ptr> [[TMP23]], ptr [[TMP22]], i32 1
 ; CHECK-NEXT:    [[GEP_FLAG:%.*]] = getelementptr i8, ptr [[GEP_SRC]], i64 152
 ; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr i8, ptr [[TMP22]], i64 152
 ; CHECK-NEXT:    [[TMP27:%.*]] = load i32, ptr [[GEP_FLAG]], align 4, !alias.scope [[META78:![0-9]+]]
 ; CHECK-NEXT:    [[TMP28:%.*]] = load i32, ptr [[TMP26]], align 4, !alias.scope [[META78]]
 ; CHECK-NEXT:    [[TMP29:%.*]] = insertelement <2 x i32> poison, i32 [[TMP27]], i32 0
 ; CHECK-NEXT:    [[TMP30:%.*]] = insertelement <2 x i32> [[TMP29]], i32 [[TMP28]], i32 1
-; CHECK-NEXT:    [[TMP31:%.*]] = icmp eq <2 x i32> [[TMP30]], zeroinitializer
+; CHECK-NEXT:    [[TMP10:%.*]] = icmp eq <2 x i32> [[TMP30]], zeroinitializer
 ; CHECK-NEXT:    [[TMP13:%.*]] = load double, ptr [[GEP_SRC]], align 8, !alias.scope [[META78]]
 ; CHECK-NEXT:    [[TMP14:%.*]] = load double, ptr [[TMP22]], align 8, !alias.scope [[META78]]
 ; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <2 x double> poison, double [[TMP13]], i32 0
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = insertelement <2 x double> [[TMP15]], double [[TMP14]], i32 1
-; CHECK-NEXT:    [[TMP33:%.*]] = xor <2 x i1> [[TMP31]], splat (i1 true)
+; CHECK-NEXT:    [[TMP16:%.*]] = xor <2 x i1> [[TMP10]], splat (i1 true)
 ; CHECK-NEXT:    [[TMP34:%.*]] = fadd <2 x double> [[WIDE_LOAD]], splat (double 8.000000e+00)
-; CHECK-NEXT:    [[GEP_DST1_ELSE:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV]]
-; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP17]]
-; CHECK-NEXT:    [[TMP38:%.*]] = insertelement <2 x ptr> poison, ptr [[GEP_DST1_ELSE]], i32 0
-; CHECK-NEXT:    [[TMP39:%.*]] = insertelement <2 x ptr> [[TMP38]], ptr [[TMP37]], i32 1
-; CHECK-NEXT:    [[TMP40:%.*]] = extractelement <2 x i1> [[TMP33]], i32 0
-; CHECK-NEXT:    br i1 [[TMP40]], label %[[PRED_LOAD_IF:.*]], label %[[PRED_LOAD_CONTINUE:.*]]
-; CHECK:       [[PRED_LOAD_IF]]:
-; CHECK-NEXT:    [[TMP41:%.*]] = extractelement <2 x double> [[TMP34]], i32 0
-; CHECK-NEXT:    store double [[TMP41]], ptr [[GEP_DST1_ELSE]], align 8, !alias.scope [[META81:![0-9]+]], !noalias [[META78]]
-; CHECK-NEXT:    [[GEP_SRC_16:%.*]] = getelementptr i8, ptr [[GEP_SRC]], i64 16
-; CHECK-NEXT:    [[TMP43:%.*]] = load double, ptr [[GEP_SRC_16]], align 8, !alias.scope [[META78]]
-; CHECK-NEXT:    [[TMP44:%.*]] = insertelement <2 x double> poison, double [[TMP43]], i32 0
-; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE]]
-; CHECK:       [[PRED_LOAD_CONTINUE]]:
-; CHECK-NEXT:    [[TMP45:%.*]] = phi <2 x double> [ poison, %[[VECTOR_BODY]] ], [ [[TMP44]], %[[PRED_LOAD_IF]] ]
-; CHECK-NEXT:    [[TMP46:%.*]] = extractelement <2 x i1> [[TMP33]], i32 1
-; CHECK-NEXT:    br i1 [[TMP46]], label %[[PRED_LOAD_IF2:.*]], label %[[PRED_LOAD_CONTINUE3:.*]]
-; CHECK:       [[PRED_LOAD_IF2]]:
-; CHECK-NEXT:    [[TMP47:%.*]] = extractelement <2 x double> [[TMP34]], i32 1
-; CHECK-NEXT:    store double [[TMP47]], ptr [[TMP37]], align 8, !alias.scope [[META81]], !noalias [[META78]]
-; CHECK-NEXT:    [[TMP48:%.*]] = getelementptr i8, ptr [[TMP22]], i64 16
-; CHECK-NEXT:    [[TMP49:%.*]] = load double, ptr [[TMP48]], align 8, !alias.scope [[META78]]
-; CHECK-NEXT:    [[TMP50:%.*]] = insertelement <2 x double> [[TMP45]], double [[TMP49]], i32 1
-; CHECK-NEXT:    br label %[[PRED_LOAD_CONTINUE3]]
-; CHECK:       [[PRED_LOAD_CONTINUE3]]:
-; CHECK-NEXT:    [[TMP51:%.*]] = phi <2 x double> [ [[TMP45]], %[[PRED_LOAD_CONTINUE]] ], [ [[TMP50]], %[[PRED_LOAD_IF2]] ]
-; CHECK-NEXT:    [[TMP53:%.*]] = fmul <2 x double> splat (double 2.000000e+01), [[TMP51]]
-; CHECK-NEXT:    [[TMP54:%.*]] = extractelement <2 x i1> [[TMP33]], i32 0
-; CHECK-NEXT:    br i1 [[TMP54]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
+; CHECK-NEXT:    [[TMP24:%.*]] = extractelement <2 x i1> [[TMP16]], i32 0
+; CHECK-NEXT:    br i1 [[TMP24]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
 ; CHECK:       [[PRED_STORE_IF]]:
-; CHECK-NEXT:    [[GEP_DST2_ELSE:%.*]] = getelementptr i8, ptr [[GEP_DST1_ELSE]], i64 8
-; CHECK-NEXT:    [[TMP56:%.*]] = extractelement <2 x double> [[TMP53]], i32 0
-; CHECK-NEXT:    store double [[TMP56]], ptr [[GEP_DST2_ELSE]], align 8, !alias.scope [[META81]], !noalias [[META78]]
+; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <2 x double> [[TMP34]], i32 0
+; CHECK-NEXT:    store double [[TMP19]], ptr [[TMP18]], align 8, !alias.scope [[META81:![0-9]+]], !noalias [[META78]]
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
 ; CHECK:       [[PRED_STORE_CONTINUE]]:
-; CHECK-NEXT:    [[TMP57:%.*]] = extractelement <2 x i1> [[TMP33]], i32 1
-; CHECK-NEXT:    br i1 [[TMP57]], label %[[PRED_STORE_IF4:.*]], label %[[PRED_STORE_CONTINUE5:.*]]
+; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i1> [[TMP16]], i32 1
+; CHECK-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF2:.*]], label %[[PRED_STORE_CONTINUE3:.*]]
+; CHECK:       [[PRED_STORE_IF2]]:
+; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP17]]
+; CHECK-NEXT:    [[TMP33:%.*]] = extractelement <2 x double> [[TMP34]], i32 1
+; CHECK-NEXT:    store double [[TMP33]], ptr [[TMP21]], align 8, !alias.scope [[META81]], !noalias [[META78]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE3]]
+; CHECK:       [[PRED_STORE_CONTINUE3]]:
+; CHECK-NEXT:    [[TMP23:%.*]] = extractelement <2 x i1> [[TMP10]], i32 0
+; CHECK-NEXT:    br i1 [[TMP23]], label %[[PRED_STORE_IF4:.*]], label %[[PRED_STORE_CONTINUE5:.*]]
 ; CHECK:       [[PRED_STORE_IF4]]:
-; CHECK-NEXT:    [[TMP58:%.*]] = getelementptr i8, ptr [[TMP37]], i64 8
-; CHECK-NEXT:    [[TMP59:%.*]] = extractelement <2 x double> [[TMP53]], i32 1
-; CHECK-NEXT:    store double [[TMP59]], ptr [[TMP58]], align 8, !alias.scope [[META81]], !noalias [[META78]]
+; CHECK-NEXT:    [[TMP31:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV]]
+; CHECK-NEXT:    store double [[TMP13]], ptr [[TMP31]], align 8, !alias.scope [[META81]], !noalias [[META78]]
+; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr i8, ptr [[TMP31]], i64 16
+; CHECK-NEXT:    store double 1.000000e+01, ptr [[TMP37]], align 8, !alias.scope [[META81]], !noalias [[META78]]
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE5]]
 ; CHECK:       [[PRED_STORE_CONTINUE5]]:
-; CHECK-NEXT:    [[TMP60:%.*]] = extractelement <2 x i1> [[TMP31]], i32 0
-; CHECK-NEXT:    br i1 [[TMP60]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7:.*]]
+; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <2 x i1> [[TMP10]], i32 1
+; CHECK-NEXT:    br i1 [[TMP25]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7]]
 ; CHECK:       [[PRED_STORE_IF6]]:
-; CHECK-NEXT:    [[TMP62:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV]]
-; CHECK-NEXT:    store double [[TMP13]], ptr [[TMP62]], align 8, !alias.scope [[META81]], !noalias [[META78]]
-; CHECK-NEXT:    [[TMP64:%.*]] = getelementptr i8, ptr [[TMP62]], i64 8
-; CHECK-NEXT:    store double 1.000000e+01, ptr [[TMP64]], align 8, !alias.scope [[META81]], !noalias [[META78]]
+; CHECK-NEXT:    [[TMP32:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP17]]
+; CHECK-NEXT:    store double [[TMP14]], ptr [[TMP32]], align 8, !alias.scope [[META81]], !noalias [[META78]]
+; CHECK-NEXT:    [[TMP47:%.*]] = getelementptr i8, ptr [[TMP32]], i64 16
+; CHECK-NEXT:    store double 1.000000e+01, ptr [[TMP47]], align 8, !alias.scope [[META81]], !noalias [[META78]]
 ; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE7]]
 ; CHECK:       [[PRED_STORE_CONTINUE7]]:
-; CHECK-NEXT:    [[TMP66:%.*]] = extractelement <2 x i1> [[TMP31]], i32 1
-; CHECK-NEXT:    br i1 [[TMP66]], label %[[PRED_STORE_IF8:.*]], label %[[PRED_STORE_CONTINUE9]]
-; CHECK:       [[PRED_STORE_IF8]]:
-; CHECK-NEXT:    [[TMP68:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP17]]
-; CHECK-NEXT:    store double [[TMP14]], ptr [[TMP68]], align 8, !alias.scope [[META81]], !noalias [[META78]]
-; CHECK-NEXT:    [[TMP70:%.*]] = getelementptr i8, ptr [[TMP68]], i64 8
-; CHECK-NEXT:    store double 1.000000e+01, ptr [[TMP70]], align 8, !alias.scope [[META81]], !noalias [[META78]]
-; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE9]]
-; CHECK:       [[PRED_STORE_CONTINUE9]]:
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX1]], 2
 ; CHECK-NEXT:    [[TMP52:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP52]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP83:![0-9]+]]
@@ -983,6 +836,117 @@ loop:
   %v.1 = load double, ptr %gep.src, align 8
   br i1 %cmp, label %then, label %else
 
+then:
+  %gep.dst1.then = getelementptr double, ptr %dst, i64 %iv
+  store double %v.1, ptr %gep.dst1.then, align 8
+  %gep.dst2.then = getelementptr i8, ptr %gep.dst1.then, i64 16
+  store double 10.0, ptr %gep.dst2.then, align 8
+  br label %loop.latch
+
+else:
+  %r.1 = fadd double %v.1, 8.0
+  %gep.dst1.else = getelementptr double, ptr %dst, i64 %iv
+  store double %r.1, ptr %gep.dst1.else, align 8
+  br label %loop.latch
+
+loop.latch:
+  %iv.next = add i64 %iv, 16
+  %exit.cond = icmp eq i64 %iv.next, 1600
+  br i1 %exit.cond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+; Same as @sink_multiple_store_groups_noalias_via_scev, but the offset between
+; store groups is only 8, which means the alias across VFs.
+define void @sink_multiple_store_groups_alias_via_scev(ptr %dst, ptr %src) {
+; CHECK-LABEL: define void @sink_multiple_store_groups_alias_via_scev(
+; CHECK-SAME: ptr [[DST:%.*]], ptr [[SRC:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[VECTOR_MEMCHECK:.*]]
+; CHECK:       [[VECTOR_MEMCHECK]]:
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DST]], i64 12688
+; CHECK-NEXT:    [[SCEVGEP1:%.*]] = getelementptr i8, ptr [[SRC]], i64 12828
+; CHECK-NEXT:    [[BOUND0:%.*]] = icmp ult ptr [[DST]], [[SCEVGEP1]]
+; CHECK-NEXT:    [[BOUND1:%.*]] = icmp ult ptr [[SRC]], [[SCEVGEP]]
+; CHECK-NEXT:    [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
+; CHECK-NEXT:    br i1 [[FOUND_CONFLICT]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE7:.*]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 16
+; CHECK-NEXT:    [[IV:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[OFFSET_IDX]], 16
+; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr double, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr double, ptr [[SRC]], i64 [[TMP1]]
+; CHECK-NEXT:    [[GEP_FLAG:%.*]] = getelementptr i8, ptr [[GEP_SRC]], i64 152
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[TMP3]], i64 152
+; CHECK-NEXT:    [[TMP8:%.*]] = load i32, ptr [[GEP_FLAG]], align 4, !alias.scope [[META85:![0-9]+]]
+; CHECK-NEXT:    [[TMP9:%.*]] = load i32, ptr [[TMP7]], align 4, !alias.scope [[META85]]
+; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <2 x i32> [[TMP10]], i32 [[TMP9]], i32 1
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <2 x i32> [[TMP11]], zeroinitializer
+; CHECK-NEXT:    [[TMP13:%.*]] = load double, ptr [[GEP_SRC]], align 8, !alias.scope [[META85]]
+; CHECK-NEXT:    [[TMP14:%.*]] = load double, ptr [[TMP3]], align 8, !alias.scope [[META85]]
+; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <2 x double> poison, double [[TMP13]], i32 0
+; CHECK-NEXT:    [[TMP16:%.*]] = insertelement <2 x double> [[TMP15]], double [[TMP14]], i32 1
+; CHECK-NEXT:    [[TMP17:%.*]] = xor <2 x i1> [[TMP12]], splat (i1 true)
+; CHECK-NEXT:    [[TMP18:%.*]] = fadd <2 x double> [[TMP16]], splat (double 8.000000e+00)
+; CHECK-NEXT:    [[TMP36:%.*]] = extractelement <2 x i1> [[TMP17]], i32 0
+; CHECK-NEXT:    br i1 [[TMP36]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
+; CHECK:       [[PRED_STORE_IF]]:
+; CHECK-NEXT:    [[TMP20:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <2 x double> [[TMP18]], i32 0
+; CHECK-NEXT:    store double [[TMP19]], ptr [[TMP20]], align 8, !alias.scope [[META88:![0-9]+]], !noalias [[META85]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
+; CHECK:       [[PRED_STORE_CONTINUE]]:
+; CHECK-NEXT:    [[TMP39:%.*]] = extractelement <2 x i1> [[TMP17]], i32 1
+; CHECK-NEXT:    br i1 [[TMP39]], label %[[PRED_STORE_IF2:.*]], label %[[PRED_STORE_CONTINUE3:.*]]
+; CHECK:       [[PRED_STORE_IF2]]:
+; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <2 x double> [[TMP18]], i32 1
+; CHECK-NEXT:    store double [[TMP22]], ptr [[TMP21]], align 8, !alias.scope [[META88]], !noalias [[META85]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE3]]
+; CHECK:       [[PRED_STORE_CONTINUE3]]:
+; CHECK-NEXT:    [[TMP42:%.*]] = extractelement <2 x i1> [[TMP12]], i32 0
+; CHECK-NEXT:    br i1 [[TMP42]], label %[[PRED_STORE_IF4:.*]], label %[[PRED_STORE_CONTINUE5:.*]]
+; CHECK:       [[PRED_STORE_IF4]]:
+; CHECK-NEXT:    [[TMP43:%.*]] = getelementptr double, ptr [[DST]], i64 [[IV]]
+; CHECK-NEXT:    store double [[TMP13]], ptr [[TMP43]], align 8, !alias.scope [[META88]], !noalias [[META85]]
+; CHECK-NEXT:    [[TMP44:%.*]] = getelementptr i8, ptr [[TMP43]], i64 8
+; CHECK-NEXT:    store double 1.000000e+01, ptr [[TMP44]], align 8, !alias.scope [[META88]], !noalias [[META85]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE5]]
+; CHECK:       [[PRED_STORE_CONTINUE5]]:
+; CHECK-NEXT:    [[TMP45:%.*]] = extractelement <2 x i1> [[TMP12]], i32 1
+; CHECK-NEXT:    br i1 [[TMP45]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7]]
+; CHECK:       [[PRED_STORE_IF6]]:
+; CHECK-NEXT:    [[TMP46:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP1]]
+; CHECK-NEXT:    store double [[TMP14]], ptr [[TMP46]], align 8, !alias.scope [[META88]], !noalias [[META85]]
+; CHECK-NEXT:    [[TMP47:%.*]] = getelementptr i8, ptr [[TMP46]], i64 8
+; CHECK-NEXT:    store double 1.000000e+01, ptr [[TMP47]], align 8, !alias.scope [[META88]], !noalias [[META85]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE7]]
+; CHECK:       [[PRED_STORE_CONTINUE7]]:
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT:    br i1 [[TMP48]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP90:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br [[EXIT:label %.*]]
+; CHECK:       [[SCALAR_PH]]:
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop.latch ]
+  %gep.src = getelementptr double, ptr %src, i64 %iv
+  %gep.flag = getelementptr i8, ptr %gep.src, i64 152
+  %c = load i32, ptr %gep.flag, align 4
+  %cmp = icmp eq i32 %c, 0
+  %v.1 = load double, ptr %gep.src, align 8
+  br i1 %cmp, label %then, label %else
+
 then:
   %gep.dst1.then = getelementptr double, ptr %dst, i64 %iv
   store double %v.1, ptr %gep.dst1.then, align 8
@@ -994,11 +958,6 @@ else:
   %r.1 = fadd double %v.1, 8.0
   %gep.dst1.else = getelementptr double, ptr %dst, i64 %iv
   store double %r.1, ptr %gep.dst1.else, align 8
-  %gep.src.16 = getelementptr i8, ptr %gep.src, i64 16
-  %v.3 = load double, ptr %gep.src.16, align 8
-  %r.2 = fmul double 20.0, %v.3
-  %gep.dst2.else = getelementptr i8, ptr %gep.dst1.else, i64 8
-  store double %r.2, ptr %gep.dst2.else, align 8
   br label %loop.latch
 
 loop.latch:
@@ -1084,3 +1043,124 @@ loop.latch:
 exit:
   ret void
 }
+
+; Test with 3 predicated stores to the same address, but with different
+; (non-complementary) predicates.
+define void @test_three_stores_with_different_predicates(ptr %dst, ptr %src, ptr %cond) {
+; CHECK-LABEL: define void @test_three_stores_with_different_predicates(
+; CHECK-SAME: ptr [[DST:%.*]], ptr [[SRC:%.*]], ptr [[COND:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[VECTOR_MEMCHECK:.*]]
+; CHECK:       [[VECTOR_MEMCHECK]]:
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DST]], i64 400
+; CHECK-NEXT:    [[SCEVGEP1:%.*]] = getelementptr i8, ptr [[COND]], i64 400
+; CHECK-NEXT:    [[BOUND0:%.*]] = icmp ult ptr [[DST]], [[SCEVGEP1]]
+; CHECK-NEXT:    [[BOUND1:%.*]] = icmp ult ptr [[COND]], [[SCEVGEP]]
+; CHECK-NEXT:    [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
+; CHECK-NEXT:    br i1 [[FOUND_CONFLICT]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[PRED_STORE_CONTINUE11:.*]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = add i32 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = add i32 [[INDEX]], 1
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[COND]], i32 [[TMP0]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP2]], align 4, !alias.scope [[META92:![0-9]+]]
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 11)
+; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x i1> [[TMP3]], i32 0
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
+; CHECK:       [[PRED_STORE_IF]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP0]]
+; CHECK-NEXT:    store i32 1, ptr [[TMP5]], align 4, !alias.scope [[META95:![0-9]+]], !noalias [[META92]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE]]
+; CHECK:       [[PRED_STORE_CONTINUE]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = extractelement <2 x i1> [[TMP3]], i32 1
+; CHECK-NEXT:    br i1 [[TMP6]], label %[[PRED_STORE_IF2:.*]], label %[[PRED_STORE_CONTINUE3:.*]]
+; CHECK:       [[PRED_STORE_IF2]]:
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP1]]
+; CHECK-NEXT:    store i32 1, ptr [[TMP7]], align 4, !alias.scope [[META95]], !noalias [[META92]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE3]]
+; CHECK:       [[PRED_STORE_CONTINUE3]]:
+; CHECK-NEXT:    [[TMP8:%.*]] = xor <2 x i1> [[TMP3]], splat (i1 true)
+; CHECK-NEXT:    [[TMP9:%.*]] = or <2 x i1> [[TMP3]], [[TMP8]]
+; CHECK-NEXT:    [[TMP10:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 10)
+; CHECK-NEXT:    [[TMP11:%.*]] = select <2 x i1> [[TMP9]], <2 x i1> [[TMP10]], <2 x i1> zeroinitializer
+; CHECK-NEXT:    [[TMP12:%.*]] = extractelement <2 x i1> [[TMP11]], i32 0
+; CHECK-NEXT:    br i1 [[TMP12]], label %[[PRED_STORE_IF4:.*]], label %[[PRED_STORE_CONTINUE5:.*]]
+; CHECK:       [[PRED_STORE_IF4]]:
+; CHECK-NEXT:    [[TMP13:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP0]]
+; CHECK-NEXT:    store i32 2, ptr [[TMP13]], align 4, !alias.scope [[META95]], !noalias [[META92]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE5]]
+; CHECK:       [[PRED_STORE_CONTINUE5]]:
+; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <2 x i1> [[TMP11]], i32 1
+; CHECK-NEXT:    br i1 [[TMP14]], label %[[PRED_STORE_IF6:.*]], label %[[PRED_STORE_CONTINUE7:.*]]
+; CHECK:       [[PRED_STORE_IF6]]:
+; CHECK-NEXT:    [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP1]]
+; CHECK-NEXT:    store i32 2, ptr [[TMP15]], align 4, !alias.scope [[META95]], !noalias [[META92]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE7]]
+; CHECK:       [[PRED_STORE_CONTINUE7]]:
+; CHECK-NEXT:    [[TMP16:%.*]] = icmp ule <2 x i32> [[WIDE_LOAD]], splat (i32 9)
+; CHECK-NEXT:    [[TMP17:%.*]] = select <2 x i1> [[TMP9]], <2 x i1> [[TMP16]], <2 x i1> zeroinitializer
+; CHECK-NEXT:    [[TMP18:%.*]] = extractelement <2 x i1> [[TMP17]], i32 0
+; CHECK-NEXT:    br i1 [[TMP18]], label %[[PRED_STORE_IF8:.*]], label %[[PRED_STORE_CONTINUE9:.*]]
+; CHECK:       [[PRED_STORE_IF8]]:
+; CHECK-NEXT:    [[TMP19:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP0]]
+; CHECK-NEXT:    store i32 3, ptr [[TMP19]], align 4, !alias.scope [[META95]], !noalias [[META92]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE9]]
+; CHECK:       [[PRED_STORE_CONTINUE9]]:
+; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <2 x i1> [[TMP17]], i32 1
+; CHECK-NEXT:    br i1 [[TMP20]], label %[[PRED_STORE_IF10:.*]], label %[[PRED_STORE_CONTINUE11]]
+; CHECK:       [[PRED_STORE_IF10]]:
+; CHECK-NEXT:    [[TMP21:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[TMP1]]
+; CHECK-NEXT:    store i32 3, ptr [[TMP21]], align 4, !alias.scope [[META95]], !noalias [[META92]]
+; CHECK-NEXT:    br label %[[PRED_STORE_CONTINUE11]]
+; CHECK:       [[PRED_STORE_CONTINUE11]]:
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP22:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
+; CHECK-NEXT:    br i1 [[TMP22]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP97:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br [[EXIT:label %.*]]
+; CHECK:       [[SCALAR_PH]]:
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.latch ]
+  %gep.cond = getelementptr inbounds i32, ptr %cond, i32 %iv
+  %c = load i32, ptr %gep.cond, align 4
+  %c.0 = icmp ule i32 %c, 11
+  br i1 %c.0, label %then.0, label %continue.0
+
+then.0:
+  %gep.dst.then.0 = getelementptr inbounds i32, ptr %dst, i32 %iv
+  store i32 1, ptr %gep.dst.then.0, align 4
+  br label %continue.0
+
+continue.0:
+  %c.1 = icmp ule i32 %c, 10
+  br i1 %c.1, label %then.1, label %continue.1
+
+then.1:
+  %gep.dst.then.1 = getelementptr inbounds i32, ptr %dst, i32 %iv
+  store i32 2, ptr %gep.dst.then.1, align 4
+  br label %continue.1
+
+continue.1:
+  %c.2 = icmp ule i32 %c, 9
+  br i1 %c.2, label %then.2, label %loop.latch
+
+then.2:
+  %gep.dst.then.2 = getelementptr inbounds i32, ptr %dst, i32 %iv
+  store i32 3, ptr %gep.dst.then.2, align 4
+  br label %loop.latch
+
+loop.latch:
+  %iv.next = add nuw nsw i32 %iv, 1
+  %ec = icmp eq i32 %iv.next, 100
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret void
+}
+
diff --git a/llvm/test/Transforms/LoopVectorize/iv_outside_user.ll b/llvm/test/Transforms/LoopVectorize/iv_outside_user.ll
index b4fd06316a2e5..4f19a7c586bc3 100644
--- a/llvm/test/Transforms/LoopVectorize/iv_outside_user.ll
+++ b/llvm/test/Transforms/LoopVectorize/iv_outside_user.ll
@@ -152,59 +152,106 @@ for.end:
   ret ptr %ptr.phi
 }
 
-define ptr @both(i32 %k)  {
-; CHECK-LABEL: define ptr @both(
-; CHECK-SAME: i32 [[K:%.*]]) {
-; CHECK-NEXT:  [[ENTRY:.*]]:
-; CHECK-NEXT:    [[BASE:%.*]] = getelementptr inbounds i32, ptr undef, i64 1
-; CHECK-NEXT:    [[TMP0:%.*]] = add i32 [[K]], -1
-; CHECK-NEXT:    [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
-; CHECK-NEXT:    [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 2
-; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
-; CHECK:       [[VECTOR_PH]]:
-; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 2
-; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
-; CHECK-NEXT:    [[IND_END:%.*]] = trunc i64 [[N_VEC]] to i32
-; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[N_VEC]], 4
-; CHECK-NEXT:    [[IND_END1:%.*]] = getelementptr i8, ptr [[BASE]], i64 [[TMP3]]
-; CHECK-NEXT:    [[TMP4:%.*]] = mul i64 [[N_VEC]], 4
-; CHECK-NEXT:    [[IND_END2:%.*]] = getelementptr i8, ptr undef, i64 [[TMP4]]
-; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
-; CHECK:       [[VECTOR_BODY]]:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
-; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
-; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], {{!llvm.loop ![0-9]+}}
-; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
-; CHECK-NEXT:    [[IND_ESCAPE:%.*]] = getelementptr i8, ptr [[IND_END1]], i64 -4
-; CHECK-NEXT:    br i1 [[CMP_N]], label %[[FOR_END:.*]], label %[[SCALAR_PH]]
-; CHECK:       [[SCALAR_PH]]:
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
-; CHECK-NEXT:    [[BC_RESUME_VAL1:%.*]] = phi ptr [ [[IND_END1]], %[[MIDDLE_BLOCK]] ], [ [[BASE]], %[[ENTRY]] ]
-; CHECK-NEXT:    [[BC_RESUME_VAL2:%.*]] = phi ptr [ [[IND_END2]], %[[MIDDLE_BLOCK]] ], [ undef, %[[ENTRY]] ]
-; CHECK-NEXT:    br label %[[FOR_BODY:.*]]
-; CHECK:       [[FOR_BODY]]:
-; CHECK-NEXT:    [[INC_PHI:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INC:%.*]], %[[FOR_BODY]] ]
-; CHECK-NEXT:    [[INC_LAG1:%.*]] = phi ptr [ [[BC_RESUME_VAL1]], %[[SCALAR_PH]] ], [ [[TMP:%.*]], %[[FOR_BODY]] ]
-; CHECK-NEXT:    [[INC_LAG2:%.*]] = phi ptr [ [[BC_RESUME_VAL2]], %[[SCALAR_PH]] ], [ [[INC_LAG1]], %[[FOR_BODY]] ]
-; CHECK-NEXT:    [[TMP]] = getelementptr inbounds i32, ptr [[INC_LAG1]], i64 1
-; CHECK-NEXT:    [[INC]] = add nsw i32 [[INC_PHI]], 1
-; CHECK-NEXT:    [[CMP:%.*]] = icmp eq i32 [[INC]], [[K]]
-; CHECK-NEXT:    br i1 [[CMP]], label %[[FOR_END]], label %[[FOR_BODY]], {{!llvm.loop ![0-9]+}}
-; CHECK:       [[FOR_END]]:
-; CHECK-NEXT:    [[INC_LAG1_LCSSA:%.*]] = phi ptr [ [[INC_LAG1]], %[[FOR_BODY]] ], [ [[IND_ESCAPE]], %[[MIDDLE_BLOCK]] ]
-; CHECK-NEXT:    ret ptr [[INC_LAG1_LCSSA]]
+define ptr @both(ptr %p, i32 %k)  {
+; VEC-LABEL: define ptr @both(
+; VEC-SAME: ptr [[P:%.*]], i32 [[K:%.*]]) {
+; VEC-NEXT:  [[ENTRY:.*]]:
+; VEC-NEXT:    [[BASE:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 1
+; VEC-NEXT:    [[TMP0:%.*]] = add i32 [[K]], -1
+; VEC-NEXT:    [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
+; VEC-NEXT:    [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
+; VEC-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 2
+; VEC-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; VEC:       [[VECTOR_PH]]:
+; VEC-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 2
+; VEC-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
+; VEC-NEXT:    [[TMP3:%.*]] = trunc i64 [[N_VEC]] to i32
+; VEC-NEXT:    [[TMP4:%.*]] = mul i64 [[N_VEC]], 4
+; VEC-NEXT:    [[TMP5:%.*]] = getelementptr i8, ptr [[BASE]], i64 [[TMP4]]
+; VEC-NEXT:    br label %[[VECTOR_BODY:.*]]
+; VEC:       [[VECTOR_BODY]]:
+; VEC-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VEC-NEXT:    [[POINTER_PHI:%.*]] = phi ptr [ [[BASE]], %[[VECTOR_PH]] ], [ [[PTR_IND:%.*]], %[[VECTOR_BODY]] ]
+; VEC-NEXT:    [[VECTOR_GEP:%.*]] = getelementptr i8, ptr [[POINTER_PHI]], <2 x i64> <i64 0, i64 4>
+; VEC-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; VEC-NEXT:    [[PTR_IND]] = getelementptr i8, ptr [[POINTER_PHI]], i64 8
+; VEC-NEXT:    [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; VEC-NEXT:    br i1 [[TMP6]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], {{!llvm.loop ![0-9]+}}
+; VEC:       [[MIDDLE_BLOCK]]:
+; VEC-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x ptr> [[VECTOR_GEP]], i32 1
+; VEC-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
+; VEC-NEXT:    [[IND_ESCAPE:%.*]] = getelementptr i8, ptr [[TMP5]], i64 -4
+; VEC-NEXT:    br i1 [[CMP_N]], label %[[FOR_END:.*]], label %[[SCALAR_PH]]
+; VEC:       [[SCALAR_PH]]:
+; VEC-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ [[TMP3]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; VEC-NEXT:    [[BC_RESUME_VAL1:%.*]] = phi ptr [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ [[BASE]], %[[ENTRY]] ]
+; VEC-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi ptr [ [[VECTOR_RECUR_EXTRACT]], %[[MIDDLE_BLOCK]] ], [ [[BASE]], %[[ENTRY]] ]
+; VEC-NEXT:    br label %[[FOR_BODY:.*]]
+; VEC:       [[FOR_BODY]]:
+; VEC-NEXT:    [[INC_PHI:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INC:%.*]], %[[FOR_BODY]] ]
+; VEC-NEXT:    [[INC_LAG1:%.*]] = phi ptr [ [[BC_RESUME_VAL1]], %[[SCALAR_PH]] ], [ [[TMP:%.*]], %[[FOR_BODY]] ]
+; VEC-NEXT:    [[INC_LAG2:%.*]] = phi ptr [ [[SCALAR_RECUR_INIT]], %[[SCALAR_PH]] ], [ [[INC_LAG1]], %[[FOR_BODY]] ]
+; VEC-NEXT:    [[TMP]] = getelementptr inbounds i32, ptr [[INC_LAG1]], i64 1
+; VEC-NEXT:    [[INC]] = add nsw i32 [[INC_PHI]], 1
+; VEC-NEXT:    [[CMP:%.*]] = icmp eq i32 [[INC]], [[K]]
+; VEC-NEXT:    br i1 [[CMP]], label %[[FOR_END]], label %[[FOR_BODY]], {{!llvm.loop ![0-9]+}}
+; VEC:       [[FOR_END]]:
+; VEC-NEXT:    [[INC_LAG1_LCSSA:%.*]] = phi ptr [ [[INC_LAG1]], %[[FOR_BODY]] ], [ [[IND_ESCAPE]], %[[MIDDLE_BLOCK]] ]
+; VEC-NEXT:    ret ptr [[INC_LAG1_LCSSA]]
+;
+; INTERLEAVE-LABEL: define ptr @both(
+; INTERLEAVE-SAME: ptr [[P:%.*]], i32 [[K:%.*]]) {
+; INTERLEAVE-NEXT:  [[ENTRY:.*]]:
+; INTERLEAVE-NEXT:    [[BASE:%.*]] = getelementptr inbounds i32, ptr [[P]], i64 1
+; INTERLEAVE-NEXT:    [[TMP0:%.*]] = add i32 [[K]], -1
+; INTERLEAVE-NEXT:    [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
+; INTERLEAVE-NEXT:    [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
+; INTERLEAVE-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 2
+; INTERLEAVE-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; INTERLEAVE:       [[VECTOR_PH]]:
+; INTERLEAVE-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 2
+; INTERLEAVE-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
+; INTERLEAVE-NEXT:    [[TMP3:%.*]] = trunc i64 [[N_VEC]] to i32
+; INTERLEAVE-NEXT:    [[TMP6:%.*]] = mul i64 [[N_VEC]], 4
+; INTERLEAVE-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[BASE]], i64 [[TMP6]]
+; INTERLEAVE-NEXT:    br label %[[VECTOR_BODY:.*]]
+; INTERLEAVE:       [[VECTOR_BODY]]:
+; INTERLEAVE-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; INTERLEAVE-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 4
+; INTERLEAVE-NEXT:    [[TMP8:%.*]] = add i64 [[OFFSET_IDX]], 4
+; INTERLEAVE-NEXT:    [[NEXT_GEP1:%.*]] = getelementptr i8, ptr [[BASE]], i64 [[TMP8]]
+; INTERLEAVE-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; INTERLEAVE-NEXT:    [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; INTERLEAVE-NEXT:    br i1 [[TMP7]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], {{!llvm.loop ![0-9]+}}
+; INTERLEAVE:       [[MIDDLE_BLOCK]]:
+; INTERLEAVE-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
+; INTERLEAVE-NEXT:    [[IND_ESCAPE:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i64 -4
+; INTERLEAVE-NEXT:    br i1 [[CMP_N]], label %[[FOR_END:.*]], label %[[SCALAR_PH]]
+; INTERLEAVE:       [[SCALAR_PH]]:
+; INTERLEAVE-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i32 [ [[TMP3]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; INTERLEAVE-NEXT:    [[BC_RESUME_VAL1:%.*]] = phi ptr [ [[NEXT_GEP]], %[[MIDDLE_BLOCK]] ], [ [[BASE]], %[[ENTRY]] ]
+; INTERLEAVE-NEXT:    [[SCALAR_RECUR_INIT:%.*]] = phi ptr [ [[NEXT_GEP1]], %[[MIDDLE_BLOCK]] ], [ [[BASE]], %[[ENTRY]] ]
+; INTERLEAVE-NEXT:    br label %[[FOR_BODY:.*]]
+; INTERLEAVE:       [[FOR_BODY]]:
+; INTERLEAVE-NEXT:    [[INC_PHI:%.*]] = phi i32 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INC:%.*]], %[[FOR_BODY]] ]
+; INTERLEAVE-NEXT:    [[INC_LAG1:%.*]] = phi ptr [ [[BC_RESUME_VAL1]], %[[SCALAR_PH]] ], [ [[TMP:%.*]], %[[FOR_BODY]] ]
+; INTERLEAVE-NEXT:    [[INC_LAG2:%.*]] = phi ptr [ [[SCALAR_RECUR_INIT]], %[[SCALAR_PH]] ], [ [[INC_LAG1]], %[[FOR_BODY]] ]
+; INTERLEAVE-NEXT:    [[TMP]] = getelementptr inbounds i32, ptr [[INC_LAG1]], i64 1
+; INTERLEAVE-NEXT:    [[INC]] = add nsw i32 [[INC_PHI]], 1
+; INTERLEAVE-NEXT:    [[CMP:%.*]] = icmp eq i32 [[INC]], [[K]]
+; INTERLEAVE-NEXT:    br i1 [[CMP]], label %[[FOR_END]], label %[[FOR_BODY]], {{!llvm.loop ![0-9]+}}
+; INTERLEAVE:       [[FOR_END]]:
+; INTERLEAVE-NEXT:    [[INC_LAG1_LCSSA:%.*]] = phi ptr [ [[INC_LAG1]], %[[FOR_BODY]] ], [ [[IND_ESCAPE]], %[[MIDDLE_BLOCK]] ]
+; INTERLEAVE-NEXT:    ret ptr [[INC_LAG1_LCSSA]]
 ;
 entry:
-  %base = getelementptr inbounds i32, ptr undef, i64 1
+  %base = getelementptr inbounds i32, ptr %p, i64 1
   br label %for.body
 
 for.body:
   %inc.phi = phi i32 [ 0, %entry ], [ %inc, %for.body ]
   %inc.lag1 = phi ptr [ %base, %entry ], [ %tmp, %for.body]
-  %inc.lag2 = phi ptr [ undef, %entry ], [ %inc.lag1, %for.body]
+  %inc.lag2 = phi ptr [ %base, %entry ], [ %inc.lag1, %for.body]
   %tmp = getelementptr inbounds i32, ptr %inc.lag1, i64 1
   %inc = add nsw i32 %inc.phi, 1
   %cmp = icmp eq i32 %inc, %k
diff --git a/llvm/test/Transforms/LoopVectorize/pr58811-scev-expansion.ll b/llvm/test/Transforms/LoopVectorize/pr58811-scev-expansion.ll
index 879c7ae5c3c43..5fb426ff7c183 100644
--- a/llvm/test/Transforms/LoopVectorize/pr58811-scev-expansion.ll
+++ b/llvm/test/Transforms/LoopVectorize/pr58811-scev-expansion.ll
@@ -1,54 +1,133 @@
-; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6
+; RUN: opt -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=1 -S %s | FileCheck --check-prefix=VF2 %s
 ; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -S %s | FileCheck %s
 
-define void @test1_pr58811() {
-; CHECK-LABEL: @test1_pr58811(
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    br label [[LOOP_1_PREHEADER:%.*]]
-; CHECK:       loop.1.preheader:
-; CHECK-NEXT:    [[IV_1_PH:%.*]] = phi i32 [ [[SUB93_2:%.*]], [[UNREACHABLE_BB:%.*]] ], [ 0, [[ENTRY:%.*]] ]
+define void @test1_pr58811(ptr %dst) {
+; VF2-LABEL: define void @test1_pr58811(
+; VF2-SAME: ptr [[DST:%.*]]) {
+; VF2-NEXT:  [[ENTRY:.*]]:
+; VF2-NEXT:    br label %[[LOOP_1_PREHEADER:.*]]
+; VF2:       [[LOOP_1_PREHEADER]]:
+; VF2-NEXT:    [[IV_1_PH:%.*]] = phi i32 [ [[SUB93_2:%.*]], %[[UNREACHABLE_BB:.*]] ], [ 0, %[[ENTRY]] ]
+; VF2-NEXT:    [[TMP0:%.*]] = sub i32 0, [[IV_1_PH]]
+; VF2-NEXT:    br label %[[LOOP_1:.*]]
+; VF2:       [[LOOP_1]]:
+; VF2-NEXT:    [[INDUCTION_IV:%.*]] = phi i32 [ [[INDUCTION_IV_NEXT:%.*]], %[[LOOP_1]] ], [ [[TMP0]], %[[LOOP_1_PREHEADER]] ]
+; VF2-NEXT:    [[IV_1:%.*]] = phi i32 [ [[IV_1_NEXT:%.*]], %[[LOOP_1]] ], [ [[IV_1_PH]], %[[LOOP_1_PREHEADER]] ]
+; VF2-NEXT:    [[IV_2:%.*]] = phi i32 [ [[IV_2_NEXT:%.*]], %[[LOOP_1]] ], [ 0, %[[LOOP_1_PREHEADER]] ]
+; VF2-NEXT:    [[TMP1:%.*]] = mul nuw nsw i32 [[IV_2]], -1
+; VF2-NEXT:    [[IV_2_NEXT]] = add i32 [[IV_2]], 1
+; VF2-NEXT:    [[IV_1_NEXT]] = add i32 [[IV_2]], [[IV_1]]
+; VF2-NEXT:    [[INDUCTION_IV_NEXT]] = add i32 [[INDUCTION_IV]], [[TMP1]]
+; VF2-NEXT:    br i1 false, label %[[LOOP_1]], label %[[LOOP_2_PREHEADER:.*]]
+; VF2:       [[LOOP_2_PREHEADER]]:
+; VF2-NEXT:    [[IV_1_LCSSA:%.*]] = phi i32 [ [[IV_1]], %[[LOOP_1]] ]
+; VF2-NEXT:    br label %[[VECTOR_PH:.*]]
+; VF2:       [[VECTOR_PH]]:
+; VF2-NEXT:    [[TMP2:%.*]] = mul i32 198, [[INDUCTION_IV]]
+; VF2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> poison, i32 [[INDUCTION_IV]], i64 0
+; VF2-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> poison, <2 x i32> zeroinitializer
+; VF2-NEXT:    [[TMP3:%.*]] = mul <2 x i32> <i32 0, i32 1>, [[BROADCAST_SPLAT]]
+; VF2-NEXT:    [[INDUCTION:%.*]] = add <2 x i32> zeroinitializer, [[TMP3]]
+; VF2-NEXT:    [[TMP4:%.*]] = mul i32 [[INDUCTION_IV]], 2
+; VF2-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <2 x i32> poison, i32 [[TMP4]], i64 0
+; VF2-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT1]], <2 x i32> poison, <2 x i32> zeroinitializer
+; VF2-NEXT:    br label %[[VECTOR_BODY:.*]]
+; VF2:       [[VECTOR_BODY]]:
+; VF2-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VF2-NEXT:    [[VEC_IND:%.*]] = phi <2 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VF2-NEXT:    [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
+; VF2-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[OFFSET_IDX]]
+; VF2-NEXT:    store <2 x i32> [[VEC_IND]], ptr [[TMP5]], align 4
+; VF2-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
+; VF2-NEXT:    [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
+; VF2-NEXT:    [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], 198
+; VF2-NEXT:    br i1 [[TMP6]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; VF2:       [[MIDDLE_BLOCK]]:
+; VF2-NEXT:    br label %[[SCALAR_PH:.*]]
+; VF2:       [[SCALAR_PH]]:
+; VF2-NEXT:    br label %[[LOOP_2:.*]]
+; VF2:       [[LOOP_2]]:
+; VF2-NEXT:    [[IV_3:%.*]] = phi i16 [ [[IV_3_NEXT:%.*]], %[[LOOP_2]] ], [ 198, %[[SCALAR_PH]] ]
+; VF2-NEXT:    [[IV_4:%.*]] = phi i32 [ [[IV_4_NEXT:%.*]], %[[LOOP_2]] ], [ [[TMP2]], %[[SCALAR_PH]] ]
+; VF2-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[IV_3]]
+; VF2-NEXT:    store i32 [[IV_4]], ptr [[GEP_DST]], align 4
+; VF2-NEXT:    [[IV_4_NEXT]] = sub i32 [[IV_4]], [[IV_1_LCSSA]]
+; VF2-NEXT:    [[IV_3_NEXT]] = add i16 [[IV_3]], 1
+; VF2-NEXT:    [[CMP88_1:%.*]] = icmp ult i16 [[IV_3]], 198
+; VF2-NEXT:    br i1 [[CMP88_1]], label %[[LOOP_2]], label %[[LOOP_3_PREHEADER:.*]], !llvm.loop [[LOOP3:![0-9]+]]
+; VF2:       [[LOOP_3_PREHEADER]]:
+; VF2-NEXT:    [[IV_4_LCSSA:%.*]] = phi i32 [ [[IV_4]], %[[LOOP_2]] ]
+; VF2-NEXT:    br label %[[LOOP_3:.*]]
+; VF2:       [[LOOP_3]]:
+; VF2-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_2]], %[[LOOP_3]] ], [ 0, %[[LOOP_3_PREHEADER]] ]
+; VF2-NEXT:    [[SUB93_2]] = sub i32 [[IV_5]], [[IV_4_LCSSA]]
+; VF2-NEXT:    br label %[[LOOP_3]]
+; VF2:       [[UNREACHABLE_BB]]:
+; VF2-NEXT:    br label %[[LOOP_1_PREHEADER]]
+;
+; CHECK-LABEL: define void @test1_pr58811(
+; CHECK-SAME: ptr [[DST:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP_1_PREHEADER:.*]]
+; CHECK:       [[LOOP_1_PREHEADER]]:
+; CHECK-NEXT:    [[IV_1_PH:%.*]] = phi i32 [ [[SUB93_2:%.*]], %[[UNREACHABLE_BB:.*]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    [[TMP0:%.*]] = sub i32 0, [[IV_1_PH]]
-; CHECK-NEXT:    br label [[LOOP_1:%.*]]
-; CHECK:       loop.1:
-; CHECK-NEXT:    [[INDUCTION_IV:%.*]] = phi i32 [ [[INDUCTION_IV_NEXT:%.*]], [[LOOP_1]] ], [ [[TMP0]], [[LOOP_1_PREHEADER]] ]
-; CHECK-NEXT:    [[IV_1:%.*]] = phi i32 [ [[IV_1_NEXT:%.*]], [[LOOP_1]] ], [ [[IV_1_PH]], [[LOOP_1_PREHEADER]] ]
-; CHECK-NEXT:    [[IV_2:%.*]] = phi i32 [ [[IV_2_NEXT:%.*]], [[LOOP_1]] ], [ 0, [[LOOP_1_PREHEADER]] ]
+; CHECK-NEXT:    br label %[[LOOP_1:.*]]
+; CHECK:       [[LOOP_1]]:
+; CHECK-NEXT:    [[INDUCTION_IV:%.*]] = phi i32 [ [[INDUCTION_IV_NEXT:%.*]], %[[LOOP_1]] ], [ [[TMP0]], %[[LOOP_1_PREHEADER]] ]
+; CHECK-NEXT:    [[IV_1:%.*]] = phi i32 [ [[IV_1_NEXT:%.*]], %[[LOOP_1]] ], [ [[IV_1_PH]], %[[LOOP_1_PREHEADER]] ]
+; CHECK-NEXT:    [[IV_2:%.*]] = phi i32 [ [[IV_2_NEXT:%.*]], %[[LOOP_1]] ], [ 0, %[[LOOP_1_PREHEADER]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = mul nuw nsw i32 [[IV_2]], -1
 ; CHECK-NEXT:    [[IV_2_NEXT]] = add i32 [[IV_2]], 1
 ; CHECK-NEXT:    [[IV_1_NEXT]] = add i32 [[IV_2]], [[IV_1]]
 ; CHECK-NEXT:    [[INDUCTION_IV_NEXT]] = add i32 [[INDUCTION_IV]], [[TMP1]]
-; CHECK-NEXT:    br i1 false, label [[LOOP_1]], label [[LOOP_2_PREHEADER:%.*]]
-; CHECK:       loop.2.preheader:
-; CHECK-NEXT:    [[IV_1_LCSSA:%.*]] = phi i32 [ [[IV_1]], [[LOOP_1]] ]
-; CHECK-NEXT:    br label [[VECTOR_PH:%.*]]
-; CHECK:       vector.ph:
+; CHECK-NEXT:    br i1 false, label %[[LOOP_1]], label %[[LOOP_2_PREHEADER:.*]]
+; CHECK:       [[LOOP_2_PREHEADER]]:
+; CHECK-NEXT:    [[IV_1_LCSSA:%.*]] = phi i32 [ [[IV_1]], %[[LOOP_1]] ]
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    [[IND_END:%.*]] = mul i32 196, [[INDUCTION_IV]]
-; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
-; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[INDUCTION_IV]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP3:%.*]] = mul <4 x i32> <i32 0, i32 1, i32 2, i32 3>, [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <4 x i32> zeroinitializer, [[TMP3]]
+; CHECK-NEXT:    [[TMP4:%.*]] = mul i32 [[INDUCTION_IV]], 4
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i32> poison, i32 [[TMP4]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT1]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[OFFSET_IDX]]
+; CHECK-NEXT:    store <4 x i32> [[VEC_IND]], ptr [[TMP5]], align 4
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = icmp eq i32 [[INDEX_NEXT]], 196
-; CHECK-NEXT:    br i1 [[TMP2]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
-; CHECK:       middle.block:
-; CHECK-NEXT:    br label [[SCALAR_PH:%.*]]
-; CHECK:       scalar.ph:
-; CHECK-NEXT:    br label [[LOOP_2:%.*]]
-; CHECK:       loop.2:
-; CHECK-NEXT:    [[IV_3:%.*]] = phi i16 [ [[IV_3_NEXT:%.*]], [[LOOP_2]] ], [ 196, [[SCALAR_PH]] ]
-; CHECK-NEXT:    [[IV_4:%.*]] = phi i32 [ [[IV_4_NEXT:%.*]], [[LOOP_2]] ], [ [[IND_END]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    br i1 [[TMP2]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP_2:.*]]
+; CHECK:       [[LOOP_2]]:
+; CHECK-NEXT:    [[IV_3:%.*]] = phi i16 [ [[IV_3_NEXT:%.*]], %[[LOOP_2]] ], [ 196, %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[IV_4:%.*]] = phi i32 [ [[IV_4_NEXT:%.*]], %[[LOOP_2]] ], [ [[IND_END]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[IV_3]]
+; CHECK-NEXT:    store i32 [[IV_4]], ptr [[GEP_DST]], align 4
 ; CHECK-NEXT:    [[IV_4_NEXT]] = sub i32 [[IV_4]], [[IV_1_LCSSA]]
 ; CHECK-NEXT:    [[IV_3_NEXT]] = add i16 [[IV_3]], 1
 ; CHECK-NEXT:    [[CMP88_1:%.*]] = icmp ult i16 [[IV_3]], 198
-; CHECK-NEXT:    br i1 [[CMP88_1]], label [[LOOP_2]], label [[LOOP_3_PREHEADER:%.*]], !llvm.loop [[LOOP3:![0-9]+]]
-; CHECK:       loop.3.preheader:
-; CHECK-NEXT:    [[IV_4_LCSSA:%.*]] = phi i32 [ [[IV_4]], [[LOOP_2]] ]
-; CHECK-NEXT:    br label [[LOOP_3:%.*]]
-; CHECK:       loop.3:
-; CHECK-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_2]], [[LOOP_3]] ], [ 0, [[LOOP_3_PREHEADER]] ]
+; CHECK-NEXT:    br i1 [[CMP88_1]], label %[[LOOP_2]], label %[[LOOP_3_PREHEADER:.*]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[LOOP_3_PREHEADER]]:
+; CHECK-NEXT:    [[IV_4_LCSSA:%.*]] = phi i32 [ [[IV_4]], %[[LOOP_2]] ]
+; CHECK-NEXT:    br label %[[LOOP_3:.*]]
+; CHECK:       [[LOOP_3]]:
+; CHECK-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_2]], %[[LOOP_3]] ], [ 0, %[[LOOP_3_PREHEADER]] ]
 ; CHECK-NEXT:    [[SUB93_2]] = sub i32 [[IV_5]], [[IV_4_LCSSA]]
-; CHECK-NEXT:    br label [[LOOP_3]]
-; CHECK:       unreachable.bb:
-; CHECK-NEXT:    br label [[LOOP_1_PREHEADER]]
+; CHECK-NEXT:    br label %[[LOOP_3]]
+; CHECK:       [[UNREACHABLE_BB]]:
+; CHECK-NEXT:    br label %[[LOOP_1_PREHEADER]]
 ;
 entry:
   br label %loop.1.preheader
@@ -71,6 +150,8 @@ loop.2.preheader:
 loop.2:
   %iv.3 = phi i16 [ %iv.3.next, %loop.2 ], [ 0, %loop.2.preheader ]
   %iv.4 = phi i32 [ %iv.4.next, %loop.2 ], [ 0, %loop.2.preheader ]
+  %gep.dst = getelementptr inbounds i32, ptr %dst, i16 %iv.3
+  store i32 %iv.4, ptr %gep.dst
   %iv.4.next = sub i32 %iv.4, %iv.1.lcssa
   %iv.3.next = add i16 %iv.3, 1
   %cmp88.1 = icmp ult i16 %iv.3, 198
@@ -89,56 +170,136 @@ unreachable.bb:                                   ; No predecessors!
   br label %loop.1.preheader
 }
 
-define void @test2_pr58811() {
-; CHECK-LABEL: @test2_pr58811(
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    br label [[LOOP_1_HEADER:%.*]]
-; CHECK:       loop.1.header.loopexit:
-; CHECK-NEXT:    [[SUB93_2_LCSSA:%.*]] = phi i32 [ [[SUB93_2:%.*]], [[LOOP_4:%.*]] ]
-; CHECK-NEXT:    br label [[LOOP_1_HEADER]]
-; CHECK:       loop.1.header:
-; CHECK-NEXT:    [[P_1:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[SUB93_2_LCSSA]], [[LOOP_1_HEADER_LOOPEXIT:%.*]] ]
+define void @test2_pr58811(ptr %dst) {
+; VF2-LABEL: define void @test2_pr58811(
+; VF2-SAME: ptr [[DST:%.*]]) {
+; VF2-NEXT:  [[ENTRY:.*]]:
+; VF2-NEXT:    br label %[[LOOP_1_HEADER:.*]]
+; VF2:       [[LOOP_1_HEADER_LOOPEXIT:.*]]:
+; VF2-NEXT:    [[SUB93_2_LCSSA:%.*]] = phi i32 [ [[SUB93_2:%.*]], %[[LOOP_4:.*]] ]
+; VF2-NEXT:    br label %[[LOOP_1_HEADER]]
+; VF2:       [[LOOP_1_HEADER]]:
+; VF2-NEXT:    [[P_1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[SUB93_2_LCSSA]], %[[LOOP_1_HEADER_LOOPEXIT]] ]
+; VF2-NEXT:    [[TMP0:%.*]] = mul i32 [[P_1]], -1
+; VF2-NEXT:    br label %[[LOOP_2:.*]]
+; VF2:       [[LOOP_2]]:
+; VF2-NEXT:    [[INDUCTION_IV:%.*]] = phi i32 [ [[INDUCTION_IV_NEXT:%.*]], %[[LOOP_2]] ], [ [[TMP0]], %[[LOOP_1_HEADER]] ]
+; VF2-NEXT:    [[IV_2:%.*]] = phi i32 [ [[P_1]], %[[LOOP_1_HEADER]] ], [ [[ADD101:%.*]], %[[LOOP_2]] ]
+; VF2-NEXT:    [[IV_3:%.*]] = phi i32 [ 0, %[[LOOP_1_HEADER]] ], [ [[SUB93:%.*]], %[[LOOP_2]] ]
+; VF2-NEXT:    [[TMP1:%.*]] = mul nuw nsw i32 [[IV_3]], -1
+; VF2-NEXT:    [[SUB93]] = add i32 [[IV_3]], 1
+; VF2-NEXT:    [[ADD101]] = add i32 [[IV_3]], [[IV_2]]
+; VF2-NEXT:    [[INDUCTION_IV_NEXT]] = add i32 [[INDUCTION_IV]], [[TMP1]]
+; VF2-NEXT:    br i1 false, label %[[LOOP_2]], label %[[LOOP_3_PREHEADER:.*]]
+; VF2:       [[LOOP_3_PREHEADER]]:
+; VF2-NEXT:    [[INDUCTION_IV_LCSSA:%.*]] = phi i32 [ [[INDUCTION_IV]], %[[LOOP_2]] ]
+; VF2-NEXT:    [[IV_2_LCSSA:%.*]] = phi i32 [ [[IV_2]], %[[LOOP_2]] ]
+; VF2-NEXT:    br label %[[VECTOR_PH:.*]]
+; VF2:       [[VECTOR_PH]]:
+; VF2-NEXT:    [[TMP2:%.*]] = mul i32 198, [[INDUCTION_IV_LCSSA]]
+; VF2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> poison, i32 [[INDUCTION_IV_LCSSA]], i64 0
+; VF2-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> poison, <2 x i32> zeroinitializer
+; VF2-NEXT:    [[TMP3:%.*]] = mul <2 x i32> <i32 0, i32 1>, [[BROADCAST_SPLAT]]
+; VF2-NEXT:    [[INDUCTION:%.*]] = add <2 x i32> zeroinitializer, [[TMP3]]
+; VF2-NEXT:    [[TMP4:%.*]] = mul i32 [[INDUCTION_IV_LCSSA]], 2
+; VF2-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <2 x i32> poison, i32 [[TMP4]], i64 0
+; VF2-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT1]], <2 x i32> poison, <2 x i32> zeroinitializer
+; VF2-NEXT:    br label %[[VECTOR_BODY:.*]]
+; VF2:       [[VECTOR_BODY]]:
+; VF2-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VF2-NEXT:    [[VEC_IND:%.*]] = phi <2 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VF2-NEXT:    [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
+; VF2-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[OFFSET_IDX]]
+; VF2-NEXT:    store <2 x i32> [[VEC_IND]], ptr [[TMP5]], align 4
+; VF2-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
+; VF2-NEXT:    [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
+; VF2-NEXT:    [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], 198
+; VF2-NEXT:    br i1 [[TMP6]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; VF2:       [[MIDDLE_BLOCK]]:
+; VF2-NEXT:    br label %[[SCALAR_PH:.*]]
+; VF2:       [[SCALAR_PH]]:
+; VF2-NEXT:    br label %[[LOOP_3:.*]]
+; VF2:       [[LOOP_3]]:
+; VF2-NEXT:    [[IV_4:%.*]] = phi i16 [ [[INC_1:%.*]], %[[LOOP_3]] ], [ 198, %[[SCALAR_PH]] ]
+; VF2-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_1:%.*]], %[[LOOP_3]] ], [ [[TMP2]], %[[SCALAR_PH]] ]
+; VF2-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[IV_4]]
+; VF2-NEXT:    store i32 [[IV_5]], ptr [[GEP_DST]], align 4
+; VF2-NEXT:    [[SUB93_1]] = sub i32 [[IV_5]], [[IV_2_LCSSA]]
+; VF2-NEXT:    [[INC_1]] = add i16 [[IV_4]], 1
+; VF2-NEXT:    [[CMP88_1:%.*]] = icmp ult i16 [[IV_4]], 198
+; VF2-NEXT:    br i1 [[CMP88_1]], label %[[LOOP_3]], label %[[LOOP_4_PREHEADER:.*]], !llvm.loop [[LOOP5:![0-9]+]]
+; VF2:       [[LOOP_4_PREHEADER]]:
+; VF2-NEXT:    [[IV_5_LCSSA:%.*]] = phi i32 [ [[IV_5]], %[[LOOP_3]] ]
+; VF2-NEXT:    br label %[[LOOP_4]]
+; VF2:       [[LOOP_4]]:
+; VF2-NEXT:    [[IV_6:%.*]] = phi i32 [ [[SUB93_2]], %[[LOOP_4]] ], [ 0, %[[LOOP_4_PREHEADER]] ]
+; VF2-NEXT:    [[SUB93_2]] = sub i32 [[IV_6]], [[IV_5_LCSSA]]
+; VF2-NEXT:    br i1 false, label %[[LOOP_4]], label %[[LOOP_1_HEADER_LOOPEXIT]]
+;
+; CHECK-LABEL: define void @test2_pr58811(
+; CHECK-SAME: ptr [[DST:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP_1_HEADER:.*]]
+; CHECK:       [[LOOP_1_HEADER_LOOPEXIT:.*]]:
+; CHECK-NEXT:    [[SUB93_2_LCSSA:%.*]] = phi i32 [ [[SUB93_2:%.*]], %[[LOOP_4:.*]] ]
+; CHECK-NEXT:    br label %[[LOOP_1_HEADER]]
+; CHECK:       [[LOOP_1_HEADER]]:
+; CHECK-NEXT:    [[P_1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[SUB93_2_LCSSA]], %[[LOOP_1_HEADER_LOOPEXIT]] ]
 ; CHECK-NEXT:    [[TMP0:%.*]] = mul i32 [[P_1]], -1
-; CHECK-NEXT:    br label [[LOOP_2:%.*]]
-; CHECK:       loop.2:
-; CHECK-NEXT:    [[INDUCTION_IV:%.*]] = phi i32 [ [[INDUCTION_IV_NEXT:%.*]], [[LOOP_2]] ], [ [[TMP0]], [[LOOP_1_HEADER]] ]
-; CHECK-NEXT:    [[IV_2:%.*]] = phi i32 [ [[P_1]], [[LOOP_1_HEADER]] ], [ [[ADD101:%.*]], [[LOOP_2]] ]
-; CHECK-NEXT:    [[IV_3:%.*]] = phi i32 [ 0, [[LOOP_1_HEADER]] ], [ [[SUB93:%.*]], [[LOOP_2]] ]
+; CHECK-NEXT:    br label %[[LOOP_2:.*]]
+; CHECK:       [[LOOP_2]]:
+; CHECK-NEXT:    [[INDUCTION_IV:%.*]] = phi i32 [ [[INDUCTION_IV_NEXT:%.*]], %[[LOOP_2]] ], [ [[TMP0]], %[[LOOP_1_HEADER]] ]
+; CHECK-NEXT:    [[IV_2:%.*]] = phi i32 [ [[P_1]], %[[LOOP_1_HEADER]] ], [ [[ADD101:%.*]], %[[LOOP_2]] ]
+; CHECK-NEXT:    [[IV_3:%.*]] = phi i32 [ 0, %[[LOOP_1_HEADER]] ], [ [[SUB93:%.*]], %[[LOOP_2]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = mul nuw nsw i32 [[IV_3]], -1
 ; CHECK-NEXT:    [[SUB93]] = add i32 [[IV_3]], 1
 ; CHECK-NEXT:    [[ADD101]] = add i32 [[IV_3]], [[IV_2]]
 ; CHECK-NEXT:    [[INDUCTION_IV_NEXT]] = add i32 [[INDUCTION_IV]], [[TMP1]]
-; CHECK-NEXT:    br i1 false, label [[LOOP_2]], label [[LOOP_3_PREHEADER:%.*]]
-; CHECK:       loop.3.preheader:
-; CHECK-NEXT:    [[INDUCTION_IV_LCSSA:%.*]] = phi i32 [ [[INDUCTION_IV]], [[LOOP_2]] ]
-; CHECK-NEXT:    [[IV_2_LCSSA:%.*]] = phi i32 [ [[IV_2]], [[LOOP_2]] ]
-; CHECK-NEXT:    br label [[VECTOR_PH:%.*]]
-; CHECK:       vector.ph:
+; CHECK-NEXT:    br i1 false, label %[[LOOP_2]], label %[[LOOP_3_PREHEADER:.*]]
+; CHECK:       [[LOOP_3_PREHEADER]]:
+; CHECK-NEXT:    [[INDUCTION_IV_LCSSA:%.*]] = phi i32 [ [[INDUCTION_IV]], %[[LOOP_2]] ]
+; CHECK-NEXT:    [[IV_2_LCSSA:%.*]] = phi i32 [ [[IV_2]], %[[LOOP_2]] ]
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    [[IND_END:%.*]] = mul i32 196, [[INDUCTION_IV_LCSSA]]
-; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
-; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[INDUCTION_IV_LCSSA]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP3:%.*]] = mul <4 x i32> <i32 0, i32 1, i32 2, i32 3>, [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <4 x i32> zeroinitializer, [[TMP3]]
+; CHECK-NEXT:    [[TMP4:%.*]] = mul i32 [[INDUCTION_IV_LCSSA]], 4
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i32> poison, i32 [[TMP4]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT1]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[OFFSET_IDX]]
+; CHECK-NEXT:    store <4 x i32> [[VEC_IND]], ptr [[TMP5]], align 4
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = icmp eq i32 [[INDEX_NEXT]], 196
-; CHECK-NEXT:    br i1 [[TMP2]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
-; CHECK:       middle.block:
-; CHECK-NEXT:    br label [[SCALAR_PH:%.*]]
-; CHECK:       scalar.ph:
-; CHECK-NEXT:    br label [[LOOP_3:%.*]]
-; CHECK:       loop.3:
-; CHECK-NEXT:    [[IV_4:%.*]] = phi i16 [ [[INC_1:%.*]], [[LOOP_3]] ], [ 196, [[SCALAR_PH]] ]
-; CHECK-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_1:%.*]], [[LOOP_3]] ], [ [[IND_END]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    br i1 [[TMP2]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP_3:.*]]
+; CHECK:       [[LOOP_3]]:
+; CHECK-NEXT:    [[IV_4:%.*]] = phi i16 [ [[INC_1:%.*]], %[[LOOP_3]] ], [ 196, %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_1:%.*]], %[[LOOP_3]] ], [ [[IND_END]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[IV_4]]
+; CHECK-NEXT:    store i32 [[IV_5]], ptr [[GEP_DST]], align 4
 ; CHECK-NEXT:    [[SUB93_1]] = sub i32 [[IV_5]], [[IV_2_LCSSA]]
 ; CHECK-NEXT:    [[INC_1]] = add i16 [[IV_4]], 1
 ; CHECK-NEXT:    [[CMP88_1:%.*]] = icmp ult i16 [[IV_4]], 198
-; CHECK-NEXT:    br i1 [[CMP88_1]], label [[LOOP_3]], label [[LOOP_4_PREHEADER:%.*]], !llvm.loop [[LOOP5:![0-9]+]]
-; CHECK:       loop.4.preheader:
-; CHECK-NEXT:    [[IV_5_LCSSA:%.*]] = phi i32 [ [[IV_5]], [[LOOP_3]] ]
-; CHECK-NEXT:    br label [[LOOP_4]]
-; CHECK:       loop.4:
-; CHECK-NEXT:    [[IV_6:%.*]] = phi i32 [ [[SUB93_2]], [[LOOP_4]] ], [ 0, [[LOOP_4_PREHEADER]] ]
+; CHECK-NEXT:    br i1 [[CMP88_1]], label %[[LOOP_3]], label %[[LOOP_4_PREHEADER:.*]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK:       [[LOOP_4_PREHEADER]]:
+; CHECK-NEXT:    [[IV_5_LCSSA:%.*]] = phi i32 [ [[IV_5]], %[[LOOP_3]] ]
+; CHECK-NEXT:    br label %[[LOOP_4]]
+; CHECK:       [[LOOP_4]]:
+; CHECK-NEXT:    [[IV_6:%.*]] = phi i32 [ [[SUB93_2]], %[[LOOP_4]] ], [ 0, %[[LOOP_4_PREHEADER]] ]
 ; CHECK-NEXT:    [[SUB93_2]] = sub i32 [[IV_6]], [[IV_5_LCSSA]]
-; CHECK-NEXT:    br i1 false, label [[LOOP_4]], label [[LOOP_1_HEADER_LOOPEXIT]]
+; CHECK-NEXT:    br i1 false, label %[[LOOP_4]], label %[[LOOP_1_HEADER_LOOPEXIT]]
 ;
 entry:
   br label %loop.1.header
@@ -157,6 +318,8 @@ loop.2:
 loop.3:
   %iv.4 = phi i16 [ 0, %loop.2 ], [ %inc.1, %loop.3 ]
   %iv.5 = phi i32 [ 0, %loop.2 ], [ %sub93.1, %loop.3 ]
+  %gep.dst = getelementptr inbounds i32, ptr %dst, i16 %iv.4
+  store i32 %iv.5, ptr %gep.dst
   %sub93.1 = sub i32 %iv.5, %iv.2
   %inc.1 = add i16 %iv.4, 1
   %cmp88.1 = icmp ult i16 %iv.4, 198
@@ -168,53 +331,130 @@ loop.4:
   br i1 false, label %loop.4, label %loop.1.header
 }
 
-define void @test3_pr58811() {
-; CHECK-LABEL: @test3_pr58811(
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    br label [[LOOP_1_HEADER:%.*]]
-; CHECK:       loop.1.header:
-; CHECK-NEXT:    [[P_1:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[SUB93_2:%.*]], [[LOOP_1_LATCH:%.*]] ]
+define void @test3_pr58811(ptr %dst) {
+; VF2-LABEL: define void @test3_pr58811(
+; VF2-SAME: ptr [[DST:%.*]]) {
+; VF2-NEXT:  [[ENTRY:.*]]:
+; VF2-NEXT:    br label %[[LOOP_1_HEADER:.*]]
+; VF2:       [[LOOP_1_HEADER]]:
+; VF2-NEXT:    [[P_1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[SUB93_2:%.*]], %[[LOOP_1_LATCH:.*]] ]
+; VF2-NEXT:    [[REM85:%.*]] = urem i32 1, [[P_1]]
+; VF2-NEXT:    br label %[[LOOP_2:.*]]
+; VF2:       [[LOOP_2]]:
+; VF2-NEXT:    [[P_2:%.*]] = phi i32 [ 1, %[[LOOP_1_HEADER]] ], [ 0, %[[LOOP_2]] ]
+; VF2-NEXT:    [[ADD101:%.*]] = add i32 [[REM85]], [[P_2]]
+; VF2-NEXT:    br i1 false, label %[[LOOP_2]], label %[[LOOP_3_PREHEADER:.*]]
+; VF2:       [[LOOP_3_PREHEADER]]:
+; VF2-NEXT:    [[ADD101_LCSSA:%.*]] = phi i32 [ [[ADD101]], %[[LOOP_2]] ]
+; VF2-NEXT:    [[TMP0:%.*]] = udiv i32 1, [[P_1]]
+; VF2-NEXT:    [[TMP1:%.*]] = mul nuw i32 [[P_1]], [[TMP0]]
+; VF2-NEXT:    [[TMP2:%.*]] = add i32 [[TMP1]], -1
+; VF2-NEXT:    [[TMP3:%.*]] = sub i32 [[TMP2]], [[P_2]]
+; VF2-NEXT:    br label %[[VECTOR_PH:.*]]
+; VF2:       [[VECTOR_PH]]:
+; VF2-NEXT:    [[TMP15:%.*]] = mul i32 198, [[TMP3]]
+; VF2-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i64 0
+; VF2-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> poison, <2 x i32> zeroinitializer
+; VF2-NEXT:    [[TMP5:%.*]] = mul <2 x i32> <i32 0, i32 1>, [[BROADCAST_SPLAT]]
+; VF2-NEXT:    [[INDUCTION:%.*]] = add <2 x i32> zeroinitializer, [[TMP5]]
+; VF2-NEXT:    [[TMP6:%.*]] = mul i32 [[TMP3]], 2
+; VF2-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <2 x i32> poison, i32 [[TMP6]], i64 0
+; VF2-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT1]], <2 x i32> poison, <2 x i32> zeroinitializer
+; VF2-NEXT:    br label %[[VECTOR_BODY:.*]]
+; VF2:       [[VECTOR_BODY]]:
+; VF2-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VF2-NEXT:    [[VEC_IND:%.*]] = phi <2 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; VF2-NEXT:    [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
+; VF2-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[OFFSET_IDX]]
+; VF2-NEXT:    store <2 x i32> [[VEC_IND]], ptr [[TMP7]], align 4
+; VF2-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
+; VF2-NEXT:    [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
+; VF2-NEXT:    [[TMP22:%.*]] = icmp eq i32 [[INDEX_NEXT]], 198
+; VF2-NEXT:    br i1 [[TMP22]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; VF2:       [[MIDDLE_BLOCK]]:
+; VF2-NEXT:    br label %[[SCALAR_PH:.*]]
+; VF2:       [[SCALAR_PH]]:
+; VF2-NEXT:    br label %[[LOOP_3:.*]]
+; VF2:       [[LOOP_3]]:
+; VF2-NEXT:    [[IV_3:%.*]] = phi i16 [ [[INC_1:%.*]], %[[LOOP_3]] ], [ 198, %[[SCALAR_PH]] ]
+; VF2-NEXT:    [[IV_4:%.*]] = phi i32 [ [[SUB93_1:%.*]], %[[LOOP_3]] ], [ [[TMP15]], %[[SCALAR_PH]] ]
+; VF2-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[IV_3]]
+; VF2-NEXT:    store i32 [[IV_4]], ptr [[GEP_DST]], align 4
+; VF2-NEXT:    [[SUB93_1]] = sub i32 [[IV_4]], [[ADD101_LCSSA]]
+; VF2-NEXT:    [[INC_1]] = add i16 [[IV_3]], 1
+; VF2-NEXT:    [[CMP88_1:%.*]] = icmp ult i16 [[IV_3]], 198
+; VF2-NEXT:    br i1 [[CMP88_1]], label %[[LOOP_3]], label %[[LOOP_4_PREHEADER:.*]], !llvm.loop [[LOOP7:![0-9]+]]
+; VF2:       [[LOOP_4_PREHEADER]]:
+; VF2-NEXT:    [[IV_4_LCSSA:%.*]] = phi i32 [ [[IV_4]], %[[LOOP_3]] ]
+; VF2-NEXT:    br label %[[LOOP_4:.*]]
+; VF2:       [[LOOP_4]]:
+; VF2-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_2]], %[[LOOP_4]] ], [ 0, %[[LOOP_4_PREHEADER]] ]
+; VF2-NEXT:    [[SUB93_2]] = sub i32 [[IV_5]], [[IV_4_LCSSA]]
+; VF2-NEXT:    br label %[[LOOP_4]]
+; VF2:       [[LOOP_1_LATCH]]:
+; VF2-NEXT:    br label %[[LOOP_1_HEADER]]
+;
+; CHECK-LABEL: define void @test3_pr58811(
+; CHECK-SAME: ptr [[DST:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP_1_HEADER:.*]]
+; CHECK:       [[LOOP_1_HEADER]]:
+; CHECK-NEXT:    [[P_1:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[SUB93_2:%.*]], %[[LOOP_1_LATCH:.*]] ]
 ; CHECK-NEXT:    [[REM85:%.*]] = urem i32 1, [[P_1]]
-; CHECK-NEXT:    br label [[LOOP_2:%.*]]
-; CHECK:       loop.2:
-; CHECK-NEXT:    [[P_2:%.*]] = phi i32 [ 1, [[LOOP_1_HEADER]] ], [ 0, [[LOOP_2]] ]
+; CHECK-NEXT:    br label %[[LOOP_2:.*]]
+; CHECK:       [[LOOP_2]]:
+; CHECK-NEXT:    [[P_2:%.*]] = phi i32 [ 1, %[[LOOP_1_HEADER]] ], [ 0, %[[LOOP_2]] ]
 ; CHECK-NEXT:    [[ADD101:%.*]] = add i32 [[REM85]], [[P_2]]
-; CHECK-NEXT:    br i1 false, label [[LOOP_2]], label [[LOOP_3_PREHEADER:%.*]]
-; CHECK:       loop.3.preheader:
-; CHECK-NEXT:    [[ADD101_LCSSA:%.*]] = phi i32 [ [[ADD101]], [[LOOP_2]] ]
+; CHECK-NEXT:    br i1 false, label %[[LOOP_2]], label %[[LOOP_3_PREHEADER:.*]]
+; CHECK:       [[LOOP_3_PREHEADER]]:
+; CHECK-NEXT:    [[ADD101_LCSSA:%.*]] = phi i32 [ [[ADD101]], %[[LOOP_2]] ]
 ; CHECK-NEXT:    [[TMP0:%.*]] = udiv i32 1, [[P_1]]
 ; CHECK-NEXT:    [[TMP1:%.*]] = mul nuw i32 [[P_1]], [[TMP0]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = add i32 [[TMP1]], -1
 ; CHECK-NEXT:    [[TMP3:%.*]] = sub i32 [[TMP2]], [[P_2]]
-; CHECK-NEXT:    br label [[VECTOR_PH:%.*]]
-; CHECK:       vector.ph:
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    [[IND_END:%.*]] = mul i32 196, [[TMP3]]
-; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
-; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP5:%.*]] = mul <4 x i32> <i32 0, i32 1, i32 2, i32 3>, [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <4 x i32> zeroinitializer, [[TMP5]]
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i32 [[TMP3]], 4
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <4 x i32> poison, i32 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT1]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = trunc i32 [[INDEX]] to i16
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[OFFSET_IDX]]
+; CHECK-NEXT:    store <4 x i32> [[VEC_IND]], ptr [[TMP7]], align 4
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT2]]
 ; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], 196
-; CHECK-NEXT:    br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
-; CHECK:       middle.block:
-; CHECK-NEXT:    br label [[SCALAR_PH:%.*]]
-; CHECK:       scalar.ph:
-; CHECK-NEXT:    br label [[LOOP_3:%.*]]
-; CHECK:       loop.3:
-; CHECK-NEXT:    [[IV_3:%.*]] = phi i16 [ [[INC_1:%.*]], [[LOOP_3]] ], [ 196, [[SCALAR_PH]] ]
-; CHECK-NEXT:    [[IV_4:%.*]] = phi i32 [ [[SUB93_1:%.*]], [[LOOP_3]] ], [ [[IND_END]], [[SCALAR_PH]] ]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP_3:.*]]
+; CHECK:       [[LOOP_3]]:
+; CHECK-NEXT:    [[IV_3:%.*]] = phi i16 [ [[INC_1:%.*]], %[[LOOP_3]] ], [ 196, %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[IV_4:%.*]] = phi i32 [ [[SUB93_1:%.*]], %[[LOOP_3]] ], [ [[IND_END]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i32, ptr [[DST]], i16 [[IV_3]]
+; CHECK-NEXT:    store i32 [[IV_4]], ptr [[GEP_DST]], align 4
 ; CHECK-NEXT:    [[SUB93_1]] = sub i32 [[IV_4]], [[ADD101_LCSSA]]
 ; CHECK-NEXT:    [[INC_1]] = add i16 [[IV_3]], 1
 ; CHECK-NEXT:    [[CMP88_1:%.*]] = icmp ult i16 [[IV_3]], 198
-; CHECK-NEXT:    br i1 [[CMP88_1]], label [[LOOP_3]], label [[LOOP_4_PREHEADER:%.*]], !llvm.loop [[LOOP7:![0-9]+]]
-; CHECK:       loop.4.preheader:
-; CHECK-NEXT:    [[IV_4_LCSSA:%.*]] = phi i32 [ [[IV_4]], [[LOOP_3]] ]
-; CHECK-NEXT:    br label [[LOOP_4:%.*]]
-; CHECK:       loop.4:
-; CHECK-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_2]], [[LOOP_4]] ], [ 0, [[LOOP_4_PREHEADER]] ]
+; CHECK-NEXT:    br i1 [[CMP88_1]], label %[[LOOP_3]], label %[[LOOP_4_PREHEADER:.*]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK:       [[LOOP_4_PREHEADER]]:
+; CHECK-NEXT:    [[IV_4_LCSSA:%.*]] = phi i32 [ [[IV_4]], %[[LOOP_3]] ]
+; CHECK-NEXT:    br label %[[LOOP_4:.*]]
+; CHECK:       [[LOOP_4]]:
+; CHECK-NEXT:    [[IV_5:%.*]] = phi i32 [ [[SUB93_2]], %[[LOOP_4]] ], [ 0, %[[LOOP_4_PREHEADER]] ]
 ; CHECK-NEXT:    [[SUB93_2]] = sub i32 [[IV_5]], [[IV_4_LCSSA]]
-; CHECK-NEXT:    br label [[LOOP_4]]
-; CHECK:       loop.1.latch:
-; CHECK-NEXT:    br label [[LOOP_1_HEADER]]
+; CHECK-NEXT:    br label %[[LOOP_4]]
+; CHECK:       [[LOOP_1_LATCH]]:
+; CHECK-NEXT:    br label %[[LOOP_1_HEADER]]
 ;
 entry:
   br label %loop.1.header
@@ -232,6 +472,8 @@ loop.2:
 loop.3:
   %iv.3 = phi i16 [ 0, %loop.2 ], [ %inc.1, %loop.3 ]
   %iv.4 = phi i32 [ 0, %loop.2 ], [ %sub93.1, %loop.3 ]
+  %gep.dst = getelementptr inbounds i32, ptr %dst, i16 %iv.3
+  store i32 %iv.4, ptr %gep.dst
   %sub93.1 = sub i32 %iv.4, %add101
   %inc.1 = add i16 %iv.3, 1
   %cmp88.1 = icmp ult i16 %iv.3, 198
@@ -245,3 +487,110 @@ loop.4:
 loop.1.latch:                                 ; No predecessors!
   br label %loop.1.header
 }
+
+define void @iv_start_from_shl_of_previous_iv(ptr %dst) {
+; VF2-LABEL: define void @iv_start_from_shl_of_previous_iv(
+; VF2-SAME: ptr [[DST:%.*]]) {
+; VF2-NEXT:  [[ENTRY:.*:]]
+; VF2-NEXT:    br label %[[VECTOR_PH:.*]]
+; VF2:       [[VECTOR_PH]]:
+; VF2-NEXT:    br label %[[VECTOR_BODY:.*]]
+; VF2:       [[VECTOR_BODY]]:
+; VF2-NEXT:    store <2 x i8> zeroinitializer, ptr [[DST]], align 1
+; VF2-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; VF2:       [[MIDDLE_BLOCK]]:
+; VF2-NEXT:    br label %[[LOOP_1_EXIT:.*]]
+; VF2:       [[LOOP_1_EXIT]]:
+; VF2-NEXT:    [[IV_1_SHL:%.*]] = shl i64 1, 1
+; VF2-NEXT:    br label %[[VECTOR_PH1:.*]]
+; VF2:       [[VECTOR_PH1]]:
+; VF2-NEXT:    [[TMP0:%.*]] = add i64 [[IV_1_SHL]], 98
+; VF2-NEXT:    br label %[[VECTOR_BODY2:.*]]
+; VF2:       [[VECTOR_BODY2]]:
+; VF2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH1]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY2]] ]
+; VF2-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[IV_1_SHL]], [[INDEX]]
+; VF2-NEXT:    [[TMP1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]]
+; VF2-NEXT:    store <2 x i8> splat (i8 1), ptr [[TMP1]], align 1
+; VF2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
+; VF2-NEXT:    [[TMP2:%.*]] = icmp eq i64 [[INDEX_NEXT]], 98
+; VF2-NEXT:    br i1 [[TMP2]], label %[[MIDDLE_BLOCK3:.*]], label %[[VECTOR_BODY2]], !llvm.loop [[LOOP8:![0-9]+]]
+; VF2:       [[MIDDLE_BLOCK3]]:
+; VF2-NEXT:    br label %[[SCALAR_PH:.*]]
+; VF2:       [[SCALAR_PH]]:
+; VF2-NEXT:    br label %[[LOOP_2:.*]]
+; VF2:       [[LOOP_2]]:
+; VF2-NEXT:    [[IV_2:%.*]] = phi i64 [ [[TMP0]], %[[SCALAR_PH]] ], [ [[IV_2_NEXT:%.*]], %[[LOOP_2]] ]
+; VF2-NEXT:    [[GEP_2:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV_2]]
+; VF2-NEXT:    store i8 1, ptr [[GEP_2]], align 1
+; VF2-NEXT:    [[IV_2_NEXT]] = add i64 [[IV_2]], 1
+; VF2-NEXT:    [[CMP_2:%.*]] = icmp eq i64 [[IV_2]], 100
+; VF2-NEXT:    br i1 [[CMP_2]], label %[[EXIT:.*]], label %[[LOOP_2]], !llvm.loop [[LOOP9:![0-9]+]]
+; VF2:       [[EXIT]]:
+; VF2-NEXT:    ret void
+;
+; CHECK-LABEL: define void @iv_start_from_shl_of_previous_iv(
+; CHECK-SAME: ptr [[DST:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP_1:.*]]
+; CHECK:       [[LOOP_1]]:
+; CHECK-NEXT:    [[IV_1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_1_NEXT:%.*]], %[[LOOP_1]] ]
+; CHECK-NEXT:    [[GEP_1:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV_1]]
+; CHECK-NEXT:    store i8 0, ptr [[GEP_1]], align 1
+; CHECK-NEXT:    [[IV_1_NEXT]] = add i64 [[IV_1]], 1
+; CHECK-NEXT:    [[CMP_1:%.*]] = icmp eq i64 [[IV_1]], 0
+; CHECK-NEXT:    br i1 [[CMP_1]], label %[[LOOP_1]], label %[[LOOP_1_EXIT:.*]]
+; CHECK:       [[LOOP_1_EXIT]]:
+; CHECK-NEXT:    [[IV_1_LCSSA:%.*]] = phi i64 [ [[IV_1]], %[[LOOP_1]] ]
+; CHECK-NEXT:    [[IV_1_SHL:%.*]] = shl i64 [[IV_1_LCSSA]], 1
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP2:%.*]] = add i64 [[IV_1_SHL]], 96
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 [[IV_1_SHL]], [[INDEX]]
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr i8, ptr [[DST]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    store <4 x i8> splat (i8 1), ptr [[TMP0]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp eq i64 [[INDEX_NEXT]], 96
+; CHECK-NEXT:    br i1 [[TMP1]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP_2:.*]]
+; CHECK:       [[LOOP_2]]:
+; CHECK-NEXT:    [[IV_2:%.*]] = phi i64 [ [[TMP2]], %[[SCALAR_PH]] ], [ [[IV_2_NEXT:%.*]], %[[LOOP_2]] ]
+; CHECK-NEXT:    [[GEP_2:%.*]] = getelementptr i8, ptr [[DST]], i64 [[IV_2]]
+; CHECK-NEXT:    store i8 1, ptr [[GEP_2]], align 1
+; CHECK-NEXT:    [[IV_2_NEXT]] = add i64 [[IV_2]], 1
+; CHECK-NEXT:    [[CMP_2:%.*]] = icmp eq i64 [[IV_2]], 100
+; CHECK-NEXT:    br i1 [[CMP_2]], label %[[EXIT:.*]], label %[[LOOP_2]], !llvm.loop [[LOOP9:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop.1
+
+loop.1:
+  %iv.1 = phi i64 [ 0, %entry ], [ %iv.1.next, %loop.1 ]
+  %gep.1 = getelementptr i8, ptr %dst, i64 %iv.1
+  store i8 0, ptr %gep.1, align 1
+  %iv.1.next = add i64 %iv.1, 1
+  %cmp.1 = icmp eq i64 %iv.1, 0
+  br i1 %cmp.1, label %loop.1, label %loop.1.exit
+
+loop.1.exit:
+  %iv.1.shl = shl i64 %iv.1, 1
+  br label %loop.2
+
+loop.2:
+  %iv.2 = phi i64 [ %iv.1.shl, %loop.1.exit ], [ %iv.2.next, %loop.2 ]
+  %gep.2 = getelementptr i8, ptr %dst, i64 %iv.2
+  store i8 1, ptr %gep.2, align 1
+  %iv.2.next = add i64 %iv.2, 1
+  %cmp.2 = icmp eq i64 %iv.2, 100
+  br i1 %cmp.2, label %exit, label %loop.2
+
+exit:
+  ret void
+}
diff --git a/llvm/test/Transforms/LoopVectorize/select-index-interleaving.ll b/llvm/test/Transforms/LoopVectorize/select-index-interleaving.ll
index 9d97c7f9cec09..1638360d08e99 100644
--- a/llvm/test/Transforms/LoopVectorize/select-index-interleaving.ll
+++ b/llvm/test/Transforms/LoopVectorize/select-index-interleaving.ll
@@ -47,11 +47,58 @@ define i64 @test_vectorize_select_umin_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 8
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <4 x i64> [ splat (i64 50), %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <4 x i64> [ splat (i64 50), %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp uge <4 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp uge <4 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[VEC_PHI2]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[VEC_PHI3]], <4 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <4 x i1> [[TMP3]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <4 x i1> [[TMP4]], <4 x i64> [[STEP_ADD]], <4 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 8
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[STEP_ADD]], splat (i64 4)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[TMP5]], <4 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.umin.v4i64(<4 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <4 x i1> [[TMP11]], <4 x i64> [[TMP7]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <4 x i1> [[TMP12]], <4 x i64> [[TMP8]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[TMP13]], <4 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 50, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 50, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp uge i64 [[MIN_VAL]], [[L]]
@@ -59,9 +106,9 @@ define i64 @test_vectorize_select_umin_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -131,11 +178,58 @@ define i64 @test_vectorize_select_smin_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smin_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 8
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp sge <4 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp sge <4 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <4 x i64> @llvm.smin.v4i64(<4 x i64> [[VEC_PHI2]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <4 x i64> @llvm.smin.v4i64(<4 x i64> [[VEC_PHI3]], <4 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <4 x i1> [[TMP3]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <4 x i1> [[TMP4]], <4 x i64> [[STEP_ADD]], <4 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 8
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[STEP_ADD]], splat (i64 4)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <4 x i64> @llvm.smin.v4i64(<4 x i64> [[TMP5]], <4 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.smin.v4i64(<4 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <4 x i1> [[TMP11]], <4 x i64> [[TMP7]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <4 x i1> [[TMP12]], <4 x i64> [[TMP8]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[TMP13]], <4 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sge i64 [[MIN_VAL]], [[L]]
@@ -143,9 +237,9 @@ define i64 @test_vectorize_select_smin_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -215,11 +309,58 @@ define i64 @test_vectorize_select_umax_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umax_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 8
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ule <4 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ule <4 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <4 x i64> @llvm.umax.v4i64(<4 x i64> [[VEC_PHI2]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <4 x i64> @llvm.umax.v4i64(<4 x i64> [[VEC_PHI3]], <4 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <4 x i1> [[TMP3]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <4 x i1> [[TMP4]], <4 x i64> [[STEP_ADD]], <4 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 8
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[STEP_ADD]], splat (i64 4)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <4 x i64> @llvm.umax.v4i64(<4 x i64> [[TMP5]], <4 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.umax.v4i64(<4 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <4 x i1> [[TMP11]], <4 x i64> [[TMP7]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <4 x i1> [[TMP12]], <4 x i64> [[TMP8]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[TMP13]], <4 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ule i64 [[MIN_VAL]], [[L]]
@@ -227,9 +368,9 @@ define i64 @test_vectorize_select_umax_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -299,11 +440,58 @@ define i64 @test_vectorize_select_smax_last_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smax_last_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 8
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP7:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP8:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI2:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI3:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP6:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[STEP_ADD:%.*]] = add <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr i64, ptr [[GEP]], i64 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp sle <4 x i64> [[VEC_PHI2]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp sle <4 x i64> [[VEC_PHI3]], [[WIDE_LOAD4]]
+; CHECK-NEXT:    [[TMP5]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[VEC_PHI2]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP6]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[VEC_PHI3]], <4 x i64> [[WIDE_LOAD4]])
+; CHECK-NEXT:    [[TMP7]] = select <4 x i1> [[TMP3]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[TMP8]] = select <4 x i1> [[TMP4]], <4 x i64> [[STEP_ADD]], <4 x i64> [[VEC_PHI1]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 8
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[STEP_ADD]], splat (i64 4)
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP9]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[RDX_MINMAX:%.*]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[TMP5]], <4 x i64> [[TMP6]])
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[RDX_MINMAX]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP10]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i64> [[TMP5]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i64> [[TMP6]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP13:%.*]] = select <4 x i1> [[TMP11]], <4 x i64> [[TMP7]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP14:%.*]] = select <4 x i1> [[TMP12]], <4 x i64> [[TMP8]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[RDX_MINMAX5:%.*]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[TMP13]], <4 x i64> [[TMP14]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[RDX_MINMAX5]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP15]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP15]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX6:%.*]] = phi i64 [ [[TMP10]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX6]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sle i64 [[MIN_VAL]], [[L]]
@@ -311,9 +499,9 @@ define i64 @test_vectorize_select_smax_last_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP9:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
diff --git a/llvm/test/Transforms/LoopVectorize/select-smax-last-index.ll b/llvm/test/Transforms/LoopVectorize/select-smax-last-index.ll
index 0e27efd788fd6..2117704950b84 100644
--- a/llvm/test/Transforms/LoopVectorize/select-smax-last-index.ll
+++ b/llvm/test/Transforms/LoopVectorize/select-smax-last-index.ll
@@ -5,11 +5,46 @@ define i64 @test_vectorize_select_smax_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smax_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp sle <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sle i64 [[MIN_VAL]], [[L]]
@@ -17,9 +52,9 @@ define i64 @test_vectorize_select_smax_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MAX_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -47,11 +82,46 @@ define i64 @test_vectorize_select_smax_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smax_idx_cond_flipped(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp sge <4 x i64> [[WIDE_LOAD]], [[VEC_PHI1]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sge i64 [[L]], [[MIN_VAL]]
@@ -59,9 +129,9 @@ define i64 @test_vectorize_select_smax_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MAX_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -219,11 +289,46 @@ define i64 @test_vectorize_select_smax_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smax_idx_min_ops_switched(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp sle <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.smax.v4i64(<4 x i64> [[WIDE_LOAD]], <4 x i64> [[VEC_PHI1]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sle i64 [[MIN_VAL]], [[L]]
@@ -231,9 +336,9 @@ define i64 @test_vectorize_select_smax_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MAX_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -551,5 +656,47 @@ exit:
   ret i64 %res
 }
 
+define i64 @test_vectorize_select_smax_idx_inc(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_smax_idx_inc(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sle i64 [[MAX_VAL]], [[L]]
+; CHECK-NEXT:    [[MAX_VAL_NEXT]] = tail call i64 @llvm.smax.i64(i64 [[MAX_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV_NEXT]], i64 [[MAX_IDX]]
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %max.idx = phi i64 [ 0, %entry ], [ %max.idx.next, %loop ]
+  %max.val = phi i64 [ 0, %entry ], [ %max.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sle i64 %max.val, %l
+  %max.val.next = tail call i64 @llvm.smax.i64(i64 %max.val, i64 %l)
+  %iv.next = add nuw nsw i64 %iv, 1
+  %max.idx.next = select i1 %cmp, i64 %iv.next, i64 %max.idx
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %max.idx.next, %loop ]
+  ret i64 %res
+}
+
 declare i64 @llvm.smax.i64(i64, i64)
 declare i16 @llvm.smax.i16(i16, i16)
diff --git a/llvm/test/Transforms/LoopVectorize/select-smin-first-index.ll b/llvm/test/Transforms/LoopVectorize/select-smin-first-index.ll
new file mode 100644
index 0000000000000..49d6ac548c330
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/select-smin-first-index.ll
@@ -0,0 +1,262 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 5
+; RUN: opt -passes=loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -S %s | FileCheck %s
+
+; Test cases for selecting the first index with the minimum value.
+
+define i64 @test_vectorize_select_smin_first_idx(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_smin_first_idx(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sgt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+define i64 @test_vectorize_select_smin_first_idx_signed_sentinel_possible(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_smin_first_idx_signed_sentinel_possible(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[INDEX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[INDEX]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[TMP0]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[INDEX]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw nsw i64 [[INDEX]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RDX_SELECT]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sgt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 100
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+define i64 @test_vectorize_select_smin_first_idx_cond_flipped(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_smin_first_idx_cond_flipped(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i64 [[L]], [[MIN_VAL]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp slt i64 %l, %min.val
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+define i32 @test_vectorize_select_smin_first_idx_trunc_may_match_sentinel(ptr %src, i64 %n) {
+; CHECK-LABEL: define i32 @test_vectorize_select_smin_first_idx_trunc_may_match_sentinel(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[T:%.*]] = trunc i64 [[IV]] to i32
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i32 [[T]], i32 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i32 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i32 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sgt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %t = trunc i64 %iv to i32
+  %min.idx.next = select i1 %cmp, i32 %t, i32 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i32 [ %min.idx.next, %loop ]
+  ret i32 %res
+}
+
+define i32 @test_vectorize_select_smin_first_idx_trunc_valid(ptr %src, i64 %n) {
+; CHECK-LABEL: define i32 @test_vectorize_select_smin_first_idx_trunc_valid(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[T:%.*]] = trunc i64 [[IV]] to i32
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i32 [[T]], i32 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 100
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i32 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i32 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sgt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %t = trunc i64 %iv to i32
+  %min.idx.next = select i1 %cmp, i32 %t, i32 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 100
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i32 [ %min.idx.next, %loop ]
+  ret i32 %res
+}
+
+define i64 @test_vectorize_select_smin_idx_iv_start_different(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_smin_idx_iv_start_different(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 20, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sgt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1000
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 20, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sgt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 1000
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+
diff --git a/llvm/test/Transforms/LoopVectorize/select-smin-last-index.ll b/llvm/test/Transforms/LoopVectorize/select-smin-last-index.ll
index f9ef340a3e2f8..96769255ada22 100644
--- a/llvm/test/Transforms/LoopVectorize/select-smin-last-index.ll
+++ b/llvm/test/Transforms/LoopVectorize/select-smin-last-index.ll
@@ -7,11 +7,46 @@ define i64 @test_vectorize_select_smin_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smin_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp sge <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP2]] = call <4 x i64> @llvm.smin.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP3]] = select <4 x i1> [[TMP1]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vector.reduce.smin.v4i64(<4 x i64> [[TMP2]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP5]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i64> [[TMP2]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP7:%.*]] = select <4 x i1> [[TMP6]], <4 x i64> [[TMP3]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP7]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP8]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP8]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sge i64 [[MIN_VAL]], [[L]]
@@ -19,9 +54,9 @@ define i64 @test_vectorize_select_smin_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -49,11 +84,46 @@ define i64 @test_vectorize_select_smin_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smin_idx_cond_flipped(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp sle <4 x i64> [[WIDE_LOAD]], [[VEC_PHI1]]
+; CHECK-NEXT:    [[TMP2]] = call <4 x i64> @llvm.smin.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP3]] = select <4 x i1> [[TMP1]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vector.reduce.smin.v4i64(<4 x i64> [[TMP2]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP5]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i64> [[TMP2]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP7:%.*]] = select <4 x i1> [[TMP6]], <4 x i64> [[TMP3]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP7]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP8]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP8]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sle i64 [[L]], [[MIN_VAL]]
@@ -61,9 +131,9 @@ define i64 @test_vectorize_select_smin_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -221,11 +291,46 @@ define i64 @test_vectorize_select_smin_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_smin_idx_min_ops_switched(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp sge <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP2]] = call <4 x i64> @llvm.smin.v4i64(<4 x i64> [[WIDE_LOAD]], <4 x i64> [[VEC_PHI1]])
+; CHECK-NEXT:    [[TMP3]] = select <4 x i1> [[TMP1]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vector.reduce.smin.v4i64(<4 x i64> [[TMP2]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP5]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i64> [[TMP2]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP7:%.*]] = select <4 x i1> [[TMP6]], <4 x i64> [[TMP3]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP7]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP8]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP8]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp sge i64 [[MIN_VAL]], [[L]]
@@ -233,9 +338,9 @@ define i64 @test_vectorize_select_smin_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -553,5 +658,48 @@ exit:
   ret i64 %res
 }
 
+define i64 @test_vectorize_select_smin_idx_inc(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_smin_idx_inc(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp sge i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.smin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV_NEXT]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp sge i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.smin.i64(i64 %min.val, i64 %l)
+  %iv.next = add nuw nsw i64 %iv, 1
+  %min.idx.next = select i1 %cmp, i64 %iv.next, i64 %min.idx
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+
 declare i64 @llvm.smin.i64(i64, i64)
 declare i16 @llvm.smin.i16(i16, i16)
diff --git a/llvm/test/Transforms/LoopVectorize/select-umax-last-index.ll b/llvm/test/Transforms/LoopVectorize/select-umax-last-index.ll
index 54281daf26790..1bb3d017a67cd 100644
--- a/llvm/test/Transforms/LoopVectorize/select-umax-last-index.ll
+++ b/llvm/test/Transforms/LoopVectorize/select-umax-last-index.ll
@@ -5,11 +5,46 @@ define i64 @test_vectorize_select_umax_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umax_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp ule <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.umax.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.umax.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ule i64 [[MIN_VAL]], [[L]]
@@ -17,9 +52,9 @@ define i64 @test_vectorize_select_umax_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MAX_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -47,11 +82,46 @@ define i64 @test_vectorize_select_umax_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umax_idx_cond_flipped(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp uge <4 x i64> [[WIDE_LOAD]], [[VEC_PHI1]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.umax.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.umax.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp uge i64 [[L]], [[MIN_VAL]]
@@ -59,9 +129,9 @@ define i64 @test_vectorize_select_umax_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MAX_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -219,11 +289,46 @@ define i64 @test_vectorize_select_umax_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umax_idx_min_ops_switched(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp ule <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.umax.v4i64(<4 x i64> [[WIDE_LOAD]], <4 x i64> [[VEC_PHI1]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.umax.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ule i64 [[MIN_VAL]], [[L]]
@@ -231,9 +336,9 @@ define i64 @test_vectorize_select_umax_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MAX_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -551,5 +656,47 @@ exit:
   ret i64 %res
 }
 
+define i64 @test_vectorize_select_umax_idx_inc(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_umax_idx_inc(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MAX_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MAX_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ule i64 [[MAX_VAL]], [[L]]
+; CHECK-NEXT:    [[MAX_VAL_NEXT]] = tail call i64 @llvm.umax.i64(i64 [[MAX_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[MAX_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV_NEXT]], i64 [[MAX_IDX]]
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MAX_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %max.idx = phi i64 [ 0, %entry ], [ %max.idx.next, %loop ]
+  %max.val = phi i64 [ 0, %entry ], [ %max.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp ule i64 %max.val, %l
+  %max.val.next = tail call i64 @llvm.umax.i64(i64 %max.val, i64 %l)
+  %iv.next = add nuw nsw i64 %iv, 1
+  %max.idx.next = select i1 %cmp, i64 %iv.next, i64 %max.idx
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %max.idx.next, %loop ]
+  ret i64 %res
+}
+
 declare i64 @llvm.umax.i64(i64, i64)
 declare i16 @llvm.umax.i16(i16, i16)
diff --git a/llvm/test/Transforms/LoopVectorize/select-umin-first-index.ll b/llvm/test/Transforms/LoopVectorize/select-umin-first-index.ll
index 283dc075a9aee..ce6c23225b13f 100644
--- a/llvm/test/Transforms/LoopVectorize/select-umin-first-index.ll
+++ b/llvm/test/Transforms/LoopVectorize/select-umin-first-index.ll
@@ -11,7 +11,7 @@ define i64 @test_vectorize_select_umin_idx(ptr %src, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -30,7 +30,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -45,6 +45,48 @@ exit:
   ret i64 %res
 }
 
+define i64 @test_vectorize_select_umin_idx_signed_sentinel_possible(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_signed_sentinel_possible(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[INDEX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -2, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[INDEX]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[TMP0]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.umin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[INDEX]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw nsw i64 [[INDEX]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 100
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RDX_SELECT]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ -2, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp ugt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.umin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 100
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
 define i64 @test_vectorize_select_umin_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_cond_flipped(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
@@ -53,7 +95,7 @@ define i64 @test_vectorize_select_umin_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i64 [[L]], [[MIN_VAL]]
@@ -72,7 +114,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ult i64 %l, %min.val
@@ -95,7 +137,7 @@ define i64 @test_vectorize_select_umin_idx_select_ops_flipped(ptr %src, i64 %n)
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 100, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i64 [[L]], [[MIN_VAL]]
@@ -114,7 +156,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ 100, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ult i64 %l, %min.val
@@ -137,7 +179,7 @@ define i64 @test_vectorize_select_umin_via_select_idx(ptr %src, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 100, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -156,7 +198,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ 100, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -179,7 +221,7 @@ define i64 @test_vectorize_select_umin_idx_all_exit_inst(ptr %src, ptr %umin, i6
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -20, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -200,7 +242,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -20, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -225,7 +267,7 @@ define i64 @test_vectorize_select_umin_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -244,7 +286,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -267,7 +309,7 @@ define i64 @test_not_vectorize_select_no_min_reduction(ptr %src, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[RED_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[RED_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RED_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[RED_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[RED_VAL]], [[L]]
@@ -286,7 +328,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %red.val = phi i64 [ 0, %entry ], [ %red.val.next, %loop ]
+  %red.val = phi i64 [ -1, %entry ], [ %red.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %red.val, %l
@@ -309,7 +351,7 @@ define i64 @test_cmp_and_umin_use_different_values(ptr %src, i64 %x, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[X]]
@@ -328,7 +370,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %x
@@ -351,7 +393,7 @@ define i32 @test_vectorize_select_umin_idx_with_trunc(ptr %src, i64 %n) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -371,7 +413,7 @@ entry:
 loop:
   %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i32 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -387,6 +429,50 @@ exit:
   ret i32 %res
 }
 
+define i32 @test_vectorize_select_umin_idx_with_trunc_valid(ptr %src, i64 %n) {
+; CHECK-LABEL: define i32 @test_vectorize_select_umin_idx_with_trunc_valid(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.umin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[TRUNC:%.*]] = trunc i64 [[IV]] to i32
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i32 [[TRUNC]], i32 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 100
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i32 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i32 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp ugt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.umin.i64(i64 %min.val, i64 %l)
+  %trunc = trunc i64 %iv to i32
+  %min.idx.next = select i1 %cmp, i32 %trunc, i32 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 100
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i32 [ %min.idx.next, %loop ]
+  ret i32 %res
+}
+
 define ptr @test_with_ptr_index(ptr %start, ptr %end) {
 ; CHECK-LABEL: define ptr @test_with_ptr_index(
 ; CHECK-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) {
@@ -395,7 +481,7 @@ define ptr @test_with_ptr_index(ptr %start, ptr %end) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi ptr [ null, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[IV]], align 4
 ; CHECK-NEXT:    [[CMP7_US:%.*]] = icmp ult i64 [[L]], [[MIN_VAL]]
 ; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.umin.i64(i64 [[MIN_VAL]], i64 [[L]])
@@ -413,7 +499,7 @@ entry:
 loop:
   %iv = phi ptr [ %start, %entry ], [ %iv.next, %loop ]
   %min.idx = phi ptr [ null, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %l = load i64, ptr %iv
   %cmp7.us = icmp ult i64 %l, %min.val
   %min.val.next = tail call i64 @llvm.umin.i64(i64 %min.val, i64 %l)
@@ -435,7 +521,7 @@ define i64 @test_no_vectorize_select_iv_decrement(ptr %src) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 1000, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -454,7 +540,7 @@ entry:
 loop:
   %iv = phi i64 [ 1000, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -477,7 +563,7 @@ define i64 @test_no_vectorize_select_iv_sub(ptr %src) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 1000, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -496,7 +582,7 @@ entry:
 loop:
   %iv = phi i64 [ 1000, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -519,7 +605,7 @@ define i64 @test_no_vectorize_select_iv_mul(ptr %src) {
 ; CHECK:       [[LOOP]]:
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 1, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
@@ -538,7 +624,7 @@ entry:
 loop:
   %iv = phi i64 [ 1, %entry ], [ %iv.next, %loop ]
   %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
-  %min.val = phi i64 [ 0, %entry ], [ %min.val.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
   %gep = getelementptr i64, ptr %src, i64 %iv
   %l = load i64, ptr %gep
   %cmp = icmp ugt i64 %min.val, %l
@@ -553,5 +639,93 @@ exit:
   ret i64 %res
 }
 
+define i64 @test_vectorize_select_umin_idx_wraps(ptr %src, i64 %n, i64 %start) {
+; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_wraps(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]], i64 [[START:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IDX:%.*]] = phi i64 [ [[START]], %[[ENTRY]] ], [ [[IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.umin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IDX]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[IDX_NEXT]] = add i64 [[IDX]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %idx = phi i64 [ %start, %entry ], [ %idx.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp ugt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.umin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %idx, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %idx.next = add i64 %idx, 1
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+define i64 @test_vectorize_select_umin_idx_iv_start_different(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_iv_start_different(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 10, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ -1, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.umin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 10000
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 10, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ -1, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp ugt i64 %min.val, %l
+  %min.val.next = tail call i64 @llvm.umin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv, i64 %min.idx
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 10000
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
 declare i64 @llvm.umin.i64(i64, i64)
 declare i16 @llvm.umin.i16(i16, i16)
diff --git a/llvm/test/Transforms/LoopVectorize/select-umin-last-index.ll b/llvm/test/Transforms/LoopVectorize/select-umin-last-index.ll
index da5ff7246a0c0..3e69c1907269b 100644
--- a/llvm/test/Transforms/LoopVectorize/select-umin-last-index.ll
+++ b/llvm/test/Transforms/LoopVectorize/select-umin-last-index.ll
@@ -7,11 +7,46 @@ define i64 @test_vectorize_select_umin_idx(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP4:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 140), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp uge <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP3]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP4]] = select <4 x i1> [[TMP2]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP5]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vector.reduce.umin.v4i64(<4 x i64> [[TMP3]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[TMP3]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i64> [[TMP4]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP9:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP8]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP9]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP6]], %[[MIDDLE_BLOCK]] ], [ 140, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 140, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp uge i64 [[MIN_VAL]], [[L]]
@@ -19,9 +54,9 @@ define i64 @test_vectorize_select_umin_idx(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -49,11 +84,46 @@ define i64 @test_vectorize_select_umin_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_cond_flipped(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 130), %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp ule <4 x i64> [[WIDE_LOAD]], [[VEC_PHI1]]
+; CHECK-NEXT:    [[TMP2]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP3]] = select <4 x i1> [[TMP1]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vector.reduce.umin.v4i64(<4 x i64> [[TMP2]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP5]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i64> [[TMP2]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP7:%.*]] = select <4 x i1> [[TMP6]], <4 x i64> [[TMP3]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP7]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP8]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP8]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ 130, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 130, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ule i64 [[L]], [[MIN_VAL]]
@@ -61,9 +131,9 @@ define i64 @test_vectorize_select_umin_idx_cond_flipped(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -91,11 +161,46 @@ define i64 @test_vectorize_select_umin_idx_select_ops_flipped(ptr %src, i64 %n)
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_select_ops_flipped(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 120), %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp ugt <4 x i64> [[WIDE_LOAD]], [[VEC_PHI1]]
+; CHECK-NEXT:    [[TMP2]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[VEC_PHI1]], <4 x i64> [[WIDE_LOAD]])
+; CHECK-NEXT:    [[TMP3]] = select <4 x i1> [[TMP1]], <4 x i64> [[VEC_PHI]], <4 x i64> [[VEC_IND]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV1]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vector.reduce.umin.v4i64(<4 x i64> [[TMP2]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP5]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i64> [[TMP2]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP7:%.*]] = select <4 x i1> [[TMP6]], <4 x i64> [[TMP3]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP7]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP8]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP8]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ 120, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 120, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ugt i64 [[L]], [[MIN_VAL]]
@@ -103,9 +208,9 @@ define i64 @test_vectorize_select_umin_idx_select_ops_flipped(ptr %src, i64 %n)
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[MIN_IDX]], i64 [[IV]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -221,11 +326,46 @@ define i64 @test_vectorize_select_umin_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_min_ops_switched(
 ; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 4
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <4 x i64> [ splat (i64 -9223372036854775808), %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_PHI1:%.*]] = phi <4 x i64> [ splat (i64 90), %[[VECTOR_PH]] ], [ [[TMP2:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp uge <4 x i64> [[VEC_PHI1]], [[WIDE_LOAD]]
+; CHECK-NEXT:    [[TMP2]] = call <4 x i64> @llvm.umin.v4i64(<4 x i64> [[WIDE_LOAD]], <4 x i64> [[VEC_PHI1]])
+; CHECK-NEXT:    [[TMP3]] = select <4 x i1> [[TMP1]], <4 x i64> [[VEC_IND]], <4 x i64> [[VEC_PHI]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[IV]], 4
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <4 x i64> [[VEC_IND]], splat (i64 4)
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vector.reduce.umin.v4i64(<4 x i64> [[TMP2]])
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[TMP5]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq <4 x i64> [[TMP2]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP7:%.*]] = select <4 x i1> [[TMP6]], <4 x i64> [[TMP3]], <4 x i64> splat (i64 -9223372036854775808)
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> [[TMP7]])
+; CHECK-NEXT:    [[RDX_SELECT_CMP:%.*]] = icmp ne i64 [[TMP8]], -9223372036854775808
+; CHECK-NEXT:    [[RDX_SELECT:%.*]] = select i1 [[RDX_SELECT_CMP]], i64 [[TMP8]], i64 0
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX:%.*]] = phi i64 [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[BC_MERGE_RDX2:%.*]] = phi i64 [ [[TMP5]], %[[MIDDLE_BLOCK]] ], [ 90, %[[ENTRY]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 90, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ [[BC_MERGE_RDX2]], %[[SCALAR_PH]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
 ; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV1]]
 ; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP1]], align 4
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp uge i64 [[MIN_VAL]], [[L]]
@@ -233,9 +373,9 @@ define i64 @test_vectorize_select_umin_idx_min_ops_switched(ptr %src, i64 %n) {
 ; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV1]], i64 [[MIN_IDX]]
 ; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV1]], 1
 ; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
-; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP9:![0-9]+]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ], [ [[RDX_SELECT]], %[[MIDDLE_BLOCK]] ]
 ; CHECK-NEXT:    ret i64 [[RES]]
 ;
 entry:
@@ -597,5 +737,48 @@ exit:
   ret i64 %res
 }
 
+define i64 @test_vectorize_select_umin_idx_inc(ptr %src, i64 %n) {
+; CHECK-LABEL: define i64 @test_vectorize_select_umin_idx_inc(
+; CHECK-SAME: ptr [[SRC:%.*]], i64 [[N:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_IDX:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[MIN_IDX_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[MIN_VAL:%.*]] = phi i64 [ 140, %[[ENTRY]] ], [ [[MIN_VAL_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr i64, ptr [[SRC]], i64 [[IV]]
+; CHECK-NEXT:    [[L:%.*]] = load i64, ptr [[GEP]], align 4
+; CHECK-NEXT:    [[CMP:%.*]] = icmp uge i64 [[MIN_VAL]], [[L]]
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[MIN_VAL_NEXT]] = tail call i64 @llvm.umin.i64(i64 [[MIN_VAL]], i64 [[L]])
+; CHECK-NEXT:    [[MIN_IDX_NEXT]] = select i1 [[CMP]], i64 [[IV_NEXT]], i64 [[MIN_IDX]]
+; CHECK-NEXT:    [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EXITCOND_NOT]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    [[RES:%.*]] = phi i64 [ [[MIN_IDX_NEXT]], %[[LOOP]] ]
+; CHECK-NEXT:    ret i64 [[RES]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %min.idx = phi i64 [ 0, %entry ], [ %min.idx.next, %loop ]
+  %min.val = phi i64 [ 140, %entry ], [ %min.val.next, %loop ]
+  %gep = getelementptr i64, ptr %src, i64 %iv
+  %l = load i64, ptr %gep
+  %cmp = icmp uge i64 %min.val, %l
+  %iv.next = add nuw nsw i64 %iv, 1
+  %min.val.next = tail call i64 @llvm.umin.i64(i64 %min.val, i64 %l)
+  %min.idx.next = select i1 %cmp, i64 %iv.next, i64 %min.idx
+  %exitcond.not = icmp eq i64 %iv.next, %n
+  br i1 %exitcond.not, label %exit, label %loop
+
+exit:
+  %res = phi i64 [ %min.idx.next, %loop ]
+  ret i64 %res
+}
+
+
 declare i64 @llvm.umin.i64(i64, i64)
 declare i16 @llvm.umin.i16(i16, i16)
diff --git a/llvm/test/Transforms/LoopVectorize/struct-return.ll b/llvm/test/Transforms/LoopVectorize/struct-return.ll
index f2e2e2846614b..70c6c7e900c51 100644
--- a/llvm/test/Transforms/LoopVectorize/struct-return.ll
+++ b/llvm/test/Transforms/LoopVectorize/struct-return.ll
@@ -29,8 +29,9 @@ define void @struct_return_f32_widen(ptr noalias %in, ptr noalias writeonly %out
 ; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-NEXT:    br i1 [[TMP6]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    br [[EXIT:label %.*]]
-; CHECK:       [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
 ;
 entry:
   br label %for.body
@@ -77,8 +78,9 @@ define void @struct_return_f64_widen(ptr noalias %in, ptr noalias writeonly %out
 ; CHECK-NEXT:    [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-NEXT:    br i1 [[TMP6]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    br [[EXIT:label %.*]]
-; CHECK:       [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
 ;
 entry:
   br label %for.body
@@ -232,8 +234,9 @@ define void @struct_return_i32_three_results_widen(ptr noalias %in, ptr noalias
 ; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-NEXT:    br i1 [[TMP4]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    br [[EXIT:label %.*]]
-; CHECK:       [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
 ;
 entry:
   br label %for.body
@@ -273,7 +276,7 @@ define void @scalarized_predicated_struct_return(ptr %a) {
 ; CHECK-NEXT:    br i1 [[TMP2]], label %[[PRED_STORE_IF:.*]], label %[[PRED_STORE_CONTINUE:.*]]
 ; CHECK:       [[PRED_STORE_IF]]:
 ; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0
-; CHECK-NEXT:    [[TMP4:%.*]] = tail call { i64, i64 } @bar_i64(i64 [[TMP3]]) #[[ATTR4:[0-9]+]]
+; CHECK-NEXT:    [[TMP4:%.*]] = tail call { i64, i64 } @bar_i64(i64 [[TMP3]]) #[[ATTR2:[0-9]+]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = extractvalue { i64, i64 } [[TMP4]], 0
 ; CHECK-NEXT:    [[TMP6:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 0
 ; CHECK-NEXT:    [[TMP7:%.*]] = udiv i64 [[TMP5]], [[TMP6]]
@@ -286,7 +289,7 @@ define void @scalarized_predicated_struct_return(ptr %a) {
 ; CHECK-NEXT:    br i1 [[TMP10]], label %[[PRED_STORE_IF1:.*]], label %[[PRED_STORE_CONTINUE2]]
 ; CHECK:       [[PRED_STORE_IF1]]:
 ; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1
-; CHECK-NEXT:    [[TMP12:%.*]] = tail call { i64, i64 } @bar_i64(i64 [[TMP11]]) #[[ATTR4]]
+; CHECK-NEXT:    [[TMP12:%.*]] = tail call { i64, i64 } @bar_i64(i64 [[TMP11]]) #[[ATTR2]]
 ; CHECK-NEXT:    [[TMP13:%.*]] = extractvalue { i64, i64 } [[TMP12]], 0
 ; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <2 x i64> [[WIDE_LOAD]], i32 1
 ; CHECK-NEXT:    [[TMP15:%.*]] = udiv i64 [[TMP13]], [[TMP14]]
@@ -299,8 +302,9 @@ define void @scalarized_predicated_struct_return(ptr %a) {
 ; CHECK-NEXT:    [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-NEXT:    br i1 [[TMP18]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
-; CHECK-NEXT:    br [[EXIT:label %.*]]
-; CHECK:       [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
 ;
 entry:
   br label %for.body
@@ -385,7 +389,7 @@ define void @negative_mixed_element_type_struct_return(ptr noalias %in, ptr noal
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[FOR_BODY]] ]
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds float, ptr [[IN]], i64 [[IV]]
 ; CHECK-NEXT:    [[IN_VAL:%.*]] = load float, ptr [[ARRAYIDX]], align 4
-; CHECK-NEXT:    [[CALL:%.*]] = tail call { float, i32 } @baz(float [[IN_VAL]]) #[[ATTR5:[0-9]+]]
+; CHECK-NEXT:    [[CALL:%.*]] = tail call { float, i32 } @baz(float [[IN_VAL]]) #[[ATTR3:[0-9]+]]
 ; CHECK-NEXT:    [[EXTRACT_A:%.*]] = extractvalue { float, i32 } [[CALL]], 0
 ; CHECK-NEXT:    [[EXTRACT_B:%.*]] = extractvalue { float, i32 } [[CALL]], 1
 ; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds float, ptr [[OUT_A]], i64 [[IV]]
@@ -433,7 +437,7 @@ define void @negative_named_struct_return(ptr noalias readonly %in, ptr noalias
 ; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[FOR_BODY]] ]
 ; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds double, ptr [[IN]], i64 [[IV]]
 ; CHECK-NEXT:    [[IN_VAL:%.*]] = load double, ptr [[ARRAYIDX]], align 8
-; CHECK-NEXT:    [[CALL:%.*]] = tail call [[NAMED_STRUCT:%.*]] @[[BAR_NAMED:[a-zA-Z0-9_$\"\\.-]*[a-zA-Z_$\"\\.-][a-zA-Z0-9_$\"\\.-]*]](double [[IN_VAL]]) #[[ATTR6:[0-9]+]]
+; CHECK-NEXT:    [[CALL:%.*]] = tail call [[NAMED_STRUCT:%.*]] @[[BAR_NAMED:[a-zA-Z0-9_$\"\\.-]*[a-zA-Z_$\"\\.-][a-zA-Z0-9_$\"\\.-]*]](double [[IN_VAL]]) #[[ATTR4:[0-9]+]]
 ; CHECK-NEXT:    [[EXTRACT_A:%.*]] = extractvalue [[NAMED_STRUCT]] [[CALL]], 0
 ; CHECK-NEXT:    [[EXTRACT_B:%.*]] = extractvalue [[NAMED_STRUCT]] [[CALL]], 1
 ; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds double, ptr [[OUT_A]], i64 [[IV]]
diff --git a/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-idxsize.ll b/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-idxsize.ll
new file mode 100644
index 0000000000000..5925fc97a2524
--- /dev/null
+++ b/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-idxsize.ll
@@ -0,0 +1,262 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=lower-constant-intrinsics -S < %s | FileCheck %s
+
+; Some extra tests using 16-bit pointers and 16-bit index type size. This
+; allows us to for example test what happens when the index type used in a
+; getelementptr does not match with the index type size (e.g. when not running
+; full opt pipeline before the lower-constant-intrinsics pass).
+
+target datalayout = "e-p:16:16:16"
+
+
+define i32 @possible_out_of_bounds_gep_i8(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i8(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i8 2, i8 10
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i8 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 3, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i8 2, i8 10
+  %ptr.slide = getelementptr i8, ptr %obj, i8 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+define i32 @possible_out_of_bounds_gep_i16(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i16(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i16 2, i16 10
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i16 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 3, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i16 2, i16 10
+  %ptr.slide = getelementptr i8, ptr %obj, i16 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+define i32 @possible_out_of_bounds_gep_i32(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i32(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i32 2, i32 10
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i32 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 3, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i32 2, i32 10
+  %ptr.slide = getelementptr i8, ptr %obj, i32 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+; SROA would produce IR like this if applied to @possible_out_of_bounds_gep_i16.
+; FIXME: The %objsize_min result here is invalid.
+define i32 @possible_out_of_bounds_gep_i16_sroa(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i16_sroa(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[DOTSROA_GEP:%.*]] = getelementptr i8, ptr [[OBJ]], i16 2
+; CHECK-NEXT:    [[DOTSROA_GEP1:%.*]] = getelementptr i8, ptr [[OBJ]], i16 10
+; CHECK-NEXT:    [[OFFSET_SROA_SEL:%.*]] = select i1 [[C0]], ptr [[DOTSROA_GEP]], ptr [[DOTSROA_GEP1]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 3, i32 65531
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8], align 1
+  %.sroa.gep = getelementptr i8, ptr %obj, i16 2
+  %.sroa.gep1 = getelementptr i8, ptr %obj, i16 10
+  %offset.sroa.sel = select i1 %c0, ptr %.sroa.gep, ptr %.sroa.gep1
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %offset.sroa.sel, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %offset.sroa.sel, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+; Indices are truncated to the pointer size in a gep. So "i32 -65526" should
+; be treated as "i16 10" and we expect same result as for
+; @possible_out_of_bounds_gep_i16 above.
+define i32 @possible_out_of_bounds_gep_i32_trunc(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i32_trunc(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i32 2, i32 -65526
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i32 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 3, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i32 2, i32 -65526  ; 0xffff000a
+  %ptr.slide = getelementptr i8, ptr %obj, i32 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+define i32 @out_of_bounds_gep_i8(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @out_of_bounds_gep_i8(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i8 -128
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 -1, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %ptr.slide = getelementptr i8, ptr %obj, i8 -128
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+define i32 @out_of_bounds_gep_i32(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @out_of_bounds_gep_i32(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i32 10
+; CHECK-NEXT:    ret i32 0
+;
+entry:
+  %obj = alloca [5 x i8]
+  %ptr.slide = getelementptr i8, ptr %obj, i32 10
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+define i32 @out_of_bounds_gep_i32_trunc(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @out_of_bounds_gep_i32_trunc(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i32 -65526
+; CHECK-NEXT:    ret i32 0
+;
+entry:
+  %obj = alloca [5 x i8]
+  %ptr.slide = getelementptr i8, ptr %obj, i32 -65526
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+; In this test the index will be out-of-bounds (both 10 and -2 is
+; out-of-bounds), but the current analysis won't detect that. The analysis
+; will find out that %offset is in the range [-2, 10] which includes valid
+; offsets that aren't out-of-bounds. Therefore we can expect to get -1 for
+; %objsize_max even if an advanced analysis should be able to derive that we
+; are out-of-bounds (returning 0 also for %objsize_max).
+define i32 @out_of_bounds_gep_i16_pos_neg(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @out_of_bounds_gep_i16_pos_neg(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i32 10, i32 -2
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i32 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 -1, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i32 10, i32 -2
+  %ptr.slide = getelementptr i8, ptr %obj, i32 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+; With 16-bit index size %offset is either 32767 or -32768. Thus, when
+; aggregating the possible offsets we know that they are in the range [-32768,
+; 32767], which includes valid offsets that aren't out-of-bounds. This is
+; similar to the out_of_bounds_gep_i16_pos_neg test above, and the current
+; implementation is expected to derive the result -1 for %objsize_max.
+define i32 @out_of_bounds_gep_i32_trunc_select(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @out_of_bounds_gep_i32_trunc_select(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i32 32767, i32 32768
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i32 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 -1, i32 0
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i32 32767, i32 32768
+  %ptr.slide = getelementptr i8, ptr %obj, i32 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+; FIXME: Is 3 really correct for %objsize_min here?
+define i32 @possible_out_of_bounds_gep_i8_neg(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i8_neg(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i8 2, i8 -10
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i8 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 -1, i32 3
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i8 2, i8 -10
+  %ptr.slide = getelementptr i8, ptr %obj, i8 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
+
+; FIXME: Is 3 really correct for %objsize_min here?
+define i32 @possible_out_of_bounds_gep_i16_neg(i1 %c0, i1 %c1) {
+; CHECK-LABEL: define i32 @possible_out_of_bounds_gep_i16_neg(
+; CHECK-SAME: i1 [[C0:%.*]], i1 [[C1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[OBJ:%.*]] = alloca [5 x i8], align 1
+; CHECK-NEXT:    [[OFFSET:%.*]] = select i1 [[C0]], i16 2, i16 -10
+; CHECK-NEXT:    [[PTR_SLIDE:%.*]] = getelementptr i8, ptr [[OBJ]], i16 [[OFFSET]]
+; CHECK-NEXT:    [[RES:%.*]] = select i1 [[C1]], i32 -1, i32 3
+; CHECK-NEXT:    ret i32 [[RES]]
+;
+entry:
+  %obj = alloca [5 x i8]
+  %offset = select i1 %c0, i16 2, i16 -10
+  %ptr.slide = getelementptr i8, ptr %obj, i16 %offset
+  %objsize_max = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 false, i1 true, i1 false)
+  %objsize_min = call i32 @llvm.objectsize.i32.p0(ptr %ptr.slide, i1 true, i1 true, i1 false)
+  %res = select i1 %c1, i32 %objsize_max, i32 %objsize_min
+  ret i32 %res
+}
diff --git a/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-phi.ll b/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-phi.ll
index 564311da64a81..d294650d7f3e2 100644
--- a/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-phi.ll
+++ b/llvm/test/Transforms/LowerConstantIntrinsics/builtin-object-size-phi.ll
@@ -200,8 +200,12 @@ if.end:
   ret i64 %size
 }
 
-define i64 @pick_negative_offset_different_width(i32 %n) {
-; CHECK-LABEL: @pick_negative_offset_different_width(
+; FIXME: The result here looks weird. Either we reference into buffer0 with an
+;        oob offset. Or we reference buffer1 (8 bytes) with a 4 byte
+;        offset. The result 5 is wrong in both cases. Probably better to
+;        return -1 here since we do not know if we have an oob pointer.
+define i64 @pick_negative_offset_different_width_index_maybe_too_small(i32 %n, i1 %c) {
+; CHECK-LABEL: @pick_negative_offset_different_width_index_maybe_too_small(
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[BUFFER0:%.*]] = alloca i8, i64 4, align 1
 ; CHECK-NEXT:    [[BUFFER1:%.*]] = alloca i8, i64 8, align 1
@@ -216,7 +220,8 @@ define i64 @pick_negative_offset_different_width(i32 %n) {
 ; CHECK:       if.end:
 ; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[OFFSETED0]], [[IF_ELSE]] ], [ [[OFFSETED1]], [[IF_END]] ]
 ; CHECK-NEXT:    [[POFFSETED:%.*]] = getelementptr i8, ptr [[P]], i64 -2
-; CHECK-NEXT:    ret i64 5
+; CHECK-NEXT:    [[SIZE:%.*]] = select i1 [[C:%.*]], i64 5, i64 0
+; CHECK-NEXT:    ret i64 [[SIZE]]
 ;
 entry:
   %buffer0 = alloca i8, i64 4
@@ -235,7 +240,51 @@ if.else:
 if.end:
   %p = phi ptr [ %offseted0, %if.then ], [ %offseted1, %if.else ]
   %poffseted = getelementptr i8, ptr %p, i64 -2
-  %size = call i64 @llvm.objectsize.i64.p0(ptr %poffseted, i1 false, i1 false, i1 false)
+  %sizemax = call i64 @llvm.objectsize.i64.p0(ptr %poffseted, i1 false, i1 false, i1 false)
+  %sizemin = call i64 @llvm.objectsize.i64.p0(ptr %poffseted, i1 true, i1 false, i1 false)
+  %size = select i1 %c, i64 %sizemax, i64 %sizemin
+  ret i64 %size
+}
+
+define i64 @pick_negative_offset_different_width_index_maybe_too_large(i32 %n, i1 %c) {
+; CHECK-LABEL: @pick_negative_offset_different_width_index_maybe_too_large(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[BUFFER0:%.*]] = alloca i8, i64 4, align 1
+; CHECK-NEXT:    [[BUFFER1:%.*]] = alloca i8, i64 8, align 1
+; CHECK-NEXT:    [[COND:%.*]] = icmp eq i32 [[N:%.*]], 0
+; CHECK-NEXT:    br i1 [[COND]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]]
+; CHECK:       if.then:
+; CHECK-NEXT:    [[OFFSETED0:%.*]] = getelementptr i8, ptr [[BUFFER0]], i64 1
+; CHECK-NEXT:    br label [[IF_END:%.*]]
+; CHECK:       if.else:
+; CHECK-NEXT:    [[OFFSETED1:%.*]] = getelementptr i8, ptr [[BUFFER1]], i64 6
+; CHECK-NEXT:    br label [[IF_END]]
+; CHECK:       if.end:
+; CHECK-NEXT:    [[P:%.*]] = phi ptr [ [[OFFSETED0]], [[IF_THEN]] ], [ [[OFFSETED1]], [[IF_ELSE]] ]
+; CHECK-NEXT:    [[POFFSETED:%.*]] = getelementptr i8, ptr [[P]], i64 2
+; CHECK-NEXT:    [[SIZE:%.*]] = select i1 [[C:%.*]], i64 1, i64 0
+; CHECK-NEXT:    ret i64 [[SIZE]]
+;
+entry:
+  %buffer0 = alloca i8, i64 4
+  %buffer1 = alloca i8, i64 8
+  %cond = icmp eq i32 %n, 0
+  br i1 %cond, label %if.then, label %if.else
+
+if.then:
+  %offseted0 = getelementptr i8, ptr %buffer0, i64 1
+  br label %if.end
+
+if.else:
+  %offseted1 = getelementptr i8, ptr %buffer1, i64 6
+  br label %if.end
+
+if.end:
+  %p = phi ptr [ %offseted0, %if.then ], [ %offseted1, %if.else ]
+  %poffseted = getelementptr i8, ptr %p, i64 2
+  %sizemax = call i64 @llvm.objectsize.i64.p0(ptr %poffseted, i1 false, i1 false, i1 false)
+  %sizemin = call i64 @llvm.objectsize.i64.p0(ptr %poffseted, i1 true, i1 false, i1 false)
+  %size = select i1 %c, i64 %sizemax, i64 %sizemin
   ret i64 %size
 }
 
diff --git a/llvm/test/Transforms/LowerTypeTests/function.ll b/llvm/test/Transforms/LowerTypeTests/function.ll
index ab3cfb6acccf8..fa7c7bbcdabd3 100644
--- a/llvm/test/Transforms/LowerTypeTests/function.ll
+++ b/llvm/test/Transforms/LowerTypeTests/function.ll
@@ -1,7 +1,7 @@
-; RUN: opt -S -passes=lowertypetests -mtriple=i686-unknown-linux-gnu %s | FileCheck --check-prefixes=X86,X86-LINUX,NATIVE,JT8 %s
-; RUN: opt -S -passes=lowertypetests -mtriple=x86_64-unknown-linux-gnu %s | FileCheck --check-prefixes=X86,X86-LINUX,NATIVE,JT8 %s
-; RUN: opt -S -passes=lowertypetests -mtriple=i686-pc-win32 %s | FileCheck --check-prefixes=X86,X86-WIN32,NATIVE,JT8 %s
-; RUN: opt -S -passes=lowertypetests -mtriple=x86_64-pc-win32 %s | FileCheck --check-prefixes=X86,X86-WIN32,NATIVE,JT8 %s
+; RUN: opt -S -passes=lowertypetests -mtriple=i686-unknown-linux-gnu %s | FileCheck --check-prefixes=X86,NATIVE,JT8 %s
+; RUN: opt -S -passes=lowertypetests -mtriple=x86_64-unknown-linux-gnu %s | FileCheck --check-prefixes=X86,NATIVE,JT8 %s
+; RUN: opt -S -passes=lowertypetests -mtriple=i686-pc-win32 %s | FileCheck --check-prefixes=X86,NATIVE,JT8 %s
+; RUN: opt -S -passes=lowertypetests -mtriple=x86_64-pc-win32 %s | FileCheck --check-prefixes=X86,NATIVE,JT8 %s
 ; RUN: opt -S -passes=lowertypetests -mtriple=riscv32-unknown-linux-gnu %s | FileCheck --check-prefixes=RISCV,NATIVE,JT8 %s
 ; RUN: opt -S -passes=lowertypetests -mtriple=riscv64-unknown-linux-gnu %s | FileCheck --check-prefixes=RISCV,NATIVE,JT8 %s
 ; RUN: opt -S -passes=lowertypetests -mtriple=wasm32-unknown-unknown %s | FileCheck --check-prefix=WASM32 %s
@@ -114,8 +114,7 @@ define i1 @foo(ptr %p) {
 
 ; NATIVE-SAME: "s"(ptr @g.cfi)
 
-; X86-LINUX: attributes #[[ATTR]] = { naked nocf_check noinline }
-; X86-WIN32: attributes #[[ATTR]] = { nocf_check noinline }
+; X86: attributes #[[ATTR]] = { naked nocf_check noinline }
 ; ARM: attributes #[[ATTR]] = { naked noinline
 ; THUMB: attributes #[[ATTR]] = { naked noinline "target-cpu"="cortex-a8" "target-features"="+thumb-mode" }
 ; THUMBV6M: attributes #[[ATTR]] = { naked noinline "target-features"="+thumb-mode" }
diff --git a/llvm/test/Transforms/OpenMP/parallel_region_merging.ll b/llvm/test/Transforms/OpenMP/parallel_region_merging.ll
index 83452e72b56b9..1bbac5cc3154b 100644
--- a/llvm/test/Transforms/OpenMP/parallel_region_merging.ll
+++ b/llvm/test/Transforms/OpenMP/parallel_region_merging.ll
@@ -4880,6 +4880,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
@@ -4974,6 +4976,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
@@ -5070,6 +5074,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
@@ -5157,6 +5163,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
@@ -5254,6 +5262,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
@@ -5434,6 +5444,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
@@ -5624,8 +5636,10 @@ entry:
 ; CHECK2:       omp.par.region.split:
 ; CHECK2-NEXT:    br label [[OMP_PAR_PRE_FINALIZE:%.*]]
 ; CHECK2:       omp.par.pre_finalize:
-; CHECK2-NEXT:    br label [[OMP_PAR_OUTLINED_EXIT_EXITSTUB:%.*]]
-; CHECK2:       omp_region.body5:
+; CHECK2-NEXT:    br label [[FINI:%.*]]
+; CHECK2:       .fini:
+; CHECK2-NEXT:    br label [[OMP_PAR_EXIT_EXITSTUB:.*]]
+; CHECK2:       omp_region.body6:
 ; CHECK2-NEXT:    br label [[SEQ_PAR_MERGED2:%.*]]
 ; CHECK2:       seq.par.merged2:
 ; CHECK2-NEXT:    [[ADD_SEQ_OUTPUT_LOAD:%.*]] = load i32, ptr [[LOADGEP_ADD_SEQ_OUTPUT_ALLOC]], align 4
@@ -5634,7 +5648,9 @@ entry:
 ; CHECK2-NEXT:    br label [[OMP_PAR_MERGED_SPLIT_SPLIT_SPLIT:%.*]]
 ; CHECK2:       omp.par.merged.split.split.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY5_SPLIT:%.*]]
-; CHECK2:       omp_region.body5.split:
+; CHECK2:       omp_region.body6.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE5:%.*]]
+; CHECK2:       omp_region.finalize{{.*}}:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM3]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END4]]
 ; CHECK2:       omp_region.body:
@@ -5646,6 +5662,8 @@ entry:
 ; CHECK2:       omp.par.merged.split:
 ; CHECK2-NEXT:    br label [[OMP_REGION_BODY_SPLIT:%.*]]
 ; CHECK2:       omp_region.body.split:
+; CHECK2-NEXT:    br label [[OMP_REGION_FINALIZE:%.*]]
+; CHECK2:       omp_region.finalize:
 ; CHECK2-NEXT:    call void @__kmpc_end_master(ptr @[[GLOB2]], i32 [[OMP_GLOBAL_THREAD_NUM]])
 ; CHECK2-NEXT:    br label [[OMP_REGION_END]]
 ; CHECK2:       omp.par.exit.exitStub:
diff --git a/llvm/test/Transforms/SCCP/get_vector_length-intrinsic.ll b/llvm/test/Transforms/SCCP/get_vector_length-intrinsic.ll
new file mode 100644
index 0000000000000..d0741161e729e
--- /dev/null
+++ b/llvm/test/Transforms/SCCP/get_vector_length-intrinsic.ll
@@ -0,0 +1,147 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt < %s -p sccp -S | FileCheck %s
+
+define i1 @result_le_count() {
+; CHECK-LABEL: define i1 @result_le_count() {
+; CHECK-NEXT:    ret i1 true
+;
+  %x = call i32 @llvm.experimental.get.vector.length(i32 3, i32 4, i1 false)
+  %res = icmp ule i32 %x, 3
+  ret i1 %res
+}
+
+define i1 @result_le_max_lanes(i32 %count) {
+; CHECK-LABEL: define i1 @result_le_max_lanes(
+; CHECK-SAME: i32 [[COUNT:%.*]]) {
+; CHECK-NEXT:    [[X:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[COUNT]], i32 3, i1 false)
+; CHECK-NEXT:    ret i1 true
+;
+  %x = call i32 @llvm.experimental.get.vector.length(i32 %count, i32 3, i1 false)
+  %res = icmp ule i32 %x, 3
+  ret i1 %res
+}
+
+define i1 @result_le_max_lanes_scalable(i32 %count) vscale_range(2, 4) {
+; CHECK-LABEL: define i1 @result_le_max_lanes_scalable(
+; CHECK-SAME: i32 [[COUNT:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[X:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[COUNT]], i32 4, i1 true)
+; CHECK-NEXT:    ret i1 true
+;
+  %x = call i32 @llvm.experimental.get.vector.length(i32 %count, i32 4, i1 true)
+  %res = icmp ule i32 %x, 16
+  ret i1 %res
+}
+
+define i32 @count_le_max_lanes() {
+; CHECK-LABEL: define i32 @count_le_max_lanes() {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i32 4
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [4, %entry], [%iv.next, %loop]
+  %x = call i32 @llvm.experimental.get.vector.length(i32 %iv, i32 4, i1 false)
+  %iv.next = sub i32 %iv, %x
+  %ec = icmp eq i32 %iv.next, 0
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret i32 %x
+}
+
+; Can't simplify because %iv isn't <= max lanes.
+define i32 @count_not_le_max_lanes() {
+; CHECK-LABEL: define range(i32 0, 5) i32 @count_not_le_max_lanes() {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 6, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[X:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[IV]], i32 4, i1 false)
+; CHECK-NEXT:    [[IV_NEXT]] = sub i32 [[IV]], [[X]]
+; CHECK-NEXT:    [[EC:%.*]] = icmp eq i32 [[IV_NEXT]], 0
+; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i32 [[X]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [6, %entry], [%iv.next, %loop]
+  %x = call i32 @llvm.experimental.get.vector.length(i32 %iv, i32 4, i1 false)
+  %iv.next = sub i32 %iv, %x
+  %ec = icmp eq i32 %iv.next, 0
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret i32 %x
+}
+
+define i32 @count_le_max_lanes_scalable_known() vscale_range(4, 8) {
+; CHECK-LABEL: define i32 @count_le_max_lanes_scalable_known(
+; CHECK-SAME: ) #[[ATTR1:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i32 16
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [16, %entry], [%iv.next, %loop]
+  %x = call i32 @llvm.experimental.get.vector.length(i32 %iv, i32 4, i1 true)
+  %iv.next = sub i32 %iv, %x
+  %ec = icmp eq i32 %iv.next, 0
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret i32 %x
+}
+
+; Can't simplify because %iv isn't guaranteed <= max lanes.
+define i32 @count_le_max_lanes_scalable_unknown() {
+; CHECK-LABEL: define range(i32 0, -1) i32 @count_le_max_lanes_scalable_unknown() {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 16, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[X:%.*]] = call i32 @llvm.experimental.get.vector.length.i32(i32 [[IV]], i32 4, i1 true)
+; CHECK-NEXT:    [[IV_NEXT]] = sub i32 [[IV]], [[X]]
+; CHECK-NEXT:    [[EC:%.*]] = icmp eq i32 [[IV_NEXT]], 0
+; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret i32 [[X]]
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i32 [16, %entry], [%iv.next, %loop]
+  %x = call i32 @llvm.experimental.get.vector.length(i32 %iv, i32 4, i1 true)
+  %iv.next = sub i32 %iv, %x
+  %ec = icmp eq i32 %iv.next, 0
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret i32 %x
+}
+
+define i1 @result_le_overflow() {
+; CHECK-LABEL: define i1 @result_le_overflow() {
+; CHECK-NEXT:    [[X:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 4294967296, i32 4, i1 false)
+; CHECK-NEXT:    [[RES:%.*]] = icmp ule i32 [[X]], 3
+; CHECK-NEXT:    ret i1 [[RES]]
+;
+  %x = call i32 @llvm.experimental.get.vector.length(i64 u0x100000000, i32 4, i1 false)
+  %res = icmp ule i32 %x, 3
+  ret i1 %res
+}
diff --git a/llvm/test/Transforms/SLPVectorizer/X86/debug-counter.ll b/llvm/test/Transforms/SLPVectorizer/X86/debug-counter.ll
index b42f36f105636..48f7ce2d58c42 100644
--- a/llvm/test/Transforms/SLPVectorizer/X86/debug-counter.ll
+++ b/llvm/test/Transforms/SLPVectorizer/X86/debug-counter.ll
@@ -3,7 +3,6 @@
 ; RUN: opt -S -passes=slp-vectorizer -mtriple=x86_64-unknown-linux -debug-counter=slp-vectorized=1 -slp-threshold=-99999 < %s | FileCheck %s --check-prefix=COUNT1
 ; RUN: opt -S -passes=slp-vectorizer -mtriple=x86_64-unknown-linux -debug-counter=slp-vectorized=2 -slp-threshold=-99999 < %s | FileCheck %s --check-prefix=COUNT2
 ; RUN: opt -S -passes=slp-vectorizer -mtriple=x86_64-unknown-linux -debug-counter=slp-vectorized=0-1 -slp-threshold=-99999 < %s | FileCheck %s --check-prefix=COUNT-1
-; REQUIRES: asserts
 
 define void @blam(ptr %arg, double %load2, i1 %fcmp3) {
 ; COUNT0-LABEL: define void @blam
diff --git a/llvm/test/Transforms/SimplifyCFG/skip-merging-duplicate-convergence-instrinsics.ll b/llvm/test/Transforms/SimplifyCFG/skip-merging-duplicate-convergence-instrinsics.ll
new file mode 100644
index 0000000000000..368ae96d0c3c2
--- /dev/null
+++ b/llvm/test/Transforms/SimplifyCFG/skip-merging-duplicate-convergence-instrinsics.ll
@@ -0,0 +1,68 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt < %s -S -passes=simplifycfg | FileCheck %s
+
+declare token @llvm.experimental.convergence.entry() #0
+
+define void @nested(i32 %tidx, i32 %tidy, ptr %array) #0 {
+; CHECK-LABEL: @nested(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[TMP0:%.*]] = tail call token @llvm.experimental.convergence.entry()
+; CHECK-NEXT:    [[TMP1:%.*]] = or i32 [[TIDY:%.*]], [[TIDX:%.*]]
+; CHECK-NEXT:    [[OR_COND_I:%.*]] = icmp eq i32 [[TMP1]], 0
+; CHECK-NEXT:    br label [[FOR_COND_I:%.*]]
+; CHECK:       for.cond.i:
+; CHECK-NEXT:    [[TMP2:%.*]] = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token [[TMP0]]) ]
+; CHECK-NEXT:    br label [[FOR_COND1_I:%.*]]
+; CHECK:       for.cond1.i:
+; CHECK-NEXT:    [[CMP2_I:%.*]] = phi i1 [ false, [[FOR_BODY4_I:%.*]] ], [ true, [[FOR_COND_I]] ]
+; CHECK-NEXT:    [[TMP3:%.*]] = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token [[TMP2]]) ]
+; CHECK-NEXT:    br i1 [[CMP2_I]], label [[FOR_BODY4_I]], label [[EXIT:%.*]]
+; CHECK:       for.body4.i:
+; CHECK-NEXT:    br i1 [[OR_COND_I]], label [[IF_THEN_I:%.*]], label [[FOR_COND1_I]]
+; CHECK:       if.then.i:
+; CHECK-NEXT:    [[TEST_VAL:%.*]] = call spir_func i32 @func_test(i32 0) [ "convergencectrl"(token [[TMP3]]) ]
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[ARRAY:%.*]], i32 0
+; CHECK-NEXT:    store i32 [[TEST_VAL]], ptr [[TMP4]], align 4
+; CHECK-NEXT:    br label [[EXIT]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %0 = tail call token @llvm.experimental.convergence.entry()
+  %2 = or i32 %tidy, %tidx
+  %or.cond.i = icmp eq i32 %2, 0
+  br label %for.cond.i
+
+for.cond.i:
+  %3 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %0) ]
+  br label %for.cond1.i
+
+for.cond1.i:
+  %cmp2.i = phi i1 [ false, %for.body4.i ], [ true, %for.cond.i ]
+  %4 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %3) ]
+  br i1 %cmp2.i, label %for.body4.i, label %cleanup.i.loopexit
+
+for.body4.i:
+  br i1 %or.cond.i, label %if.then.i, label %for.cond1.i
+
+if.then.i:
+  %test.val = call spir_func i32 @func_test(i32 0) [ "convergencectrl"(token %4) ]
+  %5 = getelementptr inbounds i32, ptr %array, i32 0
+  store i32 %test.val, ptr %5, align 4
+  br label %cleanup.i
+
+cleanup.i.loopexit:
+  br label %cleanup.i
+
+cleanup.i:
+  br label %exit
+
+exit:
+  ret void
+}
+
+declare token @llvm.experimental.convergence.loop() #0
+
+declare i32 @func_test(i32) #0
+
+attributes #0 = { convergent }
diff --git a/llvm/test/Transforms/Util/assume-builder-counter.ll b/llvm/test/Transforms/Util/assume-builder-counter.ll
index c11a69a2c3cd7..f33ee8b98a0a6 100644
--- a/llvm/test/Transforms/Util/assume-builder-counter.ll
+++ b/llvm/test/Transforms/Util/assume-builder-counter.ll
@@ -1,5 +1,4 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature
-; REQUIRES: asserts
 
 ; RUN: opt -passes='assume-builder,verify' --enable-knowledge-retention --debug-counter=assume-builder-counter=5 -S %s | FileCheck %s --check-prefixes=COUNTER1
 ; RUN: opt -passes='assume-builder,verify' --enable-knowledge-retention --debug-counter=assume-builder-counter=1-3 -S %s | FileCheck %s --check-prefixes=COUNTER2
diff --git a/llvm/test/Transforms/VectorCombine/X86/shuffle-of-fma-const.ll b/llvm/test/Transforms/VectorCombine/X86/shuffle-of-fma-const.ll
new file mode 100644
index 0000000000000..b05f851a846f4
--- /dev/null
+++ b/llvm/test/Transforms/VectorCombine/X86/shuffle-of-fma-const.ll
@@ -0,0 +1,48 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4
+; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mcpu=x86-64    | FileCheck %s --check-prefixes=CHECK,SSE
+; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=CHECK,AVX
+
+define <4 x float> @shuffle_fma_const_chain(<4 x float> %a0) {
+; CHECK-LABEL: define <4 x float> @shuffle_fma_const_chain(
+; CHECK-SAME: <4 x float> [[A0:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[F:%.*]] = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> [[A0]], <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+; CHECK-NEXT:    [[RES:%.*]] = shufflevector <4 x float> [[F]], <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+; CHECK-NEXT:    ret <4 x float> [[RES]]
+;
+  %f = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> %a0, <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+  %res = shufflevector <4 x float> %f, <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+  ret <4 x float> %res
+}
+
+define <8 x float> @concat_fma_const_chain(<4 x float> %a0, <4 x float> %a1) {
+; CHECK-LABEL: define <8 x float> @concat_fma_const_chain(
+; CHECK-SAME: <4 x float> [[A0:%.*]], <4 x float> [[A1:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x float> [[A0]], <4 x float> [[A1]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:    [[RES:%.*]] = call <8 x float> @llvm.fma.v8f32(<8 x float> [[TMP1]], <8 x float> splat (float 0x3F8DE8D040000000), <8 x float> splat (float 0xBFB3715EE0000000))
+; CHECK-NEXT:    ret <8 x float> [[RES]]
+;
+  %l = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> %a0, <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+  %h = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> %a1, <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+  %res = shufflevector <4 x float> %l, <4 x float> %h, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+  ret <8 x float> %res
+}
+
+define <8 x float> @interleave_fma_const_chain(<4 x float> %a0, <4 x float> %a1) {
+; SSE-LABEL: define <8 x float> @interleave_fma_const_chain(
+; SSE-SAME: <4 x float> [[A0:%.*]], <4 x float> [[A1:%.*]]) #[[ATTR0]] {
+; SSE-NEXT:    [[L:%.*]] = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> [[A0]], <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+; SSE-NEXT:    [[H:%.*]] = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> [[A1]], <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+; SSE-NEXT:    [[RES:%.*]] = shufflevector <4 x float> [[L]], <4 x float> [[H]], <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+; SSE-NEXT:    ret <8 x float> [[RES]]
+;
+; AVX-LABEL: define <8 x float> @interleave_fma_const_chain(
+; AVX-SAME: <4 x float> [[A0:%.*]], <4 x float> [[A1:%.*]]) #[[ATTR0]] {
+; AVX-NEXT:    [[TMP1:%.*]] = shufflevector <4 x float> [[A0]], <4 x float> [[A1]], <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+; AVX-NEXT:    [[RES:%.*]] = call <8 x float> @llvm.fma.v8f32(<8 x float> [[TMP1]], <8 x float> splat (float 0x3F8DE8D040000000), <8 x float> splat (float 0xBFB3715EE0000000))
+; AVX-NEXT:    ret <8 x float> [[RES]]
+;
+  %l = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> %a0, <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+  %h = tail call noundef <4 x float> @llvm.fma.v4f32(<4 x float> %a1, <4 x float> splat (float 0x3F8DE8D040000000), <4 x float> splat (float 0xBFB3715EE0000000))
+  %res = shufflevector <4 x float> %l, <4 x float> %h, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
+  ret <8 x float> %res
+}
diff --git a/llvm/test/Transforms/WholeProgramDevirt/calls-to-devirt.ll b/llvm/test/Transforms/WholeProgramDevirt/calls-to-devirt.ll
new file mode 100644
index 0000000000000..7393e5d335816
--- /dev/null
+++ b/llvm/test/Transforms/WholeProgramDevirt/calls-to-devirt.ll
@@ -0,0 +1,79 @@
+; Devirt calls debug counter is not explicitly set. Expect 3 remark messages.
+; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import \
+; RUN:   -pass-remarks=wholeprogramdevirt \
+; RUN:   -wholeprogramdevirt-read-summary=%S/Inputs/import-single-impl.yaml \
+; RUN:   -print-debug-counter-queries < %s  2>&1 \
+; RUN:   | grep "remark" | count 3
+; Devirt calls debug counter is set to 1. Expect one remark messages.
+; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import \
+; RUN:   -pass-remarks=wholeprogramdevirt -debug-counter=calls-to-devirt=0 \
+; RUN:   -wholeprogramdevirt-read-summary=%S/Inputs/import-single-impl.yaml \
+; RUN:   -print-debug-counter-queries < %s  2>&1 \
+; RUN:   | FileCheck --check-prefix=CHECK-SINGLE %s
+; Devirt calls debug counter is set outside the range of calls. Expect no remark message.
+; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import \
+; RUN:   -pass-remarks=wholeprogramdevirt -debug-counter=calls-to-devirt=9999 \
+; RUN:   -wholeprogramdevirt-read-summary=%S/Inputs/import-single-impl.yaml \
+; RUN:   -print-debug-counter-queries < %s 2>&1  \
+; RUN:   | FileCheck -implicit-check-not="remark" --check-prefix=CHECK-NONE %s
+
+; CHECK-SINGLE: DebugCounter calls-to-devirt=0 execute
+; CHECK-SINGLE: remark
+; CHECK-SINGLE-SAME: devirtualized a call
+; CHECK-SINGLE: DebugCounter calls-to-devirt=1 skip
+; CHECK-SINGLE: DebugCounter calls-to-devirt=2 skip
+
+; CHECK-NONE: DebugCounter calls-to-devirt=0 skip
+; CHECK-NONE: DebugCounter calls-to-devirt=1 skip
+; CHECK-NONE: DebugCounter calls-to-devirt=2 skip
+
+target datalayout = "e-p:64:64"
+target triple = "x86_64-unknown-linux-gnu"
+
+define i32 @call1(ptr %obj) #0 {
+  %vtable = load ptr, ptr %obj
+  %p = call i1 @llvm.type.test(ptr %vtable, metadata !"typeid1")
+  call void @llvm.assume(i1 %p)
+  %fptr = load ptr, ptr %vtable
+  %result = call i32 %fptr(ptr %obj, i32 1)
+  ret i32 %result
+}
+
+define i1 @call2(ptr %obj, i32 %arg1) #0 {
+  %vtable = load ptr, ptr %obj
+  %pair = call {ptr, i1} @llvm.type.checked.load(ptr %vtable, i32 8, metadata !"typeid2")
+  %fptr = extractvalue {ptr, i1} %pair, 0
+  %p = extractvalue {ptr, i1} %pair, 1
+  br i1 %p, label %cont, label %trap
+
+cont:
+  %result = call i1 %fptr(ptr %obj, i32 %arg1)
+  ret i1 %result
+
+trap:
+  call void @llvm.trap()
+  unreachable
+}
+
+define i1 @call3(ptr %obj) #0 {
+  %vtable = load ptr, ptr %obj
+  %pair = call {ptr, i1} @llvm.type.checked.load(ptr %vtable, i32 8, metadata !"typeid2")
+  %fptr = extractvalue {ptr, i1} %pair, 0
+  %p = extractvalue {ptr, i1} %pair, 1
+  br i1 %p, label %cont, label %trap
+
+cont:
+  %result = call i1 %fptr(ptr %obj, i32 3)
+  ret i1 %result
+
+trap:
+  call void @llvm.trap()
+  unreachable
+}
+
+declare void @llvm.assume(i1)
+declare void @llvm.trap()
+declare {ptr, i1} @llvm.type.checked.load(ptr, i32, metadata)
+declare i1 @llvm.type.test(ptr, metadata)
+
+attributes #0 = { "target-features"="+retpoline" }
diff --git a/llvm/test/Transforms/WholeProgramDevirt/import-indir.ll b/llvm/test/Transforms/WholeProgramDevirt/import-indir.ll
index e4d6f1d52b540..2c33059b6e126 100644
--- a/llvm/test/Transforms/WholeProgramDevirt/import-indir.ll
+++ b/llvm/test/Transforms/WholeProgramDevirt/import-indir.ll
@@ -92,7 +92,7 @@ define i1 @f1(ptr %obj) {
 }
 
 ; CHECK: define i1 @f2
-define i1 @f2(ptr %obj) {
+define i1 @f2(ptr %obj, i32 %arg1) {
   %vtable = load ptr, ptr %obj
   %pair = call {ptr, i1} @llvm.type.checked.load(ptr %vtable, i32 4, metadata !"typeid1")
   %fptr = extractvalue {ptr, i1} %pair, 0
@@ -103,7 +103,7 @@ define i1 @f2(ptr %obj) {
 
 cont:
   ; CHECK: call i1 %
-  %result = call i1 %fptr(ptr %obj, i32 undef)
+  %result = call i1 %fptr(ptr %obj, i32 %arg1)
   ret i1 %result
 
 trap:
diff --git a/llvm/test/Transforms/WholeProgramDevirt/import.ll b/llvm/test/Transforms/WholeProgramDevirt/import.ll
index de25bc10a7c12..812ffbdf7f3fb 100644
--- a/llvm/test/Transforms/WholeProgramDevirt/import.ll
+++ b/llvm/test/Transforms/WholeProgramDevirt/import.ll
@@ -8,12 +8,6 @@
 ; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import -wholeprogramdevirt-read-summary=%S/Inputs/import-vcp-branch-funnel.yaml < %s | FileCheck --check-prefixes=CHECK,VCP,VCP-X86,VCP64,BRANCH-FUNNEL %s
 ; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import -wholeprogramdevirt-read-summary=%S/Inputs/import-branch-funnel.yaml < %s | FileCheck --check-prefixes=CHECK,BRANCH-FUNNEL,BRANCH-FUNNEL-NOVCP %s
 
-; Cutoff value is not explicitly set. Expect 3 remark messages.
-; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import -pass-remarks=wholeprogramdevirt -wholeprogramdevirt-read-summary=%S/Inputs/import-single-impl.yaml < %s  2>&1 | grep "single-impl" | count 3
-; Cutoff value is set to 1. Expect one remark messages.
-; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import -pass-remarks=wholeprogramdevirt -wholeprogramdevirt-cutoff=1  -wholeprogramdevirt-read-summary=%S/Inputs/import-single-impl.yaml < %s  2>&1 | grep "single-impl" | count 1
-; Cutoff value is explicitly set to zero. Expect no remark message.
-; RUN: opt -S -passes=wholeprogramdevirt -wholeprogramdevirt-summary-action=import -pass-remarks=wholeprogramdevirt -wholeprogramdevirt-cutoff=0  -wholeprogramdevirt-read-summary=%S/Inputs/import-single-impl.yaml < %s 2>&1  | FileCheck -implicit-check-not="remark" %s
 target datalayout = "e-p:64:64"
 target triple = "x86_64-unknown-linux-gnu"
 
@@ -46,7 +40,7 @@ define i32 @call1(ptr %obj) #0 {
 ; constant propagation.
 
 ; CHECK: define i1 @call2
-define i1 @call2(ptr %obj) #0 {
+define i1 @call2(ptr %obj, i32 %arg1) #0 {
   %vtable = load ptr, ptr %obj
   %pair = call {ptr, i1} @llvm.type.checked.load(ptr %vtable, i32 8, metadata !"typeid2")
   %fptr = extractvalue {ptr, i1} %pair, 0
@@ -57,8 +51,8 @@ define i1 @call2(ptr %obj) #0 {
 cont:
   ; SINGLE-IMPL: call i1 @singleimpl2
   ; INDIR: call i1 %
-  ; BRANCH-FUNNEL: call i1 @__typeid_typeid2_8_branch_funnel(ptr nest %vtable, ptr %obj, i32 undef)
-  %result = call i1 %fptr(ptr %obj, i32 undef)
+  ; BRANCH-FUNNEL: call i1 @__typeid_typeid2_8_branch_funnel(ptr nest %vtable, ptr %obj, i32 %arg1)
+  %result = call i1 %fptr(ptr %obj, i32 %arg1)
   ret i1 %result
 
 trap:
diff --git a/llvm/test/Transforms/WholeProgramDevirt/uniform-retval-invoke.ll b/llvm/test/Transforms/WholeProgramDevirt/uniform-retval-invoke.ll
index 88d539294777e..ca42368330fcf 100644
--- a/llvm/test/Transforms/WholeProgramDevirt/uniform-retval-invoke.ll
+++ b/llvm/test/Transforms/WholeProgramDevirt/uniform-retval-invoke.ll
@@ -15,7 +15,7 @@ define i32 @vf2(ptr %this) readnone {
 }
 
 ; CHECK: define i32 @call
-define i32 @call(ptr %obj) personality ptr undef {
+define i32 @call(ptr %obj) personality ptr @__gxx_personality_v0 {
   %vtable = load ptr, ptr %obj
   %p = call i1 @llvm.type.test(ptr %vtable, metadata !"typeid")
   call void @llvm.assume(i1 %p)
@@ -35,5 +35,6 @@ ret:
 
 declare i1 @llvm.type.test(ptr, metadata)
 declare void @llvm.assume(i1)
+declare i32 @__gxx_personality_v0(...)
 
 !0 = !{i32 0, !"typeid"}
diff --git a/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test b/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
index 858569e4b0ef5..3815e281876e2 100644
--- a/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
+++ b/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
@@ -1,7 +1,9 @@
+# TODO(mshockwave): We use a fixed seed for this test because sometimes it
+# will fail to generate any snippet because it is unable to assign unique
+# def and use registers.
 # RUN: llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK \
-# RUN:    --riscv-filter-config='vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=10 | FileCheck %s
-# Sometimes it'll fail to generate any snippet because it's unable to assign unique def and use registers.
-# ALLOW_RETRIES: 2
+# RUN:    --riscv-filter-config='vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=10 \
+# RUN:    -random-generator-seed=5 | FileCheck %s
 
 # CHECK: config:          'vtype = {VXRM: rod, AVL: VLMAX, SEW: e8, Policy: ta/mu}'
 # CHECK: config:          'vtype = {VXRM: rod, AVL: VLMAX, SEW: e16, Policy: ta/mu}'
diff --git a/llvm/test/tools/llvm-exegesis/X86/snippet-generator-seed.test b/llvm/test/tools/llvm-exegesis/X86/snippet-generator-seed.test
new file mode 100644
index 0000000000000..54a0e4946fcd8
--- /dev/null
+++ b/llvm/test/tools/llvm-exegesis/X86/snippet-generator-seed.test
@@ -0,0 +1,16 @@
+# REQUIRES: exegesis-can-measure-latency, x86_64-linux
+
+# Check that the snippet we generate is exactly the same between runs when we
+# use a fixed RNG seed.
+
+# RUN: llvm-exegesis -mtriple=x86_64-unknown-unknown -mode=latency -opcode-name=ADD64rr --benchmark-phase=prepare-snippet -random-generator-seed=5 | FileCheck %s
+
+# CHECK: ---
+# CHECK: mode:            latency
+# CHECK: key:
+# CHECK:   instructions:
+# CHECK:     - 'ADD64rr RCX RCX RAX'
+# CHECK:   config:          ''
+# CHECK:   register_initial_values:
+# CHECK:     - 'RCX=0x0'
+# CHECK:     - 'RAX=0x0'
diff --git a/llvm/test/tools/llvm-readobj/COFF/arm64-packed-unwind.s b/llvm/test/tools/llvm-readobj/COFF/arm64-packed-unwind.s
index d9953ccc3f3d8..489d385468b70 100644
--- a/llvm/test/tools/llvm-readobj/COFF/arm64-packed-unwind.s
+++ b/llvm/test/tools/llvm-readobj/COFF/arm64-packed-unwind.s
@@ -105,11 +105,7 @@
 // CHECK-NEXT:     CR: 0
 // CHECK-NEXT:     FrameSize: 112
 // CHECK-NEXT:     Prologue [
-// CHECK-NEXT:       sub sp, sp, #48
-// CHECK-NEXT:       stp x6, x7, [sp, #48]
-// CHECK-NEXT:       stp x4, x5, [sp, #32]
-// CHECK-NEXT:       stp x2, x3, [sp, #16]
-// CHECK-NEXT:       stp x0, x1, [sp, #-64]!
+// CHECK-NEXT:       sub sp, sp, #112
 // CHECK-NEXT:       end
 // CHECK-NEXT:     ]
 // CHECK-NEXT:   }
@@ -139,7 +135,8 @@
 // CHECK-NEXT:     FrameSize: 32
 // CHECK-NEXT:     Prologue [
 // CHECK-NEXT:       sub sp, sp, #16
-// CHECK-NEXT:       INVALID!
+// CHECK-NEXT:       stp x19, lr, [sp]
+// CHECK-NEXT:       sub sp, sp, #16
 // CHECK-NEXT:       end
 // CHECK-NEXT:     ]
 // CHECK-NEXT:   }
@@ -275,6 +272,37 @@
 // CHECK-NEXT:       end
 // CHECK-NEXT:     ]
 // CHECK-NEXT:   }
+// CHECK-NEXT:   RuntimeFunction {
+// CHECK-NEXT:     Function: func17
+// CHECK-NEXT:     Fragment: No
+// CHECK-NEXT:     FunctionLength: 44
+// CHECK-NEXT:     RegF: 0
+// CHECK-NEXT:     RegI: 0
+// CHECK-NEXT:     HomedParameters: Yes
+// CHECK-NEXT:     CR: 3
+// CHECK-NEXT:     FrameSize: 96
+// CHECK-NEXT:     Prologue [
+// CHECK-NEXT:       mov x29, sp
+// CHECK-NEXT:       stp x29, lr, [sp, #-96]!
+// CHECK-NEXT:       end
+// CHECK-NEXT:     ]
+// CHECK-NEXT:   }
+// CHECK-NEXT:   RuntimeFunction {
+// CHECK-NEXT:     Function: func18
+// CHECK-NEXT:     Fragment: No
+// CHECK-NEXT:     FunctionLength: 44
+// CHECK-NEXT:     RegF: 0
+// CHECK-NEXT:     RegI: 0
+// CHECK-NEXT:     HomedParameters: Yes
+// CHECK-NEXT:     CR: 3
+// CHECK-NEXT:     FrameSize: 528
+// CHECK-NEXT:     Prologue [
+// CHECK-NEXT:       mov x29, sp
+// CHECK-NEXT:       stp x29, lr, [sp, #0]
+// CHECK-NEXT:       sub sp, sp, #528
+// CHECK-NEXT:       end
+// CHECK-NEXT:     ]
+// CHECK-NEXT:   }
 // CHECK-NEXT: ]
 
         .text
@@ -295,6 +323,8 @@ func13:
 func14:
 func15:
 func16:
+func17:
+func18:
         ret
 
         .section .pdata,"dr"
@@ -330,3 +360,7 @@ func16:
         .long 0x11820019 // FunctionLength=6  RegF=0 RegI=2 H=0 CR=0 FrameSize=34
         .long func16 at IMGREL
         .long 0x03b00039 // FunctionLength=14 RegF=0 RegI=0 H=1 CR=1 FrameSize=7
+        .long func17 at IMGREL
+        .long 0x0370002d // FunctionLength=11 RegF=0 RegI=0 H=1 CR=3 FrameSize=6
+        .long func18 at IMGREL
+        .long 0x10f0002d // FunctionLength=11 RegF=0 RegI=0 H=1 CR=3 FrameSize=6
diff --git a/llvm/tools/llc/NewPMDriver.cpp b/llvm/tools/llc/NewPMDriver.cpp
index 7ba17e5b82095..6d4989e278fc1 100644
--- a/llvm/tools/llc/NewPMDriver.cpp
+++ b/llvm/tools/llc/NewPMDriver.cpp
@@ -14,8 +14,10 @@
 
 #include "NewPMDriver.h"
 #include "llvm/Analysis/CGSCCPassManager.h"
+#include "llvm/Analysis/RuntimeLibcallInfo.h"
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/CodeGen/CommandFlags.h"
+#include "llvm/CodeGen/LibcallLoweringInfo.h"
 #include "llvm/CodeGen/MIRParser/MIRParser.h"
 #include "llvm/CodeGen/MIRPrinter.h"
 #include "llvm/CodeGen/MachineFunctionAnalysis.h"
@@ -136,6 +138,16 @@ int llvm::compileModuleWithNewPM(
   SI.registerCallbacks(PIC, &MAM);
 
   FAM.registerPass([&] { return TargetLibraryAnalysis(TLII); });
+
+  MAM.registerPass([&] {
+    const TargetOptions &Options = Target->Options;
+    return RuntimeLibraryAnalysis(
+        M->getTargetTriple(), Target->Options.ExceptionModel,
+        Target->Options.FloatABIType, Target->Options.EABIVersion,
+        Options.MCOptions.ABIName, Target->Options.VecLib);
+  });
+  MAM.registerPass([&] { return LibcallLoweringModuleAnalysis(); });
+
   MAM.registerPass([&] { return MachineModuleAnalysis(MMI); });
 
   ModulePassManager MPM;
diff --git a/llvm/tools/llc/llc.cpp b/llvm/tools/llc/llc.cpp
index ce1ce5d68c137..613780ecbfb40 100644
--- a/llvm/tools/llc/llc.cpp
+++ b/llvm/tools/llc/llc.cpp
@@ -16,6 +16,7 @@
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/ScopeExit.h"
 #include "llvm/ADT/Statistic.h"
+#include "llvm/Analysis/RuntimeLibcallInfo.h"
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/CodeGen/CommandFlags.h"
 #include "llvm/CodeGen/LinkAllAsmWriterComponents.h"
@@ -355,13 +356,11 @@ static std::unique_ptr<ToolOutputFile> GetOutputStream(Triple::OSType OS) {
   if (!Binary)
     OpenFlags |= sys::fs::OF_TextWithCRLF;
   auto FDOut = std::make_unique<ToolOutputFile>(OutputFilename, EC, OpenFlags);
-  if (EC) {
+  if (EC)
     reportError(EC.message());
-    return nullptr;
-  }
-
   return FDOut;
 }
+
 // main - Entry point for the llc compiler.
 //
 int main(int argc, char **argv) {
@@ -729,6 +728,10 @@ static int compileModule(char **argv, LLVMContext &Context,
   // Build up all of the passes that we want to do to the module.
   legacy::PassManager PM;
   PM.add(new TargetLibraryInfoWrapperPass(TLII));
+  PM.add(new RuntimeLibraryInfoWrapper(
+      M->getTargetTriple(), Target->Options.ExceptionModel,
+      Target->Options.FloatABIType, Target->Options.EABIVersion,
+      Options.MCOptions.ABIName, Target->Options.VecLib));
 
   {
     raw_pwrite_stream *OS = &Out->os();
diff --git a/llvm/tools/llvm-exegesis/lib/SnippetGenerator.cpp b/llvm/tools/llvm-exegesis/lib/SnippetGenerator.cpp
index 7023f1bfae193..86d4e197b6063 100644
--- a/llvm/tools/llvm-exegesis/lib/SnippetGenerator.cpp
+++ b/llvm/tools/llvm-exegesis/lib/SnippetGenerator.cpp
@@ -21,9 +21,17 @@
 #include "llvm/Support/FormatVariadic.h"
 #include "llvm/Support/Program.h"
 
+#define DEBUG_TYPE "snippet-generator"
+
 namespace llvm {
 namespace exegesis {
 
+static cl::opt<unsigned>
+    RandomGeneratorSeed("random-generator-seed",
+                        cl::desc("The seed value to use for the random number "
+                                 "generator when generating snippets."),
+                        cl::init(0));
+
 std::vector<CodeTemplate> getSingleton(CodeTemplate &&CT) {
   std::vector<CodeTemplate> Result;
   Result.push_back(std::move(CT));
@@ -188,7 +196,11 @@ generateUnconstrainedCodeTemplates(const InstructionTemplate &Variant,
 
 std::mt19937 &randomGenerator() {
   static std::random_device RandomDevice;
-  static std::mt19937 RandomGenerator(RandomDevice());
+  unsigned RandomSeed = RandomGeneratorSeed.getNumOccurrences()
+                            ? RandomGeneratorSeed
+                            : RandomDevice();
+  LLVM_DEBUG(dbgs() << "Using random seed " << RandomSeed << ".\n");
+  static std::mt19937 RandomGenerator(RandomSeed);
   return RandomGenerator;
 }
 
diff --git a/llvm/tools/llvm-readobj/ARMWinEHPrinter.cpp b/llvm/tools/llvm-readobj/ARMWinEHPrinter.cpp
index c6e409c63ef3a..32e3d059f44d3 100644
--- a/llvm/tools/llvm-readobj/ARMWinEHPrinter.cpp
+++ b/llvm/tools/llvm-readobj/ARMWinEHPrinter.cpp
@@ -1404,6 +1404,12 @@ bool Decoder::dumpPackedARM64Entry(const object::COFFObjectFile &COFF,
     FpSZ += 8;
   int SavSZ = (IntSZ + FpSZ + 8 * 8 * RF.H() + 0xf) & ~0xf;
   int LocSZ = (RF.FrameSize() << 4) - SavSZ;
+  bool Homing = RF.H();
+
+  if (RF.H() && RF.RegI() == 0 && RF.RegF() == 0 && RF.CR() != 1) {
+    LocSZ += SavSZ;
+    Homing = false;
+  }
 
   if (RF.CR() == 2 || RF.CR() == 3) {
     SW.startLine() << "mov x29, sp\n";
@@ -1419,18 +1425,11 @@ bool Decoder::dumpPackedARM64Entry(const object::COFFObjectFile &COFF,
   } else if ((RF.CR() != 3 && RF.CR() != 2 && LocSZ > 0) || LocSZ > 512) {
     SW.startLine() << format("sub sp, sp, #%d\n", LocSZ);
   }
-  if (RF.H()) {
+  if (Homing) {
     SW.startLine() << format("stp x6, x7, [sp, #%d]\n", SavSZ - 16);
     SW.startLine() << format("stp x4, x5, [sp, #%d]\n", SavSZ - 32);
     SW.startLine() << format("stp x2, x3, [sp, #%d]\n", SavSZ - 48);
-    if (RF.RegI() > 0 || RF.RegF() > 0 || RF.CR() == 1) {
-      SW.startLine() << format("stp x0, x1, [sp, #%d]\n", SavSZ - 64);
-    } else {
-      // This case isn't documented; if neither RegI nor RegF nor CR=1
-      // have decremented the stack pointer by SavSZ, we need to do it here
-      // (as the final stack adjustment of LocSZ excludes SavSZ).
-      SW.startLine() << format("stp x0, x1, [sp, #-%d]!\n", SavSZ);
-    }
+    SW.startLine() << format("stp x0, x1, [sp, #%d]\n", SavSZ - 64);
   }
   int FloatRegs = RF.RegF() > 0 ? RF.RegF() + 1 : 0;
   for (int I = (FloatRegs + 1) / 2 - 1; I >= 0; I--) {
@@ -1457,10 +1456,14 @@ bool Decoder::dumpPackedARM64Entry(const object::COFFObjectFile &COFF,
       // The last register, an odd register without a pair
       if (RF.CR() == 1) {
         if (I == 0) { // If this is the only register pair
-          // CR=1 combined with RegI=1 doesn't map to a documented case;
-          // it doesn't map to any regular unwind info opcode, and the
-          // actual unwinder doesn't support it.
-          SW.startLine() << "INVALID!\n";
+          // CR=1 combined with RegI=1 maps to a special case; there's
+          // no unwind info opcode that saves a GPR together with LR
+          // with writeback to sp (no save_lrpair_x).
+          // Instead, this case expands to two instructions; a preceding
+          // (in prologue execution order) "sub sp, sp, #16", followed
+          // by a regular "stp x19, lr, [sp]" (save_lrpair).
+          SW.startLine() << format("stp x%d, lr, [sp]\n", 19);
+          SW.startLine() << format("sub sp, sp, #%d\n", SavSZ);
         } else
           SW.startLine() << format("stp x%d, lr, [sp, #%d]\n", 19 + 2 * I,
                                    16 * I);
@@ -1478,9 +1481,6 @@ bool Decoder::dumpPackedARM64Entry(const object::COFFObjectFile &COFF,
                                19 + 2 * I + 1, 16 * I);
     }
   }
-  // CR=2 is yet undocumented, see
-  // https://github.com/MicrosoftDocs/cpp-docs/pull/4202 for upstream
-  // progress on getting it documented.
   if (RF.CR() == 2)
     SW.startLine() << "pacibsp\n";
   SW.startLine() << "end\n";
diff --git a/llvm/tools/opt/NewPMDriver.cpp b/llvm/tools/opt/NewPMDriver.cpp
index 01d7ac8e3f959..3209b652b44b4 100644
--- a/llvm/tools/opt/NewPMDriver.cpp
+++ b/llvm/tools/opt/NewPMDriver.cpp
@@ -21,6 +21,7 @@
 #include "llvm/Analysis/RuntimeLibcallInfo.h"
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/Bitcode/BitcodeWriterPass.h"
+#include "llvm/CodeGen/LibcallLoweringInfo.h"
 #include "llvm/Config/llvm-config.h"
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/LLVMContext.h"
@@ -352,9 +353,9 @@ static void registerEPCallbacks(PassBuilder &PB) {
 
 bool llvm::runPassPipeline(
     StringRef Arg0, Module &M, TargetMachine *TM, TargetLibraryInfoImpl *TLII,
-    RTLIB::RuntimeLibcallsInfo &RTLCI, ToolOutputFile *Out,
-    ToolOutputFile *ThinLTOLinkOut, ToolOutputFile *OptRemarkFile,
-    StringRef PassPipeline, ArrayRef<PassPlugin> PassPlugins,
+    ToolOutputFile *Out, ToolOutputFile *ThinLTOLinkOut,
+    ToolOutputFile *OptRemarkFile, StringRef PassPipeline,
+    ArrayRef<PassPlugin> PassPlugins,
     ArrayRef<std::function<void(PassBuilder &)>> PassBuilderCallbacks,
     OutputKind OK, VerifierKind VK, bool ShouldPreserveAssemblyUseListOrder,
     bool ShouldPreserveBitcodeUseListOrder, bool EmitSummaryIndex,
@@ -410,14 +411,24 @@ bool llvm::runPassPipeline(
       P->CSAction = PGOOptions::CSIRUse;
     }
   }
-  if (TM)
-    TM->setPGOOption(P);
 
   LoopAnalysisManager LAM;
   FunctionAnalysisManager FAM;
   CGSCCAnalysisManager CGAM;
   ModuleAnalysisManager MAM;
-  MAM.registerPass([&] { return RuntimeLibraryAnalysis(std::move(RTLCI)); });
+
+  if (TM) {
+    TM->setPGOOption(P);
+
+    MAM.registerPass([&] {
+      const TargetOptions &Options = TM->Options;
+      return RuntimeLibraryAnalysis(M.getTargetTriple(), Options.ExceptionModel,
+                                    Options.FloatABIType, Options.EABIVersion,
+                                    Options.MCOptions.ABIName, Options.VecLib);
+    });
+
+    MAM.registerPass([&] { return LibcallLoweringModuleAnalysis(); });
+  }
 
   PassInstrumentationCallbacks PIC;
   PrintPassOptions PrintPassOpts;
diff --git a/llvm/tools/opt/NewPMDriver.h b/llvm/tools/opt/NewPMDriver.h
index 31da61b9c0cae..042d5d4bbfe47 100644
--- a/llvm/tools/opt/NewPMDriver.h
+++ b/llvm/tools/opt/NewPMDriver.h
@@ -31,10 +31,6 @@ class TargetMachine;
 class ToolOutputFile;
 class TargetLibraryInfoImpl;
 
-namespace RTLIB {
-struct RuntimeLibcallsInfo;
-}
-
 extern cl::opt<bool> DebugifyEach;
 extern cl::opt<std::string> DebugifyExport;
 
@@ -71,9 +67,9 @@ void printPasses(raw_ostream &OS);
 /// nullptr.
 bool runPassPipeline(
     StringRef Arg0, Module &M, TargetMachine *TM, TargetLibraryInfoImpl *TLII,
-    RTLIB::RuntimeLibcallsInfo &RTLCI, ToolOutputFile *Out,
-    ToolOutputFile *ThinLinkOut, ToolOutputFile *OptRemarkFile,
-    StringRef PassPipeline, ArrayRef<PassPlugin> PassPlugins,
+    ToolOutputFile *Out, ToolOutputFile *ThinLinkOut,
+    ToolOutputFile *OptRemarkFile, StringRef PassPipeline,
+    ArrayRef<PassPlugin> PassPlugins,
     ArrayRef<std::function<void(PassBuilder &)>> PassBuilderCallbacks,
     opt_tool::OutputKind OK, opt_tool::VerifierKind VK,
     bool ShouldPreserveAssemblyUseListOrder,
diff --git a/llvm/tools/opt/optdriver.cpp b/llvm/tools/opt/optdriver.cpp
index f8be9f16aada6..ac318e6bc1eb4 100644
--- a/llvm/tools/opt/optdriver.cpp
+++ b/llvm/tools/opt/optdriver.cpp
@@ -657,6 +657,13 @@ optMain(int argc, char **argv,
     return 1;
   }
 
+  TargetOptions CodeGenFlagsOptions;
+  const TargetOptions *Options = TM ? &TM->Options : &CodeGenFlagsOptions;
+  if (!TM) {
+    CodeGenFlagsOptions =
+        codegen::InitTargetOptionsFromCodeGenFlags(ModuleTriple);
+  }
+
   // Override function attributes based on CPUStr, FeaturesStr, and command line
   // flags.
   codegen::setFunctionAttributes(CPUStr, FeaturesStr, *M);
@@ -674,13 +681,8 @@ optMain(int argc, char **argv,
       M->addModuleFlag(Module::Error, "UnifiedLTO", 1);
   }
 
-  VectorLibrary VecLib = codegen::getVectorLibrary();
   // Add an appropriate TargetLibraryInfo pass for the module's triple.
-  TargetLibraryInfoImpl TLII(ModuleTriple, VecLib);
-
-  RTLIB::RuntimeLibcallsInfo RTLCI(ModuleTriple, codegen::getExceptionModel(),
-                                   codegen::getFloatABIForCalls(),
-                                   codegen::getEABIVersion(), ABIName, VecLib);
+  TargetLibraryInfoImpl TLII(ModuleTriple, Options->VecLib);
 
   // The -disable-simplify-libcalls flag actually disables all builtin optzns.
   if (DisableSimplifyLibCalls)
@@ -756,7 +758,7 @@ optMain(int argc, char **argv,
     // string. Hand off the rest of the functionality to the new code for that
     // layer.
     if (!runPassPipeline(
-            argv[0], *M, TM.get(), &TLII, RTLCI, Out.get(), ThinLinkOut.get(),
+            argv[0], *M, TM.get(), &TLII, Out.get(), ThinLinkOut.get(),
             RemarksFile.get(), Pipeline, PluginList, PassBuilderCallbacks, OK,
             VK, /* ShouldPreserveAssemblyUseListOrder */ false,
             /* ShouldPreserveBitcodeUseListOrder */ true, EmitSummaryIndex,
@@ -804,6 +806,9 @@ optMain(int argc, char **argv,
       (VerifyDebugInfoPreserve && !VerifyEachDebugInfoPreserve);
 
   Passes.add(new TargetLibraryInfoWrapperPass(TLII));
+  Passes.add(new RuntimeLibraryInfoWrapper(
+      ModuleTriple, Options->ExceptionModel, Options->FloatABIType,
+      Options->EABIVersion, Options->MCOptions.ABIName, Options->VecLib));
 
   // Add internal analysis passes from the target machine.
   Passes.add(createTargetTransformInfoWrapperPass(TM ? TM->getTargetIRAnalysis()
diff --git a/llvm/unittests/ADT/MapVectorTest.cpp b/llvm/unittests/ADT/MapVectorTest.cpp
index e0589445e3271..b11d4603b90b7 100644
--- a/llvm/unittests/ADT/MapVectorTest.cpp
+++ b/llvm/unittests/ADT/MapVectorTest.cpp
@@ -307,6 +307,24 @@ TEST(MapVectorTest, AtTest) {
   EXPECT_EQ(ConstMV.at(1), 12);
 }
 
+TEST(MapVectorTest, KeysValuesIterator) {
+  MapVector<int, int> MV;
+
+  MV.insert(std::make_pair(1, 11));
+  MV.insert(std::make_pair(2, 12));
+  MV.insert(std::make_pair(3, 13));
+  MV.insert(std::make_pair(4, 14));
+  MV.insert(std::make_pair(5, 15));
+  MV.insert(std::make_pair(6, 16));
+
+  EXPECT_THAT(MV.keys(), testing::ElementsAre(1, 2, 3, 4, 5, 6));
+  EXPECT_THAT(MV.values(), testing::ElementsAre(11, 12, 13, 14, 15, 16));
+
+  const MapVector<int, int> &ConstMV = MV;
+  EXPECT_THAT(ConstMV.keys(), testing::ElementsAre(1, 2, 3, 4, 5, 6));
+  EXPECT_THAT(ConstMV.values(), testing::ElementsAre(11, 12, 13, 14, 15, 16));
+}
+
 template <class IntType> struct MapVectorMappedTypeTest : ::testing::Test {
   using int_type = IntType;
 };
diff --git a/llvm/unittests/ADT/SetVectorTest.cpp b/llvm/unittests/ADT/SetVectorTest.cpp
index ff3c876deb458..6230472553c38 100644
--- a/llvm/unittests/ADT/SetVectorTest.cpp
+++ b/llvm/unittests/ADT/SetVectorTest.cpp
@@ -52,7 +52,7 @@ TEST(SetVector, ContainsTest) {
 }
 
 TEST(SetVector, ConstPtrKeyTest) {
-  SetVector<int *, SmallVector<int *, 8>, SmallPtrSet<const int *, 8>> S, T;
+  SetVector<int *> S, T;
   int i, j, k, m, n;
 
   S.insert(&i);
diff --git a/llvm/unittests/CAS/CASTestConfig.h b/llvm/unittests/CAS/CASTestConfig.h
index b1c0e59ff2b92..20a95dd2f6aa6 100644
--- a/llvm/unittests/CAS/CASTestConfig.h
+++ b/llvm/unittests/CAS/CASTestConfig.h
@@ -15,6 +15,11 @@
 #include "gtest/gtest.h"
 #include <memory>
 
+#ifdef _WIN32
+#include "llvm/Support/VersionTuple.h"
+#include "llvm/Support/Windows/WindowsSupport.h"
+#endif
+
 namespace llvm::unittest::cas {
 class MockEnv {
   void anchor();
@@ -68,6 +73,10 @@ class CASTest
   }
 
   void SetUp() override {
+#ifdef _WIN32
+    if (llvm::GetWindowsOSVersion() < llvm::VersionTuple(10, 0, 0, 17763))
+      GTEST_SKIP() << "CAS tests skipped on older windows version";
+#endif
     NextCASIndex = 0;
     setMaxOnDiskCASMappingSize();
   }
diff --git a/llvm/unittests/CodeGen/GlobalISel/InstructionSelectTest.cpp b/llvm/unittests/CodeGen/GlobalISel/InstructionSelectTest.cpp
index 7fbccf7160e17..223798342b3ee 100644
--- a/llvm/unittests/CodeGen/GlobalISel/InstructionSelectTest.cpp
+++ b/llvm/unittests/CodeGen/GlobalISel/InstructionSelectTest.cpp
@@ -59,10 +59,8 @@ TEST_F(AArch64GISelMITest, TestInstructionSelectErase) {
     GTEST_SKIP();
 
   legacy::PassManager PM;
-  std::unique_ptr<TargetPassConfig> TPC(TM->createPassConfig(PM));
 
   EraseMockInstructionSelector ISel;
-  ISel.TPC = TPC.get();
   for (auto &MI : *EntryMBB) {
     ISel.MIs.push_back(&MI);
   }
diff --git a/llvm/unittests/CodeGen/MachineOperandTest.cpp b/llvm/unittests/CodeGen/MachineOperandTest.cpp
index 0373c7a0f629b..c0b2b1895975a 100644
--- a/llvm/unittests/CodeGen/MachineOperandTest.cpp
+++ b/llvm/unittests/CodeGen/MachineOperandTest.cpp
@@ -288,6 +288,23 @@ TEST(MachineOperandTest, PrintGlobalAddress) {
   }
 }
 
+TEST(MachineOperandTest, PrintLaneMask) {
+  // Create a MachineOperand with a lanemask and print it.
+  LaneBitmask LaneMask = LaneBitmask(12);
+  MachineOperand MO = MachineOperand::CreateLaneMask(LaneMask);
+
+  // Checking some preconditions on the newly created
+  // MachineOperand.
+  ASSERT_TRUE(MO.isLaneMask());
+  ASSERT_EQ(MO.getLaneMask(), LaneMask);
+
+  std::string str;
+  // Print a MachineOperand that is lanemask as in HEX representation.
+  raw_string_ostream OS(str);
+  MO.print(OS, /*TRI=*/nullptr);
+  ASSERT_EQ(str, "lanemask(0x000000000000000C)");
+}
+
 TEST(MachineOperandTest, PrintRegisterLiveOut) {
   // Create a MachineOperand with a register live out list and print it.
   uint32_t Mask = 0;
diff --git a/llvm/unittests/ExecutionEngine/Orc/CMakeLists.txt b/llvm/unittests/ExecutionEngine/Orc/CMakeLists.txt
index 7b563d7bcc68c..e5b5633c5a096 100644
--- a/llvm/unittests/ExecutionEngine/Orc/CMakeLists.txt
+++ b/llvm/unittests/ExecutionEngine/Orc/CMakeLists.txt
@@ -18,6 +18,7 @@ set(LLVM_LINK_COMPONENTS
   )
 
 add_llvm_unittest(OrcJITTests
+  CallableTraitsHelperTest.cpp
   CoreAPIsTest.cpp
   ExecutorAddressTest.cpp
   ExecutionSessionWrapperFunctionCallsTest.cpp
diff --git a/llvm/unittests/ExecutionEngine/Orc/CallableTraitsHelperTest.cpp b/llvm/unittests/ExecutionEngine/Orc/CallableTraitsHelperTest.cpp
new file mode 100644
index 0000000000000..bfb3d8eefbff3
--- /dev/null
+++ b/llvm/unittests/ExecutionEngine/Orc/CallableTraitsHelperTest.cpp
@@ -0,0 +1,70 @@
+//===- CallableTraitsHelperTest.cpp ---------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Tests for llvm::orc::CallableTraitsHelper APIs.
+//
+// NOTE: All tests in this file are testing compile-time functionality, so the
+//       tests at runtime all end up being noops. That's fine -- those are
+//       cheap.
+//===----------------------------------------------------------------------===//
+
+#include "llvm/ExecutionEngine/Orc/CallableTraitsHelper.h"
+#include "gtest/gtest.h"
+
+using namespace llvm;
+using namespace llvm::orc;
+
+static void freeVoidVoid() {}
+
+TEST(CallableTraitsHelperTest, FreeVoidVoid) {
+  (void)freeVoidVoid;
+  typedef CallableArgInfo<decltype(freeVoidVoid)> CAI;
+  static_assert(std::is_void_v<CAI::ReturnType>);
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<>>);
+}
+
+static int freeBinaryOp(int, float) { return 0; }
+
+TEST(CallableTraitsHelperTest, FreeBinaryOp) {
+  (void)freeBinaryOp;
+  typedef CallableArgInfo<decltype(freeBinaryOp)> CAI;
+  static_assert(std::is_same_v<CAI::ReturnType, int>);
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<int, float>>);
+}
+
+TEST(CallableTraitsHelperTest, VoidVoidObj) {
+  auto VoidVoid = []() {};
+  typedef CallableArgInfo<decltype(VoidVoid)> CAI;
+  static_assert(std::is_void_v<CAI::ReturnType>);
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<>>);
+}
+
+TEST(CallableTraitsHelperTest, BinaryOpObj) {
+  auto BinaryOp = [](int X, float Y) -> int { return X + Y; };
+  typedef CallableArgInfo<decltype(BinaryOp)> CAI;
+  static_assert(std::is_same_v<CAI::ReturnType, int>);
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<int, float>>);
+}
+
+TEST(CallableTraitsHelperTest, PreservesLValueRef) {
+  auto RefOp = [](int &) {};
+  typedef CallableArgInfo<decltype(RefOp)> CAI;
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<int &>>);
+}
+
+TEST(CallableTraitsHelperTest, PreservesLValueRefConstness) {
+  auto RefOp = [](const int &) {};
+  typedef CallableArgInfo<decltype(RefOp)> CAI;
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<const int &>>);
+}
+
+TEST(CallableTraitsHelperTest, PreservesRValueRef) {
+  auto RefOp = [](int &&) {};
+  typedef CallableArgInfo<decltype(RefOp)> CAI;
+  static_assert(std::is_same_v<CAI::ArgsTupleType, std::tuple<int &&>>);
+}
diff --git a/llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp b/llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp
index 1f35b7a5cfaa4..4595590a083d3 100644
--- a/llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp
+++ b/llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp
@@ -428,8 +428,8 @@ TEST_F(OpenMPIRBuilderTest, CreateCancel) {
                        OMPBuilder.createCancel(Loc, nullptr, OMPD_parallel));
   Builder.restoreIP(NewIP);
   EXPECT_FALSE(M->global_empty());
-  EXPECT_EQ(M->size(), 4U);
-  EXPECT_EQ(F->size(), 4U);
+  EXPECT_EQ(M->size(), 3U);
+  EXPECT_EQ(F->size(), 5U);
   EXPECT_EQ(BB->size(), 4U);
 
   CallInst *GTID = dyn_cast<CallInst>(&BB->front());
@@ -449,23 +449,16 @@ TEST_F(OpenMPIRBuilderTest, CreateCancel) {
   Instruction *CancelBBTI = Cancel->getParent()->getTerminator();
   EXPECT_EQ(CancelBBTI->getNumSuccessors(), 2U);
   EXPECT_EQ(CancelBBTI->getSuccessor(0), NewIP.getBlock());
-  EXPECT_EQ(CancelBBTI->getSuccessor(1)->size(), 3U);
-  CallInst *GTID1 = dyn_cast<CallInst>(&CancelBBTI->getSuccessor(1)->front());
-  EXPECT_NE(GTID1, nullptr);
-  EXPECT_EQ(GTID1->arg_size(), 1U);
-  EXPECT_EQ(GTID1->getCalledFunction()->getName(), "__kmpc_global_thread_num");
-  EXPECT_FALSE(GTID1->getCalledFunction()->doesNotAccessMemory());
-  EXPECT_FALSE(GTID1->getCalledFunction()->doesNotFreeMemory());
-  CallInst *Barrier = dyn_cast<CallInst>(GTID1->getNextNode());
-  EXPECT_NE(Barrier, nullptr);
-  EXPECT_EQ(Barrier->arg_size(), 2U);
-  EXPECT_EQ(Barrier->getCalledFunction()->getName(), "__kmpc_cancel_barrier");
-  EXPECT_FALSE(Barrier->getCalledFunction()->doesNotAccessMemory());
-  EXPECT_FALSE(Barrier->getCalledFunction()->doesNotFreeMemory());
-  EXPECT_TRUE(Barrier->use_empty());
+  EXPECT_EQ(CancelBBTI->getSuccessor(1)->size(), 1U);
   EXPECT_EQ(CancelBBTI->getSuccessor(1)->getTerminator()->getNumSuccessors(),
             1U);
-  EXPECT_EQ(CancelBBTI->getSuccessor(1)->getTerminator()->getSuccessor(0), CBB);
+  // cancel branch instruction (1) -> .cncl -> .fini -> CBB
+  EXPECT_EQ(CancelBBTI->getSuccessor(1)
+                ->getTerminator()
+                ->getSuccessor(0)
+                ->getTerminator()
+                ->getSuccessor(0),
+            CBB);
 
   EXPECT_EQ(cast<CallInst>(Cancel)->getArgOperand(1), GTID);
 
@@ -498,7 +491,7 @@ TEST_F(OpenMPIRBuilderTest, CreateCancelIfCond) {
   Builder.restoreIP(NewIP);
   EXPECT_FALSE(M->global_empty());
   EXPECT_EQ(M->size(), 4U);
-  EXPECT_EQ(F->size(), 7U);
+  EXPECT_EQ(F->size(), 10U);
   EXPECT_EQ(BB->size(), 1U);
   ASSERT_TRUE(isa<BranchInst>(BB->getTerminator()));
   ASSERT_EQ(BB->getTerminator()->getNumSuccessors(), 2U);
@@ -524,23 +517,15 @@ TEST_F(OpenMPIRBuilderTest, CreateCancelIfCond) {
   EXPECT_EQ(CancelBBTI->getSuccessor(0)->size(), 1U);
   EXPECT_EQ(CancelBBTI->getSuccessor(0)->getUniqueSuccessor(),
             NewIP.getBlock());
-  EXPECT_EQ(CancelBBTI->getSuccessor(1)->size(), 3U);
-  CallInst *GTID1 = dyn_cast<CallInst>(&CancelBBTI->getSuccessor(1)->front());
-  EXPECT_NE(GTID1, nullptr);
-  EXPECT_EQ(GTID1->arg_size(), 1U);
-  EXPECT_EQ(GTID1->getCalledFunction()->getName(), "__kmpc_global_thread_num");
-  EXPECT_FALSE(GTID1->getCalledFunction()->doesNotAccessMemory());
-  EXPECT_FALSE(GTID1->getCalledFunction()->doesNotFreeMemory());
-  CallInst *Barrier = dyn_cast<CallInst>(GTID1->getNextNode());
-  EXPECT_NE(Barrier, nullptr);
-  EXPECT_EQ(Barrier->arg_size(), 2U);
-  EXPECT_EQ(Barrier->getCalledFunction()->getName(), "__kmpc_cancel_barrier");
-  EXPECT_FALSE(Barrier->getCalledFunction()->doesNotAccessMemory());
-  EXPECT_FALSE(Barrier->getCalledFunction()->doesNotFreeMemory());
-  EXPECT_TRUE(Barrier->use_empty());
+  EXPECT_EQ(CancelBBTI->getSuccessor(1)->size(), 1U);
   EXPECT_EQ(CancelBBTI->getSuccessor(1)->getTerminator()->getNumSuccessors(),
             1U);
-  EXPECT_EQ(CancelBBTI->getSuccessor(1)->getTerminator()->getSuccessor(0), CBB);
+  EXPECT_EQ(CancelBBTI->getSuccessor(1)
+                ->getTerminator()
+                ->getSuccessor(0)
+                ->getTerminator()
+                ->getSuccessor(0),
+            CBB);
 
   EXPECT_EQ(cast<CallInst>(Cancel)->getArgOperand(1), GTID);
 
@@ -572,7 +557,7 @@ TEST_F(OpenMPIRBuilderTest, CreateCancelBarrier) {
   Builder.restoreIP(NewIP);
   EXPECT_FALSE(M->global_empty());
   EXPECT_EQ(M->size(), 3U);
-  EXPECT_EQ(F->size(), 4U);
+  EXPECT_EQ(F->size(), 5U);
   EXPECT_EQ(BB->size(), 4U);
 
   CallInst *GTID = dyn_cast<CallInst>(&BB->front());
@@ -595,7 +580,11 @@ TEST_F(OpenMPIRBuilderTest, CreateCancelBarrier) {
   EXPECT_EQ(BarrierBBTI->getSuccessor(1)->size(), 1U);
   EXPECT_EQ(BarrierBBTI->getSuccessor(1)->getTerminator()->getNumSuccessors(),
             1U);
-  EXPECT_EQ(BarrierBBTI->getSuccessor(1)->getTerminator()->getSuccessor(0),
+  EXPECT_EQ(BarrierBBTI->getSuccessor(1)
+                ->getTerminator()
+                ->getSuccessor(0)
+                ->getTerminator()
+                ->getSuccessor(0),
             CBB);
 
   EXPECT_EQ(cast<CallInst>(Barrier)->getArgOperand(1), GTID);
@@ -1291,8 +1280,8 @@ TEST_F(OpenMPIRBuilderTest, ParallelCancelBarrier) {
 
   EXPECT_EQ(NumBodiesGenerated, 1U);
   EXPECT_EQ(NumPrivatizedVars, 0U);
-  EXPECT_EQ(NumFinalizationPoints, 2U);
-  EXPECT_TRUE(FakeDestructor->hasNUses(2));
+  EXPECT_EQ(NumFinalizationPoints, 1U);
+  EXPECT_TRUE(FakeDestructor->hasNUses(1));
 
   Builder.restoreIP(AfterIP);
   Builder.CreateRetVoid();
@@ -2916,7 +2905,8 @@ TEST_F(OpenMPIRBuilderTest, MasterDirective) {
   BranchInst *EntryBr = cast<BranchInst>(EntryBB->getTerminator());
   EXPECT_TRUE(EntryBr->isConditional());
   EXPECT_EQ(EntryBr->getSuccessor(0), ThenBB);
-  BasicBlock *ExitBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *FinalizeBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *ExitBB = FinalizeBB->getUniqueSuccessor();
   EXPECT_EQ(EntryBr->getSuccessor(1), ExitBB);
 
   CmpInst *CondInst = cast<CmpInst>(EntryBr->getCondition());
@@ -2928,7 +2918,7 @@ TEST_F(OpenMPIRBuilderTest, MasterDirective) {
   EXPECT_TRUE(isa<GlobalVariable>(MasterEntryCI->getArgOperand(0)));
 
   CallInst *MasterEndCI = nullptr;
-  for (auto &FI : *ThenBB) {
+  for (auto &FI : *FinalizeBB) {
     Instruction *cur = &FI;
     if (isa<CallInst>(cur)) {
       MasterEndCI = cast<CallInst>(cur);
@@ -2998,7 +2988,8 @@ TEST_F(OpenMPIRBuilderTest, MaskedDirective) {
   BranchInst *EntryBr = cast<BranchInst>(EntryBB->getTerminator());
   EXPECT_TRUE(EntryBr->isConditional());
   EXPECT_EQ(EntryBr->getSuccessor(0), ThenBB);
-  BasicBlock *ExitBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *FinalizeBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *ExitBB = FinalizeBB->getUniqueSuccessor();
   EXPECT_EQ(EntryBr->getSuccessor(1), ExitBB);
 
   CmpInst *CondInst = cast<CmpInst>(EntryBr->getCondition());
@@ -3010,7 +3001,7 @@ TEST_F(OpenMPIRBuilderTest, MaskedDirective) {
   EXPECT_TRUE(isa<GlobalVariable>(MaskedEntryCI->getArgOperand(0)));
 
   CallInst *MaskedEndCI = nullptr;
-  for (auto &FI : *ThenBB) {
+  for (auto &FI : *FinalizeBB) {
     Instruction *cur = &FI;
     if (isa<CallInst>(cur)) {
       MaskedEndCI = cast<CallInst>(cur);
@@ -3062,6 +3053,9 @@ TEST_F(OpenMPIRBuilderTest, CriticalDirective) {
                                 FINICB_WRAPPER(FiniCB), "testCRT", nullptr));
   Builder.restoreIP(AfterIP);
 
+  BasicBlock *FinalizeBB = EntryBB->getUniqueSuccessor();
+  EXPECT_NE(FinalizeBB, nullptr);
+
   CallInst *CriticalEntryCI = nullptr;
   for (auto &EI : *EntryBB) {
     Instruction *cur = &EI;
@@ -3078,7 +3072,7 @@ TEST_F(OpenMPIRBuilderTest, CriticalDirective) {
   EXPECT_TRUE(isa<GlobalVariable>(CriticalEntryCI->getArgOperand(0)));
 
   CallInst *CriticalEndCI = nullptr;
-  for (auto &FI : *EntryBB) {
+  for (auto &FI : *FinalizeBB) {
     Instruction *cur = &FI;
     if (isa<CallInst>(cur)) {
       CriticalEndCI = cast<CallInst>(cur);
@@ -3312,6 +3306,9 @@ TEST_F(OpenMPIRBuilderTest, OrderedDirectiveThreads) {
                                           FINICB_WRAPPER(FiniCB), true));
   Builder.restoreIP(AfterIP);
 
+  BasicBlock *FinalizeBB = EntryBB->getUniqueSuccessor();
+  EXPECT_NE(FinalizeBB, nullptr);
+
   Builder.CreateRetVoid();
   OMPBuilder.finalize();
   EXPECT_FALSE(verifyModule(*M, &errs()));
@@ -3334,7 +3331,7 @@ TEST_F(OpenMPIRBuilderTest, OrderedDirectiveThreads) {
   EXPECT_TRUE(isa<GlobalVariable>(OrderedEntryCI->getArgOperand(0)));
 
   CallInst *OrderedEndCI = nullptr;
-  for (auto &FI : *EntryBB) {
+  for (auto &FI : *FinalizeBB) {
     Instruction *Cur = &FI;
     if (isa<CallInst>(Cur)) {
       OrderedEndCI = cast<CallInst>(Cur);
@@ -3508,7 +3505,8 @@ TEST_F(OpenMPIRBuilderTest, SingleDirective) {
   BranchInst *EntryBr = cast<BranchInst>(EntryBB->getTerminator());
   EXPECT_TRUE(EntryBr->isConditional());
   EXPECT_EQ(EntryBr->getSuccessor(0), ThenBB);
-  BasicBlock *ExitBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *FinalizeBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *ExitBB = FinalizeBB->getUniqueSuccessor();
   EXPECT_EQ(EntryBr->getSuccessor(1), ExitBB);
 
   CmpInst *CondInst = cast<CmpInst>(EntryBr->getCondition());
@@ -3520,7 +3518,7 @@ TEST_F(OpenMPIRBuilderTest, SingleDirective) {
   EXPECT_TRUE(isa<GlobalVariable>(SingleEntryCI->getArgOperand(0)));
 
   CallInst *SingleEndCI = nullptr;
-  for (auto &FI : *ThenBB) {
+  for (auto &FI : *FinalizeBB) {
     Instruction *cur = &FI;
     if (isa<CallInst>(cur)) {
       SingleEndCI = cast<CallInst>(cur);
@@ -3601,7 +3599,8 @@ TEST_F(OpenMPIRBuilderTest, SingleDirectiveNowait) {
   BranchInst *EntryBr = cast<BranchInst>(EntryBB->getTerminator());
   EXPECT_TRUE(EntryBr->isConditional());
   EXPECT_EQ(EntryBr->getSuccessor(0), ThenBB);
-  BasicBlock *ExitBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *FinalizeBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *ExitBB = FinalizeBB->getUniqueSuccessor();
   EXPECT_EQ(EntryBr->getSuccessor(1), ExitBB);
 
   CmpInst *CondInst = cast<CmpInst>(EntryBr->getCondition());
@@ -3613,7 +3612,7 @@ TEST_F(OpenMPIRBuilderTest, SingleDirectiveNowait) {
   EXPECT_TRUE(isa<GlobalVariable>(SingleEntryCI->getArgOperand(0)));
 
   CallInst *SingleEndCI = nullptr;
-  for (auto &FI : *ThenBB) {
+  for (auto &FI : *FinalizeBB) {
     Instruction *cur = &FI;
     if (isa<CallInst>(cur)) {
       SingleEndCI = cast<CallInst>(cur);
@@ -3724,7 +3723,8 @@ TEST_F(OpenMPIRBuilderTest, SingleDirectiveCopyPrivate) {
   BranchInst *EntryBr = cast<BranchInst>(EntryBB->getTerminator());
   EXPECT_TRUE(EntryBr->isConditional());
   EXPECT_EQ(EntryBr->getSuccessor(0), ThenBB);
-  BasicBlock *ExitBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *FinalizeBB = ThenBB->getUniqueSuccessor();
+  BasicBlock *ExitBB = FinalizeBB->getUniqueSuccessor();
   EXPECT_EQ(EntryBr->getSuccessor(1), ExitBB);
 
   CmpInst *CondInst = cast<CmpInst>(EntryBr->getCondition());
@@ -3743,25 +3743,28 @@ TEST_F(OpenMPIRBuilderTest, SingleDirectiveCopyPrivate) {
   EXPECT_EQ(PrivLI->getPointerOperand(), PrivAI);
   // icmp
   EXPECT_TRUE(ThenBBI.next<ICmpInst>());
+
+  // check FinalizeBB
+  BBInstIter FinalizeBBI(FinalizeBB);
   // store 1, DidIt
-  auto *DidItSI = ThenBBI.next<StoreInst>();
+  auto *DidItSI = FinalizeBBI.next<StoreInst>();
   EXPECT_NE(DidItSI, nullptr);
   EXPECT_EQ(DidItSI->getValueOperand(),
             ConstantInt::get(Type::getInt32Ty(Ctx), 1));
   Value *DidIt = DidItSI->getPointerOperand();
   // call __kmpc_end_single
-  auto *SingleEndCI = ThenBBI.next<CallInst>();
+  auto *SingleEndCI = FinalizeBBI.next<CallInst>();
   EXPECT_NE(SingleEndCI, nullptr);
   EXPECT_EQ(SingleEndCI->getCalledFunction()->getName(), "__kmpc_end_single");
   EXPECT_EQ(SingleEndCI->arg_size(), 2U);
   EXPECT_TRUE(isa<GlobalVariable>(SingleEndCI->getArgOperand(0)));
   EXPECT_EQ(SingleEndCI->getArgOperand(1), SingleEntryCI->getArgOperand(1));
   // br ExitBB
-  auto *ExitBBBI = ThenBBI.next<BranchInst>();
+  auto *ExitBBBI = FinalizeBBI.next<BranchInst>();
   EXPECT_NE(ExitBBBI, nullptr);
   EXPECT_TRUE(ExitBBBI->isUnconditional());
   EXPECT_EQ(ExitBBBI->getOperand(0), ExitBB);
-  EXPECT_FALSE(ThenBBI.hasNext());
+  EXPECT_FALSE(FinalizeBBI.hasNext());
 
   // check ExitBB
   BBInstIter ExitBBI(ExitBB);
diff --git a/llvm/unittests/IR/ConstantRangeTest.cpp b/llvm/unittests/IR/ConstantRangeTest.cpp
index 53d581c8db7c9..13712a76d3edf 100644
--- a/llvm/unittests/IR/ConstantRangeTest.cpp
+++ b/llvm/unittests/IR/ConstantRangeTest.cpp
@@ -449,6 +449,9 @@ TEST_F(ConstantRangeTest, Trunc) {
   // trunc([7, 1), 3->2) = [3, 1)
   ConstantRange SevenOne(APInt(3, 7), APInt(3, 1));
   EXPECT_EQ(SevenOne.truncate(2), ConstantRange(APInt(2, 3), APInt(2, 1)));
+
+  ConstantRange Nop = Full.truncate(Full.getBitWidth());
+  EXPECT_EQ(Full, Nop);
 }
 
 TEST_F(ConstantRangeTest, TruncNuw) {
@@ -527,6 +530,9 @@ TEST_F(ConstantRangeTest, ZExt) {
   // zext([5, 0), 3->7) = [5, 8)
   ConstantRange FiveZero(APInt(3, 5), APInt(3, 0));
   EXPECT_EQ(FiveZero.zeroExtend(7), ConstantRange(APInt(7, 5), APInt(7, 8)));
+
+  ConstantRange Nop = Full.zeroExtend(Full.getBitWidth());
+  EXPECT_EQ(Full, Nop);
 }
 
 TEST_F(ConstantRangeTest, SExt) {
@@ -550,6 +556,9 @@ TEST_F(ConstantRangeTest, SExt) {
 
   EXPECT_EQ(ConstantRange(APInt(16, 0x0200), APInt(16, 0x8000)).signExtend(19),
             ConstantRange(APInt(19, 0x0200), APInt(19, 0x8000)));
+
+  ConstantRange Nop = Full.signExtend(Full.getBitWidth());
+  EXPECT_EQ(Full, Nop);
 }
 
 TEST_F(ConstantRangeTest, IntersectWith) {
diff --git a/llvm/unittests/IR/IntrinsicsTest.cpp b/llvm/unittests/IR/IntrinsicsTest.cpp
index cfd99ed542162..87d922d22eaac 100644
--- a/llvm/unittests/IR/IntrinsicsTest.cpp
+++ b/llvm/unittests/IR/IntrinsicsTest.cpp
@@ -30,14 +30,12 @@
 using namespace llvm;
 
 namespace {
-
 class IntrinsicsTest : public ::testing::Test {
+protected:
   LLVMContext Context;
   std::unique_ptr<Module> M;
   BasicBlock *BB = nullptr;
 
-  void TearDown() override { M.reset(); }
-
   void SetUp() override {
     M = std::make_unique<Module>("Test", Context);
     auto F = M->getOrInsertFunction(
@@ -46,6 +44,8 @@ class IntrinsicsTest : public ::testing::Test {
     EXPECT_NE(BB, nullptr);
   }
 
+  void TearDown() override { M.reset(); }
+
 public:
   Instruction *makeIntrinsic(Intrinsic::ID ID) const {
     IRBuilder<> Builder(BB);
@@ -197,4 +197,212 @@ TEST(IntrinsicAttributes, TestGetFnAttributesBug) {
   AttributeSet AS = getFnAttributes(Context, experimental_guard);
   EXPECT_FALSE(AS.hasAttributes());
 }
+
+// Tests non-overloaded intrinsic declaration.
+TEST_F(IntrinsicsTest, NonOverloadedIntrinsic) {
+  Type *RetTy = Type::getVoidTy(Context);
+  SmallVector<Type *, 1> ArgTys;
+  ArgTys.push_back(Type::getInt1Ty(Context));
+
+  Function *F = Intrinsic::getOrInsertDeclaration(M.get(), Intrinsic::assume,
+                                                  RetTy, ArgTys);
+
+  ASSERT_NE(F, nullptr);
+  EXPECT_EQ(F->getIntrinsicID(), Intrinsic::assume);
+  EXPECT_EQ(F->getReturnType(), RetTy);
+  EXPECT_EQ(F->arg_size(), 1u);
+  EXPECT_FALSE(F->isVarArg());
+  EXPECT_EQ(F->getName(), "llvm.assume");
+}
+
+// Tests overloaded intrinsic with automatic type resolution for scalar types.
+TEST_F(IntrinsicsTest, OverloadedIntrinsicScalar) {
+  Type *RetTy = Type::getInt32Ty(Context);
+  SmallVector<Type *, 2> ArgTys;
+  ArgTys.push_back(Type::getInt32Ty(Context));
+  ArgTys.push_back(Type::getInt32Ty(Context));
+
+  Function *F = Intrinsic::getOrInsertDeclaration(M.get(), Intrinsic::umax,
+                                                  RetTy, ArgTys);
+
+  ASSERT_NE(F, nullptr);
+  EXPECT_EQ(F->getIntrinsicID(), Intrinsic::umax);
+  EXPECT_EQ(F->getReturnType(), RetTy);
+  EXPECT_EQ(F->arg_size(), 2u);
+  EXPECT_FALSE(F->isVarArg());
+  EXPECT_EQ(F->getName(), "llvm.umax.i32");
+}
+
+// Tests overloaded intrinsic with automatic type resolution for vector types.
+TEST_F(IntrinsicsTest, OverloadedIntrinsicVector) {
+  Type *RetTy = FixedVectorType::get(Type::getInt32Ty(Context), 4);
+  SmallVector<Type *, 2> ArgTys;
+  ArgTys.push_back(RetTy);
+  ArgTys.push_back(RetTy);
+
+  Function *F = Intrinsic::getOrInsertDeclaration(M.get(), Intrinsic::umax,
+                                                  RetTy, ArgTys);
+
+  ASSERT_NE(F, nullptr);
+  EXPECT_EQ(F->getIntrinsicID(), Intrinsic::umax);
+  EXPECT_EQ(F->getReturnType(), RetTy);
+  EXPECT_EQ(F->arg_size(), 2u);
+  EXPECT_FALSE(F->isVarArg());
+  EXPECT_EQ(F->getName(), "llvm.umax.v4i32");
+}
+
+// Tests overloaded intrinsic with automatic type resolution for addrspace.
+TEST_F(IntrinsicsTest, OverloadedIntrinsicAddressSpace) {
+  Type *RetTy = Type::getVoidTy(Context);
+  SmallVector<Type *, 4> ArgTys;
+  ArgTys.push_back(PointerType::get(Context, 1)); // ptr addrspace(1)
+  ArgTys.push_back(Type::getInt32Ty(Context));    // rw
+  ArgTys.push_back(Type::getInt32Ty(Context));    // locality
+  ArgTys.push_back(Type::getInt32Ty(Context));    // cache type
+
+  Function *F = Intrinsic::getOrInsertDeclaration(M.get(), Intrinsic::prefetch,
+                                                  RetTy, ArgTys);
+
+  ASSERT_NE(F, nullptr);
+  EXPECT_EQ(F->getIntrinsicID(), Intrinsic::prefetch);
+  EXPECT_EQ(F->getReturnType(), RetTy);
+  EXPECT_EQ(F->arg_size(), 4u);
+  EXPECT_FALSE(F->isVarArg());
+  EXPECT_EQ(F->getName(), "llvm.prefetch.p1");
+}
+
+// Tests vararg intrinsic declaration.
+TEST_F(IntrinsicsTest, VarArgIntrinsicStatepoint) {
+  Type *RetTy = Type::getTokenTy(Context);
+  SmallVector<Type *, 5> ArgTys;
+  ArgTys.push_back(Type::getInt64Ty(Context));    // ID
+  ArgTys.push_back(Type::getInt32Ty(Context));    // NumPatchBytes
+  ArgTys.push_back(PointerType::get(Context, 0)); // Target
+  ArgTys.push_back(Type::getInt32Ty(Context));    // NumCallArgs
+  ArgTys.push_back(Type::getInt32Ty(Context));    // Flags
+
+  Function *F = Intrinsic::getOrInsertDeclaration(
+      M.get(), Intrinsic::experimental_gc_statepoint, RetTy, ArgTys);
+
+  ASSERT_NE(F, nullptr);
+  EXPECT_EQ(F->getIntrinsicID(), Intrinsic::experimental_gc_statepoint);
+  EXPECT_EQ(F->getReturnType(), RetTy);
+  EXPECT_EQ(F->arg_size(), 5u);
+  EXPECT_TRUE(F->isVarArg()) << "experimental_gc_statepoint must be vararg";
+  EXPECT_EQ(F->getName(), "llvm.experimental.gc.statepoint.p0");
+}
+
+// Tests that different overloads create different declarations.
+TEST_F(IntrinsicsTest, DifferentOverloads) {
+  // i32 version
+  Type *RetTy32 = Type::getInt32Ty(Context);
+  SmallVector<Type *, 2> ArgTys32;
+  ArgTys32.push_back(Type::getInt32Ty(Context));
+  ArgTys32.push_back(Type::getInt32Ty(Context));
+
+  Function *Func32 = Intrinsic::getOrInsertDeclaration(M.get(), Intrinsic::umax,
+                                                       RetTy32, ArgTys32);
+
+  // i64 version
+  Type *RetTy64 = Type::getInt64Ty(Context);
+  SmallVector<Type *, 2> ArgTys64;
+  ArgTys64.push_back(Type::getInt64Ty(Context));
+  ArgTys64.push_back(Type::getInt64Ty(Context));
+
+  Function *Func64 = Intrinsic::getOrInsertDeclaration(M.get(), Intrinsic::umax,
+                                                       RetTy64, ArgTys64);
+
+  EXPECT_NE(Func32, Func64)
+      << "Different overloads should be different functions";
+  EXPECT_EQ(Func32->getName(), "llvm.umax.i32");
+  EXPECT_EQ(Func64->getName(), "llvm.umax.i64");
+}
+
+// Tests IRBuilder::CreateIntrinsic with overloaded scalar type.
+TEST_F(IntrinsicsTest, IRBuilderCreateIntrinsicScalar) {
+  IRBuilder<> Builder(BB);
+
+  Type *RetTy = Type::getInt32Ty(Context);
+  SmallVector<Value *, 2> Args;
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 10));
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 20));
+
+  CallInst *CI = Builder.CreateIntrinsic(RetTy, Intrinsic::umax, Args);
+
+  ASSERT_NE(CI, nullptr);
+  EXPECT_EQ(CI->getIntrinsicID(), Intrinsic::umax);
+  EXPECT_EQ(CI->getType(), RetTy);
+  EXPECT_EQ(CI->arg_size(), 2u);
+  EXPECT_FALSE(CI->getCalledFunction()->isVarArg());
+}
+
+// Tests IRBuilder::CreateIntrinsic with overloaded vector type.
+TEST_F(IntrinsicsTest, IRBuilderCreateIntrinsicVector) {
+  IRBuilder<> Builder(BB);
+
+  Type *RetTy = FixedVectorType::get(Type::getInt32Ty(Context), 4);
+  SmallVector<Value *, 2> Args;
+  Args.push_back(Constant::getNullValue(RetTy));
+  Args.push_back(Constant::getNullValue(RetTy));
+
+  CallInst *CI = Builder.CreateIntrinsic(RetTy, Intrinsic::umax, Args);
+
+  ASSERT_NE(CI, nullptr);
+  EXPECT_EQ(CI->getIntrinsicID(), Intrinsic::umax);
+  EXPECT_EQ(CI->getType(), RetTy);
+  EXPECT_EQ(CI->arg_size(), 2u);
+  EXPECT_FALSE(CI->getCalledFunction()->isVarArg());
+}
+
+// Tests IRBuilder::CreateIntrinsic with overloaded address space.
+TEST_F(IntrinsicsTest, IRBuilderCreateIntrinsicAddressSpace) {
+  IRBuilder<> Builder(BB);
+
+  Type *RetTy = Type::getVoidTy(Context);
+  SmallVector<Value *, 4> Args;
+  Args.push_back(Constant::getNullValue(
+      PointerType::get(Context, 1))); // ptr addrspace(1) null
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 0)); // rw
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 3)); // locality
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 1)); // cache type
+
+  CallInst *CI = Builder.CreateIntrinsic(RetTy, Intrinsic::prefetch, Args);
+
+  ASSERT_NE(CI, nullptr);
+  EXPECT_EQ(CI->getIntrinsicID(), Intrinsic::prefetch);
+  EXPECT_EQ(CI->getType(), RetTy);
+  EXPECT_EQ(CI->arg_size(), 4u);
+  EXPECT_FALSE(CI->getCalledFunction()->isVarArg());
+  EXPECT_EQ(CI->getCalledFunction()->getName(), "llvm.prefetch.p1");
+}
+
+// Tests IRBuilder::CreateIntrinsic with vararg intrinsic.
+TEST_F(IntrinsicsTest, IRBuilderCreateIntrinsicVarArg) {
+  IRBuilder<> Builder(BB);
+
+  // Create a dummy function to call through statepoint
+  FunctionType *DummyFnTy = FunctionType::get(Type::getVoidTy(Context), false);
+  Function *DummyFn = Function::Create(DummyFnTy, GlobalValue::ExternalLinkage,
+                                       "dummy", M.get());
+
+  Type *RetTy = Type::getTokenTy(Context);
+  SmallVector<Value *, 5> Args;
+  Args.push_back(ConstantInt::get(Type::getInt64Ty(Context), 0)); // ID
+  Args.push_back(
+      ConstantInt::get(Type::getInt32Ty(Context), 0)); // NumPatchBytes
+  Args.push_back(DummyFn);                             // Target
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 0)); // NumCallArgs
+  Args.push_back(ConstantInt::get(Type::getInt32Ty(Context), 0)); // Flags
+
+  CallInst *CI = Builder.CreateIntrinsic(
+      RetTy, Intrinsic::experimental_gc_statepoint, Args);
+
+  ASSERT_NE(CI, nullptr);
+  EXPECT_EQ(CI->getIntrinsicID(), Intrinsic::experimental_gc_statepoint);
+  EXPECT_EQ(CI->getType(), RetTy);
+  EXPECT_EQ(CI->arg_size(), 5u);
+  EXPECT_TRUE(CI->getCalledFunction()->isVarArg())
+      << "experimental_gc_statepoint must be vararg";
+}
+
 } // end namespace
diff --git a/llvm/unittests/Transforms/Utils/MemTransferLowering.cpp b/llvm/unittests/Transforms/Utils/MemTransferLowering.cpp
index dd03b4f2ae971..752029e54f394 100644
--- a/llvm/unittests/Transforms/Utils/MemTransferLowering.cpp
+++ b/llvm/unittests/Transforms/Utils/MemTransferLowering.cpp
@@ -120,7 +120,8 @@ TEST_F(MemTransferLowerTest, MemCpyKnownLength) {
         MemCpyInst *MemCpyI = cast<MemCpyInst>(Inst);
         auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
         expandMemCpyAsLoop(MemCpyI, TTI, &SE);
-        auto *CopyLoopBB = getBasicBlockByName(F, "load-store-loop");
+        auto *CopyLoopBB =
+            getBasicBlockByName(F, "static-memcpy-expansion-main-body");
         Instruction *LoadInst =
             getInstructionByOpcode(*CopyLoopBB, Instruction::Load, 1);
         EXPECT_NE(nullptr, LoadInst->getMetadata(LLVMContext::MD_alias_scope));
@@ -203,7 +204,8 @@ TEST_F(MemTransferLowerTest, AtomicMemCpyKnownLength) {
         AnyMemCpyInst *MemCpyI = cast<AnyMemCpyInst>(Inst);
         auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
         expandAtomicMemCpyAsLoop(MemCpyI, TTI, &SE);
-        auto *CopyLoopBB = getBasicBlockByName(F, "load-store-loop");
+        auto *CopyLoopBB =
+            getBasicBlockByName(F, "static-memcpy-expansion-main-body");
         Instruction *LoadInst =
             getInstructionByOpcode(*CopyLoopBB, Instruction::Load, 1);
         EXPECT_TRUE(LoadInst->isAtomic());
@@ -248,7 +250,8 @@ TEST_F(MemTransferLowerTest, AtomicMemCpyUnKnownLength) {
         auto *MemCpyI = cast<AnyMemCpyInst>(Inst);
         auto &SE = FAM.getResult<ScalarEvolutionAnalysis>(F);
         expandAtomicMemCpyAsLoop(MemCpyI, TTI, &SE);
-        auto *CopyLoopBB = getBasicBlockByName(F, "loop-memcpy-expansion");
+        auto *CopyLoopBB =
+            getBasicBlockByName(F, "dynamic-memcpy-expansion-main-body");
         Instruction *LoadInst =
             getInstructionByOpcode(*CopyLoopBB, Instruction::Load, 1);
         EXPECT_TRUE(LoadInst->isAtomic());
diff --git a/llvm/utils/gn/secondary/bolt/unittests/BUILD.gn b/llvm/utils/gn/secondary/bolt/unittests/BUILD.gn
index 9c5be5966b3fc..eded7696e9e8b 100644
--- a/llvm/utils/gn/secondary/bolt/unittests/BUILD.gn
+++ b/llvm/utils/gn/secondary/bolt/unittests/BUILD.gn
@@ -2,6 +2,7 @@ group("unittests") {
   deps = [
     "Core:CoreTests",
     "Profile:ProfileTests",
+    "Passes:PassTests",
   ]
   testonly = true
 }
diff --git a/llvm/utils/gn/secondary/bolt/unittests/Passes/BUILD.gn b/llvm/utils/gn/secondary/bolt/unittests/Passes/BUILD.gn
new file mode 100644
index 0000000000000..9b2abd4eb71d8
--- /dev/null
+++ b/llvm/utils/gn/secondary/bolt/unittests/Passes/BUILD.gn
@@ -0,0 +1,48 @@
+import("//llvm/lib/Target/targets.gni")
+import("//third-party/unittest/unittest.gni")
+
+unittest("PassTests") {
+  configs += [ "//llvm/utils/gn/build:bolt_code" ]
+  deps = [
+    "//bolt/include/bolt/Core:TargetConfig.def",
+    "//bolt/lib/Core",
+    "//bolt/lib/Passes",
+    "//bolt/lib/Profile",
+    "//bolt/lib/Rewrite",
+    "//bolt/lib/Utils",
+    "//llvm/lib/DebugInfo/DWARF",
+    "//llvm/lib/MC",
+    "//llvm/lib/Object",
+    "//llvm/lib/Target:TargetsToBuild",
+  ]
+  sources = [ "InsertNegateRAState.cpp" ]
+
+  defines = []
+  include_dirs = []
+  if (llvm_build_AArch64) {
+    defines += [ "AARCH64_AVAILABLE" ]
+
+    # This target reaches into the internal headers of LLVM's AArch64 library.
+    # That target doesn't expect that, so it doesn't use public_deps for
+    # tblgen-generated headers used only in internal headers (...which this
+    # target here questionably includes). So depend on the target that generates
+    # those headers here.
+    include_dirs += [ "//llvm/lib/Target/AArch64" ]
+    deps += [
+      "//llvm/lib/Target/AArch64:AArch64GenSDNodeInfo",
+      "//llvm/lib/Target/AArch64/MCTargetDesc",
+      "//llvm/lib/Target/AArch64/Utils",
+    ]
+  }
+  if (llvm_build_X86) {
+    defines += [ "X86_AVAILABLE" ]
+
+    # This target reaches into the internal headers of LLVM's X86 library.
+    # That target doesn't expect that, so it doesn't use public_deps for
+    # tblgen-generated headers used only in internal headers (...which this
+    # target here questionably includes). So depend on the target that generates
+    # those headers here.
+    include_dirs += [ "//llvm/lib/Target/X86" ]
+    deps += [ "//llvm/lib/Target/X86/MCTargetDesc" ]
+  }
+}
diff --git a/llvm/utils/gn/secondary/clang-tools-extra/clangd/test/BUILD.gn b/llvm/utils/gn/secondary/clang-tools-extra/clangd/test/BUILD.gn
index d40ce6424fe81..e1d02c0ce0696 100644
--- a/llvm/utils/gn/secondary/clang-tools-extra/clangd/test/BUILD.gn
+++ b/llvm/utils/gn/secondary/clang-tools-extra/clangd/test/BUILD.gn
@@ -39,6 +39,7 @@ write_lit_config("lit_site_cfg") {
     "CLANGD_ENABLE_REMOTE=0",
     "CLANGD_TIDY_CHECKS=1",
     "LLVM_HOST_TRIPLE=$llvm_current_triple",
+    "LLVM_INCLUDE_BENCHMARKS=",
     "LLVM_LIT_TOOLS_DIR=",  # Intentionally empty, matches cmake build.
     "LLVM_TOOLS_DIR=" + rebase_path("$root_out_dir/bin"),
     "LLVM_TARGET_TRIPLE=$llvm_target_triple",
diff --git a/llvm/utils/gn/secondary/libcxx/include/BUILD.gn b/llvm/utils/gn/secondary/libcxx/include/BUILD.gn
index 0e7c18415a59d..f4405664d0dad 100644
--- a/llvm/utils/gn/secondary/libcxx/include/BUILD.gn
+++ b/llvm/utils/gn/secondary/libcxx/include/BUILD.gn
@@ -40,6 +40,8 @@ if (current_toolchain == default_toolchain) {
       "_LIBCPP_INSTRUMENTED_WITH_ASAN=",
       "_LIBCPP_ABI_DEFINES=",
       "_LIBCPP_HARDENING_MODE_DEFAULT=_LIBCPP_HARDENING_MODE_NONE",
+      "_LIBCPP_LIBC_PICOLIBC=",
+      "_LIBCPP_LIBC_NEWLIB=",
       "_LIBCPP_PSTL_BACKEND_LIBDISPATCH=",
       "_LIBCPP_PSTL_BACKEND_SERIAL=",
       "_LIBCPP_PSTL_BACKEND_STD_THREAD=1",
diff --git a/llvm/utils/gn/secondary/lldb/source/Plugins/Language/CPlusPlus/BUILD.gn b/llvm/utils/gn/secondary/lldb/source/Plugins/Language/CPlusPlus/BUILD.gn
index 0537e51302819..6d0e7c0aa0c86 100644
--- a/llvm/utils/gn/secondary/lldb/source/Plugins/Language/CPlusPlus/BUILD.gn
+++ b/llvm/utils/gn/secondary/lldb/source/Plugins/Language/CPlusPlus/BUILD.gn
@@ -58,6 +58,7 @@ static_library("CPlusPlus") {
     "LibCxxVariant.cpp",
     "LibCxxVector.cpp",
     "LibStdcpp.cpp",
+    "LibStdcppSpan.cpp",
     "LibStdcppTuple.cpp",
     "LibStdcppUniquePointer.cpp",
     "MSVCUndecoratedNameParser.cpp",
diff --git a/llvm/utils/gn/secondary/lldb/source/Target/BUILD.gn b/llvm/utils/gn/secondary/lldb/source/Target/BUILD.gn
index 679373d741661..ac63bbc6ee3b3 100644
--- a/llvm/utils/gn/secondary/lldb/source/Target/BUILD.gn
+++ b/llvm/utils/gn/secondary/lldb/source/Target/BUILD.gn
@@ -35,6 +35,7 @@ static_library("Target") {
   sources = [
     "ABI.cpp",
     "AssertFrameRecognizer.cpp",
+    "BorrowedStackFrame.cpp",
     "CoreFileMemoryRanges.cpp",
     "DynamicRegisterInfo.cpp",
     "ExecutionContext.cpp",
diff --git a/llvm/utils/gn/secondary/lldb/source/Utility/BUILD.gn b/llvm/utils/gn/secondary/lldb/source/Utility/BUILD.gn
index 5faa365bb7bdb..47143ea4e15b1 100644
--- a/llvm/utils/gn/secondary/lldb/source/Utility/BUILD.gn
+++ b/llvm/utils/gn/secondary/lldb/source/Utility/BUILD.gn
@@ -60,6 +60,7 @@ static_library("Utility") {
     "UserIDResolver.cpp",
     "VASprintf.cpp",
     "VMRange.cpp",
+    "VirtualDataExtractor.cpp",
     "XcodeSDK.cpp",
     "ZipFile.cpp",
   ]
diff --git a/llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn b/llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn
index 4319bd92d163e..52601f108a059 100644
--- a/llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn
+++ b/llvm/utils/gn/secondary/llvm/lib/Target/AArch64/BUILD.gn
@@ -76,6 +76,7 @@ tablegen("AArch64GenSDNodeInfo") {
   visibility = [
     ":LLVMAArch64CodeGen",
     "//bolt/unittests/Core:CoreTests",
+    "//bolt/unittests/Passes:PassTests",
     "//llvm/unittests/Target/AArch64:AArch64Tests",
   ]
   args = [ "-gen-sd-node-info" ]
diff --git a/llvm/utils/gn/secondary/llvm/unittests/ExecutionEngine/Orc/BUILD.gn b/llvm/utils/gn/secondary/llvm/unittests/ExecutionEngine/Orc/BUILD.gn
index 111e4c997de92..c5f038c20e140 100644
--- a/llvm/utils/gn/secondary/llvm/unittests/ExecutionEngine/Orc/BUILD.gn
+++ b/llvm/utils/gn/secondary/llvm/unittests/ExecutionEngine/Orc/BUILD.gn
@@ -15,6 +15,7 @@ unittest("OrcJITTests") {
     "//llvm/lib/Testing/Support",
   ]
   sources = [
+    "CallableTraitsHelperTest.cpp",
     "CoreAPIsTest.cpp",
     "EPCGenericJITLinkMemoryManagerTest.cpp",
     "EPCGenericMemoryAccessTest.cpp",
diff --git a/llvm/utils/lit/lit/TestRunner.py b/llvm/utils/lit/lit/TestRunner.py
index 9525320f133c6..7a233b238f7e2 100644
--- a/llvm/utils/lit/lit/TestRunner.py
+++ b/llvm/utils/lit/lit/TestRunner.py
@@ -21,7 +21,6 @@
 import lit.ShUtil as ShUtil
 import lit.Test as Test
 import lit.util
-from lit.util import to_bytes, to_string, to_unicode
 from lit.BooleanExpression import BooleanExpression
 
 
@@ -391,18 +390,14 @@ def executeBuiltinEcho(cmd, shenv):
     # Some tests have un-redirected echo commands to help debug test failures.
     # Buffer our output and return it to the caller.
     is_redirected = True
-    encode = lambda x: x
     if stdout == subprocess.PIPE:
         is_redirected = False
         stdout = StringIO()
     elif kIsWindows:
-        # Reopen stdout in binary mode to avoid CRLF translation. The versions
-        # of echo we are replacing on Windows all emit plain LF, and the LLVM
-        # tests now depend on this.
-        # When we open as binary, however, this also means that we have to write
-        # 'bytes' objects to stdout instead of 'str' objects.
-        encode = lit.util.to_bytes
-        stdout = open(stdout.name, stdout.mode + "b")
+        # Reopen stdout with `newline=""` to avoid CRLF translation.
+        # The versions of echo we are replacing on Windows all emit plain LF,
+        # and the LLVM tests now depend on this.
+        stdout = open(stdout.name, stdout.mode, encoding="utf-8", newline="")
         opened_files.append((None, None, stdout, None))
 
     # Implement echo flags. We only support -e and -n, and not yet in
@@ -423,16 +418,15 @@ def maybeUnescape(arg):
         if not interpret_escapes:
             return arg
 
-        arg = lit.util.to_bytes(arg)
-        return arg.decode("unicode_escape")
+        return arg.encode("utf-8").decode("unicode_escape")
 
     if args:
         for arg in args[:-1]:
-            stdout.write(encode(maybeUnescape(arg)))
-            stdout.write(encode(" "))
-        stdout.write(encode(maybeUnescape(args[-1])))
+            stdout.write(maybeUnescape(arg))
+            stdout.write(" ")
+        stdout.write(maybeUnescape(args[-1]))
     if write_newline:
-        stdout.write(encode("\n"))
+        stdout.write("\n")
 
     for (name, mode, f, path) in opened_files:
         f.close()
@@ -463,7 +457,7 @@ def executeBuiltinMkdir(cmd, cmd_shenv):
     exitCode = 0
     for dir in args:
         dir = pathlib.Path(dir)
-        cwd = pathlib.Path(to_unicode(cmd_shenv.cwd))
+        cwd = pathlib.Path(cmd_shenv.cwd)
         if not dir.is_absolute():
             dir = lit.util.abs_path_preserve_drive(cwd / dir)
         if parent:
@@ -508,8 +502,6 @@ def on_rm_error(func, path, exc_info):
     exitCode = 0
     for path in args:
         cwd = cmd_shenv.cwd
-        path = to_unicode(path) if kIsWindows else to_bytes(path)
-        cwd = to_unicode(cwd) if kIsWindows else to_bytes(cwd)
         if not os.path.isabs(path):
             path = lit.util.abs_path_preserve_drive(os.path.join(cwd, path))
         if force and not os.path.exists(path):
@@ -718,10 +710,7 @@ def processRedirects(cmd, stdin_source, cmd_shenv, opened_files):
         else:
             # Make sure relative paths are relative to the cwd.
             redir_filename = os.path.join(cmd_shenv.cwd, name)
-            redir_filename = (
-                to_unicode(redir_filename) if kIsWindows else to_bytes(redir_filename)
-            )
-            fd = open(redir_filename, mode)
+            fd = open(redir_filename, mode, encoding="utf-8")
         # Workaround a Win32 and/or subprocess bug when appending.
         #
         # FIXME: Actually, this is probably an instance of PR6753.
@@ -1083,14 +1072,14 @@ def _executeShCmd(cmd, shenv, results, timeoutHelper):
             if out is None:
                 out = ""
             else:
-                out = to_string(out.decode("utf-8", errors="replace"))
+                out = out.decode("utf-8", errors="replace")
         except:
             out = str(out)
         try:
             if err is None:
                 err = ""
             else:
-                err = to_string(err.decode("utf-8", errors="replace"))
+                err = err.decode("utf-8", errors="replace")
         except:
             err = str(err)
 
@@ -1284,7 +1273,7 @@ def executeScriptInternal(
 
         # Add the command output, if redirected.
         for (name, path, data) in result.outputFiles:
-            data = to_string(data.decode("utf-8", errors="replace"))
+            data = data.decode("utf-8", errors="replace")
             out += formatOutput(f"redirected output from '{name}'", data, limit=1024)
         if result.stdout.strip():
             out += formatOutput("command stdout", result.stdout)
@@ -1340,13 +1329,6 @@ def executeScript(test, litConfig, tmpBase, commands, cwd):
         script += ".bat"
 
     # Write script file
-    mode = "w"
-    open_kwargs = {}
-    if litConfig.isWindows and not isWin32CMDEXE:
-        mode += "b"  # Avoid CRLFs when writing bash scripts.
-    else:
-        open_kwargs["encoding"] = "utf-8"
-    f = open(script, mode, **open_kwargs)
     if isWin32CMDEXE:
         for i, ln in enumerate(commands):
             match = re.fullmatch(kPdbgRegex, ln)
@@ -1355,8 +1337,9 @@ def executeScript(test, litConfig, tmpBase, commands, cwd):
                 commands[i] = match.expand(
                     "echo '\\1' > nul && " if command else "echo '\\1' > nul"
                 )
-        f.write("@echo on\n")
-        f.write("\n at if %ERRORLEVEL% NEQ 0 EXIT\n".join(commands))
+        with open(script, "w", encoding="utf-8") as f:
+            f.write("@echo on\n")
+            f.write("\n at if %ERRORLEVEL% NEQ 0 EXIT\n".join(commands))
     else:
         for i, ln in enumerate(commands):
             match = re.fullmatch(kPdbgRegex, ln)
@@ -1395,8 +1378,6 @@ def executeScript(test, litConfig, tmpBase, commands, cwd):
                 # seen the latter manage to terminate the shell running lit.
                 if command:
                     commands[i] += f" && {{ {command}; }}"
-        if test.config.pipefail:
-            f.write(b"set -o pipefail;" if mode == "wb" else "set -o pipefail;")
 
         # Manually export any DYLD_* variables used by dyld on macOS because
         # otherwise they are lost when the shell executable is run, before the
@@ -1406,14 +1387,14 @@ def executeScript(test, litConfig, tmpBase, commands, cwd):
             for k, v in test.config.environment.items()
             if k.startswith("DYLD_")
         )
-        f.write(bytes(env_str, "utf-8") if mode == "wb" else env_str)
-        f.write(b"set -x;" if mode == "wb" else "set -x;")
-        if mode == "wb":
-            f.write(bytes("{ " + "; } &&\n{ ".join(commands) + "; }", "utf-8"))
-        else:
+
+        with open(script, "w", encoding="utf-8", newline="") as f:
+            if test.config.pipefail:
+                f.write("set -o pipefail;")
+            f.write(env_str)
+            f.write("set -x;")
             f.write("{ " + "; } &&\n{ ".join(commands) + "; }")
-    f.write(b"\n" if mode == "wb" else "\n")
-    f.close()
+            f.write("\n")
 
     if isWin32CMDEXE:
         command = ["cmd", "/c", script]
@@ -1449,19 +1430,11 @@ def parseIntegratedTestScriptCommands(source_path, keywords):
     (line_number, command_type, line).
     """
 
-    # This code is carefully written to be dual compatible with Python 2.5+ and
-    # Python 3 without requiring input files to always have valid codings. The
-    # trick we use is to open the file in binary mode and use the regular
-    # expression library to find the commands, with it scanning strings in
-    # Python2 and bytes in Python3.
-    #
-    # Once we find a match, we do require each script line to be decodable to
-    # UTF-8, so we convert the outputs to UTF-8 before returning. This way the
-    # remaining code can work with "strings" agnostic of the executing Python
-    # version.
+    # We use `bytes` for scanning input files to avoid requiring them to always
+    # have valid codings.
 
     keywords_re = re.compile(
-        to_bytes("(%s)(.*)\n" % ("|".join(re.escape(k) for k in keywords),))
+        b"(%s)(.*)\n" % (b"|".join(re.escape(k.encode("utf-8")) for k in keywords),)
     )
 
     f = open(source_path, "rb")
@@ -1470,8 +1443,8 @@ def parseIntegratedTestScriptCommands(source_path, keywords):
         data = f.read()
 
         # Ensure the data ends with a newline.
-        if not data.endswith(to_bytes("\n")):
-            data = data + to_bytes("\n")
+        if not data.endswith(b"\n"):
+            data = data + b"\n"
 
         # Iterate over the matches.
         line_number = 1
@@ -1480,15 +1453,11 @@ def parseIntegratedTestScriptCommands(source_path, keywords):
             # Compute the updated line number by counting the intervening
             # newlines.
             match_position = match.start()
-            line_number += data.count(
-                to_bytes("\n"), last_match_position, match_position
-            )
+            line_number += data.count(b"\n", last_match_position, match_position)
             last_match_position = match_position
 
             # Convert the keyword and line to UTF-8 strings and yield the
-            # command. Note that we take care to return regular strings in
-            # Python 2, to avoid other code having to differentiate between the
-            # str and unicode types.
+            # command.
             #
             # Opening the file in binary mode prevented Windows \r newline
             # characters from being converted to Unix \n newlines, so manually
@@ -1496,8 +1465,8 @@ def parseIntegratedTestScriptCommands(source_path, keywords):
             keyword, ln = match.groups()
             yield (
                 line_number,
-                to_string(keyword.decode("utf-8")),
-                to_string(ln.decode("utf-8").rstrip("\r")),
+                keyword.decode("utf-8"),
+                ln.decode("utf-8").rstrip("\r"),
             )
     finally:
         f.close()
diff --git a/llvm/utils/lit/lit/builtin_commands/diff.py b/llvm/utils/lit/lit/builtin_commands/diff.py
index f2b5869b35889..a32a31d50ada8 100644
--- a/llvm/utils/lit/lit/builtin_commands/diff.py
+++ b/llvm/utils/lit/lit/builtin_commands/diff.py
@@ -8,7 +8,6 @@
 import sys
 
 import util
-from util import to_string
 
 
 class DiffFlags:
@@ -67,10 +66,9 @@ def compareTwoBinaryFiles(flags, filepaths, filelines):
         filepaths[1].encode(),
         n=flags.num_context_lines,
     )
-    diffs = [diff.decode(errors="backslashreplace") for diff in diffs]
 
     for diff in diffs:
-        sys.stdout.write(to_string(diff))
+        sys.stdout.write(diff.decode(errors="backslashreplace"))
         exitCode = 1
     return exitCode
 
@@ -117,7 +115,7 @@ def compose2(f, g):
         filepaths[1],
         n=flags.num_context_lines,
     ):
-        sys.stdout.write(to_string(diff))
+        sys.stdout.write(diff)
         exitCode = 1
     return exitCode
 
diff --git a/llvm/utils/lit/lit/formats/googletest.py b/llvm/utils/lit/lit/formats/googletest.py
index 172cd0beee4a1..01820da38c954 100644
--- a/llvm/utils/lit/lit/formats/googletest.py
+++ b/llvm/utils/lit/lit/formats/googletest.py
@@ -43,7 +43,7 @@ def get_num_tests(self, path, litConfig, localConfig):
             return None
         return sum(
             map(
-                lambda line: lit.util.to_string(line).startswith("  "),
+                lambda line: line.startswith(b"  "),
                 out.splitlines(False),
             )
         )
diff --git a/llvm/utils/lit/lit/llvm/config.py b/llvm/utils/lit/lit/llvm/config.py
index f212928caee1b..28a7ab25a0c3b 100644
--- a/llvm/utils/lit/lit/llvm/config.py
+++ b/llvm/utils/lit/lit/llvm/config.py
@@ -226,7 +226,7 @@ def _find_git_windows_unix_tools(self, tools_needed):
                         continue
 
                     # We found it, stop enumerating.
-                    return lit.util.to_string(candidate_path)
+                    return candidate_path
             except:
                 continue
 
@@ -287,8 +287,8 @@ def get_process_output(self, command):
                 env=self.config.environment,
             )
             stdout, stderr = cmd.communicate()
-            stdout = lit.util.to_string(stdout)
-            stderr = lit.util.to_string(stderr)
+            stdout = stdout.decode("utf-8", errors="replace")
+            stderr = stderr.decode("utf-8", errors="replace")
             return (stdout, stderr)
         except OSError:
             self.lit_config.fatal("Could not run process %s" % command)
diff --git a/llvm/utils/lit/lit/reports.py b/llvm/utils/lit/lit/reports.py
index 1b43ab9357b37..6f8a782a40aa8 100755
--- a/llvm/utils/lit/lit/reports.py
+++ b/llvm/utils/lit/lit/reports.py
@@ -29,10 +29,10 @@ def write_results(self, tests, elapsed):
             fd, _ = tempfile.mkstemp(
                 suffix=ext, prefix=f"{filename}.", dir=os.path.dirname(self.output_file)
             )
-            report_file = os.fdopen(fd, "w")
+            report_file = os.fdopen(fd, "w", encoding="utf-8")
         else:
             # Overwrite if the results already exist.
-            report_file = open(self.output_file, "w")
+            report_file = open(self.output_file, "w", encoding="utf-8")
 
         with report_file:
             self._write_results_to_file(tests, elapsed, report_file)
diff --git a/llvm/utils/lit/lit/run.py b/llvm/utils/lit/lit/run.py
index 3fc4a1b9b40bd..9c54511bfd625 100644
--- a/llvm/utils/lit/lit/run.py
+++ b/llvm/utils/lit/lit/run.py
@@ -7,6 +7,14 @@
 import lit.util
 import lit.worker
 
+# Windows has a limit of 60 workers per pool.
+# This is defined in the multiprocessing module implementation.
+# See: https://github.com/python/cpython/blob/6bc65c30ff1fd0b581a2c93416496fc720bc442c/Lib/concurrent/futures/process.py#L669-L672
+WINDOWS_MAX_WORKERS_PER_POOL = 60
+
+
+def _ceilDiv(a, b):
+    return (a + b - 1) // b
 
 class MaxFailuresError(Exception):
     pass
@@ -72,25 +80,65 @@ def _execute(self, deadline):
             if v is not None
         }
 
-        pool = multiprocessing.Pool(
-            self.workers, lit.worker.initialize, (self.lit_config, semaphores)
+        # Windows has a limit of 60 workers per pool, so we need to use multiple pools
+        # if we have more workers requested than the limit.
+        # Also, allow to override the limit with the LIT_WINDOWS_MAX_WORKERS_PER_POOL environment variable.
+        max_workers_per_pool = (
+            WINDOWS_MAX_WORKERS_PER_POOL if os.name == "nt" else self.workers
+        )
+        max_workers_per_pool = int(
+            os.getenv("LIT_WINDOWS_MAX_WORKERS_PER_POOL", max_workers_per_pool)
         )
 
-        async_results = [
-            pool.apply_async(
-                lit.worker.execute, args=[test], callback=self.progress_callback
+        num_pools = max(1, _ceilDiv(self.workers, max_workers_per_pool))
+
+        # Distribute self.workers across num_pools as evenly as possible
+        workers_per_pool_list = [self.workers // num_pools] * num_pools
+        for pool_idx in range(self.workers % num_pools):
+            workers_per_pool_list[pool_idx] += 1
+
+        if num_pools > 1:
+            self.lit_config.note(
+                "Using %d pools balancing %d workers total distributed as %s (Windows worker limit workaround)"
+                % (num_pools, self.workers, workers_per_pool_list)
             )
-            for test in self.tests
-        ]
-        pool.close()
+
+        # Create multiple pools
+        pools = []
+        for pool_size in workers_per_pool_list:
+            pool = multiprocessing.Pool(
+                pool_size, lit.worker.initialize, (self.lit_config, semaphores)
+            )
+            pools.append(pool)
+
+        # Distribute tests across pools
+        tests_per_pool = _ceilDiv(len(self.tests), num_pools)
+        async_results = []
+
+        for pool_idx, pool in enumerate(pools):
+            start_idx = pool_idx * tests_per_pool
+            end_idx = min(start_idx + tests_per_pool, len(self.tests))
+            for test in self.tests[start_idx:end_idx]:
+                ar = pool.apply_async(
+                    lit.worker.execute, args=[test], callback=self.progress_callback
+                )
+                async_results.append(ar)
+
+        # Close all pools
+        for pool in pools:
+            pool.close()
 
         try:
             self._wait_for(async_results, deadline)
         except:
-            pool.terminate()
+            # Terminate all pools on exception
+            for pool in pools:
+                pool.terminate()
             raise
         finally:
-            pool.join()
+            # Join all pools
+            for pool in pools:
+                pool.join()
 
     def _wait_for(self, async_results, deadline):
         timeout = deadline - time.time()
diff --git a/llvm/utils/lit/lit/util.py b/llvm/utils/lit/lit/util.py
index e4e031b3e0898..a800f1f6e1419 100644
--- a/llvm/utils/lit/lit/util.py
+++ b/llvm/utils/lit/lit/util.py
@@ -33,76 +33,6 @@ def make_word_regex(word):
     return r"\b" + word + r"\b"
 
 
-def to_bytes(s):
-    """Return the parameter as type 'bytes', possibly encoding it.
-
-    In Python2, the 'bytes' type is the same as 'str'. In Python3, they
-    are distinct.
-
-    """
-    if isinstance(s, bytes):
-        # In Python2, this branch is taken for both 'str' and 'bytes'.
-        # In Python3, this branch is taken only for 'bytes'.
-        return s
-    # In Python2, 's' is a 'unicode' object.
-    # In Python3, 's' is a 'str' object.
-    # Encode to UTF-8 to get 'bytes' data.
-    return s.encode("utf-8")
-
-
-def to_string(b):
-    """Return the parameter as type 'str', possibly encoding it.
-
-    In Python2, the 'str' type is the same as 'bytes'. In Python3, the
-    'str' type is (essentially) Python2's 'unicode' type, and 'bytes' is
-    distinct.
-
-    """
-    if isinstance(b, str):
-        # In Python2, this branch is taken for types 'str' and 'bytes'.
-        # In Python3, this branch is taken only for 'str'.
-        return b
-    if isinstance(b, bytes):
-        # In Python2, this branch is never taken ('bytes' is handled as 'str').
-        # In Python3, this is true only for 'bytes'.
-        try:
-            return b.decode("utf-8")
-        except UnicodeDecodeError:
-            # If the value is not valid Unicode, return the default
-            # repr-line encoding.
-            return str(b)
-
-    # By this point, here's what we *don't* have:
-    #
-    #  - In Python2:
-    #    - 'str' or 'bytes' (1st branch above)
-    #  - In Python3:
-    #    - 'str' (1st branch above)
-    #    - 'bytes' (2nd branch above)
-    #
-    # The last type we might expect is the Python2 'unicode' type. There is no
-    # 'unicode' type in Python3 (all the Python3 cases were already handled). In
-    # order to get a 'str' object, we need to encode the 'unicode' object.
-    try:
-        return b.encode("utf-8")
-    except AttributeError:
-        raise TypeError("not sure how to convert %s to %s" % (type(b), str))
-
-
-def to_unicode(s):
-    """Return the parameter as type which supports unicode, possibly decoding
-    it.
-
-    In Python2, this is the unicode type. In Python3 it's the str type.
-
-    """
-    if isinstance(s, bytes):
-        # In Python2, this branch is taken for both 'str' and 'bytes'.
-        # In Python3, this branch is taken only for 'bytes'.
-        return s.decode("utf-8")
-    return s
-
-
 def usable_core_count():
     """Return the number of cores the current process can use, if supported.
     Otherwise, return the total number of cores (like `os.cpu_count()`).
@@ -114,11 +44,6 @@ def usable_core_count():
     except AttributeError:
         n = os.cpu_count() or 1
 
-    # On Windows with more than 60 processes, multiprocessing's call to
-    # _winapi.WaitForMultipleObjects() prints an error and lit hangs.
-    if platform.system() == "Windows":
-        return min(n, 60)
-
     return n
 
 def abs_path_preserve_drive(path):
@@ -341,7 +266,7 @@ def executeCommand(
 
     """
     if input is not None:
-        input = to_bytes(input)
+        input = input.encode("utf-8")
     err_out = subprocess.STDOUT if redirect_stderr else subprocess.PIPE
     p = subprocess.Popen(
         command,
@@ -377,8 +302,8 @@ def killProcess():
             timerObject.cancel()
 
     # Ensure the resulting output is always of string type.
-    out = to_string(out)
-    err = "" if redirect_stderr else to_string(err)
+    out = out.decode("utf-8", errors="replace")
+    err = "" if redirect_stderr else err.decode("utf-8", errors="replace")
 
     if hitTimeOut[0]:
         raise ExecuteCommandTimeoutException(
diff --git a/llvm/utils/lit/tests/lit.cfg b/llvm/utils/lit/tests/lit.cfg
index 1382f7ef4ab00..922fbd6c9c872 100644
--- a/llvm/utils/lit/tests/lit.cfg
+++ b/llvm/utils/lit/tests/lit.cfg
@@ -9,6 +9,10 @@ import lit.formats
 from lit.llvm import llvm_config
 
 # Configuration file for the 'lit' test runner.
+# Note that lit can be tested in a "standalone" mode, where it is run
+# without using the LLVM build directory.
+# In this case the llvm_config object is not available, and we expect
+# tools like FileCheck to be available in the PATH.
 
 # name: The name of this test suite.
 config.name = "lit"
diff --git a/llvm/utils/lit/tests/windows-pools.py b/llvm/utils/lit/tests/windows-pools.py
new file mode 100644
index 0000000000000..6ba43ecca6284
--- /dev/null
+++ b/llvm/utils/lit/tests/windows-pools.py
@@ -0,0 +1,27 @@
+# Create a directory with 20 files and check the number of pools and workers per pool that lit will use.
+
+# RUN: rm -Rf %t.dir && mkdir -p %t.dir
+# RUN: %{python} -c "for i in range(20): open(rf'%t.dir/file{i}.txt', 'w').write('RUN:')"
+
+# RUN:  echo "import lit.formats" > %t.dir/lit.cfg
+# RUN:  echo "config.name = \"top-level-suite\"" >> %t.dir/lit.cfg
+# RUN:  echo "config.suffixes = [\".txt\"]" >> %t.dir/lit.cfg
+# RUN:  echo "config.test_format = lit.formats.ShTest()" >> %t.dir/lit.cfg
+
+
+# 15 workers per pool max, 100 workers total max: we expect lit to cap the workers to the number of files
+# RUN: env "LIT_WINDOWS_MAX_WORKERS_PER_POOL=15" %{lit} -s %t.dir/ -j100 > %t.out 2>&1
+# CHECK: Using 2 pools balancing 20 workers total distributed as [10, 10]
+# CHECK: Passed: 20
+
+# 5 workers per pool max, 17 workers total max
+# RUN: env "LIT_WINDOWS_MAX_WORKERS_PER_POOL=5" %{lit} -s %t.dir/ -j17 >> %t.out 2>&1
+# CHECK: Using 4 pools balancing 17 workers total distributed as [5, 4, 4, 4]
+# CHECK: Passed: 20
+
+# 19 workers per pool max, 19 workers total max
+# RUN: env "LIT_WINDOWS_MAX_WORKERS_PER_POOL=19" %{lit} -s %t.dir/ -j19 >> %t.out 2>&1
+# CHECK-NOT: workers total distributed as
+# CHECK: Passed: 20
+
+# RUN: cat %t.out | FileCheck %s
diff --git a/llvm/utils/profcheck-xfail.txt b/llvm/utils/profcheck-xfail.txt
index 835025d1e319e..980f99687c4cc 100644
--- a/llvm/utils/profcheck-xfail.txt
+++ b/llvm/utils/profcheck-xfail.txt
@@ -17,7 +17,6 @@ CodeGen/RISCV/zmmul.ll
 CodeGen/WebAssembly/memory-interleave.ll
 CodeGen/X86/AMX/amx-low-intrinsics.ll
 CodeGen/X86/masked_gather_scatter.ll
-CodeGen/X86/nocfivalue.ll
 DebugInfo/AArch64/ir-outliner.ll
 DebugInfo/assignment-tracking/X86/hotcoldsplit.ll
 DebugInfo/Generic/block-asan.ll
@@ -148,9 +147,6 @@ Transforms/ExpandLargeFpConvert/X86/expand-large-fp-convert-ui129tofp.ll
 Transforms/ExpandMemCmp/AArch64/memcmp.ll
 Transforms/ExpandMemCmp/X86/memcmp.ll
 Transforms/ExpandMemCmp/X86/memcmp-x32.ll
-Transforms/ExpandVariadics/expand-va-intrinsic-split-linkage.ll
-Transforms/ExpandVariadics/expand-va-intrinsic-split-simple.ll
-Transforms/ExpandVariadics/intrinsics.ll
 Transforms/FixIrreducible/basic.ll
 Transforms/FixIrreducible/bug45623.ll
 Transforms/FixIrreducible/callbr.ll
@@ -472,9 +468,6 @@ Transforms/LoopDeletion/invalidate-scev-after-hoisting.ll
 Transforms/LoopIdiom/AArch64/byte-compare-index.ll
 Transforms/LoopIdiom/AArch64/find-first-byte.ll
 Transforms/LoopIdiom/RISCV/byte-compare-index.ll
-Transforms/LoopUnroll/peel-last-iteration-expansion-cost.ll
-Transforms/LoopUnroll/peel-last-iteration-with-guards.ll
-Transforms/LoopUnroll/peel-last-iteration-with-variable-trip-count.ll
 Transforms/LowerAtomic/atomic-load.ll
 Transforms/LowerAtomic/atomic-swap.ll
 Transforms/LowerConstantIntrinsics/builtin-object-size-phi.ll
@@ -505,41 +498,15 @@ Transforms/LowerSwitch/do-not-handle-impossible-values.ll
 Transforms/LowerSwitch/feature.ll
 Transforms/LowerSwitch/fold-popular-case-to-unreachable-default.ll
 Transforms/LowerSwitch/pr59316.ll
-Transforms/LowerTypeTests/aarch64-jumptable.ll
-Transforms/LowerTypeTests/blockaddress-2.ll
-Transforms/LowerTypeTests/blockaddress.ll
-Transforms/LowerTypeTests/cfi-annotation.ll
 Transforms/LowerTypeTests/cfi-coff-comdat-rename.ll
-Transforms/LowerTypeTests/cfi-direct-call1.ll
-Transforms/LowerTypeTests/cfi-icall-alias.ll
-Transforms/LowerTypeTests/cfi-nounwind-direct-call.ll
-Transforms/LowerTypeTests/cfi-unwind-direct-call.ll
-Transforms/LowerTypeTests/export-alias.ll
-Transforms/LowerTypeTests/export-cross-dso-cfi.ll
-Transforms/LowerTypeTests/export-icall.ll
-Transforms/LowerTypeTests/export-rename-local.ll
-Transforms/LowerTypeTests/export-symver.ll
-Transforms/LowerTypeTests/function-arm-thumb.ll
-Transforms/LowerTypeTests/function-disjoint.ll
-Transforms/LowerTypeTests/function-ext.ll
 Transforms/LowerTypeTests/function.ll
-Transforms/LowerTypeTests/function-thumb-bti.ll
 Transforms/LowerTypeTests/function-weak.ll
-Transforms/LowerTypeTests/icall-branch-funnel.ll
 Transforms/LowerTypeTests/import.ll
-Transforms/LowerTypeTests/nocfivalue.ll
-Transforms/LowerTypeTests/pr37625.ll
-Transforms/LowerTypeTests/section.ll
 Transforms/LowerTypeTests/simple.ll
-Transforms/LowerTypeTests/x86-jumptable.ll
-Transforms/MemCpyOpt/memset-memcpy-dbgloc.ll
-Transforms/MemCpyOpt/memset-memcpy-redundant-memset.ll
-Transforms/MemCpyOpt/opaque-ptr.ll
 Transforms/MergeFunc/2011-02-08-RemoveEqual.ll
 Transforms/MergeFunc/apply_function_attributes.ll
 Transforms/MergeFunc/call-and-invoke-with-ranges-attr.ll
 Transforms/MergeFunc/call-and-invoke-with-ranges.ll
-Transforms/MergeFunc/cfi-thunk-merging.ll
 Transforms/MergeFunc/comdat.ll
 Transforms/MergeFunc/crash-cast-arrays.ll
 Transforms/MergeFunc/crash.ll
@@ -572,10 +539,6 @@ Transforms/MergeFunc/ranges-multiple.ll
 Transforms/MergeFunc/self-referential-global.ll
 Transforms/MergeFunc/unnamed-addr-reprocessing.ll
 Transforms/MergeFunc/vector-GEP-crash.ll
-Transforms/MergeICmps/X86/alias-merge-blocks.ll
-Transforms/MergeICmps/X86/entry-block-shuffled-2.ll
-Transforms/MergeICmps/X86/entry-block-shuffled.ll
-Transforms/MergeICmps/X86/pr59740.ll
 Transforms/OpenMP/always_inline_device.ll
 Transforms/OpenMP/custom_state_machines.ll
 Transforms/OpenMP/custom_state_machines_remarks.ll
diff --git a/mlir/docs/Dialects/Linalg/OpDSL.md b/mlir/docs/Dialects/Linalg/OpDSL.md
index b892bbe427a18..37604fc17dd9b 100644
--- a/mlir/docs/Dialects/Linalg/OpDSL.md
+++ b/mlir/docs/Dialects/Linalg/OpDSL.md
@@ -311,16 +311,17 @@ An example for a rank polymorphic operation is `fill`:
 
 ```python
 @linalg_structured_op
-def fill(value=ScalarDef(T1),
-         O=TensorDef(U, output=True)):
-  O[None] = TypeFn.cast_signed(U, value)
+def fill(value=ScalarDef(T),
+         O=TensorDef(T, output=True)):
+  O[None] = value
 ```
 
-The operation sets the elements of the output tensor `O` to `value`. All
-operands are either scalars or rank zero tensors that are accessed using the
-index `None`. The operation thus performs a scalar computation that trivially
-extends to a multi-dimensional pointwise computation. As a result, we may use
-`fill` with arbitrary ranked output tensors:
+The operation sets the elements of the output tensor `O` to `value`. The value
+type must match the element type of the output tensor. All operands are either
+scalars or rank zero tensors that are accessed using the index `None`. The
+operation thus performs a scalar computation that trivially extends to a
+multi-dimensional pointwise computation. As a result, we may use `fill` with
+arbitrary ranked output tensors:
 
 ```python
 tensor_2d = tensor.EmptyOp([4, 8], f32)
diff --git a/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h b/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h
index f86535740fec9..f9b99121476eb 100644
--- a/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h
+++ b/mlir/include/mlir/Analysis/Presburger/IntegerRelation.h
@@ -196,6 +196,7 @@ class IntegerRelation {
   inline DynamicAPInt atIneq(unsigned i, unsigned j) const {
     return inequalities(i, j);
   }
+
   /// The same, but casts to int64_t. This is unsafe and will assert-fail if the
   /// value does not fit in an int64_t.
   inline int64_t atIneq64(unsigned i, unsigned j) const {
@@ -209,6 +210,19 @@ class IntegerRelation {
     return getNumInequalities() + getNumEqualities();
   }
 
+  // Unified indexing into the constraints. Index into the inequalities
+  // if i < getNumInequalities() and into the equalities otherwise.
+  inline DynamicAPInt atConstraint(unsigned i, unsigned j) const {
+    assert(i < getNumConstraints());
+    unsigned numIneqs = getNumInequalities();
+    return i < numIneqs ? atIneq(i, j) : atEq(i - numIneqs, j);
+  }
+  inline DynamicAPInt &atConstraint(unsigned i, unsigned j) {
+    assert(i < getNumConstraints());
+    unsigned numIneqs = getNumInequalities();
+    return i < numIneqs ? atIneq(i, j) : atEq(i - numIneqs, j);
+  }
+
   unsigned getNumDomainVars() const { return space.getNumDomainVars(); }
   unsigned getNumRangeVars() const { return space.getNumRangeVars(); }
   unsigned getNumSymbolVars() const { return space.getNumSymbolVars(); }
diff --git a/mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h b/mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h
index 4c8abea680b66..48982ac6efe7c 100644
--- a/mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h
+++ b/mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h
@@ -27,7 +27,7 @@ class MMAMatrixType;
 #define GEN_PASS_DECL_CONVERTGPUOPSTONVVMOPS
 #include "mlir/Conversion/Passes.h.inc"
 
-LLVM::LLVMStructType convertMMAToLLVMType(gpu::MMAMatrixType type);
+Type convertMMAToLLVMType(gpu::MMAMatrixType type);
 
 /// Configure target to convert from the GPU dialect to NVVM.
 void configureGpuToNVVMConversionLegality(ConversionTarget &target);
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index e07c72b839e7c..16eaf28ddd95b 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -80,15 +80,15 @@ def AMDGPU_AddressSpaceAttr : EnumAttr<AMDGPU_Dialect, AMDGPU_AddressSpace,
   let assemblyFormat = "`<` $value `>`";
 }
 
+//===----------------------------------------------------------------------===//
+// AMDGPU Type definitions
+//===----------------------------------------------------------------------===//
+
 class AMDGPU_Type<string name, string typeMnemonic, list<Trait> traits = []>
     : TypeDef<AMDGPU_Dialect, name, traits> {
   let mnemonic = typeMnemonic;
 }
 
-//===----------------------------------------------------------------------===//
-// AMDGPU Type definitions
-//===----------------------------------------------------------------------===//
-
 def AMDGPU_TDMBaseType : AMDGPU_Type<"TDMBase", "tdm_base"> {
   let summary = "Pair of base addresses that move data between LDS and global storage.";
   let description = [{
@@ -104,6 +104,15 @@ def AMDGPU_TDMBaseType : AMDGPU_Type<"TDMBase", "tdm_base"> {
   let assemblyFormat = "`<` $elementType `>`";
 }
 
+def AMDGPU_TDMDescriptorType : AMDGPU_Type<"TDMDescriptor", "tdm_descriptor"> {
+  let summary = "Descriptors used in tensor store/load operations.";
+  let description = [{
+    This type is opaque and corresponds to the two or four descriptor groups
+    used in tensor_load_to_lds or tensor_store_from_lds.
+  }];
+
+}
+
 //===----------------------------------------------------------------------===//
 // AMDGPU Op definitions
 //===----------------------------------------------------------------------===//
@@ -1220,16 +1229,13 @@ def AMDGPU_ScaledMFMAOp :
 
 def AMDGPU_MakeDmaBaseOp :
     AMDGPU_Op<"make_dma_base", [Pure, AttrSizedOperandSegments]>,
-    Arguments<(ins
-                   Arg<AnyMemRef, "buffer to read from">:$src,
-                   Variadic<Index>:$srcIndices,
-                   Arg<AnyMemRef, "buffer to write to">:$dst,
-                   Variadic<Index>:$dstIndices)>,
+    Arguments<(ins Arg<AnyMemRef>:$global,
+                   Variadic<Index>:$global_indices,
+                   Arg<AnyMemRef>:$lds,
+                   Variadic<Index>:$lds_indices)>,
     Results<(outs AMDGPU_TDMBaseType: $base)> {
 
   // TODO:
-  // * Add verifiers such that one of the memrefs is from LDS and the other global.
-  // * Add verifiers to make sure that the type is in the correct direction.
   // * Add verifiers to make sure that the number of indices do not exceed the number of dimensions.
 
   let summary = "Pair of based addresses used when moving tiles between LDS and global memory.";
@@ -1240,12 +1246,109 @@ def AMDGPU_MakeDmaBaseOp :
     This operation creates a value corresponding to the tensor descriptor (D#) group 0
     found in TensorLoadToLDSOp and TensorStoreFromLDSOp in the rocdl dialect.
 
+    For example:
+
+    ```mlir
+      %base = amdgpu.make_dma_base %global[%idx0, %idx1], %lds[%idx2, %idx3] : memref<64x64xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
+      %descriptor = amdgpu.make_dma_descriptor %base globalSize [2, 2] globalStride [2, 1] sharedSize [2, 2] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+      amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor
+    ```
+
+    to
+
+    ```mlir
+      // pseudo-code
+      %global_base = llvm.extractvalue %global_memref[1]
+      %global_address = llvm.get_element_ptr ...
+
+      %lds_base = llvm.extractvalue %lds_memref[1]
+      %lds_address = llvm.get_element_ptr ...
+
+      // Definition of %base
+      %undef = llvm.mlir.undef : vector<4xi32>
+      %v0 = llvm.insertelement %15, %undef[0] : vector<4xi32>
+      %v1 = llvm.insertelement %lds_address, %v0[1] : vector<4xi32>
+      %v2 = llvm.insertelement %global_address_low, %v1[2] : vector<4xi32>
+      %base = llvm.insertelement %global_address_high, %v2[3] : vector<4xi32>
+
+      rocdl.tensor.load.to.lds %base, %dgroup1, %dgroup2, %dgroup3 cachepolicy 0 : vector<4xi32>, vector<8xi32>
+    ```
+
     These tensor DMA operations were introduced in gfx1250.
   }];
 
   let assemblyFormat = [{
-    $src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` attr-dict `:` type($src) `,` type($dst) `to` type(results)
+    $global `[` $global_indices `]` `,` $lds `[` $lds_indices `]` attr-dict `:` type($global) `,` type($lds) `->` type(results)
   }];
+
+  let hasVerifier = 1;
+}
+
+def AMDGPU_MakeDmaDescriptorOp :
+  AMDGPU_Op<"make_dma_descriptor", [Pure, AttrSizedOperandSegments]>,
+  Arguments<(ins
+    AMDGPU_TDMBaseType: $base,
+    Variadic<Index>: $global_dynamic_sizes,
+    DenseI64ArrayAttr: $global_static_sizes,
+    Variadic<Index>: $global_dynamic_strides,
+    DenseI64ArrayAttr: $global_static_strides,
+    Variadic<Index>: $shared_dynamic_sizes,
+    DenseI64ArrayAttr: $shared_static_sizes,
+    Optional<Index>: $pad,
+    Optional<Index>: $pad_every,
+    Optional<AnyMemRef>: $atomic_barrier_address,
+    Variadic<Index>: $atomic_barrier_indices,
+    Optional<Index>: $global_increment,
+    Optional<Index>: $lds_increment,
+    Optional<Index>: $iteration_count)>,
+  Results<(outs AMDGPU_TDMDescriptorType: $desc)> {
+
+  let summary = "Make all descriptor groups needed by TensorLoadToLDS/TensorStoreFromLDS.";
+  let description = [{
+     Make all descriptor groups needed by tensor memory operations.
+
+     The $base operand corresponds to the base pair addresses, one must be an address in LDS
+     while the other must be a global memory location.
+
+     $global_{static/dynamic}_sizes determine the size of the tensor.
+     $global_{static/dynamic}_strides determine the strides of the tensor.
+     $shared_{static/dynamic}_sizes determines the size of the tile.
+
+     Padding can be applied to the LDS address when copying from memory to LDS,
+     but not when copying from LDS to memory.
+     The values in the padded target addresses remain the same as before the operation was applied.
+
+     2D and 3D tensors may be iterated over by setting $global_increment, $lds_increment, and $iteration_count.
+     $global_increment determines how much to increment the starting global memory address per iteration in units of the $base's element type.
+     $lds_increment determines how much to increment the starting LDS address per iteration in units of the $base's element type.
+     $iterate_count determines how many times to iterate.
+
+     ```mlir
+      // Example of moving a two-dimensional tensor to LDS.
+      %base = amdgpu.make_dma_base %global[0, 0], %lds[0, 0] : memref<64x64xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
+      %descriptor = amdgpu.make_dma_descriptor %base globalSize [64, 64] globalStride [64, 1] sharedSize [64, 64] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+      amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor
+
+      // Example of moving a two dimension tensor to LDS where padding is applied after every integer.
+      %base = amdgpu.make_dma_base %global[0, 0], %lds[0, 0] : memref<32x32xi32>, memref<64x64xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
+      %descriptor = amdgpu.make_dma_descriptor %base globalSize [32, 32] globalStride [32, 1] sharedSize [64, 64] padding(%pad pad_every %pad_every) : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+      amdgpu.tensor_load_to_lds %descriptor : !amdgpu.tdm_descriptor
+     ```
+  }];
+
+  let assemblyFormat = [{
+    $base
+    `globalSize` custom<DynamicIndexList>($global_dynamic_sizes, $global_static_sizes)
+    `globalStride` custom<DynamicIndexList>($global_dynamic_strides, $global_static_strides)
+    `sharedSize` custom<DynamicIndexList>($shared_dynamic_sizes, $shared_static_sizes)
+    ( `padShared` `(` $pad^ `every` $pad_every `)` )?
+    ( `atomicBarrier` `(` $atomic_barrier_address^ `[` $atomic_barrier_indices `]`
+                      `:` type($atomic_barrier_address) `)`)?
+    ( `iterate` $global_increment^ `,` $lds_increment `,` $iteration_count )?
+    attr-dict `:` qualified(type($base)) `->` type(results)
+  }];
+
+  let hasVerifier = 1;
 }
 
 #endif // AMDGPU
diff --git a/mlir/include/mlir/Dialect/EmitC/IR/EmitC.td b/mlir/include/mlir/Dialect/EmitC/IR/EmitC.td
index 4c1db58da45f0..c1820904f2665 100644
--- a/mlir/include/mlir/Dialect/EmitC/IR/EmitC.td
+++ b/mlir/include/mlir/Dialect/EmitC/IR/EmitC.td
@@ -116,6 +116,37 @@ def EmitC_FileOp
   let skipDefaultBuilders = 1;
 }
 
+def EmitC_AddressOfOp : EmitC_Op<"address_of", [
+    CExpressionInterface,
+    TypesMatchWith<"input and result reference the same type", "reference", "result",
+                   "emitc::PointerType::get(::llvm::cast<emitc::LValueType>($_self).getValueType())">
+]> {
+  let summary = "Address operation";
+  let description = [{
+    This operation models the C & (address of) operator for a single operand,
+    which must be an emitc.lvalue, and returns an emitc pointer to its location.
+
+    Example:
+
+    ```mlir
+    // Custom form of applying the & operator.
+    %0 = emitc.address_of %arg0 : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+    ```
+  }];
+  let arguments = (ins EmitC_LValueType:$reference);
+  let results = (outs EmitC_PointerType:$result);
+  let assemblyFormat = [{
+    $reference `:` qualified(type($reference)) attr-dict
+  }];
+  let hasVerifier = 1;
+
+  let extraClassDeclaration = [{
+    bool hasSideEffects() {
+      return false;
+    }
+  }];
+}
+
 def EmitC_AddOp : EmitC_BinaryOp<"add", []> {
   let summary = "Addition operation";
   let description = [{
@@ -140,7 +171,7 @@ def EmitC_AddOp : EmitC_BinaryOp<"add", []> {
 }
 
 def EmitC_ApplyOp : EmitC_Op<"apply", [CExpressionInterface]> {
-  let summary = "Apply operation";
+  let summary = "Deprecated (use address_of/dereference)";
   let description = [{
     With the `emitc.apply` operation the operators & (address of) and * (contents of)
     can be applied to a single operand.
@@ -439,6 +470,31 @@ def EmitC_ConstantOp
   }];
 }
 
+def EmitC_DereferenceOp : EmitC_Op<"dereference", [
+    TypesMatchWith<"input and result reference the same type", "pointer", "result",
+                   "emitc::LValueType::get(::llvm::cast<emitc::PointerType>($_self).getPointee())">
+]> {
+  let summary = "Dereference operation";
+  let description = [{
+    This operation models the C * (dereference) operator, which must be of
+    !emitc.ptr<> type, returning an !emitc.lvalue<> the value pointed to by the
+    pointer.
+
+    Example:
+
+    ```mlir
+    // Custom form of the dereference operator.
+    %0 = emitc.dereference %arg0 : (!emitc.ptr<i32>) -> !emitc.lvalue<i32>
+    ```
+  }];
+  let arguments = (ins EmitC_PointerType:$pointer);
+  let results = (outs EmitC_LValueType:$result);
+  let assemblyFormat = [{
+    $pointer `:` qualified(type($pointer)) attr-dict
+  }];
+  let hasVerifier = 1;
+}
+
 def EmitC_DivOp : EmitC_BinaryOp<"div", []> {
   let summary = "Division operation";
   let description = [{
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f893367203..2c29bb8a01a41 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -114,7 +114,7 @@ def GPU_MMAMatrix : DialectType<
   GPU_Dialect, IsMMAMatrixTypePred, "MMAMatrix type">;
 
 // Memref type acceptable to gpu.subgroup_mma_{load|store}_matrix ops.
-def GPU_MMAMemRef : MemRefOf<[I8, I32, F16, F32, VectorOfRankAndType<[1], [I8, I32, F16, F32]>]>;
+def GPU_MMAMemRef : MemRefOf<[I8, I32, F16, F32, F64, VectorOfRankAndType<[1], [I8, I32, F16, F32, F64]>]>;
 
 class MMAMatrixOf<list<Type> allowedTypes> :
   ContainerType<AnyTypeOf<allowedTypes>, IsMMAMatrixTypePred,
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index a6c6038e1e224..5c7df25c58cde 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1872,7 +1872,7 @@ def GPU_SubgroupMmaStoreMatrixOp : GPU_Op<"subgroup_mma_store_matrix",
     ```
   }];
 
-  let arguments = (ins Arg<MMAMatrixOf<[SI8, UI8, I32, F16, F32]>>:$src,
+  let arguments = (ins Arg<MMAMatrixOf<[SI8, UI8, I32, F16, F32, F64]>>:$src,
                   Arg<GPU_MMAMemRef, "",[MemWriteAt<0, FullEffect>]>:$dstMemref,
                   Variadic<Index>:$indices,
                   IndexAttr:$leadDimension,
@@ -1919,9 +1919,9 @@ def GPU_SubgroupMmaComputeOp
     ```
   }];
 
-  let arguments = (ins Arg<MMAMatrixOf<[SI8, UI8, F16, F32]>>:$opA,
-                  Arg<MMAMatrixOf<[SI8, UI8, F16, F32]>>:$opB,
-                  Arg<MMAMatrixOf<[I32, F16, F32]>>:$opC,
+  let arguments = (ins Arg<MMAMatrixOf<[SI8, UI8, F16, F32, F64]>>:$opA,
+                  Arg<MMAMatrixOf<[SI8, UI8, F16, F32, F64]>>:$opB,
+                  Arg<MMAMatrixOf<[I32, F16, F32, F64]>>:$opC,
                   OptionalAttr<UnitAttr>:$a_transpose,
                   OptionalAttr<UnitAttr>:$b_transpose);
 
diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 8a3d07043013e..a96d65d3fcacd 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -889,10 +889,7 @@ def NVVM_MBarrierArriveDropNocompleteOp : NVVM_Op<"mbarrier.arrive_drop.nocomple
   }];
 }
 
-def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_tx">,  
-  Arguments<(ins
-    AnyTypeOf<[LLVM_PointerGeneric, LLVM_PointerShared]>:$addr,
-    I32:$txcount, PtxPredicate:$predicate)> {
+def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_tx"> {
   let summary = "MBarrier Arrive with Expected Transaction Count";
   let description = [{
     The `nvvm.mbarrier.arrive.expect_tx` operation performs an expect-tx operation 
@@ -903,11 +900,11 @@ def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_t
     threads within the CTA. When other threads perform corresponding acquire operations 
     (like 'mbarrier.test.wait'), they synchronize with this release pattern.
 
-    This operation first performs an expect-tx operation with the specified transaction 
-    count, then performs an arrive-on operation with an implicit count of 1. The 
-    expect-tx operation increases the tx-count of the *mbarrier object* by the specified 
-    expectCount value, setting the current phase to expect and tracks the completion 
-    of additional asynchronous transactions.
+    This operation first performs an expect-tx operation with the specified transaction
+    count, then performs an arrive-on operation with an implicit count of 1. The
+    expect-tx operation increases the expect-count of the *mbarrier object* by the
+    specified value (i.e. `txcount`), setting the current phase to expect and track
+    the completion of additional asynchronous transactions.
 
     The operation takes the following operands:
     - `addr`: A pointer to the memory location of the *mbarrier object*. Uses generic 
@@ -915,11 +912,86 @@ def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_t
     - `txcount`: An unsigned integer specifying the expected transaction count 
       for the expect-tx operation. This represents the number of asynchronous transactions 
       expected to complete before the barrier phase completes.
-    - `predicate`: Optional predicate for conditional execution.
+    - `scope`: This specifies the set of threads that directly observe the memory
+      synchronizing effect of the `mbarrier.test.wait` operation.
+    - `relaxed`: When set to true, the `arrive` operation has relaxed memory semantics
+      and does not provide any ordering or visibility guarantees.
+    - `predicate`: Optional predicate for conditional execution used only when lowering to
+      inline-ptx.
 
-    [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive)
+    [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive-drop)
+  }];
+
+  let results = (outs Optional<I64>:$res);
+  let arguments = (ins
+    AnyTypeOf<[LLVM_PointerGeneric, LLVM_PointerShared, LLVM_PointerSharedCluster]>:$addr,
+    I32:$txcount,
+    DefaultValuedAttr<MemScopeKindAttr, "MemScopeKind::CTA">:$scope,
+    DefaultValuedAttr<BoolAttr, "false">:$relaxed,
+    PtxPredicate:$predicate);
+
+  let assemblyFormat = "$addr `,` $txcount (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands) (`->` type($res)^)?";
+  let hasVerifier = 1;
+
+  let extraClassDeclaration = [{
+    bool hasIntrinsic() { return !getPredicate(); }
+
+    bool getAsmValues(RewriterBase &rewriter,
+      llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues);
+
+    static mlir::NVVM::IDArgPair
+    getIntrinsicIDAndArgs(Operation &op, LLVM::ModuleTranslation &mt,
+                          llvm::IRBuilderBase& builder);
+  }];
+
+  string llvmBuilder = [{
+    auto [id, args] = NVVM::MBarrierArriveExpectTxOp::getIntrinsicIDAndArgs(
+                      *op, moduleTranslation, builder);
+
+    int addrSpace = llvm::cast<LLVMPointerType>(op.getAddr().getType()).getAddressSpace();
+    if (addrSpace != NVVM::NVVMMemorySpace::SharedCluster)
+      $res = createIntrinsicCall(builder, id, args);
+    else
+      createIntrinsicCall(builder, id, args);
+  }];
+}
+
+def NVVM_MBarrierArriveDropExpectTxOp : NVVM_Op<"mbarrier.arrive_drop.expect_tx"> {
+  let summary = "MBarrier arrive_drop with expected transaction count";
+  let description = [{
+    The `nvvm.mbarrier.arrive_drop.expect_tx` operation is similar to the
+    `nvvm.mbarrier.arrive.expect_tx` operation except that it performs an
+    `arrive_drop` operation instead of only an `arrive` operation.
+
+    [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive-drop)
+  }];
+
+  let results = (outs Optional<I64>:$res);
+  let arguments = (ins
+    AnyTypeOf<[LLVM_PointerGeneric, LLVM_PointerShared, LLVM_PointerSharedCluster]>:$addr,
+    I32:$txcount,
+    DefaultValuedAttr<MemScopeKindAttr, "MemScopeKind::CTA">:$scope,
+    DefaultValuedAttr<BoolAttr, "false">:$relaxed);
+
+  let assemblyFormat = "$addr `,` $txcount attr-dict `:` type(operands) (`->` type($res)^)?";
+  let hasVerifier = 1;
+
+  let extraClassDeclaration = [{
+    static mlir::NVVM::IDArgPair
+    getIntrinsicIDAndArgs(Operation &op, LLVM::ModuleTranslation &mt,
+                          llvm::IRBuilderBase& builder);
+  }];
+
+  string llvmBuilder = [{
+    auto [id, args] = NVVM::MBarrierArriveDropExpectTxOp::getIntrinsicIDAndArgs(
+                      *op, moduleTranslation, builder);
+
+    int addrSpace = llvm::cast<LLVMPointerType>(op.getAddr().getType()).getAddressSpace();
+    if (addrSpace != NVVM::NVVMMemorySpace::SharedCluster)
+      $res = createIntrinsicCall(builder, id, args);
+    else
+      createIntrinsicCall(builder, id, args);
   }];
-  let assemblyFormat = "$addr `,` $txcount (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands)";
 }
 
 def NVVM_MBarrierTryWaitParityOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity">,  
@@ -980,31 +1052,35 @@ def NVVM_MBarrierTryWaitParityOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity"
   let assemblyFormat = "$addr `,` $phase `,` $ticks attr-dict `:` type(operands)";
 }
 
-def NVVM_MBarrierTestWaitOp : NVVM_Op<"mbarrier.test.wait">,
-  Results<(outs I1:$res)>,
-  Arguments<(ins AnyTypeOf<[LLVM_PointerGeneric, LLVM_PointerShared]>:$addr,
-                 I64:$state)> {
+def NVVM_MBarrierTestWaitOp : NVVM_Op<"mbarrier.test.wait"> {
   let summary = "MBarrier Non-Blocking Test Wait Operation";
   let description = [{
-    The `nvvm.mbarrier.test.wait` operation performs a non-blocking test for the 
+    The `nvvm.mbarrier.test.wait` operation performs a non-blocking test for the
     completion of a specific phase of an *mbarrier object*. It uses the default
-    `.acquire.cta` semantics. This acquire pattern establishes memory ordering for 
-    operations occurring in program order after this wait instruction by making 
-    operations from other threads in the CTA visible to subsequent operations in the current 
-    thread. When this wait completes, it synchronizes with the corresponding release 
-    pattern from the `mbarrier.arrive` operation, establishing memory ordering within 
+    `.acquire.cta` semantics. This acquire pattern establishes memory ordering for
+    operations occurring in program order after this wait instruction by making
+    operations from other threads in the CTA visible to subsequent operations in the current
+    thread. When this wait completes, it synchronizes with the corresponding release
+    pattern from the `mbarrier.arrive` operation, establishing memory ordering within
     the CTA.
 
-    This operation tests whether the mbarrier phase specified by the state operand 
-    has completed. It is a non-blocking instruction that immediately returns the 
+    This operation tests whether the mbarrier phase specified by the state operand
+    has completed. It is a non-blocking instruction that immediately returns the
     completion status without suspending the executing thread.
 
     The operation takes the following operands:
-    - `addr`: A pointer to the memory location of the *mbarrier object*. Uses generic 
+    - `addr`: A pointer to the memory location of the *mbarrier object*. Uses generic
       addressing, but the address must still be in the shared memory space.
-    - `state`: An opaque value returned by a previous `mbarrier.arrive` 
-      operation on the same *mbarrier object* during the current or immediately 
-      preceding phase.
+    - `stateOrPhase`: This argument represents a `state` when it is a 64-bit value
+      and represents a `phase` when it is a 32-bit value. The `state` is an opaque
+      value returned by a previous `mbarrier.arrive` operation on the same
+      *mbarrier object* during the current or immediately preceding phase.
+      The `phase` is an integer specifying the phase parity (0 or 1).
+      Even phases have parity 0, odd phases have parity 1.
+    - `scope`: This specifies the set of threads that directly observe the memory
+      synchronizing effect of the `mbarrier.test.wait` operation.
+    - `relaxed`: When set to true, the `arrive` operation has relaxed memory semantics
+      and does not provide any ordering or visibility guarantees.
 
     The operation returns a boolean value indicating whether the specified phase 
     has completed:
@@ -1031,7 +1107,15 @@ def NVVM_MBarrierTestWaitOp : NVVM_Op<"mbarrier.test.wait">,
     [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier-test-wait-try-wait)
   }];
 
-  let assemblyFormat = "$addr `,` $state attr-dict `:` type(operands) `->` type($res)";
+  let results = (outs I1:$res);
+  let arguments = (ins
+    AnyTypeOf<[LLVM_PointerGeneric, LLVM_PointerShared]>:$addr,
+    AnyTypeOf<[I64, I32]>:$stateOrPhase,
+    DefaultValuedAttr<MemScopeKindAttr, "MemScopeKind::CTA">:$scope,
+    DefaultValuedAttr<BoolAttr, "false">:$relaxed);
+
+  let assemblyFormat = "$addr `,` $stateOrPhase attr-dict `:` type(operands) `->` type($res)";
+  let hasVerifier = 1;
 
   let extraClassDeclaration = [{
     static mlir::NVVM::IDArgPair
diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yaml b/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yaml
index 9aae1b850c3a0..521afc991063f 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yaml
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yaml
@@ -6054,9 +6054,9 @@ metadata: !LinalgOpMetadata
   doc: |-
     Fills the output tensor with the given value.
 
-    Works for arbitrary ranked output tensors since the operation performs scalar
-    accesses only and is thus rank polymorphic. Numeric casting is performed on
-    the value operand, promoting it to the same data type as the output.
+    Works for arbitrary ranked output tensors since the operation performs
+    scalar accesses only and is thus rank polymorphic. The value operand
+    type must match the element type of the output.
   implements:
   - LinalgFillOpInterface
   defines:
@@ -6066,11 +6066,11 @@ structured_op: !LinalgStructuredOpConfig
   - !LinalgOperandDefConfig
     name: value
     kind: scalar
-    type_var: T1
+    type_var: T
   - !LinalgOperandDefConfig
     name: O
     kind: output_tensor
-    type_var: U
+    type_var: T
     shape_map: affine_map<() -> ()>
   indexing_maps: !LinalgIndexingMapsConfig
     static_indexing_maps:
@@ -6081,13 +6081,7 @@ structured_op: !LinalgStructuredOpConfig
   - !ScalarAssign
     arg: O
     value: !ScalarExpression
-      scalar_fn:
-        kind: type
-        fn_name: cast_signed
-        type_var: U
-        operands:
-        - !ScalarExpression
-          scalar_arg: value
+      scalar_arg: value
 --- !LinalgOpConfig
 metadata: !LinalgOpMetadata
   name: fill_rng_2d
diff --git a/mlir/include/mlir/Dialect/OpenACC/OpenACCOps.td b/mlir/include/mlir/Dialect/OpenACC/OpenACCOps.td
index b8317b4a1d2ec..77d1a6f8d53b5 100644
--- a/mlir/include/mlir/Dialect/OpenACC/OpenACCOps.td
+++ b/mlir/include/mlir/Dialect/OpenACC/OpenACCOps.td
@@ -3232,6 +3232,18 @@ def OpenACC_RoutineOp : OpenACC_Op<"routine", [IsolatedFromAbove]> {
       OptionalAttr<DeviceTypeArrayAttr>:$gangDimDeviceType);
 
   let extraClassDeclaration = [{
+    // 'create' function to generate an 'empty' routine.
+    static RoutineOp create(::mlir::OpBuilder & builder,
+                            ::mlir::Location location,
+                            ::llvm::StringRef sym_name,
+                            mlir::SymbolRefAttr func_name, bool implicit) {
+      return create(builder, location, sym_name, func_name, /*bindIDName=*/{},
+                    /*bindStrName=*/{}, /*bindIdNameDeviceType=*/{},
+                    /*bindStrnameDeviceType=*/{}, /*worker=*/{}, /*vector=*/{},
+                    /*seq=*/{}, /*nohost=*/false, implicit, /*gang=*/{},
+                    /*gangDim=*/{}, /*gangDimDeviceType=*/{});
+    }
+
     static StringRef getGangDimKeyword() { return "dim"; }
 
     /// Return true if the op has the worker attribute for the
@@ -3267,6 +3279,13 @@ def OpenACC_RoutineOp : OpenACC_Op<"routine", [IsolatedFromAbove]> {
 
     std::optional<::std::variant<mlir::SymbolRefAttr, mlir::StringAttr>> getBindNameValue();
     std::optional<::std::variant<mlir::SymbolRefAttr, mlir::StringAttr>> getBindNameValue(mlir::acc::DeviceType deviceType);
+
+    // Add an entry to the 'seq' attribute for each additional device types.
+    void addSeq(MLIRContext *, llvm::ArrayRef<DeviceType>);
+    // Add an entry to the 'vector' attribute for each additional device types.
+    void addVector(MLIRContext *, llvm::ArrayRef<DeviceType>);
+    // Add an entry to the 'worker' attribute for each additional device types.
+    void addWorker(MLIRContext *, llvm::ArrayRef<DeviceType>);
   }];
 
   let assemblyFormat = [{
diff --git a/mlir/include/mlir/Dialect/OpenACC/OpenACCTypeInterfaces.td b/mlir/include/mlir/Dialect/OpenACC/OpenACCTypeInterfaces.td
index d1bbc7f206ce6..3f11bf6fbfce3 100644
--- a/mlir/include/mlir/Dialect/OpenACC/OpenACCTypeInterfaces.td
+++ b/mlir/include/mlir/Dialect/OpenACC/OpenACCTypeInterfaces.td
@@ -176,6 +176,50 @@ def OpenACC_PointerLikeTypeInterface : TypeInterface<"PointerLikeType"> {
         return false;
       }]
     >,
+    InterfaceMethod<
+      /*description=*/[{
+        Generates a load operation from the pointer-like type. This dereferences
+        the pointer and returns the loaded value.
+
+        The `srcPtr` parameter is the pointer to load from. If the current type is
+        represented in a way that it does not capture the pointee type, `valueType`
+        must be passed in to provide the necessary type information.
+
+        Returns the loaded value, or an empty Value if load generation failed.
+      }],
+      /*retTy=*/"::mlir::Value",
+      /*methodName=*/"genLoad",
+      /*args=*/(ins "::mlir::OpBuilder &":$builder,
+                    "::mlir::Location":$loc,
+                    "::mlir::TypedValue<::mlir::acc::PointerLikeType>":$srcPtr,
+                    "::mlir::Type":$valueType),
+      /*methodBody=*/"",
+      /*defaultImplementation=*/[{
+        return {};
+      }]
+    >,
+    InterfaceMethod<
+      /*description=*/[{
+        Generates a store operation to the pointer-like type. This stores a value
+        to the memory location pointed to by the pointer.
+
+        The `destPtr` parameter is the pointer to store to. The `valueToStore`
+        parameter is the value to be stored. The type information is derived from
+        the valueToStore parameter itself.
+
+        Returns true if store was successfully generated, false otherwise.
+      }],
+      /*retTy=*/"bool",
+      /*methodName=*/"genStore",
+      /*args=*/(ins "::mlir::OpBuilder &":$builder,
+                    "::mlir::Location":$loc,
+                    "::mlir::Value":$valueToStore,
+                    "::mlir::TypedValue<::mlir::acc::PointerLikeType>":$destPtr),
+      /*methodBody=*/"",
+      /*defaultImplementation=*/[{
+        return false;
+      }]
+    >,
   ];
 }
 
diff --git a/mlir/include/mlir/Dialect/OpenACC/Transforms/Passes.td b/mlir/include/mlir/Dialect/OpenACC/Transforms/Passes.td
index 713aaabee65f0..b37cc282d4555 100644
--- a/mlir/include/mlir/Dialect/OpenACC/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/OpenACC/Transforms/Passes.td
@@ -136,4 +136,20 @@ def ACCImplicitRoutine : Pass<"acc-implicit-routine", "mlir::ModuleOp"> {
   ];
 }
 
+def ACCLegalizeSerial : Pass<"acc-legalize-serial", "mlir::func::FuncOp"> {
+  let summary = "Legalize OpenACC serial constructs";
+  let description = [{
+    This pass converts `acc.serial` constructs into `acc.parallel` constructs
+    with `num_gangs(1)`, `num_workers(1)`, and `vector_length(1)`.
+
+    This transformation simplifies processing of acc regions by unifying the
+    handling of serial and parallel constructs. Since an OpenACC serial region
+    executes sequentially (like a parallel region with a single gang, worker,
+    and vector), this conversion is semantically equivalent while enabling code
+    reuse in later compilation stages.
+  }];
+  let dependentDialects = ["mlir::acc::OpenACCDialect",
+      "mlir::arith::ArithDialect"];
+}
+
 #endif // MLIR_DIALECT_OPENACC_TRANSFORMS_PASSES
diff --git a/mlir/include/mlir/Dialect/SPIRV/IR/SPIRVBase.td b/mlir/include/mlir/Dialect/SPIRV/IR/SPIRVBase.td
index 7b363fac6e627..ecbbf39a534e1 100644
--- a/mlir/include/mlir/Dialect/SPIRV/IR/SPIRVBase.td
+++ b/mlir/include/mlir/Dialect/SPIRV/IR/SPIRVBase.td
@@ -792,7 +792,7 @@ def SPIRV_C_FPGABufferLocationINTEL                     : I32EnumAttrCase<"FPGAB
     Extension<[SPV_INTEL_fpga_buffer_location]>
   ];
 }
-def SPIRV_C_ArbitraryPrecisionFixedPointINTEL           : I32EnumAttrCase<"ArbitraryPrecisionFixedPointINTEL", 5922> {
+def SPIRV_C_ArbitraryPrecisionFixedPointINTEL          : I32EnumAttrCase<"ArbitraryPrecisionFixedPointINTEL", 5922> {
   list<Availability> availability = [
     Extension<[SPV_INTEL_arbitrary_precision_fixed_point]>
   ];
diff --git a/mlir/include/mlir/Transforms/Passes.td b/mlir/include/mlir/Transforms/Passes.td
index 28b4a01cf0ecd..55addfdb693e4 100644
--- a/mlir/include/mlir/Transforms/Passes.td
+++ b/mlir/include/mlir/Transforms/Passes.td
@@ -248,6 +248,7 @@ def RemoveDeadValues : Pass<"remove-dead-values"> {
     ```
   }];
   let constructor = "mlir::createRemoveDeadValuesPass()";
+  let dependentDialects = ["ub::UBDialect"];
 }
 
 def PrintIRPass : Pass<"print-ir"> {
diff --git a/mlir/lib/Analysis/Presburger/Barvinok.cpp b/mlir/lib/Analysis/Presburger/Barvinok.cpp
index 75d592e976edf..c31b27794f01e 100644
--- a/mlir/lib/Analysis/Presburger/Barvinok.cpp
+++ b/mlir/lib/Analysis/Presburger/Barvinok.cpp
@@ -178,13 +178,13 @@ mlir::presburger::detail::solveParametricEquations(FracMatrix equations) {
   for (unsigned i = 0; i < d; ++i) {
     // First ensure that the diagonal element is nonzero, by swapping
     // it with a row that is non-zero at column i.
-    if (equations(i, i) != 0)
-      continue;
-    for (unsigned j = i + 1; j < d; ++j) {
-      if (equations(j, i) == 0)
-        continue;
-      equations.swapRows(j, i);
-      break;
+    if (equations(i, i) == 0) {
+      for (unsigned j = i + 1; j < d; ++j) {
+        if (equations(j, i) == 0)
+          continue;
+        equations.swapRows(j, i);
+        break;
+      }
     }
 
     Fraction diagElement = equations(i, i);
diff --git a/mlir/lib/Analysis/Presburger/IntegerRelation.cpp b/mlir/lib/Analysis/Presburger/IntegerRelation.cpp
index 949fc2db79809..188ee0bb91b5c 100644
--- a/mlir/lib/Analysis/Presburger/IntegerRelation.cpp
+++ b/mlir/lib/Analysis/Presburger/IntegerRelation.cpp
@@ -1112,15 +1112,29 @@ unsigned IntegerRelation::gaussianEliminateVars(unsigned posStart,
   return posLimit - posStart;
 }
 
+static std::optional<unsigned>
+findEqualityWithNonZeroAfterRow(IntegerRelation &rel, unsigned fromRow,
+                                unsigned colIdx) {
+  assert(fromRow < rel.getNumEqualities() && colIdx < rel.getNumCols() &&
+         "position out of bounds");
+  for (unsigned rowIdx = fromRow, e = rel.getNumEqualities(); rowIdx < e;
+       ++rowIdx) {
+    if (rel.atEq(rowIdx, colIdx) != 0)
+      return rowIdx;
+  }
+  return std::nullopt;
+}
+
 bool IntegerRelation::gaussianEliminate() {
   gcdTightenInequalities();
   unsigned firstVar = 0, vars = getNumVars();
   unsigned nowDone, eqs;
   std::optional<unsigned> pivotRow;
   for (nowDone = 0, eqs = getNumEqualities(); nowDone < eqs; ++nowDone) {
-    // Finds the first non-empty column.
+    // Finds the first non-empty column that we haven't dealt with.
     for (; firstVar < vars; ++firstVar) {
-      if ((pivotRow = findConstraintWithNonZeroAt(firstVar, /*isEq=*/true)))
+      if ((pivotRow =
+               findEqualityWithNonZeroAfterRow(*this, nowDone, firstVar)))
         break;
     }
     // The matrix has been normalized to row echelon form.
@@ -1143,6 +1157,10 @@ bool IntegerRelation::gaussianEliminate() {
       inequalities.normalizeRow(i);
     }
     gcdTightenInequalities();
+
+    // The column is finished. Tell the next iteration to start at the next
+    // column.
+    firstVar++;
   }
 
   // No redundant rows.
diff --git a/mlir/lib/Analysis/Presburger/Matrix.cpp b/mlir/lib/Analysis/Presburger/Matrix.cpp
index bb6056487512a..83a2c280c3d4e 100644
--- a/mlir/lib/Analysis/Presburger/Matrix.cpp
+++ b/mlir/lib/Analysis/Presburger/Matrix.cpp
@@ -255,20 +255,13 @@ void Matrix<T>::fillRow(unsigned row, const T &value) {
 }
 
 // moveColumns is implemented by moving the columns adjacent to the source range
-// to their final position. When moving right (i.e. dstPos > srcPos), the range
-// of the adjacent columns is [srcPos + num, dstPos + num). When moving left
-// (i.e. dstPos < srcPos) the range of the adjacent columns is [dstPos, srcPos).
-// First, zeroed out columns are inserted in the final positions of the adjacent
-// columns. Then, the adjacent columns are moved to their final positions by
-// swapping them with the zeroed columns. Finally, the now zeroed adjacent
-// columns are deleted.
+// to their final position.
 template <typename T>
 void Matrix<T>::moveColumns(unsigned srcPos, unsigned num, unsigned dstPos) {
   if (num == 0)
     return;
 
-  int offset = dstPos - srcPos;
-  if (offset == 0)
+  if (dstPos == srcPos)
     return;
 
   assert(srcPos + num <= getNumColumns() &&
@@ -276,23 +269,19 @@ void Matrix<T>::moveColumns(unsigned srcPos, unsigned num, unsigned dstPos) {
   assert(dstPos + num <= getNumColumns() &&
          "move destination range exceeds matrix columns");
 
-  unsigned insertCount = offset > 0 ? offset : -offset;
-  unsigned finalAdjStart = offset > 0 ? srcPos : srcPos + num;
-  unsigned curAdjStart = offset > 0 ? srcPos + num : dstPos;
-  // TODO: This can be done using std::rotate.
-  // Insert new zero columns in the positions where the adjacent columns are to
-  // be moved.
-  insertColumns(finalAdjStart, insertCount);
-  // Update curAdjStart if insertion of new columns invalidates it.
-  if (finalAdjStart < curAdjStart)
-    curAdjStart += insertCount;
-
-  // Swap the adjacent columns with inserted zero columns.
-  for (unsigned i = 0; i < insertCount; ++i)
-    swapColumns(finalAdjStart + i, curAdjStart + i);
-
-  // Delete the now redundant zero columns.
-  removeColumns(curAdjStart, insertCount);
+  unsigned numRows = getNumRows();
+  // std::rotate(start, middle, end) permutes the elements of [start, end] to
+  // [middle, end) + [start, middle). NOTE: &at(i, srcPos + num) will trigger an
+  // assert.
+  if (dstPos > srcPos) {
+    for (unsigned i = 0; i < numRows; ++i) {
+      std::rotate(&at(i, srcPos), &at(i, srcPos) + num, &at(i, dstPos) + num);
+    }
+    return;
+  }
+  for (unsigned i = 0; i < numRows; ++i) {
+    std::rotate(&at(i, dstPos), &at(i, srcPos), &at(i, srcPos) + num);
+  }
 }
 
 template <typename T>
diff --git a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
index b9a5e7d7f6eac..2b6938712dad2 100644
--- a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
+++ b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
@@ -2264,6 +2264,77 @@ struct AMDGPUPermlaneLowering : public ConvertOpToLLVMPattern<PermlaneSwapOp> {
   }
 };
 
+struct AMDGPUMakeDmaBaseLowering
+    : public ConvertOpToLLVMPattern<MakeDmaBaseOp> {
+  using ConvertOpToLLVMPattern::ConvertOpToLLVMPattern;
+
+  AMDGPUMakeDmaBaseLowering(const LLVMTypeConverter &converter, Chipset chipset)
+      : ConvertOpToLLVMPattern<MakeDmaBaseOp>(converter), chipset(chipset) {}
+  Chipset chipset;
+
+  LogicalResult
+  matchAndRewrite(MakeDmaBaseOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    if (chipset < kGfx1250)
+      return op->emitOpError("make_dma_base is only supported on gfx1250");
+
+    Location loc = op.getLoc();
+
+    ValueRange ldsIndices = adaptor.getLdsIndices();
+    Value lds = adaptor.getLds();
+    auto ldsMemRefType = cast<MemRefType>(op.getLds().getType());
+
+    Value ldsPtr =
+        getStridedElementPtr(rewriter, loc, ldsMemRefType, lds, ldsIndices);
+
+    ValueRange globalIndices = adaptor.getGlobalIndices();
+    Value global = adaptor.getGlobal();
+    auto globalMemRefType = cast<MemRefType>(op.getGlobal().getType());
+
+    Value globalPtr = getStridedElementPtr(rewriter, loc, globalMemRefType,
+                                           global, globalIndices);
+
+    Type i32 = rewriter.getI32Type();
+    Type i64 = rewriter.getI64Type();
+
+    Value castForLdsAddr = LLVM::PtrToIntOp::create(rewriter, loc, i32, ldsPtr);
+    Value castForGlobalAddr =
+        LLVM::PtrToIntOp::create(rewriter, loc, i64, globalPtr);
+
+    Value lowHalf =
+        LLVM::TruncOp::create(rewriter, loc, i32, castForGlobalAddr);
+
+    Value shift = LLVM::LShrOp::create(rewriter, loc, castForGlobalAddr,
+                                       createI64Constant(rewriter, loc, 32));
+
+    Value highHalf = LLVM::TruncOp::create(rewriter, loc, i32, shift);
+
+    Value mask = createI32Constant(rewriter, loc, (1ull << 25) - 1);
+    Value validHighHalf = LLVM::AndOp::create(rewriter, loc, highHalf, mask);
+
+    Value typeField = createI32Constant(rewriter, loc, 2 << 30);
+    Value highHalfPlusType =
+        LLVM::OrOp::create(rewriter, loc, validHighHalf, typeField);
+
+    Value c0 = createI32Constant(rewriter, loc, 0);
+    Value c1 = createI32Constant(rewriter, loc, 1);
+    Value c2 = createI32Constant(rewriter, loc, 2);
+    Value c3 = createI32Constant(rewriter, loc, 3);
+
+    Type v4i32 = this->typeConverter->convertType(VectorType::get(4, i32));
+    Value result = LLVM::PoisonOp::create(rewriter, loc, v4i32);
+    result = LLVM::InsertElementOp::create(rewriter, loc, result, c1, c0);
+    result = LLVM::InsertElementOp::create(rewriter, loc, result,
+                                           castForLdsAddr, c1);
+    result = LLVM::InsertElementOp::create(rewriter, loc, result, lowHalf, c2);
+    result = LLVM::InsertElementOp::create(rewriter, loc, result,
+                                           highHalfPlusType, c3);
+
+    rewriter.replaceOp(op, result);
+    return success();
+  }
+};
+
 struct ConvertAMDGPUToROCDLPass
     : public impl::ConvertAMDGPUToROCDLPassBase<ConvertAMDGPUToROCDLPass> {
   using Base::Base;
@@ -2278,6 +2349,10 @@ struct ConvertAMDGPUToROCDLPass
 
     RewritePatternSet patterns(ctx);
     LLVMTypeConverter converter(ctx);
+    converter.addConversion([&](TDMBaseType type) -> Type {
+      Type i32 = IntegerType::get(type.getContext(), 32);
+      return converter.convertType(VectorType::get(4, i32));
+    });
     populateAMDGPUToROCDLConversionPatterns(converter, patterns, *maybeChipset);
     LLVMConversionTarget target(getContext());
     target.addIllegalDialect<::mlir::amdgpu::AMDGPUDialect>();
@@ -2333,6 +2408,7 @@ void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,
            ScaledExtPackedOpLowering, PackedScaledTruncOpLowering,
            PackedTrunc2xFp8OpLowering, PackedStochRoundFp8OpLowering,
            GatherToLDSOpLowering, TransposeLoadOpLowering,
-           AMDGPUPermlaneLowering>(converter, chipset);
+           AMDGPUPermlaneLowering, AMDGPUMakeDmaBaseLowering>(converter,
+                                                              chipset);
   patterns.add<AMDGPUSwizzleBitModeLowering>(converter);
 }
diff --git a/mlir/lib/Conversion/ArithToAPFloat/ArithToAPFloat.cpp b/mlir/lib/Conversion/ArithToAPFloat/ArithToAPFloat.cpp
index 81fbdb1611deb..5c68236526b7d 100644
--- a/mlir/lib/Conversion/ArithToAPFloat/ArithToAPFloat.cpp
+++ b/mlir/lib/Conversion/ArithToAPFloat/ArithToAPFloat.cpp
@@ -41,15 +41,17 @@ static FuncOp createFnDecl(OpBuilder &b, SymbolOpInterface symTable,
 }
 
 /// Helper function to look up or create the symbol for a runtime library
-/// function with the given parameter types. Always returns an int64_t.
+/// function with the given parameter types. Returns an int64_t, unless a
+/// different result type is specified.
 static FailureOr<FuncOp>
 lookupOrCreateApFloatFn(OpBuilder &b, SymbolOpInterface symTable,
                         StringRef name, TypeRange paramTypes,
-                        SymbolTableCollection *symbolTables = nullptr) {
-  auto i64Type = IntegerType::get(symTable->getContext(), 64);
-
+                        SymbolTableCollection *symbolTables = nullptr,
+                        Type resultType = {}) {
+  if (!resultType)
+    resultType = IntegerType::get(symTable->getContext(), 64);
   std::string funcName = (llvm::Twine("_mlir_apfloat_") + name).str();
-  auto funcT = FunctionType::get(b.getContext(), paramTypes, {i64Type});
+  auto funcT = FunctionType::get(b.getContext(), paramTypes, {resultType});
   FailureOr<FuncOp> func =
       lookupFnDecl(symTable, funcName, funcT, symbolTables);
   // Failed due to type mismatch.
@@ -308,6 +310,188 @@ struct IntToFpConversion final : OpRewritePattern<OpTy> {
   bool isUnsigned;
 };
 
+struct CmpFOpToAPFloatConversion final : OpRewritePattern<arith::CmpFOp> {
+  CmpFOpToAPFloatConversion(MLIRContext *context, SymbolOpInterface symTable,
+                            PatternBenefit benefit = 1)
+      : OpRewritePattern<arith::CmpFOp>(context, benefit), symTable(symTable) {}
+
+  LogicalResult matchAndRewrite(arith::CmpFOp op,
+                                PatternRewriter &rewriter) const override {
+    // Get APFloat function from runtime library.
+    auto i1Type = IntegerType::get(symTable->getContext(), 1);
+    auto i8Type = IntegerType::get(symTable->getContext(), 8);
+    auto i32Type = IntegerType::get(symTable->getContext(), 32);
+    auto i64Type = IntegerType::get(symTable->getContext(), 64);
+    FailureOr<FuncOp> fn =
+        lookupOrCreateApFloatFn(rewriter, symTable, "compare",
+                                {i32Type, i64Type, i64Type}, nullptr, i8Type);
+    if (failed(fn))
+      return fn;
+
+    // Cast operands to 64-bit integers.
+    rewriter.setInsertionPoint(op);
+    Location loc = op.getLoc();
+    auto floatTy = cast<FloatType>(op.getLhs().getType());
+    auto intWType = rewriter.getIntegerType(floatTy.getWidth());
+    Value lhsBits = arith::ExtUIOp::create(
+        rewriter, loc, i64Type,
+        arith::BitcastOp::create(rewriter, loc, intWType, op.getLhs()));
+    Value rhsBits = arith::ExtUIOp::create(
+        rewriter, loc, i64Type,
+        arith::BitcastOp::create(rewriter, loc, intWType, op.getRhs()));
+
+    // Call APFloat function.
+    Value semValue = getSemanticsValue(rewriter, loc, floatTy);
+    SmallVector<Value> params = {semValue, lhsBits, rhsBits};
+    Value comparisonResult =
+        func::CallOp::create(rewriter, loc, TypeRange(i8Type),
+                             SymbolRefAttr::get(*fn), params)
+            ->getResult(0);
+
+    // Generate an i1 SSA value that is "true" if the comparison result matches
+    // the given `val`.
+    auto checkResult = [&](llvm::APFloat::cmpResult val) {
+      return arith::CmpIOp::create(
+          rewriter, loc, arith::CmpIPredicate::eq, comparisonResult,
+          arith::ConstantOp::create(
+              rewriter, loc, i8Type,
+              rewriter.getIntegerAttr(i8Type, static_cast<int8_t>(val)))
+              .getResult());
+    };
+    // Generate an i1 SSA value that is "true" if the comparison result matches
+    // any of the given `vals`.
+    std::function<Value(ArrayRef<llvm::APFloat::cmpResult>)> checkResults =
+        [&](ArrayRef<llvm::APFloat::cmpResult> vals) {
+          Value first = checkResult(vals.front());
+          if (vals.size() == 1)
+            return first;
+          Value rest = checkResults(vals.drop_front());
+          return arith::OrIOp::create(rewriter, loc, first, rest).getResult();
+        };
+
+    // This switch-case statement was taken from arith::applyCmpPredicate.
+    Value result;
+    switch (op.getPredicate()) {
+    case arith::CmpFPredicate::AlwaysFalse:
+      result = arith::ConstantOp::create(rewriter, loc, i1Type,
+                                         rewriter.getIntegerAttr(i1Type, 0))
+                   .getResult();
+      break;
+    case arith::CmpFPredicate::OEQ:
+      result = checkResult(llvm::APFloat::cmpEqual);
+      break;
+    case arith::CmpFPredicate::OGT:
+      result = checkResult(llvm::APFloat::cmpGreaterThan);
+      break;
+    case arith::CmpFPredicate::OGE:
+      result = checkResults(
+          {llvm::APFloat::cmpGreaterThan, llvm::APFloat::cmpEqual});
+      break;
+    case arith::CmpFPredicate::OLT:
+      result = checkResult(llvm::APFloat::cmpLessThan);
+      break;
+    case arith::CmpFPredicate::OLE:
+      result =
+          checkResults({llvm::APFloat::cmpLessThan, llvm::APFloat::cmpEqual});
+      break;
+    case arith::CmpFPredicate::ONE:
+      // Not cmpUnordered and not cmpUnordered.
+      result = checkResults(
+          {llvm::APFloat::cmpLessThan, llvm::APFloat::cmpGreaterThan});
+      break;
+    case arith::CmpFPredicate::ORD:
+      // Not cmpUnordered.
+      result = checkResults({llvm::APFloat::cmpLessThan,
+                             llvm::APFloat::cmpGreaterThan,
+                             llvm::APFloat::cmpEqual});
+      break;
+    case arith::CmpFPredicate::UEQ:
+      result =
+          checkResults({llvm::APFloat::cmpUnordered, llvm::APFloat::cmpEqual});
+      break;
+    case arith::CmpFPredicate::UGT:
+      result = checkResults(
+          {llvm::APFloat::cmpUnordered, llvm::APFloat::cmpGreaterThan});
+      break;
+    case arith::CmpFPredicate::UGE:
+      result = checkResults({llvm::APFloat::cmpUnordered,
+                             llvm::APFloat::cmpGreaterThan,
+                             llvm::APFloat::cmpEqual});
+      break;
+    case arith::CmpFPredicate::ULT:
+      result = checkResults(
+          {llvm::APFloat::cmpUnordered, llvm::APFloat::cmpLessThan});
+      break;
+    case arith::CmpFPredicate::ULE:
+      result =
+          checkResults({llvm::APFloat::cmpUnordered, llvm::APFloat::cmpLessThan,
+                        llvm::APFloat::cmpEqual});
+      break;
+    case arith::CmpFPredicate::UNE:
+      // Not cmpEqual.
+      result = checkResults({llvm::APFloat::cmpLessThan,
+                             llvm::APFloat::cmpGreaterThan,
+                             llvm::APFloat::cmpUnordered});
+      break;
+    case arith::CmpFPredicate::UNO:
+      result = checkResult(llvm::APFloat::cmpUnordered);
+      break;
+    case arith::CmpFPredicate::AlwaysTrue:
+      result = arith::ConstantOp::create(rewriter, loc, i1Type,
+                                         rewriter.getIntegerAttr(i1Type, 1))
+                   .getResult();
+      break;
+    }
+    rewriter.replaceOp(op, result);
+    return success();
+  }
+
+  SymbolOpInterface symTable;
+};
+
+struct NegFOpToAPFloatConversion final : OpRewritePattern<arith::NegFOp> {
+  NegFOpToAPFloatConversion(MLIRContext *context, SymbolOpInterface symTable,
+                            PatternBenefit benefit = 1)
+      : OpRewritePattern<arith::NegFOp>(context, benefit), symTable(symTable) {}
+
+  LogicalResult matchAndRewrite(arith::NegFOp op,
+                                PatternRewriter &rewriter) const override {
+    // Get APFloat function from runtime library.
+    auto i32Type = IntegerType::get(symTable->getContext(), 32);
+    auto i64Type = IntegerType::get(symTable->getContext(), 64);
+    FailureOr<FuncOp> fn =
+        lookupOrCreateApFloatFn(rewriter, symTable, "neg", {i32Type, i64Type});
+    if (failed(fn))
+      return fn;
+
+    // Cast operand to 64-bit integer.
+    rewriter.setInsertionPoint(op);
+    Location loc = op.getLoc();
+    auto floatTy = cast<FloatType>(op.getOperand().getType());
+    auto intWType = rewriter.getIntegerType(floatTy.getWidth());
+    Value operandBits = arith::ExtUIOp::create(
+        rewriter, loc, i64Type, arith::BitcastOp::create(rewriter, loc, intWType, op.getOperand()));
+
+    // Call APFloat function.
+    Value semValue = getSemanticsValue(rewriter, loc, floatTy);
+    SmallVector<Value> params = {semValue, operandBits};
+    Value negatedBits =
+        func::CallOp::create(rewriter, loc, TypeRange(i64Type),
+                             SymbolRefAttr::get(*fn), params)
+            ->getResult(0);
+
+    // Truncate result to the original width.
+    Value truncatedBits = arith::TruncIOp::create(rewriter, loc, intWType,
+                                                  negatedBits);
+    Value result =
+        arith::BitcastOp::create(rewriter, loc, floatTy, truncatedBits);
+    rewriter.replaceOp(op, result);
+    return success();
+  }
+
+  SymbolOpInterface symTable;
+};
+
 namespace {
 struct ArithToAPFloatConversionPass final
     : impl::ArithToAPFloatConversionPassBase<ArithToAPFloatConversionPass> {
@@ -329,8 +513,17 @@ void ArithToAPFloatConversionPass::runOnOperation() {
       context, "divide", getOperation());
   patterns.add<BinaryArithOpToAPFloatConversion<arith::RemFOp>>(
       context, "remainder", getOperation());
+  patterns.add<BinaryArithOpToAPFloatConversion<arith::MinNumFOp>>(
+      context, "minnum", getOperation());
+  patterns.add<BinaryArithOpToAPFloatConversion<arith::MaxNumFOp>>(
+      context, "maxnum", getOperation());
+  patterns.add<BinaryArithOpToAPFloatConversion<arith::MinimumFOp>>(
+      context, "minimum", getOperation());
+  patterns.add<BinaryArithOpToAPFloatConversion<arith::MaximumFOp>>(
+      context, "maximum", getOperation());
   patterns
-      .add<FpToFpConversion<arith::ExtFOp>, FpToFpConversion<arith::TruncFOp>>(
+      .add<FpToFpConversion<arith::ExtFOp>, FpToFpConversion<arith::TruncFOp>,
+           CmpFOpToAPFloatConversion, NegFOpToAPFloatConversion>(
           context, getOperation());
   patterns.add<FpToIntConversion<arith::FPToSIOp>>(context, getOperation(),
                                                    /*isUnsigned=*/false);
diff --git a/mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp b/mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
index 99c059cb03299..6254de81780f5 100644
--- a/mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
+++ b/mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
@@ -17,6 +17,7 @@
 #include "mlir/Dialect/LLVMIR/LLVMDialect.h"
 #include "mlir/Dialect/LLVMIR/NVVMDialect.h"
 #include "mlir/IR/TypeUtilities.h"
+#include "mlir/IR/Types.h"
 
 using namespace mlir;
 
@@ -57,7 +58,8 @@ static NVVM::MMATypes getElementType(gpu::MMAMatrixType type) {
   if (type.getElementType().isF32())
     return type.getOperand() == "COp" ? NVVM::MMATypes::f32
                                       : NVVM::MMATypes::tf32;
-
+  if (type.getElementType().isF64())
+    return NVVM::MMATypes::f64;
   if (type.getElementType().isSignedInteger(8))
     return NVVM::MMATypes::s8;
   if (type.getElementType().isUnsignedInteger(8))
@@ -212,8 +214,13 @@ struct WmmaMmaOpToNVVMLowering
     // then passed on to the intrinsic call. Emit llvm ops to extract individual
     // values form lowered memrefs.
     SmallVector<Value> unpackedOps;
-
     auto unpackOp = [&](Value operand) {
+      // f64 a and b fragments are not structs but scalars.
+      if (!isa<LLVM::LLVMStructType>(operand.getType())) {
+        unpackedOps.push_back(operand);
+        return;
+      }
+      // every other type is lowered to an LLVM struct, extract the values.
       auto structType = cast<LLVM::LLVMStructType>(operand.getType());
       for (size_t i = 0, e = structType.getBody().size(); i < e; ++i) {
         Value toUse = LLVM::ExtractValueOp::create(rewriter, loc, operand, i);
@@ -276,10 +283,16 @@ struct WmmaConstantOpToNVVMLowering
       return failure();
     Location loc = subgroupMmaConstantOp.getLoc();
     Value cst = adaptor.getOperands()[0];
-    LLVM::LLVMStructType type = convertMMAToLLVMType(
+    Type type = convertMMAToLLVMType(
         cast<gpu::MMAMatrixType>(subgroupMmaConstantOp.getType()));
+    // If the element is not a struct, it means it's a scalar f64.
+    auto structType = dyn_cast<LLVM::LLVMStructType>(type);
+    if (!structType) {
+      rewriter.replaceOp(subgroupMmaConstantOp, cst);
+      return success();
+    }
     // If the element type is a vector create a vector from the operand.
-    if (auto vecType = dyn_cast<VectorType>(type.getBody()[0])) {
+    if (auto vecType = dyn_cast<VectorType>(structType.getBody()[0])) {
       Value vecCst = LLVM::PoisonOp::create(rewriter, loc, vecType);
       for (int64_t vecEl = 0; vecEl < vecType.getNumElements(); vecEl++) {
         Value idx = LLVM::ConstantOp::create(rewriter, loc,
@@ -289,8 +302,8 @@ struct WmmaConstantOpToNVVMLowering
       }
       cst = vecCst;
     }
-    Value matrixStruct = LLVM::PoisonOp::create(rewriter, loc, type);
-    for (size_t i : llvm::seq(size_t(0), type.getBody().size())) {
+    Value matrixStruct = LLVM::PoisonOp::create(rewriter, loc, structType);
+    for (size_t i : llvm::seq(size_t(0), structType.getBody().size())) {
       matrixStruct =
           LLVM::InsertValueOp::create(rewriter, loc, matrixStruct, cst, i);
     }
@@ -354,10 +367,24 @@ struct WmmaElementwiseOpToNVVMLowering
       return failure();
     Location loc = subgroupMmaElementwiseOp.getLoc();
     size_t numOperands = adaptor.getOperands().size();
-    LLVM::LLVMStructType destType = convertMMAToLLVMType(
+    Type destType = convertMMAToLLVMType(
         cast<gpu::MMAMatrixType>(subgroupMmaElementwiseOp.getType()));
-    Value matrixStruct = LLVM::PoisonOp::create(rewriter, loc, destType);
-    for (size_t i = 0, e = destType.getBody().size(); i < e; ++i) {
+
+    // If the element is not a struct, it means it's a scalar f64.
+    LLVM::LLVMStructType structDestTy =
+        dyn_cast<LLVM::LLVMStructType>(destType);
+    if (!structDestTy) {
+      SmallVector<Value> operands;
+      for (auto operand : adaptor.getOperands()) {
+        operands.push_back(operand);
+      }
+      Value element = createScalarOp(
+          rewriter, loc, subgroupMmaElementwiseOp.getOpType(), operands);
+      rewriter.replaceOp(subgroupMmaElementwiseOp, element);
+      return success();
+    }
+    Value matrixStruct = LLVM::PoisonOp::create(rewriter, loc, structDestTy);
+    for (size_t i = 0, e = structDestTy.getBody().size(); i < e; ++i) {
       SmallVector<Value> extractedOperands;
       for (size_t opIdx = 0; opIdx < numOperands; opIdx++) {
         extractedOperands.push_back(LLVM::ExtractValueOp::create(
@@ -377,13 +404,18 @@ struct WmmaElementwiseOpToNVVMLowering
 } // namespace
 
 /// Return the LLVMStructureType corresponding to the MMAMatrixType `type`.
-LLVM::LLVMStructType mlir::convertMMAToLLVMType(gpu::MMAMatrixType type) {
+Type mlir::convertMMAToLLVMType(gpu::MMAMatrixType type) {
   NVVM::MMAFrag frag = convertOperand(type.getOperand());
   NVVM::MMATypes eltType = getElementType(type);
   auto nRow = type.getShape()[0];
   auto nCol = type.getShape()[1];
   std::pair<Type, unsigned> typeInfo =
       NVVM::inferMMAType(eltType, frag, nRow, nCol, type.getContext());
+  // Special handling for f64 a and b fragments
+  Type f64Ty = Float64Type::get(type.getContext());
+  if (typeInfo.first == f64Ty && typeInfo.second == 1) {
+    return f64Ty;
+  }
   return LLVM::LLVMStructType::getLiteral(
       type.getContext(), SmallVector<Type, 8>(typeInfo.second, typeInfo.first));
 }
diff --git a/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp b/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp
index 11f866c103639..0a382d812f362 100644
--- a/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp
+++ b/mlir/lib/Conversion/MemRefToEmitC/MemRefToEmitC.cpp
@@ -122,7 +122,7 @@ static Value calculateMemrefTotalSizeBytes(Location loc, MemRefType memrefType,
   return totalSizeBytes.getResult();
 }
 
-static emitc::ApplyOp
+static emitc::AddressOfOp
 createPointerFromEmitcArray(Location loc, OpBuilder &builder,
                             TypedValue<emitc::ArrayType> arrayValue) {
 
@@ -133,9 +133,9 @@ createPointerFromEmitcArray(Location loc, OpBuilder &builder,
   llvm::SmallVector<mlir::Value> indices(arrayType.getRank(), zeroIndex);
   emitc::SubscriptOp subPtr =
       emitc::SubscriptOp::create(builder, loc, arrayValue, ValueRange(indices));
-  emitc::ApplyOp ptr = emitc::ApplyOp::create(
+  emitc::AddressOfOp ptr = emitc::AddressOfOp::create(
       builder, loc, emitc::PointerType::get(arrayType.getElementType()),
-      builder.getStringAttr("&"), subPtr);
+      subPtr);
 
   return ptr;
 }
@@ -225,12 +225,12 @@ struct ConvertCopy final : public OpConversionPattern<memref::CopyOp> {
 
     auto srcArrayValue =
         cast<TypedValue<emitc::ArrayType>>(operands.getSource());
-    emitc::ApplyOp srcPtr =
+    emitc::AddressOfOp srcPtr =
         createPointerFromEmitcArray(loc, rewriter, srcArrayValue);
 
     auto targetArrayValue =
         cast<TypedValue<emitc::ArrayType>>(operands.getTarget());
-    emitc::ApplyOp targetPtr =
+    emitc::AddressOfOp targetPtr =
         createPointerFromEmitcArray(loc, rewriter, targetArrayValue);
 
     emitc::CallOpaqueOp memCpyCall = emitc::CallOpaqueOp::create(
@@ -319,8 +319,8 @@ struct ConvertGetGlobal final
       emitc::GetGlobalOp globalLValue = emitc::GetGlobalOp::create(
           rewriter, op.getLoc(), lvalueType, operands.getNameAttr());
       emitc::PointerType pointerType = emitc::PointerType::get(resultTy);
-      rewriter.replaceOpWithNewOp<emitc::ApplyOp>(
-          op, pointerType, rewriter.getStringAttr("&"), globalLValue);
+      rewriter.replaceOpWithNewOp<emitc::AddressOfOp>(op, pointerType,
+                                                      globalLValue);
       return success();
     }
     rewriter.replaceOpWithNewOp<emitc::GetGlobalOp>(op, resultTy,
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index 3a70f787da124..64a7f562af0e5 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -923,7 +923,11 @@ struct NVGPUMBarrierArriveExpectTxLowering
                        adaptor.getMbarId(), rewriter);
     Value txcount = truncToI32(b, adaptor.getTxcount());
     rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxOp>(
-        op, barrier, txcount, adaptor.getPredicate());
+        op, Type{},       // return-value is optional and is void by default
+        barrier, txcount, // barrier and txcount
+        NVVM::MemScopeKind::CTA, // default scope is CTA
+        false,                   // relaxed-semantics is false
+        adaptor.getPredicate());
     return success();
   }
 };
diff --git a/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp b/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
index cdc10c60a42ae..8b58c3b1dd182 100644
--- a/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
+++ b/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp
@@ -705,6 +705,62 @@ LogicalResult TransposeLoadOp::verify() {
   return success();
 }
 
+//===----------------------------------------------------------------------===//
+// MakeDmaBaseOp
+//===----------------------------------------------------------------------===//
+
+LogicalResult MakeDmaBaseOp::verify() {
+  MemRefType ldsType = cast<MemRefType>(getLds().getType());
+  MemRefType globalType = cast<MemRefType>(getGlobal().getType());
+  if (!hasWorkgroupMemorySpace(ldsType.getMemorySpace())) {
+    return emitOpError(
+        "lds memref must have workgroup address space attribute.");
+  }
+  if (!hasGlobalMemorySpace(globalType.getMemorySpace())) {
+    return emitOpError(
+        "global memref must have global address space attribute.");
+  }
+  return success();
+}
+
+//===----------------------------------------------------------------------===//
+// MakeDmaDescriptorOp
+//===----------------------------------------------------------------------===//
+
+LogicalResult MakeDmaDescriptorOp::verify() {
+  ArrayRef<int64_t> globalStaticStrides = getGlobalStaticStrides();
+
+  if (globalStaticStrides.empty()) {
+    return emitOpError("strides must not be empty.");
+  }
+  if (globalStaticStrides.back() != 1) {
+    return emitOpError("strides for the innermost dimension must be 1.");
+  }
+
+  ArrayRef<int64_t> globalStaticSizes = getGlobalStaticSizes();
+  size_t rank = globalStaticSizes.size();
+  if (rank != globalStaticStrides.size()) {
+    return emitOpError("strides and sizes must have same rank.");
+  }
+
+  ArrayRef<int64_t> sharedStaticSizes = getSharedStaticSizes();
+  if (rank != sharedStaticSizes.size()) {
+    return emitOpError("tensor must have same rank as tile.");
+  }
+
+  if (Value atomicBarrierAddress = getAtomicBarrierAddress()) {
+    MemRefType atomicBarrierAddressType =
+        cast<MemRefType>(atomicBarrierAddress.getType());
+    bool barrierInLDS =
+        hasWorkgroupMemorySpace(atomicBarrierAddressType.getMemorySpace());
+    if (!barrierInLDS) {
+      return emitOpError("atomic barrier address must be in LDS.");
+    }
+  }
+
+  return success();
+}
+
 //===----------------------------------------------------------------------===//
 // ScaledMFMAOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/AMDGPU/Transforms/MaskedloadToLoad.cpp b/mlir/lib/Dialect/AMDGPU/Transforms/MaskedloadToLoad.cpp
index f15c63c166e0a..89ef51f922cad 100644
--- a/mlir/lib/Dialect/AMDGPU/Transforms/MaskedloadToLoad.cpp
+++ b/mlir/lib/Dialect/AMDGPU/Transforms/MaskedloadToLoad.cpp
@@ -33,19 +33,18 @@ using namespace mlir::amdgpu;
 
 /// This pattern supports lowering of: `vector.maskedload` to `vector.load`
 /// and `arith.select` if the memref is in buffer address space.
-static LogicalResult baseInBufferAddrSpace(PatternRewriter &rewriter,
-                                           vector::MaskedLoadOp maskedOp) {
-  auto memRefType = dyn_cast<MemRefType>(maskedOp.getBase().getType());
+static LogicalResult hasBufferAddressSpace(Type type) {
+  auto memRefType = dyn_cast<MemRefType>(type);
   if (!memRefType)
-    return rewriter.notifyMatchFailure(maskedOp, "not a memref source");
+    return failure();
 
   Attribute addrSpace = memRefType.getMemorySpace();
   if (!isa_and_nonnull<amdgpu::AddressSpaceAttr>(addrSpace))
-    return rewriter.notifyMatchFailure(maskedOp, "no address space");
+    return failure();
 
   if (dyn_cast<amdgpu::AddressSpaceAttr>(addrSpace).getValue() !=
       amdgpu::AddressSpace::FatRawBuffer)
-    return rewriter.notifyMatchFailure(maskedOp, "not in buffer address space");
+    return failure();
 
   return success();
 }
@@ -83,10 +82,11 @@ struct MaskedLoadLowering final : OpRewritePattern<vector::MaskedLoadOp> {
   LogicalResult matchAndRewrite(vector::MaskedLoadOp maskedOp,
                                 PatternRewriter &rewriter) const override {
     if (maskedOp->hasAttr(kMaskedloadNeedsMask))
-      return failure();
+      return rewriter.notifyMatchFailure(maskedOp, "already rewritten");
 
-    if (failed(baseInBufferAddrSpace(rewriter, maskedOp))) {
-      return failure();
+    if (failed(hasBufferAddressSpace(maskedOp.getBase().getType()))) {
+      return rewriter.notifyMatchFailure(
+          maskedOp, "isn't a load from a fat buffer resource");
     }
 
     // Check if this is either a full inbounds load or an empty, oob load. If
@@ -176,9 +176,14 @@ struct FullMaskedLoadToConditionalLoad
 
   LogicalResult matchAndRewrite(vector::MaskedLoadOp loadOp,
                                 PatternRewriter &rewriter) const override {
+    if (succeeded(hasBufferAddressSpace(loadOp.getBase().getType())))
+      return rewriter.notifyMatchFailure(
+          loadOp, "buffer loads are handled by a more specialized pattern");
+
     FailureOr<Value> maybeCond = matchFullMask(rewriter, loadOp.getMask());
     if (failed(maybeCond)) {
-      return failure();
+      return rewriter.notifyMatchFailure(loadOp,
+                                         "isn't loading a broadcasted scalar");
     }
 
     Value cond = maybeCond.value();
@@ -203,6 +208,15 @@ struct FullMaskedStoreToConditionalStore
 
   LogicalResult matchAndRewrite(vector::MaskedStoreOp storeOp,
                                 PatternRewriter &rewriter) const override {
+    // A condition-free implementation of fully masked stores requires
+    // 1) an accessor for the num_records field on buffer resources/fat pointers
+    // 2) knowledge that said field will always be set accurately - that is,
+    // that writes to x < num_records of offset wouldn't trap, which is
+    // something a pattern user would need to assert or we'd need to prove.
+    //
+    // Therefore, conditional stores to buffers still go down this path at
+    // present.
+
     FailureOr<Value> maybeCond = matchFullMask(rewriter, storeOp.getMask());
     if (failed(maybeCond)) {
       return failure();
diff --git a/mlir/lib/Dialect/Affine/Utils/LoopUtils.cpp b/mlir/lib/Dialect/Affine/Utils/LoopUtils.cpp
index 4743941deff3f..8f1249e3afaf0 100644
--- a/mlir/lib/Dialect/Affine/Utils/LoopUtils.cpp
+++ b/mlir/lib/Dialect/Affine/Utils/LoopUtils.cpp
@@ -1711,6 +1711,12 @@ LogicalResult mlir::affine::coalesceLoops(MutableArrayRef<AffineForOp> loops) {
   outermost.getBody()->getOperations().splice(
       Block::iterator(secondOutermostLoop.getOperation()),
       innermost.getBody()->getOperations());
+  for (auto [iter, init] :
+       llvm::zip_equal(secondOutermostLoop.getRegionIterArgs(),
+                       secondOutermostLoop.getInits())) {
+    iter.replaceAllUsesWith(init);
+    iter.dropAllUses();
+  }
   secondOutermostLoop.erase();
   return success();
 }
diff --git a/mlir/lib/Dialect/ControlFlow/IR/CMakeLists.txt b/mlir/lib/Dialect/ControlFlow/IR/CMakeLists.txt
index 58551bb435c86..05a787fa53ec3 100644
--- a/mlir/lib/Dialect/ControlFlow/IR/CMakeLists.txt
+++ b/mlir/lib/Dialect/ControlFlow/IR/CMakeLists.txt
@@ -12,4 +12,5 @@ add_mlir_dialect_library(MLIRControlFlowDialect
   MLIRControlFlowInterfaces
   MLIRIR
   MLIRSideEffectInterfaces
+  MLIRUBDialect
   )
diff --git a/mlir/lib/Dialect/ControlFlow/IR/ControlFlowOps.cpp b/mlir/lib/Dialect/ControlFlow/IR/ControlFlowOps.cpp
index f1da1a125e9ef..d2078d8ab5ca5 100644
--- a/mlir/lib/Dialect/ControlFlow/IR/ControlFlowOps.cpp
+++ b/mlir/lib/Dialect/ControlFlow/IR/ControlFlowOps.cpp
@@ -12,6 +12,7 @@
 #include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/Dialect/Bufferization/IR/BufferDeallocationOpInterface.h"
 #include "mlir/Dialect/Bufferization/IR/BufferizableOpInterface.h"
+#include "mlir/Dialect/UB/IR/UBOps.h"
 #include "mlir/IR/AffineExpr.h"
 #include "mlir/IR/AffineMap.h"
 #include "mlir/IR/Builders.h"
@@ -445,6 +446,37 @@ struct CondBranchTruthPropagation : public OpRewritePattern<CondBranchOp> {
     return success(replaced);
   }
 };
+
+/// If the destination block of a conditional branch contains only
+/// ub.unreachable, unconditionally branch to the other destination.
+struct DropUnreachableCondBranch : public OpRewritePattern<CondBranchOp> {
+  using OpRewritePattern<CondBranchOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(CondBranchOp condbr,
+                                PatternRewriter &rewriter) const override {
+    // If the "true" destination is unreachable, branch to the "false"
+    // destination.
+    Block *trueDest = condbr.getTrueDest();
+    Block *falseDest = condbr.getFalseDest();
+    if (llvm::hasSingleElement(*trueDest) &&
+        isa<ub::UnreachableOp>(trueDest->getTerminator())) {
+      rewriter.replaceOpWithNewOp<BranchOp>(condbr, falseDest,
+                                            condbr.getFalseOperands());
+      return success();
+    }
+
+    // If the "false" destination is unreachable, branch to the "true"
+    // destination.
+    if (llvm::hasSingleElement(*falseDest) &&
+        isa<ub::UnreachableOp>(falseDest->getTerminator())) {
+      rewriter.replaceOpWithNewOp<BranchOp>(condbr, trueDest,
+                                            condbr.getTrueOperands());
+      return success();
+    }
+
+    return failure();
+  }
+};
 } // namespace
 
 void CondBranchOp::getCanonicalizationPatterns(RewritePatternSet &results,
@@ -452,7 +484,7 @@ void CondBranchOp::getCanonicalizationPatterns(RewritePatternSet &results,
   results.add<SimplifyConstCondBranchPred, SimplifyPassThroughCondBranch,
               SimplifyCondBranchIdenticalSuccessors,
               SimplifyCondBranchFromCondBranchOnSameCondition,
-              CondBranchTruthPropagation>(context);
+              CondBranchTruthPropagation, DropUnreachableCondBranch>(context);
 }
 
 SuccessorOperands CondBranchOp::getSuccessorOperands(unsigned index) {
diff --git a/mlir/lib/Dialect/EmitC/IR/EmitC.cpp b/mlir/lib/Dialect/EmitC/IR/EmitC.cpp
index d478220221f7a..b0566dd10f490 100644
--- a/mlir/lib/Dialect/EmitC/IR/EmitC.cpp
+++ b/mlir/lib/Dialect/EmitC/IR/EmitC.cpp
@@ -225,6 +225,21 @@ FailureOr<SmallVector<ReplacementItem>> parseFormatString(
   return items;
 }
 
+//===----------------------------------------------------------------------===//
+// AddressOfOp
+//===----------------------------------------------------------------------===//
+
+LogicalResult AddressOfOp::verify() {
+  emitc::LValueType referenceType = getReference().getType();
+  emitc::PointerType resultType = getResult().getType();
+
+  if (referenceType.getValueType() != resultType.getPointee())
+    return emitOpError("requires result to be a pointer to the type "
+                       "referenced by operand");
+
+  return success();
+}
+
 //===----------------------------------------------------------------------===//
 // AddOp
 //===----------------------------------------------------------------------===//
@@ -379,6 +394,20 @@ LogicalResult emitc::ConstantOp::verify() {
 
 OpFoldResult emitc::ConstantOp::fold(FoldAdaptor adaptor) { return getValue(); }
 
+//===----------------------------------------------------------------------===//
+// DereferenceOp
+//===----------------------------------------------------------------------===//
+
+LogicalResult DereferenceOp::verify() {
+  emitc::PointerType pointerType = getPointer().getType();
+
+  if (pointerType.getPointee() != getResult().getType().getValueType())
+    return emitOpError("requires result to be an lvalue of the type "
+                       "pointed to by operand");
+
+  return success();
+}
+
 //===----------------------------------------------------------------------===//
 // ExpressionOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 6c6d8d2bad55d..61a630aa88960 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -208,7 +208,7 @@ Type MMAMatrixType::getElementType() const { return getImpl()->elementType; }
 StringRef MMAMatrixType::getOperand() const { return getImpl()->getOperand(); }
 
 bool MMAMatrixType::isValidElementType(Type elementType) {
-  return elementType.isF16() || elementType.isF32() ||
+  return elementType.isF16() || elementType.isF32() || elementType.isF64() ||
          elementType.isUnsignedInteger(8) || elementType.isSignedInteger(8) ||
          elementType.isInteger(32);
 }
@@ -225,7 +225,7 @@ MMAMatrixType::verifyInvariants(function_ref<InFlightDiagnostic()> emitError,
 
   if (!MMAMatrixType::isValidElementType(elementType))
     return emitError()
-           << "MMAMatrixType elements must be SI8, UI8, I32, F16, or F32";
+           << "MMAMatrixType elements must be SI8, UI8, I32, F16, F32, or F64";
 
   return success();
 }
diff --git a/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp b/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
index d3c305555fde8..ada4223ac12de 100644
--- a/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
+++ b/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
@@ -252,10 +252,10 @@ LogicalResult CpAsyncBulkGlobalToSharedClusterOp::verify() {
 static LogicalResult verifyMBarrierArriveLikeOp(Operation *op, Value addr,
                                                 NVVM::MemScopeKind scope,
                                                 Value retVal = nullptr) {
-  bool isSharedCluster = isPtrInSharedClusterSpace(addr);
   if (scope != NVVM::MemScopeKind::CTA && scope != NVVM::MemScopeKind::CLUSTER)
     return op->emitError("mbarrier scope must be either CTA or Cluster");
 
+  bool isSharedCluster = isPtrInSharedClusterSpace(addr);
   bool hasRetValue = static_cast<bool>(retVal);
   if (isSharedCluster && hasRetValue)
     return op->emitError(
@@ -274,6 +274,34 @@ LogicalResult MBarrierArriveDropOp::verify() {
                                     getRes());
 }
 
+LogicalResult MBarrierArriveExpectTxOp::verify() {
+  // The inline-ptx version of this Op does not support all features.
+  // With predicate, this Op lowers to inline-ptx. So, verify and
+  // error-out if there are unsupported features.
+  if (getPredicate()) {
+    if (getScope() != NVVM::MemScopeKind::CTA)
+      return emitError("mbarrier scope must be CTA when using predicate");
+
+    if (isPtrInSharedClusterSpace(getAddr()))
+      return emitError("mbarrier in shared_cluster space is not supported when "
+                       "using predicate");
+
+    if (getRes())
+      return emitError("return-value is not supported when using predicate");
+
+    if (getRelaxed() == true)
+      return emitError("mbarrier with relaxed semantics is not supported when "
+                       "using predicate");
+  }
+  return verifyMBarrierArriveLikeOp(getOperation(), getAddr(), getScope(),
+                                    getRes());
+}
+
+LogicalResult MBarrierArriveDropExpectTxOp::verify() {
+  return verifyMBarrierArriveLikeOp(getOperation(), getAddr(), getScope(),
+                                    getRes());
+}
+
 LogicalResult MBarrierExpectTxOp::verify() {
   return verifyMBarrierArriveLikeOp(getOperation(), getAddr(), getScope());
 }
@@ -282,6 +310,10 @@ LogicalResult MBarrierCompleteTxOp::verify() {
   return verifyMBarrierArriveLikeOp(getOperation(), getAddr(), getScope());
 }
 
+LogicalResult MBarrierTestWaitOp::verify() {
+  return verifyMBarrierArriveLikeOp(getOperation(), getAddr(), getScope());
+}
+
 LogicalResult ConvertFloatToTF32Op::verify() {
   using RndMode = NVVM::FPRoundingMode;
   switch (getRnd()) {
@@ -2576,6 +2608,87 @@ mlir::NVVM::IDArgPair MBarrierArriveDropOp::getIntrinsicIDAndArgs(
   return {id, {mbar, count}};
 }
 
+bool MBarrierArriveExpectTxOp::getAsmValues(
+    RewriterBase &rewriter,
+    llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>>
+        &asmValues) {
+  // Add all the operands but not the attrs to the asmValues list.
+  // The attrs here are used to generate the right variants for
+  // intrinsics-lowering. So, we ignore them while generating inline-PTX.
+  for (auto val : getOperands())
+    asmValues.push_back({val, mlir::NVVM::PTXRegisterMod::Read});
+
+  return false;
+}
+
+mlir::NVVM::IDArgPair MBarrierArriveExpectTxOp::getIntrinsicIDAndArgs(
+    Operation &op, LLVM::ModuleTranslation &mt, llvm::IRBuilderBase &builder) {
+  auto thisOp = cast<NVVM::MBarrierArriveExpectTxOp>(op);
+
+  bool isClusterSpace = isPtrInSharedClusterSpace(thisOp.getAddr());
+  bool isClusterScope = thisOp.getScope() == NVVM::MemScopeKind::CLUSTER;
+  // bit-0: Space
+  // bit-1: Scope
+  size_t index = ((isClusterScope ? 1 : 0) << 1) | (isClusterSpace ? 1 : 0);
+
+  // clang-format off
+  static constexpr llvm::Intrinsic::ID IDs[] = {
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_scope_cta_space_cluster,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_scope_cluster_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_scope_cluster_space_cluster};
+  static constexpr llvm::Intrinsic::ID relaxedIDs[] = {
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_relaxed_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_relaxed_scope_cta_space_cluster,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_relaxed_scope_cluster_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_expect_tx_relaxed_scope_cluster_space_cluster};
+  // clang-format on
+  auto id = thisOp.getRelaxed() ? relaxedIDs[index] : IDs[index];
+
+  // Tidy-up the Intrinsic Args
+  llvm::Value *txcount = mt.lookupValue(thisOp.getTxcount());
+  llvm::Value *mbar = mt.lookupValue(thisOp.getAddr());
+  bool needCast = isPtrInGenericSpace(thisOp.getAddr());
+  if (needCast)
+    mbar = castPtrToAddrSpace(builder, mbar, NVVMMemorySpace::Shared);
+
+  return {id, {mbar, txcount}};
+}
+
+mlir::NVVM::IDArgPair MBarrierArriveDropExpectTxOp::getIntrinsicIDAndArgs(
+    Operation &op, LLVM::ModuleTranslation &mt, llvm::IRBuilderBase &builder) {
+  auto thisOp = cast<NVVM::MBarrierArriveDropExpectTxOp>(op);
+
+  bool isClusterSpace = isPtrInSharedClusterSpace(thisOp.getAddr());
+  bool isClusterScope = thisOp.getScope() == NVVM::MemScopeKind::CLUSTER;
+  // bit-0: Space
+  // bit-1: Scope
+  size_t index = ((isClusterScope ? 1 : 0) << 1) | (isClusterSpace ? 1 : 0);
+
+  // clang-format off
+  static constexpr llvm::Intrinsic::ID IDs[] = {
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_scope_cta_space_cluster,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_scope_cluster_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_scope_cluster_space_cluster};
+  static constexpr llvm::Intrinsic::ID relaxedIDs[] = {
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_relaxed_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_relaxed_scope_cta_space_cluster,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_relaxed_scope_cluster_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_arrive_drop_expect_tx_relaxed_scope_cluster_space_cluster};
+  // clang-format on
+  auto id = thisOp.getRelaxed() ? relaxedIDs[index] : IDs[index];
+
+  // Tidy-up the Intrinsic Args
+  llvm::Value *txcount = mt.lookupValue(thisOp.getTxcount());
+  llvm::Value *mbar = mt.lookupValue(thisOp.getAddr());
+  bool needCast = isPtrInGenericSpace(thisOp.getAddr());
+  if (needCast)
+    mbar = castPtrToAddrSpace(builder, mbar, NVVMMemorySpace::Shared);
+
+  return {id, {mbar, txcount}};
+}
+
 mlir::NVVM::IDArgPair MBarrierArriveNocompleteOp::getIntrinsicIDAndArgs(
     Operation &op, LLVM::ModuleTranslation &mt, llvm::IRBuilderBase &builder) {
   auto thisOp = cast<NVVM::MBarrierArriveNocompleteOp>(op);
@@ -2609,16 +2722,34 @@ mlir::NVVM::IDArgPair MBarrierArriveDropNocompleteOp::getIntrinsicIDAndArgs(
 mlir::NVVM::IDArgPair MBarrierTestWaitOp::getIntrinsicIDAndArgs(
     Operation &op, LLVM::ModuleTranslation &mt, llvm::IRBuilderBase &builder) {
   auto thisOp = cast<NVVM::MBarrierTestWaitOp>(op);
-  bool isShared = isPtrInSharedCTASpace(thisOp.getAddr());
-  llvm::Intrinsic::ID id = isShared
-                               ? llvm::Intrinsic::nvvm_mbarrier_test_wait_shared
-                               : llvm::Intrinsic::nvvm_mbarrier_test_wait;
-  // Fill the Intrinsic Args
-  llvm::SmallVector<llvm::Value *> args;
-  args.push_back(mt.lookupValue(thisOp.getAddr()));
-  args.push_back(mt.lookupValue(thisOp.getState()));
+  bool isPhaseParity = thisOp.getStateOrPhase().getType().isInteger(32);
+  bool isClusterScope = thisOp.getScope() == NVVM::MemScopeKind::CLUSTER;
+  // bit-0: isPhaseParity
+  // bit-1: Scope
+  size_t index = ((isClusterScope ? 1 : 0) << 1) | (isPhaseParity ? 1 : 0);
 
-  return {id, std::move(args)};
+  // clang-format off
+  static constexpr llvm::Intrinsic::ID IDs[] = {
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_parity_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_scope_cluster_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_parity_scope_cluster_space_cta};
+  static constexpr llvm::Intrinsic::ID relaxedIDs[] = {
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_relaxed_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_parity_relaxed_scope_cta_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_relaxed_scope_cluster_space_cta,
+      llvm::Intrinsic::nvvm_mbarrier_test_wait_parity_relaxed_scope_cluster_space_cta};
+  // clang-format on
+  auto id = thisOp.getRelaxed() ? relaxedIDs[index] : IDs[index];
+
+  // Tidy-up the Intrinsic Args
+  llvm::Value *mbar = mt.lookupValue(thisOp.getAddr());
+  llvm::Value *input = mt.lookupValue(thisOp.getStateOrPhase());
+  bool needCast = isPtrInGenericSpace(thisOp.getAddr());
+  if (needCast)
+    mbar = castPtrToAddrSpace(builder, mbar, NVVMMemorySpace::Shared);
+
+  return {id, {mbar, input}};
 }
 
 mlir::NVVM::IDArgPair CpAsyncMBarrierArriveOp::getIntrinsicIDAndArgs(
@@ -4707,16 +4838,20 @@ LogicalResult NVVMTargetAttr::verifyTarget(Operation *gpuModule) {
                      "Minimum NVVM target SM version is sm_20");
   }
 
-  gpuModuleOp->walk([&](Operation *op) {
-    if (auto reqOp = llvm::dyn_cast<NVVM::RequiresSMInterface>(op)) {
-      const NVVMCheckSMVersion requirement = reqOp.getRequiredMinSMVersion();
-      if (!requirement.isCompatibleWith(targetSMVersion)) {
-        op->emitOpError() << "is not supported on " << getChip();
-        return WalkResult::interrupt();
-      }
-    }
-    return WalkResult::advance();
-  });
+  if (gpuModuleOp
+          ->walk([&](Operation *op) {
+            if (auto reqOp = llvm::dyn_cast<NVVM::RequiresSMInterface>(op)) {
+              const NVVMCheckSMVersion requirement =
+                  reqOp.getRequiredMinSMVersion();
+              if (!requirement.isCompatibleWith(targetSMVersion)) {
+                op->emitOpError() << "is not supported on " << getChip();
+                return WalkResult::interrupt();
+              }
+            }
+            return WalkResult::advance();
+          })
+          .wasInterrupted())
+    return failure();
 
   return success();
 }
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp
index dcc1ef9e997ea..b4b1347493529 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgInterfaces.cpp
@@ -1057,12 +1057,15 @@ LogicalResult mlir::linalg::detail::verifyConvolutionInterface(Operation *op) {
 // FillOpInterface implementation
 //===----------------------------------------------------------------------===//
 
+namespace {
 enum class MatchFillResult {
   Success = 0,
   NotLinalgOp,
   WrongNumOperands,
-  NotScalarInput
+  NotScalarInput,
+  TypeMismatch
 };
+} // namespace
 
 static MatchFillResult isFillInterfaceImpl(Operation *op) {
   auto linalgOp = dyn_cast<linalg::LinalgOp>(op);
@@ -1075,17 +1078,33 @@ static MatchFillResult isFillInterfaceImpl(Operation *op) {
   if (!linalgOp.isScalar(value))
     return MatchFillResult::NotScalarInput;
 
+  // Check that the scalar input type matches the output element type.
+  OpOperand *output = linalgOp.getDpsInitOperand(0);
+  Type scalarType = value->get().getType();
+  Type outputElementType = getElementTypeOrSelf(output->get().getType());
+  if (scalarType != outputElementType)
+    return MatchFillResult::TypeMismatch;
+
   return MatchFillResult::Success;
 }
 
 LogicalResult mlir::linalg::detail::verifyFillInterface(Operation *op) {
-  auto res = isFillInterfaceImpl(op);
+  MatchFillResult res = isFillInterfaceImpl(op);
   if (res == MatchFillResult::NotLinalgOp)
     return op->emitError("expected a LinalgOp");
   if (res == MatchFillResult::WrongNumOperands)
     return op->emitError("expected op with 1 input and 1 output");
   if (res == MatchFillResult::NotScalarInput)
     return op->emitError("expected op with scalar input");
+  if (res == MatchFillResult::TypeMismatch) {
+    auto linalgOp = cast<linalg::LinalgOp>(op);
+    Type scalarType = linalgOp.getDpsInputOperand(0)->get().getType();
+    Type outputElementType =
+        getElementTypeOrSelf(linalgOp.getDpsInitOperand(0)->get().getType());
+    return op->emitOpError("expected fill value type (")
+           << scalarType << ") to match output element type ("
+           << outputElementType << ")";
+  }
 
   return success();
 }
diff --git a/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp b/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp
index 05fc7cbbb90af..421ab5e3760a7 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp
@@ -1038,6 +1038,62 @@ class FoldWithProducerReshapeOpByExpansion
   ControlFusionFn controlFoldingReshapes;
 };
 
+/// Carries information about a padded dimension.
+struct PadDimInfo {
+  // The resulting shape after padding each dimension.
+  SmallVector<int64_t> paddedShape;
+
+  // Low and high padding amounts for each dimension.
+  SmallVector<OpFoldResult> lowPad;
+  SmallVector<OpFoldResult> highPad;
+};
+
+/// Computes the expanded padding information for the given pad operation based
+/// on the provided expanded shape and reassociation indices. Returns a list of
+/// PadDimInfo containing the low and high padding amounts and the padded
+/// size for each dimension, or failure if the expansion is not possible.
+static FailureOr<PadDimInfo>
+computeExpandedPadding(tensor::PadOp padOp, ArrayRef<int64_t> expandedShape,
+                       ArrayRef<ReassociationIndices> reassociations,
+                       PatternRewriter &rewriter) {
+  // If the padding value depends on the index values of the pad operation,
+  // then it may not be valid to expand the dimensions, since it will change
+  // the index values on which the padding value depends. This is not currently
+  // supported by the pad expansion patterns, but it could be implemented
+  // similarly to the expansion of linalg.generic ops with linalg.index ops in
+  // the body, as is done in `updateExpandedGenericOpRegion`.
+  if (!padOp.getConstantPaddingValue())
+    return failure();
+
+  // Expanded dimensions cannot have padding because the resulting padding may
+  // not be representable by a tensor.pad op. There are some special cases where
+  // it is possible (like expanding unit dims), but supporting these cases is
+  // NYI, so disallow it for now.
+  ArrayRef<int64_t> low = padOp.getStaticLow();
+  ArrayRef<int64_t> high = padOp.getStaticHigh();
+  for (auto [reInd, l, h] : llvm::zip_equal(reassociations, low, high)) {
+    if (reInd.size() != 1 && (l != 0 || h != 0))
+      return failure();
+  }
+
+  SmallVector<OpFoldResult> mixedLowPad(padOp.getMixedLowPad());
+  SmallVector<OpFoldResult> mixedHighPad(padOp.getMixedHighPad());
+  ArrayRef<int64_t> paddedShape = padOp.getResultType().getShape();
+  PadDimInfo padDimInfo;
+  padDimInfo.paddedShape.assign(expandedShape);
+  padDimInfo.lowPad.assign(expandedShape.size(), rewriter.getIndexAttr(0));
+  padDimInfo.highPad.assign(expandedShape.size(), rewriter.getIndexAttr(0));
+  for (auto [idx, reInd] : llvm::enumerate(reassociations)) {
+    if (reInd.size() == 1) {
+      padDimInfo.paddedShape[reInd[0]] = paddedShape[idx];
+      padDimInfo.lowPad[reInd[0]] = mixedLowPad[idx];
+      padDimInfo.highPad[reInd[0]] = mixedHighPad[idx];
+    }
+  }
+
+  return padDimInfo;
+}
+
 class FoldPadWithProducerReshapeOpByExpansion
     : public OpRewritePattern<tensor::PadOp> {
 public:
@@ -1053,46 +1109,96 @@ class FoldPadWithProducerReshapeOpByExpansion
         padOp.getSource().getDefiningOp<tensor::CollapseShapeOp>();
     if (!reshapeOp)
       return failure();
-    if (!reshapeOp->hasOneUse())
-      return failure();
 
     if (!controlFoldingReshapes(&padOp.getSourceMutable())) {
       return rewriter.notifyMatchFailure(padOp,
                                          "fusion blocked by control function");
     }
 
-    ArrayRef<int64_t> low = padOp.getStaticLow();
-    ArrayRef<int64_t> high = padOp.getStaticHigh();
+    RankedTensorType expandedType = reshapeOp.getSrcType();
     SmallVector<ReassociationIndices> reassociations =
         reshapeOp.getReassociationIndices();
+    FailureOr<PadDimInfo> maybeExpandedPadding = computeExpandedPadding(
+        padOp, expandedType.getShape(), reassociations, rewriter);
+    if (failed(maybeExpandedPadding))
+      return failure();
+    PadDimInfo &expandedPadding = maybeExpandedPadding.value();
 
-    for (auto [reInd, l, h] : llvm::zip_equal(reassociations, low, high)) {
-      if (reInd.size() != 1 && (l != 0 || h != 0))
-        return failure();
+    Location loc = padOp->getLoc();
+    RankedTensorType expandedPaddedType =
+        padOp.getResultType().clone(expandedPadding.paddedShape);
+
+    auto newPadOp = tensor::PadOp::create(
+        rewriter, loc, expandedPaddedType, reshapeOp.getSrc(),
+        expandedPadding.lowPad, expandedPadding.highPad,
+        padOp.getConstantPaddingValue(), padOp.getNofold());
+
+    rewriter.replaceOpWithNewOp<tensor::CollapseShapeOp>(
+        padOp, padOp.getResultType(), newPadOp.getResult(), reassociations);
+
+    return success();
+  }
+
+private:
+  ControlFusionFn controlFoldingReshapes;
+};
+
+class FoldReshapeWithProducerPadOpByExpansion
+    : public OpRewritePattern<tensor::ExpandShapeOp> {
+public:
+  FoldReshapeWithProducerPadOpByExpansion(MLIRContext *context,
+                                          ControlFusionFn foldReshapes,
+                                          PatternBenefit benefit = 1)
+      : OpRewritePattern<tensor::ExpandShapeOp>(context, benefit),
+        controlFoldingReshapes(std::move(foldReshapes)) {}
+
+  LogicalResult matchAndRewrite(tensor::ExpandShapeOp expandOp,
+                                PatternRewriter &rewriter) const override {
+    tensor::PadOp padOp = expandOp.getSrc().getDefiningOp<tensor::PadOp>();
+    if (!padOp)
+      return failure();
+
+    if (!controlFoldingReshapes(&expandOp.getSrcMutable())) {
+      return rewriter.notifyMatchFailure(expandOp,
+                                         "fusion blocked by control function");
     }
 
-    SmallVector<OpFoldResult> newLow, newHigh;
-    RankedTensorType expandedType = reshapeOp.getSrcType();
-    RankedTensorType paddedType = padOp.getResultType();
-    SmallVector<int64_t> expandedPaddedShape(expandedType.getShape());
+    RankedTensorType expandedType = expandOp.getResultType();
+    SmallVector<ReassociationIndices> reassociations =
+        expandOp.getReassociationIndices();
+    FailureOr<PadDimInfo> maybeExpandedPadding = computeExpandedPadding(
+        padOp, expandedType.getShape(), reassociations, rewriter);
+    if (failed(maybeExpandedPadding))
+      return failure();
+    PadDimInfo &expandedPadding = maybeExpandedPadding.value();
+
+    Location loc = expandOp->getLoc();
+    SmallVector<OpFoldResult> newExpandedSizes = expandOp.getMixedOutputShape();
+    SmallVector<int64_t> newExpandedShape(expandedType.getShape());
+    rewriter.setInsertionPointAfterValue(padOp.getSource());
+    SmallVector<OpFoldResult> padSrcSizes =
+        tensor::getMixedSizes(rewriter, loc, padOp.getSource());
     for (auto [idx, reInd] : llvm::enumerate(reassociations)) {
+      // We know that any reassociation with multiple dims is not padded because
+      // of the requirements of computeExpandedPadding.
       if (reInd.size() == 1) {
-        expandedPaddedShape[reInd[0]] = paddedType.getShape()[idx];
-      }
-      for (size_t i = 0; i < reInd.size(); ++i) {
-        newLow.push_back(padOp.getMixedLowPad()[idx]);
-        newHigh.push_back(padOp.getMixedHighPad()[idx]);
+        newExpandedShape[reInd[0]] = padOp.getSourceType().getDimSize(idx);
+        newExpandedSizes[reInd[0]] = padSrcSizes[idx];
       }
     }
-
-    Location loc = padOp->getLoc();
-    RankedTensorType expandedPaddedType = paddedType.clone(expandedPaddedShape);
+    RankedTensorType newExpandedType = expandedType.clone(newExpandedShape);
+    auto newExpandOp = tensor::ExpandShapeOp::create(
+        rewriter, loc, newExpandedType, padOp.getSource(), reassociations,
+        newExpandedSizes);
+    RankedTensorType expandedPaddedType =
+        padOp.getResultType().clone(expandedPadding.paddedShape);
+    rewriter.setInsertionPoint(expandOp);
     auto newPadOp = tensor::PadOp::create(
-        rewriter, loc, expandedPaddedType, reshapeOp.getSrc(), newLow, newHigh,
+        rewriter, loc, expandedPaddedType, newExpandOp.getResult(),
+        expandedPadding.lowPad, expandedPadding.highPad,
         padOp.getConstantPaddingValue(), padOp.getNofold());
 
-    rewriter.replaceOpWithNewOp<tensor::CollapseShapeOp>(
-        padOp, padOp.getResultType(), newPadOp.getResult(), reassociations);
+    rewriter.replaceOp(expandOp, newPadOp.getResult());
 
     return success();
   }
@@ -1921,6 +2027,62 @@ struct FoldReshapeWithGenericOpByCollapsing
   ControlFusionFn controlFoldingReshapes;
 };
 
+/// Computes the collapsed padding information for the given pad operation based
+/// on the provided collapsed shape and reassociation indices. Returns a
+/// PadDimInfo containing the low and high padding amounts and the collapsed
+/// shape for each dimension, or failure if the collapse is not possible.
+static FailureOr<PadDimInfo>
+computeCollapsedPadding(tensor::PadOp padOp,
+                        ArrayRef<ReassociationIndices> reassociations,
+                        PatternRewriter &rewriter) {
+  // If the padding value depends on the index values of the pad operation,
+  // then it may not be valid to collapse the dimensions, since it will change
+  // the index values on which the padding value depends. This is not currently
+  // supported by the pad collapsing patterns, but it could be implemented
+  // similarly to the collapsing of linalg.generic ops with linalg.index ops in
+  // the body, as is done in `generateCollapsedIndexingRegion`.
+  if (!padOp.getConstantPaddingValue())
+    return failure();
+
+  // Collapsed dimensions cannot have padding because this can produce strided
+  // padding that isn't representable by a tensor.pad op. There are some special
+  // cases where it is possible (like collapsing unit dims), but supporting
+  // these cases is NYI, so disallow it for now.
+  ArrayRef<int64_t> low = padOp.getStaticLow();
+  ArrayRef<int64_t> high = padOp.getStaticHigh();
+  for (auto [idx, reInd] : llvm::enumerate(reassociations)) {
+    for (int64_t dim : reInd) {
+      if ((low[dim] != 0 || high[dim] != 0) && reInd.size() != 1)
+        return failure();
+    }
+  }
+
+  // Initialize padding values for collapsed tensors with zeros
+  ArrayRef<int64_t> expandedPaddedShape = padOp.getType().getShape();
+  PadDimInfo padDimInfo;
+  padDimInfo.lowPad.assign(reassociations.size(), rewriter.getIndexAttr(0));
+  padDimInfo.highPad.assign(reassociations.size(), rewriter.getIndexAttr(0));
+
+  // Update padding for dimensions that are not being collapsed, and compute
+  // the collapsed padded shape.
+  SmallVector<OpFoldResult> mixedLowPad(padOp.getMixedLowPad());
+  SmallVector<OpFoldResult> mixedHighPad(padOp.getMixedHighPad());
+  for (auto [idx, reInd] : llvm::enumerate(reassociations)) {
+    if (reInd.size() == 1) {
+      padDimInfo.lowPad[idx] = mixedLowPad[reInd[0]];
+      padDimInfo.highPad[idx] = mixedHighPad[reInd[0]];
+    }
+    SaturatedInteger collapsedSize = SaturatedInteger::wrap(1);
+    for (int64_t dim : reInd) {
+      collapsedSize =
+          collapsedSize * SaturatedInteger::wrap(expandedPaddedShape[dim]);
+    }
+    padDimInfo.paddedShape.push_back(collapsedSize.asInteger());
+  }
+
+  return padDimInfo;
+}
+
 class FoldPadWithProducerReshapeOpByCollapsing
     : public OpRewritePattern<tensor::PadOp> {
 public:
@@ -1936,57 +2098,40 @@ class FoldPadWithProducerReshapeOpByCollapsing
         padOp.getSource().getDefiningOp<tensor::ExpandShapeOp>();
     if (!reshapeOp)
       return failure();
-    if (!reshapeOp->hasOneUse())
-      return failure();
 
     if (!controlFoldingReshapes(&padOp.getSourceMutable())) {
       return rewriter.notifyMatchFailure(padOp,
                                          "fusion blocked by control function");
     }
 
-    ArrayRef<int64_t> low = padOp.getStaticLow();
-    ArrayRef<int64_t> high = padOp.getStaticHigh();
     SmallVector<ReassociationIndices> reassociations =
         reshapeOp.getReassociationIndices();
+    FailureOr<PadDimInfo> maybeCollapsedPadding =
+        computeCollapsedPadding(padOp, reassociations, rewriter);
+    if (failed(maybeCollapsedPadding))
+      return failure();
+    PadDimInfo &collapsedPadding = maybeCollapsedPadding.value();
 
-    for (auto reInd : reassociations) {
-      if (reInd.size() == 1)
-        continue;
-      if (llvm::any_of(reInd, [&](int64_t ind) {
-            return low[ind] != 0 || high[ind] != 0;
-          })) {
-        return failure();
-      }
-    }
-
-    SmallVector<OpFoldResult> newLow, newHigh;
-    RankedTensorType collapsedType = reshapeOp.getSrcType();
-    RankedTensorType paddedType = padOp.getResultType();
-    SmallVector<int64_t> collapsedPaddedShape(collapsedType.getShape());
-    SmallVector<OpFoldResult> expandedPaddedSizes(
-        getMixedValues(reshapeOp.getStaticOutputShape(),
-                       reshapeOp.getOutputShape(), rewriter));
+    SmallVector<OpFoldResult> expandedPaddedSizes =
+        reshapeOp.getMixedOutputShape();
     AffineExpr d0, d1, d2;
     bindDims(rewriter.getContext(), d0, d1, d2);
     auto addMap = AffineMap::get(3, 0, {d0 + d1 + d2});
     Location loc = reshapeOp->getLoc();
-    for (auto [idx, reInd] : llvm::enumerate(reassociations)) {
-      OpFoldResult l = padOp.getMixedLowPad()[reInd[0]];
-      OpFoldResult h = padOp.getMixedHighPad()[reInd[0]];
+    for (auto [reInd, l, h] :
+         llvm::zip_equal(reassociations, collapsedPadding.lowPad,
+                         collapsedPadding.highPad)) {
       if (reInd.size() == 1) {
-        collapsedPaddedShape[idx] = paddedType.getShape()[reInd[0]];
-        OpFoldResult paddedSize = affine::makeComposedFoldedAffineApply(
+        expandedPaddedSizes[reInd[0]] = affine::makeComposedFoldedAffineApply(
             rewriter, loc, addMap, {l, h, expandedPaddedSizes[reInd[0]]});
-        expandedPaddedSizes[reInd[0]] = paddedSize;
       }
-      newLow.push_back(l);
-      newHigh.push_back(h);
     }
 
     RankedTensorType collapsedPaddedType =
-        paddedType.clone(collapsedPaddedShape);
+        padOp.getType().clone(collapsedPadding.paddedShape);
     auto newPadOp = tensor::PadOp::create(
-        rewriter, loc, collapsedPaddedType, reshapeOp.getSrc(), newLow, newHigh,
+        rewriter, loc, collapsedPaddedType, reshapeOp.getSrc(),
+        collapsedPadding.lowPad, collapsedPadding.highPad,
         padOp.getConstantPaddingValue(), padOp.getNofold());
 
     rewriter.replaceOpWithNewOp<tensor::ExpandShapeOp>(
@@ -2000,6 +2145,52 @@ class FoldPadWithProducerReshapeOpByCollapsing
   ControlFusionFn controlFoldingReshapes;
 };
 
+class FoldReshapeWithProducerPadOpByCollapsing
+    : public OpRewritePattern<tensor::CollapseShapeOp> {
+public:
+  FoldReshapeWithProducerPadOpByCollapsing(MLIRContext *context,
+                                           ControlFusionFn foldReshapes,
+                                           PatternBenefit benefit = 1)
+      : OpRewritePattern<tensor::CollapseShapeOp>(context, benefit),
+        controlFoldingReshapes(std::move(foldReshapes)) {}
+
+  LogicalResult matchAndRewrite(tensor::CollapseShapeOp reshapeOp,
+                                PatternRewriter &rewriter) const override {
+    tensor::PadOp padOp = reshapeOp.getSrc().getDefiningOp<tensor::PadOp>();
+    if (!padOp)
+      return failure();
+
+    if (!controlFoldingReshapes(&reshapeOp.getSrcMutable())) {
+      return rewriter.notifyMatchFailure(padOp,
+                                         "fusion blocked by control function");
+    }
+
+    SmallVector<ReassociationIndices> reassociations =
+        reshapeOp.getReassociationIndices();
+    RankedTensorType collapsedPaddedType = reshapeOp.getResultType();
+    FailureOr<PadDimInfo> maybeCollapsedPadding =
+        computeCollapsedPadding(padOp, reassociations, rewriter);
+    if (failed(maybeCollapsedPadding))
+      return failure();
+    PadDimInfo &collapsedPadding = maybeCollapsedPadding.value();
+
+    Location loc = reshapeOp->getLoc();
+    auto newCollapseOp = tensor::CollapseShapeOp::create(
+        rewriter, loc, padOp.getSource(), reassociations);
+
+    auto newPadOp = tensor::PadOp::create(
+        rewriter, loc, collapsedPaddedType, newCollapseOp.getResult(),
+        collapsedPadding.lowPad, collapsedPadding.highPad,
+        padOp.getConstantPaddingValue(), padOp.getNofold());
+
+    rewriter.replaceOp(reshapeOp, newPadOp.getResult());
+    return success();
+  }
+
+private:
+  ControlFusionFn controlFoldingReshapes;
+};
+
 /// Pattern to collapse dimensions.
 template <typename LinalgType>
 class CollapseLinalgDimensions : public OpRewritePattern<LinalgType> {
@@ -2239,6 +2430,8 @@ void mlir::linalg::populateFoldReshapeOpsByExpansionPatterns(
                                                     controlFoldingReshapes);
   patterns.add<FoldPadWithProducerReshapeOpByExpansion>(patterns.getContext(),
                                                         controlFoldingReshapes);
+  patterns.add<FoldReshapeWithProducerPadOpByExpansion>(patterns.getContext(),
+                                                        controlFoldingReshapes);
   patterns.add<FoldWithProducerReshapeOpByExpansion>(patterns.getContext(),
                                                      controlFoldingReshapes);
 }
@@ -2250,6 +2443,8 @@ void mlir::linalg::populateFoldReshapeOpsByCollapsingPatterns(
                                                       controlFoldingReshapes);
   patterns.add<FoldPadWithProducerReshapeOpByCollapsing>(
       patterns.getContext(), controlFoldingReshapes);
+  patterns.add<FoldReshapeWithProducerPadOpByCollapsing>(
+      patterns.getContext(), controlFoldingReshapes);
   patterns.add<FoldReshapeWithGenericOpByCollapsing>(patterns.getContext(),
                                                      controlFoldingReshapes);
 }
diff --git a/mlir/lib/Dialect/Linalg/Utils/Utils.cpp b/mlir/lib/Dialect/Linalg/Utils/Utils.cpp
index 6b85e6ba0ede2..01e6e1e248658 100644
--- a/mlir/lib/Dialect/Linalg/Utils/Utils.cpp
+++ b/mlir/lib/Dialect/Linalg/Utils/Utils.cpp
@@ -416,10 +416,6 @@ static bool matchConvDimAddExprPattern(ArrayAttr indexingMaps, unsigned iDim,
   return false;
 }
 
-// ---------------------------------------------
-// Matchers for specific convolution operation.
-// ---------------------------------------------
-
 /// Returns true if the given indexing maps matches with the expected indexing
 /// maps.
 static bool convLayoutMatches(ArrayRef<ArrayRef<AffineExpr>> mapListExpected,
@@ -434,6 +430,105 @@ static bool convLayoutMatches(ArrayRef<ArrayRef<AffineExpr>> mapListExpected,
                           })));
 }
 
+/// Enum representing pooling operation types used by ConvMatcherBuilder.
+enum class PoolingType {
+  None,
+  MaxSigned,
+  MaxUnsigned,
+  MinSigned,
+  MinUnsigned,
+  Sum
+};
+
+/// Helper class for building convolution op matchers with minimal boilerplate.
+/// Reduces repetitive code across Conv1D/2D/3D and Depthwise variants as well
+/// as Pooling ops.
+///
+/// Usage: Create an instance with the op, spatial rank, and output pointers for
+/// extracted dilations/strides. Then chain matchStride() calls for each spatial
+/// dimension, followed by matchMaps() to verify indexing maps, and finally
+/// matchBody() to verify the operation body pattern.
+///
+/// The `matched` flag starts as `true` and is set to `false` if any match step
+/// fails. This allows chaining multiple match calls; once any match fails, all
+/// subsequent calls become no-ops and the final result is `false`.
+///
+/// The `dilations` and `strides` pointers are output parameters that get
+/// populated with the extracted dilation and stride values from the operation's
+/// indexing maps during matchStride() calls. These values are initially set to
+/// 1 for each spatial dimension and updated as patterns are matched.
+class ConvMatcherBuilder {
+  LinalgOp op;
+  MLIRContext *ctx;
+  SmallVector<int64_t> *dilations, *strides;
+  ArrayAttr indexingMaps;
+  PoolingType poolingType;
+  bool matched = true;
+
+public:
+  ConvMatcherBuilder(LinalgOp op, unsigned spatialRank, SmallVector<int64_t> *d,
+                     SmallVector<int64_t> *s,
+                     PoolingType poolingType = PoolingType::None)
+      : op(op), ctx(op->getContext()), dilations(d), strides(s),
+        indexingMaps(op.getIndexingMaps()), poolingType(poolingType) {
+    *dilations = SmallVector<int64_t>(spatialRank, 1);
+    *strides = SmallVector<int64_t>(spatialRank, 1);
+  }
+
+  /// Get affine dimension expression for dimension `i`.
+  AffineExpr dim(unsigned i) { return getAffineDimExpr(i, ctx); }
+
+  /// Build strided expression: base * stride[idx] + kernel * dilation[idx].
+  AffineExpr strided(AffineExpr base, AffineExpr kernel, unsigned idx) {
+    return base * (*strides)[idx] + kernel * (*dilations)[idx];
+  }
+
+  /// Match stride/dilation pattern for a spatial dimension.
+  /// Returns *this for method chaining.
+  ConvMatcherBuilder &matchStride(unsigned iDim, unsigned fDim, unsigned oDim,
+                                  unsigned idx) {
+    if (matched) {
+      matched &= matchConvDimAddExprPattern(indexingMaps, iDim, fDim, oDim,
+                                            (*dilations)[idx], (*strides)[idx]);
+    }
+    return *this;
+  }
+
+  /// Match expected indexing maps layout. Returns *this for method chaining.
+  ConvMatcherBuilder &matchMaps(ArrayRef<ArrayRef<AffineExpr>> maps) {
+    if (matched)
+      matched &= convLayoutMatches(maps, indexingMaps, ctx);
+    return *this;
+  }
+
+  /// Match body pattern. This should be called last.
+  bool matchBody() {
+    if (!matched)
+      return false;
+    Block *body = op.getBlock();
+    auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
+    switch (poolingType) {
+    case PoolingType::None:
+      return bodyMatcherForConvolutionOps(yieldOp.getOperand(0), body);
+    case PoolingType::MaxSigned:
+      return bodyMatcherForMaxSignedPoolOps(yieldOp.getOperand(0), body);
+    case PoolingType::MaxUnsigned:
+      return bodyMatcherForMaxUnsignedPoolOps(yieldOp.getOperand(0), body);
+    case PoolingType::MinSigned:
+      return bodyMatcherForMinSignedPoolOps(yieldOp.getOperand(0), body);
+    case PoolingType::MinUnsigned:
+      return bodyMatcherForMinUnsignedPoolOps(yieldOp.getOperand(0), body);
+    case PoolingType::Sum:
+      return bodyMatcherForSumPoolOps(yieldOp.getOperand(0), body);
+    }
+    return false;
+  }
+};
+
+//===----------------------------------------------------------------------===//
+// Matchers for specific convolution operation.
+//===----------------------------------------------------------------------===//
+
 // #inputMap = affine_map<(W, w) -> (W + w)>
 // #filterMap = affine_map<(W, w) -> (w)>
 // #outputMap = affine_map<(W, w) -> (W)>
@@ -447,29 +542,15 @@ bool isaConvolutionOpOfType<linalg::Conv1DOp>(LinalgOp op,
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(1, 1);
-  *strides = SmallVector<int64_t>(1, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr W = getAffineDimExpr(0, context);
-  AffineExpr w = getAffineDimExpr(1, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/0, /*fDim=*/0,
-                                  /*oDim=*/0, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{W * (*strides)[0] + w * (*dilations)[0]},
-           /*filterMap=*/{w},
-           /*outputMap=*/{W}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/1, dilations, strides);
+  AffineExpr W = m.dim(0);
+  AffineExpr w = m.dim(1);
+
+  return m.matchStride(/*iDim=*/0, /*fDim=*/0, /*oDim=*/0, /*idx=*/0)
+      .matchMaps({/*inputMap=*/{m.strided(W, w, 0)},
+                  /*filterMap=*/{w},
+                  /*outputMap=*/{W}})
+      .matchBody();
 }
 
 // #inputMap  = affine_map<(N, W, F, w, c) -> (N, W + w, c)>
@@ -485,32 +566,18 @@ bool isaConvolutionOpOfType<linalg::Conv1DNwcWcfOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(1, 1);
-  *strides = SmallVector<int64_t>(1, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr W = getAffineDimExpr(1, context);
-  AffineExpr F = getAffineDimExpr(2, context);
-  AffineExpr w = getAffineDimExpr(3, context);
-  AffineExpr c = getAffineDimExpr(4, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, W * (*strides)[0] + w * (*dilations)[0], c},
-           /*filterMap=*/{w, c, F},
-           /*outputMap=*/{N, W, F}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/1, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr W = m.dim(1);
+  AffineExpr F = m.dim(2);
+  AffineExpr w = m.dim(3);
+  AffineExpr c = m.dim(4);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchMaps({/*inputMap=*/{N, m.strided(W, w, 0), c},
+                  /*filterMap=*/{w, c, F},
+                  /*outputMap=*/{N, W, F}})
+      .matchBody();
 }
 
 // #inputMap  = affine_map<(N, F, W, c, w) -> (N, c, W + w)>
@@ -526,32 +593,18 @@ bool isaConvolutionOpOfType<linalg::Conv1DNcwFcwOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(1, 1);
-  *strides = SmallVector<int64_t>(1, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr F = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr c = getAffineDimExpr(3, context);
-  AffineExpr w = getAffineDimExpr(4, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/2,
-                                  /*oDim=*/2, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, c, W * (*strides)[0] + w * (*dilations)[0]},
-           /*filterMap=*/{F, c, w},
-           /*outputMap=*/{N, F, W}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/1, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr F = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr c = m.dim(3);
+  AffineExpr w = m.dim(4);
+
+  return m.matchStride(/*iDim=*/2, /*fDim=*/2, /*oDim=*/2, /*idx=*/0)
+      .matchMaps({/*inputMap=*/{N, c, m.strided(W, w, 0)},
+                  /*filterMap=*/{F, c, w},
+                  /*outputMap=*/{N, F, W}})
+      .matchBody();
 }
 
 // #inputMap  = affine_map<(H, W, h, w) -> (H + h, W + w)>
@@ -567,36 +620,18 @@ bool isaConvolutionOpOfType<linalg::Conv2DOp>(LinalgOp op,
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr H = getAffineDimExpr(0, context);
-  AffineExpr W = getAffineDimExpr(1, context);
-  AffineExpr h = getAffineDimExpr(2, context);
-  AffineExpr w = getAffineDimExpr(3, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/0, /*fDim=*/0,
-                                  /*oDim=*/0, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/1,
-                                  /*oDim=*/1, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1]},
-           /*filterMap=*/{h, w},
-           /*outputMap=*/{H, W}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides);
+  AffineExpr H = m.dim(0);
+  AffineExpr W = m.dim(1);
+  AffineExpr h = m.dim(2);
+  AffineExpr w = m.dim(3);
+
+  return m.matchStride(/*iDim=*/0, /*fDim=*/0, /*oDim=*/0, /*idx=*/0)
+      .matchStride(/*iDim=*/1, /*fDim=*/1, /*oDim=*/1, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{m.strided(H, h, 0), m.strided(W, w, 1)},
+                  /*filterMap=*/{h, w},
+                  /*outputMap=*/{H, W}})
+      .matchBody();
 }
 
 // #inputMap  = affine_map<(D, H, W, d, h, w) -> (D + d, H + h, W + w)>
@@ -612,43 +647,22 @@ bool isaConvolutionOpOfType<linalg::Conv3DOp>(LinalgOp op,
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(3, 1);
-  *strides = SmallVector<int64_t>(3, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr D = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr d = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: D * stride + d * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/0, /*fDim=*/0,
-                                  /*oDim=*/0, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/1,
-                                  /*oDim=*/1, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/2,
-                                  /*oDim=*/2, (*dilations)[2], (*strides)[2]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{D * (*strides)[0] + d * (*dilations)[0],
-                         H * (*strides)[1] + h * (*dilations)[1],
-                         W * (*strides)[2] + w * (*dilations)[2]},
-           /*filterMap=*/{d, h, w},
-           /*outputMap=*/{D, H, W}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/3, dilations, strides);
+  AffineExpr D = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr d = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/0, /*fDim=*/0, /*oDim=*/0, /*idx=*/0)
+      .matchStride(/*iDim=*/1, /*fDim=*/1, /*oDim=*/1, /*idx=*/1)
+      .matchStride(/*iDim=*/2, /*fDim=*/2, /*oDim=*/2, /*idx=*/2)
+      .matchMaps({/*inputMap=*/{m.strided(D, d, 0), m.strided(H, h, 1),
+                                m.strided(W, w, 2)},
+                  /*filterMap=*/{d, h, w},
+                  /*outputMap=*/{D, H, W}})
+      .matchBody();
 }
 
 // #inputMap  = affine_map<(N, W, C, w) -> (N, C, W + w)>
@@ -664,31 +678,17 @@ bool isaConvolutionOpOfType<linalg::DepthwiseConv1DNcwCwOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(1, 1);
-  *strides = SmallVector<int64_t>(1, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr W = getAffineDimExpr(1, context);
-  AffineExpr C = getAffineDimExpr(2, context);
-  AffineExpr w = getAffineDimExpr(3, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, C, W * (*strides)[0] + w * (*dilations)[0]},
-           /*filterMap=*/{C, w},
-           /*outputMap=*/{N, C, W}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/1, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr W = m.dim(1);
+  AffineExpr C = m.dim(2);
+  AffineExpr w = m.dim(3);
+
+  return m.matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/0)
+      .matchMaps({/*inputMap=*/{N, C, m.strided(W, w, 0)},
+                  /*filterMap=*/{C, w},
+                  /*outputMap=*/{N, C, W}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, W, C, w) -> (N, W + w, C)>
@@ -704,31 +704,17 @@ bool isaConvolutionOpOfType<linalg::DepthwiseConv1DNwcWcOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(1, 1);
-  *strides = SmallVector<int64_t>(1, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr W = getAffineDimExpr(1, context);
-  AffineExpr C = getAffineDimExpr(2, context);
-  AffineExpr w = getAffineDimExpr(3, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, W * (*strides)[0] + w * (*dilations)[0], C},
-           /*filterMap=*/{w, C},
-           /*outputMap=*/{N, W, C}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/1, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr W = m.dim(1);
+  AffineExpr C = m.dim(2);
+  AffineExpr w = m.dim(3);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchMaps({/*inputMap=*/{N, m.strided(W, w, 0), C},
+                  /*filterMap=*/{w, C},
+                  /*outputMap=*/{N, W, C}})
+      .matchBody();
 }
 
 // #inputMap  = affine_map<(N, W, C, CM, w) -> (N, W + w, C)>
@@ -744,32 +730,18 @@ bool isaConvolutionOpOfType<linalg::DepthwiseConv1DNwcWcmOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(1, 1);
-  *strides = SmallVector<int64_t>(1, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr W = getAffineDimExpr(1, context);
-  AffineExpr C = getAffineDimExpr(2, context);
-  AffineExpr CM = getAffineDimExpr(3, context);
-  AffineExpr w = getAffineDimExpr(4, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, W * (*strides)[0] + w * (*dilations)[0], C},
-           /*filterMap=*/{w, C, CM},
-           /*outputMap=*/{N, W, C, CM}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/1, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr W = m.dim(1);
+  AffineExpr C = m.dim(2);
+  AffineExpr CM = m.dim(3);
+  AffineExpr w = m.dim(4);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchMaps({/*inputMap=*/{N, m.strided(W, w, 0), C},
+                  /*filterMap=*/{w, C, CM},
+                  /*outputMap=*/{N, W, C, CM}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, H, W, C, h, w) -> (N, C, H + h, W + w)>
@@ -785,38 +757,20 @@ bool isaConvolutionOpOfType<linalg::DepthwiseConv2DNchwChwOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr C = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/3, /*fDim=*/2,
-                                  /*oDim=*/3, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, C, H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1]},
-           /*filterMap=*/{C, h, w},
-           /*outputMap=*/{N, C, H, W}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr C = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/0)
+      .matchStride(/*iDim=*/3, /*fDim=*/2, /*oDim=*/3, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{N, C, m.strided(H, h, 0), m.strided(W, w, 1)},
+                  /*filterMap=*/{C, h, w},
+                  /*outputMap=*/{N, C, H, W}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, D, H, W, CM, d, h, w, C)
@@ -835,46 +789,25 @@ bool isaConvolutionOpOfType<linalg::DepthwiseConv3DNdhwcDhwcmOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(3, 1);
-  *strides = SmallVector<int64_t>(3, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr D = getAffineDimExpr(1, context);
-  AffineExpr H = getAffineDimExpr(2, context);
-  AffineExpr W = getAffineDimExpr(3, context);
-  AffineExpr CM = getAffineDimExpr(4, context);
-  AffineExpr d = getAffineDimExpr(5, context);
-  AffineExpr h = getAffineDimExpr(6, context);
-  AffineExpr w = getAffineDimExpr(7, context);
-  AffineExpr C = getAffineDimExpr(8, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: D * stride + d * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/3, /*fDim=*/2,
-                                  /*oDim=*/3, (*dilations)[2], (*strides)[2]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, D * (*strides)[0] + d * (*dilations)[0],
-                         H * (*strides)[1] + h * (*dilations)[1],
-                         W * (*strides)[2] + w * (*dilations)[2], C},
-           /*filterMap=*/{d, h, w, C, CM},
-           /*outputMap=*/{N, D, H, W, C, CM}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForConvolutionOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/3, dilations, strides);
+  AffineExpr N = m.dim(0);
+  AffineExpr D = m.dim(1);
+  AffineExpr H = m.dim(2);
+  AffineExpr W = m.dim(3);
+  AffineExpr CM = m.dim(4);
+  AffineExpr d = m.dim(5);
+  AffineExpr h = m.dim(6);
+  AffineExpr w = m.dim(7);
+  AffineExpr C = m.dim(8);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/1)
+      .matchStride(/*iDim=*/3, /*fDim=*/2, /*oDim=*/3, /*idx=*/2)
+      .matchMaps({/*inputMap=*/{N, m.strided(D, d, 0), m.strided(H, h, 1),
+                                m.strided(W, w, 2), C},
+                  /*filterMap=*/{d, h, w, C, CM},
+                  /*outputMap=*/{N, D, H, W, C, CM}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, H, W, C, h, w) -> (N, H + h, W + w, C)>
@@ -890,38 +823,21 @@ bool isaConvolutionOpOfType<linalg::PoolingNhwcMaxOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr C = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1], C},
-           /*filterMap=*/{h, w},
-           /*outputMap=*/{N, H, W, C}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForMaxSignedPoolOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides,
+                       PoolingType::MaxSigned);
+  AffineExpr N = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr C = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{N, m.strided(H, h, 0), m.strided(W, w, 1), C},
+                  /*filterMap=*/{h, w},
+                  /*outputMap=*/{N, H, W, C}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, H, W, C, h, w) -> (N, H + h, W + w, C)>
@@ -937,38 +853,21 @@ bool isaConvolutionOpOfType<linalg::PoolingNhwcMinOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr C = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1], C},
-           /*filterMap=*/{h, w},
-           /*outputMap=*/{N, H, W, C}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForMinSignedPoolOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides,
+                       PoolingType::MinSigned);
+  AffineExpr N = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr C = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{N, m.strided(H, h, 0), m.strided(W, w, 1), C},
+                  /*filterMap=*/{h, w},
+                  /*outputMap=*/{N, H, W, C}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, H, W, C, h, w) -> (N, H + h, W + w, C)>
@@ -984,38 +883,21 @@ bool isaConvolutionOpOfType<linalg::PoolingNhwcSumOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr C = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1], C},
-           /*filterMap=*/{h, w},
-           /*outputMap=*/{N, H, W, C}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForSumPoolOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides,
+                       PoolingType::Sum);
+  AffineExpr N = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr C = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{N, m.strided(H, h, 0), m.strided(W, w, 1), C},
+                  /*filterMap=*/{h, w},
+                  /*outputMap=*/{N, H, W, C}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, H, W, C, h, w) -> (N, H + h, W + w, C)>
@@ -1031,38 +913,21 @@ bool isaConvolutionOpOfType<linalg::PoolingNhwcMaxUnsignedOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr C = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1], C},
-           /*filterMap=*/{h, w},
-           /*outputMap=*/{N, H, W, C}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForMaxUnsignedPoolOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides,
+                       PoolingType::MaxUnsigned);
+  AffineExpr N = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr C = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{N, m.strided(H, h, 0), m.strided(W, w, 1), C},
+                  /*filterMap=*/{h, w},
+                  /*outputMap=*/{N, H, W, C}})
+      .matchBody();
 }
 
 // #inputMap = affine_map<(N, H, W, C, h, w) -> (N, H + h, W + w, C)>
@@ -1078,38 +943,21 @@ bool isaConvolutionOpOfType<linalg::PoolingNhwcMinUnsignedOp>(
   assert(isaConvolutionOpInterface(op) &&
          "expected op to implement ConvolutionOpInterface");
 
-  *dilations = SmallVector<int64_t>(2, 1);
-  *strides = SmallVector<int64_t>(2, 1);
-  MLIRContext *context = op->getContext();
-  AffineExpr N = getAffineDimExpr(0, context);
-  AffineExpr H = getAffineDimExpr(1, context);
-  AffineExpr W = getAffineDimExpr(2, context);
-  AffineExpr C = getAffineDimExpr(3, context);
-  AffineExpr h = getAffineDimExpr(4, context);
-  AffineExpr w = getAffineDimExpr(5, context);
-  ArrayAttr indexingMaps = op.getIndexingMaps();
-  // First fetch dilations/strides :-
-  // Match: H * stride + h * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/1, /*fDim=*/0,
-                                  /*oDim=*/1, (*dilations)[0], (*strides)[0]))
-    return false;
-  // Match: W * stride + w * dilation
-  if (!matchConvDimAddExprPattern(indexingMaps, /*iDim=*/2, /*fDim=*/1,
-                                  /*oDim=*/2, (*dilations)[1], (*strides)[1]))
-    return false;
-  // Match expected indexing maps
-  if (!convLayoutMatches(
-          {/*inputMap=*/{N, H * (*strides)[0] + h * (*dilations)[0],
-                         W * (*strides)[1] + w * (*dilations)[1], C},
-           /*filterMap=*/{h, w},
-           /*outputMap=*/{N, H, W, C}},
-          indexingMaps, context))
-    return false;
-  // Match body
-  Block *body = op.getBlock();
-  auto yieldOp = cast<linalg::YieldOp>(body->getTerminator());
-  Value yieldVal = yieldOp.getOperand(0);
-  return bodyMatcherForMinUnsignedPoolOps(yieldVal, body);
+  ConvMatcherBuilder m(op, /*spatialRank=*/2, dilations, strides,
+                       PoolingType::MinUnsigned);
+  AffineExpr N = m.dim(0);
+  AffineExpr H = m.dim(1);
+  AffineExpr W = m.dim(2);
+  AffineExpr C = m.dim(3);
+  AffineExpr h = m.dim(4);
+  AffineExpr w = m.dim(5);
+
+  return m.matchStride(/*iDim=*/1, /*fDim=*/0, /*oDim=*/1, /*idx=*/0)
+      .matchStride(/*iDim=*/2, /*fDim=*/1, /*oDim=*/2, /*idx=*/1)
+      .matchMaps({/*inputMap=*/{N, m.strided(H, h, 0), m.strided(W, w, 1), C},
+                  /*filterMap=*/{h, w},
+                  /*outputMap=*/{N, H, W, C}})
+      .matchBody();
 }
 
 Value makeComposedPadHighOp(OpBuilder &b, Location loc, RankedTensorType type,
diff --git a/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp b/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
index 2a857eddbb932..a2e16abf0cb13 100644
--- a/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
+++ b/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
@@ -675,7 +675,7 @@ MmaSyncBuilder::buildMemRefLoads(OpBuilder &b, Location loc,
 Value MmaSyncBuilder::buildMmaSyncMemRefLoadOperand(
     OpBuilder &b, Location loc, OpFoldResult laneId, Value memref,
     IndexCalculator indexFn, ArrayRef<int64_t> vectorShape) {
-  auto loads = buildMemRefLoads(b, loc, laneId, memref, std::move(indexFn));
+  auto loads = buildMemRefLoads(b, loc, laneId, memref, indexFn);
 
   Type elementType = getElementTypeOrSelf(memref.getType());
   auto vt = VectorType::get(vectorShape, elementType);
@@ -727,7 +727,7 @@ SmallVector<Operation *> MmaSyncBuilder::buildMmaSyncMemRefStoreOperand(
       [&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {
         toStore.push_back(v);
       });
-  return buildMemRefStores(b, loc, toStore, laneId, memref, std::move(indexFn));
+  return buildMemRefStores(b, loc, toStore, laneId, memref, indexFn);
 }
 
 static std::tuple<SmallVector<int64_t>, SmallVector<int64_t>,
diff --git a/mlir/lib/Dialect/OpenACC/IR/OpenACC.cpp b/mlir/lib/Dialect/OpenACC/IR/OpenACC.cpp
index 841d1d781f1a1..7039bbe1d11ec 100644
--- a/mlir/lib/Dialect/OpenACC/IR/OpenACC.cpp
+++ b/mlir/lib/Dialect/OpenACC/IR/OpenACC.cpp
@@ -203,12 +203,68 @@ struct MemRefPointerLikeModel
 
     return false;
   }
+
+  mlir::Value genLoad(Type pointer, OpBuilder &builder, Location loc,
+                      TypedValue<PointerLikeType> srcPtr,
+                      Type valueType) const {
+    // Load from a memref - only valid for scalar memrefs (rank 0).
+    // This is because the address computation for memrefs is part of the load
+    // (and not computed separately), but the API does not have arguments for
+    // indexing.
+    auto memrefValue = dyn_cast_if_present<TypedValue<MemRefType>>(srcPtr);
+    if (!memrefValue)
+      return {};
+
+    auto memrefTy = memrefValue.getType();
+
+    // Only load from scalar memrefs (rank 0)
+    if (memrefTy.getRank() != 0)
+      return {};
+
+    return memref::LoadOp::create(builder, loc, memrefValue);
+  }
+
+  bool genStore(Type pointer, OpBuilder &builder, Location loc,
+                Value valueToStore, TypedValue<PointerLikeType> destPtr) const {
+    // Store to a memref - only valid for scalar memrefs (rank 0)
+    // This is because the address computation for memrefs is part of the store
+    // (and not computed separately), but the API does not have arguments for
+    // indexing.
+    auto memrefValue = dyn_cast_if_present<TypedValue<MemRefType>>(destPtr);
+    if (!memrefValue)
+      return false;
+
+    auto memrefTy = memrefValue.getType();
+
+    // Only store to scalar memrefs (rank 0)
+    if (memrefTy.getRank() != 0)
+      return false;
+
+    memref::StoreOp::create(builder, loc, valueToStore, memrefValue);
+    return true;
+  }
 };
 
 struct LLVMPointerPointerLikeModel
     : public PointerLikeType::ExternalModel<LLVMPointerPointerLikeModel,
                                             LLVM::LLVMPointerType> {
   Type getElementType(Type pointer) const { return Type(); }
+
+  mlir::Value genLoad(Type pointer, OpBuilder &builder, Location loc,
+                      TypedValue<PointerLikeType> srcPtr,
+                      Type valueType) const {
+    // For LLVM pointers, we need the valueType to determine what to load
+    if (!valueType)
+      return {};
+
+    return LLVM::LoadOp::create(builder, loc, valueType, srcPtr);
+  }
+
+  bool genStore(Type pointer, OpBuilder &builder, Location loc,
+                Value valueToStore, TypedValue<PointerLikeType> destPtr) const {
+    LLVM::StoreOp::create(builder, loc, valueToStore, destPtr);
+    return true;
+  }
 };
 
 struct MemrefAddressOfGlobalModel
@@ -4293,6 +4349,24 @@ RoutineOp::getGangDimValue(mlir::acc::DeviceType deviceType) {
   return std::nullopt;
 }
 
+void RoutineOp::addSeq(MLIRContext *context,
+                       llvm::ArrayRef<DeviceType> effectiveDeviceTypes) {
+  setSeqAttr(addDeviceTypeAffectedOperandHelper(context, getSeqAttr(),
+                                                effectiveDeviceTypes));
+}
+
+void RoutineOp::addVector(MLIRContext *context,
+                          llvm::ArrayRef<DeviceType> effectiveDeviceTypes) {
+  setVectorAttr(addDeviceTypeAffectedOperandHelper(context, getVectorAttr(),
+                                                   effectiveDeviceTypes));
+}
+
+void RoutineOp::addWorker(MLIRContext *context,
+                          llvm::ArrayRef<DeviceType> effectiveDeviceTypes) {
+  setWorkerAttr(addDeviceTypeAffectedOperandHelper(context, getWorkerAttr(),
+                                                   effectiveDeviceTypes));
+}
+
 //===----------------------------------------------------------------------===//
 // InitOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/OpenACC/Transforms/ACCLegalizeSerial.cpp b/mlir/lib/Dialect/OpenACC/Transforms/ACCLegalizeSerial.cpp
new file mode 100644
index 0000000000000..f41ce276f994f
--- /dev/null
+++ b/mlir/lib/Dialect/OpenACC/Transforms/ACCLegalizeSerial.cpp
@@ -0,0 +1,117 @@
+//===- ACCLegalizeSerial.cpp - Legalize ACC Serial region -----------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass converts acc.serial into acc.parallel with num_gangs(1)
+// num_workers(1) vector_length(1).
+//
+// This transformation simplifies processing of acc regions by unifying the
+// handling of serial and parallel constructs. Since an OpenACC serial region
+// executes sequentially (like a parallel region with a single gang, worker, and
+// vector), this conversion is semantically equivalent while enabling code reuse
+// in later compilation stages.
+//
+//===----------------------------------------------------------------------===//
+
+#include "mlir/Dialect/OpenACC/Transforms/Passes.h"
+
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/OpenACC/OpenACC.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinAttributes.h"
+#include "mlir/IR/Location.h"
+#include "mlir/IR/MLIRContext.h"
+#include "mlir/IR/PatternMatch.h"
+#include "mlir/IR/Region.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Support/LLVM.h"
+#include "mlir/Support/LogicalResult.h"
+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+#include "llvm/Support/Debug.h"
+
+namespace mlir {
+namespace acc {
+#define GEN_PASS_DEF_ACCLEGALIZESERIAL
+#include "mlir/Dialect/OpenACC/Transforms/Passes.h.inc"
+} // namespace acc
+} // namespace mlir
+
+#define DEBUG_TYPE "acc-legalize-serial"
+
+namespace {
+using namespace mlir;
+
+struct ACCSerialOpConversion : public OpRewritePattern<acc::SerialOp> {
+  using OpRewritePattern<acc::SerialOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(acc::SerialOp serialOp,
+                                PatternRewriter &rewriter) const override {
+
+    const Location loc = serialOp.getLoc();
+
+    // Create a container holding the constant value of 1 for use as the
+    // num_gangs, num_workers, and vector_length attributes.
+    llvm::SmallVector<mlir::Value> numValues;
+    auto value = arith::ConstantIntOp::create(rewriter, loc, 1, 32);
+    numValues.push_back(value);
+
+    // Since num_gangs is specified as both attributes and values, create a
+    // segment attribute.
+    llvm::SmallVector<int32_t> numGangsSegments;
+    numGangsSegments.push_back(numValues.size());
+    auto gangSegmentsAttr = rewriter.getDenseI32ArrayAttr(numGangsSegments);
+
+    // Create a device_type attribute set to `none` which ensures that
+    // the parallel dimensions specification applies to the default clauses.
+    llvm::SmallVector<mlir::Attribute> crtDeviceTypes;
+    auto crtDeviceTypeAttr = mlir::acc::DeviceTypeAttr::get(
+        rewriter.getContext(), mlir::acc::DeviceType::None);
+    crtDeviceTypes.push_back(crtDeviceTypeAttr);
+    auto devTypeAttr =
+        mlir::ArrayAttr::get(rewriter.getContext(), crtDeviceTypes);
+
+    LLVM_DEBUG(llvm::dbgs() << "acc.serial OP: " << serialOp << "\n");
+
+    // Create a new acc.parallel op with the same operands - except include the
+    // num_gangs, num_workers, and vector_length attributes.
+    acc::ParallelOp parOp = acc::ParallelOp::create(
+        rewriter, loc, serialOp.getAsyncOperands(),
+        serialOp.getAsyncOperandsDeviceTypeAttr(), serialOp.getAsyncOnlyAttr(),
+        serialOp.getWaitOperands(), serialOp.getWaitOperandsSegmentsAttr(),
+        serialOp.getWaitOperandsDeviceTypeAttr(),
+        serialOp.getHasWaitDevnumAttr(), serialOp.getWaitOnlyAttr(), numValues,
+        gangSegmentsAttr, devTypeAttr, numValues, devTypeAttr, numValues,
+        devTypeAttr, serialOp.getIfCond(), serialOp.getSelfCond(),
+        serialOp.getSelfAttrAttr(), serialOp.getReductionOperands(),
+        serialOp.getPrivateOperands(), serialOp.getFirstprivateOperands(),
+        serialOp.getDataClauseOperands(), serialOp.getDefaultAttrAttr(),
+        serialOp.getCombinedAttr());
+
+    parOp.getRegion().takeBody(serialOp.getRegion());
+
+    LLVM_DEBUG(llvm::dbgs() << "acc.parallel OP: " << parOp << "\n");
+    rewriter.replaceOp(serialOp, parOp);
+
+    return success();
+  }
+};
+
+class ACCLegalizeSerial
+    : public mlir::acc::impl::ACCLegalizeSerialBase<ACCLegalizeSerial> {
+public:
+  using ACCLegalizeSerialBase<ACCLegalizeSerial>::ACCLegalizeSerialBase;
+  void runOnOperation() override {
+    func::FuncOp funcOp = getOperation();
+    MLIRContext *context = funcOp.getContext();
+    RewritePatternSet patterns(context);
+    patterns.insert<ACCSerialOpConversion>(context);
+    (void)applyPatternsGreedily(funcOp, std::move(patterns));
+  }
+};
+
+} // namespace
diff --git a/mlir/lib/Dialect/OpenACC/Transforms/CMakeLists.txt b/mlir/lib/Dialect/OpenACC/Transforms/CMakeLists.txt
index 2c6da87c66a11..10a1796972044 100644
--- a/mlir/lib/Dialect/OpenACC/Transforms/CMakeLists.txt
+++ b/mlir/lib/Dialect/OpenACC/Transforms/CMakeLists.txt
@@ -2,6 +2,7 @@ add_mlir_dialect_library(MLIROpenACCTransforms
   ACCImplicitData.cpp
   ACCImplicitDeclare.cpp
   ACCImplicitRoutine.cpp
+  ACCLegalizeSerial.cpp
   LegalizeDataValues.cpp
 
   ADDITIONAL_HEADER_DIRS
diff --git a/mlir/lib/Dialect/SCF/IR/CMakeLists.txt b/mlir/lib/Dialect/SCF/IR/CMakeLists.txt
index 423e1c3e1e042..b111117410ba3 100644
--- a/mlir/lib/Dialect/SCF/IR/CMakeLists.txt
+++ b/mlir/lib/Dialect/SCF/IR/CMakeLists.txt
@@ -19,5 +19,5 @@ add_mlir_dialect_library(MLIRSCFDialect
   MLIRSideEffectInterfaces
   MLIRTensorDialect
   MLIRValueBoundsOpInterface
+  MLIRTransformUtils
   )
-
diff --git a/mlir/lib/Dialect/SCF/IR/SCF.cpp b/mlir/lib/Dialect/SCF/IR/SCF.cpp
index 881e256a8797b..bb07291036667 100644
--- a/mlir/lib/Dialect/SCF/IR/SCF.cpp
+++ b/mlir/lib/Dialect/SCF/IR/SCF.cpp
@@ -26,6 +26,7 @@
 #include "mlir/Interfaces/ParallelCombiningOpInterface.h"
 #include "mlir/Interfaces/ValueBoundsOpInterface.h"
 #include "mlir/Transforms/InliningUtils.h"
+#include "mlir/Transforms/RegionUtils.h"
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallPtrSet.h"
@@ -3687,6 +3688,133 @@ LogicalResult scf::WhileOp::verify() {
 }
 
 namespace {
+/// Move a scf.if op that is directly before the scf.condition op in the while
+/// before region, and whose condition matches the condition of the
+/// scf.condition op, down into the while after region.
+///
+/// scf.while (..) : (...) -> ... {
+///  %additional_used_values = ...
+///  %cond = ...
+///  ...
+///  %res = scf.if %cond -> (...) {
+///    use(%additional_used_values)
+///    ... // then block
+///    scf.yield %then_value
+///  } else {
+///    scf.yield %else_value
+///  }
+///  scf.condition(%cond) %res, ...
+/// } do {
+/// ^bb0(%res_arg, ...):
+///    use(%res_arg)
+///    ...
+///
+/// becomes
+/// scf.while (..) : (...) -> ... {
+///  %additional_used_values = ...
+///  %cond = ...
+///  ...
+///  scf.condition(%cond) %else_value, ..., %additional_used_values
+/// } do {
+/// ^bb0(%res_arg ..., %additional_args): :
+///    use(%additional_args)
+///    ... // if then block
+///    use(%then_value)
+///    ...
+struct WhileMoveIfDown : public OpRewritePattern<scf::WhileOp> {
+  using OpRewritePattern<scf::WhileOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(scf::WhileOp op,
+                                PatternRewriter &rewriter) const override {
+    auto conditionOp = op.getConditionOp();
+
+    // Only support ifOp right before the condition at the moment. Relaxing this
+    // would require to:
+    // - check that the body does not have side-effects conflicting with
+    //    operations between the if and the condition.
+    // - check that results of the if operation are only used as arguments to
+    //    the condition.
+    auto ifOp = dyn_cast_or_null<scf::IfOp>(conditionOp->getPrevNode());
+
+    // Check that the ifOp is directly before the conditionOp and that it
+    // matches the condition of the conditionOp. Also ensure that the ifOp has
+    // no else block with content, as that would complicate the transformation.
+    // TODO: support else blocks with content.
+    if (!ifOp || ifOp.getCondition() != conditionOp.getCondition() ||
+        (ifOp.elseBlock() && !ifOp.elseBlock()->without_terminator().empty()))
+      return failure();
+
+    assert(ifOp->use_empty() || (llvm::all_equal(ifOp->getUsers()) &&
+                                 *ifOp->user_begin() == conditionOp) &&
+                                    "ifOp has unexpected uses");
+
+    Location loc = op.getLoc();
+
+    // Replace uses of ifOp results in the conditionOp with the yielded values
+    // from the ifOp branches.
+    for (auto [idx, arg] : llvm::enumerate(conditionOp.getArgs())) {
+      auto it = llvm::find(ifOp->getResults(), arg);
+      if (it != ifOp->getResults().end()) {
+        size_t ifOpIdx = it.getIndex();
+        Value thenValue = ifOp.thenYield()->getOperand(ifOpIdx);
+        Value elseValue = ifOp.elseYield()->getOperand(ifOpIdx);
+
+        rewriter.replaceAllUsesWith(ifOp->getResults()[ifOpIdx], elseValue);
+        rewriter.replaceAllUsesWith(op.getAfterArguments()[idx], thenValue);
+      }
+    }
+
+    // Collect additional used values from before region.
+    SetVector<Value> additionalUsedValuesSet;
+    visitUsedValuesDefinedAbove(ifOp.getThenRegion(), [&](OpOperand *operand) {
+      if (&op.getBefore() == operand->get().getParentRegion())
+        additionalUsedValuesSet.insert(operand->get());
+    });
+
+    // Create new whileOp with additional used values as results.
+    auto additionalUsedValues = additionalUsedValuesSet.getArrayRef();
+    auto additionalValueTypes = llvm::map_to_vector(
+        additionalUsedValues, [](Value val) { return val.getType(); });
+    size_t additionalValueSize = additionalUsedValues.size();
+    SmallVector<Type> newResultTypes(op.getResultTypes());
+    newResultTypes.append(additionalValueTypes);
+
+    auto newWhileOp =
+        scf::WhileOp::create(rewriter, loc, newResultTypes, op.getInits());
+
+    rewriter.modifyOpInPlace(newWhileOp, [&] {
+      newWhileOp.getBefore().takeBody(op.getBefore());
+      newWhileOp.getAfter().takeBody(op.getAfter());
+      newWhileOp.getAfter().addArguments(
+          additionalValueTypes,
+          SmallVector<Location>(additionalValueSize, loc));
+    });
+
+    rewriter.modifyOpInPlace(conditionOp, [&] {
+      conditionOp.getArgsMutable().append(additionalUsedValues);
+    });
+
+    // Replace uses of additional used values inside the ifOp then region with
+    // the whileOp after region arguments.
+    rewriter.replaceUsesWithIf(
+        additionalUsedValues,
+        newWhileOp.getAfterArguments().take_back(additionalValueSize),
+        [&](OpOperand &use) {
+          return ifOp.getThenRegion().isAncestor(
+              use.getOwner()->getParentRegion());
+        });
+
+    // Inline ifOp then region into new whileOp after region.
+    rewriter.eraseOp(ifOp.thenYield());
+    rewriter.inlineBlockBefore(ifOp.thenBlock(), newWhileOp.getAfterBody(),
+                               newWhileOp.getAfterBody()->begin());
+    rewriter.eraseOp(ifOp);
+    rewriter.replaceOp(op,
+                       newWhileOp->getResults().drop_back(additionalValueSize));
+    return success();
+  }
+};
+
 /// Replace uses of the condition within the do block with true, since otherwise
 /// the block would not be evaluated.
 ///
@@ -4399,7 +4527,8 @@ void WhileOp::getCanonicalizationPatterns(RewritePatternSet &results,
   results.add<RemoveLoopInvariantArgsFromBeforeBlock,
               RemoveLoopInvariantValueYielded, WhileConditionTruth,
               WhileCmpCond, WhileUnusedResult, WhileRemoveDuplicatedResults,
-              WhileRemoveUnusedArgs, WhileOpAlignBeforeArgs>(context);
+              WhileRemoveUnusedArgs, WhileOpAlignBeforeArgs, WhileMoveIfDown>(
+      context);
 }
 
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/SCF/Transforms/UpliftWhileToFor.cpp b/mlir/lib/Dialect/SCF/Transforms/UpliftWhileToFor.cpp
index 9f242f9e62b8e..ec1044aaa42ac 100644
--- a/mlir/lib/Dialect/SCF/Transforms/UpliftWhileToFor.cpp
+++ b/mlir/lib/Dialect/SCF/Transforms/UpliftWhileToFor.cpp
@@ -19,83 +19,6 @@
 using namespace mlir;
 
 namespace {
-/// Move an scf.if op that is directly before the scf.condition op in the while
-/// before region, and whose condition matches the condition of the
-/// scf.condition op, down into the while after region.
-///
-/// scf.while (%init) : (...) -> ... {
-///   %cond = ...
-///   %res = scf.if %cond -> (...) {
-///     use1(%init)
-///     %then_val = ...
-///      ... // then block
-///     scf.yield %then_val
-///   } else {
-///     scf.yield %init
-///   }
-///   scf.condition(%cond) %res
-/// } do {
-/// ^bb0(%arg):
-///   use2(%arg)
-///    ...
-///
-/// becomes
-/// scf.while (%init) : (...) -> ... {
-///   %cond = ...
-///   scf.condition(%cond) %init
-/// } do {
-/// ^bb0(%arg): :
-///   use1(%arg)
-///    ... // if then block
-///   %then_val = ...
-///   use2(%then_val)
-///    ...
-struct WhileMoveIfDown : public OpRewritePattern<scf::WhileOp> {
-  using OpRewritePattern<scf::WhileOp>::OpRewritePattern;
-
-  LogicalResult matchAndRewrite(scf::WhileOp op,
-                                PatternRewriter &rewriter) const override {
-    // Check that the first opeation produces one result and that result must
-    // have exactly two uses (these two uses come from the `scf.if` and
-    // `scf.condition` operations).
-    Operation &condOp = op.getBeforeBody()->front();
-    if (condOp.getNumResults() != 1 || !condOp.getResult(0).hasNUses(2))
-      return failure();
-
-    Value condVal = condOp.getResult(0);
-    auto ifOp = dyn_cast<scf::IfOp>(condOp.getNextNode());
-    if (!ifOp || ifOp.getCondition() != condVal)
-      return failure();
-
-    auto term = dyn_cast<scf::ConditionOp>(ifOp->getNextNode());
-    if (!term || term.getCondition() != condVal)
-      return failure();
-
-    // Check that if results and else yield operands match the scf.condition op
-    // arguments and while before arguments respectively.
-    if (!llvm::equal(ifOp->getResults(), term.getArgs()) ||
-        !llvm::equal(ifOp.elseYield()->getOperands(), op.getBeforeArguments()))
-      return failure();
-
-    // Update uses and move the if op into the after region.
-    rewriter.replaceAllUsesWith(op.getAfterArguments(),
-                                ifOp.thenYield()->getOperands());
-    rewriter.replaceUsesWithIf(op.getBeforeArguments(), op.getAfterArguments(),
-                               [&](OpOperand &use) {
-                                 return ifOp.getThenRegion().isAncestor(
-                                     use.getOwner()->getParentRegion());
-                               });
-    rewriter.modifyOpInPlace(
-        term, [&]() { term.getArgsMutable().assign(op.getBeforeArguments()); });
-
-    rewriter.eraseOp(ifOp.thenYield());
-    rewriter.inlineBlockBefore(ifOp.thenBlock(), op.getAfterBody(),
-                               op.getAfterBody()->begin());
-    rewriter.eraseOp(ifOp);
-    return success();
-  }
-};
-
 struct UpliftWhileOp : public OpRewritePattern<scf::WhileOp> {
   using OpRewritePattern::OpRewritePattern;
 
@@ -344,5 +267,5 @@ FailureOr<scf::ForOp> mlir::scf::upliftWhileToForLoop(RewriterBase &rewriter,
 }
 
 void mlir::scf::populateUpliftWhileToForPatterns(RewritePatternSet &patterns) {
-  patterns.add<UpliftWhileOp, WhileMoveIfDown>(patterns.getContext());
+  patterns.add<UpliftWhileOp>(patterns.getContext());
 }
diff --git a/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp b/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp
index c607ece418dff..310e72587eb81 100644
--- a/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp
+++ b/mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp
@@ -1132,35 +1132,22 @@ struct ConcatOpInterface
 
     // Extract the dimension for the concat op
     uint64_t concatDim = concatOp.getDim();
-    bool dynamicConcatDim = false;
 
     SmallVector<OpFoldResult> offsets(tensorType.getRank(),
                                       rewriter.getIndexAttr(0));
     SmallVector<OpFoldResult> strides(tensorType.getRank(),
                                       rewriter.getIndexAttr(1));
-    SmallVector<OpFoldResult> sizes;
-
-    for (const auto &[dimIdx, dimSize] :
-         llvm::enumerate(tensorType.getShape())) {
-      if (dimSize == ShapedType::kDynamic) {
-        auto dimOp = memref::DimOp::create(rewriter, loc, dstBuffer, dimIdx);
-        sizes.push_back(dimOp.getResult());
-        if (dimIdx == concatDim)
-          dynamicConcatDim = true;
-      } else {
-        sizes.push_back(rewriter.getIndexAttr(dimSize));
-      }
-    }
-
-    int64_t concatDimOffset = 0;
-    std::optional<Value> dynamicOffset;
-    std::optional<Value> dynamicSize;
-    if (dynamicConcatDim) {
-      // One or more operands have dynamic size, so we must accumulate the
-      // offset with arith ops.
-      dynamicOffset = arith::ConstantIndexOp::create(rewriter, loc, 0);
-    }
+    SmallVector<OpFoldResult> sizes =
+        memref::getMixedSizes(rewriter, loc, dstBuffer);
+
+    AffineExpr s0, s1;
+    bindSymbols(rewriter.getContext(), s0, s1);
+    auto sum = [&](OpFoldResult v1, OpFoldResult v2) {
+      return affine::makeComposedFoldedAffineApply(rewriter, loc, s0 + s1,
+                                                   {v1, v2});
+    };
 
+    OpFoldResult concatDimOffset = rewriter.getIndexAttr(0);
     for (auto operand : concatOp.getInputs()) {
       // Get the buffer for the operand.
       FailureOr<Value> srcBuffer = getBuffer(rewriter, operand, options, state);
@@ -1171,18 +1158,10 @@ struct ConcatOpInterface
       // so the offset on that axis must accumulate through the loop, and the
       // size must change to the size of the current operand.
       auto operandTensorType = cast<RankedTensorType>(operand.getType());
-      int64_t operandConcatDimSize = operandTensorType.getDimSize(concatDim);
-
-      if (dynamicConcatDim) {
-        offsets[concatDim] = dynamicOffset.value();
-        dynamicSize =
-            memref::DimOp::create(rewriter, loc, *srcBuffer, concatDim)
-                .getResult();
-        sizes[concatDim] = dynamicSize.value();
-      } else {
-        sizes[concatDim] = rewriter.getIndexAttr(operandConcatDimSize);
-        offsets[concatDim] = rewriter.getIndexAttr(concatDimOffset);
-      }
+      offsets[concatDim] = concatDimOffset;
+      OpFoldResult concatDimSize =
+          memref::getMixedSize(rewriter, loc, *srcBuffer, concatDim);
+      sizes[concatDim] = concatDimSize;
 
       // Create a subview of the destination buffer.
       auto dstMemrefType = cast<MemRefType>(memrefType);
@@ -1197,12 +1176,7 @@ struct ConcatOpInterface
       if (failed(options.createMemCpy(rewriter, loc, *srcBuffer, subview)))
         return failure();
 
-      if (dynamicConcatDim) {
-        dynamicOffset = arith::AddIOp::create(
-            rewriter, loc, dynamicOffset.value(), dynamicSize.value());
-      } else {
-        concatDimOffset += operandConcatDimSize;
-      }
+      concatDimOffset = sum(concatDimOffset, concatDimSize);
     }
 
     replaceOpWithBufferizedValues(rewriter, op, dstBuffer);
diff --git a/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp b/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
index fb5d1e758dbd1..7ab2e612ed890 100644
--- a/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
+++ b/mlir/lib/Dialect/XeGPU/IR/XeGPUDialect.cpp
@@ -8,7 +8,6 @@
 
 #include "mlir/Dialect/Affine/Utils.h"
 #include "mlir/Dialect/Arith/Utils/Utils.h"
-#include "mlir/Dialect/Index/IR/IndexOps.h"
 #include "mlir/Dialect/Utils/IndexingUtils.h"
 #include "mlir/Dialect/XeGPU/IR/XeGPU.h"
 #include "mlir/Dialect/XeGPU/uArch/IntelGpuXe2.h"
@@ -61,7 +60,7 @@ genCoordinates(OpBuilder &builder, Location loc,
   // Get the offset of `subShape` within a distribution unit.
   SmallVector<Value> distUnitLocalOffset = llvm::map_to_vector(
       llvm::zip(delinearizedId, subShape), [&](const auto &t) -> Value {
-        return builder.createOrFold<index::MulOp>(
+        return builder.createOrFold<arith::MulIOp>(
             loc, std::get<0>(t),
             builder.createOrFold<arith::ConstantIndexOp>(loc, std::get<1>(t)));
       });
@@ -84,7 +83,7 @@ genCoordinates(OpBuilder &builder, Location loc,
     // Do not go beyond `srcShape` bounds.
     SmallVector<Value> mods = llvm::map_to_vector(
         llvm::zip_equal(adds, srcShape), [&](const auto &t) -> Value {
-          return builder.createOrFold<index::RemUOp>(
+          return builder.createOrFold<arith::RemUIOp>(
               loc, std::get<0>(t),
               arith::ConstantIndexOp::create(builder, loc, std::get<1>(t)));
         });
@@ -343,7 +342,7 @@ LayoutAttr::delinearizeId(OpBuilder &builder, Location loc, Value linearId) {
     /// e.g., linearId=22, dimSize=4: 22 % 4 = 2 (we're at position 2 within
     /// this dimension)
     result[dimIdx] =
-        builder.createOrFold<index::RemUOp>(loc, remaining, dimSizeVal);
+        builder.createOrFold<arith::RemUIOp>(loc, remaining, dimSizeVal);
 
     /// Update remaining for the next dimension by removing what we've already
     /// processed. Division tells us "how many complete groups of this dimension
@@ -352,7 +351,7 @@ LayoutAttr::delinearizeId(OpBuilder &builder, Location loc, Value linearId) {
     /// no next dimension to process
     if (i < order.size() - 1) {
       remaining =
-          builder.createOrFold<index::DivUOp>(loc, remaining, dimSizeVal);
+          builder.createOrFold<arith::DivUIOp>(loc, remaining, dimSizeVal);
     }
   }
   return result;
diff --git a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUPropagateLayout.cpp b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUPropagateLayout.cpp
index f2b0e71c9397f..59a1ad9dbe189 100644
--- a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUPropagateLayout.cpp
+++ b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUPropagateLayout.cpp
@@ -517,8 +517,7 @@ void LayoutInfoPropagation::visitPrefetchNdOp(
     auto [bWidth, bHeight, bCount] = blockWHC.value();
     SmallVector<int> instData;
     int instWidth = xegpu::getLargestDivisor(
-        static_cast<int>(tdescTy.getDimSize(tdescTy.getRank() - 1)), bWidth,
-        bCount);
+        static_cast<int>(tdescTy.getDimSize(tdescTy.getRank() - 1)), bWidth);
     if (instWidth == -1)
       prefetch.emitWarning(
           "No suitable instruction multiple found for the given shape.");
@@ -759,8 +758,7 @@ void LayoutInfoPropagation::visitStoreNdOp(
     auto [bWidth, bHeight, bCount] = blockWHC.value();
     SmallVector<int> instData;
     int instWidth = xegpu::getLargestDivisor(
-        static_cast<int>(dataTy.getDimSize(dataTy.getRank() - 1)), bWidth,
-        bCount);
+        static_cast<int>(dataTy.getDimSize(dataTy.getRank() - 1)), bWidth);
     if (instWidth == -1)
       store.emitWarning(
           "No suitable instruction multiple found for the given shape.");
diff --git a/mlir/lib/Dialect/XeGPU/Utils/XeGPUUtils.cpp b/mlir/lib/Dialect/XeGPU/Utils/XeGPUUtils.cpp
index 91432b1c11304..9f126fe8c2415 100644
--- a/mlir/lib/Dialect/XeGPU/Utils/XeGPUUtils.cpp
+++ b/mlir/lib/Dialect/XeGPU/Utils/XeGPUUtils.cpp
@@ -12,7 +12,6 @@
 
 #include "mlir/Dialect/XeGPU/Utils/XeGPUUtils.h"
 #include "mlir/Dialect/GPU/IR/GPUDialect.h"
-#include "mlir/Dialect/Index/IR/IndexOps.h"
 #include "mlir/Dialect/LLVMIR/XeVMDialect.h"
 #include "mlir/Dialect/SCF/Transforms/Patterns.h"
 #include "mlir/Dialect/Utils/IndexingUtils.h"
@@ -527,7 +526,7 @@ SmallVector<OpFoldResult> xegpu::addElementwise(OpBuilder &builder,
   for (auto [l, r] : llvm::zip_equal(lhs, rhs)) {
     auto lval = getValueOrCreateConstantIndexOp(builder, loc, l);
     auto rval = getValueOrCreateConstantIndexOp(builder, loc, r);
-    results.push_back(builder.createOrFold<index::AddOp>(loc, lval, rval));
+    results.push_back(builder.createOrFold<arith::AddIOp>(loc, lval, rval));
   }
   return results;
 }
diff --git a/mlir/lib/ExecutionEngine/APFloatWrappers.cpp b/mlir/lib/ExecutionEngine/APFloatWrappers.cpp
index 44980ccd77491..f3e38eb8ffa2d 100644
--- a/mlir/lib/ExecutionEngine/APFloatWrappers.cpp
+++ b/mlir/lib/ExecutionEngine/APFloatWrappers.cpp
@@ -131,4 +131,44 @@ MLIR_APFLOAT_WRAPPERS_EXPORT uint64_t _mlir_apfloat_convert_from_int(
                           llvm::RoundingMode::NearestTiesToEven);
   return result.bitcastToAPInt().getZExtValue();
 }
+
+MLIR_APFLOAT_WRAPPERS_EXPORT int8_t _mlir_apfloat_compare(int32_t semantics,
+                                                          uint64_t a,
+                                                          uint64_t b) {
+  const llvm::fltSemantics &sem = llvm::APFloatBase::EnumToSemantics(
+      static_cast<llvm::APFloatBase::Semantics>(semantics));
+  unsigned bitWidth = llvm::APFloatBase::semanticsSizeInBits(sem);
+  llvm::APFloat x(sem, llvm::APInt(bitWidth, a));
+  llvm::APFloat y(sem, llvm::APInt(bitWidth, b));
+  return static_cast<int8_t>(x.compare(y));
+}
+
+MLIR_APFLOAT_WRAPPERS_EXPORT uint64_t _mlir_apfloat_neg(int32_t semantics, uint64_t a) {
+  const llvm::fltSemantics &sem = llvm::APFloatBase::EnumToSemantics(
+      static_cast<llvm::APFloatBase::Semantics>(semantics));
+  unsigned bitWidth = llvm::APFloatBase::semanticsSizeInBits(sem);
+  llvm::APFloat x(sem, llvm::APInt(bitWidth, a));
+  x.changeSign();
+  return x.bitcastToAPInt().getZExtValue();
+}
+
+/// Min/max operations.
+#define APFLOAT_MIN_MAX_OP(OP)                                                 \
+  MLIR_APFLOAT_WRAPPERS_EXPORT uint64_t _mlir_apfloat_##OP(                    \
+      int32_t semantics, uint64_t a, uint64_t b) {                             \
+    const llvm::fltSemantics &sem = llvm::APFloatBase::EnumToSemantics(        \
+        static_cast<llvm::APFloatBase::Semantics>(semantics));                 \
+    unsigned bitWidth = llvm::APFloatBase::semanticsSizeInBits(sem);           \
+    llvm::APFloat lhs(sem, llvm::APInt(bitWidth, a));                          \
+    llvm::APFloat rhs(sem, llvm::APInt(bitWidth, b));                          \
+    llvm::APFloat result = llvm::OP(lhs, rhs);                                 \
+    return result.bitcastToAPInt().getZExtValue();                             \
+  }
+
+APFLOAT_MIN_MAX_OP(minimum)
+APFLOAT_MIN_MAX_OP(maximum)
+APFLOAT_MIN_MAX_OP(minnum)
+APFLOAT_MIN_MAX_OP(maxnum)
+
+#undef APFLOAT_MIN_MAX_OP
 }
diff --git a/mlir/lib/Target/Cpp/TranslateToCpp.cpp b/mlir/lib/Target/Cpp/TranslateToCpp.cpp
index bcce0a5145221..a37ea5e87a316 100644
--- a/mlir/lib/Target/Cpp/TranslateToCpp.cpp
+++ b/mlir/lib/Target/Cpp/TranslateToCpp.cpp
@@ -70,6 +70,7 @@ static inline LogicalResult interleaveCommaWithError(const Container &c,
 /// imply higher precedence.
 static FailureOr<int> getOperatorPrecedence(Operation *operation) {
   return llvm::TypeSwitch<Operation *, FailureOr<int>>(operation)
+      .Case<emitc::AddressOfOp>([&](auto op) { return 15; })
       .Case<emitc::AddOp>([&](auto op) { return 12; })
       .Case<emitc::ApplyOp>([&](auto op) { return 15; })
       .Case<emitc::BitwiseAndOp>([&](auto op) { return 7; })
@@ -396,6 +397,15 @@ static bool shouldBeInlined(ExpressionOp expressionOp) {
   return false;
 }
 
+static LogicalResult printOperation(CppEmitter &emitter,
+                                    emitc::DereferenceOp dereferenceOp) {
+  std::string out;
+  llvm::raw_string_ostream ss(out);
+  ss << "*" << emitter.getOrCreateName(dereferenceOp.getPointer());
+  emitter.cacheDeferredOpResult(dereferenceOp.getResult(), out);
+  return success();
+}
+
 static LogicalResult printOperation(CppEmitter &emitter,
                                     emitc::GetFieldOp getFieldOp) {
   emitter.cacheDeferredOpResult(getFieldOp.getResult(),
@@ -479,6 +489,17 @@ static LogicalResult printConstantOp(CppEmitter &emitter, Operation *operation,
   return emitter.emitAttribute(operation->getLoc(), value);
 }
 
+static LogicalResult printOperation(CppEmitter &emitter,
+                                    emitc::AddressOfOp addressOfOp) {
+  raw_ostream &os = emitter.ostream();
+  Operation &op = *addressOfOp.getOperation();
+
+  if (failed(emitter.emitAssignPrefix(op)))
+    return failure();
+  os << "&";
+  return emitter.emitOperand(addressOfOp.getReference());
+}
+
 static LogicalResult printOperation(CppEmitter &emitter,
                                     emitc::ConstantOp constantOp) {
   Operation *operation = constantOp.getOperation();
@@ -1768,14 +1789,14 @@ LogicalResult CppEmitter::emitOperation(Operation &op, bool trailingSemicolon) {
           .Case<cf::BranchOp, cf::CondBranchOp>(
               [&](auto op) { return printOperation(*this, op); })
           // EmitC ops.
-          .Case<emitc::AddOp, emitc::ApplyOp, emitc::AssignOp,
-                emitc::BitwiseAndOp, emitc::BitwiseLeftShiftOp,
+          .Case<emitc::AddressOfOp, emitc::AddOp, emitc::ApplyOp,
+                emitc::AssignOp, emitc::BitwiseAndOp, emitc::BitwiseLeftShiftOp,
                 emitc::BitwiseNotOp, emitc::BitwiseOrOp,
                 emitc::BitwiseRightShiftOp, emitc::BitwiseXorOp, emitc::CallOp,
                 emitc::CallOpaqueOp, emitc::CastOp, emitc::ClassOp,
                 emitc::CmpOp, emitc::ConditionalOp, emitc::ConstantOp,
-                emitc::DeclareFuncOp, emitc::DivOp, emitc::DoOp,
-                emitc::ExpressionOp, emitc::FieldOp, emitc::FileOp,
+                emitc::DeclareFuncOp, emitc::DereferenceOp, emitc::DivOp,
+                emitc::DoOp, emitc::ExpressionOp, emitc::FieldOp, emitc::FileOp,
                 emitc::ForOp, emitc::FuncOp, emitc::GetFieldOp,
                 emitc::GetGlobalOp, emitc::GlobalOp, emitc::IfOp,
                 emitc::IncludeOp, emitc::LiteralOp, emitc::LoadOp,
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 0d5b553c8e652..869bde69d5cdc 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -194,8 +194,8 @@ class LinearClauseProcessor {
           builder.CreateLoad(linearOrigVars[index]->getAllocatedType(),
 
                              linearPreconditionVars[index]);
-      auto mulInst = builder.CreateMul(loopInductionVar, linearSteps[index]);
-      auto addInst = builder.CreateAdd(linearVarStart, mulInst);
+      auto *mulInst = builder.CreateMul(loopInductionVar, linearSteps[index]);
+      auto *addInst = builder.CreateAdd(linearVarStart, mulInst);
       builder.CreateStore(addInst, linearLoopBodyTemps[index]);
     }
   }
@@ -1968,7 +1968,7 @@ static bool teamsReductionContainedInDistribute(omp::TeamsOp teamsOp) {
   // If we are going to use distribute reduction then remove any debug uses of
   // the reduction parameters in teamsOp. Otherwise they will be left without
   // any mapped value in moduleTranslation and will eventually error out.
-  for (auto use : debugUses)
+  for (auto *use : debugUses)
     use->erase();
   return true;
 }
@@ -2729,6 +2729,7 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
   ArrayRef<bool> isByRef = getIsByRef(opInst.getReductionByref());
   assert(isByRef.size() == opInst.getNumReductionVars());
   llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+  bool isCancellable = constructIsCancellable(opInst);
 
   if (failed(checkImplementationStatus(*opInst)))
     return failure();
@@ -2867,6 +2868,18 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
                                   privateVarsInfo.privatizers)))
       return llvm::make_error<PreviouslyReportedError>();
 
+    // If we could be performing cancellation, add the cancellation barrier on
+    // the way out of the outlined region.
+    if (isCancellable) {
+      auto IPOrErr = ompBuilder->createBarrier(
+          llvm::OpenMPIRBuilder::LocationDescription(builder),
+          llvm::omp::Directive::OMPD_unknown,
+          /* ForceSimpleCall */ false,
+          /* CheckCancelFlag */ false);
+      if (!IPOrErr)
+        return IPOrErr.takeError();
+    }
+
     builder.restoreIP(oldIP);
     return llvm::Error::success();
   };
@@ -2880,7 +2893,6 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
   auto pbKind = llvm::omp::OMP_PROC_BIND_default;
   if (auto bind = opInst.getProcBindKind())
     pbKind = getProcBindKind(*bind);
-  bool isCancellable = constructIsCancellable(opInst);
 
   llvm::OpenMPIRBuilder::InsertPointTy allocaIP =
       findAllocaInsertPoint(builder, moduleTranslation);
diff --git a/mlir/lib/Target/SPIRV/Deserialization/Deserializer.cpp b/mlir/lib/Target/SPIRV/Deserialization/Deserializer.cpp
index 252be796488c5..50883d9ed5e75 100644
--- a/mlir/lib/Target/SPIRV/Deserialization/Deserializer.cpp
+++ b/mlir/lib/Target/SPIRV/Deserialization/Deserializer.cpp
@@ -346,6 +346,7 @@ LogicalResult spirv::Deserializer::processDecoration(ArrayRef<uint32_t> words) {
   case spirv::Decoration::Constant:
   case spirv::Decoration::Invariant:
   case spirv::Decoration::Patch:
+  case spirv::Decoration::Coherent:
     if (words.size() != 2) {
       return emitError(unknownLoc, "OpDecoration with ")
              << decorationName << "needs a single target <id>";
@@ -2831,6 +2832,23 @@ LogicalResult spirv::Deserializer::wireUpBlockArgument() {
             branchCondOp.getFalseBlock());
 
       branchCondOp.erase();
+    } else if (auto switchOp = dyn_cast<spirv::SwitchOp>(op)) {
+      if (target == switchOp.getDefaultTarget()) {
+        SmallVector<ValueRange> targetOperands(switchOp.getTargetOperands());
+        DenseIntElementsAttr literals =
+            switchOp.getLiterals().value_or(DenseIntElementsAttr());
+        spirv::SwitchOp::create(
+            opBuilder, switchOp.getLoc(), switchOp.getSelector(),
+            switchOp.getDefaultTarget(), blockArgs, literals,
+            switchOp.getTargets(), targetOperands);
+        switchOp.erase();
+      } else {
+        SuccessorRange targets = switchOp.getTargets();
+        auto it = llvm::find(targets, target);
+        assert(it != targets.end());
+        size_t index = std::distance(targets.begin(), it);
+        switchOp.getTargetOperandsMutable(index).assign(blockArgs);
+      }
     } else {
       return emitError(unknownLoc, "unimplemented terminator for Phi creation");
     }
@@ -2851,7 +2869,7 @@ LogicalResult spirv::Deserializer::wireUpBlockArgument() {
   return success();
 }
 
-LogicalResult spirv::Deserializer::splitConditionalBlocks() {
+LogicalResult spirv::Deserializer::splitSelectionHeader() {
   // Create a copy, so we can modify keys in the original.
   BlockMergeInfoMap blockMergeInfoCopy = blockMergeInfo;
   for (auto it = blockMergeInfoCopy.begin(), e = blockMergeInfoCopy.end();
@@ -2868,7 +2886,7 @@ LogicalResult spirv::Deserializer::splitConditionalBlocks() {
     Operation *terminator = block->getTerminator();
     assert(terminator);
 
-    if (!isa<spirv::BranchConditionalOp>(terminator))
+    if (!isa<spirv::BranchConditionalOp, spirv::SwitchOp>(terminator))
       continue;
 
     // Check if the current header block is a merge block of another construct.
@@ -2878,10 +2896,10 @@ LogicalResult spirv::Deserializer::splitConditionalBlocks() {
         splitHeaderMergeBlock = true;
     }
 
-    // Do not split a block that only contains a conditional branch, unless it
-    // is also a merge block of another construct - in that case we want to
-    // split the block. We do not want two constructs to share header / merge
-    // block.
+    // Do not split a block that only contains a conditional branch / switch,
+    // unless it is also a merge block of another construct - in that case we
+    // want to split the block. We do not want two constructs to share header /
+    // merge block.
     if (!llvm::hasSingleElement(*block) || splitHeaderMergeBlock) {
       Block *newBlock = block->splitBlock(terminator);
       OpBuilder builder(block, block->end());
@@ -2919,13 +2937,10 @@ LogicalResult spirv::Deserializer::structurizeControlFlow() {
     logger.startLine() << "\n";
   });
 
-  if (failed(splitConditionalBlocks())) {
+  if (failed(splitSelectionHeader())) {
     return failure();
   }
 
-  // TODO: This loop is non-deterministic. Iteration order may vary between runs
-  // for the same shader as the key to the map is a pointer. See:
-  // https://github.com/llvm/llvm-project/issues/128547
   while (!blockMergeInfo.empty()) {
     Block *headerBlock = blockMergeInfo.begin()->first;
     BlockMergeInfo mergeInfo = blockMergeInfo.begin()->second;
diff --git a/mlir/lib/Target/SPIRV/Deserialization/Deserializer.h b/mlir/lib/Target/SPIRV/Deserialization/Deserializer.h
index 243e6fd70ae43..50c935036158c 100644
--- a/mlir/lib/Target/SPIRV/Deserialization/Deserializer.h
+++ b/mlir/lib/Target/SPIRV/Deserialization/Deserializer.h
@@ -58,7 +58,9 @@ struct DebugLine {
 };
 
 /// Map from a selection/loop's header block to its merge (and continue) target.
-using BlockMergeInfoMap = DenseMap<Block *, BlockMergeInfo>;
+/// Use `MapVector<>` to ensure a deterministic iteration order with a pointer
+/// key.
+using BlockMergeInfoMap = llvm::MapVector<Block *, BlockMergeInfo>;
 
 /// A "deferred struct type" is a struct type with one or more member types not
 /// known when the Deserializer first encounters the struct. This happens, for
@@ -278,11 +280,11 @@ class Deserializer {
     return opBuilder.getStringAttr(attrName);
   }
 
-  /// Move a conditional branch into a separate basic block to avoid unnecessary
-  /// sinking of defs that may be required outside a selection region. This
-  /// function also ensures that a single block cannot be a header block of one
-  /// selection construct and the merge block of another.
-  LogicalResult splitConditionalBlocks();
+  /// Move a conditional branch or a switch into a separate basic block to avoid
+  /// unnecessary sinking of defs that may be required outside a selection
+  /// region. This function also ensures that a single block cannot be a header
+  /// block of one selection construct and the merge block of another.
+  LogicalResult splitSelectionHeader();
 
   //===--------------------------------------------------------------------===//
   // Type
diff --git a/mlir/lib/Target/SPIRV/Serialization/Serializer.cpp b/mlir/lib/Target/SPIRV/Serialization/Serializer.cpp
index 4e03a809bd0bc..c879a2b3e0207 100644
--- a/mlir/lib/Target/SPIRV/Serialization/Serializer.cpp
+++ b/mlir/lib/Target/SPIRV/Serialization/Serializer.cpp
@@ -373,6 +373,7 @@ LogicalResult Serializer::processDecorationAttr(Location loc, uint32_t resultID,
   case spirv::Decoration::Block:
   case spirv::Decoration::Invariant:
   case spirv::Decoration::Patch:
+  case spirv::Decoration::Coherent:
     // For unit attributes and decoration attributes, the args list
     // has no values so we do nothing.
     if (isa<UnitAttr, DecorationAttr>(attr))
@@ -1443,7 +1444,20 @@ LogicalResult Serializer::emitPhiForBlockArguments(Block *block) {
         assert(branchCondOp.getFalseTarget() == block);
         blockOperands = branchCondOp.getFalseTargetOperands();
       }
-
+      assert(!blockOperands->empty() &&
+             "expected non-empty block operand range");
+      predecessors.emplace_back(spirvPredecessor, *blockOperands);
+    } else if (auto switchOp = dyn_cast<spirv::SwitchOp>(terminator)) {
+      std::optional<OperandRange> blockOperands;
+      if (block == switchOp.getDefaultTarget()) {
+        blockOperands = switchOp.getDefaultOperands();
+      } else {
+        SuccessorRange targets = switchOp.getTargets();
+        auto it = llvm::find(targets, block);
+        assert(it != targets.end());
+        size_t index = std::distance(targets.begin(), it);
+        blockOperands = switchOp.getTargetOperands(index);
+      }
       assert(!blockOperands->empty() &&
              "expected non-empty block operand range");
       predecessors.emplace_back(spirvPredecessor, *blockOperands);
diff --git a/mlir/lib/Transforms/CMakeLists.txt b/mlir/lib/Transforms/CMakeLists.txt
index 54b67f5c7a91e..06161293e907f 100644
--- a/mlir/lib/Transforms/CMakeLists.txt
+++ b/mlir/lib/Transforms/CMakeLists.txt
@@ -39,4 +39,5 @@ add_mlir_library(MLIRTransforms
   MLIRSideEffectInterfaces
   MLIRSupport
   MLIRTransformUtils
+  MLIRUBDialect
   )
diff --git a/mlir/lib/Transforms/RemoveDeadValues.cpp b/mlir/lib/Transforms/RemoveDeadValues.cpp
index 989c614ef6617..e9ced064c9884 100644
--- a/mlir/lib/Transforms/RemoveDeadValues.cpp
+++ b/mlir/lib/Transforms/RemoveDeadValues.cpp
@@ -33,6 +33,7 @@
 
 #include "mlir/Analysis/DataFlow/DeadCodeAnalysis.h"
 #include "mlir/Analysis/DataFlow/LivenessAnalysis.h"
+#include "mlir/Dialect/UB/IR/UBOps.h"
 #include "mlir/IR/Builders.h"
 #include "mlir/IR/BuiltinAttributes.h"
 #include "mlir/IR/Dialect.h"
@@ -260,6 +261,22 @@ static SmallVector<OpOperand *> operandsToOpOperands(OperandRange operands) {
 static void processSimpleOp(Operation *op, RunLivenessAnalysis &la,
                             DenseSet<Value> &nonLiveSet,
                             RDVFinalCleanupList &cl) {
+  // Operations that have dead operands can be erased regardless of their
+  // side effects. The liveness analysis would not have marked an SSA value as
+  // "dead" if it had a side-effecting user that is reachable.
+  bool hasDeadOperand =
+      markLives(op->getOperands(), nonLiveSet, la).flip().any();
+  if (hasDeadOperand) {
+    LDBG() << "Simple op has dead operands, so the op must be dead: "
+           << OpWithFlags(op, OpPrintingFlags().skipRegions());
+    assert(!hasLive(op->getResults(), nonLiveSet, la) &&
+           "expected the op to have no live results");
+    cl.operations.push_back(op);
+    collectNonLiveValues(nonLiveSet, op->getResults(),
+                         BitVector(op->getNumResults(), true));
+    return;
+  }
+
   if (!isMemoryEffectFree(op) || hasLive(op->getResults(), nonLiveSet, la)) {
     LDBG() << "Simple op is not memory effect free or has live results, "
               "preserving it: "
@@ -361,6 +378,8 @@ static void processFuncOp(FunctionOpInterface funcOp, Operation *module,
   // block other than the entry block, because every block has a terminator.
   for (Block &block : funcOp.getBlocks()) {
     Operation *returnOp = block.getTerminator();
+    if (!returnOp->hasTrait<OpTrait::ReturnLike>())
+      continue;
     if (returnOp && returnOp->getNumOperands() == numReturns)
       cl.operands.push_back({returnOp, nonLiveRets});
   }
@@ -700,7 +719,11 @@ static void processRegionBranchOp(RegionBranchOpInterface regionBranchOp,
 }
 
 /// Steps to process a `BranchOpInterface` operation:
-/// Iterate through each successor block of `branchOp`.
+///
+/// When a non-forwarded operand is dead (e.g., the condition value of a
+/// conditional branch op), the entire operation is dead.
+///
+/// Otherwise, iterate through each successor block of `branchOp`.
 /// (1) For each successor block, gather all operands from all successors.
 /// (2) Fetch their associated liveness analysis data and collect for future
 ///     removal.
@@ -711,7 +734,22 @@ static void processBranchOp(BranchOpInterface branchOp, RunLivenessAnalysis &la,
                             DenseSet<Value> &nonLiveSet,
                             RDVFinalCleanupList &cl) {
   LDBG() << "Processing branch op: " << *branchOp;
+
+  // Check for dead non-forwarded operands.
+  BitVector deadNonForwardedOperands =
+      markLives(branchOp->getOperands(), nonLiveSet, la).flip();
   unsigned numSuccessors = branchOp->getNumSuccessors();
+  for (unsigned succIdx = 0; succIdx < numSuccessors; ++succIdx) {
+    SuccessorOperands successorOperands =
+        branchOp.getSuccessorOperands(succIdx);
+    // Remove all non-forwarded operands from the bit vector.
+    for (OpOperand &opOperand : successorOperands.getMutableForwardedOperands())
+      deadNonForwardedOperands[opOperand.getOperandNumber()] = false;
+  }
+  if (deadNonForwardedOperands.any()) {
+    cl.operations.push_back(branchOp.getOperation());
+    return;
+  }
 
   for (unsigned succIdx = 0; succIdx < numSuccessors; ++succIdx) {
     Block *successorBlock = branchOp->getSuccessor(succIdx);
@@ -786,9 +824,14 @@ static void cleanUpDeadVals(RDVFinalCleanupList &list) {
 
   // 3. Operations
   LDBG() << "Cleaning up " << list.operations.size() << " operations";
-  for (auto &op : list.operations) {
+  for (Operation *op : list.operations) {
     LDBG() << "Erasing operation: "
            << OpWithFlags(op, OpPrintingFlags().skipRegions());
+    if (op->hasTrait<OpTrait::IsTerminator>()) {
+      // When erasing a terminator, insert an unreachable op in its place.
+      OpBuilder b(op);
+      ub::UnreachableOp::create(b, op->getLoc());
+    }
     op->dropAllUses();
     op->erase();
   }
diff --git a/mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py b/mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py
index fd4a5a848f1e3..9c24f94fcf612 100644
--- a/mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py
+++ b/mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py
@@ -1729,16 +1729,16 @@ def pooling_ndhwc_min(
 
 
 @linalg_structured_op
-def fill(value=ScalarDef(T1), O=TensorDef(U, output=True)):
+def fill(value=ScalarDef(T), O=TensorDef(T, output=True)):
     """Fills the output tensor with the given value.
 
     Works for arbitrary ranked output tensors since the operation performs scalar
-    accesses only and is thus rank polymorphic. Numeric casting is performed on
-    the value operand, promoting it to the same data type as the output.
+    accesses only and is thus rank polymorphic. The value type must match the
+    element type of the output tensor or memref.
     """
     implements(FillOpInterface)
     defines(Canonicalizer)
-    O[None] = TypeFn.cast_signed(U, value)
+    O[None] = value
 
 
 @linalg_structured_op
diff --git a/mlir/test/Conversion/AMDGPUToROCDL/cvt_scale_pk-gfx1250.mlir b/mlir/test/Conversion/AMDGPUToROCDL/gfx1250.mlir
similarity index 81%
rename from mlir/test/Conversion/AMDGPUToROCDL/cvt_scale_pk-gfx1250.mlir
rename to mlir/test/Conversion/AMDGPUToROCDL/gfx1250.mlir
index d2391140ce056..27daea58f8f92 100644
--- a/mlir/test/Conversion/AMDGPUToROCDL/cvt_scale_pk-gfx1250.mlir
+++ b/mlir/test/Conversion/AMDGPUToROCDL/gfx1250.mlir
@@ -162,3 +162,51 @@ func.func @amdgpu.scaled_ext_packed816_invalid_dst_elem_type(%v: vector<16xf6E3M
   %ret0 = amdgpu.scaled_ext_packed816 %v scale(%scale) blockSize(32) firstScaleLane(0) firstScaleByte(0) : vector<16xf6E3M2FN>, vector<4xf8E8M0FNU> -> vector<16xf64>
   return %ret0: vector<16xf64>
 }
+
+// -----
+
+#gpu_global_addrspace = 1
+#gpu_lds_addrspace = 3
+#amdgpu_fat_buffer_addrspace = 7
+
+// CHECK-LABEL: func @make_dma_base
+// CHECK-SAME: (%[[IDX:.+]]: index, %[[MEM:.+]]: memref<8xi32, 1>, %[[SMEM:.+]]: memref<8xi32, 3>)
+func.func @make_dma_base(%idx: index, %mem: memref<8xi32, #gpu_global_addrspace>, %smem: memref<8xi32,#gpu_lds_addrspace>) -> (!amdgpu.tdm_base<i32>) {
+  // CHECK-DAG: %[[INT:.+]] = builtin.unrealized_conversion_cast %[[IDX]] : index to i64
+  // CHECK-DAG: %[[MEMREF_DESC_MEM:.+]] = builtin.unrealized_conversion_cast %[[MEM]] : memref<8xi32, 1>
+  // CHECK-DAG: %[[MEMREF_DESC_SMEM:.+]] = builtin.unrealized_conversion_cast %[[SMEM]] : memref<8xi32, 3>
+
+  // CHECK-DAG: %[[MEM_BASE_PTR:.+]] = llvm.extractvalue %[[MEMREF_DESC_MEM]][1] : !llvm.struct<(ptr<1>
+  // CHECK-DAG: %[[SMEM_BASE_PTR:.+]] = llvm.extractvalue %[[MEMREF_DESC_SMEM]][1] : !llvm.struct<(ptr<3>
+
+  // CHECK-DAG: %[[MEM_BASE_OFFSET:.+]] = llvm.getelementptr %[[MEM_BASE_PTR]][%[[INT]]]
+  // CHECK-DAG: %[[SMEM_BASE_OFFSET:.+]] = llvm.getelementptr %[[SMEM_BASE_PTR]][%[[INT]]]
+
+  // CHECK-DAG: %[[MEM_INT:.+]] = llvm.ptrtoint %[[MEM_BASE_OFFSET]] : !llvm.ptr<1> to i64
+  // CHECK-DAG: %[[SMEM_INT:.+]] = llvm.ptrtoint %[[SMEM_BASE_OFFSET]] : !llvm.ptr<3> to i32
+
+  // CHECK: %[[MEM_INT_LOW:.+]] = llvm.trunc %[[MEM_INT]] : i64 to i32
+  // CHECK-DAG: %[[SHIFT:.+]] = llvm.mlir.constant(32 : i64)
+  // CHECK: %[[SHIFTED_MEM_INT:.+]] = llvm.lshr %[[MEM_INT]], %[[SHIFT]]
+  // CHECK: %[[MEM_INT_HIGH:.+]] = llvm.trunc %[[SHIFTED_MEM_INT]] : i64 to i32
+  // CHECK-DAG: %[[MASK:.+]] = llvm.mlir.constant(33554431 : i32)
+  // CHECK: %[[VALID_MEM_INT_HIGH:.+]] = llvm.and %[[MEM_INT_HIGH]], %[[MASK]]
+
+  // CHECK-DAG: %[[TYPE_FIELD:.+]] = llvm.mlir.constant(-2147483648 : i32)
+  // CHECK: %[[MEM_INT_HIGH_TYPE:.+]] = llvm.or %[[VALID_MEM_INT_HIGH]], %[[TYPE_FIELD]]
+
+  // CHECK-DAG: %[[C0:.+]] = llvm.mlir.constant(0 : i32) : i32
+  // CHECK-DAG: %[[C1:.+]] = llvm.mlir.constant(1 : i32) : i32
+  // CHECK-DAG: %[[C2:.+]] = llvm.mlir.constant(2 : i32) : i32
+  // CHECK-DAG: %[[C3:.+]] = llvm.mlir.constant(3 : i32) : i32
+
+  // CHECK: %[[V4I32_0_0:.+]] = llvm.mlir.poison : vector<4xi32>
+  // CHECK: %[[V4I32_0_1:.+]] = llvm.insertelement %[[C1]], %[[V4I32_0_0]][%[[C0]] : i32]
+  // CHECK: %[[V4I32_0_2:.+]] = llvm.insertelement %[[SMEM_INT]], %[[V4I32_0_1]][%[[C1]] : i32]
+  // CHECK: %[[V4I32_0_3:.+]] = llvm.insertelement %[[MEM_INT_LOW]], %[[V4I32_0_2]][%[[C2]] : i32]
+  // CHECK: %[[V4I32_0_4:.+]] = llvm.insertelement %[[MEM_INT_HIGH_TYPE]], %[[V4I32_0_3]][%[[C3]] : i32]
+
+  %0 = amdgpu.make_dma_base %mem[%idx], %smem[%idx] : memref<8xi32, #gpu_global_addrspace>, memref<8xi32, #gpu_lds_addrspace> -> !amdgpu.tdm_base<i32>
+
+  func.return %0 : !amdgpu.tdm_base<i32>
+}
diff --git a/mlir/test/Conversion/ArithToApfloat/arith-to-apfloat.mlir b/mlir/test/Conversion/ArithToApfloat/arith-to-apfloat.mlir
index d71d81dddcd4f..950d2cecefa95 100644
--- a/mlir/test/Conversion/ArithToApfloat/arith-to-apfloat.mlir
+++ b/mlir/test/Conversion/ArithToApfloat/arith-to-apfloat.mlir
@@ -198,3 +198,68 @@ func.func @uitofp(%arg0: i32) {
   %0 = arith.uitofp %arg0 : i32 to f4E2M1FN
   return
 }
+
+// -----
+
+// CHECK: func.func private @_mlir_apfloat_compare(i32, i64, i64) -> i8
+// CHECK: %[[sem:.*]] = arith.constant 18 : i32
+// CHECK: %[[cmp:.*]] = call @_mlir_apfloat_compare(%[[sem]], %{{.*}}, %{{.*}}) : (i32, i64, i64) -> i8
+// CHECK: %[[c3:.*]] = arith.constant 3 : i8
+// CHECK: %[[is_unordered:.*]] = arith.cmpi eq, %[[cmp]], %[[c3]] : i8
+// CHECK: %[[c0:.*]] = arith.constant 0 : i8
+// CHECK: %[[is_lt:.*]] = arith.cmpi eq, %[[cmp]], %[[c0]] : i8
+// CHECK: arith.ori %[[is_unordered]], %[[is_lt]] : i1
+func.func @cmpf(%arg0: f4E2M1FN, %arg1: f4E2M1FN) {
+  %0 = arith.cmpf "ult", %arg0, %arg1 : f4E2M1FN
+  return
+}
+
+// -----
+
+// CHECK: func.func private @_mlir_apfloat_neg(i32, i64) -> i64
+// CHECK: %[[sem:.*]] = arith.constant 2 : i32
+// CHECK: %[[res:.*]] = call @_mlir_apfloat_neg(%[[sem]], %{{.*}}) : (i32, i64) -> i64
+func.func @negf(%arg0: f32) {
+  %0 = arith.negf %arg0 : f32
+  return
+}
+
+// -----
+
+// CHECK: func.func private @_mlir_apfloat_minimum(i32, i64, i64) -> i64
+// CHECK: %[[sem:.*]] = arith.constant 2 : i32
+// CHECK: %[[res:.*]] = call @_mlir_apfloat_minimum(%[[sem]], %{{.*}}, %{{.*}}) : (i32, i64, i64) -> i64
+func.func @minimumf(%arg0: f32, %arg1: f32) {
+  %0 = arith.minimumf %arg0, %arg1 : f32
+  return
+}
+
+// -----
+
+// CHECK: func.func private @_mlir_apfloat_maximum(i32, i64, i64) -> i64
+// CHECK: %[[sem:.*]] = arith.constant 2 : i32
+// CHECK: %[[res:.*]] = call @_mlir_apfloat_maximum(%[[sem]], %{{.*}}, %{{.*}}) : (i32, i64, i64) -> i64
+func.func @maximumf(%arg0: f32, %arg1: f32) {
+  %0 = arith.maximumf %arg0, %arg1 : f32
+  return
+}
+
+// -----
+
+// CHECK: func.func private @_mlir_apfloat_minnum(i32, i64, i64) -> i64
+// CHECK: %[[sem:.*]] = arith.constant 2 : i32
+// CHECK: %[[res:.*]] = call @_mlir_apfloat_minnum(%[[sem]], %{{.*}}, %{{.*}}) : (i32, i64, i64) -> i64
+func.func @minnumf(%arg0: f32, %arg1: f32) {
+  %0 = arith.minnumf %arg0, %arg1 : f32
+  return
+}
+
+// -----
+
+// CHECK: func.func private @_mlir_apfloat_maxnum(i32, i64, i64) -> i64
+// CHECK: %[[sem:.*]] = arith.constant 2 : i32
+// CHECK: %[[res:.*]] = call @_mlir_apfloat_maxnum(%[[sem]], %{{.*}}, %{{.*}}) : (i32, i64, i64) -> i64
+func.func @maxnumf(%arg0: f32, %arg1: f32) {
+  %0 = arith.maxnumf %arg0, %arg1 : f32
+  return
+}
diff --git a/mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir b/mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
index 82c02c1d6ee63..a0801443057ea 100644
--- a/mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
+++ b/mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
@@ -80,6 +80,28 @@ gpu.module @test_module {
 
 // -----
 
+gpu.module @test_module {
+
+  // CHECK-LABEL: func @gpu_wmma_f64_load_op() ->
+  // CHECK-SAME: f64
+  // CHECK32-LABEL: func @gpu_wmma_f64_load_op() ->
+  func.func @gpu_wmma_f64_load_op() -> (!gpu.mma_matrix<8x4xf64, "AOp">) {
+    %wg = memref.alloca() {alignment = 32} : memref<32x32xf64, 3>
+    %i = arith.constant 16 : index
+    %j = arith.constant 16 : index
+    %0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf64, 3> -> !gpu.mma_matrix<8x4xf64, "AOp">
+    return %0 : !gpu.mma_matrix<8x4xf64, "AOp">
+    // CHECK: %[[MUL:.*]] = llvm.mul %{{.*}}, %{{.*}} : i64
+    // CHECK: %[[ADD:.*]] = llvm.add %[[MUL]], %{{.*}} : i64
+    // CHECK: %[[GEP:.*]] = llvm.getelementptr %{{.*}}[%[[ADD]]] : (!llvm.ptr<3>, i64) -> !llvm.ptr<3>, f64
+    // CHECK: %[[C32_I32:.*]] = llvm.mlir.constant(32 : index) : i32
+    // CHECK: %[[LOAD:.*]] = nvvm.wmma.load %[[GEP]], %[[C32_I32]] {eltype = #nvvm.mma_type<f64>, frag = #nvvm.mma_frag<a>, k = 4 : i32, layout = #nvvm.mma_layout<row>, m = 8 : i32, n = 8 : i32} : (!llvm.ptr<3>) -> f64
+    // CHECK: llvm.return %[[LOAD]] : f64
+  }
+}
+
+// -----
+
 gpu.module @test_module {
 
   // CHECK-LABEL: func @gpu_wmma_store_op
diff --git a/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-alloc-copy.mlir b/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-alloc-copy.mlir
index d1b751dc590f6..19e1c7ae4263e 100644
--- a/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-alloc-copy.mlir
+++ b/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-alloc-copy.mlir
@@ -26,14 +26,14 @@ func.func @alloc_copy(%arg0: memref<999xi32>) {
 // CHECK:           %[[UNREALIZED_CONVERSION_CAST_1:.*]] = builtin.unrealized_conversion_cast %[[CAST_0]] : !emitc.ptr<i32> to !emitc.array<999xi32>
 // CHECK:           %[[VAL_1:.*]] = "emitc.constant"() <{value = 0 : index}> : () -> index
 // CHECK:           %[[SUBSCRIPT_0:.*]] = emitc.subscript %[[UNREALIZED_CONVERSION_CAST_0]]{{\[}}%[[VAL_1]]] : (!emitc.array<999xi32>, index) -> !emitc.lvalue<i32>
-// CHECK:           %[[APPLY_0:.*]] = emitc.apply "&"(%[[SUBSCRIPT_0]]) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+// CHECK:           %[[ADDRESS_OF_0:.*]] = emitc.address_of %[[SUBSCRIPT_0]] : !emitc.lvalue<i32>
 // CHECK:           %[[VAL_2:.*]] = "emitc.constant"() <{value = 0 : index}> : () -> index
 // CHECK:           %[[SUBSCRIPT_1:.*]] = emitc.subscript %[[UNREALIZED_CONVERSION_CAST_1]]{{\[}}%[[VAL_2]]] : (!emitc.array<999xi32>, index) -> !emitc.lvalue<i32>
-// CHECK:           %[[APPLY_1:.*]] = emitc.apply "&"(%[[SUBSCRIPT_1]]) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+// CHECK:           %[[ADDRESS_OF_1:.*]] = emitc.address_of %[[SUBSCRIPT_1]] : !emitc.lvalue<i32>
 // CHECK:           %[[CALL_OPAQUE_2:.*]] = emitc.call_opaque "sizeof"() {args = [i32]} : () -> !emitc.size_t
 // CHECK:           %[[VAL_3:.*]] = "emitc.constant"() <{value = 999 : index}> : () -> index
 // CHECK:           %[[MUL_1:.*]] = emitc.mul %[[CALL_OPAQUE_2]], %[[VAL_3]] : (!emitc.size_t, index) -> !emitc.size_t
-// CHECK:           emitc.call_opaque "memcpy"(%[[APPLY_1]], %[[APPLY_0]], %[[MUL_1]]) : (!emitc.ptr<i32>, !emitc.ptr<i32>, !emitc.size_t) -> ()
+// CHECK:           emitc.call_opaque "memcpy"(%[[ADDRESS_OF_1]], %[[ADDRESS_OF_0]], %[[MUL_1]]) : (!emitc.ptr<i32>, !emitc.ptr<i32>, !emitc.size_t) -> ()
 // CHECK:           %[[CALL_OPAQUE_3:.*]] = emitc.call_opaque "sizeof"() {args = [i32]} : () -> !emitc.size_t
 // CHECK:           %[[VAL_4:.*]] = "emitc.constant"() <{value = 999 : index}> : () -> index
 // CHECK:           %[[MUL_2:.*]] = emitc.mul %[[CALL_OPAQUE_3]], %[[VAL_4]] : (!emitc.size_t, index) -> !emitc.size_t
@@ -42,13 +42,13 @@ func.func @alloc_copy(%arg0: memref<999xi32>) {
 // CHECK:           %[[UNREALIZED_CONVERSION_CAST_2:.*]] = builtin.unrealized_conversion_cast %[[CAST_1]] : !emitc.ptr<i32> to !emitc.array<999xi32>
 // CHECK:           %[[VAL_5:.*]] = "emitc.constant"() <{value = 0 : index}> : () -> index
 // CHECK:           %[[SUBSCRIPT_2:.*]] = emitc.subscript %[[UNREALIZED_CONVERSION_CAST_0]]{{\[}}%[[VAL_5]]] : (!emitc.array<999xi32>, index) -> !emitc.lvalue<i32>
-// CHECK:           %[[APPLY_2:.*]] = emitc.apply "&"(%[[SUBSCRIPT_2]]) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+// CHECK:           %[[ADDRESS_OF_2:.*]] = emitc.address_of %[[SUBSCRIPT_2]] : !emitc.lvalue<i32>
 // CHECK:           %[[VAL_6:.*]] = "emitc.constant"() <{value = 0 : index}> : () -> index
 // CHECK:           %[[SUBSCRIPT_3:.*]] = emitc.subscript %[[UNREALIZED_CONVERSION_CAST_2]]{{\[}}%[[VAL_6]]] : (!emitc.array<999xi32>, index) -> !emitc.lvalue<i32>
-// CHECK:           %[[APPLY_3:.*]] = emitc.apply "&"(%[[SUBSCRIPT_3]]) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+// CHECK:           %[[ADDRESS_OF_3:.*]] = emitc.address_of %[[SUBSCRIPT_3]] : !emitc.lvalue<i32>
 // CHECK:           %[[CALL_OPAQUE_5:.*]] = emitc.call_opaque "sizeof"() {args = [i32]} : () -> !emitc.size_t
 // CHECK:           %[[VAL_7:.*]] = "emitc.constant"() <{value = 999 : index}> : () -> index
 // CHECK:           %[[MUL_3:.*]] = emitc.mul %[[CALL_OPAQUE_5]], %[[VAL_7]] : (!emitc.size_t, index) -> !emitc.size_t
-// CHECK:           emitc.call_opaque "memcpy"(%[[APPLY_3]], %[[APPLY_2]], %[[MUL_3]]) : (!emitc.ptr<i32>, !emitc.ptr<i32>, !emitc.size_t) -> ()
+// CHECK:           emitc.call_opaque "memcpy"(%[[ADDRESS_OF_3]], %[[ADDRESS_OF_2]], %[[MUL_3]]) : (!emitc.ptr<i32>, !emitc.ptr<i32>, !emitc.size_t) -> ()
 // CHECK:           return
 // CHECK:         }
diff --git a/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-copy.mlir b/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-copy.mlir
index 829f267743d93..3de2d25f2b0d4 100644
--- a/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-copy.mlir
+++ b/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc-copy.mlir
@@ -17,14 +17,14 @@ func.func @copying(%arg0 : memref<9x4x5x7xf32>, %arg1 : memref<9x4x5x7xf32>) {
 // CHECK:           %[[UNREALIZED_CONVERSION_CAST_1:.*]] = builtin.unrealized_conversion_cast %[[ARG0]] : memref<9x4x5x7xf32> to !emitc.array<9x4x5x7xf32>
 // CHECK:           %[[VAL_0:.*]] = "emitc.constant"() <{value = 0 : index}> : () -> index
 // CHECK:           %[[SUBSCRIPT_0:.*]] = emitc.subscript %[[UNREALIZED_CONVERSION_CAST_1]]{{\[}}%[[VAL_0]], %[[VAL_0]], %[[VAL_0]], %[[VAL_0]]] : (!emitc.array<9x4x5x7xf32>, index, index, index, index) -> !emitc.lvalue<f32>
-// CHECK:           %[[APPLY_0:.*]] = emitc.apply "&"(%[[SUBSCRIPT_0]]) : (!emitc.lvalue<f32>) -> !emitc.ptr<f32>
+// CHECK:           %[[ADDRESS_OF_0:.*]] = emitc.address_of %[[SUBSCRIPT_0]] : !emitc.lvalue<f32>
 // CHECK:           %[[VAL_1:.*]] = "emitc.constant"() <{value = 0 : index}> : () -> index
 // CHECK:           %[[SUBSCRIPT_1:.*]] = emitc.subscript %[[UNREALIZED_CONVERSION_CAST_0]]{{\[}}%[[VAL_1]], %[[VAL_1]], %[[VAL_1]], %[[VAL_1]]] : (!emitc.array<9x4x5x7xf32>, index, index, index, index) -> !emitc.lvalue<f32>
-// CHECK:           %[[APPLY_1:.*]] = emitc.apply "&"(%[[SUBSCRIPT_1]]) : (!emitc.lvalue<f32>) -> !emitc.ptr<f32>
+// CHECK:           %[[ADDRESS_OF_1:.*]] = emitc.address_of %[[SUBSCRIPT_1]] : !emitc.lvalue<f32>
 // CHECK:           %[[CALL_OPAQUE_0:.*]] = emitc.call_opaque "sizeof"() {args = [f32]} : () -> !emitc.size_t
 // CHECK:           %[[VAL_2:.*]] = "emitc.constant"() <{value = 1260 : index}> : () -> index
 // CHECK:           %[[MUL_0:.*]] = emitc.mul %[[CALL_OPAQUE_0]], %[[VAL_2]] : (!emitc.size_t, index) -> !emitc.size_t
-// CHECK:           emitc.call_opaque "memcpy"(%[[APPLY_1]], %[[APPLY_0]], %[[MUL_0]]) : (!emitc.ptr<f32>, !emitc.ptr<f32>, !emitc.size_t) -> ()
+// CHECK:           emitc.call_opaque "memcpy"(%[[ADDRESS_OF_1]], %[[ADDRESS_OF_0]], %[[MUL_0]]) : (!emitc.ptr<f32>, !emitc.ptr<f32>, !emitc.size_t) -> ()
 // CHECK:           return
 // CHECK:         }
 
diff --git a/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc.mlir b/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc.mlir
index 2b4eda37903d4..c7b043b8e2370 100644
--- a/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc.mlir
+++ b/mlir/test/Conversion/MemRefToEmitC/memref-to-emitc.mlir
@@ -53,7 +53,7 @@ module @globals {
     // CHECK-NEXT: emitc.get_global @public_global : !emitc.array<3x7xf32>
     %0 = memref.get_global @public_global : memref<3x7xf32>
     // CHECK-NEXT: emitc.get_global @__constant_xi32 : !emitc.lvalue<i32>
-    // CHECK-NEXT: emitc.apply "&"(%1) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+    // CHECK-NEXT: emitc.address_of %1 : !emitc.lvalue<i32>
     %1 = memref.get_global @__constant_xi32 : memref<i32>
     return
   }
diff --git a/mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir b/mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir
index a94fcb4856db4..fbf8d9efb3bc7 100644
--- a/mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir
+++ b/mlir/test/Conversion/NVVMToLLVM/nvvm-to-llvm.mlir
@@ -16,8 +16,6 @@ llvm.func @init_mbarrier(%barrier_gen : !llvm.ptr, %barrier : !llvm.ptr<3>, %cou
 
 // CHECK-LABEL: @init_mbarrier_arrive_expect_tx
 llvm.func @init_mbarrier_arrive_expect_tx(%barrier : !llvm.ptr<3>, %txcount : i32, %pred : i1) {
-  //CHECK: llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.shared.b64 _, [$0], $1;", "r,r"
-  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr<3>, i32
   //CHECK:  llvm.inline_asm has_side_effects asm_dialect = att "@$2 mbarrier.arrive.expect_tx.shared.b64 _, [$0], $1;", "r,r,b"
   nvvm.mbarrier.arrive.expect_tx %barrier, %txcount, predicate = %pred : !llvm.ptr<3>, i32, i1 
   llvm.return
@@ -25,8 +23,6 @@ llvm.func @init_mbarrier_arrive_expect_tx(%barrier : !llvm.ptr<3>, %txcount : i3
 
 // CHECK-LABEL: @init_mbarrier_arrive_expect_tx_generic
 llvm.func @init_mbarrier_arrive_expect_tx_generic(%barrier : !llvm.ptr, %txcount : i32, %pred : i1) {
-  // CHECK: llvm.inline_asm has_side_effects asm_dialect = att "mbarrier.arrive.expect_tx.b64 _, [$0], $1;", "l,r" 
-  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32
   // CHECK: llvm.inline_asm has_side_effects asm_dialect = att "@$2 mbarrier.arrive.expect_tx.b64 _, [$0], $1;", "l,r,b"
   nvvm.mbarrier.arrive.expect_tx %barrier, %txcount, predicate = %pred : !llvm.ptr, i32, i1 
   llvm.return
@@ -544,8 +540,8 @@ func.func @elect_one_leader_sync() {
 
 // -----
 
-// CHECK-LABEL: @init_mbarrier_arrive_expect_tx
-llvm.func @init_mbarrier_arrive_expect_tx(%desc : !llvm.ptr, %pred : i1) {
+// CHECK-LABEL: @test_nvvm_prefetch
+llvm.func @test_nvvm_prefetch(%desc : !llvm.ptr, %pred : i1) {
   //CHECK: nvvm.prefetch tensormap, %{{.*}}
   nvvm.prefetch tensormap, %desc : !llvm.ptr
   //CHECK: llvm.inline_asm has_side_effects asm_dialect = att "@$1 prefetch.tensormap [$0];", "l,b"
diff --git a/mlir/test/Conversion/UBToSPIRV/ub-to-spirv.mlir b/mlir/test/Conversion/UBToSPIRV/ub-to-spirv.mlir
index edbe8b8001bba..9c277cf99b9a8 100644
--- a/mlir/test/Conversion/UBToSPIRV/ub-to-spirv.mlir
+++ b/mlir/test/Conversion/UBToSPIRV/ub-to-spirv.mlir
@@ -1,4 +1,4 @@
-// RUN: mlir-opt -split-input-file -convert-ub-to-spirv -verify-diagnostics %s | FileCheck %s
+// RUN: mlir-opt -split-input-file -convert-ub-to-spirv %s | FileCheck %s
 
 module attributes {
   spirv.target_env = #spirv.target_env<
@@ -22,15 +22,17 @@ func.func @check_poison() {
 
 // -----
 
-// No successful test because the dialect conversion framework does not convert
-// unreachable blocks.
-
 module attributes {
   spirv.target_env = #spirv.target_env<
     #spirv.vce<v1.0, [Int8, Int16, Int64, Float16, Float64, Shader], []>, #spirv.resource_limits<>>
 } {
-func.func @check_unrechable() {
-// expected-error at +1{{cannot be used in reachable block}}
-  spirv.Unreachable
+// CHECK-LABEL: @check_unrechable
+func.func @check_unrechable(%c: i1) {
+  cf.cond_br %c, ^bb1, ^bb2
+^bb1:
+// CHECK: spirv.Unreachable
+  ub.unreachable
+^bb2:
+  return
 }
 }
diff --git a/mlir/test/Dialect/AMDGPU/invalid.mlir b/mlir/test/Dialect/AMDGPU/invalid.mlir
index 61fdf29a78cbd..b915bfa324c77 100644
--- a/mlir/test/Dialect/AMDGPU/invalid.mlir
+++ b/mlir/test/Dialect/AMDGPU/invalid.mlir
@@ -354,3 +354,64 @@ func.func @scaled_mfma_invalid_k(%arg0 : vector<4xf8E8M0FNU>, %arg1 : vector<32x
   %0 = amdgpu.scaled_mfma 32x32x32 (%arg0[0] * %arg1) * (%arg0[1] * %arg1) + %arg2 : vector<4xf8E8M0FNU>, vector<32xf4E2M1FN>, vector<4xf8E8M0FNU>, vector<32xf4E2M1FN>, vector<16xf32>
   func.return %0 : vector<16xf32>
 }
+
+// -----
+
+func.func @make_dma_base_invalid_addressspace(%idx: index, %mem: memref<8xi32>) {
+  // expected-error at +1 {{'amdgpu.make_dma_base' op lds memref must have workgroup address space attribute.}}
+  amdgpu.make_dma_base %mem[%idx], %mem[%idx] : memref<8xi32>, memref<8xi32> -> !amdgpu.tdm_base<i32>
+}
+
+// -----
+
+func.func @make_dma_base_invalid_addressspace(%idx: index, %smem : memref<8xi32, #gpu.address_space<workgroup>>) {
+  // expected-error at +1 {{'amdgpu.make_dma_base' op global memref must have global address space attribute.}}
+  amdgpu.make_dma_base %smem[%idx], %smem[%idx] : memref<8xi32, #gpu.address_space<workgroup>>, memref<8xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
+}
+
+// -----
+
+func.func @make_dma_base_invalid_barrier(%base: !amdgpu.tdm_base<i32>, %barrier: memref<8xi32>, %idx: index) {
+  // expected-error at +1 {{'amdgpu.make_dma_descriptor' op atomic barrier address must be in LDS.}}
+  amdgpu.make_dma_descriptor %base globalSize [0] globalStride [1] sharedSize [0] atomicBarrier(%barrier[%idx] : memref<8xi32>) : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+}
+
+// -----
+
+// CHECK-LABEL: func @make_dma_descriptor_invalid_empty_strides
+// CHECK-SAME: (%[[BASE:.+]]: !amdgpu.tdm_base<i32>)
+func.func @make_dma_descriptor_invalid_empty_strides(%base: !amdgpu.tdm_base<i32>) {
+  // expected-error at +1 {{'amdgpu.make_dma_descriptor' op strides must not be empty.}}
+  amdgpu.make_dma_descriptor %base globalSize [0] globalStride [] sharedSize [0] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+  func.return
+}
+
+// -----
+
+// CHECK-LABEL: func @make_dma_descriptor_invalid_innermost_stride
+// CHECK-SAME: (%[[BASE:.+]]: !amdgpu.tdm_base<i32>)
+func.func @make_dma_descriptor_invalid_innermost_stride(%base: !amdgpu.tdm_base<i32>) {
+  // expected-error at +1 {{'amdgpu.make_dma_descriptor' op strides for the innermost dimension must be 1.}}
+  amdgpu.make_dma_descriptor %base globalSize [2, 2] globalStride [1, 2] sharedSize [0] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+  func.return
+}
+
+// -----
+
+// CHECK-LABEL: func @make_dma_descriptor_invalid_size_and_stride_sizes
+// CHECK-SAME: (%[[BASE:.+]]: !amdgpu.tdm_base<i32>)
+func.func @make_dma_descriptor_invalid_size_and_stride_sizes(%base: !amdgpu.tdm_base<i32>) {
+  // expected-error at +1 {{'amdgpu.make_dma_descriptor' op strides and sizes must have same rank.}}
+  amdgpu.make_dma_descriptor %base globalSize [1] globalStride [1, 1] sharedSize [0] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+  func.return
+}
+
+// -----
+
+// CHECK-LABEL: func @make_dma_descriptor_invalid_shared_and_global_rank
+// CHECK-SAME: (%[[BASE:.+]]: !amdgpu.tdm_base<i32>)
+func.func @make_dma_descriptor_invalid_shared_and_global_rank(%base: !amdgpu.tdm_base<i32>) {
+  // expected-error at +1 {{'amdgpu.make_dma_descriptor' op tensor must have same rank as tile.}}
+  amdgpu.make_dma_descriptor %base globalSize [4, 4] globalStride [1, 1] sharedSize [2] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+  func.return
+}
diff --git a/mlir/test/Dialect/AMDGPU/ops.mlir b/mlir/test/Dialect/AMDGPU/ops.mlir
index 653f9f64d24f4..3260bd4a8df9a 100644
--- a/mlir/test/Dialect/AMDGPU/ops.mlir
+++ b/mlir/test/Dialect/AMDGPU/ops.mlir
@@ -689,11 +689,60 @@ func.func @memory_counter_wait() {
 // CHECK-LABEL: func @make_dma_base
 // CHECK-SAME: (%[[IDX:.+]]: index, %[[MEM:.+]]: memref<8xi32>, %[[SMEM:.+]]: memref<8xi32, #gpu.address_space<workgroup>>)
 func.func @make_dma_base(%idx: index, %mem: memref<8xi32>, %smem: memref<8xi32, #gpu.address_space<workgroup>>) {
-  // CHECK: amdgpu.make_dma_base %[[MEM]][%[[IDX]]], %[[SMEM]][%[[IDX]]] : memref<8xi32>, memref<8xi32, #gpu.address_space<workgroup>> to !amdgpu.tdm_base<i32>
-  amdgpu.make_dma_base %mem[%idx], %smem[%idx] : memref<8xi32>, memref<8xi32, #gpu.address_space<workgroup>> to !amdgpu.tdm_base<i32>
+  // CHECK: amdgpu.make_dma_base %[[MEM]][%[[IDX]]], %[[SMEM]][%[[IDX]]] : memref<8xi32>, memref<8xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
+  amdgpu.make_dma_base %mem[%idx], %smem[%idx] : memref<8xi32>, memref<8xi32, #gpu.address_space<workgroup>> -> !amdgpu.tdm_base<i32>
+  func.return
+}
+
+// CHECK-LABEL: func @make_dma_descriptor
+// CHECK-SAME: (%[[BASE:.+]]: !amdgpu.tdm_base<i32>, %[[BARRIER:.+]]: memref<8xi32, #gpu.address_space<workgroup>>, %[[IDX:.+]]: index)
+func.func @make_dma_descriptor(%base: !amdgpu.tdm_base<i32>, %barrier: memref<8xi32, #gpu.address_space<workgroup>>, %idx: index) {
+
+  // CHECK: amdgpu.make_dma_descriptor %[[BASE]]
+  amdgpu.make_dma_descriptor %base
+        // CHECK-SAME: globalSize [0]
+        globalSize [0]
+        // CHECK-SAME: globalStride [1]
+        globalStride [1]
+        // CHECK-SAME: sharedSize [0] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+        sharedSize [0] : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+
+  // CHECK: amdgpu.make_dma_descriptor %[[BASE]]
+  amdgpu.make_dma_descriptor %base
+        // CHECK-SAME: globalSize [0]
+        globalSize [0]
+        // CHECK-SAME: globalStride [1]
+        globalStride [1]
+        // CHECK-SAME: sharedSize [0]
+        sharedSize [0]
+        // CHECK-SAME: padShared(%[[IDX]] every %[[IDX]])
+        padShared(%idx every %idx)
+        : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+
+  // CHECK: amdgpu.make_dma_descriptor %[[BASE]]
+  amdgpu.make_dma_descriptor %base
+        // CHECK-SAME: globalSize [0]
+        globalSize [0]
+        // CHECK-SAME: globalStride [1]
+        globalStride [1]
+        // CHECK-SAME: sharedSize [0]
+        sharedSize [0]
+        // CHECK-SAME: atomicBarrier(%[[BARRIER]][%[[IDX]]] : memref<8xi32, #gpu.address_space<workgroup>>)
+        atomicBarrier(%barrier[%idx] : memref<8xi32, #gpu.address_space<workgroup>>)
+        : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
+
+  // CHECK: amdgpu.make_dma_descriptor %[[BASE]]
+  amdgpu.make_dma_descriptor %base
+        // CHECK-SAME: globalSize [0]
+        globalSize [0]
+        // CHECK-SAME: globalStride [1]
+        globalStride [1]
+        // CHECK-SAME: sharedSize [0]
+        sharedSize [0]
+        // CHECK-SAME: iterate %[[IDX]], %[[IDX]], %[[IDX]]
+        iterate %idx, %idx, %idx
+        : !amdgpu.tdm_base<i32> -> !amdgpu.tdm_descriptor
 
-  // CHECK: amdgpu.make_dma_base %[[SMEM]][%[[IDX]]], %[[MEM]][%[[IDX]]] : memref<8xi32, #gpu.address_space<workgroup>>, memref<8xi32> to !amdgpu.tdm_base<i32>
-  amdgpu.make_dma_base %smem[%idx], %mem[%idx] : memref<8xi32, #gpu.address_space<workgroup>>, memref<8xi32> to !amdgpu.tdm_base<i32>
   func.return
 }
 
diff --git a/mlir/test/Dialect/Affine/loop-coalescing.mlir b/mlir/test/Dialect/Affine/loop-coalescing.mlir
index 3be14eaf5c326..6a825320ff20f 100644
--- a/mlir/test/Dialect/Affine/loop-coalescing.mlir
+++ b/mlir/test/Dialect/Affine/loop-coalescing.mlir
@@ -416,3 +416,31 @@ func.func @test_loops_do_not_get_coalesced() {
 // CHECK-NEXT: }
 // CHECK-NEXT: }
 // CHECK-NEXT: return
+
+// -----
+
+// CHECK-LABEL: func @inner_loop_has_iter_args
+// CHECK-SAME:    %[[ALLOC:.*]]: memref<?xi64>)
+func.func @inner_loop_has_iter_args(%alloc : memref<?xi64>) {
+  %c17 = arith.constant 17 : index
+  affine.for %arg0 = 0 to 79 {
+    %0 = affine.for %arg1 = 0 to 64 iter_args(%arg2 = %alloc) -> (memref<?xi64>) {
+      %1 = arith.remui %arg1, %c17 : index
+      %2 = arith.index_cast %arg1 : index to i64
+      memref.store %2, %arg2[%1] : memref<?xi64>
+      affine.yield %arg2 : memref<?xi64>
+    }
+  }
+  return
+}
+
+// CHECK: %[[CONSTANT_0:.*]] = arith.constant 17 : index
+// CHECK: %[[APPLY_0:.*]] = affine.apply affine_map<() -> (79)>()
+// CHECK: %[[APPLY_1:.*]] = affine.apply affine_map<() -> (64)>()
+// CHECK: %[[APPLY_2:.*]] = affine.apply affine_map<(d0)[s0] -> (d0 * s0)>(%[[APPLY_0]]){{\[}}%[[APPLY_1]]]
+// CHECK: affine.for %[[IV:.*]] = 0 to %[[APPLY_2]] {
+// CHECK: %[[APPLY_3:.*]] = affine.apply affine_map<(d0)[s0] -> (d0 mod s0)>(%[[IV]]){{\[}}%[[APPLY_1]]]
+// CHECK:   %[[REMUI_0:.*]] = arith.remui %[[APPLY_3]], %[[CONSTANT_0]] : index
+// CHECK:   %[[INDEX_CAST_0:.*]] = arith.index_cast %[[APPLY_3]] : index to i64
+// CHECK:   memref.store %[[INDEX_CAST_0]], %[[ALLOC]]{{\[}}%[[REMUI_0]]] : memref<?xi64>
+// CHECK: }
diff --git a/mlir/test/Dialect/Affine/value-bounds-reification.mlir b/mlir/test/Dialect/Affine/value-bounds-reification.mlir
index 817614be50533..2e801028057a1 100644
--- a/mlir/test/Dialect/Affine/value-bounds-reification.mlir
+++ b/mlir/test/Dialect/Affine/value-bounds-reification.mlir
@@ -36,13 +36,13 @@ func.func @reify_through_chain(%sz0: index, %sz2: index) -> (index, index, index
 //       CHECK:   "test.some_use"(%[[c5]])
 //       CHECK:   %[[c5:.*]] = arith.constant 5 : index
 //       CHECK:   "test.some_use"(%[[c5]])
-func.func @reify_slice_bound(%t: tensor<?x?xi32>, %idx: index, %ub: index, %f: f32) {
+func.func @reify_slice_bound(%t: tensor<?x?xi32>, %idx: index, %ub: index, %f: i32) {
   %c0 = arith.constant 0 : index
   %c4 = arith.constant 4 : index
   scf.for %iv = %c0 to %ub step %c4 {
     %sz = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%iv)[%ub]
     %slice = tensor.extract_slice %t[%idx, %iv] [1, %sz] [1, 1] : tensor<?x?xi32> to tensor<1x?xi32>
-    %filled = linalg.fill ins(%f : f32) outs(%slice : tensor<1x?xi32>) -> tensor<1x?xi32>
+    %filled = linalg.fill ins(%f : i32) outs(%slice : tensor<1x?xi32>) -> tensor<1x?xi32>
 
     %bound = "test.reify_bound"(%filled) {dim = 1, type = "UB"} : (tensor<1x?xi32>) -> (index)
     "test.some_use"(%bound) : (index) -> ()
diff --git a/mlir/test/Dialect/ControlFlow/canonicalize.mlir b/mlir/test/Dialect/ControlFlow/canonicalize.mlir
index 17f7d28ba59fb..21a16784b81b2 100644
--- a/mlir/test/Dialect/ControlFlow/canonicalize.mlir
+++ b/mlir/test/Dialect/ControlFlow/canonicalize.mlir
@@ -634,3 +634,25 @@ func.func @unsimplified_cycle_2(%c : i1) {
 ^bb7:
   cf.br ^bb6
 }
+
+// CHECK-LABEL: @drop_unreachable_branch_1
+//  CHECK-NEXT:   "test.foo"() : () -> ()
+//  CHECK-NEXT:   return
+func.func @drop_unreachable_branch_1(%c: i1) {
+  cf.cond_br %c, ^bb1, ^bb2
+^bb1:
+  "test.foo"() : () -> ()
+  return
+^bb2:
+  ub.unreachable
+}
+
+// CHECK-LABEL: @drop_unreachable_branch_2
+//  CHECK-NEXT:   ub.unreachable
+func.func @drop_unreachable_branch_2(%c: i1) {
+  cf.cond_br %c, ^bb1, ^bb2
+^bb1:
+  ub.unreachable
+^bb2:
+  ub.unreachable
+}
diff --git a/mlir/test/Dialect/EmitC/invalid_ops.mlir b/mlir/test/Dialect/EmitC/invalid_ops.mlir
index f285196d466ce..d1601bed29ca9 100644
--- a/mlir/test/Dialect/EmitC/invalid_ops.mlir
+++ b/mlir/test/Dialect/EmitC/invalid_ops.mlir
@@ -914,3 +914,19 @@ func.func @test_for_unmatch_type(%arg0: index) {
   ) : (index, index, index) -> ()
   return
 }
+
+// -----
+
+func.func @address_of(%arg0: !emitc.lvalue<i32>) {
+  // expected-error @+1 {{failed to verify that input and result reference the same type}}
+  %1 = "emitc.address_of"(%arg0) : (!emitc.lvalue<i32>) -> !emitc.ptr<i8>
+  return
+}
+
+// -----
+
+func.func @dereference(%arg0: !emitc.ptr<i32>) {
+  // expected-error @+1 {{failed to verify that input and result reference the same type}}
+  %1 = "emitc.dereference"(%arg0) : (!emitc.ptr<i32>) -> !emitc.lvalue<i8>
+  return
+}
diff --git a/mlir/test/Dialect/EmitC/ops.mlir b/mlir/test/Dialect/EmitC/ops.mlir
index 1259748dfce84..b2c8b843ec14b 100644
--- a/mlir/test/Dialect/EmitC/ops.mlir
+++ b/mlir/test/Dialect/EmitC/ops.mlir
@@ -355,3 +355,13 @@ func.func @do(%arg0 : !emitc.ptr<i32>) {
 
   return
 }
+
+func.func @address_of(%arg0: !emitc.lvalue<i32>) {
+  %1 = emitc.address_of %arg0 : !emitc.lvalue<i32>
+  return
+}
+
+func.func @dereference(%arg0: !emitc.ptr<i32>) {
+  %1 = emitc.dereference %arg0 : !emitc.ptr<i32>
+  return
+}
diff --git a/mlir/test/Dialect/GPU/invalid.mlir b/mlir/test/Dialect/GPU/invalid.mlir
index 35381dab7b200..26bcf948bc85d 100644
--- a/mlir/test/Dialect/GPU/invalid.mlir
+++ b/mlir/test/Dialect/GPU/invalid.mlir
@@ -688,7 +688,7 @@ func.func @mmamatrix_operand_type(){
 func.func @mmamatrix_invalid_element_type(){
     %wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
     %i = arith.constant 16 : index
-    // expected-error @+1 {{MMAMatrixType elements must be SI8, UI8, I32, F16, or F32}}
+    // expected-error @+1 {{MMAMatrixType elements must be SI8, UI8, I32, F16, F32, or F64}}
     %0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xbf16, "AOp">
     return
 }
@@ -708,7 +708,7 @@ func.func @mmaLoadOp_identity_layout(){
 // -----
 
 func.func @mma_invalid_memref_type(%src: memref<32x4xvector<4x8xf32>>, %i: index) {
-    // expected-error @+1 {{operand #0 must be memref of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float or vector of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float values of ranks 1 values}}
+    // expected-error @+1 {{operand #0 must be memref of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float or 64-bit float or vector of 8-bit signless integer or 32-bit signless integer or 16-bit float or 32-bit float or 64-bit float values of ranks 1 values}}
     %0 = gpu.subgroup_mma_load_matrix %src[%i, %i] {leadDimension = 4 : index} : memref<32x4xvector<4x8xf32>> -> !gpu.mma_matrix<16x16xf16, "AOp">
     return
 }
diff --git a/mlir/test/Dialect/LLVMIR/nvvm-target-invalid.mlir b/mlir/test/Dialect/LLVMIR/nvvm-target-invalid.mlir
new file mode 100644
index 0000000000000..c2cfa7689978b
--- /dev/null
+++ b/mlir/test/Dialect/LLVMIR/nvvm-target-invalid.mlir
@@ -0,0 +1,11 @@
+// RUN: not mlir-opt %s 2>&1 | FileCheck %s
+// CHECK: 'nvvm.tcgen05.alloc' op is not supported on sm_90
+
+module {
+    gpu.module @mod [#nvvm.target<chip = "sm_90">] {
+        func.func @tcgen05_alloc(%arg0: !llvm.ptr<7>, %arg1: i32) {
+             nvvm.tcgen05.alloc %arg0, %arg1 : !llvm.ptr<7>, i32
+             return
+        }
+    }
+}
diff --git a/mlir/test/Dialect/LLVMIR/nvvm.mlir b/mlir/test/Dialect/LLVMIR/nvvm.mlir
index cd7bd37da5763..6f67a50c1a946 100644
--- a/mlir/test/Dialect/LLVMIR/nvvm.mlir
+++ b/mlir/test/Dialect/LLVMIR/nvvm.mlir
@@ -464,19 +464,6 @@ llvm.func private @mbarrier_arrive_nocomplete_shared(%barrier: !llvm.ptr<3>) {
   llvm.return
 }
 
-llvm.func private @mbarrier_test_wait(%barrier: !llvm.ptr, %token : i64) -> i1 {  
-  // CHECK:   nvvm.mbarrier.test.wait %{{.*}}
-  %isComplete = nvvm.mbarrier.test.wait %barrier, %token : !llvm.ptr, i64 -> i1
-  llvm.return %isComplete : i1
-}
-
-llvm.func private @mbarrier_test_wait_shared(%barrier: !llvm.ptr<3>, %token : i64) {
-  %count = nvvm.read.ptx.sreg.ntid.x : i32
-  // CHECK:   nvvm.mbarrier.test.wait %{{.*}}
-  %isComplete = nvvm.mbarrier.test.wait %barrier, %token : !llvm.ptr<3>, i64 -> i1
-  llvm.return
-}
-
 // CHECK-LABEL: @wgmma_fence_aligned
 func.func @wgmma_fence_aligned() {
   // CHECK: nvvm.wgmma.fence.aligned
diff --git a/mlir/test/Dialect/Linalg/fuse-with-reshape-by-collapsing.mlir b/mlir/test/Dialect/Linalg/fuse-with-reshape-by-collapsing.mlir
index 2bf3d21c35526..77c7d7d69a77d 100644
--- a/mlir/test/Dialect/Linalg/fuse-with-reshape-by-collapsing.mlir
+++ b/mlir/test/Dialect/Linalg/fuse-with-reshape-by-collapsing.mlir
@@ -594,6 +594,24 @@ func.func @fuse_by_collapsing_pad(%arg0 : tensor<2x12x5x336x9xi32>) -> tensor<8x
 
 // -----
 
+func.func @no_fuse_by_collapsing_pad_non_constant_padding(%arg0 : tensor<2x12xi32>) -> tensor<8x3x4xi32> {
+  %expand = tensor.expand_shape %arg0 [[0], [1, 2]] output_shape [2, 3, 4] : tensor<2x12xi32> into tensor<2x3x4xi32>
+  %cst = arith.constant 0 : i32
+  %padded_0 = tensor.pad %expand low[1, 0, 0] high[5, 0, 0] {
+  ^bb0(%arg1: index, %arg2: index, %arg3: index):
+    %pad_val = arith.index_cast %arg1 : index to i32
+    tensor.yield %pad_val : i32
+  } : tensor<2x3x4xi32> to tensor<8x3x4xi32>
+  return %padded_0 : tensor<8x3x4xi32>
+}
+//      CHECK: func @no_fuse_by_collapsing_pad_non_constant_padding(
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<2x12xi32>)
+//      CHECK:   %[[EXPAND:.+]] = tensor.expand_shape %[[ARG0]]
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[EXPAND]]
+//      CHECK:   return %[[PAD]]
+
+// -----
+
 func.func @no_fuse_by_collapsing_pad(%arg0 : tensor<2x12x5x336x9xi32>) -> tensor<8x5x4x17x6x7x8x14xi32> {
   %expand = tensor.expand_shape %arg0 [[0], [1, 2], [3], [4, 5, 6], [7]] output_shape [2, 3, 4, 5, 6, 7, 8, 9] : tensor<2x12x5x336x9xi32> into tensor<2x3x4x5x6x7x8x9xi32>
   %cst = arith.constant 0 : i32
@@ -639,6 +657,63 @@ func.func @fuse_by_collapsing_dynamic_pad(%arg0 : tensor<?x?x?x?xf32>,
 // CHECK-SAME:       output_shape [%[[PAD_SIZE0]], %[[S1]], %[[S2]], %[[PAD_SIZE1]], %[[S4]], %[[S5]]] : tensor<?x?x?x?xf32> into tensor<?x?x?x?x?x?xf32>
 //      CHECK:   return %[[EXPAND]]
 
+// -----
+
+func.func @collapse_shape_with_producer_pad(%arg0: tensor<2x3x4x5x6x7x8x9xi32>) -> tensor<8x12x17x336x14xi32> {
+  %cst = arith.constant 0 : i32
+  %padded = tensor.pad %arg0 low[1, 0, 0, 8, 0, 0, 0, 3] high[5, 0, 0, 4, 0, 0, 0, 2] {
+  ^bb0(%arg1: index, %arg2: index, %arg3: index, %arg4: index,
+       %arg5: index, %arg6: index, %arg7: index, %arg8: index):
+    tensor.yield %cst : i32
+  } : tensor<2x3x4x5x6x7x8x9xi32> to tensor<8x3x4x17x6x7x8x14xi32>
+  %collapsed = tensor.collapse_shape %padded [[0], [1, 2], [3], [4, 5, 6], [7]]
+    : tensor<8x3x4x17x6x7x8x14xi32> into tensor<8x12x17x336x14xi32>
+  return %collapsed : tensor<8x12x17x336x14xi32>
+}
+//      CHECK: func @collapse_shape_with_producer_pad
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<2x3x4x5x6x7x8x9xi32>
+//      CHECK:   %[[COLLAPSE:.+]] = tensor.collapse_shape %[[ARG0]] {{\[}}[0], [1, 2], [3], [4, 5, 6], [7]]
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[COLLAPSE]] low[1, 0, 8, 0, 3] high[5, 0, 4, 0, 2]
+//      CHECK:   return %[[PAD]]
+
+// -----
+
+func.func @collapse_shape_with_producer_pad_dynamic(%arg0: tensor<?x?x?x?x?x?xf32>,
+    %l0 : index, %l1 : index, %h0 : index, %h1 : index) -> tensor<?x?x?x?xf32> {
+  %cst = arith.constant 0.0 : f32
+  %padded = tensor.pad %arg0 low[%l0, 0, 0, %l1, 0, 0] high[%h0, 0, 0, %h1, 0, 0] {
+  ^bb0(%arg1: index, %arg2: index, %arg3: index, %arg4: index, %arg5: index, %arg6: index):
+    tensor.yield %cst : f32
+  } : tensor<?x?x?x?x?x?xf32> to tensor<?x?x?x?x?x?xf32>
+  %collapsed = tensor.collapse_shape %padded [[0], [1, 2], [3], [4, 5]]
+    : tensor<?x?x?x?x?x?xf32> into tensor<?x?x?x?xf32>
+  return %collapsed : tensor<?x?x?x?xf32>
+}
+//      CHECK: func @collapse_shape_with_producer_pad_dynamic
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<?x?x?x?x?x?xf32>
+// CHECK-SAME:   %[[L0:.+]]: index, %[[L1:.+]]: index, %[[H0:.+]]: index, %[[H1:.+]]: index
+//      CHECK:   %[[COLLAPSE:.+]] = tensor.collapse_shape %[[ARG0]] {{\[}}[0], [1, 2], [3], [4, 5]]
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[COLLAPSE]] low[%[[L0]], 0, %[[L1]], 0] high[%[[H0]], 0, %[[H1]], 0]
+//      CHECK:   return %[[PAD]]
+
+// -----
+
+func.func @collapse_shape_with_producer_pad_non_constant_padding(%arg0 : tensor<2x3x4xi32>) -> tensor<8x12xi32> {
+  %cst = arith.constant 0 : i32
+  %padded_0 = tensor.pad %arg0 low[1, 0, 0] high[5, 0, 0] {
+  ^bb0(%arg1: index, %arg2: index, %arg3: index):
+    %pad_val = arith.index_cast %arg1 : index to i32
+    tensor.yield %pad_val : i32
+  } : tensor<2x3x4xi32> to tensor<8x3x4xi32>
+  %collapsed = tensor.collapse_shape %padded_0 [[0], [1, 2]] : tensor<8x3x4xi32> into tensor<8x12xi32>
+  return %collapsed : tensor<8x12xi32>
+}
+//      CHECK: func @collapse_shape_with_producer_pad_non_constant_padding(
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<2x3x4xi32>)
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[ARG0]]
+//      CHECK:   %[[COLLAPSED:.+]] = tensor.collapse_shape %[[PAD]]
+//      CHECK:   return %[[COLLAPSED]]
+
 // -----
 // Static problem sizes. Checks all aspects of fusion by collapsing with bubbling up collapse shapes.
 #map0 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7) -> (d0, d1, d2, d3, d4, d5, d6, d7)>
diff --git a/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir b/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir
index bc55c12c02f29..6f1a422324e08 100644
--- a/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir
+++ b/mlir/test/Dialect/Linalg/fusion-elementwise-ops.mlir
@@ -921,30 +921,6 @@ func.func @fold_fill_generic_basic(%arg0: tensor<?xf32>) -> (tensor<?xf32>) {
 
 // -----
 
-// CHECK-LABEL: func @fold_fill_generic_different_dtype
-//  CHECK-SAME: (%[[ARG0:.*]]: tensor<?xf16>) -> tensor<?xf16> {
-//   CHECK-NOT: linalg.fill
-//       CHECK: %[[GENERIC_OP:.*]] = linalg.generic
-//  CHECK-SAME: ins(%[[ARG0]] : tensor<?xf16>)
-//  CHECK-SAME: outs({{.*}} : tensor<?xf16>) {
-#map0 = affine_map<(d0) -> (d0)>
-func.func @fold_fill_generic_different_dtype(%arg0: tensor<?xf16>) -> (tensor<?xf16>) {
-  %c0 = arith.constant 0 : index
-  %cst = arith.constant 7.0 : f32
-  %0 = tensor.dim %arg0, %c0 : tensor<?xf16>
-  %1 = tensor.empty(%0) : tensor<?xf16>
-  %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<?xf16>) -> tensor<?xf16>
-  %3 = tensor.empty(%0) : tensor<?xf16>
-  %4 = linalg.generic {indexing_maps = [#map0, #map0, #map0], iterator_types=["parallel"]} ins(%arg0, %2 : tensor<?xf16>, tensor<?xf16>) outs (%3:tensor<?xf16>) {
-  ^bb0(%arg1: f16, %arg2: f16, %arg3: f16):
-    %5 = arith.addf  %arg1, %arg2 : f16
-        linalg.yield %5 : f16
-  } -> tensor<?xf16>
-  return %4 : tensor<?xf16>
-}
-
-// -----
-
 // CHECK-LABEL: func @fold_fill_generic_mixedaccess
 //   CHECK-NOT: linalg.fill
 //       CHECK: %[[GENERIC_OP:.*]] = linalg.generic
@@ -1079,4 +1055,4 @@ module {
 // CHECK-NOT:     linalg.generic
 // CHECK:         tensor.expand_shape
 // CHECK:         linalg.generic {{.*}}, iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel", "reduction"]}
-// CHECK-SAME:     ins(%[[ARG0]], %[[FUSED]]#1 : tensor<1x1x2x1xf32>, tensor<4x1x1x1xf32>)
\ No newline at end of file
+// CHECK-SAME:     ins(%[[ARG0]], %[[FUSED]]#1 : tensor<1x1x2x1xf32>, tensor<4x1x1x1xf32>)
diff --git a/mlir/test/Dialect/Linalg/generalize-named-polymorphic-ops.mlir b/mlir/test/Dialect/Linalg/generalize-named-polymorphic-ops.mlir
index 290c6c7c36f76..4526dc90fad2e 100644
--- a/mlir/test/Dialect/Linalg/generalize-named-polymorphic-ops.mlir
+++ b/mlir/test/Dialect/Linalg/generalize-named-polymorphic-ops.mlir
@@ -380,8 +380,8 @@ func.func @generalize_pooling_nwc_sum_i32(%input : tensor<1x16x1xi32>, %shape: t
 
 // -----
 
-func.func @generalize_fill_0d(%value: f64, %O: tensor<f32>) -> tensor<f32> {
-  %0 = linalg.fill ins(%value: f64) outs(%O : tensor<f32>) -> tensor<f32>
+func.func @generalize_fill_0d(%value: f32, %O: tensor<f32>) -> tensor<f32> {
+  %0 = linalg.fill ins(%value: f32) outs(%O : tensor<f32>) -> tensor<f32>
   return %0: tensor<f32>
 }
 
@@ -394,8 +394,8 @@ func.func @generalize_fill_0d(%value: f64, %O: tensor<f32>) -> tensor<f32> {
 
 // -----
 
-func.func @generalize_fill_2d(%value: f64, %O: memref<16x32xf32>) {
-  linalg.fill ins(%value: f64) outs(%O : memref<16x32xf32>)
+func.func @generalize_fill_2d(%value: f32, %O: memref<16x32xf32>) {
+  linalg.fill ins(%value: f32) outs(%O : memref<16x32xf32>)
   return
 }
 
diff --git a/mlir/test/Dialect/Linalg/invalid.mlir b/mlir/test/Dialect/Linalg/invalid.mlir
index fabc8e610612d..1f554e6c45da7 100644
--- a/mlir/test/Dialect/Linalg/invalid.mlir
+++ b/mlir/test/Dialect/Linalg/invalid.mlir
@@ -352,6 +352,24 @@ func.func @illegal_fill_tensor_with_memref_return
 
 // -----
 
+func.func @illegal_fill_element_type_truncation(%arg0 : tensor<2xf32>, %arg1 : f64) -> tensor<2xf32>
+{
+  // expected-error @+1 {{'linalg.fill' op expected fill value type ('f64') to match output element type ('f32')}}
+  %0 = linalg.fill ins(%arg1 : f64) outs(%arg0 : tensor<2xf32>) -> tensor<2xf32>
+  return %0 : tensor<2xf32>
+}
+
+// -----
+
+func.func @illegal_fill_element_type_extension(%arg0 : tensor<2xi32>, %arg1 : i16) -> tensor<2xi32>
+{
+  // expected-error @+1 {{'linalg.fill' op expected fill value type ('i16') to match output element type ('i32')}}
+  %0 = linalg.fill ins(%arg1 : i16) outs(%arg0 : tensor<2xi32>) -> tensor<2xi32>
+  return %0 : tensor<2xi32>
+}
+
+// -----
+
 func.func @illegal_fill_value_type(%arg0 : tensor<2x2xf32>, %arg1 : tensor<2xf32>) -> tensor<2x2xf32>
 {
   // expected-error @+1 {{expected op with scalar input}}
diff --git a/mlir/test/Dialect/Linalg/reshape_fusion.mlir b/mlir/test/Dialect/Linalg/reshape_fusion.mlir
index 67b4f2b32bad5..3fb7225069983 100644
--- a/mlir/test/Dialect/Linalg/reshape_fusion.mlir
+++ b/mlir/test/Dialect/Linalg/reshape_fusion.mlir
@@ -822,6 +822,23 @@ func.func @fuse_by_expanding_pad(%arg0 : tensor<2x3x4x5x6x7x8x9xi32>) -> tensor<
 
 // -----
 
+func.func @no_fuse_by_expanding_pad_non_constant_padding(%arg0 : tensor<2x3x4xi32>) -> tensor<8x12xi32> {
+  %collapse = tensor.collapse_shape %arg0 [[0], [1, 2]] : tensor<2x3x4xi32> into tensor<2x12xi32>
+  %padded_0 = tensor.pad %collapse low[1, 0] high[5, 0] {
+  ^bb0(%arg1: index, %arg2: index):
+    %pad_val = arith.index_cast %arg1 : index to i32
+    tensor.yield %pad_val : i32
+  } : tensor<2x12xi32> to tensor<8x12xi32>
+  return %padded_0 : tensor<8x12xi32> 
+}
+//      CHECK: func @no_fuse_by_expanding_pad_non_constant_padding(
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<2x3x4xi32>)
+//      CHECK:   %[[COLLAPSE:.+]] = tensor.collapse_shape %[[ARG0]]
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[COLLAPSE]]
+//      CHECK:   return %[[PAD]]
+
+// -----
+
 func.func @no_fuse_by_expanding_pad(%arg0 : tensor<2x3x4x5x6x7x8x9xi32>) -> tensor<8x12x17x339x14xi32> {
   %collapse = tensor.collapse_shape %arg0 [[0], [1, 2], [3], [4, 5, 6], [7]] : tensor<2x3x4x5x6x7x8x9xi32> into tensor<2x12x5x336x9xi32>
   %cst = arith.constant 0 : i32
@@ -863,6 +880,64 @@ func.func @fuse_by_expanding_dynamic_pad(%arg0 : tensor<?x?x?x?x?x?xi32>, %l0: i
 
 // -----
 
+func.func @expand_shape_with_producer_pad(%arg0: tensor<2x12x5x336x9xi32>) -> tensor<8x3x4x17x6x7x8x14xi32> {
+  %cst = arith.constant 0 : i32
+  %padded = tensor.pad %arg0 low[1, 0, 8, 0, 3] high[5, 0, 4, 0, 2] {
+  ^bb0(%arg1: index, %arg2: index, %arg3: index, %arg4: index, %arg5: index):
+    tensor.yield %cst : i32
+  } : tensor<2x12x5x336x9xi32> to tensor<8x12x17x336x14xi32>
+  %expanded = tensor.expand_shape %padded [[0], [1, 2], [3], [4, 5, 6], [7]] output_shape [8, 3, 4, 17, 6, 7, 8, 14]
+    : tensor<8x12x17x336x14xi32> into tensor<8x3x4x17x6x7x8x14xi32>
+  return %expanded : tensor<8x3x4x17x6x7x8x14xi32>
+}
+//      CHECK: func @expand_shape_with_producer_pad
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<2x12x5x336x9xi32>
+//      CHECK:   %[[EXPAND:.+]] = tensor.expand_shape %[[ARG0]] {{\[}}[0], [1, 2], [3], [4, 5, 6], [7]] output_shape [2, 3, 4, 5, 6, 7, 8, 9]
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[EXPAND]] low[1, 0, 0, 8, 0, 0, 0, 3] high[5, 0, 0, 4, 0, 0, 0, 2]
+//      CHECK:   return %[[PAD]]
+
+// -----
+
+func.func @expand_shape_with_producer_pad_dynamic(%arg0: tensor<?x?x?x?xf32>,
+    %s0: index, %s1: index, %s2: index, %s3: index, %s4: index, %s5: index,
+    %l0: index, %l1: index, %h0: index, %h1: index) -> tensor<?x?x?x?x?x?xf32> {
+  %cst = arith.constant 0.0 : f32
+  %padded = tensor.pad %arg0 low[%l0, 0, %l1, 0] high[%h0, 0, %h1, 0] {
+  ^bb0(%arg1: index, %arg2: index, %arg3: index, %arg4: index):
+    tensor.yield %cst : f32
+  } : tensor<?x?x?x?xf32> to tensor<?x?x?x?xf32>
+  %expanded = tensor.expand_shape %padded [[0], [1, 2], [3], [4, 5]] output_shape [%s0, %s1, %s2, %s3, %s4, %s5]
+    : tensor<?x?x?x?xf32> into tensor<?x?x?x?x?x?xf32>
+  return %expanded : tensor<?x?x?x?x?x?xf32>
+}
+//      CHECK: func @expand_shape_with_producer_pad_dynamic
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<?x?x?x?xf32>
+// CHECK-SAME:   %[[S0:.+]]: index, %[[S1:.+]]: index, %[[S2:.+]]: index, %[[S3:.+]]: index, %[[S4:.+]]: index, %[[S5:.+]]: index, %[[L0:.+]]: index, %[[L1:.+]]: index, %[[H0:.+]]: index, %[[H1:.+]]: index
+//      CHECK:   %[[DIM0:.+]] = tensor.dim %[[ARG0]], %[[C0:.+]] : tensor<?x?x?x?xf32>
+//      CHECK:   %[[DIM2:.+]] = tensor.dim %[[ARG0]], %[[C2:.+]] : tensor<?x?x?x?xf32>
+//      CHECK:   %[[EXPAND:.+]] = tensor.expand_shape %[[ARG0]] {{\[}}[0], [1, 2], [3], [4, 5]] output_shape [%[[DIM0]], %[[S1]], %[[S2]], %[[DIM2]], %[[S4]], %[[S5]]]
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[EXPAND]] low[%[[L0]], 0, 0, %[[L1]], 0, 0] high[%[[H0]], 0, 0, %[[H1]], 0, 0]
+//      CHECK:   return %[[PAD]]
+
+// -----
+
+func.func @expand_shape_with_producer_pad_non_constant_padding(%arg0 : tensor<2x12xi32>) -> tensor<8x3x4xi32> {
+  %padded_0 = tensor.pad %arg0 low[1, 0] high[5, 0] {
+  ^bb0(%arg1: index, %arg2: index):
+    %pad_val = arith.index_cast %arg1 : index to i32
+    tensor.yield %pad_val : i32
+  } : tensor<2x12xi32> to tensor<8x12xi32>
+  %expand = tensor.expand_shape %padded_0 [[0], [1, 2]] output_shape [8, 3, 4] : tensor<8x12xi32> into tensor<8x3x4xi32>
+  return %expand : tensor<8x3x4xi32> 
+}
+//      CHECK: func @expand_shape_with_producer_pad_non_constant_padding(
+// CHECK-SAME:   %[[ARG0:.+]]: tensor<2x12xi32>)
+//      CHECK:   %[[PAD:.+]] = tensor.pad %[[ARG0]]
+//      CHECK:   %[[EXPAND:.+]] = tensor.expand_shape %[[PAD]]
+//      CHECK:   return %[[EXPAND]]
+
+// -----
+
 func.func @move_operand_deps(%arg0 : tensor<?x128xf16>,
     %arg1 : tensor<4x?x32x128xf16>, %empty : tensor<4x?x32x128xf16>) -> tensor<4x?x32x8x16xf16> {
   %c0 = arith.constant 0 : index
diff --git a/mlir/test/Dialect/OpenACC/legalize-serial.mlir b/mlir/test/Dialect/OpenACC/legalize-serial.mlir
new file mode 100644
index 0000000000000..774c6b6f65ce3
--- /dev/null
+++ b/mlir/test/Dialect/OpenACC/legalize-serial.mlir
@@ -0,0 +1,164 @@
+// RUN: mlir-opt %s -acc-legalize-serial | FileCheck %s
+
+acc.private.recipe @privatization_memref_10_f32 : memref<10xf32> init {
+^bb0(%arg0: memref<10xf32>):
+  %0 = memref.alloc() : memref<10xf32>
+  acc.yield %0 : memref<10xf32>
+} destroy {
+^bb0(%arg0: memref<10xf32>):
+  memref.dealloc %arg0 : memref<10xf32> 
+  acc.terminator
+}
+
+acc.private.recipe @privatization_memref_10_10_f32 : memref<10x10xf32> init {
+^bb0(%arg0: memref<10x10xf32>):
+  %0 = memref.alloc() : memref<10x10xf32>
+  acc.yield %0 : memref<10x10xf32>
+} destroy {
+^bb0(%arg0: memref<10x10xf32>):
+  memref.dealloc %arg0 : memref<10x10xf32> 
+  acc.terminator
+}
+
+acc.firstprivate.recipe @firstprivatization_memref_10xf32 : memref<10xf32> init {
+^bb0(%arg0: memref<10xf32>):
+  %0 = memref.alloc() : memref<10xf32>
+  acc.yield %0 : memref<10xf32>
+} copy {
+^bb0(%arg0: memref<10xf32>, %arg1: memref<10xf32>):
+  acc.terminator
+} destroy {
+^bb0(%arg0: memref<10xf32>):
+  memref.dealloc %arg0 : memref<10xf32> 
+  acc.terminator
+}
+
+acc.reduction.recipe @reduction_add_i64 : i64 reduction_operator<add> init {
+^bb0(%0: i64):
+  %1 = arith.constant 0 : i64
+  acc.yield %1 : i64
+} combiner {
+^bb0(%0: i64, %1: i64):
+  %2 = arith.addi %0, %1 : i64
+  acc.yield %2 : i64
+}
+
+acc.reduction.recipe @reduction_add_memref_i64 : memref<i64> reduction_operator<add> init {
+^bb0(%arg0: memref<i64>):
+  %0 = memref.alloca() : memref<i64>
+  %c0 = arith.constant 0 : i64
+  memref.store %c0, %0[] : memref<i64>
+  acc.yield %0 : memref<i64>
+} combiner {
+^bb0(%arg0: memref<i64>, %arg1: memref<i64>):
+  %0 = memref.load %arg0[] : memref<i64>
+  %1 = memref.load %arg1[] : memref<i64>
+  %2 = arith.addi %0, %1 : i64
+  memref.store %2, %arg0[] : memref<i64>
+  acc.terminator
+}
+
+// CHECK:   func.func @testserialop(%[[VAL_0:.*]]: memref<10xf32>, %[[VAL_1:.*]]: memref<10xf32>, %[[VAL_2:.*]]: memref<10x10xf32>) {
+// CHECK:           %[[VAL_3:.*]] = arith.constant 1 : i64
+// CHECK:           %[[VAL_4:.*]] = arith.constant 1 : i32
+// CHECK:           %[[VAL_5:.*]] = arith.constant 1 : index
+// CHECK:           acc.parallel async(%[[VAL_3]] : i64) num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           acc.parallel async(%[[VAL_4]] : i32) num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           acc.parallel async(%[[VAL_5]] : index) num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) wait({%[[VAL_3]] : i64}) {
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) wait({%[[VAL_4]] : i32}) {
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) wait({%[[VAL_5]] : index}) {
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) wait({%[[VAL_3]] : i64, %[[VAL_4]] : i32, %[[VAL_5]] : index}) {
+// CHECK:           }
+// CHECK:           %[[VAL_6:.*]] = acc.firstprivate varPtr(%[[VAL_1]] : memref<10xf32>) recipe(@firstprivatization_memref_10xf32) -> memref<10xf32>
+// CHECK:           %[[VAL_9:.*]] = acc.private varPtr(%[[VAL_2]] : memref<10x10xf32>) recipe(@privatization_memref_10_10_f32) -> memref<10x10xf32>
+// CHECK:           acc.parallel firstprivate(%[[VAL_6]] : memref<10xf32>) num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) private(%[[VAL_9]] : memref<10x10xf32>) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           %[[VAL_7:.*]] = acc.copyin varPtr(%[[VAL_0]] : memref<10xf32>) -> memref<10xf32> {dataClause = #acc<data_clause acc_copy>}
+// CHECK:           acc.parallel dataOperands(%[[VAL_7]] : memref<10xf32>) num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           %[[I64MEM:.*]] = memref.alloca() : memref<i64>
+// CHECK:           memref.store %[[VAL_3]], %[[I64MEM]][] : memref<i64>
+// CHECK:           %[[VAL_10:.*]] = acc.reduction varPtr(%[[I64MEM]] : memref<i64>) recipe(@reduction_add_memref_i64) -> memref<i64>
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) reduction(%[[VAL_10]] : memref<i64>) {
+// CHECK:           }
+// CHECK:           acc.parallel combined(loop) num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:             acc.loop combined(serial) control(%{{.*}} : index) = (%[[VAL_5]] : index) to (%[[VAL_5]] : index) step (%[[VAL_5]] : index) {
+// CHECK:               acc.yield
+// CHECK:             } attributes {seq = [#acc.device_type<none>]}
+// CHECK:             acc.terminator
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           } attributes {defaultAttr = #acc<defaultvalue none>}
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           } attributes {defaultAttr = #acc<defaultvalue present>}
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           }
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:           } attributes {selfAttr}
+// CHECK:           acc.parallel num_gangs({%[[VAL_4]] : i32}) num_workers(%[[VAL_4]] : i32) vector_length(%[[VAL_4]] : i32) {
+// CHECK:             acc.yield
+// CHECK:           } attributes {selfAttr}
+// CHECK:           return
+// CHECK:         }
+
+func.func @testserialop(%a: memref<10xf32>, %b: memref<10xf32>, %c: memref<10x10xf32>) -> () {
+  %i64value = arith.constant 1 : i64
+  %i32value = arith.constant 1 : i32
+  %idxValue = arith.constant 1 : index
+  acc.serial async(%i64value: i64) {
+  }
+  acc.serial async(%i32value: i32) {
+  }
+  acc.serial async(%idxValue: index) {
+  }
+  acc.serial wait({%i64value: i64}) {
+  }
+  acc.serial wait({%i32value: i32}) {
+  }
+  acc.serial wait({%idxValue: index}) {
+  }
+  acc.serial wait({%i64value : i64, %i32value : i32, %idxValue : index}) {
+  }
+  %firstprivate = acc.firstprivate varPtr(%b : memref<10xf32>) recipe(@firstprivatization_memref_10xf32) -> memref<10xf32>
+  %c_private = acc.private varPtr(%c : memref<10x10xf32>) recipe(@privatization_memref_10_10_f32) -> memref<10x10xf32>
+  acc.serial private(%c_private : memref<10x10xf32>) firstprivate(%firstprivate : memref<10xf32>) {
+  }
+  %copyinfromcopy = acc.copyin varPtr(%a : memref<10xf32>) -> memref<10xf32> {dataClause = #acc<data_clause acc_copy>}
+  acc.serial dataOperands(%copyinfromcopy : memref<10xf32>) {
+  }
+  %i64mem = memref.alloca() : memref<i64>
+  memref.store %i64value, %i64mem[] : memref<i64>
+  %i64reduction = acc.reduction varPtr(%i64mem : memref<i64>) recipe(@reduction_add_memref_i64) -> memref<i64>
+  acc.serial reduction(%i64reduction : memref<i64>) {
+  }
+  acc.serial combined(loop) {
+    acc.loop combined(serial) control(%arg3 : index) = (%idxValue : index) to (%idxValue : index) step (%idxValue : index) {
+      acc.yield
+    } attributes {seq = [#acc.device_type<none>]}
+    acc.terminator
+  }
+  acc.serial {
+  } attributes {defaultAttr = #acc<defaultvalue none>}
+  acc.serial {
+  } attributes {defaultAttr = #acc<defaultvalue present>}
+  acc.serial {
+  } attributes {asyncAttr}
+  acc.serial {
+  } attributes {waitAttr}
+  acc.serial {
+  } attributes {selfAttr}
+  acc.serial {
+    acc.yield
+  } attributes {selfAttr}
+  return
+}
+
diff --git a/mlir/test/Dialect/OpenACC/pointer-like-interface-load.mlir b/mlir/test/Dialect/OpenACC/pointer-like-interface-load.mlir
new file mode 100644
index 0000000000000..36df6a1d1bbe3
--- /dev/null
+++ b/mlir/test/Dialect/OpenACC/pointer-like-interface-load.mlir
@@ -0,0 +1,29 @@
+// RUN: mlir-opt %s --split-input-file --pass-pipeline="builtin.module(func.func(test-acc-pointer-like-interface{test-mode=load}))" 2>&1 | FileCheck %s
+
+func.func @test_memref_load_scalar() {
+  %ptr = memref.alloca() {test.ptr} : memref<f32>
+  // CHECK: Successfully generated load for operation: %[[PTR:.*]] = memref.alloca() {test.ptr} : memref<f32>
+  // CHECK: Loaded value type: f32
+  // CHECK: Generated: %{{.*}} = memref.load %[[PTR]][] : memref<f32>
+  return
+}
+
+// -----
+
+func.func @test_memref_load_int() {
+  %ptr = memref.alloca() {test.ptr} : memref<i64>
+  // CHECK: Successfully generated load for operation: %[[PTR:.*]] = memref.alloca() {test.ptr} : memref<i64>
+  // CHECK: Loaded value type: i64
+  // CHECK: Generated: %{{.*}} = memref.load %[[PTR]][] : memref<i64>
+  return
+}
+
+// -----
+
+func.func @test_memref_load_dynamic() {
+  %c10 = arith.constant 10 : index
+  %ptr = memref.alloc(%c10) {test.ptr} : memref<?xf32>
+  // CHECK: Failed to generate load for operation: %[[PTR:.*]] = memref.alloc(%{{.*}}) {test.ptr} : memref<?xf32>
+  return
+}
+
diff --git a/mlir/test/Dialect/OpenACC/pointer-like-interface-store.mlir b/mlir/test/Dialect/OpenACC/pointer-like-interface-store.mlir
new file mode 100644
index 0000000000000..0fee43102d6d9
--- /dev/null
+++ b/mlir/test/Dialect/OpenACC/pointer-like-interface-store.mlir
@@ -0,0 +1,39 @@
+// RUN: mlir-opt %s --split-input-file --pass-pipeline="builtin.module(func.func(test-acc-pointer-like-interface{test-mode=store}))" 2>&1 | FileCheck %s
+
+func.func @test_memref_store_scalar() {
+  %ptr = memref.alloca() {test.ptr} : memref<f32>
+  // CHECK: Successfully generated store for operation: %[[PTR:.*]] = memref.alloca() {test.ptr} : memref<f32>
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 4.200000e+01 : f32
+  // CHECK: Generated: memref.store %[[VAL]], %[[PTR]][] : memref<f32>
+  return
+}
+
+// -----
+
+func.func @test_memref_store_int() {
+  %ptr = memref.alloca() {test.ptr} : memref<i32>
+  // CHECK: Successfully generated store for operation: %[[PTR:.*]] = memref.alloca() {test.ptr} : memref<i32>
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 42 : i32
+  // CHECK: Generated: memref.store %[[VAL]], %[[PTR]][] : memref<i32>
+  return
+}
+
+// -----
+
+func.func @test_memref_store_i64() {
+  %ptr = memref.alloca() {test.ptr} : memref<i64>
+  // CHECK: Successfully generated store for operation: %[[PTR:.*]] = memref.alloca() {test.ptr} : memref<i64>
+  // CHECK: Generated: %[[VAL:.*]] = arith.constant 42 : i64
+  // CHECK: Generated: memref.store %[[VAL]], %[[PTR]][] : memref<i64>
+  return
+}
+
+// -----
+
+func.func @test_memref_store_dynamic() {
+  %c10 = arith.constant 10 : index
+  %ptr = memref.alloc(%c10) {test.ptr} : memref<?xf32>
+  // CHECK: Failed to generate store for operation: %[[PTR:.*]] = memref.alloc(%{{.*}}) {test.ptr} : memref<?xf32>
+  return
+}
+
diff --git a/mlir/test/Dialect/SCF/canonicalize.mlir b/mlir/test/Dialect/SCF/canonicalize.mlir
index 084c3fc065de3..ac590fc0c47b9 100644
--- a/mlir/test/Dialect/SCF/canonicalize.mlir
+++ b/mlir/test/Dialect/SCF/canonicalize.mlir
@@ -974,6 +974,56 @@ func.func @replace_if_with_cond3(%arg0 : i1, %arg2: i64) -> (i32, i64) {
 
 // -----
 
+// CHECK-LABEL: @while_move_if_down
+func.func @while_move_if_down() -> i32 {
+  %defined_outside = "test.get_some_value0" () : () -> (i32)
+  %0 = scf.while () : () -> (i32) {
+    %used_value = "test.get_some_value1" () : () -> (i32)
+    %used_by_subregion = "test.get_some_value2" () : () -> (i32)
+    %else_value = "test.get_some_value3" () : () -> (i32)
+    %condition = "test.condition"() : () -> i1
+    %res = scf.if %condition -> (i32) {
+      "test.use0" (%defined_outside) : (i32) -> ()
+      "test.use1" (%used_value) : (i32) -> ()
+      test.alloca_scope_region {
+        "test.use2" (%used_by_subregion) : (i32) -> ()
+      }
+      %then_value = "test.get_some_value4" () : () -> (i32)
+      scf.yield %then_value : i32
+    } else {
+      scf.yield %else_value : i32
+    }
+    scf.condition(%condition) %res : i32
+  } do {
+  ^bb0(%res_arg: i32):
+    "test.use3" (%res_arg) : (i32) -> ()
+    scf.yield
+  }
+  return %0 : i32
+}
+// CHECK:           %[[defined_outside:.*]] = "test.get_some_value0"() : () -> i32
+// CHECK:           %[[WHILE_RES:.*]]:3 = scf.while : () -> (i32, i32, i32) {
+// CHECK:             %[[used_value:.*]] = "test.get_some_value1"() : () -> i32
+// CHECK:             %[[used_by_subregion:.*]] = "test.get_some_value2"() : () -> i32
+// CHECK:             %[[else_value:.*]] = "test.get_some_value3"() : () -> i32
+// CHECK:             %[[condition:.*]] = "test.condition"() : () -> i1
+// CHECK:             scf.condition(%[[condition]]) %[[else_value]], %[[used_value]], %[[used_by_subregion]] : i32, i32, i32
+// CHECK:           } do {
+// CHECK:           ^bb0(%[[res_arg:.*]]: i32, %[[used_value_arg:.*]]: i32, %[[used_by_subregion_arg:.*]]: i32):
+// CHECK:             "test.use0"(%[[defined_outside]]) : (i32) -> ()
+// CHECK:             "test.use1"(%[[used_value_arg]]) : (i32) -> ()
+// CHECK:             test.alloca_scope_region {
+// CHECK:               "test.use2"(%[[used_by_subregion_arg]]) : (i32) -> ()
+// CHECK:             }
+// CHECK:             %[[then_value:.*]] = "test.get_some_value4"() : () -> i32
+// CHECK:             "test.use3"(%[[then_value]]) : (i32) -> ()
+// CHECK:             scf.yield
+// CHECK:           }
+// CHECK:           return %[[WHILE_RES]]#0 : i32
+// CHECK:         }
+
+// -----
+
 // CHECK-LABEL: @while_cond_true
 func.func @while_cond_true() -> i1 {
   %0 = scf.while () : () -> i1 {
diff --git a/mlir/test/Dialect/SCF/uplift-while.mlir b/mlir/test/Dialect/SCF/uplift-while.mlir
index 736112824c515..cbe2ce5076ad2 100644
--- a/mlir/test/Dialect/SCF/uplift-while.mlir
+++ b/mlir/test/Dialect/SCF/uplift-while.mlir
@@ -185,34 +185,3 @@ func.func @uplift_while(%arg0: index, %arg1: index, %arg2: index) -> (i32, f32)
 //       CHECK:     %[[T2:.*]] = "test.test2"(%[[ARG2]]) : (f32) -> f32
 //       CHECK:     scf.yield %[[T1]], %[[T2]] : i32, f32
 //       CHECK:     return %[[RES]]#0, %[[RES]]#1 : i32, f32
-
-// -----
-
-func.func @uplift_while(%low: index, %upper: index, %val : i32) -> i32 {
-  %c1 = arith.constant 1 : index
-  %1:2 = scf.while (%iv = %low, %iter = %val) : (index, i32) -> (index, i32) {
-    %2 = arith.cmpi slt, %iv, %upper : index
-    %3:2 = scf.if %2 -> (index, i32) {
-      %4 = "test.test"(%iter) : (i32) -> i32
-      %5 = arith.addi %iv, %c1 : index
-      scf.yield %5, %4 : index, i32
-    } else {
-      scf.yield %iv, %iter : index, i32
-    }
-    scf.condition(%2) %3#0, %3#1 : index, i32
-  } do {
-  ^bb0(%arg0: index, %arg1: i32):
-    scf.yield %arg0, %arg1 : index, i32
-  }
-  return %1#1 : i32
-}
-
-// CHECK-LABEL:   func.func @uplift_while(
-// CHECK-SAME:      %[[ARG0:.*]]: index, %[[ARG1:.*]]: index, %[[ARG2:.*]]: i32) -> i32 {
-// CHECK:           %[[CONSTANT_0:.*]] = arith.constant 1 : index
-// CHECK:           %[[FOR_0:.*]] = scf.for %[[VAL_0:.*]] = %[[ARG0]] to %[[ARG1]] step %[[CONSTANT_0]] iter_args(%[[VAL_1:.*]] = %[[ARG2]]) -> (i32) {
-// CHECK:             %[[VAL_2:.*]] = "test.test"(%[[VAL_1]]) : (i32) -> i32
-// CHECK:             scf.yield %[[VAL_2]] : i32
-// CHECK:           }
-// CHECK:           return %[[FOR_0]] : i32
-// CHECK:         }
diff --git a/mlir/test/Dialect/Tensor/bufferize.mlir b/mlir/test/Dialect/Tensor/bufferize.mlir
index 5eb2360a29b8f..be8ce20d8f154 100644
--- a/mlir/test/Dialect/Tensor/bufferize.mlir
+++ b/mlir/test/Dialect/Tensor/bufferize.mlir
@@ -678,11 +678,9 @@ func.func @tensor.concat_different_shapes(%f: tensor<8x4xf32>, %g: tensor<8x5xf3
 // CHECK-DAG:       %[[G_DIM:.*]] = memref.dim %[[G_MEMREF]], %[[c1]]
 // CHECK:           %[[ALLOC:.*]] = memref.alloc
 // CHECK-SAME:                                    memref<8x?xf32>
-// CHECK-DAG:       %[[OFFSET:.*]] = arith.constant 0 : index
-// CHECK:           %[[SUBVIEW1:.*]] = memref.subview %[[ALLOC]][0, %[[OFFSET]]] [8, %[[F_DIM]]] [1, 1]
+// CHECK:           %[[SUBVIEW1:.*]] = memref.subview %[[ALLOC]][0, 0] [8, %[[F_DIM]]] [1, 1]
 // CHECK:           memref.copy %[[F_MEMREF]], %[[SUBVIEW1]]
-// CHECK:           %[[OFFSET_2:.*]] = arith.addi %[[OFFSET]], %[[F_DIM]] : index
-// CHECK:           %[[SUBVIEW2:.*]] = memref.subview %[[ALLOC]][0, %[[OFFSET_2]]] [8, %[[G_DIM]]] [1, 1]
+// CHECK:           %[[SUBVIEW2:.*]] = memref.subview %[[ALLOC]][0, %[[F_DIM]]] [8, %[[G_DIM]]] [1, 1]
 // CHECK:           memref.copy %[[G_MEMREF]], %[[SUBVIEW2]]
 // CHECK:           %[[RET:.*]] = bufferization.to_tensor %[[ALLOC]]
 // CHECK:           return %[[RET]]
@@ -706,10 +704,9 @@ func.func @tensor.concat_dynamic(%f: tensor<8x?xf32>, %g: tensor<8x?xf32>) -> te
 // CHECK:           %[[ALLOC:.*]] = memref.alloc
 // CHECK-SAME:                                    memref<?x?xf32>
 // CHECK-DAG:       %[[NON_CONCAT_DIM:.*]] = memref.dim %[[ALLOC]], %[[c0]]
-// CHECK:           %[[SUBVIEW1:.*]] = memref.subview %[[ALLOC]][0, %[[c0]]] [%[[NON_CONCAT_DIM]], %[[F_DIM]]] [1, 1]
+// CHECK:           %[[SUBVIEW1:.*]] = memref.subview %[[ALLOC]][0, 0] [%[[NON_CONCAT_DIM]], %[[F_DIM]]] [1, 1]
 // CHECK:           memref.copy %[[F_MEMREF]], %[[SUBVIEW1]]
-// CHECK:           %[[OFFSET_2:.*]] = arith.addi %[[c0]], %[[F_DIM]] : index
-// CHECK:           %[[SUBVIEW2:.*]] = memref.subview %[[ALLOC]][0, %[[OFFSET_2]]] [%[[NON_CONCAT_DIM]], %[[G_DIM]]] [1, 1]
+// CHECK:           %[[SUBVIEW2:.*]] = memref.subview %[[ALLOC]][0, %[[F_DIM]]] [%[[NON_CONCAT_DIM]], %[[G_DIM]]] [1, 1]
 // CHECK:           memref.copy %[[G_MEMREF]], %[[SUBVIEW2]]
 // CHECK:           %[[RET:.*]] = bufferization.to_tensor %[[ALLOC]]
 // CHECK:           return %[[RET]]
@@ -721,6 +718,35 @@ func.func @tensor.concat_dynamic_nonconcat_dim(%f: tensor<?x?xf32>, %g: tensor<?
 
 // -----
 
+// CHECK:  #[[$sum_map:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>
+
+// CHECK-LABEL:   func @tensor.concat_mixed_dynamic_static(
+// CHECK-SAME:        %[[F:.*]]: tensor<8x?xf32>, %[[G:.*]]: tensor<8x?xf32>,
+// CHECK-SAME:        %[[H:.*]]: tensor<8x2xf32>)
+// CHECK-DAG:       %[[F_MEMREF:.*]] = bufferization.to_buffer %[[F]]
+// CHECK-DAG:       %[[G_MEMREF:.*]] = bufferization.to_buffer %[[G]]
+// CHECK-DAG:       %[[H_MEMREF:.*]] = bufferization.to_buffer %[[H]]
+// CHECK-DAG:       %[[ALLOC:.*]] = memref.alloc() {alignment = 64 : i64} : memref<8x10xf32>
+// CHECK-DAG:       %[[c1:.*]] = arith.constant 1 : index
+// CHECK:           %[[F_DIM:.*]] = memref.dim %[[F_MEMREF]], %[[c1]]
+// CHECK:           %[[SUBVIEW1:.*]] = memref.subview %[[ALLOC]][0, 0] [8, %[[F_DIM]]] [1, 1]
+// CHECK:           memref.copy %[[F_MEMREF]], %[[SUBVIEW1]]
+// CHECK:           %[[G_DIM:.*]] = memref.dim %[[G_MEMREF]], %[[c1]]
+// CHECK:           %[[SUBVIEW2:.*]] = memref.subview %[[ALLOC]][0, %[[F_DIM]]] [8, %[[G_DIM]]] [1, 1]
+// CHECK:           memref.copy %[[G_MEMREF]], %[[SUBVIEW2]]
+// CHECK:           %[[OFFSET:.*]] = affine.apply #[[$sum_map]]()[%[[F_DIM]], %[[G_DIM]]]
+// CHECK:           %[[SUBVIEW3:.*]] = memref.subview %[[ALLOC]][0, %[[OFFSET]]] [8, 2] [1, 1]
+// CHECK:           memref.copy %[[H_MEMREF]], %[[SUBVIEW3]]
+// CHECK:           %[[RET:.*]] = bufferization.to_tensor %[[ALLOC]]
+// CHECK:           return %[[RET]]
+// CHECK:         }
+func.func @tensor.concat_mixed_dynamic_static(%f: tensor<8x?xf32>, %g: tensor<8x?xf32>, %h: tensor<8x2xf32>) -> tensor<8x10xf32> {
+  %0 = tensor.concat dim(1) %f, %g, %h : (tensor<8x?xf32>, tensor<8x?xf32>, tensor<8x2xf32>) -> tensor<8x10xf32>
+  return %0 : tensor<8x10xf32>
+}
+
+// -----
+
 // CHECK-LABEL: func @tensor.splat_dynamic(
 // CHECK-SAME:  %[[F:[a-zA-Z0-9_]+]]: f32
 // CHECK-SAME:  %[[M:[a-zA-Z0-9_]+]]: index
diff --git a/mlir/test/Dialect/Vector/vector-sink.mlir b/mlir/test/Dialect/Vector/vector-sink.mlir
index 577b06df42929..beaba52af1841 100644
--- a/mlir/test/Dialect/Vector/vector-sink.mlir
+++ b/mlir/test/Dialect/Vector/vector-sink.mlir
@@ -780,7 +780,7 @@ func.func @negative_extract_load_scalable(%arg0: memref<?xf32>, %arg1: index) ->
 }
 
 //-----------------------------------------------------------------------------
-// [Pattern: StoreOpFromSplatOrBroadcast]
+// [Pattern: StoreOpFromBroadcast]
 //-----------------------------------------------------------------------------
 
 // CHECK-LABEL: @store_splat
diff --git a/mlir/test/Dialect/XeGPU/propagate-layout-inst-data.mlir b/mlir/test/Dialect/XeGPU/propagate-layout-inst-data.mlir
index d911baa49acbb..32fb3178a8af2 100644
--- a/mlir/test/Dialect/XeGPU/propagate-layout-inst-data.mlir
+++ b/mlir/test/Dialect/XeGPU/propagate-layout-inst-data.mlir
@@ -6,6 +6,8 @@
 // CHECK: %[[CST:.*]] = arith.constant dense<0.000000e+00> : vector<8x16xf32>
 // CHECK: %[[TDESC_SRC:.*]] = xegpu.create_nd_tdesc %[[ARG0]] : memref<8x32xf32> -> !xegpu.tensor_desc<8x32xf32, #xegpu.layout<inst_data = [8, 16]>>
 // CHECK: %[[TDESC_DST:.*]] = xegpu.create_nd_tdesc %[[ARG1]] : memref<8x32xf32> -> !xegpu.tensor_desc<8x32xf32, #xegpu.layout<inst_data = [8, 16]>>
+// CHECK: xegpu.prefetch_nd %[[TDESC_SRC]] <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>, layout = #xegpu.layout<inst_data = [8, 16]>}> :
+// CHECK-SAME: !xegpu.tensor_desc<8x32xf32, #xegpu.layout<inst_data = [8, 16]>>
 // CHECK: %[[LOADED:.*]] = xegpu.load_nd %0 <{layout = #xegpu.layout<inst_data = [8, 16]>}> {layout_result_0 = #xegpu.layout<inst_data = [8, 16]>} :
 // CHECK-SAME: !xegpu.tensor_desc<8x32xf32, #xegpu.layout<inst_data = [8, 16]>> -> vector<8x32xf32>
 // CHECK: xegpu.store_nd %[[LOADED]], %[[TDESC_DST]] <{layout = #xegpu.layout<inst_data = [8, 16]>}> : vector<8x32xf32>, !xegpu.tensor_desc<8x32xf32, #xegpu.layout<inst_data = [8, 16]>>
@@ -16,6 +18,7 @@ func.func @load_store_no_array_len(%arg0: memref<8x32xf32>, %arg1: memref<8x32xf
   %cst = arith.constant dense<0.000000e+00> : vector<8x16xf32>
   %0 = xegpu.create_nd_tdesc %arg0 : memref<8x32xf32> -> !xegpu.tensor_desc<8x32xf32>
   %1 = xegpu.create_nd_tdesc %arg1 : memref<8x32xf32> -> !xegpu.tensor_desc<8x32xf32>
+  xegpu.prefetch_nd %0 <{l1_hint = #xegpu.cache_hint<cached>, l2_hint = #xegpu.cache_hint<uncached>}>: !xegpu.tensor_desc<8x32xf32>
   %2 = xegpu.load_nd %0  : !xegpu.tensor_desc<8x32xf32> -> vector<8x32xf32>
   xegpu.store_nd %2, %1  : vector<8x32xf32>, !xegpu.tensor_desc<8x32xf32>
   return
diff --git a/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir b/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir
index 8fd3cca5594cb..22177f8f6a15f 100644
--- a/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir
+++ b/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir
@@ -271,11 +271,11 @@ gpu.module @xevm_module{
 // CHECK: %[[C2:.*]] = arith.constant 2 : index
 // CHECK: %[[C8:.*]] = arith.constant 8 : index
 // CHECK: %[[LANE_ID:.*]] = gpu.lane_id
-// CHECK: %[[REMU1:.*]] = index.remu %[[LANE_ID]], %[[C8]]
-// CHECK: %[[DIVU:.*]] = index.divu %[[LANE_ID]], %[[C8]]
-// CHECK: %[[REMU2:.*]] = index.remu %[[DIVU]], %[[C2]]
-// CHECK: %[[REMU3:.*]] = index.remu %[[REMU2]], %[[C2]]
-// CHECK: %[[REMU4:.*]] = index.remu %[[REMU1]], %[[C8]]
+// CHECK: %[[REMU1:.*]] = arith.remui %[[LANE_ID]], %[[C8]]
+// CHECK: %[[DIVU:.*]] = arith.divui %[[LANE_ID]], %[[C8]]
+// CHECK: %[[REMU2:.*]] = arith.remui %[[DIVU]], %[[C2]]
+// CHECK: %[[REMU3:.*]] = arith.remui %[[REMU2]], %[[C2]]
+// CHECK: %[[REMU4:.*]] = arith.remui %[[REMU1]], %[[C8]]
 // CHECK: %[[MAT:.*]] = xegpu.load_matrix %arg0[%[[REMU3]], %[[REMU4]]] : !xegpu.mem_desc<32x32xf32>, index, index -> vector<1x1xf32>
 // CHECK: xegpu.store_matrix %[[MAT]], %arg0[%[[REMU3]], %[[REMU4]]] : vector<1x1xf32>, !xegpu.mem_desc<32x32xf32>, index, index
 gpu.module @xevm_module{
@@ -294,13 +294,13 @@ gpu.module @xevm_module{
 // CHECK: %[[C4:.*]] = arith.constant 4 : index
 // CHECK: %[[C1:.*]] = arith.constant 1 : index
 // CHECK: %[[LANE_ID:.*]] = gpu.lane_id
-// CHECK: %[[REMU1:.*]] = index.remu %[[LANE_ID]], %[[C4]]
-// CHECK: %[[DIVU:.*]] = index.divu %[[LANE_ID]], %[[C4]]
-// CHECK: %[[REMU2:.*]] = index.remu %[[DIVU]], %[[C4]]
-// CHECK: %[[MUL:.*]] = index.mul %[[REMU2]], %[[C2]]
-// CHECK: %[[REMU3:.*]] = index.remu %[[MUL]], %[[C8]]
-// CHECK: %[[REMU4:.*]] = index.remu %[[REMU1]], %[[C4]]
-// CHECK: %[[ADD:.*]] = index.add %[[REMU4]], %[[C1]]
+// CHECK: %[[REMU1:.*]] = arith.remui %[[LANE_ID]], %[[C4]]
+// CHECK: %[[DIVU:.*]] = arith.divui %[[LANE_ID]], %[[C4]]
+// CHECK: %[[REMU2:.*]] = arith.remui %[[DIVU]], %[[C4]]
+// CHECK: %[[MUL:.*]] = arith.muli %[[REMU2]], %[[C2]]
+// CHECK: %[[REMU3:.*]] = arith.remui %[[MUL]], %[[C8]]
+// CHECK: %[[REMU4:.*]] = arith.remui %[[REMU1]], %[[C4]]
+// CHECK: %[[ADD:.*]] = arith.addi %[[REMU4]], %[[C1]]
 // CHECK: %[[MAT:.*]] = xegpu.load_matrix %arg0[%[[REMU3]], %[[ADD]]] : !xegpu.mem_desc<32x32xf32>, index, index -> vector<2x1xf32>
 // CHECK: xegpu.store_matrix %[[MAT]], %arg0[%[[REMU3]], %[[ADD]]] : vector<2x1xf32>, !xegpu.mem_desc<32x32xf32>, index, index
 gpu.module @xevm_module{
diff --git a/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir b/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir
index 02c5f71d5c83d..8ce6d4dfd439e 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-attr-interface.mlir
@@ -3,10 +3,10 @@
 gpu.module @test {
   gpu.func @slice_attr() -> vector<128xindex> {
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[DIVU:.*]] = index.divu %[[SGID]], %[[C8:.*]]
-    // CHECK-DAG: %[[REMU:.*]] = index.remu %[[DIVU]], %[[C4:.*]]
-    // CHECK-DAG: %[[MUL:.*]] = index.mul %[[REMU]], %[[C32:.*]]
-    // CHECK-DAG: %[[MOD:.*]] = index.remu %[[MUL]], %[[C128:.*]]
+    // CHECK-DAG: %[[DIVU:.*]] = arith.divui %[[SGID]], %[[C8:.*]]
+    // CHECK-DAG: %[[REMU:.*]] = arith.remui %[[DIVU]], %[[C4:.*]]
+    // CHECK-DAG: %[[MUL:.*]] = arith.muli %[[REMU]], %[[C32:.*]]
+    // CHECK-DAG: %[[MOD:.*]] = arith.remui %[[MUL]], %[[C128:.*]]
     // CHECK-DAG: %[[BASE:.*]] = vector.step : vector<32xindex>
     // CHECK-DAG: %[[CAST:.*]] = vector.broadcast %[[MOD]] : index to vector<32xindex>
     // CHECK-DAG: %[[ADD:.*]] = arith.addi %[[BASE]], %[[CAST]] : vector<32xindex>
@@ -16,11 +16,10 @@ gpu.module @test {
 
   gpu.func @nested_slice_attr() -> vector<128xindex> {
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[DIVU1:.*]] = index.divu %[[SGID]], %[[C1:.*]]
-    // CHECK-DAG: %[[DIVU2:.*]] = index.divu %[[DIVU1]], %[[C8:.*]]
-    // CHECK-DAG: %[[REMU:.*]] = index.remu %[[DIVU2]], %[[C4:.*]]
-    // CHECK-DAG: %[[MUL:.*]] = index.mul %[[REMU]], %[[C32:.*]]
-    // CHECK-DAG: %[[MOD:.*]] = index.remu %[[MUL]], %[[C128:.*]]
+    // CHECK-DAG: %[[DIVU2:.*]] = arith.divui %[[SGID]], %[[C8:.*]]
+    // CHECK-DAG: %[[REMU:.*]] = arith.remui %[[DIVU2]], %[[C4:.*]]
+    // CHECK-DAG: %[[MUL:.*]] = arith.muli %[[REMU]], %[[C32:.*]]
+    // CHECK-DAG: %[[MOD:.*]] = arith.remui %[[MUL]], %[[C128:.*]]
     // CHECK-DAG: %[[BASE:.*]] = vector.step : vector<32xindex>
     // CHECK-DAG: %[[CAST:.*]] = vector.broadcast %[[MOD]] : index to vector<32xindex>
     // CHECK-DAG: %[[ADD:.*]] = arith.addi %[[BASE]], %[[CAST]] : vector<32xindex>
@@ -29,4 +28,3 @@ gpu.module @test {
   }
 
 }
-
diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir
index 01134d8eaabec..4829af3612de3 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-rr.mlir
@@ -16,18 +16,18 @@ gpu.module @test_round_robin_assignment {
   gpu.func @create_nd_tdesc_with_shared_data(%src: memref<256x128xf32>) {
     // CHECK: %[[SGID:.*]] = gpu.subgroup_id : index
     // CHECK: %[[C4:.*]] = arith.constant 4 : index
-    // CHECK: %[[IDX:.*]] = index.remu %[[SGID]], %[[C4]]
-    // CHECK: %[[IDY_DIV:.*]] = index.divu %[[SGID]], %[[C4]]
+    // CHECK: %[[IDX:.*]] = arith.remui %[[SGID]], %[[C4]]
+    // CHECK: %[[IDY_DIV:.*]] = arith.divui %[[SGID]], %[[C4]]
     // CHECK: %[[C8:.*]] = arith.constant 8 : index
-    // CHECK: %[[IDY:.*]] = index.remu %[[IDY_DIV]], %[[C8]]
+    // CHECK: %[[IDY:.*]] = arith.remui %[[IDY_DIV]], %[[C8]]
     // CHECK: %[[C16:.*]] = arith.constant 16 : index
-    // CHECK: %[[LY:.*]] = index.mul %[[IDY]], %[[C16]]
+    // CHECK: %[[LY:.*]] = arith.muli %[[IDY]], %[[C16]]
     // CHECK: %[[C64:.*]] = arith.constant 64 : index
-    // CHECK: %[[LX:.*]] = index.mul %[[IDX]], %[[C64]]
+    // CHECK: %[[LX:.*]] = arith.muli %[[IDX]], %[[C64]]
     // CHECK: %[[C128:.*]] = arith.constant 128 : index
-    // CHECK: %[[OFFY:.*]] = index.remu %[[LY]], %[[C128]]
+    // CHECK: %[[OFFY:.*]] = arith.remui %[[LY]], %[[C128]]
     // CHECK: %[[C64_1:.*]] = arith.constant 64 : index
-    // CHECK: %[[OFFX:.*]] = index.remu %[[LX]], %[[C64_1]]
+    // CHECK: %[[OFFX:.*]] = arith.remui %[[LX]], %[[C64_1]]
     // CHECK: xegpu.create_nd_tdesc %[[ARG_0]][%[[OFFY]], %[[OFFX]]] : memref<256x128xf32> -> !xegpu.tensor_desc<16x64xf32>
     %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<256x128xf32>
       -> !xegpu.tensor_desc<128x64xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [16, 64]>>
diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir
index 1cddccb5fbbd1..eae51a16053d8 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops-rr.mlir
@@ -90,30 +90,27 @@ gpu.module @test_distribution {
     gpu.return
   }
 
+  // CHECK-LABEL: non_splat_constant
   gpu.func @non_splat_constant() {
-    // CHECK-DAG: %[[BASECST:.*]] = arith.constant dense<{{.*}}> : vector<2x1xindex>
+    // CHECK-DAG: %[[CST:.*]] = arith.constant dense<{{.*}}0{{.*}}, {{.*}}16{{.*}}> : vector<2x1xindex>
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[REMU1:.*]] = index.remu %[[SGID]], %[[C1:.*]]
-    // CHECK-DAG: %[[DIVU:.*]] = index.divu %[[SGID]], %[[C1:.*]]
-    // CHECK-DAG: %[[REMU2:.*]] = index.remu %[[DIVU]], %[[C8:.*]]
-    // CHECK-DAG: %[[MUL:.*]] = index.mul %[[REMU2]], %[[C2:.*]]
-    // CHECK-DAG: %[[REMU3:.*]] = index.remu %[[MUL]], %[[C32:.*]]
-    // CHECK-DAG: %[[REMU4:.*]] = index.remu %[[REMU1]], %[[C1:.*]]
-    // CHECK-DAG: %[[ADD16:.*]] = arith.addi %[[MUL]], %[[C16:.*]] : index
-    // CHECK-DAG: %[[REMU5:.*]] = index.remu %[[ADD16]], %[[C32:.*]]
-    // CHECK-DAG: %[[REMU6:.*]] = index.remu %[[REMU1]], %[[C1:.*]]
-    // CHECK-DAG: %[[STRIDE1:.*]] = arith.muli %[[REMU3]], %[[C16:.*]] : index
-    // CHECK-DAG: %[[ADDSTRIDES:.*]] = arith.addi %[[C0:.*]], %[[STRIDE1]] : index
-    // CHECK-DAG: %[[STRIDE2:.*]] = arith.muli %[[REMU4]], %[[C0:.*]] : index
-    // CHECK-DAG: %[[ADDSTRIDES1:.*]] = arith.addi %[[ADDSTRIDES]], %[[STRIDE2]] : index
-    // CHECK-DAG: %[[BCAST1:.*]] = vector.broadcast %[[ADDSTRIDES1]] : index to vector<2x1xindex>
-    // CHECK-DAG: %[[RESULT1:.*]] = arith.addi %[[BASECST]], %[[BCAST1]] : vector<2x1xindex>
-    // CHECK-DAG: %[[STRIDE3:.*]] = arith.muli %[[REMU5]], %[[C16:.*]] : index
-    // CHECK-DAG: %[[ADDSTRIDES2:.*]] = arith.addi %[[C0:.*]], %[[STRIDE3]] : index
-    // CHECK-DAG: %[[STRIDE4:.*]] = arith.muli %[[REMU6]], %[[C0:.*]] : index
-    // CHECK-DAG: %[[ADDSTRIDES3:.*]] = arith.addi %[[ADDSTRIDES2]], %[[STRIDE4]] : index
-    // CHECK-DAG: %[[BCAST2:.*]] = vector.broadcast %[[ADDSTRIDES3]] : index to vector<2x1xindex>
-    // CHECK-DAG: %[[RESULT2:.*]] = arith.addi %[[BASECST]], %[[BCAST2]] : vector<2x1xindex>
+    // CHECK-DAG: %[[T1:.*]] = arith.remui %[[SGID]], %[[C8:.*]] : index
+    // CHECK-DAG: %[[T2:.*]] = arith.muli %[[T1]], %[[C2:.*]] : index
+    // CHECK-DAG: %[[T3:.*]] = arith.remui %[[T2]], %[[C32:.*]] : index
+    // CHECK-DAG: %[[T4:.*]] = arith.addi %[[T2]], %[[C16:.*]] : index
+    // CHECK-DAG: %[[T5:.*]] = arith.remui %[[T4]], %[[C32_6:.*]] : index
+    // CHECK-DAG: %[[T6:.*]] = arith.muli %[[T3]], %[[C16_10:.*]] : index
+    // CHECK-DAG: %[[T7:.*]] = arith.addi %[[C0_11:.*]], %[[T6]] : index
+    // CHECK-DAG: %[[T8:.*]] = arith.muli %[[C0_4:.*]], %[[C0_9:.*]] : index
+    // CHECK-DAG: %[[T9:.*]] = arith.addi %[[T7]], %[[T8]] : index
+    // CHECK-DAG: %[[T10:.*]] = vector.broadcast %[[T9]] : index to vector<2x1xindex>
+    // CHECK-DAG: %[[T11:.*]] = arith.addi %[[CST]], %[[T10]] : vector<2x1xindex>
+    // CHECK-DAG: %[[T12:.*]] = arith.muli %[[T5]], %[[C16_10:.*]] : index
+    // CHECK-DAG: %[[T13:.*]] = arith.addi %[[C0_12:.*]], %[[T12]] : index
+    // CHECK-DAG: %[[T14:.*]] = arith.muli %[[C0_8:.*]], %[[C0_9:.*]] : index
+    // CHECK-DAG: %[[T15:.*]] = arith.addi %[[T13]], %[[T14]] : index
+    // CHECK-DAG: %[[T16:.*]] = vector.broadcast %[[T15]] : index to vector<2x1xindex>
+    // CHECK-DAG: %[[T17:.*]] = arith.addi %[[CST]], %[[T16]] : vector<2x1xindex>
     %cst_2 = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [8, 1], sg_data = [2, 1]>} dense<[[0], [16], [32], [48], [64], [80], [96], [112], [128], [144], [160], [176], [192], [208], [224], [240], [256], [272], [288], [304], [320], [336], [352], [368], [384], [400], [416], [432], [448], [464], [480], [496]]> : vector<32x1xindex>
     gpu.return
   }
@@ -139,4 +136,3 @@ gpu.module @test_distribution {
     gpu.return
   }
 }
-
diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir
index 574b365443a0a..98920d61c4f58 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg-unify-ops.mlir
@@ -27,17 +27,17 @@ gpu.module @test_distribution {
     //CHECK: %[[TDESC:.*]] = xegpu.create_nd_tdesc %{{.*}} : memref<256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
     //CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
     //CHECK-DAG: %[[C4:.*]] = arith.constant 4 : index
-    //CHECK-DAG: %[[SGIDX:.*]] = index.remu %[[SGID]], %[[C4]]
-    //CHECK-DAG: %[[SGIDY_TMP:.*]] = index.divu %[[SGID]], %[[C4]]
+    //CHECK-DAG: %[[SGIDX:.*]] = arith.remui %[[SGID]], %[[C4]]
+    //CHECK-DAG: %[[SGIDY_TMP:.*]] = arith.divui %[[SGID]], %[[C4]]
     //CHECK-DAG: %[[C8:.*]] = arith.constant 8 : index
-    //CHECK-DAG: %[[SGIDY:.*]] = index.remu %[[SGIDY_TMP]], %[[C8]]
+    //CHECK-DAG: %[[SGIDY:.*]] = arith.remui %[[SGIDY_TMP]], %[[C8]]
     //CHECK-DAG: %[[C32:.*]] = arith.constant 32 : index
-    //CHECK-DAG: %[[L_OFF_Y:.*]] = index.mul %[[SGIDY]], %[[C32]]
-    //CHECK-DAG: %[[L_OFF_X:.*]] = index.mul %[[SGIDX]], %[[C32]]
+    //CHECK-DAG: %[[L_OFF_Y:.*]] = arith.muli %[[SGIDY]], %[[C32]] : index
+    //CHECK-DAG: %[[L_OFF_X:.*]] = arith.muli %[[SGIDX]], %[[C32_1:.*]] : index
     //CHECK-DAG: %[[C256:.*]] = arith.constant 256 : index
-    //CHECK-DAG: %[[OFF_Y:.*]] = index.remu %[[L_OFF_Y]], %[[C256]]
+    //CHECK-DAG: %[[OFF_Y:.*]] = arith.remui %[[L_OFF_Y]], %[[C256]] : index
     //CHECK-DAG: %[[C128:.*]] = arith.constant 128 : index
-    //CHECK-DAG: %[[OFF_X:.*]] = index.remu %[[L_OFF_X]], %[[C128]]
+    //CHECK-DAG: %[[OFF_X:.*]] = arith.remui %[[L_OFF_X]], %[[C128]] : index
     //CHECK-DAG: %[[LOAD:.*]] = xegpu.load_nd %[[TDESC]][{{%.*}}, {{%.*}}] : !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> -> vector<32x32xf32>
     %tdesc = xegpu.create_nd_tdesc %src : memref<256x128xf32>
       -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
@@ -293,7 +293,7 @@ gpu.module @test_distribution {
     %val = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>} dense<25.5> : vector<256xf16>
     %offset = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>} dense<0> : vector<256xindex>
     %mask = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>} dense<1> : vector<256xi1>
-    xegpu.store %val, %dest[%offset], %mask {chunk_size = 1, layout_operand_0 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>, 
+    xegpu.store %val, %dest[%offset], %mask {chunk_size = 1, layout_operand_0 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>,
                                              layout_operand_2 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>,
                                              layout_operand_3 = #xegpu.layout<sg_layout = [32], sg_data = [8], inst_data = [8]>,
                                              l1_hint = #xegpu.cache_hint<cached>}
@@ -321,18 +321,18 @@ gpu.module @test_distribution {
     //CHECK: [[mdesc:%.+]] = xegpu.create_mem_desc [[arg0]] : memref<32768xi8, 3> -> !xegpu.mem_desc<64x128xf32>
     //CHECK: [[sgid:%.+]] = gpu.subgroup_id : index
     //CHECK: [[c4:%.+]] = arith.constant 4 : index
-    //CHECK: [[sgidx:%.+]] = index.remu [[sgid]], [[c4]]
-    //CHECK: [[sgidy_tmp:%.+]] = index.divu [[sgid]], [[c4]]
+    //CHECK: [[sgidx:%.+]] = arith.remui [[sgid]], [[c4]] : index
+    //CHECK: [[sgidy_tmp:%.+]] = arith.divui [[sgid]], [[c4]] : index
     //CHECK: [[c2:%.+]] = arith.constant 2 : index
-    //CHECK: [[sgidy:%.+]] = index.remu [[sgidy_tmp]], [[c2]]
+    //CHECK: [[sgidy:%.+]] = arith.remui [[sgidy_tmp]], [[c2]] : index
     //CHECK: [[c32:%.+]] = arith.constant 32 : index
-    //CHECK: [[l_off_y:%.+]] = index.mul [[sgidy]], [[c32]]
+    //CHECK: [[l_off_y:%.+]] = arith.muli [[sgidy]], [[c32]] : index
     //CHECK: [[c32_0:%.+]] = arith.constant 32 : index
-    //CHECK: [[l_off_x:%.+]] = index.mul [[sgidx]], [[c32_0]]
+    //CHECK: [[l_off_x:%.+]] = arith.muli [[sgidx]], [[c32_0]] : index
     //CHECK: [[c64:%.+]] = arith.constant 64 : index
-    //CHECK: [[off_y:%.+]] = index.remu [[l_off_y]], [[c64]]
+    //CHECK: [[off_y:%.+]] = arith.remui [[l_off_y]], [[c64]] : index
     //CHECK: [[c128:%.+]] = arith.constant 128 : index
-    //CHECK: [[off_x:%.+]] = index.remu [[l_off_x]], [[c128]]
+    //CHECK: [[off_x:%.+]] = arith.remui [[l_off_x]], [[c128]] : index
     //CHECK: xegpu.load_matrix [[mdesc]][[[off_y]], [[off_x]]] <{layout = #xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>}>: !xegpu.mem_desc<64x128xf32>, index, index -> vector<32x32xf32>
     %0 = xegpu.create_mem_desc %arg0 : memref<32768xi8, 3> -> !xegpu.mem_desc<64x128xf32>
     %1 = xegpu.load_matrix %0[0, 0] <{layout = #xegpu.layout<sg_layout = [2, 4], sg_data = [32, 32], lane_layout = [2, 8], lane_data = [1, 1]>}>: !xegpu.mem_desc<64x128xf32> -> vector<64x128xf32>
@@ -346,18 +346,18 @@ gpu.module @test_distribution {
     //CHECK: [[mdesc:%.+]] = xegpu.create_mem_desc [[arg0]] : memref<32768xi8, 3> -> !xegpu.mem_desc<64x128xf32>
     //CHECK: [[sgid:%.+]] = gpu.subgroup_id : index
     //CHECK: [[c4:%.+]] = arith.constant 4 : index
-    //CHECK: [[sgidx:%.+]] = index.remu [[sgid]], [[c4]]
-    //CHECK: [[sgidy_tmp:%.+]] = index.divu [[sgid]], [[c4]]
+    //CHECK: [[sgidx:%.+]] = arith.remui [[sgid]], [[c4]] : index
+    //CHECK: [[sgidy_tmp:%.+]] = arith.divui [[sgid]], [[c4]] : index
     //CHECK: [[c2:%.+]] = arith.constant 2 : index
-    //CHECK: [[sgidy:%.+]] = index.remu [[sgidy_tmp]], [[c2]]
+    //CHECK: [[sgidy:%.+]] = arith.remui [[sgidy_tmp]], [[c2]] : index
     //CHECK: [[c32:%.+]] = arith.constant 32 : index
-    //CHECK: [[l_off_y:%.+]] = index.mul [[sgidy]], [[c32]]
+    //CHECK: [[l_off_y:%.+]] = arith.muli [[sgidy]], [[c32]] : index
     //CHECK: [[c32_0:%.+]] = arith.constant 32 : index
-    //CHECK: [[l_off_x:%.+]] = index.mul [[sgidx]], [[c32_0]]
+    //CHECK: [[l_off_x:%.+]] = arith.muli [[sgidx]], [[c32_0]] : index
     //CHECK: [[c64:%.+]] = arith.constant 64 : index
-    //CHECK: [[off_y:%.+]] = index.remu [[l_off_y]], [[c64]]
+    //CHECK: [[off_y:%.+]] = arith.remui [[l_off_y]], [[c64]] : index
     //CHECK: [[c128:%.+]] = arith.constant 128 : index
-    //CHECK: [[off_x:%.+]] = index.remu [[l_off_x]], [[c128]]
+    //CHECK: [[off_x:%.+]] = arith.remui [[l_off_x]], [[c128]] : index
     //CHECK: xegpu.store_matrix [[cst]], [[mdesc]][[[off_y]], [[off_x]]] : vector<32x32xf32>, !xegpu.mem_desc<64x128xf32>, index, index
     %cst = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [2, 4], sg_data = [32, 32]>} dense<1.0> : vector<64x128xf32>
     %mdesc = xegpu.create_mem_desc %arg0 : memref<32768xi8, 3> -> !xegpu.mem_desc<64x128xf32>
@@ -409,14 +409,14 @@ gpu.module @test_distribution {
   gpu.func @vector_step_op_slice_attr() {
     //CHECK: [[sgId:%.+]] = gpu.subgroup_id : index
     //CHECK: [[c8:%.+]] = arith.constant 8 : index
-    //CHECK: [[sgidx:%.+]] = index.remu [[sgId]], [[c8]]
-    //CHECK: [[sgidy_tmp:%.+]] = index.divu [[sgId]], [[c8]]
+    //CHECK: [[sgidx:%.+]] = arith.remui [[sgId]], [[c8]] : index
+    //CHECK: [[sgidy_tmp:%.+]] = arith.divui [[sgId]], [[c8]] : index
     //CHECK: [[c4:%.+]] = arith.constant 4 : index
-    //CHECK: [[sgidy:%.+]] = index.remu [[sgidy_tmp]], [[c4]]
+    //CHECK: [[sgidy:%.+]] = arith.remui [[sgidy_tmp]], [[c4]] : index
     //CHECK: [[c32:%.+]] = arith.constant 32 : index
-    //CHECK: [[LY:%.+]] = index.mul [[sgidy]], [[c32]]
+    //CHECK: [[LY:%.+]] = arith.muli [[sgidy]], [[c32]] : index
     //CHECK: [[c128:%.+]] = arith.constant 128 : index
-    //CHECK: [[MODY:%.+]] = index.remu [[LY]], [[c128]]
+    //CHECK: [[MODY:%.+]] = arith.remui [[LY]], [[c128]] : index
     //CHECK: [[BASE:%.+]] = vector.step : vector<32xindex>
     //CHECK: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<32xindex>
     //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<32xindex>
@@ -427,11 +427,11 @@ gpu.module @test_distribution {
   gpu.func @vector_step_op_layout_attr() {
     //CHECK: [[sgId:%.+]] = gpu.subgroup_id : index
     //CHECK: [[c16:%.+]] = arith.constant 16 : index
-    //CHECK: [[sgidx:%.+]] = index.remu [[sgId]], [[c16]]
+    //CHECK: [[sgidx:%.+]] = arith.remui [[sgId]], [[c16]] : index
     //CHECK: [[c8:%.+]] = arith.constant 8 : index
-    //CHECK: [[LOCALY:%.+]] = index.mul [[sgidx]], [[c8]]
+    //CHECK: [[LOCALY:%.+]] = arith.muli [[sgidx]], [[c8]] : index
     //CHECK: [[c128:%.+]] = arith.constant 128 : index
-    //CHECK: [[MODY:%.+]] = index.remu [[LOCALY]], [[c128]]
+    //CHECK: [[MODY:%.+]] = arith.remui [[LOCALY]], [[c128]] : index
     //CHECK: [[BASE:%.+]] = vector.step : vector<8xindex>
     //CHECK: [[CAST:%.+]] = vector.broadcast [[MODY]] : index to vector<8xindex>
     //CHECK: [[ADD:%.+]] = arith.addi [[BASE]], [[CAST]] : vector<8xindex>
@@ -479,18 +479,15 @@ gpu.module @test_distribution {
   // CHECK-LABEL: non_splat_constant_2D
   gpu.func @non_splat_constant_2D() {
     // CHECK-DAG: %[[CST:.*]] = arith.constant dense<0> : vector<1x1xindex>
-    // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[SGIDX:.*]] = index.remu %[[SGID]], %{{.*}}
-    // CHECK-DAG: %[[SGIDY_TMP:.*]] = index.divu %[[SGID]], %{{.*}}
-    // CHECK-DAG: %[[SGIDY:.*]] = index.remu %[[SGIDY_TMP]], %{{.*}}
-    // CHECK-DAG: %[[IDY:.*]] = index.remu %[[SGIDY]], %{{.*}}
-    // CHECK-DAG: %[[IDX:.*]] = index.remu %[[SGIDX]], %{{.*}}
-    // CHECK-DAG: %[[STRIDECOL:.*]] = arith.muli %[[IDY]], %[[C16:.*]] : index
-    // CHECK-DAG: %[[ADD:.*]] = arith.addi %[[C0:.*]], %[[STRIDECOL]] : index
-    // CHECK-DAG: %[[STRIDEROW:.*]] = arith.muli %[[IDX]], %[[C0:.*]] : index
-    // CHECK-DAG: %[[ADDSTRIDES:.*]] = arith.addi %[[ADD]], %[[STRIDEROW]] : index
-    // CHECK-DAG: %[[BCAST:.*]] = vector.broadcast %[[ADDSTRIDES]] : index to vector<1x1xindex>
-    // CHECK-DAG: arith.addi %[[CST]], %[[BCAST]] : vector<1x1xindex>
+    // CHECK-DAG: %[[T0:.*]] = gpu.subgroup_id : index
+    // CHECK-DAG: %[[T1:.*]] = arith.remui %[[T0]], %[[C32:.*]] : index
+    // CHECK-DAG: %[[T2:.*]] = arith.remui %[[T1]], %[[C32_4:.*]] : index
+    // CHECK-DAG: %[[T3:.*]] = arith.muli %[[T2]], %[[C16:.*]] : index
+    // CHECK-DAG: %[[T4:.*]] = arith.addi %[[C0_8:.*]], %[[T3]] : index
+    // CHECK-DAG: %[[T5:.*]] = arith.muli %[[C0_6:.*]], %[[C0_7:.*]] : index
+    // CHECK-DAG: %[[T6:.*]] = arith.addi %[[T4]], %[[T5]] : index
+    // CHECK-DAG: %[[T7:.*]] = vector.broadcast %[[T6]] : index to vector<1x1xindex>
+    // CHECK-DAG: %[[T8:.*]] = arith.addi %[[CST]], %[[T7]] : vector<1x1xindex>
     %cst = arith.constant {layout_result_0 = #xegpu.layout<sg_layout = [32, 1], sg_data = [1, 1]>} dense<[[0], [16], [32], [48], [64], [80], [96], [112], [128], [144], [160], [176], [192], [208], [224], [240], [256], [272], [288], [304], [320], [336], [352], [368], [384], [400], [416], [432], [448], [464], [480], [496]]> : vector<32x1xindex>
     gpu.return
   }
@@ -499,13 +496,13 @@ gpu.module @test_distribution {
   gpu.func @non_splat_constant_2D_non_unit_dim() {
     // CHECK-DAG: %[[BASECST:.*]] = arith.constant dense<{{\[}}{{\[}}0, 16{{\]}}, {{\[}}8, 24{{\]}}{{\]}}> : vector<2x2xindex>
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[SGIDX:.*]] = index.remu %[[SGID]], %{{.*}}
-    // CHECK-DAG: %[[SGIDY_TMP:.*]] = index.divu %[[SGID]], %{{.*}}
-    // CHECK-DAG: %[[SGIDY:.*]] = index.remu %[[SGIDY_TMP]], %{{.*}}
-    // CHECK-DAG: %[[MULY:.*]] = index.mul %[[SGIDY]], %[[C2:.*]]
-    // CHECK-DAG: %[[MULX:.*]] = index.mul %[[SGIDX]], %{{.*}}
-    // CHECK-DAG: %[[REMU_Y:.*]] = index.remu %[[MULY]], %[[C8:.*]]
-    // CHECK-DAG: %[[REMU_X:.*]] = index.remu %[[MULX]], %{{.*}}
+    // CHECK-DAG: %[[SGIDX:.*]] = arith.remui %[[SGID]], %{{.*}}
+    // CHECK-DAG: %[[SGIDY_TMP:.*]] = arith.divui %[[SGID]], %{{.*}}
+    // CHECK-DAG: %[[SGIDY:.*]] = arith.remui %[[SGIDY_TMP]], %{{.*}}
+    // CHECK-DAG: %[[MULY:.*]] = arith.muli %[[SGIDY]], %[[C2:.*]] : index
+    // CHECK-DAG: %[[MULX:.*]] = arith.muli %[[SGIDX]], %{{.*}} : index
+    // CHECK-DAG: %[[REMU_Y:.*]] = arith.remui %[[MULY]], %[[C8:.*]] : index
+    // CHECK-DAG: %[[REMU_X:.*]] = arith.remui %[[MULX]], %{{.*}} : index
     // CHECK-DAG: %[[MUL5:.*]] = arith.muli %[[REMU_Y]], %{{.*}} : index
     // CHECK-DAG: %[[ADD:.*]] = arith.addi %[[C0:.*]], %[[MUL5]] : index
     // CHECK-DAG: %[[MUL6:.*]] = arith.muli %[[REMU_X]], %[[C16:.*]] : index
@@ -529,8 +526,8 @@ gpu.module @test_distribution {
   gpu.func @non_splat_constant() {
     // CHECK-DAG: %[[CST:.*]] = arith.constant dense<0> : vector<1xindex>
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[REMU:.*]] = index.remu %[[SGID]], %{{.*}}
-    // CHECK-DAG: %[[REMU2:.*]] = index.remu %[[REMU]], %{{.*}}
+    // CHECK-DAG: %[[REMU:.*]] = arith.remui %[[SGID]], %{{.*}}
+    // CHECK-DAG: %[[REMU2:.*]] = arith.remui %[[REMU]], %{{.*}}
     // CHECK-DAG: %[[MUL:.*]] = arith.muli %[[REMU2]], %[[C16:.*]] : index
     // CHECK-DAG: %[[ADDSTRIDES:.*]] = arith.addi %[[C0:.*]], %[[MUL]] : index
     // CHECK-DAG: %[[BCAST:.*]] = vector.broadcast %[[ADDSTRIDES]] : index to vector<1xindex>
@@ -551,9 +548,9 @@ gpu.module @test_distribution {
   // CHECK-LABEL: vector_mask_1D
   gpu.func @vector_mask_1D() {
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[REMU:.*]] = index.remu %[[SGID]], %[[C2:.*]]
-    // CHECK-DAG: %[[MUL:.*]] = index.mul %[[REMU]], %[[C16:.*]]
-    // CHECK-DAG: %[[REMU2:.*]] = index.remu %[[MUL]], %[[C32:.*]]
+    // CHECK-DAG: %[[REMU:.*]] = arith.remui %[[SGID]], %[[C2:.*]]
+    // CHECK-DAG: %[[MUL:.*]] = arith.muli %[[REMU]], %[[C16:.*]] : index
+    // CHECK-DAG: %[[REMU2:.*]] = arith.remui %[[MUL]], %[[C32:.*]] : index
     // CHECK-DAG: %[[SUB:.*]] = arith.subi %[[C8:.*]], %[[REMU2]] : index
     // CHECK-DAG: %[[MAX:.*]] = arith.maxsi %[[SUB]], %[[C0:.*]] : index
     // CHECK-DAG: %[[MIN:.*]] = arith.minsi %[[MAX]], %[[C16:.*]] : index
@@ -565,13 +562,13 @@ gpu.module @test_distribution {
   // CHECK-LABEL: vector_mask_2D
   gpu.func @vector_mask_2D() {
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[SGIDX:.*]] = index.remu %[[SGID]], %[[C4:.*]]
-    // CHECK-DAG: %[[SGIDY_TMP:.*]] = index.divu %[[SGID]], %[[C4:.*]]
-    // CHECK-DAG: %[[SGIDY:.*]] = index.remu %[[SGIDY_TMP]], %[[C8:.*]]
-    // CHECK-DAG: %[[ROW:.*]] = index.mul %[[SGIDY]], %[[C32:.*]]
-    // CHECK-DAG: %[[COL:.*]] = index.mul %[[SGIDX]], %[[C32:.*]]
-    // CHECK-DAG: %[[MODROW:.*]] = index.remu %[[ROW]], %[[C256:.*]]
-    // CHECK-DAG: %[[MODCOL:.*]] = index.remu %[[COL]], %[[C128:.*]]
+    // CHECK-DAG: %[[SGIDX:.*]] = arith.remui %[[SGID]], %[[C4:.*]]
+    // CHECK-DAG: %[[SGIDY_TMP:.*]] = arith.divui %[[SGID]], %[[C4:.*]]
+    // CHECK-DAG: %[[SGIDY:.*]] = arith.remui %[[SGIDY_TMP]], %[[C8:.*]]
+    // CHECK-DAG: %[[ROW:.*]] = arith.muli %[[SGIDY]], %[[C32:.*]] : index
+    // CHECK-DAG: %[[COL:.*]] = arith.muli %[[SGIDX]], %[[C32:.*]] : index
+    // CHECK-DAG: %[[MODROW:.*]] = arith.remui %[[ROW]], %[[C256:.*]] : index
+    // CHECK-DAG: %[[MODCOL:.*]] = arith.remui %[[COL]], %[[C128:.*]] : index
     // CHECK-DAG: %[[SUBROW:.*]] = arith.subi %[[C16:.*]], %[[MODROW]] : index
     // CHECK-DAG: %[[MAXROW:.*]] = arith.maxsi %[[SUBROW]], %[[C4:.*]] : index
     // CHECK-DAG: %[[MINROW:.*]] = arith.minsi %[[MAXROW]], %[[C32:.*]] : index
diff --git a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir
index 5ce3d1d0fb5d6..a8015cced7eb4 100644
--- a/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir
+++ b/mlir/test/Dialect/XeGPU/xegpu-wg-to-sg.mlir
@@ -5,13 +5,13 @@ gpu.module @test_1_1_assignment {
   // CHECK-SAME: %[[ARG_0:.*]]: memref<256x128xf32>
   gpu.func @create_nd_tdesc(%src: memref<256x128xf32>) {
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[REMUX:.*]] = index.remu %[[SGID]], %[[C4:.*]]
-    // CHECK-DAG: %[[DIVU:.*]] = index.divu %[[SGID]], %[[C4:.*]]
-    // CHECK-DAG: %[[REMUY:.*]] = index.remu %[[DIVU]], %[[C8:.*]]
-    // CHECK-DAG: %[[MULY:.*]] = index.mul %[[REMUY]], %[[C32:.*]]
-    // CHECK-DAG: %[[MULX:.*]] = index.mul %[[REMUX]], %[[C32:.*]]
-    // CHECK-DAG: %[[MODY:.*]] = index.remu %[[MULY]], %[[C256:.*]]
-    // CHECK-DAG: %[[MODX:.*]] = index.remu %[[MULX]], %[[C128:.*]]
+    // CHECK-DAG: %[[REMUX:.*]] = arith.remui %[[SGID]], %[[C4:.*]]
+    // CHECK-DAG: %[[DIVU:.*]] = arith.divui %[[SGID]], %[[C4:.*]]
+    // CHECK-DAG: %[[REMUY:.*]] = arith.remui %[[DIVU]], %[[C8:.*]]
+    // CHECK-DAG: %[[MULY:.*]] = arith.muli %[[REMUY]], %[[C32:.*]]
+    // CHECK-DAG: %[[MULX:.*]] = arith.muli %[[REMUX]], %[[C32:.*]]
+    // CHECK-DAG: %[[MODY:.*]] = arith.remui %[[MULY]], %[[C256:.*]]
+    // CHECK-DAG: %[[MODX:.*]] = arith.remui %[[MULX]], %[[C128:.*]]
     // CHECK-DAG: %[[TDESC:.*]] = xegpu.create_nd_tdesc %[[ARG_0]][%[[MODY]], %[[MODX]]] : memref<256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
     %tdesc = xegpu.create_nd_tdesc %src[0, 0] : memref<256x128xf32>
       -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
@@ -22,13 +22,13 @@ gpu.module @test_1_1_assignment {
   // CHECK-SAME: %[[ARG_0:.*]]: memref<3x256x128xf32>
   gpu.func @create_nd_tdesc_from_higher_rank_memref(%src: memref<3x256x128xf32>) {
     // CHECK-DAG: %[[SGID:.*]] = gpu.subgroup_id : index
-    // CHECK-DAG: %[[REMUX:.*]] = index.remu %[[SGID]], %[[C4:.*]]
-    // CHECK-DAG: %[[DIVU:.*]] = index.divu %[[SGID]], %[[C4:.*]]
-    // CHECK-DAG: %[[REMUY:.*]] = index.remu %[[DIVU]], %[[C8:.*]]
-    // CHECK-DAG: %[[MULY:.*]] = index.mul %[[REMUY]], %[[C32:.*]]
-    // CHECK-DAG: %[[MULX:.*]] = index.mul %[[REMUX]], %[[C32:.*]]
-    // CHECK-DAG: %[[MODY:.*]] = index.remu %[[MULY]], %[[C256:.*]]
-    // CHECK-DAG: %[[MODX:.*]] = index.remu %[[MULX]], %[[C128:.*]]
+    // CHECK-DAG: %[[REMUX:.*]] = arith.remui %[[SGID]], %[[C4:.*]]
+    // CHECK-DAG: %[[DIVU:.*]] = arith.divui %[[SGID]], %[[C4:.*]]
+    // CHECK-DAG: %[[REMUY:.*]] = arith.remui %[[DIVU]], %[[C8:.*]]
+    // CHECK-DAG: %[[MULY:.*]] = arith.muli %[[REMUY]], %[[C32:.*]]
+    // CHECK-DAG: %[[MULX:.*]] = arith.muli %[[REMUX]], %[[C32:.*]]
+    // CHECK-DAG: %[[MODY:.*]] = arith.remui %[[MULY]], %[[C256:.*]]
+    // CHECK-DAG: %[[MODX:.*]] = arith.remui %[[MULX]], %[[C128:.*]]
     // CHECK-DAG: %[[TDESC:.*]] = xegpu.create_nd_tdesc %[[ARG_0]][1, %[[MODY]], %[[MODX]]] : memref<3x256x128xf32> -> !xegpu.tensor_desc<32x32xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>
     %tdesc = xegpu.create_nd_tdesc %src[1, 0, 0] : memref<3x256x128xf32>
       -> !xegpu.tensor_desc<256x128xf32, #xegpu.layout<sg_layout = [8, 4], sg_data = [32, 32], lane_layout = [1, 16], lane_data = [1, 1]>>
diff --git a/mlir/test/Integration/Dialect/Arith/CPU/test-apfloat-emulation.mlir b/mlir/test/Integration/Dialect/Arith/CPU/test-apfloat-emulation.mlir
index 8046610d479a8..7f72dd5931488 100644
--- a/mlir/test/Integration/Dialect/Arith/CPU/test-apfloat-emulation.mlir
+++ b/mlir/test/Integration/Dialect/Arith/CPU/test-apfloat-emulation.mlir
@@ -43,6 +43,18 @@ func.func @entry() {
   %cvt = arith.truncf %b2 : f32 to f8E4M3FN
   vector.print %cvt : f8E4M3FN
 
+  // CHECK-NEXT: -2.25
+  %negated = arith.negf %cvt : f8E4M3FN
+  vector.print %negated : f8E4M3FN
+
+  // CHECK-NEXT: -2.25
+  %min = arith.minimumf %cvt, %negated : f8E4M3FN
+  vector.print %min : f8E4M3FN
+
+  // CHECK-NEXT: 1
+  %cmp1 = arith.cmpf "olt", %cvt, %c1 : f8E4M3FN
+  vector.print %cmp1 : i1
+
   // CHECK-NEXT: 1
   // Bit pattern: 01, interpreted as signed integer: 1
   %cvt_int_signed = arith.fptosi %cvt : f8E4M3FN to i2
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul-transpose-a.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul-transpose-a.mlir
index 9d043573091eb..d26853d14aec7 100644
--- a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul-transpose-a.mlir
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul-transpose-a.mlir
@@ -22,7 +22,7 @@ func.func @matmul_transpose_a(%A : tensor<?x?xf32>, %B : tensor<?x?xf32>, %C : t
 }
 
 func.func @main() {
-  %c0 = arith.constant 0 : i32
+  %c0 = arith.constant 0.0 : f32
   %c7 = arith.constant 7 : index
 
   %A = arith.constant dense<[
@@ -44,7 +44,7 @@ func.func @main() {
   %A_dyn = tensor.cast %A : tensor<13x7xf32> to tensor<?x?xf32>
 
   %C_init = bufferization.alloc_tensor(%c7, %c7) : tensor<?x?xf32>
-  %C = linalg.fill ins(%c0 : i32) outs(%C_init : tensor<?x?xf32>) -> tensor<?x?xf32>
+  %C = linalg.fill ins(%c0 : f32) outs(%C_init : tensor<?x?xf32>) -> tensor<?x?xf32>
 
   // CHECK: Unranked Memref {{.*}} rank = 2 offset = 0 sizes = [7, 7] strides = [7, 1] data =
   // CHECK: [32955, 33514, 34073, 34632, 35191, 35750, 36309]
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul.mlir
index ad7dbb9f7e126..e2c0f1d22fea1 100644
--- a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul.mlir
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/matmul.mlir
@@ -16,7 +16,7 @@ func.func @matmul(%A : tensor<?x?xf32>, %B : tensor<?x?xf32>, %C : tensor<?x?xf3
 }
 
 func.func @main() {
-  %c0 = arith.constant 0 : i32
+  %c0 = arith.constant 0.0 : f32
   %c7 = arith.constant 7 : index
 
   %A = arith.constant dense<[
@@ -37,7 +37,7 @@ func.func @main() {
   %B_dyn = tensor.cast %B : tensor<13x7xf32> to tensor<?x?xf32>
 
   %C_init = bufferization.alloc_tensor(%c7, %c7) : tensor<?x?xf32>
-  %C = linalg.fill ins(%c0 : i32) outs(%C_init : tensor<?x?xf32>) -> tensor<?x?xf32>
+  %C = linalg.fill ins(%c0 : f32) outs(%C_init : tensor<?x?xf32>) -> tensor<?x?xf32>
 
   // CHECK: Unranked Memref {{.*}} rank = 2 offset = 0 sizes = [7, 7] strides = [7, 1] data =
   // CHECK: [32955, 33514, 34073, 34632, 35191, 35750, 36309]
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/multi-tile-matmul.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/multi-tile-matmul.mlir
index 243f9e5cde9f5..007189ad9d578 100644
--- a/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/multi-tile-matmul.mlir
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/ArmSME/multi-tile-matmul.mlir
@@ -29,7 +29,7 @@ func.func @main() {
   %c128 = arith.constant 128 : i32
   func.call @setArmSVLBits(%c128) : (i32) -> ()
 
-  %c0 = arith.constant 0 : i32
+  %c0 = arith.constant 0.0 : f32
   %c7 = arith.constant 7 : index
 
   %A = arith.constant dense<[
@@ -50,7 +50,7 @@ func.func @main() {
   %B_dyn = tensor.cast %B : tensor<13x7xf32> to tensor<?x?xf32>
 
   %C_init = bufferization.alloc_tensor(%c7, %c7) : tensor<?x?xf32>
-  %C = linalg.fill ins(%c0 : i32) outs(%C_init : tensor<?x?xf32>) -> tensor<?x?xf32>
+  %C = linalg.fill ins(%c0 : f32) outs(%C_init : tensor<?x?xf32>) -> tensor<?x?xf32>
 
   // CHECK: Unranked Memref {{.*}} rank = 2 offset = 0 sizes = [7, 7] strides = [7, 1] data =
   // CHECK: [32955, 33514, 34073, 34632, 35191, 35750, 36309]
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/runtime-verification.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/runtime-verification.mlir
index 610ed63168d87..c90476e1ff61d 100644
--- a/mlir/test/Integration/Dialect/Linalg/CPU/runtime-verification.mlir
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/runtime-verification.mlir
@@ -80,10 +80,10 @@ func.func @main() {
   %c64x57 = arith.constant dense<0.0> : tensor<16x29xf32>
   %c3x4 = arith.constant dense<0.0> : tensor<3x4xf32>
 
-  // TODO: BROKEN CHK: ERROR: Runtime op verification failed
-  // TODO: BROKEN CHK-NEXT: linalg.generic
-  // TODO: BROKEN CHK-NEXT: unexpected negative result on dimension #0 of input/output operand #0
-  // TODO: BROKEN func.call @reverse_from_3(%d5x) : (tensor<?xf32>) -> (tensor<?xf32>)
+  // CHECK: ERROR: Runtime op verification failed
+  // CHECK-NEXT: linalg.generic
+  // CHECK-NEXT: unexpected negative result on dimension #0 of input/output operand #0
+  func.call @reverse_from_3(%d5x) : (tensor<?xf32>) -> (tensor<?xf32>)
 
   %c0x = arith.constant dense<1.0> : tensor<0xf32>
   %d0x = tensor.cast %c0x : tensor<0xf32> to tensor<?xf32>
diff --git a/mlir/test/Integration/Dialect/Linalg/CPU/test-matmul-masked-vec.mlir b/mlir/test/Integration/Dialect/Linalg/CPU/test-matmul-masked-vec.mlir
index 8fa32d7aeb586..bbda8d4e99d04 100644
--- a/mlir/test/Integration/Dialect/Linalg/CPU/test-matmul-masked-vec.mlir
+++ b/mlir/test/Integration/Dialect/Linalg/CPU/test-matmul-masked-vec.mlir
@@ -27,8 +27,8 @@ func.func @main() {
   %A_dyn = tensor.cast %A : tensor<8x2xf32> to tensor<?x?xf32>
   %B_dyn = tensor.cast %B : tensor<2x4xf32> to tensor<?x?xf32>
 
-  %c0_i32 = arith.constant  0 : i32
-  %C_init = linalg.fill ins(%c0_i32 : i32) outs(%C_dyn : tensor<?x?xf32>) -> tensor<?x?xf32>
+  %c0_f32 = arith.constant 0.0 : f32
+  %C_init = linalg.fill ins(%c0_f32 : f32) outs(%C_dyn : tensor<?x?xf32>) -> tensor<?x?xf32>
 
   %res = linalg.matmul ins(%A_dyn, %B_dyn: tensor<?x?xf32>, tensor<?x?xf32>)
             outs(%C_init: tensor<?x?xf32>) -> tensor<?x?xf32>
diff --git a/mlir/test/Integration/Dialect/Transform/match_matmul.mlir b/mlir/test/Integration/Dialect/Transform/match_matmul.mlir
index a374d9a611258..e3fee917cdeaa 100644
--- a/mlir/test/Integration/Dialect/Transform/match_matmul.mlir
+++ b/mlir/test/Integration/Dialect/Transform/match_matmul.mlir
@@ -63,11 +63,11 @@ func.func @matmul_simple(%lhs: tensor<10x20xf16>, %rhs: tensor<20x15xf32>) -> te
 }
 
 func.func @matmul_with_extra_ops_in_func(%lhs: tensor<10x20xf32>, %rhs: tensor<20x15xf32>) -> tensor<10x15xf32> {
-  %cst = arith.constant 0.0 : f64
+  %cst = arith.constant 0.0 : f32
   %empty = tensor.empty() : tensor<10x15xf32>
 
   // expected-remark @below {{fill}}
-  %fill = linalg.fill ins(%cst : f64) outs(%empty : tensor<10x15xf32>) -> tensor<10x15xf32>
+  %fill = linalg.fill ins(%cst : f32) outs(%empty : tensor<10x15xf32>) -> tensor<10x15xf32>
 
   %real_lhs = linalg.mul
     ins(%lhs, %lhs : tensor<10x20xf32>, tensor<10x20xf32>) outs(%lhs : tensor<10x20xf32>) -> tensor<10x20xf32>
diff --git a/mlir/test/Integration/Dialect/XeGPU/LANE/load_store_subview.mlir b/mlir/test/Integration/Dialect/XeGPU/LANE/load_store_subview.mlir
new file mode 100644
index 0000000000000..c4608acb7b7b5
--- /dev/null
+++ b/mlir/test/Integration/Dialect/XeGPU/LANE/load_store_subview.mlir
@@ -0,0 +1,63 @@
+// RUN: mlir-opt %s --gpu-lower-to-xevm-pipeline="xegpu-op-level=lane" \
+// RUN: | mlir-runner \
+// RUN:   --shared-libs=%mlir_levelzero_runtime \
+// RUN:   --shared-libs=%mlir_runner_utils \
+// RUN:   --entry-point-result=void \
+// RUN: | FileCheck %s
+
+module @subview attributes {gpu.container_module} {
+  gpu.module @kernel {
+    gpu.func @subview(%src: memref<256xf32>, %dst: memref<256xf32>) kernel {
+      %src_subview = memref.subview %src[5] [251] [1] : memref<256xf32> to memref<251xf32, strided<[1], offset: 5>>
+      %dst_subview = memref.subview %dst[10] [246] [1] : memref<256xf32> to memref<246xf32, strided<[1], offset: 10>>
+      %lane_id = gpu.lane_id
+      %mask = arith.constant 1 : i1
+      %loaded = xegpu.load %src_subview[%lane_id], %mask : memref<251xf32, strided<[1], offset: 5>>, index, i1 -> f32
+      xegpu.store %loaded, %dst_subview[%lane_id], %mask : f32, memref<246xf32, strided<[1], offset: 10>>, index, i1
+      gpu.return
+    }
+  }
+  func.func @test(%src: memref<256xf32>, %dst: memref<256xf32>) -> memref<256xf32> {
+    %memref_src = gpu.alloc  () : memref<256xf32>
+    gpu.memcpy %memref_src, %src : memref<256xf32>, memref<256xf32>
+    %memref_dst = gpu.alloc  () : memref<256xf32>
+    gpu.memcpy %memref_dst, %dst : memref<256xf32>, memref<256xf32>
+    %c1 = arith.constant 1 : index
+    %c16 = arith.constant 16 : index
+    gpu.launch_func @kernel::@subview blocks in (%c1, %c1, %c1) threads in (%c16, %c1, %c1) args(%memref_src : memref<256xf32>, %memref_dst : memref<256xf32>)
+    gpu.wait // Wait for the kernel to finish.
+    gpu.memcpy %dst, %memref_dst : memref<256xf32>, memref<256xf32>
+    gpu.dealloc %memref_src : memref<256xf32>
+    gpu.dealloc %memref_dst : memref<256xf32>
+    return %dst : memref<256xf32>
+  }
+  func.func @main() {
+    %c0 = arith.constant 0 : index
+    %c1 = arith.constant 1 : index
+    %c256 = arith.constant 256 : index
+    %memref_src = memref.alloc() : memref<256xf32>
+    %memref_dst = memref.alloc() : memref<256xf32>
+    // Initialize source memref
+    scf.for %i = %c0 to %c256 step %c1 {
+      %val = arith.index_cast %i : index to i32
+      %val_float = arith.sitofp %val : i32 to f32
+      memref.store %val_float, %memref_src[%i] : memref<256xf32>
+    }
+    // Initialize destination memref to zero
+    scf.for %i = %c0 to %c256 step %c1 {
+      %zero = arith.constant 0.0 : f32
+      memref.store %zero, %memref_dst[%i] : memref<256xf32>
+    }
+    // Call test function
+    %gpu_result = call @test(%memref_src, %memref_dst) : (memref<256xf32>, memref<256xf32>) -> memref<256xf32>
+    %gpu_result_casted = memref.cast %gpu_result : memref<256xf32> to memref<*xf32>
+    // CHECK: Unranked Memref base@ = 0x{{[0-9a-f]+}}
+    // CHECK: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+    call @printMemrefF32(%gpu_result_casted) : (memref<*xf32>) -> ()
+    // Deallocate memrefs
+    memref.dealloc %memref_src : memref<256xf32>
+    memref.dealloc %memref_dst : memref<256xf32>
+    return
+  }
+  func.func private @printMemrefF32(memref<*xf32>) attributes {llvm.emit_c_interface}
+}
diff --git a/mlir/test/Integration/GPU/CUDA/TensorCore/sm80/wmma-matmul-f64.mlir b/mlir/test/Integration/GPU/CUDA/TensorCore/sm80/wmma-matmul-f64.mlir
new file mode 100644
index 0000000000000..a016a60022699
--- /dev/null
+++ b/mlir/test/Integration/GPU/CUDA/TensorCore/sm80/wmma-matmul-f64.mlir
@@ -0,0 +1,72 @@
+// RUN: mlir-opt %s \
+// RUN: | mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 cubin-format=%gpu_compilation_format" \
+// RUN: | mlir-runner \
+// RUN:   --shared-libs=%mlir_cuda_runtime \
+// RUN:   --shared-libs=%mlir_runner_utils \
+// RUN:   --entry-point-result=void \
+// RUN: | FileCheck %s
+
+#map0 = affine_map<(d0, d1) -> (d1, d0)>
+
+func.func @main() {
+  %a = memref.alloc() : memref<8x4xf64>
+  %b = memref.alloc() : memref<4x8xf64>
+  %c = memref.alloc() : memref<8x8xf64>
+  %d = memref.alloc() : memref<8x8xf64>
+
+  %f1 = arith.constant 1.0e+00 : f64
+  %fcst = arith.constant 3.14e+00 : f64
+  %c0 = arith.constant 0 : index
+  %c8 = arith.constant 8 : index
+  %c4 = arith.constant 4 : index
+  %c1 = arith.constant 1 : index
+  %c32 = arith.constant 32 : index
+
+  // Initialize the Input matrixes with ones.
+  scf.for %arg0 = %c0 to %c8 step %c1 {
+    scf.for %arg1 = %c0 to %c4 step %c1 {
+      memref.store %f1, %a[%arg0, %arg1] : memref<8x4xf64>
+      memref.store %f1, %b[%arg1, %arg0] : memref<4x8xf64>
+    }
+  }
+  // Initialize the accumulator matrix with a constant.
+  scf.for %arg0 = %c0 to %c8 step %c1 {
+    scf.for %arg1 = %c0 to %c8 step %c1 {
+      memref.store %fcst, %c[%arg0, %arg1] : memref<8x8xf64>
+    }
+  }
+
+  %2 = memref.cast %a : memref<8x4xf64> to memref<*xf64>
+  %20 = memref.cast %b : memref<4x8xf64> to memref<*xf64>
+  %33 = memref.cast %c : memref<8x8xf64> to memref<*xf64>
+  %34 = memref.cast %d : memref<8x8xf64> to memref<*xf64>
+
+  gpu.host_register %2 : memref<*xf64>
+  gpu.host_register %20 : memref<*xf64>
+  gpu.host_register %33 : memref<*xf64>
+
+  gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
+             threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
+    %A = gpu.subgroup_mma_load_matrix %a[%c0, %c0] {leadDimension = 4 : index} : memref<8x4xf64> -> !gpu.mma_matrix<8x4xf64, "AOp">
+    %B = gpu.subgroup_mma_load_matrix %b[%c0, %c0] {leadDimension = 8 : index} : memref<4x8xf64> -> !gpu.mma_matrix<4x8xf64, "BOp">
+    %C = gpu.subgroup_mma_load_matrix %c[%c0, %c0] {leadDimension = 8 : index} : memref<8x8xf64> -> !gpu.mma_matrix<8x8xf64, "COp">
+
+    %R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<8x4xf64, "AOp">, !gpu.mma_matrix<4x8xf64, "BOp"> -> !gpu.mma_matrix<8x8xf64, "COp">
+
+    gpu.subgroup_mma_store_matrix %R, %d[%c0, %c0] {leadDimension = 8 : index}: !gpu.mma_matrix<8x8xf64, "COp">, memref<8x8xf64>
+    gpu.terminator
+  }
+  // Print the memref after computation.
+  call @printMemrefF64(%34) : (memref<*xf64>) -> ()
+  // CHECK: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14],
+  // CHECK-NEXT: [7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14,   7.14]
+  return
+}
+
+func.func private @printMemrefF64(memref<*xf64>)
diff --git a/mlir/test/Target/Cpp/common-cpp.mlir b/mlir/test/Target/Cpp/common-cpp.mlir
index 294e6af65bf14..abf85c8e9a359 100644
--- a/mlir/test/Target/Cpp/common-cpp.mlir
+++ b/mlir/test/Target/Cpp/common-cpp.mlir
@@ -105,6 +105,25 @@ func.func @apply() -> !emitc.ptr<i32> {
   return %1 : !emitc.ptr<i32>
 }
 
+
+// CHECK-LABEL: void address_of() {
+func.func @address_of() {
+  // CHECK-NEXT: int32_t [[V1:[^ ]*]];
+  %0 = "emitc.variable"() <{value = #emitc.opaque<"">}> : () -> !emitc.lvalue<i32>
+  // CHECK-NEXT: int32_t* [[V2:[^ ]*]] = &[[V1]];
+  %1 = emitc.address_of %0 : !emitc.lvalue<i32>
+  return
+}
+
+// CHECK-LABEL: void dereference
+// CHECK-SAME:                   (int32_t* [[ARG0:[^ ]*]]) {
+func.func @dereference(%arg0: !emitc.ptr<i32>) {
+  // CHECK: int32_t [[V1:[^ ]*]] = *[[ARG0]];
+  %2 = emitc.dereference %arg0 : !emitc.ptr<i32>
+  emitc.load %2 : !emitc.lvalue<i32>
+  return
+}
+
 // CHECK: void array_type(int32_t v1[3], float v2[10][20])
 func.func @array_type(%arg0: !emitc.array<3xi32>, %arg1: !emitc.array<10x20xf32>) {
   return
diff --git a/mlir/test/Target/Cpp/expressions.mlir b/mlir/test/Target/Cpp/expressions.mlir
index 9f1c816ddabbc..2de94d0b11fc8 100644
--- a/mlir/test/Target/Cpp/expressions.mlir
+++ b/mlir/test/Target/Cpp/expressions.mlir
@@ -314,14 +314,14 @@ func.func @different_expressions(%arg0: i32, %arg1: i32, %arg2: i32, %arg3: i32)
   return %v_load : i32
 }
 
-// CPP-DEFAULT:      int32_t expression_with_dereference(int32_t [[VAL_1:v[0-9]+]], int32_t* [[VAL_2]]) {
+// CPP-DEFAULT:      int32_t expression_with_dereference_apply(int32_t [[VAL_1:v[0-9]+]], int32_t* [[VAL_2]]) {
 // CPP-DEFAULT-NEXT:   return *([[VAL_2]] - [[VAL_1]]);
 // CPP-DEFAULT-NEXT: }
 
-// CPP-DECLTOP:      int32_t expression_with_dereference(int32_t [[VAL_1:v[0-9]+]], int32_t* [[VAL_2]]) {
+// CPP-DECLTOP:      int32_t expression_with_dereference_apply(int32_t [[VAL_1:v[0-9]+]], int32_t* [[VAL_2]]) {
 // CPP-DECLTOP-NEXT:   return *([[VAL_2]] - [[VAL_1]]);
 // CPP-DECLTOP-NEXT: }
-emitc.func @expression_with_dereference(%arg1: i32, %arg2: !emitc.ptr<i32>) -> i32 {
+emitc.func @expression_with_dereference_apply(%arg1: i32, %arg2: !emitc.ptr<i32>) -> i32 {
   %c = emitc.expression %arg1, %arg2 : (i32, !emitc.ptr<i32>) -> i32 {
     %e = emitc.sub %arg2, %arg1 : (!emitc.ptr<i32>, i32) -> !emitc.ptr<i32>
     %d = emitc.apply "*"(%e) : (!emitc.ptr<i32>) -> i32
@@ -330,6 +330,28 @@ emitc.func @expression_with_dereference(%arg1: i32, %arg2: !emitc.ptr<i32>) -> i
   return %c : i32
 }
 
+// CPP-DEFAULT:      bool expression_with_address_taken_apply(int32_t [[VAL_1:v[0-9]+]], int32_t [[VAL_2:v[0-9]+]], int32_t* [[VAL_3]]) {
+// CPP-DEFAULT-NEXT:   int32_t [[VAL_4:v[0-9]+]] = 42;
+// CPP-DEFAULT-NEXT:   return &[[VAL_4]] - [[VAL_2]] < [[VAL_3]];
+// CPP-DEFAULT-NEXT: }
+
+// CPP-DECLTOP:      bool expression_with_address_taken_apply(int32_t [[VAL_1:v[0-9]+]], int32_t [[VAL_2:v[0-9]+]], int32_t* [[VAL_3]]) {
+// CPP-DECLTOP-NEXT:   int32_t [[VAL_4:v[0-9]+]];
+// CPP-DECLTOP-NEXT:   [[VAL_4]] = 42;
+// CPP-DECLTOP-NEXT:   return &[[VAL_4]] - [[VAL_2]] < [[VAL_3]];
+// CPP-DECLTOP-NEXT: }
+
+func.func @expression_with_address_taken_apply(%arg0: i32, %arg1: i32, %arg2: !emitc.ptr<i32>) -> i1 {
+  %a = "emitc.variable"(){value = 42 : i32} : () -> !emitc.lvalue<i32>
+  %c = emitc.expression %arg1, %arg2, %a : (i32, !emitc.ptr<i32>, !emitc.lvalue<i32>) -> i1 {
+    %d = emitc.apply "&"(%a) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+    %e = emitc.sub %d, %arg1 : (!emitc.ptr<i32>, i32) -> !emitc.ptr<i32>
+    %f = emitc.cmp lt, %e, %arg2 : (!emitc.ptr<i32>, !emitc.ptr<i32>) -> i1
+    emitc.yield %f : i1
+  }
+  return %c : i1
+}
+
 // CPP-DEFAULT:      bool expression_with_address_taken(int32_t [[VAL_1:v[0-9]+]], int32_t [[VAL_2:v[0-9]+]], int32_t* [[VAL_3]]) {
 // CPP-DEFAULT-NEXT:   int32_t [[VAL_4:v[0-9]+]] = 42;
 // CPP-DEFAULT-NEXT:   return &[[VAL_4]] - [[VAL_2]] < [[VAL_3]];
@@ -344,7 +366,7 @@ emitc.func @expression_with_dereference(%arg1: i32, %arg2: !emitc.ptr<i32>) -> i
 func.func @expression_with_address_taken(%arg0: i32, %arg1: i32, %arg2: !emitc.ptr<i32>) -> i1 {
   %a = "emitc.variable"(){value = 42 : i32} : () -> !emitc.lvalue<i32>
   %c = emitc.expression %arg1, %arg2, %a : (i32, !emitc.ptr<i32>, !emitc.lvalue<i32>) -> i1 {
-    %d = emitc.apply "&"(%a) : (!emitc.lvalue<i32>) -> !emitc.ptr<i32>
+    %d = emitc.address_of %a : !emitc.lvalue<i32>
     %e = emitc.sub %d, %arg1 : (!emitc.ptr<i32>, i32) -> !emitc.ptr<i32>
     %f = emitc.cmp lt, %e, %arg2 : (!emitc.ptr<i32>, !emitc.ptr<i32>) -> i1
     emitc.yield %f : i1
diff --git a/mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir b/mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir
index df606150b760a..95d12f304aca0 100644
--- a/mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir
+++ b/mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir
@@ -1,3 +1,5 @@
+// Tests single-team by-ref GPU reductions.
+
 // RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
 
 module attributes {dlti.dl_spec = #dlti.dl_spec<"dlti.alloca_memory_space" = 5 : ui64, "dlti.global_memory_space" = 1 : ui64>, llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true} {
diff --git a/mlir/test/Target/LLVMIR/allocatable_gpu_reduction_teams.mlir b/mlir/test/Target/LLVMIR/allocatable_gpu_reduction_teams.mlir
new file mode 100644
index 0000000000000..1c73a49b0bf9f
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/allocatable_gpu_reduction_teams.mlir
@@ -0,0 +1,121 @@
+// Tests cross-teams by-ref GPU reductions.
+
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+module attributes {dlti.dl_spec = #dlti.dl_spec<"dlti.alloca_memory_space" = 5 : ui64, "dlti.global_memory_space" = 1 : ui64>, llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true} {
+  omp.private {type = private} @_QFfooEi_private_i32 : i32
+  omp.declare_reduction @add_reduction_byref_box_heap_f32 : !llvm.ptr attributes {byref_element_type = f32} alloc {
+    %0 = llvm.mlir.constant(1 : i64) : i64
+    %1 = llvm.alloca %0 x !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)> : (i64) -> !llvm.ptr<5>
+    %2 = llvm.addrspacecast %1 : !llvm.ptr<5> to !llvm.ptr
+    omp.yield(%2 : !llvm.ptr)
+  } init {
+  ^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr):
+    omp.yield(%arg1 : !llvm.ptr)
+  } combiner {
+  ^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr):
+    %0 = llvm.mlir.constant(1 : i32) : i32
+    %1 = llvm.alloca %0 x !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)> {alignment = 8 : i64} : (i32) -> !llvm.ptr<5>
+    %2 = llvm.addrspacecast %1 : !llvm.ptr<5> to !llvm.ptr
+    %3 = llvm.mlir.constant(1 : i32) : i32
+    %4 = llvm.alloca %3 x !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)> {alignment = 8 : i64} : (i32) -> !llvm.ptr<5>
+    %5 = llvm.addrspacecast %4 : !llvm.ptr<5> to !llvm.ptr
+    %6 = llvm.mlir.constant(24 : i32) : i32
+    "llvm.intr.memcpy"(%5, %arg0, %6) <{isVolatile = false}> : (!llvm.ptr, !llvm.ptr, i32) -> ()
+    %7 = llvm.mlir.constant(24 : i32) : i32
+    "llvm.intr.memcpy"(%2, %arg1, %7) <{isVolatile = false}> : (!llvm.ptr, !llvm.ptr, i32) -> ()
+    %8 = llvm.getelementptr %5[0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)>
+    %9 = llvm.load %8 : !llvm.ptr -> !llvm.ptr
+    %10 = llvm.getelementptr %2[0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)>
+    %11 = llvm.load %10 : !llvm.ptr -> !llvm.ptr
+    %12 = llvm.load %9 : !llvm.ptr -> f32
+    %13 = llvm.load %11 : !llvm.ptr -> f32
+    %14 = llvm.fadd %12, %13 {fastmathFlags = #llvm.fastmath<contract>} : f32
+    llvm.store %14, %9 : f32, !llvm.ptr
+    omp.yield(%arg0 : !llvm.ptr)
+  } data_ptr_ptr {
+  ^bb0(%arg0: !llvm.ptr):
+    %0 = llvm.getelementptr %arg0[0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)>
+    omp.yield(%0 : !llvm.ptr)
+  }
+
+  llvm.func @foo_() {
+    %0 = llvm.mlir.constant(1 : i64) : i64
+    %4 = llvm.alloca %0 x i1 : (i64) -> !llvm.ptr<5>
+    %5 = llvm.addrspacecast %4 : !llvm.ptr<5> to !llvm.ptr
+    %8 = llvm.getelementptr %5[0, 0] : (!llvm.ptr) -> !llvm.ptr, !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)>
+    %9 = omp.map.info var_ptr(%5 : !llvm.ptr, f32) map_clauses(tofrom) capture(ByRef) var_ptr_ptr(%8 : !llvm.ptr) -> !llvm.ptr {name = ""}
+    %10 = omp.map.info var_ptr(%5 : !llvm.ptr, !llvm.struct<(ptr, i64, i32, i8, i8, i8, i8)>) map_clauses(always, descriptor, to, attach) capture(ByRef) members(%9 : [0] : !llvm.ptr) -> !llvm.ptr {name = "scalar_alloc"}
+    omp.target map_entries(%10 -> %arg0 : !llvm.ptr) {
+      %14 = llvm.mlir.constant(1000000 : i32) : i32
+      %15 = llvm.mlir.constant(1 : i32) : i32
+      omp.teams reduction(byref @add_reduction_byref_box_heap_f32 %arg0 -> %arg3 : !llvm.ptr) {
+        omp.parallel {
+          omp.distribute {
+            omp.wsloop reduction(byref @add_reduction_byref_box_heap_f32 %arg3 -> %arg5 : !llvm.ptr) {
+              omp.loop_nest (%arg6) : i32 = (%15) to (%14) inclusive step (%15) {
+                omp.yield
+              }
+            } {omp.composite}
+          } {omp.composite}
+          omp.terminator
+        } {omp.composite}
+        omp.terminator
+      }
+      omp.terminator
+    }
+    llvm.return
+  }
+}
+
+// CHECK: %[[GLOBALIZED_LOCALS:.*]] = type { float }
+
+// CHECK: define internal void @_omp_reduction_list_to_global_copy_func({{.*}}) {{.*}} {
+// CHECK:   %[[RED_ARR_LIST:.*]] = getelementptr inbounds [1 x ptr], ptr %{{.*}}, i64 0, i64 0
+// CHECK:   %[[RED_ELEM_PTR:.*]] = load ptr, ptr %[[RED_ARR_LIST]], align 8
+// CHECK:   %[[GLOB_ELEM_PTR:.*]] = getelementptr inbounds %[[GLOBALIZED_LOCALS]], ptr %{{.*}}, i32 0, i32 0
+// CHECK:   %[[ALLOC_PTR_PTR:.*]] = getelementptr { ptr, i64, i32, i8, i8, i8, i8 }, ptr %[[RED_ELEM_PTR]], i32 0, i32 0
+// CHECK:   %[[ALLOC_PTR:.*]] = load ptr, ptr %[[ALLOC_PTR_PTR]], align 8
+// CHECK:   %[[ALLOC_VAL:.*]] = load float, ptr %[[ALLOC_PTR]], align 4
+// Verify that the actual value managed by the descriptor is stored in the globalized 
+// locals arrays; rather than a pointer to the descriptor or a pointer to the value.
+// CHECK:   store float %[[ALLOC_VAL]], ptr %[[GLOB_ELEM_PTR]], align 4
+// CHECK: }
+
+// CHECK: define internal void @_omp_reduction_list_to_global_reduce_func({{.*}}) {{.*}} {
+// Allocate a descriptor to manage the element retrieved from the globalized local array.
+// CHECK:   %[[ALLOC_DESC:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8, addrspace(5)
+// CHECK:   %[[ALLOC_DESC_ASCAST:.*]] = addrspacecast ptr addrspace(5) %[[ALLOC_DESC]] to ptr
+
+// CHECK:   %[[RED_ARR_LIST:.*]] = getelementptr inbounds [1 x ptr], ptr %{{.*}}, i64 0, i64 0
+// CHECK:   %[[GLOB_ELEM_PTR:.*]] = getelementptr inbounds %[[GLOBALIZED_LOCALS]], ptr %{{.*}}, i32 0, i32 0
+// CHECK:   %[[ALLOC_PTR_PTR:.*]] = getelementptr { ptr, i64, i32, i8, i8, i8, i8 }, ptr %[[ALLOC_DESC_ASCAST]], i32 0, i32 0
+// Store the pointer to the gloalized local element into the locally allocated descriptor.
+// CHECK:   store ptr %[[GLOB_ELEM_PTR]], ptr %[[ALLOC_PTR_PTR]], align 8
+// CHECK:   store ptr %[[ALLOC_DESC_ASCAST]], ptr %[[RED_ARR_LIST]], align 8
+// CHECK: }
+
+// CHECK: define internal void @_omp_reduction_global_to_list_copy_func({{.*}}) {{.*}} {
+// CHECK:   %[[RED_ARR_LIST:.*]] = getelementptr inbounds [1 x ptr], ptr %{{.*}}, i64 0, i64 0
+// CHECK:   %[[RED_ELEM_PTR:.*]] = load ptr, ptr %[[RED_ARR_LIST]], align 8
+// CHECK:   %[[GLOB_ELEM_PTR:.*]] = getelementptr inbounds %[[GLOBALIZED_LOCALS]], ptr %{{.*}}, i32 0, i32 0
+// CHECK:   %[[ALLOC_PTR_PTR:.*]] = getelementptr { ptr, i64, i32, i8, i8, i8, i8 }, ptr %[[RED_ELEM_PTR]], i32 0, i32 0
+// Similar to _omp_reduction_list_to_global_copy_func(...) but in the reverse direction; i.e.
+// the globalized local array is copied from rather than copied to.
+// CHECK:   %[[ALLOC_PTR:.*]] = load ptr, ptr %[[ALLOC_PTR_PTR]], align 8
+// CHECK:   %[[ALLOC_VAL:.*]] = load float, ptr %[[GLOB_ELEM_PTR]], align 4
+// CHECK:   store float %[[ALLOC_VAL]], ptr %[[ALLOC_PTR]], align 4
+// CHECK: }
+
+// CHECK: define internal void @_omp_reduction_global_to_list_reduce_func({{.*}}) {{.*}} {
+// Allocate a descriptor to manage the element retrieved from the globalized local array.
+// CHECK:   %[[ALLOC_DESC:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8, addrspace(5)
+// CHECK:   %[[ALLOC_DESC_ASCAST:.*]] = addrspacecast ptr addrspace(5) %[[ALLOC_DESC]] to ptr
+
+// CHECK:   %[[RED_ARR_LIST:.*]] = getelementptr inbounds [1 x ptr], ptr %{{.*}}, i64 0, i64 0
+// CHECK:   %[[GLOB_ELEM_PTR:.*]] = getelementptr inbounds %[[GLOBALIZED_LOCALS]], ptr %{{.*}}, i32 0, i32 0
+// CHECK:   %[[ALLOC_PTR_PTR:.*]] = getelementptr { ptr, i64, i32, i8, i8, i8, i8 }, ptr %[[ALLOC_DESC_ASCAST]], i32 0, i32 0
+// Store the pointer to the gloalized local element into the locally allocated descriptor.
+// CHECK:   store ptr %[[GLOB_ELEM_PTR]], ptr %[[ALLOC_PTR_PTR]], align 8
+// CHECK:   store ptr %[[ALLOC_DESC_ASCAST]], ptr %[[RED_ARR_LIST]], align 8
+// CHECK: }
diff --git a/mlir/test/Target/LLVMIR/nvvm/mbar_arr_drop_expect_tx.mlir b/mlir/test/Target/LLVMIR/nvvm/mbar_arr_drop_expect_tx.mlir
new file mode 100644
index 0000000000000..4b3cafec08a39
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/nvvm/mbar_arr_drop_expect_tx.mlir
@@ -0,0 +1,68 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+llvm.func @mbarrier_arrive_drop_expect_tx_generic(%barrier: !llvm.ptr, %txcount : i32) {
+  // CHECK-LABEL: define void @mbarrier_arrive_drop_expect_tx_generic(ptr %0, i32 %1) {
+  // CHECK-NEXT: %3 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %3, i32 %1)
+  // CHECK-NEXT: %5 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %6 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %5, i32 %1)
+  // CHECK-NEXT: %7 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %8 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %7, i32 %1)
+  // CHECK-NEXT: %9 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %10 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %9, i32 %1)
+  // CHECK-NEXT: %11 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %12 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %11, i32 %1)
+  // CHECK-NEXT: %13 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %14 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %13, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i64
+  %1 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>} : !llvm.ptr, i32 -> i64
+  %2 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr, i32 -> i64
+
+  %3 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {relaxed = true} : !llvm.ptr, i32 -> i64
+  %4 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>, relaxed = true} : !llvm.ptr, i32 -> i64
+  %5 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>, relaxed = true} : !llvm.ptr, i32 -> i64
+  llvm.return
+}
+
+llvm.func @mbarrier_arrive_drop_expect_tx_shared(%barrier: !llvm.ptr<3>, %txcount : i32) {
+  // CHECK-LABEL: define void @mbarrier_arrive_drop_expect_tx_shared(ptr addrspace(3) %0, i32 %1) {
+  // CHECK-NEXT: %3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %5 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %6 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %7 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %8 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount : !llvm.ptr<3>, i32 -> i64
+  %1 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>} : !llvm.ptr<3>, i32 -> i64
+  %2 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i32 -> i64
+
+  %3 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {relaxed = true} : !llvm.ptr<3>, i32 -> i64
+  %4 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>, relaxed = true} : !llvm.ptr<3>, i32 -> i64
+  %5 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>, relaxed = true} : !llvm.ptr<3>, i32 -> i64
+  llvm.return
+}
+
+llvm.func @mbarrier_arrive_drop_expect_tx_shared_cluster(%barrier: !llvm.ptr<7>, %txcount : i32) {
+  // CHECK-LABEL: define void @mbarrier_arrive_drop_expect_tx_shared_cluster(ptr addrspace(7) %0, i32 %1) {
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>} : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<7>, i32
+
+  nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {relaxed = true} : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>, relaxed = true} : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive_drop.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>, relaxed = true} : !llvm.ptr<7>, i32
+  llvm.return
+}
+
diff --git a/mlir/test/Target/LLVMIR/nvvm/mbar_arr_expect_tx.mlir b/mlir/test/Target/LLVMIR/nvvm/mbar_arr_expect_tx.mlir
new file mode 100644
index 0000000000000..b5389bdd30267
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/nvvm/mbar_arr_expect_tx.mlir
@@ -0,0 +1,68 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+llvm.func @mbarrier_arrive_expect_tx_generic(%barrier: !llvm.ptr, %txcount : i32) {
+  // CHECK-LABEL: define void @mbarrier_arrive_expect_tx_generic(ptr %0, i32 %1) {
+  // CHECK-NEXT: %3 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %4 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %3, i32 %1)
+  // CHECK-NEXT: %5 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %6 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %5, i32 %1)
+  // CHECK-NEXT: %7 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %8 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %7, i32 %1)
+  // CHECK-NEXT: %9 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %10 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %9, i32 %1)
+  // CHECK-NEXT: %11 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %12 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %11, i32 %1)
+  // CHECK-NEXT: %13 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %14 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %13, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr, i32 -> i64
+  %1 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>} : !llvm.ptr, i32 -> i64
+  %2 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr, i32 -> i64
+
+  %3 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {relaxed = true} : !llvm.ptr, i32 -> i64
+  %4 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>, relaxed = true} : !llvm.ptr, i32 -> i64
+  %5 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>, relaxed = true} : !llvm.ptr, i32 -> i64
+  llvm.return
+}
+
+llvm.func @mbarrier_arrive_expect_tx_shared(%barrier: !llvm.ptr<3>, %txcount : i32) {
+  // CHECK-LABEL: define void @mbarrier_arrive_expect_tx_shared(ptr addrspace(3) %0, i32 %1) {
+  // CHECK-NEXT: %3 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %4 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %5 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %6 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %7 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %8 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr<3>, i32 -> i64
+  %1 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>} : !llvm.ptr<3>, i32 -> i64
+  %2 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i32 -> i64
+
+  %3 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {relaxed = true} : !llvm.ptr<3>, i32 -> i64
+  %4 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>, relaxed = true} : !llvm.ptr<3>, i32 -> i64
+  %5 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>, relaxed = true} : !llvm.ptr<3>, i32 -> i64
+  llvm.return
+}
+
+llvm.func @mbarrier_arrive_expect_tx_shared_cluster(%barrier: !llvm.ptr<7>, %txcount : i32) {
+  // CHECK-LABEL: define void @mbarrier_arrive_expect_tx_shared_cluster(ptr addrspace(7) %0, i32 %1) {
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %0, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>} : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<7>, i32
+
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {relaxed = true} : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cta>, relaxed = true} : !llvm.ptr<7>, i32
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount {scope = #nvvm.mem_scope<cluster>, relaxed = true} : !llvm.ptr<7>, i32
+  llvm.return
+}
+
diff --git a/mlir/test/Target/LLVMIR/nvvm/mbar_init.mlir b/mlir/test/Target/LLVMIR/nvvm/mbar_init.mlir
index ae9c7f29bc7a5..9c1d1cc0cdc31 100644
--- a/mlir/test/Target/LLVMIR/nvvm/mbar_init.mlir
+++ b/mlir/test/Target/LLVMIR/nvvm/mbar_init.mlir
@@ -54,23 +54,3 @@ llvm.func @mbarrier_inval_shared(%barrier: !llvm.ptr<3>) {
   nvvm.mbarrier.inval %barrier : !llvm.ptr<3>
   llvm.return
 }
-
-llvm.func @mbarrier_test_wait(%barrier: !llvm.ptr, %token : i64) -> i1 {
-  // CHECK-LABEL: define i1 @mbarrier_test_wait(ptr %0, i64 %1) {
-  // CHECK-NEXT: %3 = call i1 @llvm.nvvm.mbarrier.test.wait(ptr %0, i64 %1)
-  // CHECK-NEXT: ret i1 %3
-  // CHECK-NEXT: }
-  %isComplete = nvvm.mbarrier.test.wait %barrier, %token : !llvm.ptr, i64 -> i1
-  llvm.return %isComplete : i1
-}
-
-llvm.func @mbarrier_test_wait_shared(%barrier: !llvm.ptr<3>, %token : i64) {
-  // CHECK-LABEL: define void @mbarrier_test_wait_shared(ptr addrspace(3) %0, i64 %1) {
-  // CHECK-NEXT: %3 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
-  // CHECK-NEXT: %4 = call i1 @llvm.nvvm.mbarrier.test.wait.shared(ptr addrspace(3) %0, i64 %1)
-  // CHECK-NEXT: ret void
-  // CHECK-NEXT: }
-  %count = nvvm.read.ptx.sreg.ntid.x : i32
-  %isComplete = nvvm.mbarrier.test.wait %barrier, %token : !llvm.ptr<3>, i64 -> i1
-  llvm.return
-}
diff --git a/mlir/test/Target/LLVMIR/nvvm/mbar_invalid.mlir b/mlir/test/Target/LLVMIR/nvvm/mbar_invalid.mlir
index 4ad76248b7e25..d8cb9853f3374 100644
--- a/mlir/test/Target/LLVMIR/nvvm/mbar_invalid.mlir
+++ b/mlir/test/Target/LLVMIR/nvvm/mbar_invalid.mlir
@@ -47,3 +47,76 @@ llvm.func @mbarrier_complete_tx_scope(%barrier: !llvm.ptr<3>, %tx_count: i32) {
   nvvm.mbarrier.complete_tx %barrier, %tx_count {scope = #nvvm.mem_scope<sys>} : !llvm.ptr<3>, i32
   llvm.return
 }
+
+// -----
+
+llvm.func @mbarrier_arr_expect_tx(%barrier: !llvm.ptr<3>, %tx_count: i32) {
+  // expected-error @below {{mbarrier scope must be either CTA or Cluster}}
+  %1 = nvvm.mbarrier.arrive.expect_tx %barrier, %tx_count {scope = #nvvm.mem_scope<gpu>} : !llvm.ptr<3>, i32 -> i64
+  llvm.return
+}
+
+// -----
+
+llvm.func @mbarrier_arr_expect_tx_cluster(%barrier: !llvm.ptr<7>, %tx_count: i32) {
+  // expected-error @below {{mbarrier in shared_cluster space cannot return any value}}
+  %1 = nvvm.mbarrier.arrive.expect_tx %barrier, %tx_count {scope = #nvvm.mem_scope<cta>} : !llvm.ptr<7>, i32 -> i64
+  llvm.return
+}
+
+// -----
+
+llvm.func @init_mbarrier_arrive_expect_tx_asm_ret(%barrier : !llvm.ptr<3>, %txcount : i32, %pred : i1) {
+  // expected-error @below {{return-value is not supported when using predicate}}
+  %1 = nvvm.mbarrier.arrive.expect_tx %barrier, %txcount, predicate = %pred : !llvm.ptr<3>, i32, i1 -> i64 
+  llvm.return
+}
+
+// -----
+
+llvm.func @init_mbarrier_arrive_expect_tx_asm_relaxed(%barrier : !llvm.ptr<3>, %txcount : i32, %pred : i1) {
+  // expected-error @below {{mbarrier with relaxed semantics is not supported when using predicate}}
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount, predicate = %pred {relaxed = true} : !llvm.ptr<3>, i32, i1
+  llvm.return
+}
+
+// -----
+
+llvm.func @init_mbarrier_arrive_expect_tx_asm_cta(%barrier : !llvm.ptr<3>, %txcount : i32, %pred : i1) {
+  // expected-error @below {{mbarrier scope must be CTA when using predicate}}
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount, predicate = %pred {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i32, i1
+  llvm.return
+}
+
+// -----
+
+llvm.func @init_mbarrier_arrive_expect_tx_asm_cluster(%barrier : !llvm.ptr<7>, %txcount : i32, %pred : i1) {
+  // expected-error @below {{mbarrier in shared_cluster space is not supported when using predicate}}
+  nvvm.mbarrier.arrive.expect_tx %barrier, %txcount, predicate = %pred : !llvm.ptr<7>, i32, i1
+  llvm.return
+}
+
+// -----
+
+llvm.func @mbarrier_arr_drop_expect_tx(%barrier: !llvm.ptr<3>, %tx_count: i32) {
+  // expected-error @below {{mbarrier scope must be either CTA or Cluster}}
+  %1 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %tx_count {scope = #nvvm.mem_scope<gpu>} : !llvm.ptr<3>, i32 -> i64
+  llvm.return
+}
+
+// -----
+
+llvm.func @mbarrier_arr_drop_expect_tx_cluster(%barrier: !llvm.ptr<7>, %tx_count: i32) {
+  // expected-error @below {{mbarrier in shared_cluster space cannot return any value}}
+  %1 = nvvm.mbarrier.arrive_drop.expect_tx %barrier, %tx_count {scope = #nvvm.mem_scope<cta>} : !llvm.ptr<7>, i32 -> i64
+  llvm.return
+}
+
+// -----
+
+llvm.func @mbarrier_test_wait(%barrier: !llvm.ptr<3>, %phase: i32) {
+  // expected-error @below {{mbarrier scope must be either CTA or Cluster}}
+  %1 = nvvm.mbarrier.test.wait %barrier, %phase {scope = #nvvm.mem_scope<gpu>} : !llvm.ptr<3>, i32 -> i1
+  llvm.return
+}
+
diff --git a/mlir/test/Target/LLVMIR/nvvm/mbar_test_wait.mlir b/mlir/test/Target/LLVMIR/nvvm/mbar_test_wait.mlir
new file mode 100644
index 0000000000000..21ab72eeab167
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/nvvm/mbar_test_wait.mlir
@@ -0,0 +1,73 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+llvm.func @mbarrier_test_wait_state(%barrier: !llvm.ptr, %state : i64) {
+  // CHECK-LABEL: define void @mbarrier_test_wait_state(ptr %0, i64 %1) {
+  // CHECK-NEXT: %3 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %4 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %3, i64 %1)
+  // CHECK-NEXT: %5 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %6 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %5, i64 %1)
+  // CHECK-NEXT: %7 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %8 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %7, i64 %1)
+  // CHECK-NEXT: %9 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %10 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %9, i64 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.test.wait %barrier, %state : !llvm.ptr, i64 -> i1
+  %1 = nvvm.mbarrier.test.wait %barrier, %state {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr, i64 -> i1
+
+  %2 = nvvm.mbarrier.test.wait %barrier, %state {relaxed = true} : !llvm.ptr, i64 -> i1
+  %3 = nvvm.mbarrier.test.wait %barrier, %state {relaxed = true, scope = #nvvm.mem_scope<cluster>} : !llvm.ptr, i64 -> i1
+  llvm.return
+}
+
+llvm.func @mbarrier_test_wait_shared_state(%barrier: !llvm.ptr<3>, %state : i64) {
+  // CHECK-LABEL: define void @mbarrier_test_wait_shared_state(ptr addrspace(3) %0, i64 %1) {
+  // CHECK-NEXT: %3 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %0, i64 %1)
+  // CHECK-NEXT: %4 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %0, i64 %1)
+  // CHECK-NEXT: %5 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %0, i64 %1)
+  // CHECK-NEXT: %6 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %0, i64 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.test.wait %barrier, %state : !llvm.ptr<3>, i64 -> i1
+  %1 = nvvm.mbarrier.test.wait %barrier, %state {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i64 -> i1
+
+  %2 = nvvm.mbarrier.test.wait %barrier, %state {relaxed = true} : !llvm.ptr<3>, i64 -> i1
+  %3 = nvvm.mbarrier.test.wait %barrier, %state {relaxed = true, scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i64 -> i1
+  llvm.return
+}
+
+llvm.func @mbarrier_test_wait_phase(%barrier: !llvm.ptr, %phase : i32) {
+  // CHECK-LABEL: define void @mbarrier_test_wait_phase(ptr %0, i32 %1) {
+  // CHECK-NEXT: %3 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %4 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %3, i32 %1)
+  // CHECK-NEXT: %5 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %6 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %5, i32 %1)
+  // CHECK-NEXT: %7 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %8 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %7, i32 %1)
+  // CHECK-NEXT: %9 = addrspacecast ptr %0 to ptr addrspace(3)
+  // CHECK-NEXT: %10 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %9, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.test.wait %barrier, %phase : !llvm.ptr, i32 -> i1
+  %1 = nvvm.mbarrier.test.wait %barrier, %phase {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr, i32 -> i1
+
+  %2 = nvvm.mbarrier.test.wait %barrier, %phase {relaxed = true} : !llvm.ptr, i32 -> i1
+  %3 = nvvm.mbarrier.test.wait %barrier, %phase {relaxed = true, scope = #nvvm.mem_scope<cluster>} : !llvm.ptr, i32 -> i1
+  llvm.return
+}
+
+llvm.func @mbarrier_test_wait_shared_phase(%barrier: !llvm.ptr<3>, %phase : i32) {
+  // CHECK-LABEL: define void @mbarrier_test_wait_shared_phase(ptr addrspace(3) %0, i32 %1) {
+  // CHECK-NEXT: %3 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %4 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %5 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: %6 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %0, i32 %1)
+  // CHECK-NEXT: ret void
+  // CHECK-NEXT: }
+  %0 = nvvm.mbarrier.test.wait %barrier, %phase : !llvm.ptr<3>, i32 -> i1
+  %1 = nvvm.mbarrier.test.wait %barrier, %phase {scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i32 -> i1
+
+  %2 = nvvm.mbarrier.test.wait %barrier, %phase {relaxed = true} : !llvm.ptr<3>, i32 -> i1
+  %3 = nvvm.mbarrier.test.wait %barrier, %phase {relaxed = true, scope = #nvvm.mem_scope<cluster>} : !llvm.ptr<3>, i32 -> i1
+  llvm.return
+}
diff --git a/mlir/test/Target/LLVMIR/openmp-barrier-cancel.mlir b/mlir/test/Target/LLVMIR/openmp-barrier-cancel.mlir
index c4b245667a1f3..6585549de7f96 100644
--- a/mlir/test/Target/LLVMIR/openmp-barrier-cancel.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-barrier-cancel.mlir
@@ -29,22 +29,24 @@ llvm.func @test() {
 // CHECK:         %[[VAL_14:.*]] = icmp eq i32 %[[VAL_13]], 0
 // CHECK:         br i1 %[[VAL_14]], label %[[VAL_15:.*]], label %[[VAL_16:.*]]
 // CHECK:       omp.par.region1.cncl:                             ; preds = %[[VAL_11]]
-// CHECK:         %[[VAL_17:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
-// CHECK:         %[[VAL_18:.*]] = call i32 @__kmpc_cancel_barrier(ptr @2, i32 %[[VAL_17]])
-// CHECK:         br label %[[VAL_19:.*]]
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       .fini:
+// CHECK:         %[[TID:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
+// CHECK:         %[[CNCL_BARRIER:.*]] = call i32 @__kmpc_cancel_barrier(ptr @2, i32 %[[TID]])
+// CHECK:         br label %[[EXIT_STUB:.*]]
 // CHECK:       omp.par.region1.split:                            ; preds = %[[VAL_11]]
 // CHECK:         %[[VAL_20:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         %[[VAL_21:.*]] = call i32 @__kmpc_cancel_barrier(ptr @3, i32 %[[VAL_20]])
 // CHECK:         %[[VAL_22:.*]] = icmp eq i32 %[[VAL_21]], 0
 // CHECK:         br i1 %[[VAL_22]], label %[[VAL_23:.*]], label %[[VAL_24:.*]]
 // CHECK:       omp.par.region1.split.cncl:                       ; preds = %[[VAL_15]]
-// CHECK:         br label %[[VAL_19]]
+// CHECK:         br label %[[FINI]]
 // CHECK:       omp.par.region1.split.cont:                       ; preds = %[[VAL_15]]
 // CHECK:         br label %[[VAL_25:.*]]
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_23]]
 // CHECK:         br label %[[VAL_26:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_25]]
-// CHECK:         br label %[[VAL_19]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_26]], %[[VAL_24]], %[[VAL_16]]
+// CHECK:         br label %[[FINI]]
+// CHECK:       omp.par.exit.exitStub:
 // CHECK:         ret void
 
diff --git a/mlir/test/Target/LLVMIR/openmp-cancel.mlir b/mlir/test/Target/LLVMIR/openmp-cancel.mlir
index 21241702ad569..a6911f80d43b7 100644
--- a/mlir/test/Target/LLVMIR/openmp-cancel.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-cancel.mlir
@@ -24,16 +24,18 @@ llvm.func @cancel_parallel() {
 // CHECK:         %[[VAL_15:.*]] = icmp eq i32 %[[VAL_14]], 0
 // CHECK:         br i1 %[[VAL_15]], label %[[VAL_16:.*]], label %[[VAL_17:.*]]
 // CHECK:       omp.par.region1.cncl:                             ; preds = %[[VAL_12]]
+// CHECK:         br label %[[VAL_20:.*]]
+// CHECK:       .fini:
 // CHECK:         %[[VAL_18:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         %[[VAL_19:.*]] = call i32 @__kmpc_cancel_barrier(ptr @2, i32 %[[VAL_18]])
-// CHECK:         br label %[[VAL_20:.*]]
+// CHECK:         br label %[[EXIT_STUB:.*]]
 // CHECK:       omp.par.region1.split:                            ; preds = %[[VAL_12]]
 // CHECK:         br label %[[VAL_21:.*]]
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_16]]
 // CHECK:         br label %[[VAL_22:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_21]]
 // CHECK:         br label %[[VAL_20]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_22]], %[[VAL_17]]
+// CHECK:       omp.par.exit.exitStub:
 // CHECK:         ret void
 
 llvm.func @cancel_parallel_if(%arg0 : i1) {
@@ -58,27 +60,36 @@ llvm.func @cancel_parallel_if(%arg0 : i1) {
 // CHECK:       omp.par.region:                                   ; preds = %[[VAL_17]]
 // CHECK:         br label %[[VAL_20:.*]]
 // CHECK:       omp.par.region1:                                  ; preds = %[[VAL_19]]
-// CHECK:         br i1 %[[VAL_16]], label %[[VAL_21:.*]], label %[[VAL_22:.*]]
+// CHECK:         br i1 %[[VAL_16]], label %[[SPLIT:.*]], label %[[VAL_22:.*]]
 // CHECK:       3:                                                ; preds = %[[VAL_20]]
-// CHECK:         br label %[[VAL_23:.*]]
-// CHECK:       4:                                                ; preds = %[[VAL_22]], %[[VAL_24:.*]]
+// CHECK:         %[[GTN:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
+// CHECK:         %[[NOT_CANCELLED:.*]] = call i32 @__kmpc_cancellationpoint(ptr @1, i32 %[[GTN]], i32 1)
+// CHECK:         %[[COND:.*]] = icmp eq i32 %[[NOT_CANCELLED]], 0
+// CHECK:         br i1 %[[COND]], label %[[VAL_23:.*]], label %[[CNCL:.*]]
+// CHECK:       .cncl:
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       .fini:
+// CHECK:         %[[VAL_32:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
+// CHECK:         %[[VAL_33:.*]] = call i32 @__kmpc_cancel_barrier(ptr @2, i32 %[[VAL_32]])
+// CHECK:         br label %[[EXIT_STUB:.*]]
+// CHECK:       .split:
+// CHECK:         br label %[[SEVEN:.*]]
+// CHECK:       7:
 // CHECK:         br label %[[VAL_25:.*]]
-// CHECK:       omp.region.cont:                                  ; preds = %[[VAL_23]]
+// CHECK:       omp.region.cont:
 // CHECK:         br label %[[VAL_26:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_25]]
 // CHECK:         br label %[[VAL_27:.*]]
-// CHECK:       5:                                                ; preds = %[[VAL_20]]
+// CHECK:       8:                                                ; preds = %[[VAL_20]]
 // CHECK:         %[[VAL_28:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         %[[VAL_29:.*]] = call i32 @__kmpc_cancel(ptr @1, i32 %[[VAL_28]], i32 1)
 // CHECK:         %[[VAL_30:.*]] = icmp eq i32 %[[VAL_29]], 0
-// CHECK:         br i1 %[[VAL_30]], label %[[VAL_24]], label %[[VAL_31:.*]]
-// CHECK:       .cncl:                                            ; preds = %[[VAL_21]]
-// CHECK:         %[[VAL_32:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
-// CHECK:         %[[VAL_33:.*]] = call i32 @__kmpc_cancel_barrier(ptr @2, i32 %[[VAL_32]])
-// CHECK:         br label %[[VAL_27]]
-// CHECK:       .split:                                           ; preds = %[[VAL_21]]
-// CHECK:         br label %[[VAL_23]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_31]], %[[VAL_26]]
+// CHECK:         br i1 %[[VAL_30]], label %[[SPLIT5:.*]], label %[[VAL_31:.*]]
+// CHECK:       .cncl{{.*}}:
+// CHECK:         br label %[[FINI]]
+// CHECK:       .split{{.*}}:
+// CHECK:         br label %[[SEVEN]]
+// CHECK:       omp.par.exit.exitStub:
 // CHECK:         ret void
 
 llvm.func @cancel_sections_if(%cond : i1) {
@@ -132,11 +143,16 @@ llvm.func @cancel_sections_if(%cond : i1) {
 // CHECK:         %[[VAL_30:.*]] = call i32 @__kmpc_cancel(ptr @1, i32 %[[VAL_29]], i32 3)
 // CHECK:         %[[VAL_31:.*]] = icmp eq i32 %[[VAL_30]], 0
 // CHECK:         br i1 %[[VAL_31]], label %[[VAL_32:.*]], label %[[VAL_33:.*]]
-// CHECK:       .split:                                           ; preds = %[[VAL_27]]
+// CHECK:       .split{{.*}}:                                     ; preds = %[[VAL_27]]
 // CHECK:         br label %[[VAL_34:.*]]
 // CHECK:       12:                                               ; preds = %[[VAL_25]]
+// CHECK:         %[[GTN:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
+// CHECK:         %[[CANCEL_POINT:.*]] = call i32 @__kmpc_cancellationpoint(ptr @1, i32 %[[GTN]], i32 3)
+// CHECK:         %[[COND:.*]] = icmp eq i32 %13, 0
+// CHECK:         br i1 %[[COND]], label %[[SPLIT:.*]], label %[[CNCL:.*]]
+// CHECK:       .split{{.*}}:
 // CHECK:         br label %[[VAL_34]]
-// CHECK:       13:                                               ; preds = %[[VAL_28]], %[[VAL_32]]
+// CHECK:       15:
 // CHECK:         br label %[[VAL_35:.*]]
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_34]]
 // CHECK:         br label %[[VAL_23]]
@@ -145,17 +161,17 @@ llvm.func @cancel_sections_if(%cond : i1) {
 // CHECK:       omp_section_loop.inc:                             ; preds = %[[VAL_23]]
 // CHECK:         %[[VAL_15]] = add nuw i32 %[[VAL_14]], 1
 // CHECK:         br label %[[VAL_12]]
-// CHECK:       omp_section_loop.exit:                            ; preds = %[[VAL_33]], %[[VAL_16]]
+// CHECK:       omp_section_loop.exit:
 // CHECK:         call void @__kmpc_for_static_fini(ptr @1, i32 %[[VAL_7]])
 // CHECK:         %[[VAL_36:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_36]])
 // CHECK:         br label %[[VAL_37:.*]]
 // CHECK:       omp_section_loop.after:                           ; preds = %[[VAL_19]]
-// CHECK:         br label %[[VAL_38:.*]]
-// CHECK:       omp_section_loop.aftersections.fini:              ; preds = %[[VAL_37]]
 // CHECK:         ret void
-// CHECK:       .cncl:                                            ; preds = %[[VAL_27]]
-// CHECK:         br label %[[VAL_19]]
+// CHECK:       .cncl:
+// CHECK:         br label %[[OMP_SECTION_LOOP_EXIT:.*]]
+// CHECK:       .cncl{{.*}}:
+// CHECK:         br label %[[OMP_SECTION_LOOP_EXIT:.*]]
 
 llvm.func @cancel_wsloop_if(%lb : i32, %ub : i32, %step : i32, %cond : i1) {
   omp.wsloop {
@@ -221,18 +237,23 @@ llvm.func @cancel_wsloop_if(%lb : i32, %ub : i32, %step : i32, %cond : i1) {
 // CHECK:         %[[VAL_47:.*]] = call i32 @__kmpc_cancel(ptr @1, i32 %[[VAL_46]], i32 2)
 // CHECK:         %[[VAL_48:.*]] = icmp eq i32 %[[VAL_47]], 0
 // CHECK:         br i1 %[[VAL_48]], label %[[VAL_49:.*]], label %[[VAL_50:.*]]
-// CHECK:       .split:                                           ; preds = %[[VAL_44]]
+// CHECK:       .split{{.*}}:
 // CHECK:         br label %[[VAL_51:.*]]
-// CHECK:       28:                                               ; preds = %[[VAL_42]]
+// CHECK:       28:
+// CHECK:         %[[GTN:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
+// CHECK:         %[[CANCEL_POINT:.*]] = call i32 @__kmpc_cancellationpoint(ptr @1, i32 %[[GTN]], i32 2)
+// CHECK:         %[[COND:.*]] = icmp eq i32 %[[CANCEL_POINT]], 0
+// CHECK:         br i1 %[[COND]], label %[[SPLIT3:.*]], label %[[CNCL4:.*]]
+// CHECK:       .split{{.*}}:
 // CHECK:         br label %[[VAL_51]]
-// CHECK:       29:                                               ; preds = %[[VAL_45]], %[[VAL_49]]
+// CHECK:       31:
 // CHECK:         br label %[[VAL_52:.*]]
 // CHECK:       omp.region.cont1:                                 ; preds = %[[VAL_51]]
 // CHECK:         br label %[[VAL_32]]
 // CHECK:       omp_loop.inc:                                     ; preds = %[[VAL_52]]
 // CHECK:         %[[VAL_34]] = add nuw i32 %[[VAL_33]], 1
 // CHECK:         br label %[[VAL_31]]
-// CHECK:       omp_loop.exit:                                    ; preds = %[[VAL_50]], %[[VAL_35]]
+// CHECK:       omp_loop.exit:
 // CHECK:         call void @__kmpc_for_static_fini(ptr @1, i32 %[[VAL_26]])
 // CHECK:         %[[VAL_53:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_53]])
@@ -241,8 +262,12 @@ llvm.func @cancel_wsloop_if(%lb : i32, %ub : i32, %step : i32, %cond : i1) {
 // CHECK:         br label %[[VAL_55:.*]]
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_54]]
 // CHECK:         ret void
-// CHECK:       .cncl:                                            ; preds = %[[VAL_44]]
-// CHECK:         br label %[[VAL_38]]
+// CHECK:       .cncl{{.*}}:
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       .fini:
+// CHECK:         br label %[[OMP_LOOP_EXIT:.*]]
+// CHECK:       .cncl{{.*}}:
+// CHECK:         br label %[[FINI:.*]]
 
 omp.private {type = firstprivate} @i32_priv : i32 copy {
 ^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr):
diff --git a/mlir/test/Target/LLVMIR/openmp-cancellation-point.mlir b/mlir/test/Target/LLVMIR/openmp-cancellation-point.mlir
index 5e0d3f9f7e293..93fa2064ab99a 100644
--- a/mlir/test/Target/LLVMIR/openmp-cancellation-point.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-cancellation-point.mlir
@@ -24,16 +24,18 @@ llvm.func @cancellation_point_parallel() {
 // CHECK:         %[[VAL_15:.*]] = icmp eq i32 %[[VAL_14]], 0
 // CHECK:         br i1 %[[VAL_15]], label %[[VAL_16:.*]], label %[[VAL_17:.*]]
 // CHECK:       omp.par.region1.cncl:                             ; preds = %[[VAL_12]]
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       .fini:
 // CHECK:         %[[VAL_18:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         %[[VAL_19:.*]] = call i32 @__kmpc_cancel_barrier(ptr @2, i32 %[[VAL_18]])
-// CHECK:         br label %[[VAL_20:.*]]
+// CHECK:         br label %[[EXIT_STUB:.*]]
 // CHECK:       omp.par.region1.split:                            ; preds = %[[VAL_12]]
 // CHECK:         br label %[[VAL_21:.*]]
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_16]]
 // CHECK:         br label %[[VAL_22:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_21]]
-// CHECK:         br label %[[VAL_20]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_22]], %[[VAL_17]]
+// CHECK:         br label %[[FINI]]
+// CHECK:       omp.par.exit.exitStub:
 // CHECK:         ret void
 
 llvm.func @cancellation_point_sections() {
@@ -94,14 +96,12 @@ llvm.func @cancellation_point_sections() {
 // CHECK:       omp_section_loop.inc:                             ; preds = %[[VAL_46]]
 // CHECK:         %[[VAL_38]] = add nuw i32 %[[VAL_37]], 1
 // CHECK:         br label %[[VAL_35]]
-// CHECK:       omp_section_loop.exit:                            ; preds = %[[VAL_53]], %[[VAL_39]]
+// CHECK:       omp_section_loop.exit:
 // CHECK:         call void @__kmpc_for_static_fini(ptr @1, i32 %[[VAL_30]])
 // CHECK:         %[[VAL_55:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_55]])
 // CHECK:         br label %[[VAL_56:.*]]
 // CHECK:       omp_section_loop.after:                           ; preds = %[[VAL_42]]
-// CHECK:         br label %[[VAL_57:.*]]
-// CHECK:       omp_section_loop.aftersections.fini:              ; preds = %[[VAL_56]]
 // CHECK:         ret void
 // CHECK:       omp.section.region.cncl:                          ; preds = %[[VAL_48]]
 // CHECK:         br label %[[VAL_42]]
@@ -175,7 +175,7 @@ llvm.func @cancellation_point_wsloop(%lb : i32, %ub : i32, %step : i32) {
 // CHECK:       omp_loop.inc:                                     ; preds = %[[VAL_106]]
 // CHECK:         %[[VAL_92]] = add nuw i32 %[[VAL_91]], 1
 // CHECK:         br label %[[VAL_89]]
-// CHECK:       omp_loop.exit:                                    ; preds = %[[VAL_105]], %[[VAL_93]]
+// CHECK:       omp_loop.exit:
 // CHECK:         call void @__kmpc_for_static_fini(ptr @1, i32 %[[VAL_84]])
 // CHECK:         %[[VAL_107:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_107]])
diff --git a/mlir/test/Target/LLVMIR/openmp-outline-infinite-loop.mlir b/mlir/test/Target/LLVMIR/openmp-outline-infinite-loop.mlir
index faccfc678adfe..99f37c7e79be8 100644
--- a/mlir/test/Target/LLVMIR/openmp-outline-infinite-loop.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-outline-infinite-loop.mlir
@@ -21,9 +21,11 @@ llvm.func @parallel_infinite_loop() -> () {
 // CHECK:       omp.region.cont:                                  ; No predecessors!
 // CHECK:         br label %[[VAL_4:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_5:.*]]
-// CHECK:         br label %[[VAL_6:.*]]
-// CHECK:       omp.par.exit:                                     ; preds = %[[VAL_4]]
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       [[OMP_PAR_EXIT:omp.par.exit]]:                                     ; preds = %[[FINI]]
 // CHECK:         ret void
+// CHECK:       [[FINI]]:
+// CHECK:         br label %[[OMP_PAR_EXIT]]
 // CHECK:       }
 
 // CHECK-LABEL: define internal void @parallel_infinite_loop..omp_par(
diff --git a/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir b/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir
index 887d2977e45cc..c79c369b69d7f 100644
--- a/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-parallel-reduction-multiblock.mlir
@@ -108,6 +108,8 @@ llvm.func @missordered_blocks_(%arg0: !llvm.ptr {fir.bindc_name = "x"}, %arg1: !
 // CHECK:       reduce.finalize:                                  ; preds = %[[VAL_49]], %[[VAL_43]]
 // CHECK:         br label %[[VAL_53:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_48]]
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       .fini:
 // CHECK:         %[[VAL_54:.*]] = load ptr, ptr %[[VAL_20]], align 8
 // CHECK:         %[[VAL_55:.*]] = load ptr, ptr %[[VAL_21]], align 8
 // CHECK:         br label %[[VAL_56:.*]]
@@ -115,5 +117,5 @@ llvm.func @missordered_blocks_(%arg0: !llvm.ptr {fir.bindc_name = "x"}, %arg1: !
 // CHECK:         br label %[[VAL_38]]
 // CHECK:       omp.reduction.neutral1:                           ; preds = %[[VAL_25]]
 // CHECK:         br label %[[VAL_30]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_53]]
+// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[FINI]]
 // CHECK:         ret void
diff --git a/mlir/test/Target/LLVMIR/openmp-reduction-array-sections.mlir b/mlir/test/Target/LLVMIR/openmp-reduction-array-sections.mlir
index b302b4b20edd5..13f52f054869e 100644
--- a/mlir/test/Target/LLVMIR/openmp-reduction-array-sections.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-reduction-array-sections.mlir
@@ -127,8 +127,6 @@ llvm.func @sectionsreduction_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attribute
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_36]])
 // CHECK:         br label %[[VAL_37:.*]]
 // CHECK:       omp_section_loop.after:                           ; preds = %[[VAL_35]]
-// CHECK:         br label %[[VAL_38:.*]]
-// CHECK:       omp_section_loop.aftersections.fini:              ; preds = %[[VAL_37]]
 // CHECK:         %[[VAL_39:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_14]], i64 0, i64 0
 // CHECK:         store ptr %[[VAL_21]], ptr %[[VAL_39]], align 8
 // CHECK:         %[[VAL_40:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
@@ -137,9 +135,9 @@ llvm.func @sectionsreduction_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attribute
 // CHECK:           i32 1, label %[[VAL_43:.*]]
 // CHECK:           i32 2, label %[[VAL_44:.*]]
 // CHECK:         ]
-// CHECK:       reduce.switch.atomic:                             ; preds = %[[VAL_38]]
+// CHECK:       reduce.switch.atomic:                             ; preds = %[[VAL_37]]
 // CHECK:         unreachable
-// CHECK:       reduce.switch.nonatomic:                          ; preds = %[[VAL_38]]
+// CHECK:       reduce.switch.nonatomic:                          ; preds = %[[VAL_37]]
 // CHECK:         %[[VAL_45:.*]] = load ptr, ptr %[[VAL_21]], align 8
 // CHECK:         br label %[[VAL_46:.*]]
 // CHECK:       omp.reduction.nonatomic.body:                     ; preds = %[[VAL_43]]
@@ -157,7 +155,7 @@ llvm.func @sectionsreduction_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attribute
 // CHECK:       omp.reduction.nonatomic.body17:                   ; preds = %[[VAL_47]]
 // CHECK:         %[[VAL_50]] = sub i64 %[[VAL_49]], 1
 // CHECK:         br label %[[VAL_47]]
-// CHECK:       reduce.finalize:                                  ; preds = %[[VAL_53]], %[[VAL_38]]
+// CHECK:       reduce.finalize:                                  ; preds = %[[VAL_53]], %[[VAL_37]]
 // CHECK:         %[[VAL_55:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_55]])
 // CHECK:         %[[VAL_56:.*]] = load ptr, ptr %[[VAL_21]], align 8
@@ -173,7 +171,9 @@ llvm.func @sectionsreduction_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attribute
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_62]]
 // CHECK:         br label %[[VAL_64:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_63]]
-// CHECK:         br label %[[VAL_65:.*]]
+// CHECK:         br label %[[FINI:.fini.*]]
+// CHECK:       [[FINI]]:
+// CHECK:         br label %[[EXIT:.*]]
 // CHECK:       omp.reduction.cleanup21:                          ; preds = %[[VAL_57]]
 // CHECK:         br label %[[VAL_61]]
 // CHECK:       omp_section_loop.body:                            ; preds = %[[VAL_32]]
@@ -219,5 +219,5 @@ llvm.func @sectionsreduction_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attribute
 // CHECK:       omp_section_loop.inc:                             ; preds = %[[VAL_69]]
 // CHECK:         %[[VAL_31]] = add nuw i32 %[[VAL_30]], 1
 // CHECK:         br label %[[VAL_28]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_64]]
+// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[FINI]]
 // CHECK:         ret void
diff --git a/mlir/test/Target/LLVMIR/openmp-reduction-init-arg.mlir b/mlir/test/Target/LLVMIR/openmp-reduction-init-arg.mlir
index a714ca68a1e95..cb30d3b2f4473 100644
--- a/mlir/test/Target/LLVMIR/openmp-reduction-init-arg.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-reduction-init-arg.mlir
@@ -96,8 +96,10 @@ module {
 // CHECK:       reduce.finalize:                                  ; preds = %[[VAL_34]], %[[VAL_28]]
 // CHECK:         br label %[[VAL_38:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_33]]
+// CHECK:         br label %[[FINI:.*]]
+// CHECK:       [[FINI]]:
 // CHECK:         br label %[[VAL_39:.*]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_38]]
+// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[FINI]]
 // CHECK:         ret void
 // CHECK:         %[[VAL_40:.*]] = getelementptr inbounds [2 x ptr], ptr %[[VAL_41:.*]], i64 0, i64 0
 // CHECK:         %[[VAL_42:.*]] = load ptr, ptr %[[VAL_40]], align 8
diff --git a/mlir/test/Target/LLVMIR/openmp-reduction-sections.mlir b/mlir/test/Target/LLVMIR/openmp-reduction-sections.mlir
index 19da6f8517fcd..00f6c1b02206e 100644
--- a/mlir/test/Target/LLVMIR/openmp-reduction-sections.mlir
+++ b/mlir/test/Target/LLVMIR/openmp-reduction-sections.mlir
@@ -86,8 +86,6 @@ llvm.func @sections_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attributes {fir.in
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_40]])
 // CHECK:         br label %[[VAL_41:.*]]
 // CHECK:       omp_section_loop.after:                           ; preds = %[[VAL_39]]
-// CHECK:         br label %[[VAL_42:.*]]
-// CHECK:       omp_section_loop.aftersections.fini:              ; preds = %[[VAL_41]]
 // CHECK:         %[[VAL_43:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_21]], i64 0, i64 0
 // CHECK:         store ptr %[[VAL_20]], ptr %[[VAL_43]], align 8
 // CHECK:         %[[VAL_44:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
@@ -96,23 +94,25 @@ llvm.func @sections_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attributes {fir.in
 // CHECK:           i32 1, label %[[VAL_47:.*]]
 // CHECK:           i32 2, label %[[VAL_48:.*]]
 // CHECK:         ]
-// CHECK:       reduce.switch.atomic:                             ; preds = %[[VAL_42]]
+// CHECK:       reduce.switch.atomic:                             ; preds = %[[VAL_41]]
 // CHECK:         unreachable
-// CHECK:       reduce.switch.nonatomic:                          ; preds = %[[VAL_42]]
+// CHECK:       reduce.switch.nonatomic:                          ; preds = %[[VAL_41]]
 // CHECK:         %[[VAL_49:.*]] = load float, ptr %[[VAL_11]], align 4
 // CHECK:         %[[VAL_50:.*]] = load float, ptr %[[VAL_20]], align 4
 // CHECK:         %[[VAL_51:.*]] = fadd contract float %[[VAL_49]], %[[VAL_50]]
 // CHECK:         store float %[[VAL_51]], ptr %[[VAL_11]], align 4
 // CHECK:         call void @__kmpc_end_reduce(ptr @1, i32 %[[VAL_44]], ptr @.gomp_critical_user_.reduction.var)
 // CHECK:         br label %[[VAL_46]]
-// CHECK:       reduce.finalize:                                  ; preds = %[[VAL_47]], %[[VAL_42]]
+// CHECK:       reduce.finalize:                                  ; preds = %[[VAL_47]], %[[VAL_41]]
 // CHECK:         %[[VAL_52:.*]] = call i32 @__kmpc_global_thread_num(ptr @1)
 // CHECK:         call void @__kmpc_barrier(ptr @2, i32 %[[VAL_52]])
 // CHECK:         br label %[[VAL_53:.*]]
 // CHECK:       omp.region.cont:                                  ; preds = %[[VAL_46]]
 // CHECK:         br label %[[VAL_54:.*]]
 // CHECK:       omp.par.pre_finalize:                             ; preds = %[[VAL_53]]
-// CHECK:         br label %[[VAL_55:.*]]
+// CHECK:         br label %[[FINI:.fini.*]]
+// CHECK:       [[FINI]]:
+// CHECK:         br label %[[EXIT:.*]]
 // CHECK:       omp_section_loop.body:                            ; preds = %[[VAL_36]]
 // CHECK:         %[[VAL_56:.*]] = add i32 %[[VAL_34]], %[[VAL_28]]
 // CHECK:         %[[VAL_57:.*]] = mul i32 %[[VAL_56]], 1
@@ -144,8 +144,10 @@ llvm.func @sections_(%arg0: !llvm.ptr {fir.bindc_name = "x"}) attributes {fir.in
 // CHECK:       omp_section_loop.inc:                             ; preds = %[[VAL_59]]
 // CHECK:         %[[VAL_35]] = add nuw i32 %[[VAL_34]], 1
 // CHECK:         br label %[[VAL_32]]
-// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[VAL_54]]
+// CHECK:       omp.par.exit.exitStub:                            ; preds = %[[FINI]]
 // CHECK:         ret void
+
+// CHECK-LABEL: define internal void @.omp.reduction.func
 // CHECK:         %[[VAL_70:.*]] = getelementptr inbounds [1 x ptr], ptr %[[VAL_71:.*]], i64 0, i64 0
 // CHECK:         %[[VAL_72:.*]] = load ptr, ptr %[[VAL_70]], align 8
 // CHECK:         %[[VAL_73:.*]] = load float, ptr %[[VAL_72]], align 4
diff --git a/mlir/test/Target/SPIRV/consecutive-selection.spv b/mlir/test/Target/SPIRV/consecutive-selection.spvasm
similarity index 100%
rename from mlir/test/Target/SPIRV/consecutive-selection.spv
rename to mlir/test/Target/SPIRV/consecutive-selection.spvasm
diff --git a/mlir/test/Target/SPIRV/decorations.mlir b/mlir/test/Target/SPIRV/decorations.mlir
index 712fd17623402..29b5d4fd5c743 100644
--- a/mlir/test/Target/SPIRV/decorations.mlir
+++ b/mlir/test/Target/SPIRV/decorations.mlir
@@ -77,6 +77,13 @@ spirv.module Logical GLSL450 requires #spirv.vce<v1.0, [Shader, Linkage], []> {
 
 // -----
 
+spirv.module Logical GLSL450 requires #spirv.vce<v1.0, [Shader, Linkage], []> {
+  // CHECK: coherent
+  spirv.GlobalVariable @var {coherent} : !spirv.ptr<vector<2xf32>, Output>
+}
+
+// -----
+
 spirv.module Logical GLSL450 requires #spirv.vce<v1.0, [Shader, Linkage], []> {
   // CHECK: linkage_attributes = #spirv.linkage_attributes<linkage_name = "outSideGlobalVar1", linkage_type = <Import>>
   spirv.GlobalVariable @var1 {
diff --git a/mlir/test/Target/SPIRV/selection.mlir b/mlir/test/Target/SPIRV/selection.mlir
index 3f762920015aa..d0ad118b01c8a 100644
--- a/mlir/test/Target/SPIRV/selection.mlir
+++ b/mlir/test/Target/SPIRV/selection.mlir
@@ -288,3 +288,61 @@ spirv.module Logical GLSL450 requires #spirv.vce<v1.0, [Shader], []> {
   spirv.EntryPoint "GLCompute" @main
   spirv.ExecutionMode @main "LocalSize", 1, 1, 1
 }
+
+// -----
+
+// Selection with switch and block operands
+
+spirv.module Logical GLSL450 requires #spirv.vce<v1.5, [Shader], []> {
+// CHECK-LABEL: @selection_switch_operands
+  spirv.func @selection_switch_operands(%selector : si32) "None" {
+    %cst1 = spirv.Constant 1.000000e+00 : f32
+    %vec0 = spirv.Undef : vector<3xf32>
+// CHECK: {{%.*}} = spirv.CompositeInsert {{%.*}}, {{%.*}}[0 : i32] : f32 into vector<3xf32>
+    %vec1 = spirv.CompositeInsert %cst1, %vec0[0 : i32] : f32 into vector<3xf32>
+    spirv.Branch ^bb1
+  ^bb1:
+// CHECK: {{%.*}} = spirv.mlir.selection -> vector<3xf32> {
+    %vec4 = spirv.mlir.selection -> vector<3xf32> {
+// CHECK-NEXT: spirv.Switch {{%.*}} : si32, [
+// CHECK-NEXT: default: ^[[DEFAULT:.+]]({{%.*}} : vector<3xf32>),
+// CHECK-NEXT: 0: ^[[CASE0:.+]]({{%.*}} : vector<3xf32>),
+// CHECK-NEXT: 1: ^[[CASE1:.+]]({{%.*}} : vector<3xf32>)
+      spirv.Switch %selector : si32, [
+        default: ^bb3(%vec1 : vector<3xf32>),
+        0: ^bb1(%vec1 : vector<3xf32>),
+        1: ^bb2(%vec1 : vector<3xf32>)
+      ]
+// CHECK: ^[[CASE0]]({{%.*}}: vector<3xf32>)
+    ^bb1(%vecbb1: vector<3xf32>):
+      %cst3 = spirv.Constant 3.000000e+00 : f32
+// CHECK: {{%.*}} = spirv.CompositeInsert {{%.*}}, {{%.*}}[1 : i32] : f32 into vector<3xf32>
+      %vec2 = spirv.CompositeInsert %cst3, %vecbb1[1 : i32] : f32 into vector<3xf32>
+// CHECK-NEXT: spirv.Branch ^[[DEFAULT]]({{%.*}} : vector<3xf32>)
+      spirv.Branch ^bb3(%vec2 : vector<3xf32>)
+// CHECK-NEXT: ^[[CASE1]]({{%.*}}: vector<3xf32>)
+    ^bb2(%vecbb2: vector<3xf32>):
+      %cst4 = spirv.Constant 4.000000e+00 : f32
+// CHECK: {{%.*}} = spirv.CompositeInsert {{%.*}}, {{%.*}}[1 : i32] : f32 into vector<3xf32>
+      %vec3 = spirv.CompositeInsert %cst4, %vecbb2[1 : i32] : f32 into vector<3xf32>
+// CHECK-NEXT: spirv.Branch ^[[DEFAULT]]({{%.*}} : vector<3xf32>)
+      spirv.Branch ^bb3(%vec3 : vector<3xf32>)
+// CHECK-NEXT: ^[[DEFAULT]]({{%.*}}: vector<3xf32>)
+    ^bb3(%vecbb3: vector<3xf32>):
+// CHECK-NEXT: spirv.mlir.merge {{%.*}} : vector<3xf32>
+      spirv.mlir.merge %vecbb3 : vector<3xf32>
+// CHECK-NEXT: }
+    }
+    %cst2 = spirv.Constant 2.000000e+00 : f32
+// CHECK: {{%.*}} = spirv.CompositeInsert {{%.*}}, {{%.*}}[2 : i32] : f32 into vector<3xf32>
+    %vec5 = spirv.CompositeInsert %cst2, %vec4[2 : i32] : f32 into vector<3xf32>
+    spirv.Return
+  }
+
+  spirv.func @main() -> () "None" {
+    spirv.Return
+  }
+
+  spirv.EntryPoint "GLCompute" @main
+  spirv.ExecutionMode @main "LocalSize", 1, 1, 1
+}
diff --git a/mlir/test/Target/SPIRV/selection.spv b/mlir/test/Target/SPIRV/selection.spvasm
similarity index 100%
rename from mlir/test/Target/SPIRV/selection.spv
rename to mlir/test/Target/SPIRV/selection.spvasm
diff --git a/mlir/test/Target/SPIRV/selection_switch.spvasm b/mlir/test/Target/SPIRV/selection_switch.spvasm
new file mode 100644
index 0000000000000..81fecf307eb7a
--- /dev/null
+++ b/mlir/test/Target/SPIRV/selection_switch.spvasm
@@ -0,0 +1,69 @@
+; RUN: %if spirv-tools %{ spirv-as --target-env spv1.0 %s -o - | mlir-translate --deserialize-spirv - -o - | FileCheck %s %}
+
+; This test is analogous to selection.spv but tests switch op.
+
+; CHECK:      spirv.module Logical GLSL450 requires #spirv.vce<v1.0, [Shader], []> {
+; CHECK-NEXT:   spirv.func @switch({{%.*}}: si32) "None" {
+; CHECK:          {{%.*}} = spirv.Constant 1.000000e+00 : f32
+; CHECK-NEXT:     {{%.*}} = spirv.Undef : vector<3xf32>
+; CHECK-NEXT:     {{%.*}} = spirv.CompositeInsert {{%.*}}, {{%.*}}[0 : i32] : f32 into vector<3xf32>
+; CHECK-NEXT:     spirv.Branch ^[[bb:.+]]
+; CHECK-NEXT:   ^[[bb:.+]]:
+; CHECK-NEXT:     {{%.*}} = spirv.mlir.selection -> vector<3xf32> {
+; CHECK-NEXT:     spirv.Switch {{%.*}} : si32, [
+; CHECK-NEXT:       default: ^[[bb:.+]]({{%.*}}: vector<3xf32>),
+; CHECK-NEXT:       0: ^[[bb:.+]]({{%.*}}: vector<3xf32>),
+; CHECK-NEXT:       1: ^[[bb:.+]]({{%.*}}: vector<3xf32>)
+; CHECK:          ^[[bb:.+]]({{%.*}}: vector<3xf32>):
+; CHECK:            spirv.Branch ^[[bb:.+]]({{%.*}}: vector<3xf32>)
+; CHECK-NEXT:     ^[[bb:.+]]({{%.*}}: vector<3xf32>):
+; CHECK:            spirv.Branch ^[[bb:.+]]({{%.*}}: vector<3xf32>)
+; CHECK-NEXT:     ^[[bb:.+]]({{%.*}}: vector<3xf32>):
+; CHECK-NEXT:       spirv.mlir.merge %8 : vector<3xf32>
+; CHECK-NEXT      }
+; CHECK:          spirv.Return
+; CHECK-NEXT:   }
+; CHECK:      }
+
+               OpCapability Shader
+               OpMemoryModel Logical GLSL450
+               OpEntryPoint GLCompute %main "main"
+               OpExecutionMode %main LocalSize 1 1 1
+               OpName %switch "switch"
+               OpName %main "main"
+       %void = OpTypeVoid
+        %int = OpTypeInt 32 1
+          %1 = OpTypeFunction %void %int
+      %float = OpTypeFloat 32
+    %float_1 = OpConstant %float 1
+    %v3float = OpTypeVector %float 3
+          %9 = OpUndef %v3float
+    %float_3 = OpConstant %float 3
+    %float_4 = OpConstant %float 4
+    %float_2 = OpConstant %float 2
+         %25 = OpTypeFunction %void
+     %switch = OpFunction %void None %1
+          %5 = OpFunctionParameter %int
+          %6 = OpLabel
+               OpBranch %12
+         %12 = OpLabel
+         %11 = OpCompositeInsert %v3float %float_1 %9 0
+               OpSelectionMerge %15 None
+               OpSwitch %5 %15 0 %13 1 %14
+         %13 = OpLabel
+         %16 = OpPhi %v3float %11 %12
+         %18 = OpCompositeInsert %v3float %float_3 %16 1
+               OpBranch %15
+         %14 = OpLabel
+         %19 = OpPhi %v3float %11 %12
+         %21 = OpCompositeInsert %v3float %float_4 %19 1
+               OpBranch %15
+         %15 = OpLabel
+         %22 = OpPhi %v3float %21 %14 %18 %13 %11 %12
+         %24 = OpCompositeInsert %v3float %float_2 %22 2
+               OpReturn
+               OpFunctionEnd
+       %main = OpFunction %void None %25
+         %27 = OpLabel
+               OpReturn
+               OpFunctionEnd
diff --git a/mlir/test/Transforms/remove-dead-values.mlir b/mlir/test/Transforms/remove-dead-values.mlir
index 4bae85dcf4f7d..71306676d48e9 100644
--- a/mlir/test/Transforms/remove-dead-values.mlir
+++ b/mlir/test/Transforms/remove-dead-values.mlir
@@ -118,6 +118,17 @@ func.func @main(%arg0 : i32) {
 
 // -----
 
+// CHECK-LABEL: func.func private @clean_func_op_remove_side_effecting_op() {
+// CHECK-NEXT:    return
+// CHECK-NEXT:  }
+func.func private @clean_func_op_remove_side_effecting_op(%arg0: i32) -> (i32) {
+  // vector.print has a side effect but the op is dead.
+  vector.print %arg0 : i32
+  return %arg0 : i32
+}
+
+// -----
+
 // %arg0 is not live because it is never used. %arg1 is not live because its
 // user `arith.addi` doesn't have any uses and the value that it is forwarded to
 // (%non_live_0) also doesn't have any uses.
@@ -687,3 +698,19 @@ func.func @op_block_have_dead_arg(%arg0: index, %arg1: index, %arg2: i1) {
   // CHECK-NEXT: return
   return
 }
+
+// -----
+
+// CHECK-LABEL: func private @remove_dead_branch_op()
+// CHECK-NEXT:    ub.unreachable
+// CHECK-NEXT:  ^{{.*}}:
+// CHECK-NEXT:    return
+// CHECK-NEXT:  ^{{.*}}:
+// CHECK-NEXT:    return
+func.func private @remove_dead_branch_op(%c: i1, %arg0: i64, %arg1: i64) -> (i64) {
+  cf.cond_br %c, ^bb1, ^bb2
+^bb1:
+  return %arg0 : i64
+^bb2:
+  return %arg1 : i64
+}
diff --git a/mlir/test/lib/Dialect/OpenACC/TestPointerLikeTypeInterface.cpp b/mlir/test/lib/Dialect/OpenACC/TestPointerLikeTypeInterface.cpp
index 027b0a1a8b80b..3ff0dc85b2152 100644
--- a/mlir/test/lib/Dialect/OpenACC/TestPointerLikeTypeInterface.cpp
+++ b/mlir/test/lib/Dialect/OpenACC/TestPointerLikeTypeInterface.cpp
@@ -46,7 +46,7 @@ struct TestPointerLikeTypeInterfacePass
 
   Pass::Option<std::string> testMode{
       *this, "test-mode",
-      llvm::cl::desc("Test mode: walk, alloc, copy, or free"),
+      llvm::cl::desc("Test mode: walk, alloc, copy, free, load, or store"),
       llvm::cl::init("walk")};
 
   StringRef getArgument() const override {
@@ -75,6 +75,10 @@ struct TestPointerLikeTypeInterfacePass
   void testGenCopy(Operation *srcOp, Operation *destOp, Value srcResult,
                    Value destResult, PointerLikeType pointerType,
                    OpBuilder &builder);
+  void testGenLoad(Operation *op, Value result, PointerLikeType pointerType,
+                   OpBuilder &builder);
+  void testGenStore(Operation *op, Value result, PointerLikeType pointerType,
+                    OpBuilder &builder, Value providedValue = {});
 
   struct PointerCandidate {
     Operation *op;
@@ -92,9 +96,12 @@ void TestPointerLikeTypeInterfacePass::runOnOperation() {
   auto func = getOperation();
   OpBuilder builder(&getContext());
 
-  if (testMode == "alloc" || testMode == "free") {
+  if (testMode == "alloc" || testMode == "free" || testMode == "load" ||
+      testMode == "store") {
     // Collect all candidates first
     SmallVector<PointerCandidate> candidates;
+    // For store mode, also look for a test value to use
+    Value testValue;
     func.walk([&](Operation *op) {
       if (op->hasAttr("test.ptr")) {
         for (auto result : op->getResults()) {
@@ -105,6 +112,11 @@ void TestPointerLikeTypeInterfacePass::runOnOperation() {
           }
         }
       }
+      // Collect value marked with test.value for store tests
+      if (testMode == "store" && op->hasAttr("test.value")) {
+        if (op->getNumResults() > 0)
+          testValue = op->getResult(0);
+      }
     });
 
     // Now test all candidates
@@ -115,6 +127,12 @@ void TestPointerLikeTypeInterfacePass::runOnOperation() {
       else if (testMode == "free")
         testGenFree(candidate.op, candidate.result, candidate.pointerType,
                     builder);
+      else if (testMode == "load")
+        testGenLoad(candidate.op, candidate.result, candidate.pointerType,
+                    builder);
+      else if (testMode == "store")
+        testGenStore(candidate.op, candidate.result, candidate.pointerType,
+                     builder, testValue);
     }
   } else if (testMode == "copy") {
     // Collect all source and destination candidates
@@ -292,6 +310,105 @@ void TestPointerLikeTypeInterfacePass::testGenCopy(
   }
 }
 
+void TestPointerLikeTypeInterfacePass::testGenLoad(Operation *op, Value result,
+                                                   PointerLikeType pointerType,
+                                                   OpBuilder &builder) {
+  Location loc = op->getLoc();
+
+  // Create a new builder with the listener and set insertion point
+  OperationTracker tracker;
+  OpBuilder newBuilder(op->getContext());
+  newBuilder.setListener(&tracker);
+  newBuilder.setInsertionPointAfter(op);
+
+  // Call the genLoad API
+  auto typedResult = cast<TypedValue<PointerLikeType>>(result);
+  Value loadRes = pointerType.genLoad(newBuilder, loc, typedResult, Type());
+
+  if (loadRes) {
+    llvm::errs() << "Successfully generated load for operation: ";
+    op->print(llvm::errs());
+    llvm::errs() << "\n";
+    llvm::errs() << "\tLoaded value type: ";
+    loadRes.getType().print(llvm::errs());
+    llvm::errs() << "\n";
+
+    // Print all operations that were inserted
+    for (Operation *insertedOp : tracker.insertedOps) {
+      llvm::errs() << "\tGenerated: ";
+      insertedOp->print(llvm::errs());
+      llvm::errs() << "\n";
+    }
+  } else {
+    llvm::errs() << "Failed to generate load for operation: ";
+    op->print(llvm::errs());
+    llvm::errs() << "\n";
+  }
+}
+
+void TestPointerLikeTypeInterfacePass::testGenStore(Operation *op, Value result,
+                                                    PointerLikeType pointerType,
+                                                    OpBuilder &builder,
+                                                    Value providedValue) {
+  Location loc = op->getLoc();
+
+  // Create a new builder with the listener and set insertion point
+  OperationTracker tracker;
+  OpBuilder newBuilder(op->getContext());
+  newBuilder.setListener(&tracker);
+  newBuilder.setInsertionPointAfter(op);
+
+  // Use provided value if available, otherwise create a constant
+  Value valueToStore = providedValue;
+  if (!valueToStore) {
+    // Create a test value to store - use a constant matching the element type
+    Type elementType = pointerType.getElementType();
+    if (!elementType) {
+      llvm::errs() << "Failed to generate store for operation: ";
+      op->print(llvm::errs());
+      llvm::errs() << "\n";
+      return;
+    }
+
+    if (elementType.isIntOrIndex()) {
+      auto attr = newBuilder.getIntegerAttr(elementType, 42);
+      valueToStore =
+          arith::ConstantOp::create(newBuilder, loc, elementType, attr);
+    } else if (auto floatType = dyn_cast<FloatType>(elementType)) {
+      auto attr = newBuilder.getFloatAttr(floatType, 42.0);
+      valueToStore =
+          arith::ConstantOp::create(newBuilder, loc, floatType, attr);
+    } else {
+      llvm::errs() << "Failed to generate store for operation: ";
+      op->print(llvm::errs());
+      llvm::errs() << "\n";
+      return;
+    }
+  }
+
+  // Call the genStore API
+  auto typedResult = cast<TypedValue<PointerLikeType>>(result);
+  bool success =
+      pointerType.genStore(newBuilder, loc, valueToStore, typedResult);
+
+  if (success) {
+    llvm::errs() << "Successfully generated store for operation: ";
+    op->print(llvm::errs());
+    llvm::errs() << "\n";
+
+    // Print all operations that were inserted
+    for (Operation *insertedOp : tracker.insertedOps) {
+      llvm::errs() << "\tGenerated: ";
+      insertedOp->print(llvm::errs());
+      llvm::errs() << "\n";
+    }
+  } else {
+    llvm::errs() << "Failed to generate store for operation: ";
+    op->print(llvm::errs());
+    llvm::errs() << "\n";
+  }
+}
+
 } // namespace
 
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/lib/Dialect/Test/TestOpDefs.cpp b/mlir/test/lib/Dialect/Test/TestOpDefs.cpp
index c153211c68f92..868926520af05 100644
--- a/mlir/test/lib/Dialect/Test/TestOpDefs.cpp
+++ b/mlir/test/lib/Dialect/Test/TestOpDefs.cpp
@@ -1637,3 +1637,14 @@ test::TestCreateTensorOp::getBufferType(
 
   return convertTensorToBuffer(getOperation(), options, type);
 }
+
+// Define a custom builder for ManyRegionsOp declared in TestOps.td.
+//  OpBuilder<(ins "::std::unique_ptr<::mlir::Region>":$firstRegion,
+//                 "::std::unique_ptr<::mlir::Region>":$secondRegion)>
+void test::ManyRegionsOp::build(
+    mlir::OpBuilder &builder, mlir::OperationState &state,
+    llvm::SmallVectorImpl<std::unique_ptr<mlir::Region>> &&regions) {
+  for (auto &&regionPtr : std::move(regions))
+    state.addRegion(std::move(regionPtr));
+  ManyRegionsOp::build(builder, state, {}, regions.size());
+}
diff --git a/mlir/test/lib/Dialect/Test/TestOps.td b/mlir/test/lib/Dialect/Test/TestOps.td
index 670223984fd95..5417ae94f00d7 100644
--- a/mlir/test/lib/Dialect/Test/TestOps.td
+++ b/mlir/test/lib/Dialect/Test/TestOps.td
@@ -2352,6 +2352,24 @@ def IsolatedGraphRegionOp : TEST_Op<"isolated_graph_region",  [
   let assemblyFormat = "attr-dict-with-keyword $region";
 }
 
+def ManyRegionsOp : TEST_Op<"many_regions", []> {
+  let summary = "operation created with move-only objects";
+  let description = [{
+    Test op with multiple regions with a `create` function that
+    takes parameters containing move-only objects.
+  }];
+
+  let regions = (region VariadicRegion<AnyRegion>:$regions);
+  let builders =
+      [OpBuilder<(ins "::std::unique_ptr<::mlir::Region>":$singleRegion), [{
+      $_state.addRegion(std::move(singleRegion));
+      build($_builder, $_state, {}, /*regionsCount=*/1);
+    }]>,
+       // Define in TestOps.cpp.
+       OpBuilder<(ins "::llvm::SmallVectorImpl<::std::unique_ptr<::mlir::"
+                      "Region>>&&":$regions)>];
+}
+
 def AffineScopeOp : TEST_Op<"affine_scope", [AffineScope]> {
   let summary =  "affine scope operation";
   let description = [{
diff --git a/mlir/test/lit.cfg.py b/mlir/test/lit.cfg.py
index 7081c51994ec1..675ded35d98f3 100644
--- a/mlir/test/lit.cfg.py
+++ b/mlir/test/lit.cfg.py
@@ -44,7 +44,7 @@
     ".test",
     ".pdll",
     ".c",
-    ".spv",
+    ".spvasm",
 ]
 
 # test_source_root: The root path where tests are located.
diff --git a/mlir/test/mlir-tblgen/op-decl-and-defs.td b/mlir/test/mlir-tblgen/op-decl-and-defs.td
index 0e87373b2f6c2..80dedb8475b9e 100644
--- a/mlir/test/mlir-tblgen/op-decl-and-defs.td
+++ b/mlir/test/mlir-tblgen/op-decl-and-defs.td
@@ -235,14 +235,14 @@ def NS_FOp : NS_Op<"op_with_all_types_constraint",
 
 // DEFS: FOp FOp::create(::mlir::OpBuilder &builder, ::mlir::Location location, ::mlir::Value a) {
 // DEFS:   ::mlir::OperationState __state__(location, getOperationName());
-// DEFS:   build(builder, __state__, a);
+// DEFS:   build(builder, __state__, std::forward<decltype(a)>(a));
 // DEFS:   auto __res__ = ::llvm::dyn_cast<FOp>(builder.create(__state__));
 // DEFS:   assert(__res__ && "builder didn't return the right type");
 // DEFS:   return __res__;
 // DEFS: }
 
 // DEFS: FOp FOp::create(::mlir::ImplicitLocOpBuilder &builder, ::mlir::Value a) {
-// DEFS:   return create(builder, builder.getLoc(), a);
+// DEFS:   return create(builder, builder.getLoc(), std::forward<decltype(a)>(a));
 // DEFS: }
 
 def NS_GOp : NS_Op<"op_with_fixed_return_type", []> {
diff --git a/mlir/test/python/integration/dialects/linalg/opsrun.py b/mlir/test/python/integration/dialects/linalg/opsrun.py
index 8f202318146ee..8eff573f98ad3 100644
--- a/mlir/test/python/integration/dialects/linalg/opsrun.py
+++ b/mlir/test/python/integration/dialects/linalg/opsrun.py
@@ -25,13 +25,13 @@ def log(*args):
   %O1 = memref.alloc() : memref<16xi32>
   %O2 = memref.alloc() : memref<4x16xi32>
 
-  %val0 = arith.constant 1.0 : f32
-  %val1 = arith.constant 2.0 : f32
-  %val2 = arith.constant 3.0 : f32
+  %val0 = arith.constant 1 : i32
+  %val1 = arith.constant 2 : i32
+  %val2 = arith.constant 3 : i32
 
-  call @fill_0d_on_buffers(%val0, %O0) : (f32, memref<i32>) -> ()
-  call @fill_1d_on_buffers(%val1, %O1) : (f32, memref<16xi32>) -> ()
-  call @fill_2d_on_buffers(%val2, %O2) : (f32, memref<4x16xi32>) -> ()
+  call @fill_0d_on_buffers(%val0, %O0) : (i32, memref<i32>) -> ()
+  call @fill_1d_on_buffers(%val1, %O1) : (i32, memref<16xi32>) -> ()
+  call @fill_2d_on_buffers(%val2, %O2) : (i32, memref<4x16xi32>) -> ()
 
   %c0 = arith.constant 0 : index
   %res0 = memref.load %O0[] : memref<i32>
@@ -149,19 +149,18 @@ def transform(module, boilerplate):
 def test_fill_builtin():
     with Context() as ctx, Location.unknown():
         module = Module.create()
-        f32 = F32Type.get()
         i32 = IntegerType.get_signless(32)
         with InsertionPoint(module.body):
 
-            @func.FuncOp.from_py_func(f32, MemRefType.get([], i32))
+            @func.FuncOp.from_py_func(i32, MemRefType.get([], i32))
             def fill_0d_on_buffers(value, out):
                 linalg.fill(value, outs=[out])
 
-            @func.FuncOp.from_py_func(f32, MemRefType.get([16], i32))
+            @func.FuncOp.from_py_func(i32, MemRefType.get([16], i32))
             def fill_1d_on_buffers(value, out):
                 linalg.fill(value, outs=[out])
 
-            @func.FuncOp.from_py_func(f32, MemRefType.get([4, 16], i32))
+            @func.FuncOp.from_py_func(i32, MemRefType.get([4, 16], i32))
             def fill_2d_on_buffers(value, out):
                 linalg.fill(value, outs=[out])
 
@@ -184,19 +183,18 @@ def fill_2d_on_buffers(value, out):
 def test_fill_generic():
     with Context() as ctx, Location.unknown():
         module = Module.create()
-        f32 = F32Type.get()
         i32 = IntegerType.get_signless(32)
         with InsertionPoint(module.body):
 
-            @func.FuncOp.from_py_func(f32, MemRefType.get([], i32))
+            @func.FuncOp.from_py_func(i32, MemRefType.get([], i32))
             def fill_0d_on_buffers(value, out):
                 linalg.fill(value, outs=[out], emit_generic=True)
 
-            @func.FuncOp.from_py_func(f32, MemRefType.get([16], i32))
+            @func.FuncOp.from_py_func(i32, MemRefType.get([16], i32))
             def fill_1d_on_buffers(value, out):
                 linalg.fill(value, outs=[out], emit_generic=True)
 
-            @func.FuncOp.from_py_func(f32, MemRefType.get([4, 16], i32))
+            @func.FuncOp.from_py_func(i32, MemRefType.get([4, 16], i32))
             def fill_2d_on_buffers(value, out):
                 linalg.fill(value, outs=[out], emit_generic=True)
 
diff --git a/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp b/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp
index 3b10842f2a127..dbae5d9258d04 100644
--- a/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp
+++ b/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp
@@ -2641,7 +2641,14 @@ void OpEmitter::genInlineCreateBody(
   std::string nonBuilderStateArgs = "";
   if (!nonBuilderStateArgsList.empty()) {
     llvm::raw_string_ostream nonBuilderStateArgsOS(nonBuilderStateArgs);
-    interleaveComma(nonBuilderStateArgsList, nonBuilderStateArgsOS);
+    interleave(
+        nonBuilderStateArgsList,
+        [&](StringRef name) {
+          nonBuilderStateArgsOS << "std::forward<decltype(" << name << ")>("
+                                << name << ')';
+        },
+        [&] { nonBuilderStateArgsOS << ", "; });
+
     nonBuilderStateArgs = ", " + nonBuilderStateArgs;
   }
   if (cWithLoc)
diff --git a/mlir/unittests/Analysis/Presburger/BarvinokTest.cpp b/mlir/unittests/Analysis/Presburger/BarvinokTest.cpp
index eaf04379cb529..4ca999878df2c 100644
--- a/mlir/unittests/Analysis/Presburger/BarvinokTest.cpp
+++ b/mlir/unittests/Analysis/Presburger/BarvinokTest.cpp
@@ -301,3 +301,12 @@ TEST(BarvinokTest, computeNumTermsPolytope) {
   gf = count[0].second;
   EXPECT_EQ(gf.getNumerators().size(), 24u);
 }
+
+TEST(BarvinokTest, solveParametricEquations) {
+  FracMatrix equations = makeFracMatrix(2, 3, {{2, 3, -4}, {2, 6, -7}});
+  auto maybeSolution = solveParametricEquations(equations);
+  ASSERT_TRUE(maybeSolution.has_value());
+  FracMatrix solution = *maybeSolution;
+  EXPECT_EQ(solution.at(0, 0), Fraction(1, 2));
+  EXPECT_EQ(solution.at(1, 0), 1);
+}
diff --git a/mlir/unittests/Analysis/Presburger/IntegerRelationTest.cpp b/mlir/unittests/Analysis/Presburger/IntegerRelationTest.cpp
index 9ae90a4841f3c..599db4cc74983 100644
--- a/mlir/unittests/Analysis/Presburger/IntegerRelationTest.cpp
+++ b/mlir/unittests/Analysis/Presburger/IntegerRelationTest.cpp
@@ -725,3 +725,18 @@ TEST(IntegerRelationTest, addLocalModulo) {
     EXPECT_TRUE(rel.containsPointNoLocal({x, x % 32}));
   }
 }
+
+TEST(IntegerRelationTest, simplify) {
+  IntegerRelation rel =
+      parseRelationFromSet("(x, y, z): (2*x + y - 4*z - 3 == 0, "
+                           "3*x - y - 3*z + 2 == 0, x + 3*y - 5*z - 8 == 0,"
+                           "x - y + z >= 0)",
+                           2);
+  IntegerRelation copy = rel;
+  rel.simplify();
+
+  EXPECT_TRUE(rel.isEqual(copy));
+  // The third equality is redundant and should be removed.
+  // It can be obtained from 2 times the first equality minus the second.
+  EXPECT_TRUE(rel.getNumEqualities() == 2);
+}
diff --git a/mlir/unittests/IR/SymbolTableTest.cpp b/mlir/unittests/IR/SymbolTableTest.cpp
index 4b3545bce1952..864eb40898335 100644
--- a/mlir/unittests/IR/SymbolTableTest.cpp
+++ b/mlir/unittests/IR/SymbolTableTest.cpp
@@ -77,7 +77,7 @@ namespace {
 
 TEST_F(ReplaceAllSymbolUsesTest, OperationInModuleOp) {
   // Symbol as `Operation *`, rename within module.
-  testReplaceAllSymbolUses([&](auto symbolTable, auto module, auto fooOp,
+  testReplaceAllSymbolUses([&](const auto &symbolTable, auto module, auto fooOp,
                                auto barOp) -> LogicalResult {
     return symbolTable.replaceAllSymbolUses(
         barOp, StringAttr::get(context.get(), "baz"), module);
@@ -86,7 +86,7 @@ TEST_F(ReplaceAllSymbolUsesTest, OperationInModuleOp) {
 
 TEST_F(ReplaceAllSymbolUsesTest, StringAttrInModuleOp) {
   // Symbol as `StringAttr`, rename within module.
-  testReplaceAllSymbolUses([&](auto symbolTable, auto module, auto fooOp,
+  testReplaceAllSymbolUses([&](const auto &symbolTable, auto module, auto fooOp,
                                auto barOp) -> LogicalResult {
     return symbolTable.replaceAllSymbolUses(
         StringAttr::get(context.get(), "bar"),
@@ -96,7 +96,7 @@ TEST_F(ReplaceAllSymbolUsesTest, StringAttrInModuleOp) {
 
 TEST_F(ReplaceAllSymbolUsesTest, OperationInModuleBody) {
   // Symbol as `Operation *`, rename within module body.
-  testReplaceAllSymbolUses([&](auto symbolTable, auto module, auto fooOp,
+  testReplaceAllSymbolUses([&](const auto &symbolTable, auto module, auto fooOp,
                                auto barOp) -> LogicalResult {
     return symbolTable.replaceAllSymbolUses(
         barOp, StringAttr::get(context.get(), "baz"), &module->getRegion(0));
@@ -105,7 +105,7 @@ TEST_F(ReplaceAllSymbolUsesTest, OperationInModuleBody) {
 
 TEST_F(ReplaceAllSymbolUsesTest, StringAttrInModuleBody) {
   // Symbol as `StringAttr`, rename within module body.
-  testReplaceAllSymbolUses([&](auto symbolTable, auto module, auto fooOp,
+  testReplaceAllSymbolUses([&](const auto &symbolTable, auto module, auto fooOp,
                                auto barOp) -> LogicalResult {
     return symbolTable.replaceAllSymbolUses(
         StringAttr::get(context.get(), "bar"),
@@ -115,7 +115,7 @@ TEST_F(ReplaceAllSymbolUsesTest, StringAttrInModuleBody) {
 
 TEST_F(ReplaceAllSymbolUsesTest, OperationInFuncOp) {
   // Symbol as `Operation *`, rename within function.
-  testReplaceAllSymbolUses([&](auto symbolTable, auto module, auto fooOp,
+  testReplaceAllSymbolUses([&](const auto &symbolTable, auto module, auto fooOp,
                                auto barOp) -> LogicalResult {
     return symbolTable.replaceAllSymbolUses(
         barOp, StringAttr::get(context.get(), "baz"), fooOp);
@@ -124,7 +124,7 @@ TEST_F(ReplaceAllSymbolUsesTest, OperationInFuncOp) {
 
 TEST_F(ReplaceAllSymbolUsesTest, StringAttrInFuncOp) {
   // Symbol as `StringAttr`, rename within function.
-  testReplaceAllSymbolUses([&](auto symbolTable, auto module, auto fooOp,
+  testReplaceAllSymbolUses([&](const auto &symbolTable, auto module, auto fooOp,
                                auto barOp) -> LogicalResult {
     return symbolTable.replaceAllSymbolUses(
         StringAttr::get(context.get(), "bar"),
diff --git a/offload/include/OpenMP/omp.h b/offload/include/OpenMP/omp.h
index 768ca46a9bed0..8db0a0518dc99 100644
--- a/offload/include/OpenMP/omp.h
+++ b/offload/include/OpenMP/omp.h
@@ -30,6 +30,14 @@
 
 extern "C" {
 
+/// Definitions
+///{
+
+// See definition in OpenMP (omp.h.var/omp_lib.(F90|h).var)
+#define omp_invalid_device -2
+
+///}
+
 /// Type declarations
 ///{
 
diff --git a/offload/include/Shared/Debug.h b/offload/include/Shared/Debug.h
index 41613a37c3548..9e657e64484c0 100644
--- a/offload/include/Shared/Debug.h
+++ b/offload/include/Shared/Debug.h
@@ -39,6 +39,7 @@
 #define OMPTARGET_SHARED_DEBUG_H
 
 #include <atomic>
+#include <cstdarg>
 #include <mutex>
 #include <string>
 
@@ -78,17 +79,6 @@ inline std::atomic<uint32_t> &getInfoLevelInternal() {
 
 inline uint32_t getInfoLevel() { return getInfoLevelInternal().load(); }
 
-inline uint32_t getDebugLevel() {
-  static uint32_t DebugLevel = 0;
-  static std::once_flag Flag{};
-  std::call_once(Flag, []() {
-    if (char *EnvStr = getenv("LIBOMPTARGET_DEBUG"))
-      DebugLevel = std::stoi(EnvStr);
-  });
-
-  return DebugLevel;
-}
-
 #undef USED
 #undef GCC_VERSION
 
@@ -147,46 +137,11 @@ inline uint32_t getDebugLevel() {
     fprintf(_stdDst, __VA_ARGS__);                                             \
   } while (0)
 
-// Debugging messages
-#ifdef OMPTARGET_DEBUG
-#include <stdio.h>
-
-#define DEBUGP(prefix, ...)                                                    \
-  {                                                                            \
-    fprintf(stderr, "%s --> ", prefix);                                        \
-    fprintf(stderr, __VA_ARGS__);                                              \
-  }
-
-/// Emit a message for debugging
-#define DP(...)                                                                \
-  do {                                                                         \
-    if (getDebugLevel() > 0) {                                                 \
-      DEBUGP(DEBUG_PREFIX, __VA_ARGS__);                                       \
-    }                                                                          \
-  } while (false)
-
-/// Emit a message for debugging or failure if debugging is disabled
-#define REPORT(...)                                                            \
-  do {                                                                         \
-    if (getDebugLevel() > 0) {                                                 \
-      DP(__VA_ARGS__);                                                         \
-    } else {                                                                   \
-      FAILURE_MESSAGE(__VA_ARGS__);                                            \
-    }                                                                          \
-  } while (false)
-#else
-#define DEBUGP(prefix, ...)                                                    \
-  {}
-#define DP(...)                                                                \
-  {}
-#define REPORT(...) FAILURE_MESSAGE(__VA_ARGS__);
-#endif // OMPTARGET_DEBUG
-
 /// Emit a message giving the user extra information about the runtime if
 #define INFO(_flags, _id, ...)                                                 \
   do {                                                                         \
-    if (getDebugLevel() > 0) {                                                 \
-      DEBUGP(DEBUG_PREFIX, __VA_ARGS__);                                       \
+    if (::llvm::offload::debug::isDebugEnabled()) {                            \
+      DP(__VA_ARGS__);                                                         \
     } else if (getInfoLevel() & _flags) {                                      \
       INFO_MESSAGE(_id, __VA_ARGS__);                                          \
     }                                                                          \
@@ -203,17 +158,92 @@ inline uint32_t getDebugLevel() {
 
 namespace llvm::offload::debug {
 
-#ifdef OMPTARGET_DEBUG
+/// A raw_ostream that tracks `\n` and print the prefix after each
+/// newline. Based on raw_ldbg_ostream from Support/DebugLog.h
+class LLVM_ABI odbg_ostream final : public raw_ostream {
+public:
+  enum IfLevel : uint32_t;
+  enum OnlyLevel : uint32_t;
 
-struct DebugFilter {
-  StringRef Type;
-  uint32_t Level;
-};
+private:
+  std::string Prefix;
+  raw_ostream &Os;
+  uint32_t BaseLevel;
+  bool ShouldPrefixNextString;
+  bool ShouldEmitNewLineOnDestruction;
+  bool NeedEndNewLine = false;
 
-struct DebugSettings {
-  bool Enabled = false;
-  uint32_t DefaultLevel = 1;
-  llvm::SmallVector<DebugFilter> Filters;
+  /// If the stream is muted, writes to it are ignored
+  bool Muted = false;
+
+  /// Split the line on newlines and insert the prefix before each
+  /// newline. Forward everything to the underlying stream.
+  void write_impl(const char *Ptr, size_t Size) final {
+    if (Muted)
+      return;
+
+    NeedEndNewLine = false;
+    auto Str = StringRef(Ptr, Size);
+    auto Eol = Str.find('\n');
+    // Handle `\n` occurring in the string, ensure to print the prefix at the
+    // beginning of each line.
+    while (Eol != StringRef::npos) {
+      // Take the line up to the newline (including the newline).
+      StringRef Line = Str.take_front(Eol + 1);
+      if (!Line.empty())
+        writeWithPrefix(Line);
+      // We printed a newline, record here to print a prefix.
+      ShouldPrefixNextString = true;
+      Str = Str.drop_front(Eol + 1);
+      Eol = Str.find('\n');
+    }
+    if (!Str.empty()) {
+      writeWithPrefix(Str);
+      NeedEndNewLine = true;
+    }
+  }
+  void emitPrefix() { Os.write(Prefix.c_str(), Prefix.size()); }
+  void writeWithPrefix(StringRef Str) {
+    if (ShouldPrefixNextString) {
+      emitPrefix();
+      ShouldPrefixNextString = false;
+    }
+    Os.write(Str.data(), Str.size());
+  }
+
+public:
+  explicit odbg_ostream(std::string Prefix, raw_ostream &Os, uint32_t BaseLevel,
+                        bool ShouldPrefixNextString = true,
+                        bool ShouldEmitNewLineOnDestruction = true)
+      : Prefix(std::move(Prefix)), Os(Os), BaseLevel(BaseLevel),
+        ShouldPrefixNextString(ShouldPrefixNextString),
+        ShouldEmitNewLineOnDestruction(ShouldEmitNewLineOnDestruction) {
+    SetUnbuffered();
+  }
+  ~odbg_ostream() final {
+    if (ShouldEmitNewLineOnDestruction && NeedEndNewLine)
+      Os << '\n';
+  }
+  odbg_ostream(const odbg_ostream &) = delete;
+  odbg_ostream &operator=(const odbg_ostream &) = delete;
+  odbg_ostream(odbg_ostream &&other) : Os(other.Os) {
+    Prefix = std::move(other.Prefix);
+    BaseLevel = other.BaseLevel;
+    ShouldPrefixNextString = other.ShouldPrefixNextString;
+    ShouldEmitNewLineOnDestruction = other.ShouldEmitNewLineOnDestruction;
+    NeedEndNewLine = other.NeedEndNewLine;
+    Muted = other.Muted;
+  }
+
+  /// Forward the current_pos method to the underlying stream.
+  uint64_t current_pos() const final { return Os.tell(); }
+
+  /// Some of the `<<` operators expect an lvalue, so we trick the type
+  /// system.
+  odbg_ostream &asLvalue() { return *this; }
+
+  void shouldMute(const IfLevel Filter) { Muted = Filter > BaseLevel; }
+  void shouldMute(const OnlyLevel Filter) { Muted = BaseLevel != Filter; }
 };
 
 /// dbgs - Return a circular-buffered debug stream.
@@ -228,6 +258,19 @@ struct DebugSettings {
   return thestrm.strm;
 }
 
+#ifdef OMPTARGET_DEBUG
+
+struct DebugFilter {
+  StringRef Type;
+  uint32_t Level;
+};
+
+struct DebugSettings {
+  bool Enabled = false;
+  uint32_t DefaultLevel = 1;
+  llvm::SmallVector<DebugFilter> Filters;
+};
+
 [[maybe_unused]] static DebugFilter parseDebugFilter(StringRef Filter) {
   size_t Pos = Filter.find(':');
   if (Pos == StringRef::npos)
@@ -309,80 +352,6 @@ shouldPrintDebug(const char *Component, const char *Type, uint32_t &Level) {
   return false;
 }
 
-/// A raw_ostream that tracks `\n` and print the prefix after each
-/// newline. Based on raw_ldbg_ostream from Support/DebugLog.h
-class LLVM_ABI odbg_ostream final : public raw_ostream {
-public:
-  enum IfLevel : uint32_t;
-  enum OnlyLevel : uint32_t;
-
-private:
-  std::string Prefix;
-  raw_ostream &Os;
-  uint32_t BaseLevel;
-  bool ShouldPrefixNextString;
-  bool ShouldEmitNewLineOnDestruction;
-
-  /// If the stream is muted, writes to it are ignored
-  bool Muted = false;
-
-  /// Split the line on newlines and insert the prefix before each
-  /// newline. Forward everything to the underlying stream.
-  void write_impl(const char *Ptr, size_t Size) final {
-    if (Muted)
-      return;
-
-    auto Str = StringRef(Ptr, Size);
-    auto Eol = Str.find('\n');
-    // Handle `\n` occurring in the string, ensure to print the prefix at the
-    // beginning of each line.
-    while (Eol != StringRef::npos) {
-      // Take the line up to the newline (including the newline).
-      StringRef Line = Str.take_front(Eol + 1);
-      if (!Line.empty())
-        writeWithPrefix(Line);
-      // We printed a newline, record here to print a prefix.
-      ShouldPrefixNextString = true;
-      Str = Str.drop_front(Eol + 1);
-      Eol = Str.find('\n');
-    }
-    if (!Str.empty())
-      writeWithPrefix(Str);
-  }
-  void emitPrefix() { Os.write(Prefix.c_str(), Prefix.size()); }
-  void writeWithPrefix(StringRef Str) {
-    if (ShouldPrefixNextString) {
-      emitPrefix();
-      ShouldPrefixNextString = false;
-    }
-    Os.write(Str.data(), Str.size());
-  }
-
-public:
-  explicit odbg_ostream(std::string Prefix, raw_ostream &Os, uint32_t BaseLevel,
-                        bool ShouldPrefixNextString = true,
-                        bool ShouldEmitNewLineOnDestruction = false)
-      : Prefix(std::move(Prefix)), Os(Os), BaseLevel(BaseLevel),
-        ShouldPrefixNextString(ShouldPrefixNextString),
-        ShouldEmitNewLineOnDestruction(ShouldEmitNewLineOnDestruction) {
-    SetUnbuffered();
-  }
-  ~odbg_ostream() final {
-    if (ShouldEmitNewLineOnDestruction)
-      Os << '\n';
-  }
-
-  /// Forward the current_pos method to the underlying stream.
-  uint64_t current_pos() const final { return Os.tell(); }
-
-  /// Some of the `<<` operators expect an lvalue, so we trick the type
-  /// system.
-  odbg_ostream &asLvalue() { return *this; }
-
-  void shouldMute(const IfLevel Filter) { Muted = Filter > BaseLevel; }
-  void shouldMute(const OnlyLevel Filter) { Muted = BaseLevel != Filter; }
-};
-
 /// Compute the prefix for the debug log in the form of:
 /// "Component --> "
 [[maybe_unused]] static std::string computePrefix(StringRef Component,
@@ -463,6 +432,8 @@ static inline raw_ostream &operator<<(raw_ostream &Os,
 
 #else
 
+inline bool isDebugEnabled() { return false; }
+
 #define ODBG_NULL                                                              \
   for (bool _c = false; _c; _c = false)                                        \
   ::llvm::nulls()
@@ -479,4 +450,94 @@ static inline raw_ostream &operator<<(raw_ostream &Os,
 
 } // namespace llvm::offload::debug
 
+namespace llvm::omp::target::debug {
+using namespace llvm::offload::debug;
+
+enum OmpDebugLevel : uint32_t {
+  ODL_Default = 1,
+  ODL_Error = ODL_Default,
+  ODL_Detailed = 2,
+  ODL_Verbose = 3,
+  ODL_VeryVerbose = 4,
+  ODL_Dumping = 5
+};
+
+/* Debug types to use in libomptarget */
+constexpr const char *ODT_Init = "Init";
+constexpr const char *ODT_Mapping = "Mapping";
+constexpr const char *ODT_Kernel = "Kernel";
+constexpr const char *ODT_DataTransfer = "DataTransfer";
+constexpr const char *ODT_Sync = "Sync";
+constexpr const char *ODT_Deinit = "Deinit";
+constexpr const char *ODT_Error = "Error";
+constexpr const char *ODT_KernelArgs = "KernelArgs";
+constexpr const char *ODT_MappingExists = "MappingExists";
+constexpr const char *ODT_DumpTable = "DumpTable";
+constexpr const char *ODT_MappingChanged = "MappingChanged";
+constexpr const char *ODT_PluginKernel = "PluginKernel";
+constexpr const char *ODT_EmptyMapping = "EmptyMapping";
+
+static inline odbg_ostream reportErrorStream() {
+#ifdef OMPTARGET_DEBUG
+  if (::llvm::offload::debug::isDebugEnabled()) {
+    uint32_t RealLevel = ODL_Error;
+    if (::llvm::offload::debug::shouldPrintDebug(GETNAME(TARGET_NAME),
+                                                 (ODT_Error), RealLevel))
+      return odbg_ostream{
+          ::llvm::offload::debug::computePrefix(DEBUG_PREFIX, ODT_Error),
+          ::llvm::offload::debug::dbgs(), RealLevel};
+    else
+      return odbg_ostream{"", ::llvm::nulls(), 1};
+  }
+#endif
+  return odbg_ostream{GETNAME(TARGET_NAME) " error: ",
+                      ::llvm::offload::debug::dbgs(), ODL_Error};
+}
+
+#ifdef OMPTARGET_DEBUG
+// Deprecated debug print macros
+[[maybe_unused]] static std::string formatToStr(const char *format, ...) {
+  va_list args;
+  va_start(args, format);
+  size_t len = std::vsnprintf(NULL, 0, format, args);
+  va_end(args);
+  llvm::SmallVector<char, 128> vec(len + 1);
+  va_start(args, format);
+  std::vsnprintf(&vec[0], len + 1, format, args);
+  va_end(args);
+  return &vec[0];
+}
+
+// helper macro to support old DP and REPORT macros with printf syntax
+#define FORMAT_TO_STR(Format, ...)                                             \
+  ::llvm::omp::target::debug::formatToStr(Format __VA_OPT__(, ) __VA_ARGS__)
+
+#define DP(...) ODBG() << FORMAT_TO_STR(__VA_ARGS__);
+
+#define REPORT_INT_OLD(...)                                                    \
+  do {                                                                         \
+    if (::llvm::offload::debug::isDebugEnabled()) {                            \
+      ODBG(::llvm::omp::target::debug::ODT_Error,                              \
+           ::llvm::omp::target::debug::ODL_Error)                              \
+          << FORMAT_TO_STR(__VA_ARGS__);                                       \
+    } else {                                                                   \
+      FAILURE_MESSAGE(__VA_ARGS__);                                            \
+    }                                                                          \
+  } while (false)
+
+#else
+#define DP(...)                                                                \
+  {                                                                            \
+  }
+#define REPORT_INT_OLD(...) FAILURE_MESSAGE(__VA_ARGS__);
+#endif // OMPTARGET_DEBUG
+
+// This is used for the new style REPORT macro
+#define REPORT_INT() ::llvm::omp::target::debug::reportErrorStream()
+
+// Make REPORT compatible with old and new syntax
+#define REPORT(...) REPORT_INT##__VA_OPT__(_OLD)(__VA_ARGS__)
+
+} // namespace llvm::omp::target::debug
+
 #endif // OMPTARGET_SHARED_DEBUG_H
diff --git a/offload/include/omptarget.h b/offload/include/omptarget.h
index f8b7d52fe4ef9..030589929ca8b 100644
--- a/offload/include/omptarget.h
+++ b/offload/include/omptarget.h
@@ -272,6 +272,8 @@ extern "C" {
 void ompx_dump_mapping_tables(void);
 int omp_get_num_devices(void);
 int omp_get_device_num(void);
+int omp_get_device_from_uid(const char *DeviceUid);
+const char *omp_get_uid_from_device(int DeviceNum);
 int omp_get_initial_device(void);
 void *omp_target_alloc(size_t Size, int DeviceNum);
 void omp_target_free(void *DevicePtr, int DeviceNum);
diff --git a/offload/libomptarget/OffloadRTL.cpp b/offload/libomptarget/OffloadRTL.cpp
index 0ae325bf496d9..3a18d76aaae15 100644
--- a/offload/libomptarget/OffloadRTL.cpp
+++ b/offload/libomptarget/OffloadRTL.cpp
@@ -19,6 +19,7 @@
 #ifdef OMPT_SUPPORT
 extern void llvm::omp::target::ompt::connectLibrary();
 #endif
+using namespace llvm::omp::target::debug;
 
 static std::mutex PluginMtx;
 static uint32_t RefCount = 0;
@@ -35,7 +36,7 @@ void initRuntime() {
 
   RefCount++;
   if (RefCount == 1) {
-    ODBG() << "Init offload library!";
+    ODBG(ODT_Init) << "Init offload library!";
 #ifdef OMPT_SUPPORT
     // Initialize OMPT first
     llvm::omp::target::ompt::connectLibrary();
@@ -54,12 +55,12 @@ void deinitRuntime() {
   assert(PM && "Runtime not initialized");
 
   if (RefCount == 1) {
-    DP("Deinit offload library!\n");
+    ODBG(ODT_Deinit) << "Deinit offload library!";
     // RTL deinitialization has started
     RTLAlive = false;
     while (RTLOngoingSyncs > 0) {
-      DP("Waiting for ongoing syncs to finish, count: %d\n",
-         RTLOngoingSyncs.load());
+      ODBG(ODT_Sync) << "Waiting for ongoing syncs to finish, count:"
+                     << RTLOngoingSyncs.load();
       std::this_thread::sleep_for(std::chrono::milliseconds(100));
     }
     PM->deinit();
diff --git a/offload/libomptarget/OpenMP/API.cpp b/offload/libomptarget/OpenMP/API.cpp
index dd83a3ccd08e6..6e85e5764449c 100644
--- a/offload/libomptarget/OpenMP/API.cpp
+++ b/offload/libomptarget/OpenMP/API.cpp
@@ -40,6 +40,8 @@ EXTERN void ompx_dump_mapping_tables() {
 using namespace llvm::omp::target::ompt;
 #endif
 
+using GenericDeviceTy = llvm::omp::target::plugin::GenericDeviceTy;
+
 void *targetAllocExplicit(size_t Size, int DeviceNum, int Kind,
                           const char *Name);
 void targetFreeExplicit(void *DevicePtr, int DeviceNum, int Kind,
@@ -68,6 +70,62 @@ EXTERN int omp_get_device_num(void) {
   return HostDevice;
 }
 
+static inline bool is_initial_device_uid(const char *DeviceUid) {
+  return strcmp(DeviceUid, GenericPluginTy::getHostDeviceUid()) == 0;
+}
+
+EXTERN int omp_get_device_from_uid(const char *DeviceUid) {
+  TIMESCOPE();
+  OMPT_IF_BUILT(ReturnAddressSetterRAII RA(__builtin_return_address(0)));
+
+  if (!DeviceUid) {
+    DP("Call to omp_get_device_from_uid returning omp_invalid_device\n");
+    return omp_invalid_device;
+  }
+  if (is_initial_device_uid(DeviceUid)) {
+    DP("Call to omp_get_device_from_uid returning initial device number %d\n",
+       omp_get_initial_device());
+    return omp_get_initial_device();
+  }
+
+  int DeviceNum = omp_invalid_device;
+
+  auto ExclusiveDevicesAccessor = PM->getExclusiveDevicesAccessor();
+  for (const DeviceTy &Device : PM->devices(ExclusiveDevicesAccessor)) {
+    const char *Uid = Device.RTL->getDevice(Device.RTLDeviceID).getDeviceUid();
+    if (Uid && strcmp(DeviceUid, Uid) == 0) {
+      DeviceNum = Device.DeviceID;
+      break;
+    }
+  }
+
+  DP("Call to omp_get_device_from_uid returning %d\n", DeviceNum);
+  return DeviceNum;
+}
+
+EXTERN const char *omp_get_uid_from_device(int DeviceNum) {
+  TIMESCOPE();
+  OMPT_IF_BUILT(ReturnAddressSetterRAII RA(__builtin_return_address(0)));
+
+  if (DeviceNum == omp_invalid_device) {
+    DP("Call to omp_get_uid_from_device returning nullptr\n");
+    return nullptr;
+  }
+  if (DeviceNum == omp_get_initial_device()) {
+    DP("Call to omp_get_uid_from_device returning initial device UID\n");
+    return GenericPluginTy::getHostDeviceUid();
+  }
+
+  auto DeviceOrErr = PM->getDevice(DeviceNum);
+  if (!DeviceOrErr)
+    FATAL_MESSAGE(DeviceNum, "%s", toString(DeviceOrErr.takeError()).c_str());
+
+  const char *Uid =
+      DeviceOrErr->RTL->getDevice(DeviceOrErr->RTLDeviceID).getDeviceUid();
+  DP("Call to omp_get_uid_from_device returning %s\n", Uid);
+  return Uid;
+}
+
 EXTERN int omp_get_initial_device(void) {
   TIMESCOPE();
   OMPT_IF_BUILT(ReturnAddressSetterRAII RA(__builtin_return_address(0)));
diff --git a/offload/libomptarget/device.cpp b/offload/libomptarget/device.cpp
index bf271b2a24aac..659ef689f67e1 100644
--- a/offload/libomptarget/device.cpp
+++ b/offload/libomptarget/device.cpp
@@ -38,6 +38,7 @@ using namespace llvm::omp::target::ompt;
 #endif
 
 using namespace llvm::omp::target::plugin;
+using namespace llvm::omp::target::debug;
 
 int HostDataToTargetTy::addEventIfNecessary(DeviceTy &Device,
                                             AsyncInfoTy &AsyncInfo) const {
@@ -48,7 +49,7 @@ int HostDataToTargetTy::addEventIfNecessary(DeviceTy &Device,
   void *Event = getEvent();
   bool NeedNewEvent = Event == nullptr;
   if (NeedNewEvent && Device.createEvent(&Event) != OFFLOAD_SUCCESS) {
-    REPORT("Failed to create event\n");
+    REPORT() << "Failed to create event";
     return OFFLOAD_FAIL;
   }
 
@@ -56,7 +57,7 @@ int HostDataToTargetTy::addEventIfNecessary(DeviceTy &Device,
   // know if the target support event. But if a target doesn't,
   // recordEvent should always return success.
   if (Device.recordEvent(Event, AsyncInfo) != OFFLOAD_SUCCESS) {
-    REPORT("Failed to set dependence on event " DPxMOD "\n", DPxPTR(Event));
+    REPORT() << "Failed to set dependence on event " << Event;
     return OFFLOAD_FAIL;
   }
 
@@ -315,21 +316,21 @@ int32_t DeviceTy::dataFence(AsyncInfoTy &AsyncInfo) {
 }
 
 int32_t DeviceTy::notifyDataMapped(void *HstPtr, int64_t Size) {
-  DP("Notifying about new mapping: HstPtr=" DPxMOD ", Size=%" PRId64 "\n",
-     DPxPTR(HstPtr), Size);
+  ODBG(ODT_Mapping) << "Notifying about new mapping: HstPtr=" << HstPtr
+                    << ", Size=" << Size;
 
   if (RTL->data_notify_mapped(RTLDeviceID, HstPtr, Size)) {
-    REPORT("Notifying about data mapping failed.\n");
+    REPORT() << "Notifying about data mapping failed.";
     return OFFLOAD_FAIL;
   }
   return OFFLOAD_SUCCESS;
 }
 
 int32_t DeviceTy::notifyDataUnmapped(void *HstPtr) {
-  DP("Notifying about an unmapping: HstPtr=" DPxMOD "\n", DPxPTR(HstPtr));
+  ODBG(ODT_Mapping) << "Notifying about an unmapping: HstPtr=" << HstPtr;
 
   if (RTL->data_notify_unmapped(RTLDeviceID, HstPtr)) {
-    REPORT("Notifying about data unmapping failed.\n");
+    REPORT() << "Notifying about data unmapping failed.";
     return OFFLOAD_FAIL;
   }
   return OFFLOAD_SUCCESS;
diff --git a/offload/libomptarget/exports b/offload/libomptarget/exports
index 910a5b6c827a7..2ebc23e3cf60a 100644
--- a/offload/libomptarget/exports
+++ b/offload/libomptarget/exports
@@ -40,6 +40,8 @@ VERS1.0 {
     omp_get_mapped_ptr;
     omp_get_num_devices;
     omp_get_device_num;
+    omp_get_device_from_uid;
+    omp_get_uid_from_device;
     omp_get_initial_device;
     omp_target_alloc;
     omp_target_free;
diff --git a/offload/test/api/omp_device_uid.c b/offload/test/api/omp_device_uid.c
new file mode 100644
index 0000000000000..2a41d8d04ef8a
--- /dev/null
+++ b/offload/test/api/omp_device_uid.c
@@ -0,0 +1,76 @@
+// RUN: %libomptarget-compile-run-and-check-generic
+
+#include <omp.h>
+#include <stdio.h>
+#include <string.h>
+
+int test_omp_device_uid(int device_num) {
+  const char *device_uid = omp_get_uid_from_device(device_num);
+  if (device_uid == NULL) {
+    printf("FAIL for device %d: omp_get_uid_from_device returned NULL\n",
+           device_num);
+    return 0;
+  }
+
+  int device_num_from_uid = omp_get_device_from_uid(device_uid);
+  if (device_num_from_uid != device_num) {
+    printf(
+        "FAIL for device %d: omp_get_device_from_uid returned %d (UID: %s)\n",
+        device_num, device_num_from_uid, device_uid);
+    return 0;
+  }
+
+  if (device_num == omp_get_initial_device())
+    return 1;
+
+  int success = 1;
+
+// Note that the following code may be executed on the host if the host is the
+// device
+#pragma omp target map(tofrom : success) device(device_num)
+  {
+    int device_num = omp_get_device_num();
+
+    // omp_get_uid_from_device() in the device runtime is a dummy function
+    // returning NULL
+    const char *device_uid = omp_get_uid_from_device(device_num);
+
+    // omp_get_device_from_uid() in the device runtime is a dummy function
+    // returning omp_invalid_device.
+    int device_num_from_uid = omp_get_device_from_uid(device_uid);
+
+    // Depending on whether we're executing on the device or the host, we either
+    // got NULL as the device UID or the correct device UID.  Consequently,
+    // omp_get_device_from_uid() either returned omp_invalid_device or the
+    // correct device number (aka omp_get_initial_device()).
+    if (device_uid ? device_num_from_uid != device_num
+                   : device_num_from_uid != omp_invalid_device) {
+      printf("FAIL for device %d (target): omp_get_device_from_uid returned %d "
+             "(UID: %s)\n",
+             device_num, device_num_from_uid, device_uid);
+      success = 0;
+    }
+  }
+
+  return success;
+}
+
+int main() {
+  int num_devices = omp_get_num_devices();
+  int num_failed = 0;
+  // (also test initial device aka num_devices)
+  for (int i = 0; i < num_devices + 1; i++) {
+    if (!test_omp_device_uid(i)) {
+      printf("FAIL for device %d\n", i);
+      num_failed++;
+    }
+  }
+  if (num_failed) {
+    printf("FAIL\n");
+    return 1;
+  }
+  printf("PASS\n");
+  return 0;
+}
+
+// CHECK: PASS
diff --git a/openmp/device/include/DeviceTypes.h b/openmp/device/include/DeviceTypes.h
index 2e5d92380f040..213ccfe58b4fb 100644
--- a/openmp/device/include/DeviceTypes.h
+++ b/openmp/device/include/DeviceTypes.h
@@ -21,6 +21,9 @@ template <typename T> using Constant = __gpu_constant T;
 template <typename T> using Local = __gpu_local T;
 template <typename T> using Global = __gpu_local T;
 
+// See definition in OpenMP (omp.h.var/omp_lib.(F90|h).var)
+#define omp_invalid_device -2
+
 enum omp_proc_bind_t {
   omp_proc_bind_false = 0,
   omp_proc_bind_true = 1,
diff --git a/openmp/device/include/Interface.h b/openmp/device/include/Interface.h
index c4bfaaa2404b4..71c3b1fc06d40 100644
--- a/openmp/device/include/Interface.h
+++ b/openmp/device/include/Interface.h
@@ -130,6 +130,10 @@ int omp_get_num_devices(void);
 
 int omp_get_device_num(void);
 
+int omp_get_device_from_uid(const char *DeviceUid);
+
+const char *omp_get_uid_from_device(int DeviceNum);
+
 int omp_get_num_teams(void);
 
 int omp_get_team_num();
diff --git a/openmp/device/src/State.cpp b/openmp/device/src/State.cpp
index 9f38cf26f8c6f..985e6b169137f 100644
--- a/openmp/device/src/State.cpp
+++ b/openmp/device/src/State.cpp
@@ -403,6 +403,12 @@ int omp_get_num_devices(void) { return config::getNumDevices(); }
 
 int omp_get_device_num(void) { return config::getDeviceNum(); }
 
+int omp_get_device_from_uid(const char *DeviceUid) {
+  return omp_invalid_device;
+}
+
+const char *omp_get_uid_from_device(int DeviceNum) { return nullptr; }
+
 int omp_get_num_teams(void) { return mapping::getNumberOfBlocksInKernel(); }
 
 int omp_get_team_num() { return mapping::getBlockIdInKernel(); }
diff --git a/openmp/runtime/src/dllexports b/openmp/runtime/src/dllexports
index 3983dae80c9f5..00becd1a657fd 100644
--- a/openmp/runtime/src/dllexports
+++ b/openmp/runtime/src/dllexports
@@ -544,6 +544,8 @@ kmp_set_disp_num_buffers                    890
     omp_get_devices_all_allocator           819
     omp_get_memspace_num_resources          820
     omp_get_submemspace                     821
+    omp_get_device_from_uid                 822
+    omp_get_uid_from_device                 823
     %ifndef stub
         __kmpc_set_default_allocator
         __kmpc_get_default_allocator
diff --git a/openmp/runtime/src/exports_so.txt b/openmp/runtime/src/exports_so.txt
index 124c80a1422b4..d826882d98804 100644
--- a/openmp/runtime/src/exports_so.txt
+++ b/openmp/runtime/src/exports_so.txt
@@ -105,6 +105,8 @@ OMP_4.5 {
 } OMP_4.0;
 OMP_5.0 {
 } OMP_4.5;
+OMP_6.0 {
+} OMP_5.0;
 
 # sets up GCC GOMP_ version dependency chain
 GOMP_1.0 {
diff --git a/openmp/runtime/src/exports_test_so.txt b/openmp/runtime/src/exports_test_so.txt
index c0a08e6d3b23b..02ef8ecd52b5a 100644
--- a/openmp/runtime/src/exports_test_so.txt
+++ b/openmp/runtime/src/exports_test_so.txt
@@ -36,6 +36,8 @@ OMP_4.5 {
 } OMP_4.0;
 OMP_5.0 {
 } OMP_4.5;
+OMP_6.0 {
+} OMP_5.0;
 
 # sets up GCC GOMP_ version dependency chain
 GOMP_1.0 {
diff --git a/openmp/runtime/src/include/omp.h.var b/openmp/runtime/src/include/omp.h.var
index 74f385feb3ea5..e98df731ad888 100644
--- a/openmp/runtime/src/include/omp.h.var
+++ b/openmp/runtime/src/include/omp.h.var
@@ -536,6 +536,11 @@
 
     /* OpenMP 5.2 */
     extern int __KAI_KMPC_CONVENTION omp_in_explicit_task(void);
+    #define omp_invalid_device -2
+
+    /* OpenMP 6.0 */
+    extern int   __KAI_KMPC_CONVENTION  omp_get_device_from_uid(const char *DeviceUid);
+    extern const char *   __KAI_KMPC_CONVENTION  omp_get_uid_from_device(int DeviceNum);
 
     /* LLVM Extensions */
     extern void *llvm_omp_target_dynamic_shared_alloc(void);
diff --git a/openmp/runtime/src/kmp_csupport.cpp b/openmp/runtime/src/kmp_csupport.cpp
index 3ca32ba583fe2..a92fc46374c27 100644
--- a/openmp/runtime/src/kmp_csupport.cpp
+++ b/openmp/runtime/src/kmp_csupport.cpp
@@ -1780,7 +1780,7 @@ void __kmpc_end_critical(ident_t *loc, kmp_int32 global_tid,
   if (ompt_enabled.ompt_callback_mutex_released) {
     ompt_callbacks.ompt_callback(ompt_callback_mutex_released)(
         ompt_mutex_critical, (ompt_wait_id_t)(uintptr_t)lck,
-        OMPT_LOAD_RETURN_ADDRESS(0));
+        OMPT_LOAD_RETURN_ADDRESS(global_tid));
   }
 #endif
 
diff --git a/openmp/runtime/src/kmp_ftn_entry.h b/openmp/runtime/src/kmp_ftn_entry.h
index 2b0063eb23a0a..625101b067daf 100644
--- a/openmp/runtime/src/kmp_ftn_entry.h
+++ b/openmp/runtime/src/kmp_ftn_entry.h
@@ -1543,13 +1543,40 @@ int FTN_STDCALL KMP_EXPAND_NAME(FTN_GET_MAX_TASK_PRIORITY)(void) {
 #endif
 }
 
-// This function will be defined in libomptarget. When libomptarget is not
-// loaded, we assume we are on the host and return KMP_HOST_DEVICE.
+// These functions will be defined in libomptarget. When libomptarget is not
+// loaded, we assume we are on the host.
 // Compiler/libomptarget will handle this if called inside target.
 int FTN_STDCALL FTN_GET_DEVICE_NUM(void) KMP_WEAK_ATTRIBUTE_EXTERNAL;
 int FTN_STDCALL FTN_GET_DEVICE_NUM(void) {
   return KMP_EXPAND_NAME(FTN_GET_INITIAL_DEVICE)();
 }
+const char *FTN_STDCALL KMP_EXPAND_NAME(FTN_GET_UID_FROM_DEVICE)(int device_num)
+    KMP_WEAK_ATTRIBUTE_EXTERNAL;
+const char *FTN_STDCALL
+KMP_EXPAND_NAME(FTN_GET_UID_FROM_DEVICE)(int device_num) {
+#if KMP_OS_DARWIN || KMP_OS_WASI || defined(KMP_STUB)
+  return nullptr;
+#else
+  const char *(*fptr)(int);
+  if ((*(void **)(&fptr) = KMP_DLSYM_NEXT("omp_get_uid_from_device")))
+    return (*fptr)(device_num);
+  // Returns the same string as used by libomptarget
+  return "HOST";
+#endif
+}
+int FTN_STDCALL KMP_EXPAND_NAME(FTN_GET_DEVICE_FROM_UID)(const char *device_uid)
+    KMP_WEAK_ATTRIBUTE_EXTERNAL;
+int FTN_STDCALL
+KMP_EXPAND_NAME(FTN_GET_DEVICE_FROM_UID)(const char *device_uid) {
+#if KMP_OS_DARWIN || KMP_OS_WASI || defined(KMP_STUB)
+  return omp_invalid_device;
+#else
+  int (*fptr)(const char *);
+  if ((*(void **)(&fptr) = KMP_DLSYM_NEXT("omp_get_device_from_uid")))
+    return (*fptr)(device_uid);
+  return KMP_EXPAND_NAME(FTN_GET_INITIAL_DEVICE)();
+#endif
+}
 
 // Compiler will ensure that this is only called from host in sequential region
 int FTN_STDCALL KMP_EXPAND_NAME(FTN_PAUSE_RESOURCE)(kmp_pause_status_t kind,
@@ -1906,6 +1933,10 @@ KMP_VERSION_SYMBOL(FTN_SET_AFFINITY_FORMAT, 50, "OMP_5.0");
 // KMP_VERSION_SYMBOL(FTN_GET_SUPPORTED_ACTIVE_LEVELS, 50, "OMP_5.0");
 // KMP_VERSION_SYMBOL(FTN_FULFILL_EVENT, 50, "OMP_5.0");
 
+// OMP_6.0 versioned symbols
+KMP_VERSION_SYMBOL(FTN_GET_UID_FROM_DEVICE, 60, "OMP_6.0");
+KMP_VERSION_SYMBOL(FTN_GET_DEVICE_FROM_UID, 60, "OMP_6.0");
+
 #endif // KMP_USE_VERSION_SYMBOLS
 
 #ifdef __cplusplus
diff --git a/openmp/runtime/src/kmp_ftn_os.h b/openmp/runtime/src/kmp_ftn_os.h
index ae0ed067235e5..c439a058f22b4 100644
--- a/openmp/runtime/src/kmp_ftn_os.h
+++ b/openmp/runtime/src/kmp_ftn_os.h
@@ -140,6 +140,8 @@
 #define FTN_GET_MEMSPACE_NUM_RESOURCES omp_get_memspace_num_resources
 #define FTN_GET_SUBMEMSPACE omp_get_submemspace
 #define FTN_GET_DEVICE_NUM omp_get_device_num
+#define FTN_GET_UID_FROM_DEVICE omp_get_uid_from_device
+#define FTN_GET_DEVICE_FROM_UID omp_get_device_from_uid
 #define FTN_SET_AFFINITY_FORMAT omp_set_affinity_format
 #define FTN_GET_AFFINITY_FORMAT omp_get_affinity_format
 #define FTN_DISPLAY_AFFINITY omp_display_affinity
@@ -289,6 +291,8 @@
 #define FTN_ALLOC omp_alloc_
 #define FTN_FREE omp_free_
 #define FTN_GET_DEVICE_NUM omp_get_device_num_
+#define FTN_GET_UID_FROM_DEVICE omp_get_uid_from_device_
+#define FTN_GET_DEVICE_FROM_UID omp_get_device_from_uid_
 #define FTN_SET_AFFINITY_FORMAT omp_set_affinity_format_
 #define FTN_GET_AFFINITY_FORMAT omp_get_affinity_format_
 #define FTN_DISPLAY_AFFINITY omp_display_affinity_
@@ -436,6 +440,8 @@
 #define FTN_GET_MEMSPACE_NUM_RESOURCES OMP_GET_MEMSPACE_NUM_RESOURCES
 #define FTN_GET_SUBMEMSPACE OMP_GET_SUBMEMSPACE
 #define FTN_GET_DEVICE_NUM OMP_GET_DEVICE_NUM
+#define FTN_GET_UID_FROM_DEVICE OMP_GET_UID_FROM_DEVICE
+#define FTN_GET_DEVICE_FROM_UID OMP_GET_DEVICE_FROM_UID
 #define FTN_SET_AFFINITY_FORMAT OMP_SET_AFFINITY_FORMAT
 #define FTN_GET_AFFINITY_FORMAT OMP_GET_AFFINITY_FORMAT
 #define FTN_DISPLAY_AFFINITY OMP_DISPLAY_AFFINITY
@@ -585,6 +591,8 @@
 #define FTN_ALLOC OMP_ALLOC_
 #define FTN_FREE OMP_FREE_
 #define FTN_GET_DEVICE_NUM OMP_GET_DEVICE_NUM_
+#define FTN_GET_UID_FROM_DEVICE OMP_GET_UID_FROM_DEVICE_
+#define FTN_GET_DEVICE_FROM_UID OMP_GET_DEVICE_FROM_UID_
 #define FTN_SET_AFFINITY_FORMAT OMP_SET_AFFINITY_FORMAT_
 #define FTN_GET_AFFINITY_FORMAT OMP_GET_AFFINITY_FORMAT_
 #define FTN_DISPLAY_AFFINITY OMP_DISPLAY_AFFINITY_
diff --git a/openmp/runtime/test/api/omp_device_uid.c b/openmp/runtime/test/api/omp_device_uid.c
new file mode 100644
index 0000000000000..40a1cbb644c7b
--- /dev/null
+++ b/openmp/runtime/test/api/omp_device_uid.c
@@ -0,0 +1,77 @@
+// RUN: %libomp-compile-and-run 2>&1 | FileCheck %s
+// Linking fails for icc 18
+// UNSUPPORTED: icc-18
+
+#include <omp_testsuite.h>
+#include <string.h>
+
+int test_omp_device_uid(int device_num) {
+  const char *device_uid = omp_get_uid_from_device(device_num);
+  if (device_uid == NULL) {
+    printf("FAIL for device %d: omp_get_uid_from_device returned NULL\n",
+           device_num);
+    return 0;
+  }
+
+  int device_num_from_uid = omp_get_device_from_uid(device_uid);
+  if (device_num_from_uid != device_num) {
+    printf(
+        "FAIL for device %d: omp_get_device_from_uid returned %d (UID: %s)\n",
+        device_num, device_num_from_uid, device_uid);
+    return 0;
+  }
+
+  if (device_num == omp_get_initial_device())
+    return 1;
+
+  int success = 1;
+
+// Note that the following code may be executed on the host if the host is the
+// device
+#pragma omp target map(tofrom : success) device(device_num)
+  {
+    int device_num = omp_get_device_num();
+
+    // omp_get_uid_from_device() in the device runtime is a dummy function
+    // returning NULL
+    const char *device_uid = omp_get_uid_from_device(device_num);
+
+    // omp_get_device_from_uid() in the device runtime is a dummy function
+    // returning omp_invalid_device.
+    int device_num_from_uid = omp_get_device_from_uid(device_uid);
+
+    // Depending on whether we're executing on the device or the host, we either
+    // got NULL as the device UID or the correct device UID.  Consequently,
+    // omp_get_device_from_uid() either returned omp_invalid_device or the
+    // correct device number (aka omp_get_initial_device()).
+    if (device_uid ? device_num_from_uid != device_num
+                   : device_num_from_uid != omp_invalid_device) {
+      printf("FAIL for device %d (target): omp_get_device_from_uid returned %d "
+             "(UID: %s)\n",
+             device_num, device_num_from_uid, device_uid);
+      success = 0;
+    }
+  }
+
+  return success;
+}
+
+int main() {
+  int num_devices = omp_get_num_devices();
+  int num_failed = 0;
+  // (also test initial device aka num_devices)
+  for (int i = 0; i < num_devices + 1; i++) {
+    if (!test_omp_device_uid(i)) {
+      printf("FAIL for device %d\n", i);
+      num_failed++;
+    }
+  }
+  if (num_failed) {
+    printf("FAIL\n");
+    return 1;
+  }
+  printf("PASS\n");
+  return 0;
+}
+
+// CHECK: PASS
diff --git a/runtimes/cmake/Modules/WarningFlags.cmake b/runtimes/cmake/Modules/WarningFlags.cmake
index 43ef76561cc54..c253b9b117bc4 100644
--- a/runtimes/cmake/Modules/WarningFlags.cmake
+++ b/runtimes/cmake/Modules/WarningFlags.cmake
@@ -2,7 +2,7 @@ include(HandleFlags)
 
 # Warning flags ===============================================================
 function(cxx_add_warning_flags target enable_werror enable_pedantic)
-  target_compile_definitions(${target} PUBLIC -D_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
+  target_compile_definitions(${target} PRIVATE -D_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
   if (MSVC)
     # -W4 is the cl.exe/clang-cl equivalent of -Wall. (In cl.exe and clang-cl,
     # -Wall is equivalent to -Weverything in GCC style compiler drivers.)
diff --git a/third-party/benchmark/bindings/python/build_defs.bzl b/third-party/benchmark/bindings/python/build_defs.bzl
index b0c1b0f5807e3..d520eda616393 100644
--- a/third-party/benchmark/bindings/python/build_defs.bzl
+++ b/third-party/benchmark/bindings/python/build_defs.bzl
@@ -2,6 +2,8 @@
 This file contains some build definitions for C++ extensions used in the Google Benchmark Python bindings.
 """
 
+load("//third_party/bazel_rules/rules_cc/cc:cc_binary.bzl", "cc_binary")
+
 _SHARED_LIB_SUFFIX = {
     "//conditions:default": ".so",
     "//:windows": ".dll",
@@ -10,7 +12,7 @@ _SHARED_LIB_SUFFIX = {
 def py_extension(name, srcs, hdrs = [], copts = [], features = [], deps = []):
     for shared_lib_suffix in _SHARED_LIB_SUFFIX.values():
         shared_lib_name = name + shared_lib_suffix
-        native.cc_binary(
+        cc_binary(
             name = shared_lib_name,
             linkshared = True,
             linkstatic = True,
diff --git a/utils/bazel/configure.bzl b/utils/bazel/configure.bzl
index b976f3955febf..adbd3c6539037 100644
--- a/utils/bazel/configure.bzl
+++ b/utils/bazel/configure.bzl
@@ -4,9 +4,6 @@
 
 """Helper macros to configure the LLVM overlay project."""
 
-# Directory of overlay files relative to WORKSPACE
-DEFAULT_OVERLAY_PATH = "llvm-project-overlay"
-
 DEFAULT_TARGETS = [
     "AArch64",
     "AMDGPU",
@@ -30,43 +27,54 @@ DEFAULT_TARGETS = [
     "XCore",
 ]
 
+
+MAX_TRAVERSAL_STEPS = 1000000  # "big number" upper bound on total visited dirs
+
 def _overlay_directories(repository_ctx):
-    src_path = repository_ctx.path(Label("@llvm-raw//:WORKSPACE")).dirname
-    bazel_path = src_path.get_child("utils").get_child("bazel")
-    overlay_path = bazel_path.get_child("llvm-project-overlay")
-    script_path = bazel_path.get_child("overlay_directories.py")
-
-    python_bin = repository_ctx.which("python3")
-    if not python_bin:
-        # Windows typically just defines "python" as python3. The script itself
-        # contains a check to ensure python3.
-        python_bin = repository_ctx.which("python")
-
-    if not python_bin:
-        fail("Failed to find python3 binary")
-
-    cmd = [
-        python_bin,
-        script_path,
-        "--src",
-        src_path,
-        "--overlay",
-        overlay_path,
-        "--target",
-        ".",
-    ]
-    exec_result = repository_ctx.execute(cmd, timeout = 20)
-
-    if exec_result.return_code != 0:
-        fail(("Failed to execute overlay script: '{cmd}'\n" +
-              "Exited with code {return_code}\n" +
-              "stdout:\n{stdout}\n" +
-              "stderr:\n{stderr}\n").format(
-            cmd = " ".join([str(arg) for arg in cmd]),
-            return_code = exec_result.return_code,
-            stdout = exec_result.stdout,
-            stderr = exec_result.stderr,
-        ))
+    src_root = repository_ctx.path(Label("@llvm-raw//:WORKSPACE")).dirname
+    overlay_root = src_root.get_child("utils/bazel/llvm-project-overlay")
+    target_root = repository_ctx.path(".")
+
+    # Tries to minimize the number of symlinks created (that is, does not symlink
+    # every single file). Symlinks every file in the overlay directory. Only symlinks
+    # individual files in the source directory if their parent directory is also
+    # contained in the overlay directory tree.
+
+    stack = ["."]
+    for _ in range(MAX_TRAVERSAL_STEPS):
+        rel_dir = stack.pop()
+
+        overlay_dirs = set()
+
+        # Symlink overlay files, overlay dirs will be handled in future iterations.
+        for entry in overlay_root.get_child(rel_dir).readdir():
+            name = entry.basename
+            full_rel_path = rel_dir + "/" + name
+
+            if entry.is_dir:
+                stack.append(full_rel_path)
+                overlay_dirs.add(name)
+            else:
+                src_path = overlay_root.get_child(full_rel_path)
+                dst_path = target_root.get_child(full_rel_path)
+                repository_ctx.symlink(src_path, dst_path)
+
+        # Symlink source dirs (if not themselves overlaid) and files.
+        for src_entry in src_root.get_child(rel_dir).readdir():
+            name = src_entry.basename
+            if name in overlay_dirs:
+                # Skip: overlay has a directory with this name
+                continue
+
+            repository_ctx.symlink(src_entry, target_root.get_child(rel_dir + "/" + name))
+
+        if not stack:
+            return
+
+    fail("overlay_directories: exceeded MAX_TRAVERSAL_STEPS ({}). " +
+         "Tree too large or a cycle in the filesystem?".format(
+             MAX_TRAVERSAL_STEPS,
+         ))
 
 def _extract_cmake_settings(repository_ctx, llvm_cmake):
     # The list to be written to vars.bzl
diff --git a/utils/bazel/llvm-project-overlay/clang/BUILD.bazel b/utils/bazel/llvm-project-overlay/clang/BUILD.bazel
index 020b2aa68a357..0f256e6272055 100644
--- a/utils/bazel/llvm-project-overlay/clang/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/clang/BUILD.bazel
@@ -2,8 +2,6 @@
 # See https://llvm.org/LICENSE.txt for license information.
 # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
-load("@com_google_protobuf//bazel:cc_proto_library.bzl", "cc_proto_library")
-load("@com_google_protobuf//bazel:proto_library.bzl", "proto_library")
 load("@rules_cc//cc:defs.bzl", "cc_binary", "cc_library")
 load("@rules_python//python:defs.bzl", "py_binary")
 load(
@@ -2493,64 +2491,3 @@ cc_library(
         "//llvm:TargetParser",
     ],
 )
-
-cc_binary(
-    name = "clang-fuzzer-dictionary",
-    srcs = ["tools/clang-fuzzer/dictionary/dictionary.c"],
-    deps = [":basic"],
-)
-
-genrule(
-    name = "fuzzer-dictionary",
-    outs = ["fuzzer-dictionary.txt"],
-    cmd = "$(location :clang-fuzzer-dictionary) > $@",
-    tools = [":clang-fuzzer-dictionary"],
-)
-
-cc_library(
-    name = "handle-cxx",
-    srcs = ["tools/clang-fuzzer/handle-cxx/handle_cxx.cpp"],
-    hdrs = ["tools/clang-fuzzer/handle-cxx/handle_cxx.h"],
-    deps = [
-        ":codegen",
-        ":frontend",
-        ":lex",
-        ":tooling",
-        "//llvm:Option",
-        "//llvm:Support",
-    ],
-)
-
-proto_library(
-    name = "cxx-proto",
-    srcs = ["tools/clang-fuzzer/cxx_proto.proto"],
-)
-
-cc_proto_library(
-    name = "cxx_cc_proto",
-    deps = [":cxx-proto"],
-)
-
-cc_library(
-    name = "proto-to-cxx-lib",
-    srcs = ["tools/clang-fuzzer/proto-to-cxx/proto_to_cxx.cpp"],
-    hdrs = ["tools/clang-fuzzer/proto-to-cxx/proto_to_cxx.h"],
-    includes = ["tools/clang-fuzzer"],
-    deps = [":cxx_cc_proto"],
-)
-
-cc_binary(
-    name = "clang-proto-to-cxx",
-    srcs = ["tools/clang-fuzzer/proto-to-cxx/proto_to_cxx_main.cpp"],
-    deps = [":proto-to-cxx-lib"],
-)
-
-cc_library(
-    name = "clang-fuzzer-initialize",
-    srcs = ["tools/clang-fuzzer/fuzzer-initialize/fuzzer_initialize.cpp"],
-    hdrs = ["tools/clang-fuzzer/fuzzer-initialize/fuzzer_initialize.h"],
-    deps = [
-        "//llvm:Core",
-        "//llvm:Support",
-    ],
-)
diff --git a/utils/bazel/llvm-project-overlay/clang/tools/clang-fuzzer/BUILD.bazel b/utils/bazel/llvm-project-overlay/clang/tools/clang-fuzzer/BUILD.bazel
new file mode 100644
index 0000000000000..bee4f0e143f89
--- /dev/null
+++ b/utils/bazel/llvm-project-overlay/clang/tools/clang-fuzzer/BUILD.bazel
@@ -0,0 +1,70 @@
+# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+load("@com_google_protobuf//bazel:cc_proto_library.bzl", "cc_proto_library")
+load("@com_google_protobuf//bazel:proto_library.bzl", "proto_library")
+load("@rules_cc//cc:defs.bzl", "cc_binary", "cc_library")
+
+package(default_visibility = ["//visibility:public"])
+
+cc_binary(
+    name = "clang-fuzzer-dictionary",
+    srcs = ["dictionary/dictionary.c"],
+    deps = ["//clang:basic"],
+)
+
+genrule(
+    name = "fuzzer-dictionary",
+    outs = ["fuzzer-dictionary.txt"],
+    cmd = "$(location :clang-fuzzer-dictionary) > $@",
+    tools = [":clang-fuzzer-dictionary"],
+)
+
+cc_library(
+    name = "handle-cxx",
+    srcs = ["handle-cxx/handle_cxx.cpp"],
+    hdrs = ["handle-cxx/handle_cxx.h"],
+    deps = [
+        "//clang:codegen",
+        "//clang:frontend",
+        "//clang:lex",
+        "//clang:tooling",
+        "//llvm:Option",
+        "//llvm:Support",
+    ],
+)
+
+proto_library(
+    name = "cxx-proto",
+    srcs = ["cxx_proto.proto"],
+)
+
+cc_proto_library(
+    name = "cxx_cc_proto",
+    deps = [":cxx-proto"],
+)
+
+cc_library(
+    name = "proto-to-cxx-lib",
+    srcs = ["proto-to-cxx/proto_to_cxx.cpp"],
+    hdrs = ["proto-to-cxx/proto_to_cxx.h"],
+    includes = ["."],
+    deps = [":cxx_cc_proto"],
+)
+
+cc_binary(
+    name = "clang-proto-to-cxx",
+    srcs = ["proto-to-cxx/proto_to_cxx_main.cpp"],
+    deps = [":proto-to-cxx-lib"],
+)
+
+cc_library(
+    name = "clang-fuzzer-initialize",
+    srcs = ["fuzzer-initialize/fuzzer_initialize.cpp"],
+    hdrs = ["fuzzer-initialize/fuzzer_initialize.h"],
+    deps = [
+        "//llvm:Core",
+        "//llvm:Support",
+    ],
+)
diff --git a/utils/bazel/llvm-project-overlay/libc/BUILD.bazel b/utils/bazel/llvm-project-overlay/libc/BUILD.bazel
index e3962d9b5f95b..e063de015506b 100644
--- a/utils/bazel/llvm-project-overlay/libc/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/libc/BUILD.bazel
@@ -1799,11 +1799,9 @@ libc_support_library(
     name = "__support_wctype_utils",
     hdrs = ["src/__support/wctype_utils.h"],
     deps = [
-        ":__support_cpp_optional",
         ":__support_macros_attributes",
         ":__support_macros_config",
         ":types_wchar_t",
-        ":types_wint_t",
     ],
 )
 
@@ -7472,7 +7470,6 @@ libc_function(
     deps = [
         ":__support_common",
         ":__support_macros_config",
-        ":__support_wctype_utils",
         ":hdr_wchar_macros",
         ":types_wint_t",
     ],
@@ -7704,7 +7701,6 @@ libc_function(
         ":__support_common",
         ":__support_macros_config",
         ":__support_macros_null_check",
-        ":__support_wctype_utils",
         ":hdr_stdio_macros",
         ":types_wint_t",
     ],
diff --git a/utils/bazel/llvm-project-overlay/lldb/source/Plugins/BUILD.bazel b/utils/bazel/llvm-project-overlay/lldb/source/Plugins/BUILD.bazel
index da39e58ac70ed..3af1c6dd8c918 100644
--- a/utils/bazel/llvm-project-overlay/lldb/source/Plugins/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/lldb/source/Plugins/BUILD.bazel
@@ -2091,6 +2091,7 @@ cc_library(
         "//lldb:Target",
         "//lldb:TargetHeaders",
         "//lldb:Utility",
+        "//llvm:Support",
     ],
 )
 
@@ -2142,11 +2143,15 @@ cc_library(
         ":PluginObjectFilePlaceholder",
         ":PluginProcessUtility",
         "//lldb:Core",
+        "//lldb:CoreHeaders",
+        "//lldb:Headers",
         "//lldb:Host",
         "//lldb:InterpreterHeaders",
+        "//lldb:SymbolHeaders",
         "//lldb:Target",
         "//lldb:TargetHeaders",
         "//lldb:Utility",
+        "//llvm:Support",
     ],
 )
 
diff --git a/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel b/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel
index 1428299076fb3..8e9b51b58f4f5 100644
--- a/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel
@@ -1243,42 +1243,12 @@ cc_library(
     ],
 )
 
-AnalysisFpExcSrcs = [
-    "lib/Analysis/ConstantFolding.cpp",
-]
-
-cc_library(
-    name = "AnalysisFpExc",
-    srcs = AnalysisFpExcSrcs,
-    hdrs = glob(
-        [
-            "include/llvm/Analysis/*.h",
-            "include/llvm/Analysis/Utils/*.h",
-        ],
-    ),
-    copts = llvm_copts + ["-ftrapping-math"],
-    textual_hdrs = glob([
-        "include/llvm/Analysis/*.def",
-    ]),
-    deps = [
-        ":BinaryFormat",
-        ":Core",
-        ":Object",
-        ":ProfileData",
-        ":Support",
-        ":TargetParser",
-        ":config",
-        ":target_library_info_gen",
-    ],
-)
-
 cc_library(
     name = "Analysis",
     srcs = glob(
         [
             "lib/Analysis/*.cpp",
         ],
-        exclude = AnalysisFpExcSrcs,
     ),
     hdrs = glob(
         [
@@ -1288,12 +1258,11 @@ cc_library(
     ) + [
         "include/llvm-c/Analysis.h",
     ],
-    copts = llvm_copts,
+    copts = llvm_copts + ["-ftrapping-math"],
     textual_hdrs = glob([
         "include/llvm/Analysis/*.def",
     ]),
     deps = [
-        ":AnalysisFpExc",
         ":BinaryFormat",
         ":Core",
         ":FrontendHLSL",
diff --git a/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel b/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
index ed020c6f0a2d8..1820ff108ba3b 100644
--- a/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
@@ -4444,6 +4444,7 @@ cc_library(
         ":SCFIncGen",
         ":SideEffectInterfaces",
         ":TensorDialect",
+        ":TransformUtils",
         ":ValueBoundsOpInterface",
         ":ViewLikeInterface",
         "//llvm:Support",
@@ -4795,6 +4796,7 @@ cc_library(
         ":IR",
         ":InliningUtils",
         ":SideEffectInterfaces",
+        ":UBDialect",
         "//llvm:Support",
     ],
 )
@@ -10407,6 +10409,7 @@ cc_library(
         ":OpenACCDialect",
         ":OpenACCPassIncGen",
         ":Pass",
+        ":Support",
         ":TransformUtils",
         ":ViewLikeInterface",
         "//llvm:Support",
diff --git a/utils/bazel/overlay_directories.py b/utils/bazel/overlay_directories.py
deleted file mode 100755
index 526a78e978e5d..0000000000000
--- a/utils/bazel/overlay_directories.py
+++ /dev/null
@@ -1,99 +0,0 @@
-#!/bin/python3
-
-# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-# See https://llvm.org/LICENSE.txt for license information.
-# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-"""Overlays two directories into a target directory using symlinks.
-
-Tries to minimize the number of symlinks created (that is, does not symlink
-every single file). Symlinks every file in the overlay directory. Only symlinks
-individual files in the source directory if their parent directory is also
-contained in the overlay directory tree.
-"""
-
-import argparse
-import errno
-import os
-import sys
-
-
-def _check_python_version():
-    if sys.version_info[0] < 3:
-        raise RuntimeError(
-            "Must be invoked with a python 3 interpreter but was %s" % sys.executable
-        )
-
-
-def _check_dir_exists(path):
-    if not os.path.isdir(path):
-        raise OSError(errno.ENOENT, os.strerror(errno.ENOENT), path)
-
-
-def parse_arguments():
-    parser = argparse.ArgumentParser(
-        description="""
-    Overlays two directories into a target directory using symlinks.
-
-    Tries to minimize the number of symlinks created (that is, does not symlink
-    every single file). Symlinks every file in the overlay directory. Only
-    symlinks individual files in the source directory if their parent directory
-    is also contained in the overlay directory tree.
-    """
-    )
-    parser.add_argument(
-        "--src",
-        required=True,
-        help="Directory that contains most of the content to symlink.",
-    )
-    parser.add_argument(
-        "--overlay",
-        required=True,
-        help="Directory to overlay on top of the source directory.",
-    )
-    parser.add_argument(
-        "--target",
-        required=True,
-        help="Directory in which to place the fused symlink directories.",
-    )
-
-    args = parser.parse_args()
-
-    _check_dir_exists(args.target)
-    _check_dir_exists(args.overlay)
-    _check_dir_exists(args.src)
-
-    return args
-
-
-def _symlink_abs(from_path, to_path):
-    os.symlink(os.path.abspath(from_path), os.path.abspath(to_path))
-
-
-def main(args):
-    for root, dirs, files in os.walk(args.overlay):
-        # We could do something more intelligent here and only symlink individual
-        # files if the directory is present in both overlay and src. This could also
-        # be generalized to an arbitrary number of directories without any
-        # "src/overlay" distinction. In the current use case we only have two and
-        # the overlay directory is always small, so putting that off for now.
-        rel_root = os.path.relpath(root, start=args.overlay)
-        if rel_root != ".":
-            os.mkdir(os.path.join(args.target, rel_root))
-
-        for file in files:
-            relpath = os.path.join(rel_root, file)
-            _symlink_abs(
-                os.path.join(args.overlay, relpath), os.path.join(args.target, relpath)
-            )
-
-        for src_entry in os.listdir(os.path.join(args.src, rel_root)):
-            if src_entry not in dirs:
-                relpath = os.path.join(rel_root, src_entry)
-                _symlink_abs(
-                    os.path.join(args.src, relpath), os.path.join(args.target, relpath)
-                )
-
-
-if __name__ == "__main__":
-    _check_python_version()
-    main(parse_arguments())

>From c5d666dc19ab86edb9dbec0caf1c85d966b06a5f Mon Sep 17 00:00:00 2001
From: Naveen Seth Hanig <naveen at linux.fritz.box>
Date: Wed, 3 Dec 2025 23:32:36 +0100
Subject: [PATCH 3/3] Address jansvoboda11 review feedback

---
 clang/include/clang/DependencyScanning/DependencyScannerImpl.h  | 2 +-
 .../clang/DependencyScanning/DependencyScanningFilesystem.h     | 2 +-
 .../clang/DependencyScanning/DependencyScanningService.h        | 2 +-
 .../include/clang/DependencyScanning/DependencyScanningUtils.h  | 2 +-
 .../include/clang/DependencyScanning/DependencyScanningWorker.h | 2 +-
 clang/include/clang/DependencyScanning/InProcessModuleCache.h   | 2 +-
 clang/include/clang/DependencyScanning/ModuleDepCollector.h     | 2 +-
 clang/include/clang/Tooling/DependencyScanningTool.h            | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/clang/include/clang/DependencyScanning/DependencyScannerImpl.h b/clang/include/clang/DependencyScanning/DependencyScannerImpl.h
index 0a0808dd9b93e..f3cece28f90e5 100644
--- a/clang/include/clang/DependencyScanning/DependencyScannerImpl.h
+++ b/clang/include/clang/DependencyScanning/DependencyScannerImpl.h
@@ -1,4 +1,4 @@
-//===- DependencyScannerImpl.h - Implements dependency scanning *- C++ -*--===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h b/clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h
index a4516ff77509d..2162222a66643 100644
--- a/clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningFilesystem.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningFilesystem.h - Optimized Scanning FS ---*- C++ -*-===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/DependencyScanning/DependencyScanningService.h b/clang/include/clang/DependencyScanning/DependencyScanningService.h
index 96dd33c28cf5a..371b862996706 100644
--- a/clang/include/clang/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningService.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningService.h - Scanning Service -----------*- C++ -*-===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/DependencyScanning/DependencyScanningUtils.h b/clang/include/clang/DependencyScanning/DependencyScanningUtils.h
index e2fb5ad3e5cf3..7840f031901c3 100644
--- a/clang/include/clang/DependencyScanning/DependencyScanningUtils.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningUtils.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningUtils.h - Common Scanning Utilities ----*- C++ -*-===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/DependencyScanning/DependencyScanningWorker.h b/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
index 8d91c78c72322..9585691607ca9 100644
--- a/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
+++ b/clang/include/clang/DependencyScanning/DependencyScanningWorker.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningWorker.h - Thread-Safe Scanning Worker -*- C++ -*-===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/DependencyScanning/InProcessModuleCache.h b/clang/include/clang/DependencyScanning/InProcessModuleCache.h
index c0e8f00b7fb59..0585348fa7d1d 100644
--- a/clang/include/clang/DependencyScanning/InProcessModuleCache.h
+++ b/clang/include/clang/DependencyScanning/InProcessModuleCache.h
@@ -1,4 +1,4 @@
-//===- InProcessModuleCache.h - Implicit Module Cache -----------*- C++ -*-===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/DependencyScanning/ModuleDepCollector.h b/clang/include/clang/DependencyScanning/ModuleDepCollector.h
index 0243f7abcbe10..8f665daf03c69 100644
--- a/clang/include/clang/DependencyScanning/ModuleDepCollector.h
+++ b/clang/include/clang/DependencyScanning/ModuleDepCollector.h
@@ -1,4 +1,4 @@
-//===- ModuleDepCollector.h - Callbacks to collect deps ---------*- C++ -*-===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
diff --git a/clang/include/clang/Tooling/DependencyScanningTool.h b/clang/include/clang/Tooling/DependencyScanningTool.h
index 8e03f6e949689..0ac142a3fc673 100644
--- a/clang/include/clang/Tooling/DependencyScanningTool.h
+++ b/clang/include/clang/Tooling/DependencyScanningTool.h
@@ -1,4 +1,4 @@
-//===- DependencyScanningTool.h - clang-scan-deps service -----------------===//
+//===----------------------------------------------------------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.