[llvm] [MTE] [NFC] use vector to collect globals to tag (#120283) (PR #142330)

Sun Jun 1 22:03:54 PDT 2025

https://github.com/jofrn created https://github.com/llvm/llvm-project/pull/142330

[MTE] [NFC] use vector to collect globals to tag (#120283)

The same pattern caused test failures in the HWASan pass, so is brittle.
Let's go for the easier approach.

[DOCS] Rename LLVM Security Group to LLVM Security Response Group. (#116986)

Rename LLVM Security Group to LLVM Security Response Group. Take the
opportunity to canonicalise security group and Security Group to LLVM
Security Response Group.

At the 2024-11-19 LLVM Security Group meeting [1] we discussed that in
practice the LLVM Security Group was performing an incident response
role, but it was not proactively adding additional testing, fuzzing and
hardening. We do not want projects that use LLVM to see the LLVM
Security Group as guaranteeing security for LLVM.

We decided that it would be useful to rename the group to LLVM Security
Response Group as that reflects the work that it is doing.

There may be a case for a proactive security group with a different
remit, but this is out of scope of this commit.

[1]

https://discourse.llvm.org/t/llvm-security-group-public-sync-ups/62735/32

[DOCS] Remove bullet point on improving security over time. (#116980)

Remove the 6th bullet point "Strive to improve security over time, for
example by adding additional testing, fuzzing and hardening after fixing
issues."

At the security group meeting on 2024-11-19 we discussed the role the
security group was performing in practice. We are in effect acting as a
security response group, dealing with issues raised via the process
given in the LLVM Security group page. We are not proactively adding
additional testing fuzzing and hardening. While this could be considered
an aspirational goal, it may give the implication that the LLVM Security
Group is handling or at worst guaranteeing security for the LLVM project
when in practice it is not.

Meeting notes:

https://discourse.llvm.org/t/llvm-security-group-public-sync-ups/62735/32

[Github] Add LLVM Premerge Checks to the watchlist (#120230)

LLVM Premerge Checks is running on the new GCP cluster. Tracking its
metrics will allow us to determine the stability of the presubmit and
make sure the new infra is working as intended.

---------

Signed-off-by: Nathan Gauër <brioche at google.com>

[SPIR-V] Fix issue #120078 and simplifies parsing of floating point decoration tips in demangled function name (#120128)

This PR fixes https://github.com/llvm/llvm-project/issues/120078 and
improves/simplifies parsing of demangled function name that aims to
detect a tip for floating point decorations. The latter improvement
fixes also a complaint from `LLVM_USE_SANITIZER=Address`.

[AArch64] Prevent unnecessary truncation in bool vector reduce code generation (#120096)

Prevent unnecessarily truncating results of 128 bit wide vector
comparisons to 64 bit wide vector values in boolean vector reduce
operations.

[LoopVectorize] Enable more early exit vectorisation tests (#117008)

PR #112138 introduced initial support for dispatching to
multiple exit blocks via split middle blocks. This patch
fixes a few issues so that we can enable more tests to use
the new enable-early-exit-vectorization flag. Fixes are:

1. The code to bail out for any loop live-out values happens
too late. This is because collectUsersInExitBlocks ignores
induction variables, which get dealt with in fixupIVUsers.
I've moved the check much earlier in processLoop by looking
for outside users of loop-defined values.
2. We shouldn't yet be interleaving when vectorising loops
with uncountable early exits, since we've not added support
for this yet.
3. Similarly, we also shouldn't be creating vector epilogues.
4. Similarly, we shouldn't enable tail-folding.
5. The existing implementation doesn't yet support loops
that require scalar epilogues, although I plan to add that
as part of PR #88385.
6. The new split middle blocks weren't being added to the
parent loop.

[flang][HLFIR] fix FORALL issue 120190 (#120236)

Fix #120190.

The hlfir.forall lowering code was not properly checking for forall
index reference in mask value computation before trying to hoist it: it
was only looking at the ops directly nested in the hlfir.forall_mask
region, but not the operation indirectly nested. This caused triggered
bogus hoisting in #120190 leading to undefined behavior (reference to
uinitialized data). The added regression test would die at compile time
with a dominance error.

Fix this by doing a deep walk of the region operation instead. Also
clean-up the region cloning to use without_terminator.

[llvm][RISCV] Set ScalableVector stack id in proper place (#117862)

Without this patch ScalableVector frame index property is used before
assignment. More precisely, let's take a look at
RISCVFrameLowering::assignCalleeSavedSpillSlots. In this function we
divide callee saved registers on scalar and vector ones, based on
ScalableVector property of their frame indexes:
```
  ...
  const auto &UnmanagedCSI = getUnmanagedCSI(*MF, CSI);
  const auto &RVVCSI = getRVVCalleeSavedInfo(*MF, CSI);
  ...
```
But we assign ScalableVector property several lines below:
```
  ...
  auto storeRegToStackSlot = [&](decltype(UnmanagedCSI) CSInfo) {
    for (auto &CS : CSInfo) {
      // Insert the spill to the stack frame.
      Register Reg = CS.getReg();
      const TargetRegisterClass *RC = TRI->getMinimalPhysRegClass(Reg);
      TII.storeRegToStackSlot(MBB, MI, Reg, !MBB.isLiveIn(Reg),
                              CS.getFrameIdx(), RC, TRI, Register());
    }
  };
  storeRegToStackSlot(UnmanagedCSI);
  ...
```
Due to it, list of RVV callee saved registers will always be empty.
Currently this problem doesn't appear, but if you slightly change the
code and, for example, put some instructions between scalar and vector
spills, the resulting code will be ill formed.

[LV] Fixup check lines after 13107cb09441.

[lldb][NFC] clang-format MainLoopPosix.cpp

Since AIX support is about to change this.

[Clang] Implement CWG2813: Class member access with prvalues (#120223)

This is a rebase of #95112 with my own feedback apply as @MitalAshok has
been inactive for a while.
It's fairly important this makes clang 20 as it is a blocker for #107451

--- 

[CWG2813](https://cplusplus.github.io/CWG/issues/2813.html)

prvalue.member_fn(expression-list) now will not materialize a temporary
for prvalue if member_fn is an explicit object member function, and
prvalue will bind directly to the object parameter.

The E1 in E1.static_member is now a discarded-value expression, so if E1
was a call to a [[nodiscard]] function, there will now be a warning.
This also affects C++98 with [[gnu::warn_unused_result]] functions.

This should not affect C where TemporaryMaterializationConversion is a
no-op.

Closes #100314
Fixes #100341

---------

Co-authored-by: Mital Ashok <mital at mitalashok.co.uk>

[lldb] Add lldb/source/Host/posix/MainLoopPosix.cpp to git blame ignores

[VFABI] Add support for vector functions that return struct types (#119000)

This patch updates the `VFABIDemangler` to support vector functions that
return struct types. For example, a vector variant of `sincos` that
returns a vector of sine values and a vector of cosine values within a
struct.

This patch also adds some helpers for vectorizing types (including
struct types). Some of these are used in the `VFABIDemangler`, and
others will be used in subsequent patches, so this patch simply adds
tests for them.

[X86] combineKSHIFT - fold kshiftr(kshiftr/extract_subvector(X,C1),C2) --> kshiftr(X,C1+C2) (#115528)

Merge serial KSHIFTR nodes, possibly separated by EXTRACT_SUBVECTOR, to allow mask instructions to be computed in parallel.

[gn build] Port 1ee740a79620

[github/CODEOWNERS] Add yota9 as BOLT reviewer

[ARM] Reduce loop unroll when low overhead branching is available (#120065)

For processors with low overhead branching (LOB), runtime unrolling the
innermost loop is often detrimental to performance. In these cases the
loop remainder gets unrolled into a series of compare-and-jump blocks,
which in deeply nested loops get executed multiple times, negating the
benefits of LOB.

This is particularly noticable when the loop trip count of the innermost
loop varies within the outer loop, such as in the case of triangular
matrix decompositions.

In these cases we will prefer to not unroll the innermost loop, with the
intention for it to be executed as a low overhead loop.

Add support for single reductions in ComplexDeinterleavingPass (#112875)

The Complex Deinterleaving pass assumes that all values emitted will
result in complex numbers, this patch aims to remove that assumption and
adds support for emitting just the real or imaginary components, not
both.

Reland [Clang] skip default argument instantiation for non-defining friend declarations to meet [dcl.fct.default] p4 (#115487)

This fixes a crash when instantiating default arguments for templated
friend function declarations which lack a definition.
There are implementation limits which prevents us from finding the
pattern for such functions, and this causes difficulties
setting up the instantiation scope for the function parameters.

This patch skips instantiating the default argument in these cases,
which causes a minor regression in error recovery, but otherwise avoids
the crash.

The previous attempt #113777 accidentally skipped all default argument
constructions, causing some regressions. This patch resolves that by
moving the guard to InstantiateDefaultArgument() where the handling of
templates takes place.

Fixes https://github.com/llvm/llvm-project/issues/113324

[AMDGPU] Modify Dyn Alloca test to account for Machine-Verifier bug (#120393)

Machine-Verifier crashes in kernel functions, 
but fails gracefully in device functions.

This is due to the buffer resource descriptor selected
during G-ISEL, before the fallback path. 
Device functions use `$sgpr0_sgpr1_sgpr2_sgpr3`.
while Kernel functions select `$private_rsrc_reg` 
where machine-verifier complains: 
`$private_rsrc_reg is not a SReg_128 register.`

Modifying test case to capture both behaviors, this is related to
https://github.com/llvm/llvm-project/pull/120063

[clang-tidy] use local config (#120004)

follow up patch for #119948.

[NVPTX] fix nvcl-param-align.ll

fix for f9c8c01d38f8fbea81db99ab90b7d0f2bdcc8b4d

SourceCoverageViewHTML.cpp: Reformat JS

Introduce CounterMappingRegion::isBranch(). NFC.

llvm-cov: Refactor SourceCoverageView::renderBranchView().

NFC except for calculating `Total`. I've replaced
`(uint64_t)+(uint64_t)` with `(double)+(double)`.

This is still inexact with large numbers `(1LL << 53)` but will be expected to prevent possible overflow.

[SCEV] Bail out on mixed int/pointer in SCEVWrapPredicate::implies.

Fixes a crash when trying to extend the pointer start value to a narrow
integer type after b6c29fdffd65.

LLVMContext: rem constexpr to unblock build w/ gcc (#120402)

Address issues observed in buildbots with older GCC versions:
https://lab.llvm.org/buildbot/#/builders/140/builds/13302

[X86] LowerShift - track the number and location of constant shift elements. (#120270)

We have several vector shift lowering strategies that have to analyse
the distribution of non-uniform constant vector shift amounts, at the
moment there is very little sharing of data between these analysis.

This patch creates a SmallDenseMap of the different LEGAL constant shift
amounts used, with a mask of which elements they are used in. So far
I've only updated the shuffle(immshift(x,c1),immshift(x,c2)) lowering
pattern to use it for clarity, there's several more that can be done in
followups. Its hoped that the proposed patch #117980 can be simplified
after this patch as well.

vec_shift6.ll - the existing shuffle(immshift(x,c1),immshift(x,c2))
lowering bails on out of range shift amounts, while this patch now skips
them and treats them as UNDEF - this means we manage to fold more cases
that before would have to lower to a SHL->MUL pattern, including some
legalized cases.

[TableGen][GISel] Import more "multi-level" patterns (#120332)

Previously, if the destination DAG has an untyped leaf, we would import
the pattern only if that leaf is defined by the *top-level* source DAG.
This is an unnecessary restriction.

Here is an example of such pattern:
```
def : Pat<(add (mul v8i16:$vA, v8i16:$vB), v8i16:$vC),
          (VMLADDUHM $vA, $vB, $vC)>;
```

Previously, it failed to import because `add` doesn't define neither
`$vA` nor `$vB`.

This change reduces the number of skipped patterns as follows:

```
AArch64: 8695 ->  8548 (-147)
AMDGPU: 11333 -> 11240 (-93)
ARM:     4297 ->  4278 (-1)
PowerPC: 3955 ->  3010 (-945)
```

Other GISel-enabled targets are unaffected.

[LLVM][AsmPrinter] Add vector ConstantInt/FP support to emitGlobalConstantImpl. (#120077)

The fixes a failure path for fixed length vector globals when
ConstantInt/FP is used to represent splats instead of
ConstantDataVector.

[Exegesis][RISCV] Add RISCV support for llvm-exegesis (#89047)

This patch also makes following amendments to core exegesis:
* Added distinction between regular registers aliasing check and
registers used as memory address in instruction.
* Added scratch memory space pointer register.
* General exegesis options were amended:
        * mattr - new option to pass a list of enabled target features

Llvm-exegesis RISCV port is a result of team effort. Below everyone
involved listed.
Co-authored-by: Konstantin Vladimirov
<konstantin.vladimirov at syntacore.com>
Co-authored-by: Dmitrii Petrov <dmitrii.petrov at syntacore.com>
Co-authored-by: Dmitry Bushev <dmitry.bushev at syntacore.com>
Co-authored-by: Mark Goncharov <mark.goncharov at syntacore.com>
Co-authored-by: Anastasiya Chernikova
<anastasiya.chernikova at syntacore.com>

---------

Co-authored-by: Dmitry Bushev <dmitry.bushev at syntacore.com>

[X86] urem-seteq-illegal-types.ll - regenerate VPTERNLOG comment

Fix unused variable warning. NFC.

Revert "[Exegesis][RISCV] Add RISCV support for llvm-exegesis (#89047)"

This reverts commit bc3eee11ea6f771bf007c4921a34c1dfee040471.

These tests are failing because of no `REQUIRES`.

[Xtensa] Implement Code Density Option. (#119639)

The Code Density option adds 16-bit encoding for frequently used
instructions.

[InstCombine] Drop samesign flags in `foldLogOpOfMaskedICmps_NotAllZeros_BMask_Mixed` (#120373)

Counterexamples: https://alive2.llvm.org/ce/z/6Ks8Qz
Closes https://github.com/llvm/llvm-project/issues/120361.

[lldb][AIX] Header Parsing for XCOFF Object File in AIX (#116338)

This PR is in reference to porting LLDB on AIX.

Link to discussions on llvm discourse and github:
1. https://discourse.llvm.org/t/port-lldb-to-ibm-aix/80640
2. https://github.com/llvm/llvm-project/issues/101657
The complete changes for porting are present in this draft PR:
https://github.com/llvm/llvm-project/pull/102601

Added XCOFF Object File Header Parsing for AIX.

Details about XCOFF file format on AIX:
[XCOFF](https://www.ibm.com/docs/en/aix/7.3?topic=formats-xcoff-object-file-format)

Reapply "[NFC][AMDGPU] Pre-commit clang and llvm tests for dynamic allocas" (#120410)

This reapplies commit https://github.com/llvm/llvm-project/pull/120063.

A machine-verifier bug was causing a crash in the previous commit. 
This has been addressed in
https://github.com/llvm/llvm-project/pull/120393.

[AMDGPU] Use -triple instead of -arch in MC tests

[Python] Use raw string literals for regexes (#120401)

Previously these backslashes were not followed by a valid escape
sequence character so were treated as literal backslashes, which was the
intended behaviour of the code. However python as of 3.12 has started
warning about these, so we should use raw string literals for regexes so
that backslashes are always interpreted literally. I've done this for
every regex in this file for consistency, including the ones which do
not contain backslashes.

[mlir][SCF] Unify tileUsingFor and tileReductionUsingFor implementation (#120115)

This patch unifies the tiling implementation for tileUsingFor and
tileReductionUsingFor. This is done by passing an addition option to
SCFTilingOptions, allowing it to set how reduction dimensions should be
tiled. Currently, there are 3 different options for reduction tiling:
FullReduction (old tileUsingFor), PartialReductionOuterReduction (old
tileReductionUsingFor) and PartialReductionOuterParallel
(linalg::tileReductionUsingForall, this isn't implemented in this
patch).

The patch makes tileReductionUsingFor use the tileUsingFor
implementation with the new reduction tiling options.

There are no test changes because the implementation was doing almost
the exactly same thing. This was also tested in IREE (which uses both
these APIs heavily) and there were no test changes.

Revert "[VectorCombine] Combine scalar fneg with insert/extract to vector fneg when length is different" (#120422)

Reverts llvm/llvm-project#115209 - investigating a reported regression

[VPlan] Handle exit phis with multiple operands in addUsersInExitBlocks. (#120260)

Currently the addUsersInExitBlocks incorrectly assumes exit phis only
have a single operand, which may not be the case for loops with early
exits when they share a common exit block.

Also further relax the assertion in fixupIVUsers to allow exit values if
they come from theloop latch/middle.block.

PR: https://github.com/llvm/llvm-project/pull/120260

[OpenMP][Clang] Migrate OpenMP UserDefinedMapper from Clang to OMPIRBuilder (#110001)

This patch migrates the OpenMP UserDefinedMapper codegen from Clang to
the OpenMPIRBuilder. I will be adding further patches in the near future
so that OpenMP dialect in MLIR can make use of these.

[flang] Add UNSIGNED (#113504)

Implement the UNSIGNED extension type and operations under control of a
language feature flag (-funsigned).

This is nearly identical to the UNSIGNED feature that has been available
in Sun Fortran for years, and now implemented in GNU Fortran for
gfortran 15, and proposed for ISO standardization in J3/24-116.txt.

See the new documentation for details; but in short, this is C's
unsigned type, with guaranteed modular arithmetic for +, -, and *, and
the related transformational intrinsic functions SUM & al.

Revert "Add support for single reductions in ComplexDeinterleavingPass (#112875)"

This reverts commit b3eede5e1fa7ab742b86e9be22db7bccd2505b8a.

This has been breaking most AArch64 stage2 builds for 4+ hours,
reverting to get the bots back to green.

https://lab.llvm.org/buildbot/#/builders/41/builds/4172
https://lab.llvm.org/buildbot/#/builders/4/builds/4281
https://lab.llvm.org/buildbot/#/builders/199/builds/263
https://lab.llvm.org/buildbot/#/builders/198/builds/334
https://lab.llvm.org/buildbot/#/builders/143/builds/4276
https://lab.llvm.org/buildbot/#/builders/17/builds/4725

[Exegesis][RISCV] Add RISCV support for llvm-exegesis (#120419)

This patch also makes following amendments to core exegesis:
* Added distinction between regular registers aliasing check and
registers used as memory address in instruction.
* Added scratch memory space pointer register.
* General exegesis options were amended:
        * mattr - new option to pass a list of enabled target features

Llvm-exegesis RISCV port is a result of team effort. Below everyone
involved listed.
Co-authored-by: Konstantin Vladimirov
<konstantin.vladimirov at syntacore.com>
Co-authored-by: Dmitrii Petrov <dmitrii.petrov at syntacore.com>
Co-authored-by: Dmitry Bushev <dmitry.bushev at syntacore.com>
Co-authored-by: Mark Goncharov <mark.goncharov at syntacore.com>
Co-authored-by: Anastasiya Chernikova
<anastasiya.chernikova at syntacore.com>

---------

Co-authored-by: Anastasiya Chernikova <anastasiya.chernikova at syntacore.com>

Fix #110001 build error.

[TableGen][GISel] Improve dead register handling (#120426)

A dead implicit def wasn't marked as dead if it is also an implicit use.
The new approach should also be more straightforward and simplifies
future changes for supporting optional defs and physical register defs.

Pull Request: https://github.com/llvm/llvm-project/pull/120426

[DirectX] Split resource info into type and binding info. NFC (#119773)

This splits the DXILResourceAnalysis pass into TypeAnalysis and
BindingAnalysis passes. The type analysis pass is made immutable and
populated lazily so that it can be used earlier in the pipeline without
needing to carefully maintain the invariants of the binding analysis.

Fixes #118400

[X86] LowerShift - don't prematurely lower to x86 vector shift imm instructions (#120282)

When splitting 2 unique amount shifts to shuffle(shift(x,c1),shift(x,c2)), don't use getTargetVShiftByConstNode directly to lower, use generic shifts to ensure we make use of any further canonicalization: shl(X,1) to add(X,X) etc. - this can have notably better throughput on some x86 targets.

Noticed on #120270

[Clang] Set `__cpp_explicit_this_parameter` (#107451)

There are not a lot of outstanding known issues
with deducing this (besides #95112), so it
seems reasonable to claim full support.

Fixes #82780

[clang-doc] Use LangOpts when printing types (#120308)

The implementation in the clang-doc serializer failed to take in the
LangOpts from the declaration. As a result, we'd do things like print
`_Bool` instead of `bool`, even in C++ code.

Fixes #62970

Reland 2de78815604e9027efd93cac27c517bf732587d2 (#119650) (#120454)

[NFC] Move DroppedVariableStats to its own file and redesign it to be
extensible. (#115563)

Move DroppedVariableStats code to its own file and change the class to
have an extensible design so that we can use it to add dropped
statistics to MIR passes and the instruction selector.

[libc][docs] convert stdio.h to docgen (#120334)

Add info from n3220 and POSIX.1-2024.

[flang][NFC] static assert intrinsic table is sorted (#120399)

This invariant is used below when searching for intrinsic
implementation. Currently, if the map is not sorted, the compiler will
just silently assume there is no such implementation.

[DirectX] Introduce the DXILResourceAccess pass (#116726)

This pass transforms resource access via `llvm.dx.resource.getpointer`
into buffer loads and stores.

Fixes #114848.

[lld] Move BPSectionOrderer from MachO to Common for reuse in ELF (#117514)

Add lld/Common/BPSectionOrdererBase from MachO for reuse in ELF

[DirectX] Create symbols for resource handles (#119775)

We need to create symbols with "the original shape of resource and
element type" to put in the resource metadata in order to generate valid
DXIL.

Note that DXC generally doesn't emit an actual symbol outside of library
shaders (it emits an undef of a pointer to the type), but since we have
to deal with opaque pointers we would need a way to smuggle the type
through to match that. Instead, we simply emit symbols for now.

Fixed #116849

Revert "[Exegesis][RISCV] Add RISCV support for llvm-exegesis (#120419)"

This reverts commit 6993d32c77a78ac0e6eee0e4bffd714a455e776b.

Reason: buildbot breakage
(https://lab.llvm.org/buildbot/#/builders/51/builds/7908)

CCACHE_CPP2=yes CCACHE_HASHDIR=yes /usr/bin/ccache /home/b/sanitizer-aarch64-linux/build/llvm_build0/bin/clang++ -DGTEST_HAS_RTTI=0 -DLLVM_BUILD_STATIC -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/b/sanitizer-aarch64-linux/build/build_default/tools/llvm-exegesis/lib/RISCV -I/home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/tools/llvm-exegesis/lib/RISCV -I/home/b/sanitizer-aarch64-linux/build/build_default/include -I/home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/include -I/home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/lib/Target/RISCV -I/home/b/sanitizer-aarch64-linux/build/build_default/lib/Target/RISCV -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -std=c++17  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT tools/llvm-exegesis/lib/RISCV/CMakeFiles/LLVMExegesisRISCV.dir/Target.cpp.o -MF tools/llvm-exegesis/lib/RISCV/CMakeFiles/LLVMExegesisRISCV.dir/Target.cpp.o.d -o tools/llvm-exegesis/lib/RISCV/CMakeFiles/LLVMExegesisRISCV.dir/Target.cpp.o -c /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/tools/llvm-exegesis/lib/RISCV/Target.cpp
In file included from /home/b/sanitizer-aarch64-linux/build/llvm-project/llvm/tools/llvm-exegesis/lib/RISCV/Target.cpp:139:
/home/b/sanitizer-aarch64-linux/build/build_default/lib/Target/RISCV/RISCVGenAsmMatcher.inc:239:19: error: unused function 'MatchRegisterName' [-Werror,-Wunused-function]
  239 | static MCRegister MatchRegisterName(StringRef Name) {
      |                   ^~~~~~~~~~~~~~~~~
/home/b/sanitizer-aarch64-linux/build/build_default/lib/Target/RISCV/RISCVGenAsmMatcher.inc:568:19: error: unused function 'MatchRegisterAltName' [-Werror,-Wunused-function]
  568 | static MCRegister MatchRegisterAltName(StringRef Name) {
      |                   ^~~~~~~~~~~~~~~~~~~~

[AMDGPU][True16][MC] test update for v_ldexp_f16 in true16 (#119313)

This is a NFC change. Update mc test for v_ldexp_f16 in true16 format.

MC source change was done by previous patch and automatically enabled by
t16 pesudo

[AMDGPU][True16][MC] test update for v_subrev_f16 in true16 (#119315)

This is a NFC change. Update mc test for v_subrev_f16 in true16 format.

MC source change was done by previous patch and automatically enabled by
t16 pesudo

Revert "[InstCombine] Infer nuw for gep inbounds from base of object" (#120460)

Reverts llvm/llvm-project#119225 due to the lack of sanitizer support,
large potential of breaking code containing latent UB, non-trivial
localization and investigation, and what seems to be a bad interaction
with msan (a test is in the works).

Related discussions:
https://github.com/llvm/llvm-project/pull/119225#issuecomment-2551904822
https://github.com/llvm/llvm-project/pull/118472#issuecomment-2549986255

[NFC] update gfx12 vop test to use sed instead of grep (#120458)

changes from https://github.com/llvm/llvm-project/pull/119778 breaks the
AIX clang ppc64 bot:
https://lab.llvm.org/buildbot/#/builders/64/builds/1714 as `grep -o` is
not supported on AIX and is not POSIX compatible as per:
https://www.unix.com/man-page/posix/1p/grep/

Co-authored-by: Mark Danial <mark.danial at ibm.com>

[PhaseOrdering] Update test for #120460

[AMDGPU][True16][MC] true16 for v_pack_b32_f16 (#119630)

Support true16 format for v_pack_b32_f16  in MC.

Since we are replacing v_alignbit_b32 to
`v_pack_b32_f16_t16/v_pack_b32_f16_fake16` in Post-GFX11, have to update
the CodeGen pattern for `v_pack_b32_f16_fake16 `to get CodeGen test
passing. There is no pattern modified/created, but just replacing the
`v_pack_b32_f16` with fake16 format.

Some of the true16 CodeGen test are impacted since `v_pack_b32_f16`
selection are removed in Post-GFX11 while `v_pack_b32_f16_t16` are not
yet supported. The CodeGen patch for `v_pack_b32_f16_t16` will be done
is the following patch.

[clang] Change initialization of a vector from undef to poison [NFC] (#120446)

It is fully initialized with insertelements.

[driver] Fix sanitizer libc++ runtime linking (#120370)

1. -f[no-]sanitize-link-c++-runtime suppose to
   override defauld behavior implied from `CCCIsCXX`
2. Take into account -nostdlib++ (unblocks #108357)
3. Fix typo hasFlag vs hasArg.

[gn build] Port 5717a99d8de4

[gn build] Port 79e859e049c7

[AMDGPU][MC] Disallow op_sel in some VOP3P dot instructions (#100485)

In v_dot4 and v_dot8 instructions with 4- or 8-bit packed data (e.g.,
v_dot4_u32_u8, v_dot8_u32_u4), the op_sel modifier should not be
allowed.

[MemRef] Migrate away from PointerUnion::{is,get} (NFC) (#120382)

Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:

  // FIXME: Replace the uses of is(), get() and dyn_cast() with
  //        isa<T>, cast<T> and the llvm::dyn_cast<T>

I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.

[memprof] Move Frame::hash and hashCallStack to IndexedMemProfData (NFC) (#120365)

Now that IndexedMemProfData::{addFrame,addCallStack} are the only
callers of Frame::hash and hashCallStack, respectively, this patch
moves those functions into IndexedMemProfData and makes them private.
With this patch, we can obtain FrameId and CallStackId only through
addFrame and addCallStack, respectively.

[DirectX] Lower ops after translating metadata (#120157)

Move the DXILOpLoweringPass after DXILTranslateMetadata, and add asserts
in DXILShaderFlags to ensure it isn't scheduled after op lowering. This
will allow us to rely on DirectX intrinsics in the shader flags analysis
rather than having to recover information from lowered operations.

Fixes #120119.

[mlir python] Port Python core code to nanobind. (#118583)

Why? https://nanobind.readthedocs.io/en/latest/why.html says it better
than I can, but my primary motivation for this change is to improve MLIR
IR construction time from JAX.

For a complicated Google-internal LLM model in JAX, this change improves
the MLIR
lowering time by around 5s (out of around 30s), which is a significant
speedup for simply switching binding frameworks.

To a large extent, this is a mechanical change, for instance changing
`pybind11::`
to `nanobind::`.

Notes:
* this PR needs Nanobind 2.4.0, because it needs a bug fix
(https://github.com/wjakob/nanobind/pull/806) that landed in that
release.
* this PR does not port the in-tree dialect extension modules. They can
be ported in a future PR.
* I removed the py::sibling() annotations from def_static and def_class
in `PybindAdapters.h`. These ask pybind11 to try to form an overload
with an existing method, but it's not possible to form mixed
pybind11/nanobind overloads this ways and the parent class is now
defined in nanobind. Better solutions may be possible here.
* nanobind does not contain an exact equivalent of pybind11's buffer
protocol support. It was not hard to add a nanobind implementation of a
similar API.
* nanobind is pickier about casting to std::vector<bool>, expecting that
the input is a sequence of bool types, not truthy values. In a couple of
places I added code to support truthy values during casting.
* nanobind distinguishes bytes (`nb::bytes`) from strings (e.g.,
`std::string`). This required nb::bytes overloads in a few places.

Revert "[mlir python] Port Python core code to nanobind. (#118583)"

This reverts commit 41bd35b58bb482fd466aa4b13aa44a810ad6470f.

Breakage detected, rolling back.

[flang] Don't needlessly instantiate distinct UNSIGNED cases for FINDLOC (#120471)

The FINDLOC runtime doesn't need to distinguish between INTEGER and
UNSIGNED data, so use the code for INTEGER also for UNSIGNED.

[flang][cuda] Using nvvm intrinsics for the syncthread and threadfence families of calls (#120020)

[VPlan] Don't use VPlan ctor taking trip count in most unit tests (NFC).

Update tests to use constructor not passing a trip count VPValue. The
tests don't need that and are simpler as a result.

[libc++] Remove some unused includes (#120219)

[DirectX] TypedUAVLoadAdditionalFormats shader flag (#120477)

Set the TypedUAVLoadAddtionalFormats flag if the shader contains a load
from a multicomponent UAV.

Fixes #114557

[clang-format] Don't change breaking before CtorInitializerColon (#119522)

Don't change breaking before CtorInitializerColon with `ColumnLimit: 0`.

Fixes #119519.

[clang-format] Fix a bug in annotating arrows after init braces (#119958)

Fixes #59066.

[MemProf] Skip unmatched callers when cloning (#120455)

Don't unnecessarily clone for a caller that wasn't matched to a call
instruction.

This necessitated updated a couple of tests that were either
unnecessarily cloning or unnecessarily processing an allocation and
hinting it not cold.

[MemProf] Add quotes around FileCheck pattern (#120481)

Some bots are failing with 2916352936097a35cdcaaf38a9097465adbf8cf5,
likely due to the escapes in the FileCheck pattern. Add extra quotes to
try to fix this.
E.g. https://lab.llvm.org/buildbot/#/builders/46/builds/9442

[llvm][Support] Use __NR_gettid on Linux for compat with older glibc (#120007)

[DirectX] Bug fix for Data Scalarization crash (#118426)

Two bugs here. First calling `Inst->getFunction()` has undefined
behavior if the instruction is not tracked to a function. I suspect the
`replaceAllUsesWith` was leaving the GEPs in a weird ghost parent
situation. I switched up the visitor to be able to `eraseFromParent` as
part of visiting and then everything started working.

The second bug was in `DXILFlattenArrays.cpp`. I was unaware that you
can have multidimensional arrays of `zeroinitializer`, and `undef` so
fixed up the initializer to handle these two cases.

fixes #117273

[mlir][bufferization]-Replace only one use in TensorEmptyElimination (#118958)

In many cases the emptyTensorElimination can not transform or eliminate
the empty tensor which is being inserted into the
`SubsetInsertionOpInterface`.

Two major reasons for that:

1- Failing when trying to find a legal/suitable insertion point for the
`subsetExtract` which is about to replace the empty tensor. However, we
may try to handle this issue by moving the needed values which
responsible on building the `subsetExtract` nearby the empty tensor
(which is about to be eliminated). Thus increasing the probability to
find a legal insertion point.

2-The EmptyTensorElimination transform replaces the tensor.empty's uses
all at once in one apply, rather than replacing only the specific use
which was visited in the use-def chain (when traversing from the
tensor.insert_slice). This scenario of replacing all the uses of the
tensor.empty may lead into additional read effects after bufferization
of the specific subset extract/subview which should not be the case.

Both cases may result in many copies in the coming bufferization which
can not be canonicalized.

The first case can be noticed when having a `tensor.empty` followed by
`SubsetInsertionOpInterface` (or in simple words `tensor.insert_slice`),
which have been lowered from `tensor/tosa.concat`.

The second case can be noticed when having a `tensor.empty`, with many
uses and leading to applying the transformation only once, since the
whole uses have been replaced at once.

The first commit in the PR only adds the lit tests for the cases shown
above (NFC), to emphasize how the transform works, in the coming MRs
will upload a slight changes to handle these case.

The second commit in this PR, we want to replace only the specific use
which was visited in the `use-def` chain (when traversing from the
`tensor.insert_slice`'s source).

[VPlan] Move initial VPlan block creation to constructor. (NFC)

This sets up the initial blocks needed to initialize a VPlan directly
in the constructor. This will allow tracking of all created blocks
directly in VPlan, simplifying block deletion.

[mlir] Add predicates to tablegen-defined properties (#120176)

Give the properties from tablegen a `predicate` field that holds the
predicate that the property needs to satisfy, if one exists, and hook
that field up to verifier generation.

[memprof] Undrift MemProfRecord (#120138)

This patch undrifts source locations in MemProfRecord before readMemprof
starts the matching process.

The thoery of operation is as follows:

1. Collect the lists of direct calls, one from the IR and the other
   from the profile.

2. Compute the correspondence (called undrift map in the patch)
   between the two lists with longestCommonSequence.

3. Apply the undrift map just before readMemprof consumes
   MemProfRecord.

The new function gated by a flag that is off by default.

[SLP] Check if instructions exist after vectorization (#120434)

Fixes #120433.

[mlir][IR] Fix bug in AffineExpr simplifier `lhs % rhs` where `lhs = lhs floordiv rhs` (#119245)

Fixes an issue where the `SimpleAffineExprFlattener` would simplify
`lhs % rhs` to just `-(lhs floordiv rhs)` instead of 
`lhs - (lhs floordiv rhs)`
if `lhs` happened to be equal to `lhs floordiv rhs`.

The reported failure case was 
`(d0, d1) -> (((d1 - (d1 + 2)) floordiv 8) % 8)`
from https://github.com/llvm/llvm-project/issues/114654.

Note that many paths that simplify AffineMaps (e.g. the AffineApplyOp
folder and canonicalization) would not observe this bug because of
of slightly different paths taken by the code. Slightly different
grouping of the terms could also result in avoiding the bug.

Resolves https://github.com/llvm/llvm-project/issues/114654.

[APINotes] Avoid assertion failure with expensive checks (#120487)

Found assertion failures when using EXPENSIVE_CHECKS and running lit
tests for APINotes:
Assertion `left.first != right.first && "two entries for the same
version"' failed.

It seems like std::is_sorted is verifying that the comparison function
is reflective (comp(a,a)=false) when using expensive checks. So we would
get callbacks to the lambda used for comparison, even for vectors with a
single element in APINotesReader::VersionedInfo<T>::VersionedInfo, with
"left" and "right" being the same object. Therefore the assert checking
that we never found equal values would fail.

Fix makes sure that we skip the check for equal values when "left" and
"right" is the same object.

[Exegesis][RISCV] Add RISCV support for llvm-exegesis (#120467)

This patch also makes following amendments to core exegesis:
* Added distinction between regular registers aliasing check and
registers used as memory address in instruction.
* Added scratch memory space pointer register.
* General exegesis options were amended:
        * mattr - new option to pass a list of enabled target features

Llvm-exegesis RISCV port is a result of team effort. Below everyone
involved listed.
Co-authored-by: Konstantin Vladimirov
<konstantin.vladimirov at syntacore.com>
Co-authored-by: Dmitrii Petrov <dmitrii.petrov at syntacore.com>
Co-authored-by: Dmitry Bushev <dmitry.bushev at syntacore.com>
Co-authored-by: Mark Goncharov <mark.goncharov at syntacore.com>
Co-authored-by: Anastasiya Chernikova
<anastasiya.chernikova at syntacore.com>

Original pr: #89047

---------

Co-authored-by: Kazu Hirata <kazu at google.com>

[AMDGPU][True16][MC] true16 for v_cvt_pknorm_i16/u16_f16 (#119605)

Support true16 format for v_cvt_pknorm_i16/u16_f16 in MC.

[AMDGPU][True16][MC] true16 for v_div_fixup_f16 (#119613)

Support true16 format for v_div_fixup_f16 in MC.

[AMDGPU][True16][MC] true16 for v_minmax/maxmin_f16 (#119586)

Support true16 format for v_minmax/maxmin_f16 in MC.

Since we are replacing `v_minmax/maxmin_f16` to `v_minmax/maxmin_f16_t16
/ v_minmax/maxmin_f16_fake16` in Post-GFX11, have to update the CodeGen
pattern for `v_minmax/maxmin_f16` to get CodeGen test passing.

[OpenACC] Implement 'wait' construct

The arguments to this are the same as for the 'wait' clause, so this
reuses all of that infrastructure. So all this has to do is support a
pair of clauses that are already implemented (if and async), plus create
an AST node.  This patch does so, and adds proper testing.

[ubsan] Add runtime test for -fsanitize=local-bounds (#120038)

[ubsan] Add -fsanitize-merge (and -fno-sanitize-merge) (#120464)

'-mllvm -ubsan-unique-traps'
(https://github.com/llvm/llvm-project/pull/65972) applies to all UBSan
checks. This patch introduces -fsanitize-merge (defaults to on,
maintaining the status quo behavior) and -fno-sanitize-merge (equivalent
to '-mllvm -ubsan-unique-traps'), with the option to selectively
applying non-merged handlers to a subset of UBSan checks (e.g.,
-fno-sanitize-merge=bool,enum).

N.B. we do not use "trap" in the argument name since
https://github.com/llvm/llvm-project/pull/119302 has generalized
-ubsan-unique-traps to work for non-trap modes (min-rt and regular rt).

This patch does not remove the -ubsan-unique-traps flag; that will
override -f(no-)sanitize-merge.

[Coverage] Resurrect Branch:FalseCnt in SwitchStmt that was pruned in #112694 (#120418)

I missed that FalseCnt for each Case was used to calculate percentage in
the SwitchStmt. At the moment I resurrect them.

In `!HasDefaultCase`, the pair of Counters shall be `[CaseCountSum,
FalseCnt]`. (Reversal of before #112694)
I think it can be considered as the False count on SwitchStmt.

FalseCnt shall be folded (same as current impl) in the coming
SingleByteCoverage changes, since percentage would not make sense.

Allow `CoverageMapping::getCoverageForFile()` to show Branches also outside functions (#120416)

Fixes #119952

Revert "[ubsan] Add -fsanitize-merge (and -fno-sanitize-merge) (#120464)"

This reverts commit 7eaf4708098c216bf432fc7e0bc79c3771e793a4.

Reason: buildbot breakage (e.g.,
https://lab.llvm.org/buildbot/#/builders/144/builds/14299/steps/6/logs/FAIL__Clang__ubsan-trap-debugloc_c)

[llvm][CodeGen] Intrinsic `llvm.powi.*` code gen for vector arguments (#118242)

Scalarize vector FPOWI instead of promoting the type. This allows the
scalar FPOWIs to be visited and converted to libcalls before promoting
the type.

FIXME: This should be done in LegalizeVectorOps/LegalizeDAG, but call
lowering needs the unpromoted EVT.

Without this patch, in some backends, such as RISCV64 and LoongArch64,
the i32 type is illegal and will be promoted. This causes exponent type
check to fail when ISD::FPOWI node generates a libcall.

Fix https://github.com/llvm/llvm-project/issues/118079

Revert "[driver] Fix sanitizer libc++ runtime linking (#120370)"

This reverts commit 9af5de320b77d3757ea2b7e3d85c67f88dfbabb5.

Reason: buildbot breakage
(https://lab.llvm.org/buildbot/#/builders/24/builds/3394/steps/10/logs/stdio)
"Unexpectedly Passed Tests (1):
   llvm-libc++-shared.cfg.in :: libcxx/language.support/support.dynamic/libcpp_deallocate.sh.cpp"

Reapply "[ubsan] Add -fsanitize-merge (and -fno-sanitize-merge) (#120…464)" (#120511)

This reverts commit 2691b964150c77a9e6967423383ad14a7693095e. This
reapply fixes the buildbot breakage of the original patch, by updating
clang/test/CodeGen/ubsan-trap-debugloc.c to specify -fsanitize-merge
(the default, which is merge, is applied by the driver but not
clang_cc1).

This reapply also expands clang/test/CodeGen/ubsan-trap-merge.c.

----

Original commit message:
'-mllvm -ubsan-unique-traps'
(https://github.com/llvm/llvm-project/pull/65972) applies to all UBSan
checks. This patch introduces -fsanitize-merge (defaults to on,
maintaining the status quo behavior) and -fno-sanitize-merge (equivalent
to '-mllvm -ubsan-unique-traps'), with the option to selectively
applying non-merged handlers to a subset of UBSan checks (e.g.,
-fno-sanitize-merge=bool,enum).

N.B. we do not use "trap" in the argument name since
https://github.com/llvm/llvm-project/pull/119302 has generalized
-ubsan-unique-traps to work for non-trap modes (min-rt and regular rt).

This patch does not remove the -ubsan-unique-traps flag; that will
override -f(no-)sanitize-merge.

[flang][cuda] Allocate descriptor in managed memory when emboxing device memory (#120485)

When emboxing memory that comes from CUFMemAlloc, we need to allocate
the descriptor in manage memory as it might be passed to a kernel.

[gn] port 8e8692a5420370 (RISCV support for llvm-exegesis)

[mlir python] Port Python core code to nanobind. (#120473)

Relands #118583, with a fix for Python 3.8 compatibility. It was not
possible to set the buffer protocol accessers via slots in Python 3.8.

Why? https://nanobind.readthedocs.io/en/latest/why.html says it better
than I can, but my primary motivation for this change is to improve MLIR
IR construction time from JAX.

For a complicated Google-internal LLM model in JAX, this change improves
the MLIR
lowering time by around 5s (out of around 30s), which is a significant
speedup for simply switching binding frameworks.

To a large extent, this is a mechanical change, for instance changing
`pybind11::` to `nanobind::`.

Notes:
* this PR needs Nanobind 2.4.0, because it needs a bug fix
(https://github.com/wjakob/nanobind/pull/806) that landed in that
release.
* this PR does not port the in-tree dialect extension modules. They can
be ported in a future PR.
* I removed the py::sibling() annotations from def_static and def_class
in `PybindAdapters.h`. These ask pybind11 to try to form an overload
with an existing method, but it's not possible to form mixed
pybind11/nanobind overloads this ways and the parent class is now
defined in nanobind. Better solutions may be possible here.
* nanobind does not contain an exact equivalent of pybind11's buffer
protocol support. It was not hard to add a nanobind implementation of a
similar API.
* nanobind is pickier about casting to std::vector<bool>, expecting that
the input is a sequence of bool types, not truthy values. In a couple of
places I added code to support truthy values during casting.
* nanobind distinguishes bytes (`nb::bytes`) from strings (e.g.,
`std::string`). This required nb::bytes overloads in a few places.

[RISCV] Custom legalize vp.merge for mask vectors. (#120479)

The default legalization uses vmslt with a vector of XLen to compute a
mask. This doesn't work if the type isn't legal. For fixed vectors it
will scalarize. For scalable vectors it crashes the compiler.

This patch uses an alternate strategy that promotes the i1 vector to an
i8 vector and does the merge. I don't claim this to be the best
lowering. I wrote it quickly almost 3 years ago when a crash was
reported in our downstream.

Fixes #120405.

[Sema] Fix tautological bounds check warning with -fwrapv (#120480)

The tautological bounds check warning added in #120222 does not take
into account whether signed integer overflow is well defined or not,
which could result in a developer removing a bounds check that may not
actually be always false because of different overflow semantics.

```c
int check(const int* foo, unsigned int idx)
{
    return foo + idx < foo;
}
```

```
$ clang -O2 -c test.c
test.c:3:19: warning: pointer comparison always evaluates to false [-Wtautological-compare]
    3 |         return foo + idx < foo;
      |                          ^
1 warning generated.

# Bounds check is eliminated without -fwrapv, warning was correct
$ llvm-objdump -dr test.o
...
0000000000000000 <check>:
       0: 31 c0                         xorl    %eax, %eax
       2: c3                            retq
```

```
$ clang -O2 -fwrapv -c test.c
test.c:3:19: warning: pointer comparison always evaluates to false [-Wtautological-compare]
    3 |         return foo + idx < foo;
      |                          ^
1 warning generated.

# Bounds check remains, warning was wrong
$ llvm-objdump -dr test.o
0000000000000000 <check>:
       0: 89 f0                         movl    %esi, %eax
       2: 48 8d 0c 87                   leaq    (%rdi,%rax,4), %rcx
       6: 31 c0                         xorl    %eax, %eax
       8: 48 39 f9                      cmpq    %rdi, %rcx
       b: 0f 92 c0                      setb    %al
       e: c3                            retq
```

[ADT] Add a unittest for the ScopedHashTable class (#120183)

The ScopedHashTable class is particularly used to develop string tables
for parsers and code convertors. For instance, the MLIRGen class from the
toy example for MLIR actively uses this class to define scopes for
declared variables. To demonstrate common use cases for the
ScopedHashTable class as well as to check its behavior in different
situations, the unittest has been added.

Signed-off-by: Pavel Samolysov <samolisov at gmail.com>

[gn build] Port 1cc926b8b697

[clang-format] Fix a crash caused by commit f03bf8c45f43

[ADT] Fix warnings

This patch fixes warnings of the form:

  llvm/unittests/ADT/ScopedHashTableTest.cpp:41:20: error:
  'ScopedHashTableScope' may not intend to support class template
  argument deduction [-Werror,-Wctad-maybe-unsupported]

[SelectionDAG] Rename SDNode::uses() to users(). (#120499)

This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.

I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.

[Coroutines][Docs] Add a discussion on the handling of certain parameter attribs (#117183)

ByVal arguments and Swifterror require special handling in the coroutine
passes. The goal of this section is to provide a description of how
these parameter attributes are handled.

[RISCV] Add software pipeliner support (#117546)

This patch adds basic support of `MachinePipeliner` and disable
it by default.
    
The functionality should be OK and all llvm-test-suite tests have
passed.

[Clang] Don't assume unexpanded PackExpansions' size when expanding packs (#120380)

CheckParameterPacksForExpansion() previously assumed that template
arguments don't include PackExpansion types when attempting another pack
expansion (i.e. when NumExpansions is present). However, this assumption
doesn't hold for type aliases, whose substitution might involve
unexpanded packs. This can lead to incorrect diagnostics during
substitution because the pack size is not yet determined.

To address this, this patch calculates the minimum pack size (ignoring
unexpanded PackExpansionTypes) and compares it to the previously
expanded size. If the minimum pack size is smaller, then there's still a
chance for future substitution to expand it to a correct size, so we
don't diagnose it too eagerly.

Fixes #61415
Fixes #32252
Fixes #17042

[SelectionDAG] Replace findGlueUse in SelectionDAGISel with SDNode::getGluedUser. NFC (#120512)

[SelectionDAG] Add SDNode::user_begin() and use it in some places (#120509)

Most of these are just places that want the first user and aren't
iterating over the whole list.

While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.

This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.

[RISCV][MCA] Move sifive-x280 tests to directory SiFiveX280 (#120522)

[llvm-mc] --no-exec-stack: replace initSection with switchSection. NFC

AsmParser will call initSection unless -n is specified.
It is not good to call initSection twice.

[NFC] Move DroppedVariableStats code to Analysis (#120502)

This is done because the CodeGen library and Passes library both link
against Analysis, to avoid adding a dependency between CodeGen and
Passes if we want to extend the DroppedVariableStats code for MIR stats
as well, as seen in https://github.com/llvm/llvm-project/pull/120501

[gn build] Port e389492d6a00

[LLVM] Update BPF maintainer (#120429)

Nowadays yonghong-song and eddyz87 are more involved with LLVM
BPF development than 4ast, so update the maintainer list to reflect
this.

[LLVM] Move Bigcheese to inactive maintainer for Windows object tools (#120425)

Bigcheese isn't actively working on Windows support in object tools
anymore, so move him to the inactive maintainer list. I'm also not
aware of anyone else who is actively involved in this area currently,
so I'm dropping the category entirely for now.

[LLVM] Update maintainers for binary utilities (#120428)

We currently list jakehehrlich as the maintainer for llvm-objcopy /
ObjCopy, but he hasn't been involved with LLVM for more than 5 years.

Convert the llvm-object category into a broader binary utilities
category and add jh7370 and MaskRay as the new maintainers.

Add a pass to collect dropped var stats for MIR (#120501)

Reland "Add a pass to collect dropped var stats for MIR" (#117044)

I am trying to reland https://github.com/llvm/llvm-project/pull/115566

I also moved the DroppedVariableStats code to the Analysis lib

This is part of a stack of patches with
https://github.com/llvm/llvm-project/pull/120502 being the first one in
the stack

Revert "Add a pass to collect dropped var stats for MIR (#120501)"

This reverts commit 223c7648468cd4f649a578d3f9cbc27a63523192.

Reverted due to vuildbot failure:

flang-aarch64-libcxx

Linking CXX shared library lib/libLLVMAnalysis.so.20.0git
FAILED: lib/libLLVMAnalysis.so.20.0git

[RISCV] Add scheduling model for mips p8700 CPU (#119885)

Depends on #119882.

[LLVM] Update ADT/Support maintainers (#120423)

Nominate dwblaikie and kuhar as new maintainers for ADT/Support,
replacing chandlerc.

Revert "[RISCV] Add scheduling model for mips p8700 CPU" (#120537)

Reverts llvm/llvm-project#119885

llvm-project/llvm/lib/Target/RISCV/RISCVSchedMIPSP8700.td:20:5:
error: Processor does not define resources for WriteFCvtF32ToF16
def MIPSP8700Model : SchedMachineModel {

[TOSA] Don't run validation pass on non TOSA operations (#120205)

This commit ensures the validation pass is not run on operations from
other dialects. In doing so, operations from other dialects that, for
example, use types not supported by TOSA don't result in an error.

Signed-off-by: Luke Hutton <luke.hutton at arm.com>

Reapply "[driver] Fix sanitizer libc++ runtime linking (#120370)" (#120538)

Reland without item 2 from #120370 to avoid breaking libc++ tests.

This reverts commit 60a2f32cf5ce75c9a2511d7fc2b0f24699605912.

[Clang] Re-write codegen for atomic_test_and_set and atomic_clear (#120449)

Re-write the sema and codegen for the atomic_test_and_set and
atomic_clear builtin functions to go via AtomicExpr, like the other
atomic builtins do. This simplifies the code, because AtomicExpr already
handles things like generating code for to dynamically select the memory
ordering, which was duplicated for these builtins. This also fixes a few
crash bugs, one when passing an integer to the pointer argument, and one
when using an array.

This also adds diagnostics for the memory orderings which are not valid
for atomic_clear according to
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html, which
were missing before.

Fixes #111293.

[lldb][AIX] clang-format changes for ProcessLauncherPosixFork.cpp (#120459)

This PR is in reference to porting LLDB on AIX.

Link to discussions on llvm discourse and github:

1. https://discourse.llvm.org/t/port-lldb-to-ibm-aix/80640
2. https://github.com/llvm/llvm-project/issues/101657
The complete changes for porting are present in this draft PR:
https://github.com/llvm/llvm-project/pull/102601

Added clang-format changes for ProcessLauncherPosixFork.cpp which will
be followed by ptrace changes in:
- https://github.com/llvm/llvm-project/pull/120390

[AArch64][SME2] Extend getRegAllocationHints for ZPRStridedOrContiguousReg (#119865)

ZPR2StridedOrContiguous loads used by a FORM_TRANSPOSED_REG_TUPLE
pseudo should attempt to assign a strided register to avoid unnecessary
copies, even though this may overlap with the list of SVE callee-saved registers.

[PAC][MC][ELF][AArch64] Support signed TLSDESC (#120010)

Support the following relocations and assembly operators:

- `R_AARCH64_AUTH_TLSDESC_ADR_PAGE21` (`:tlsdesc_auth:` for `adrp`)
- `R_AARCH64_AUTH_TLSDESC_LD64_LO12` (`:tlsdesc_auth_lo12:` for `ldr`)
- `R_AARCH64_AUTH_TLSDESC_ADD_LO12` (`:tlsdesc_auth_lo12:` for `add`)

[X86] LowerShift - directly initialize SmallVector with build vector operands. NFC.

Don't push_back the operands separately.

[X86] ExtendToType - directly initialize SmallVector with build vector operands. NFC.

Don't push_back the operands separately.

[PS5][Driver] Pass user search paths to linker before implict ones (#119875)

Responsibility for setting up implicit library search paths was recently
transferred to the PS5 driver (llvm#109796). Prior to this, SIE private
patches in lld performed this function. During the transition, I failed
to maintain the order in which implicit and user-supplied search paths
were supplied/considered. This change ensures user-supplied search paths
appear before any implicit ones on the link line.

SIE tracker: TOOLCHAIN-17490

[bazel] port 79e859e049c77b5190a54fc1ecf1d262e3ef9f11

Revert "[NFC] Move DroppedVariableStats code to Analysis (#120502)"

that introduces a circular dependency of analysis -> codegen -> target

This reverts commit e389492d6a00e1c49a034e13343098541ebd03c6.

[gn build] Port cffe22a93726

[LoopVectorize] Use new single string variant of reportVectorizationFailure (#120414)

[AArch64] Tweak truncate costs for some scalable vector types (#119542)

== We were previously returning an invalid cost when truncating
anything to <vscale x 2 x i1>, which is incorrect since we can
generate perfectly good code for this.

== The costs for truncating legal or unpacked types to predicates
seemed overly optimistic. For example, when truncating
<vscale x 8 x i16> to <vscale x 8 x i1> we typically do
something like

  and z0.h, z0.h, #0x1
  cmpne   p0.h, p0/z, z0.h, #0

I guess it might depend upon whether the input value is
generated in the same block or not and if we can avoid the
inreg zero-extend. However, it feels safe to take the more
conservative cost here.

== The costs for some truncates such as

  trunc <vscale x 2 x i32> %a to <vscale x 2 x i16>

were 1, whereas in actual fact they are free and no instructions
are required.

== Also, for this

  trunc <vscale x 8 x i32> %a to <vscale x 8 x i16>

it's just a single uzp1 instruction so I reduced the cost to 1.

In general, I've added costs for all cases where the destination
type is legal or unpacked. One unfortunate side effect of this
is the costs for some fixed-width truncates when using SVE now
look too optimistic.

[ARM] Fix BF16 lowering with FullFP16

This adds test coverage for bf16 instructions, making sure that lowering bf16
works with and without +fullfp16.

[Clang] Fix crash in __builtin_assume_aligned (#114217)

The CodeGen for __builtin_assume_aligned assumes that the first argument
is a pointer, so crashes if the int-conversion error is downgraded or
disabled. Emit a non-downgradable error if the argument is not a
pointer, like we currently do for __builtin_launder.

Fixes #110914.

[AArch64] Fixup destructive floating-point precision conversions (#118788)

This patch changes the zeroing forms of `FCVTXNT`, `FCVTNT`, and
`BFCVTNT` such that their destination operand is also listed as a dag
input. These narrowing down-conversions leave the even elements of the
destination vector unchanged, regardless of the predicate type.

This patch also makes the merging form of `BFCVTNT` non-movprfx'able.

- `FCVTXNT` - [Arm
Developer](https://developer.arm.com/documentation/ddi0602/2024-09/SVE-Instructions/FCVTXNT--Floating-point-down-convert--rounding-to-odd--top--predicated--?lang=en)
- `FCVTNT` - [Arm
Developer](https://developer.arm.com/documentation/ddi0602/2024-09/SVE-Instructions/FCVTNT--predicated---Floating-point-down-convert-and-narrow--top--predicated--?lang=en)
- `BFCVTNT` - [Arm
Developer](https://developer.arm.com/documentation/ddi0602/2024-09/SVE-Instructions/BFCVTNT--Floating-point-down-convert-and-narrow-to-BFloat16--top--predicated--?lang=en)

[LLParser] Remove redundant code (NFC) (#120478)

ARM: Handle vldrh and vstrh in stack access hooks (#120527)

[AMDGPU] Remove unneeded use of !dag. NFC. (#120546)

[analyzer][NFC] Introduce APSIntPtr, a safe wrapper of APSInt (1/4) (#120435)

One could create dangling APSInt references in various ways in the past, that were sometimes assumed to be persisted in the BasicValueFactor.

One should always use BasicValueFactory to create persistent APSInts, that could be used by ConcreteInts or SymIntExprs and similar long-living objects.
If one used a temporary or local variables for this, these would dangle.
To enforce the contract of the analyzer BasicValueFactory and the uses of APSInts, let's have a dedicated strong-type for this.

The idea is that APSIntPtr is always owned by the BasicValueFactory, and that is the only component that can construct it.

These PRs are all NFC - besides fixing dangling APSInt references.

[LoopVectorizer] Add support for partial reductions (#92418)

Following on from https://github.com/llvm/llvm-project/pull/94499, this
patch adds support to the Loop Vectorizer to emit the partial reduction
intrinsics where they may be beneficial for the target.

---------

Co-authored-by: Samuel Tebbs <samuel.tebbs at arm.com>

[analyzer][NFC] Migrate nonloc::ConcreteInt to use APSIntPtr (2/4) (#120436)

[analyzer][NFC] Migrate loc::ConcreteInt to use APSIntPtr (3/4) (#120437)

[analyzer][NFC] Migrate {SymInt,IntSym}Expr to use APSIntPtr (4/4) (#120438)

[clang] NFC, simplify the shouldLifetimeExtendThroughPath.

[FMV][AArch64] Emit mangled default version if explicitly specified. (#120022)

Currently we need at least one more version other than the default to
trigger FMV. However we would like a header file declaration

__attribute__((target_version("default"))) void f(void);

to guarantee that there will be f.default

[X86] Put R20/R21/R28/R29 later in GR64 list (#120510)

Because these registers require an extra byte to encode in certain
memory form. Putting them later in the list will reduce code size when
EGPR is enabled. And align the same order in GR8, GR16 and GR32 lists.
Example:

    movq (%r20), %r11  # encoding: [0xd5,0x1c,0x8b,0x1c,0x24]
    movq (%r22), %r11  # encoding: [0xd5,0x1c,0x8b,0x1e]

[analyzer] Handle [[assume(cond)]] as __builtin_assume(cond) (#116462)

Resolves #100762 

Gist of the change:
1. All the symbol analysis, constraint manager and expression parsing
logic was already present, but the previous code didn't "visit" the
expressions within `assume()` by parsing those expressions, all of the
code "just works" by evaluating the SVals, and hence leaning on the same
logic that makes the code with `__builtin_assume` work
2. "Ignore" an expression from adding in CFG if it has side-effects (
similar to CGStmt.cpp (todo add link))
3. Add additional test case for ternary operator handling and modify
CFG.cpp's VisitGuardedExpr code for `continue`-ing if the `ProgramPoint`
is a `StmtPoint`

---------

Co-authored-by: Balazs Benics <benicsbalazs at gmail.com>

[X86] getShuffleCost - when splitting shuffles, if a whole vector source is just copied we should treat this as free. (#120561)

If the shuffle split results in referencing a single legalised whole vector (i.e. no permutation), then this can be treated as free.

We already do something similar for broadcasts / whole subvector insertion + extraction - its purely an issue for register allocation.

[AArch64][SVE] Use SVE for scalar FP converts in streaming[-compatible] functions (1/n) (#118505)

In streaming[-compatible] functions, use SVE for scalar FP conversions
to/from integer types. This can help avoid moves between FPRs and GRPs,
which could be costly.

This patch also updates definitions of SCVTF_ZPmZ_StoD and
UCVTF_ZPmZ_StoD to disallow lowering to them from ISD nodes, as doing so
requires creating a [U|S]INT_TO_FP_MERGE_PASSTHRU node with inconsistent
types.

Follow up to #112213.

Note: This PR does not include support for f64 <-> i32 conversions (like
#112564), which needs a bit more work to support.

Reland "[RISCV] Add scheduling model for mips p8700 CPU" (#120550)

This patch introduces a scheduling model for the MIPS p8700, an
out-of-order
RISC-V processor. The model includes pipelines for the following units:

- 2 Integer Arithmetic/Logical Units (ALU and AL2)
- Multiply/Divide Unit (MDU)
- Branch Unit (CTI)
- Load/Store Unit (LSU)
- Short Floating-Point Pipe (FPUS)
- Long Floating-Point Pipe (FPUL)

For additional details, refer to the official product page:
https://mips.com/products/hardware/p8700/.

Also adds `UnsupportedSchedZfhmin` to handle cases like
`WriteFCvtF16ToF32` that
previously caused build failures.

[Clang][AArch64] Add signed index/offset variants of sve2p1 qword stores (#120549)

This patch adds signed offset/index variants to the SVE2p1 quadword
store intrinsics, in accordance with
https://github.com/ARM-software/acle/pull/359.

[lldb][AIX] GetOpt support in AIX (#120574)

This PR is in reference to porting LLDB on AIX.

Link to discussions on llvm discourse and github:

1. https://discourse.llvm.org/t/port-lldb-to-ibm-aix/80640
2. https://github.com/llvm/llvm-project/issues/101657
The complet

>From 382744ee05d770abbafbca0a56ee8d1657234056 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Wed, 18 Dec 2024 03:33:03 -0500
Subject: [PATCH 1/8] IR/Verifier: Allow vector type in atomic load and store

Vector types on atomics are assumed to be invalid by the verifier. However,
this type can be valid if it is lowered by codegen.

commit-id:72529270
---
 llvm/docs/LangRef.rst         |  8 ++++----
 llvm/docs/ReleaseNotes.md     |  1 +
 llvm/lib/IR/Verifier.cpp      | 14 ++++++++------
 llvm/test/Assembler/atomic.ll | 19 +++++++++++++++++++
 llvm/test/Verifier/atomics.ll | 15 ++++++++-------
 5 files changed, 40 insertions(+), 17 deletions(-)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 8c0a046d3a7e9..d0badb292647b 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -11244,8 +11244,8 @@ If the ``load`` is marked as ``atomic``, it takes an extra :ref:`ordering
 <ordering>` and optional ``syncscope("<target-scope>")`` argument. The
 ``release`` and ``acq_rel`` orderings are not valid on ``load`` instructions.
 Atomic loads produce :ref:`defined <memmodel>` results when they may see
-multiple atomic stores. The type of the pointee must be an integer, pointer, or
-floating-point type whose bit width is a power of two greater than or equal to
+multiple atomic stores. The type of the pointee must be an integer, pointer,
+floating-point, or vector type whose bit width is a power of two greater than or equal to
 eight and less than or equal to a target-specific size limit.  ``align`` must be
 explicitly specified on atomic loads. Note: if the alignment is not greater or
 equal to the size of the `<value>` type, the atomic operation is likely to
@@ -11385,8 +11385,8 @@ If the ``store`` is marked as ``atomic``, it takes an extra :ref:`ordering
 <ordering>` and optional ``syncscope("<target-scope>")`` argument. The
 ``acquire`` and ``acq_rel`` orderings aren't valid on ``store`` instructions.
 Atomic loads produce :ref:`defined <memmodel>` results when they may see
-multiple atomic stores. The type of the pointee must be an integer, pointer, or
-floating-point type whose bit width is a power of two greater than or equal to
+multiple atomic stores. The type of the pointee must be an integer, pointer,
+floating-point, or vector type whose bit width is a power of two greater than or equal to
 eight and less than or equal to a target-specific size limit.  ``align`` must be
 explicitly specified on atomic stores. Note: if the alignment is not greater or
 equal to the size of the `<value>` type, the atomic operation is likely to
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index 10efc889fcc7f..56f2c92311f24 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -65,6 +65,7 @@ Changes to the LLVM IR
   removed:
 
   * `mul`
+* A `load atomic` may now be used with vector types.
 
 * Updated semantics of `llvm.type.checked.load.relative` to match that of
   `llvm.load.relative`.
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index fdb4ddaafbbcc..68d42cfadd169 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -4323,9 +4323,10 @@ void Verifier::visitLoadInst(LoadInst &LI) {
     Check(LI.getOrdering() != AtomicOrdering::Release &&
               LI.getOrdering() != AtomicOrdering::AcquireRelease,
           "Load cannot have Release ordering", &LI);
-    Check(ElTy->isIntOrPtrTy() || ElTy->isFloatingPointTy(),
-          "atomic load operand must have integer, pointer, or floating point "
-          "type!",
+    Check(ElTy->getScalarType()->isIntOrPtrTy() ||
+              ElTy->getScalarType()->isFloatingPointTy(),
+          "atomic load operand must have integer, pointer, floating point, "
+          "or vector type!",
           ElTy, &LI);
     checkAtomicMemAccessSize(ElTy, &LI);
   } else {
@@ -4349,9 +4350,10 @@ void Verifier::visitStoreInst(StoreInst &SI) {
     Check(SI.getOrdering() != AtomicOrdering::Acquire &&
               SI.getOrdering() != AtomicOrdering::AcquireRelease,
           "Store cannot have Acquire ordering", &SI);
-    Check(ElTy->isIntOrPtrTy() || ElTy->isFloatingPointTy(),
-          "atomic store operand must have integer, pointer, or floating point "
-          "type!",
+    Check(ElTy->getScalarType()->isIntOrPtrTy() ||
+              ElTy->getScalarType()->isFloatingPointTy(),
+          "atomic store operand must have integer, pointer, floating point, "
+          "or vector type!",
           ElTy, &SI);
     checkAtomicMemAccessSize(ElTy, &SI);
   } else {
diff --git a/llvm/test/Assembler/atomic.ll b/llvm/test/Assembler/atomic.ll
index 39f33f9fdcacb..6609edc2953cc 100644
--- a/llvm/test/Assembler/atomic.ll
+++ b/llvm/test/Assembler/atomic.ll
@@ -52,6 +52,25 @@ define void @f(ptr %x) {
   ; CHECK: atomicrmw volatile usub_sat ptr %x, i32 10 syncscope("agent") monotonic
   atomicrmw volatile usub_sat ptr %x, i32 10 syncscope("agent") monotonic
 
+  ; CHECK : load atomic <1 x i32>, ptr %x unordered, align 4
+  load atomic <1 x i32>, ptr %x unordered, align 4
+  ; CHECK : store atomic <1 x i32> splat (i32 3), ptr %x release, align 4
+  store atomic <1 x i32> <i32 3>, ptr %x release, align 4
+  ; CHECK : load atomic <2 x i32>, ptr %x unordered, align 4
+  load atomic <2 x i32>, ptr %x unordered, align 4
+  ; CHECK : store atomic <2 x i32> <i32 3, i32 4>, ptr %x release, align 4
+  store atomic <2 x i32> <i32 3, i32 4>, ptr %x release, align 4
+
+  ; CHECK : load atomic <2 x ptr>, ptr %x unordered, align 4
+  load atomic <2 x ptr>, ptr %x unordered, align 4
+  ; CHECK : store atomic <2 x ptr> zeroinitializer, ptr %x release, align 4
+  store atomic <2 x ptr> zeroinitializer, ptr %x release, align 4
+
+  ; CHECK : load atomic <2 x float>, ptr %x unordered, align 4
+  load atomic <2 x float>, ptr %x unordered, align 4
+  ; CHECK : store atomic <2 x float> <float 3.0, float 4.0>, ptr %x release, align 4
+  store atomic <2 x float> <float 3.0, float 4.0>, ptr %x release, align 4
+
   ; CHECK: fence syncscope("singlethread") release
   fence syncscope("singlethread") release
   ; CHECK: fence seq_cst
diff --git a/llvm/test/Verifier/atomics.ll b/llvm/test/Verifier/atomics.ll
index f835b98b24345..17bf5a0528d73 100644
--- a/llvm/test/Verifier/atomics.ll
+++ b/llvm/test/Verifier/atomics.ll
@@ -1,14 +1,15 @@
 ; RUN: not opt -passes=verify < %s 2>&1 | FileCheck %s
+; CHECK: atomic store operand must have integer, pointer, floating point, or vector type!
+; CHECK: atomic load operand must have integer, pointer, floating point, or vector type!
 
-; CHECK: atomic store operand must have integer, pointer, or floating point type!
-; CHECK: atomic load operand must have integer, pointer, or floating point type!
+%ty = type { i32 };
 
-define void @foo(ptr %P, <1 x i64> %v) {
-  store atomic <1 x i64> %v, ptr %P unordered, align 8
+define void @foo(ptr %P, %ty %v) {
+  store atomic %ty %v, ptr %P unordered, align 8
   ret void
 }
 
-define <1 x i64> @bar(ptr %P) {
-  %v = load atomic <1 x i64>, ptr %P unordered, align 8
-  ret <1 x i64> %v
+define %ty @bar(ptr %P) {
+  %v = load atomic %ty, ptr %P unordered, align 8
+  ret %ty %v
 }

>From 23959886eba5bf21c189dbc569f19a827db54848 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Wed, 18 Dec 2024 03:37:17 -0500
Subject: [PATCH 2/8] [SelectionDAG] Legalize <1 x T> vector types for atomic
 load

`load atomic <1 x T>` is not valid. This change legalizes
vector types of atomic load via scalarization in SelectionDAG
so that it can, for example, translate from `v1i32` to `i32`.

commit-id:5c36cc8c
---
 llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h |   1 +
 .../SelectionDAG/LegalizeVectorTypes.cpp      |  15 ++
 llvm/test/CodeGen/X86/atomic-load-store.ll    | 250 +++++++++++++++++-
 3 files changed, 257 insertions(+), 9 deletions(-)

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
index dd9af47da5287..d24b4517a460d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
@@ -879,6 +879,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   SDValue ScalarizeVecRes_UnaryOpWithExtraInput(SDNode *N);
   SDValue ScalarizeVecRes_INSERT_VECTOR_ELT(SDNode *N);
   SDValue ScalarizeVecRes_LOAD(LoadSDNode *N);
+  SDValue ScalarizeVecRes_ATOMIC_LOAD(AtomicSDNode *N);
   SDValue ScalarizeVecRes_SCALAR_TO_VECTOR(SDNode *N);
   SDValue ScalarizeVecRes_VSELECT(SDNode *N);
   SDValue ScalarizeVecRes_SELECT(SDNode *N);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 4d844f0036a75..d6cbf2211f053 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -65,6 +65,9 @@ void DAGTypeLegalizer::ScalarizeVectorResult(SDNode *N, unsigned ResNo) {
     R = ScalarizeVecRes_UnaryOpWithExtraInput(N);
     break;
   case ISD::INSERT_VECTOR_ELT: R = ScalarizeVecRes_INSERT_VECTOR_ELT(N); break;
+  case ISD::ATOMIC_LOAD:
+    R = ScalarizeVecRes_ATOMIC_LOAD(cast<AtomicSDNode>(N));
+    break;
   case ISD::LOAD:           R = ScalarizeVecRes_LOAD(cast<LoadSDNode>(N));break;
   case ISD::SCALAR_TO_VECTOR:  R = ScalarizeVecRes_SCALAR_TO_VECTOR(N); break;
   case ISD::SIGN_EXTEND_INREG: R = ScalarizeVecRes_InregOp(N); break;
@@ -455,6 +458,18 @@ SDValue DAGTypeLegalizer::ScalarizeVecRes_INSERT_VECTOR_ELT(SDNode *N) {
   return Op;
 }
 
+SDValue DAGTypeLegalizer::ScalarizeVecRes_ATOMIC_LOAD(AtomicSDNode *N) {
+  SDValue Result = DAG.getAtomicLoad(
+      ISD::NON_EXTLOAD, SDLoc(N), N->getMemoryVT().getVectorElementType(),
+      N->getValueType(0).getVectorElementType(), N->getChain(), N->getBasePtr(),
+      N->getMemOperand());
+
+  // Legalize the chain result - switch anything that used the old chain to
+  // use the new one.
+  ReplaceValueWith(SDValue(N, 1), Result.getValue(1));
+  return Result;
+}
+
 SDValue DAGTypeLegalizer::ScalarizeVecRes_LOAD(LoadSDNode *N) {
   assert(N->isUnindexed() && "Indexed vector load?");
 
diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index 45277ce3d26c4..4f5cb5a4e9247 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -1,12 +1,12 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64    | FileCheck %s --check-prefixes=CHECK,CHECK-O3
-; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=CHECK,CHECK-O3
-; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=CHECK,CHECK-O3
-; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=CHECK,CHECK-O3
+; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=CHECK,CHECK-SSE-O3
+; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=CHECK,CHECK-AVX-O3
+; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=CHECK,CHECK-AVX-O3
 ; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64    | FileCheck %s --check-prefixes=CHECK,CHECK-O0
-; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=CHECK,CHECK-O0
-; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=CHECK,CHECK-O0
-; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=CHECK,CHECK-O0
+; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64-v2 | FileCheck %s --check-prefixes=CHECK,CHECK-SSE-O0
+; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=CHECK,CHECK-AVX-O0
+; RUN: llc < %s -mtriple=x86_64-- -verify-machineinstrs -O0 -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=CHECK,CHECK-AVX-O0
 
 define void @test1(ptr %ptr, i32 %val1) {
 ; CHECK-LABEL: test1:
@@ -34,6 +34,238 @@ define i32 @test3(ptr %ptr) {
   %val = load atomic i32, ptr %ptr seq_cst, align 4
   ret i32 %val
 }
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; CHECK-O0: {{.*}}
-; CHECK-O3: {{.*}}
+
+define <1 x i32> @atomic_vec1_i32(ptr %x) {
+; CHECK-LABEL: atomic_vec1_i32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movl (%rdi), %eax
+; CHECK-NEXT:    retq
+  %ret = load atomic <1 x i32>, ptr %x acquire, align 4
+  ret <1 x i32> %ret
+}
+
+define <1 x i8> @atomic_vec1_i8(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_i8:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzbl (%rdi), %eax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_i8:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzbl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_i8:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzbl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_i8:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movb (%rdi), %al
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_i8:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movb (%rdi), %al
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_i8:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movb (%rdi), %al
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x i8>, ptr %x acquire, align 1
+  ret <1 x i8> %ret
+}
+
+define <1 x i16> @atomic_vec1_i16(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_i16:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_i16:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_i16:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_i16:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movw (%rdi), %ax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_i16:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movw (%rdi), %ax
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_i16:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movw (%rdi), %ax
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x i16>, ptr %x acquire, align 2
+  ret <1 x i16> %ret
+}
+
+define <1 x i32> @atomic_vec1_i8_zext(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_i8_zext:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzbl (%rdi), %eax
+; CHECK-O3-NEXT:    movzbl %al, %eax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_i8_zext:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzbl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    movzbl %al, %eax
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_i8_zext:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzbl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    movzbl %al, %eax
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_i8_zext:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movb (%rdi), %al
+; CHECK-O0-NEXT:    movzbl %al, %eax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_i8_zext:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movb (%rdi), %al
+; CHECK-SSE-O0-NEXT:    movzbl %al, %eax
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_i8_zext:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movb (%rdi), %al
+; CHECK-AVX-O0-NEXT:    movzbl %al, %eax
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x i8>, ptr %x acquire, align 1
+  %zret = zext <1 x i8> %ret to <1 x i32>
+  ret <1 x i32> %zret
+}
+
+define <1 x i64> @atomic_vec1_i16_sext(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_i16_sext:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-O3-NEXT:    movswq %ax, %rax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_i16_sext:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    movswq %ax, %rax
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_i16_sext:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    movswq %ax, %rax
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_i16_sext:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movw (%rdi), %ax
+; CHECK-O0-NEXT:    movswq %ax, %rax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_i16_sext:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movw (%rdi), %ax
+; CHECK-SSE-O0-NEXT:    movswq %ax, %rax
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_i16_sext:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movw (%rdi), %ax
+; CHECK-AVX-O0-NEXT:    movswq %ax, %rax
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x i16>, ptr %x acquire, align 2
+  %sret = sext <1 x i16> %ret to <1 x i64>
+  ret <1 x i64> %sret
+}
+
+define <1 x ptr addrspace(270)> @atomic_vec1_ptr270(ptr %x) {
+; CHECK-LABEL: atomic_vec1_ptr270:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movl (%rdi), %eax
+; CHECK-NEXT:    retq
+  %ret = load atomic <1 x ptr addrspace(270)>, ptr %x acquire, align 4
+  ret <1 x ptr addrspace(270)> %ret
+}
+
+define <1 x bfloat> @atomic_vec1_bfloat(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_bfloat:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_bfloat:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_bfloat:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_bfloat:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movw (%rdi), %cx
+; CHECK-O0-NEXT:    # implicit-def: $eax
+; CHECK-O0-NEXT:    movw %cx, %ax
+; CHECK-O0-NEXT:    # implicit-def: $xmm0
+; CHECK-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_bfloat:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movw (%rdi), %cx
+; CHECK-SSE-O0-NEXT:    # implicit-def: $eax
+; CHECK-SSE-O0-NEXT:    movw %cx, %ax
+; CHECK-SSE-O0-NEXT:    # implicit-def: $xmm0
+; CHECK-SSE-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_bfloat:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movw (%rdi), %cx
+; CHECK-AVX-O0-NEXT:    # implicit-def: $eax
+; CHECK-AVX-O0-NEXT:    movw %cx, %ax
+; CHECK-AVX-O0-NEXT:    # implicit-def: $xmm0
+; CHECK-AVX-O0-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x bfloat>, ptr %x acquire, align 2
+  ret <1 x bfloat> %ret
+}
+
+define <1 x ptr> @atomic_vec1_ptr_align(ptr %x) nounwind {
+; CHECK-LABEL: atomic_vec1_ptr_align:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %rax
+; CHECK-NEXT:    retq
+  %ret = load atomic <1 x ptr>, ptr %x acquire, align 8
+  ret <1 x ptr> %ret
+}
+
+define <1 x i64> @atomic_vec1_i64_align(ptr %x) nounwind {
+; CHECK-LABEL: atomic_vec1_i64_align:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %rax
+; CHECK-NEXT:    retq
+  %ret = load atomic <1 x i64>, ptr %x acquire, align 8
+  ret <1 x i64> %ret
+}

>From 9a088dc48e1646ef1ff4e4d6a86dbdfd2fc43679 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Wed, 18 Dec 2024 03:38:23 -0500
Subject: [PATCH 3/8] [X86] Manage atomic load of fp -> int promotion in DAG

When lowering atomic <1 x T> vector types with floats, selection can fail since
this pattern is unsupported. To support this, floats can be casted to
an integer type of the same size.

commit-id:f9d761c5
---
 llvm/lib/Target/X86/X86ISelLowering.cpp    |   4 +
 llvm/test/CodeGen/X86/atomic-load-store.ll | 117 +++++++++++++++++++++
 2 files changed, 121 insertions(+)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index b1a3e3c006bb3..776d3c0a42e2f 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -2653,6 +2653,10 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
         setOperationAction(Op, MVT::f32, Promote);
   }
 
+  setOperationPromotedToType(ISD::ATOMIC_LOAD, MVT::f16, MVT::i16);
+  setOperationPromotedToType(ISD::ATOMIC_LOAD, MVT::f32, MVT::i32);
+  setOperationPromotedToType(ISD::ATOMIC_LOAD, MVT::f64, MVT::i64);
+
   // We have target-specific dag combine patterns for the following nodes:
   setTargetDAGCombine({ISD::VECTOR_SHUFFLE,
                        ISD::SCALAR_TO_VECTOR,
diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index 4f5cb5a4e9247..9fab8b98b4af0 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -269,3 +269,120 @@ define <1 x i64> @atomic_vec1_i64_align(ptr %x) nounwind {
   %ret = load atomic <1 x i64>, ptr %x acquire, align 8
   ret <1 x i64> %ret
 }
+
+define <1 x half> @atomic_vec1_half(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_half:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_half:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_half:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_half:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movw (%rdi), %cx
+; CHECK-O0-NEXT:    # implicit-def: $eax
+; CHECK-O0-NEXT:    movw %cx, %ax
+; CHECK-O0-NEXT:    # implicit-def: $xmm0
+; CHECK-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_half:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movw (%rdi), %cx
+; CHECK-SSE-O0-NEXT:    # implicit-def: $eax
+; CHECK-SSE-O0-NEXT:    movw %cx, %ax
+; CHECK-SSE-O0-NEXT:    # implicit-def: $xmm0
+; CHECK-SSE-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_half:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movw (%rdi), %cx
+; CHECK-AVX-O0-NEXT:    # implicit-def: $eax
+; CHECK-AVX-O0-NEXT:    movw %cx, %ax
+; CHECK-AVX-O0-NEXT:    # implicit-def: $xmm0
+; CHECK-AVX-O0-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x half>, ptr %x acquire, align 2
+  ret <1 x half> %ret
+}
+
+define <1 x float> @atomic_vec1_float(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec1_float:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_float:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_float:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_float:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_float:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_float:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x float>, ptr %x acquire, align 4
+  ret <1 x float> %ret
+}
+
+define <1 x double> @atomic_vec1_double_align(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec1_double_align:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_double_align:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_double_align:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    vmovsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_double_align:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_double_align:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_double_align:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    vmovsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x double>, ptr %x acquire, align 8
+  ret <1 x double> %ret
+}

>From 620e182faad8f780d6f353db4b4b01c25de03be2 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Wed, 18 Dec 2024 03:40:32 -0500
Subject: [PATCH 4/8] [X86] Add atomic vector tests for unaligned >1 sizes.

Unaligned atomic vectors with size >1 are lowered to calls.
Adding their tests separately here.

commit-id:a06a5cc6
---
 llvm/test/CodeGen/X86/atomic-load-store.ll | 588 +++++++++++++++++++++
 1 file changed, 588 insertions(+)

diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index 9fab8b98b4af0..3e7b73a65fe07 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -270,6 +270,82 @@ define <1 x i64> @atomic_vec1_i64_align(ptr %x) nounwind {
   ret <1 x i64> %ret
 }
 
+define <1 x ptr> @atomic_vec1_ptr(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec1_ptr:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $8, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movq (%rsp), %rax
+; CHECK-O3-NEXT:    popq %rcx
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_ptr:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    pushq %rax
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $8, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movq (%rsp), %rax
+; CHECK-SSE-O3-NEXT:    popq %rcx
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_ptr:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    pushq %rax
+; CHECK-AVX-O3-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O3-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O3-NEXT:    movl $8, %edi
+; CHECK-AVX-O3-NEXT:    movl $2, %ecx
+; CHECK-AVX-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O3-NEXT:    movq (%rsp), %rax
+; CHECK-AVX-O3-NEXT:    popq %rcx
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_ptr:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $8, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movq (%rsp), %rax
+; CHECK-O0-NEXT:    popq %rcx
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_ptr:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    pushq %rax
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $8, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movq (%rsp), %rax
+; CHECK-SSE-O0-NEXT:    popq %rcx
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_ptr:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    pushq %rax
+; CHECK-AVX-O0-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O0-NEXT:    movl $8, %edi
+; CHECK-AVX-O0-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O0-NEXT:    movl $2, %ecx
+; CHECK-AVX-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O0-NEXT:    movq (%rsp), %rax
+; CHECK-AVX-O0-NEXT:    popq %rcx
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x ptr>, ptr %x acquire, align 4
+  ret <1 x ptr> %ret
+}
+
 define <1 x half> @atomic_vec1_half(ptr %x) {
 ; CHECK-O3-LABEL: atomic_vec1_half:
 ; CHECK-O3:       # %bb.0:
@@ -386,3 +462,515 @@ define <1 x double> @atomic_vec1_double_align(ptr %x) nounwind {
   %ret = load atomic <1 x double>, ptr %x acquire, align 8
   ret <1 x double> %ret
 }
+
+define <1 x i64> @atomic_vec1_i64(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec1_i64:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $8, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movq (%rsp), %rax
+; CHECK-O3-NEXT:    popq %rcx
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_i64:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    pushq %rax
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $8, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movq (%rsp), %rax
+; CHECK-SSE-O3-NEXT:    popq %rcx
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_i64:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    pushq %rax
+; CHECK-AVX-O3-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O3-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O3-NEXT:    movl $8, %edi
+; CHECK-AVX-O3-NEXT:    movl $2, %ecx
+; CHECK-AVX-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O3-NEXT:    movq (%rsp), %rax
+; CHECK-AVX-O3-NEXT:    popq %rcx
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_i64:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $8, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movq (%rsp), %rax
+; CHECK-O0-NEXT:    popq %rcx
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_i64:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    pushq %rax
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $8, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movq (%rsp), %rax
+; CHECK-SSE-O0-NEXT:    popq %rcx
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_i64:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    pushq %rax
+; CHECK-AVX-O0-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O0-NEXT:    movl $8, %edi
+; CHECK-AVX-O0-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O0-NEXT:    movl $2, %ecx
+; CHECK-AVX-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O0-NEXT:    movq (%rsp), %rax
+; CHECK-AVX-O0-NEXT:    popq %rcx
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x i64>, ptr %x acquire, align 4
+  ret <1 x i64> %ret
+}
+
+define <1 x double> @atomic_vec1_double(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec1_double:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $8, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-O3-NEXT:    popq %rax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec1_double:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    pushq %rax
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $8, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-SSE-O3-NEXT:    popq %rax
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec1_double:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    pushq %rax
+; CHECK-AVX-O3-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O3-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O3-NEXT:    movl $8, %edi
+; CHECK-AVX-O3-NEXT:    movl $2, %ecx
+; CHECK-AVX-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O3-NEXT:    vmovsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-AVX-O3-NEXT:    popq %rax
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec1_double:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $8, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-O0-NEXT:    popq %rax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec1_double:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    pushq %rax
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $8, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-SSE-O0-NEXT:    popq %rax
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec1_double:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    pushq %rax
+; CHECK-AVX-O0-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O0-NEXT:    movl $8, %edi
+; CHECK-AVX-O0-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O0-NEXT:    movl $2, %ecx
+; CHECK-AVX-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O0-NEXT:    vmovsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-AVX-O0-NEXT:    popq %rax
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <1 x double>, ptr %x acquire, align 4
+  ret <1 x double> %ret
+}
+
+define <2 x i32> @atomic_vec2_i32(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec2_i32:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $8, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-O3-NEXT:    popq %rax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_i32:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    pushq %rax
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $8, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-SSE-O3-NEXT:    popq %rax
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_i32:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    pushq %rax
+; CHECK-AVX-O3-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O3-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O3-NEXT:    movl $8, %edi
+; CHECK-AVX-O3-NEXT:    movl $2, %ecx
+; CHECK-AVX-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O3-NEXT:    vmovsd {{.*#+}} xmm0 = mem[0],zero
+; CHECK-AVX-O3-NEXT:    popq %rax
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_i32:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $8, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
+; CHECK-O0-NEXT:    popq %rax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_i32:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    pushq %rax
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $8, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movq {{.*#+}} xmm0 = mem[0],zero
+; CHECK-SSE-O0-NEXT:    popq %rax
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_i32:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    pushq %rax
+; CHECK-AVX-O0-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O0-NEXT:    movl $8, %edi
+; CHECK-AVX-O0-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O0-NEXT:    movl $2, %ecx
+; CHECK-AVX-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O0-NEXT:    vmovq {{.*#+}} xmm0 = mem[0],zero
+; CHECK-AVX-O0-NEXT:    popq %rax
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x i32>, ptr %x acquire, align 4
+  ret <2 x i32> %ret
+}
+
+define <4 x float> @atomic_vec4_float(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec4_float:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    subq $24, %rsp
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $16, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O3-NEXT:    addq $24, %rsp
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec4_float:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    subq $24, %rsp
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $16, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O3-NEXT:    addq $24, %rsp
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec4_float:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    subq $24, %rsp
+; CHECK-AVX-O3-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O3-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O3-NEXT:    movl $16, %edi
+; CHECK-AVX-O3-NEXT:    movl $2, %ecx
+; CHECK-AVX-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O3-NEXT:    vmovaps (%rsp), %xmm0
+; CHECK-AVX-O3-NEXT:    addq $24, %rsp
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec4_float:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    subq $24, %rsp
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $16, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O0-NEXT:    addq $24, %rsp
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec4_float:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    subq $24, %rsp
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $16, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O0-NEXT:    addq $24, %rsp
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec4_float:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    subq $24, %rsp
+; CHECK-AVX-O0-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O0-NEXT:    movl $16, %edi
+; CHECK-AVX-O0-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O0-NEXT:    movl $2, %ecx
+; CHECK-AVX-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O0-NEXT:    vmovaps (%rsp), %xmm0
+; CHECK-AVX-O0-NEXT:    addq $24, %rsp
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <4 x float>, ptr %x acquire, align 4
+  ret <4 x float> %ret
+}
+
+define <8 x double> @atomic_vec8_double(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec8_double:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    subq $72, %rsp
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $64, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm2
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm3
+; CHECK-O3-NEXT:    addq $72, %rsp
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec8_double:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    subq $72, %rsp
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $64, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm2
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm3
+; CHECK-SSE-O3-NEXT:    addq $72, %rsp
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec8_double:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    subq $72, %rsp
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $64, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movapd (%rsp), %xmm0
+; CHECK-O0-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm1
+; CHECK-O0-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm2
+; CHECK-O0-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm3
+; CHECK-O0-NEXT:    addq $72, %rsp
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec8_double:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    subq $72, %rsp
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $64, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movapd (%rsp), %xmm0
+; CHECK-SSE-O0-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm1
+; CHECK-SSE-O0-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm2
+; CHECK-SSE-O0-NEXT:    movapd {{[0-9]+}}(%rsp), %xmm3
+; CHECK-SSE-O0-NEXT:    addq $72, %rsp
+; CHECK-SSE-O0-NEXT:    retq
+  %ret = load atomic <8 x double>, ptr %x acquire, align 4
+  ret <8 x double> %ret
+}
+
+define <16 x bfloat> @atomic_vec16_bfloat(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec16_bfloat:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    subq $40, %rsp
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $32, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-O3-NEXT:    addq $40, %rsp
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec16_bfloat:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    subq $40, %rsp
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $32, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-SSE-O3-NEXT:    addq $40, %rsp
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec16_bfloat:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    subq $40, %rsp
+; CHECK-AVX-O3-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O3-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O3-NEXT:    movl $32, %edi
+; CHECK-AVX-O3-NEXT:    movl $2, %ecx
+; CHECK-AVX-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O3-NEXT:    vmovups (%rsp), %ymm0
+; CHECK-AVX-O3-NEXT:    addq $40, %rsp
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec16_bfloat:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    subq $40, %rsp
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $32, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-O0-NEXT:    addq $40, %rsp
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec16_bfloat:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    subq $40, %rsp
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $32, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-SSE-O0-NEXT:    addq $40, %rsp
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec16_bfloat:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    subq $40, %rsp
+; CHECK-AVX-O0-NEXT:    movq %rdi, %rsi
+; CHECK-AVX-O0-NEXT:    movl $32, %edi
+; CHECK-AVX-O0-NEXT:    movq %rsp, %rdx
+; CHECK-AVX-O0-NEXT:    movl $2, %ecx
+; CHECK-AVX-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-AVX-O0-NEXT:    vmovups (%rsp), %ymm0
+; CHECK-AVX-O0-NEXT:    addq $40, %rsp
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <16 x bfloat>, ptr %x acquire, align 4
+  ret <16 x bfloat> %ret
+}
+
+define <32 x half> @atomic_vec32_half(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec32_half:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    subq $72, %rsp
+; CHECK-O3-NEXT:    movq %rdi, %rsi
+; CHECK-O3-NEXT:    movq %rsp, %rdx
+; CHECK-O3-NEXT:    movl $64, %edi
+; CHECK-O3-NEXT:    movl $2, %ecx
+; CHECK-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm2
+; CHECK-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm3
+; CHECK-O3-NEXT:    addq $72, %rsp
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec32_half:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    subq $72, %rsp
+; CHECK-SSE-O3-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O3-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O3-NEXT:    movl $64, %edi
+; CHECK-SSE-O3-NEXT:    movl $2, %ecx
+; CHECK-SSE-O3-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O3-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm2
+; CHECK-SSE-O3-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm3
+; CHECK-SSE-O3-NEXT:    addq $72, %rsp
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec32_half:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    subq $72, %rsp
+; CHECK-O0-NEXT:    movq %rdi, %rsi
+; CHECK-O0-NEXT:    movl $64, %edi
+; CHECK-O0-NEXT:    movq %rsp, %rdx
+; CHECK-O0-NEXT:    movl $2, %ecx
+; CHECK-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-O0-NEXT:    movaps (%rsp), %xmm0
+; CHECK-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm2
+; CHECK-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm3
+; CHECK-O0-NEXT:    addq $72, %rsp
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec32_half:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    subq $72, %rsp
+; CHECK-SSE-O0-NEXT:    movq %rdi, %rsi
+; CHECK-SSE-O0-NEXT:    movl $64, %edi
+; CHECK-SSE-O0-NEXT:    movq %rsp, %rdx
+; CHECK-SSE-O0-NEXT:    movl $2, %ecx
+; CHECK-SSE-O0-NEXT:    callq __atomic_load at PLT
+; CHECK-SSE-O0-NEXT:    movaps (%rsp), %xmm0
+; CHECK-SSE-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm1
+; CHECK-SSE-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm2
+; CHECK-SSE-O0-NEXT:    movaps {{[0-9]+}}(%rsp), %xmm3
+; CHECK-SSE-O0-NEXT:    addq $72, %rsp
+; CHECK-SSE-O0-NEXT:    retq
+  %ret = load atomic <32 x half>, ptr %x acquire, align 4
+  ret <32 x half> %ret
+}

>From 768b1a91595c23b3902aee20b8142b2897ac3840 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Thu, 19 Dec 2024 11:19:39 -0500
Subject: [PATCH 5/8] [SelectionDAG] Widen <2 x T> vector types for atomic load

Vector types of 2 elements must be widened. This change does this
for vector types of atomic load in SelectionDAG
so that it can translate aligned vectors of >1 size.

commit-id:2894ccd1
---
 llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h |   1 +
 .../SelectionDAG/LegalizeVectorTypes.cpp      |  97 ++++--
 llvm/test/CodeGen/X86/atomic-load-store.ll    | 286 ++++++++++++++++++
 3 files changed, 361 insertions(+), 23 deletions(-)

diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
index d24b4517a460d..b6e018ba0e454 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
@@ -1068,6 +1068,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   SDValue WidenVecRes_EXTRACT_SUBVECTOR(SDNode* N);
   SDValue WidenVecRes_INSERT_SUBVECTOR(SDNode *N);
   SDValue WidenVecRes_INSERT_VECTOR_ELT(SDNode* N);
+  SDValue WidenVecRes_ATOMIC_LOAD(AtomicSDNode *N);
   SDValue WidenVecRes_LOAD(SDNode* N);
   SDValue WidenVecRes_VP_LOAD(VPLoadSDNode *N);
   SDValue WidenVecRes_VP_STRIDED_LOAD(VPStridedLoadSDNode *N);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index d6cbf2211f053..42763aab5bb55 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -4622,6 +4622,9 @@ void DAGTypeLegalizer::WidenVectorResult(SDNode *N, unsigned ResNo) {
     break;
   case ISD::EXTRACT_SUBVECTOR: Res = WidenVecRes_EXTRACT_SUBVECTOR(N); break;
   case ISD::INSERT_VECTOR_ELT: Res = WidenVecRes_INSERT_VECTOR_ELT(N); break;
+  case ISD::ATOMIC_LOAD:
+    Res = WidenVecRes_ATOMIC_LOAD(cast<AtomicSDNode>(N));
+    break;
   case ISD::LOAD:              Res = WidenVecRes_LOAD(N); break;
   case ISD::STEP_VECTOR:
   case ISD::SPLAT_VECTOR:
@@ -6003,6 +6006,74 @@ SDValue DAGTypeLegalizer::WidenVecRes_INSERT_VECTOR_ELT(SDNode *N) {
                      N->getOperand(1), N->getOperand(2));
 }
 
+/// Either return the same load or provide appropriate casts
+/// from the load and return that.
+static SDValue coerceLoadedValue(SDValue LdOp, EVT FirstVT, EVT WidenVT,
+                                 TypeSize LdWidth, TypeSize FirstVTWidth,
+                                 SDLoc dl, SelectionDAG &DAG) {
+  assert(TypeSize::isKnownLE(LdWidth, FirstVTWidth));
+  TypeSize WidenWidth = WidenVT.getSizeInBits();
+  if (!FirstVT.isVector()) {
+    unsigned NumElts =
+        WidenWidth.getFixedValue() / FirstVTWidth.getFixedValue();
+    EVT NewVecVT = EVT::getVectorVT(*DAG.getContext(), FirstVT, NumElts);
+    SDValue VecOp = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, NewVecVT, LdOp);
+    return DAG.getNode(ISD::BITCAST, dl, WidenVT, VecOp);
+  }
+  assert(FirstVT == WidenVT);
+  return LdOp;
+}
+
+static std::optional<EVT> findMemType(SelectionDAG &DAG,
+                                      const TargetLowering &TLI, unsigned Width,
+                                      EVT WidenVT, unsigned Align,
+                                      unsigned WidenEx);
+
+SDValue DAGTypeLegalizer::WidenVecRes_ATOMIC_LOAD(AtomicSDNode *LD) {
+  EVT WidenVT =
+      TLI.getTypeToTransformTo(*DAG.getContext(), LD->getValueType(0));
+  EVT LdVT = LD->getMemoryVT();
+  SDLoc dl(LD);
+  assert(LdVT.isVector() && WidenVT.isVector() && "Expected vectors");
+  assert(LdVT.isScalableVector() == WidenVT.isScalableVector() &&
+         "Must be scalable");
+  assert(LdVT.getVectorElementType() == WidenVT.getVectorElementType() &&
+         "Expected equivalent element types");
+
+  // Load information
+  SDValue Chain = LD->getChain();
+  SDValue BasePtr = LD->getBasePtr();
+  MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
+  AAMDNodes AAInfo = LD->getAAInfo();
+
+  TypeSize LdWidth = LdVT.getSizeInBits();
+  TypeSize WidenWidth = WidenVT.getSizeInBits();
+  TypeSize WidthDiff = WidenWidth - LdWidth;
+
+  // Find the vector type that can load from.
+  std::optional<EVT> FirstVT =
+      findMemType(DAG, TLI, LdWidth.getKnownMinValue(), WidenVT, /*LdAlign=*/0,
+                  WidthDiff.getKnownMinValue());
+
+  if (!FirstVT)
+    return SDValue();
+
+  SmallVector<EVT, 8> MemVTs;
+  TypeSize FirstVTWidth = FirstVT->getSizeInBits();
+
+  SDValue LdOp = DAG.getAtomicLoad(ISD::NON_EXTLOAD, dl, *FirstVT, *FirstVT,
+                                   Chain, BasePtr, LD->getMemOperand());
+
+  // Load the element with one instruction.
+  SDValue Result = coerceLoadedValue(LdOp, *FirstVT, WidenVT, LdWidth,
+                                     FirstVTWidth, dl, DAG);
+
+  // Modified the chain - switch anything that used the old chain to use
+  // the new one.
+  ReplaceValueWith(SDValue(LD, 1), LdOp.getValue(1));
+  return Result;
+}
+
 SDValue DAGTypeLegalizer::WidenVecRes_LOAD(SDNode *N) {
   LoadSDNode *LD = cast<LoadSDNode>(N);
   ISD::LoadExtType ExtType = LD->getExtensionType();
@@ -7894,29 +7965,9 @@ SDValue DAGTypeLegalizer::GenWidenVectorLoads(SmallVectorImpl<SDValue> &LdChain,
   LdChain.push_back(LdOp.getValue(1));
 
   // Check if we can load the element with one instruction.
-  if (MemVTs.empty()) {
-    assert(TypeSize::isKnownLE(LdWidth, FirstVTWidth));
-    if (!FirstVT->isVector()) {
-      unsigned NumElts =
-          WidenWidth.getFixedValue() / FirstVTWidth.getFixedValue();
-      EVT NewVecVT = EVT::getVectorVT(*DAG.getContext(), *FirstVT, NumElts);
-      SDValue VecOp = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, NewVecVT, LdOp);
-      return DAG.getNode(ISD::BITCAST, dl, WidenVT, VecOp);
-    }
-    if (FirstVT == WidenVT)
-      return LdOp;
-
-    // TODO: We don't currently have any tests that exercise this code path.
-    assert(WidenWidth.getFixedValue() % FirstVTWidth.getFixedValue() == 0);
-    unsigned NumConcat =
-        WidenWidth.getFixedValue() / FirstVTWidth.getFixedValue();
-    SmallVector<SDValue, 16> ConcatOps(NumConcat);
-    SDValue UndefVal = DAG.getUNDEF(*FirstVT);
-    ConcatOps[0] = LdOp;
-    for (unsigned i = 1; i != NumConcat; ++i)
-      ConcatOps[i] = UndefVal;
-    return DAG.getNode(ISD::CONCAT_VECTORS, dl, WidenVT, ConcatOps);
-  }
+  if (MemVTs.empty())
+    return coerceLoadedValue(LdOp, *FirstVT, WidenVT, LdWidth, FirstVTWidth, dl,
+                             DAG);
 
   // Load vector by using multiple loads from largest vector to scalar.
   SmallVector<SDValue, 16> LdOps;
diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index 3e7b73a65fe07..ff5391f44bbe3 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -270,6 +270,212 @@ define <1 x i64> @atomic_vec1_i64_align(ptr %x) nounwind {
   ret <1 x i64> %ret
 }
 
+define <2 x i8> @atomic_vec2_i8(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_i8:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-O3-NEXT:    movd %eax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_i8:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_i8:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_i8:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movw (%rdi), %cx
+; CHECK-O0-NEXT:    # implicit-def: $eax
+; CHECK-O0-NEXT:    movw %cx, %ax
+; CHECK-O0-NEXT:    movd %eax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_i8:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movw (%rdi), %cx
+; CHECK-SSE-O0-NEXT:    # implicit-def: $eax
+; CHECK-SSE-O0-NEXT:    movw %cx, %ax
+; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_i8:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movw (%rdi), %cx
+; CHECK-AVX-O0-NEXT:    # implicit-def: $eax
+; CHECK-AVX-O0-NEXT:    movw %cx, %ax
+; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x i8>, ptr %x acquire, align 4
+  ret <2 x i8> %ret
+}
+
+define <2 x i16> @atomic_vec2_i16(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_i16:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movl (%rdi), %eax
+; CHECK-O3-NEXT:    movd %eax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_i16:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_i16:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_i16:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movl (%rdi), %eax
+; CHECK-O0-NEXT:    movd %eax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_i16:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movl (%rdi), %eax
+; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_i16:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movl (%rdi), %eax
+; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x i16>, ptr %x acquire, align 4
+  ret <2 x i16> %ret
+}
+
+define <2 x ptr addrspace(270)> @atomic_vec2_ptr270(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_ptr270:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movq (%rdi), %rax
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_ptr270:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_ptr270:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_ptr270:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movq (%rdi), %rax
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_ptr270:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_ptr270:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x ptr addrspace(270)>, ptr %x acquire, align 8
+  ret <2 x ptr addrspace(270)> %ret
+}
+
+define <2 x i32> @atomic_vec2_i32_align(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_i32_align:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movq (%rdi), %rax
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_i32_align:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_i32_align:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_i32_align:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movq (%rdi), %rax
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_i32_align:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_i32_align:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x i32>, ptr %x acquire, align 8
+  ret <2 x i32> %ret
+}
+
+define <2 x float> @atomic_vec2_float_align(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_float_align:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movq (%rdi), %rax
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_float_align:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_float_align:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_float_align:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movq (%rdi), %rax
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_float_align:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_float_align:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x float>, ptr %x acquire, align 8
+  ret <2 x float> %ret
+}
+
 define <1 x ptr> @atomic_vec1_ptr(ptr %x) nounwind {
 ; CHECK-O3-LABEL: atomic_vec1_ptr:
 ; CHECK-O3:       # %bb.0:
@@ -691,6 +897,86 @@ define <2 x i32> @atomic_vec2_i32(ptr %x) nounwind {
   ret <2 x i32> %ret
 }
 
+define <4 x i8> @atomic_vec4_i8(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec4_i8:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movl (%rdi), %eax
+; CHECK-O3-NEXT:    movd %eax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec4_i8:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movl (%rdi), %eax
+; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec4_i8:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movl (%rdi), %eax
+; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec4_i8:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movl (%rdi), %eax
+; CHECK-O0-NEXT:    movd %eax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec4_i8:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movl (%rdi), %eax
+; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec4_i8:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movl (%rdi), %eax
+; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <4 x i8>, ptr %x acquire, align 4
+  ret <4 x i8> %ret
+}
+
+define <4 x i16> @atomic_vec4_i16(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec4_i16:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movq (%rdi), %rax
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec4_i16:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec4_i16:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec4_i16:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movq (%rdi), %rax
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec4_i16:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
+; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec4_i16:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
+; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <4 x i16>, ptr %x acquire, align 8
+  ret <4 x i16> %ret
+}
+
 define <4 x float> @atomic_vec4_float(ptr %x) nounwind {
 ; CHECK-O3-LABEL: atomic_vec4_float:
 ; CHECK-O3:       # %bb.0:

>From b2b23bfbb221870a308501d5f5839b59b3fb2370 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Sun, 1 Jun 2025 16:23:05 -0400
Subject: [PATCH 6/8] [X86] Remove extra MOV after widening atomic load

This change adds patterns to optimize out an extra MOV
present after widening the atomic load.

commit-id:45989503
---
 llvm/lib/Target/X86/X86InstrCompiler.td    |   7 +
 llvm/test/CodeGen/X86/atomic-load-store.ll | 192 +++------------------
 2 files changed, 35 insertions(+), 164 deletions(-)

diff --git a/llvm/lib/Target/X86/X86InstrCompiler.td b/llvm/lib/Target/X86/X86InstrCompiler.td
index 927b2c8b22f05..26b76dd1ca83a 100644
--- a/llvm/lib/Target/X86/X86InstrCompiler.td
+++ b/llvm/lib/Target/X86/X86InstrCompiler.td
@@ -1204,6 +1204,13 @@ def : Pat<(i16 (atomic_load_nonext_16 addr:$src)), (MOV16rm addr:$src)>;
 def : Pat<(i32 (atomic_load_nonext_32 addr:$src)), (MOV32rm addr:$src)>;
 def : Pat<(i64 (atomic_load_nonext_64 addr:$src)), (MOV64rm addr:$src)>;
 
+def : Pat<(v4i32 (scalar_to_vector (i32 (zext (i16 (atomic_load_16 addr:$src)))))),
+           (MOVDI2PDIrm addr:$src)>;   // load atomic <2 x i8>
+def : Pat<(v4i32 (scalar_to_vector (i32 (atomic_load_32 addr:$src)))),
+           (MOVDI2PDIrm addr:$src)>;   // load atomic <2 x i16>
+def : Pat<(v2i64 (scalar_to_vector (i64 (atomic_load_64 addr:$src)))),
+           (MOV64toPQIrm  addr:$src)>; // load atomic <2 x i32,float>
+
 // Floating point loads/stores.
 def : Pat<(atomic_store_32 (i32 (bitconvert (f32 FR32:$src))), addr:$dst),
           (MOVSSmr addr:$dst, FR32:$src)>, Requires<[UseSSE1]>;
diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index ff5391f44bbe3..4b818b6cfa57e 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -319,159 +319,60 @@ define <2 x i8> @atomic_vec2_i8(ptr %x) {
 define <2 x i16> @atomic_vec2_i16(ptr %x) {
 ; CHECK-O3-LABEL: atomic_vec2_i16:
 ; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    movl (%rdi), %eax
-; CHECK-O3-NEXT:    movd %eax, %xmm0
+; CHECK-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-O3-NEXT:    retq
 ;
 ; CHECK-SSE-O3-LABEL: atomic_vec2_i16:
 ; CHECK-SSE-O3:       # %bb.0:
-; CHECK-SSE-O3-NEXT:    movl (%rdi), %eax
-; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-SSE-O3-NEXT:    retq
 ;
 ; CHECK-AVX-O3-LABEL: atomic_vec2_i16:
 ; CHECK-AVX-O3:       # %bb.0:
-; CHECK-AVX-O3-NEXT:    movl (%rdi), %eax
-; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-AVX-O3-NEXT:    retq
 ;
 ; CHECK-O0-LABEL: atomic_vec2_i16:
 ; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    movl (%rdi), %eax
-; CHECK-O0-NEXT:    movd %eax, %xmm0
+; CHECK-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-SSE-O0-LABEL: atomic_vec2_i16:
 ; CHECK-SSE-O0:       # %bb.0:
-; CHECK-SSE-O0-NEXT:    movl (%rdi), %eax
-; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-SSE-O0-NEXT:    retq
 ;
 ; CHECK-AVX-O0-LABEL: atomic_vec2_i16:
 ; CHECK-AVX-O0:       # %bb.0:
-; CHECK-AVX-O0-NEXT:    movl (%rdi), %eax
-; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-AVX-O0-NEXT:    retq
   %ret = load atomic <2 x i16>, ptr %x acquire, align 4
   ret <2 x i16> %ret
 }
 
 define <2 x ptr addrspace(270)> @atomic_vec2_ptr270(ptr %x) {
-; CHECK-O3-LABEL: atomic_vec2_ptr270:
-; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    movq (%rdi), %rax
-; CHECK-O3-NEXT:    movq %rax, %xmm0
-; CHECK-O3-NEXT:    retq
-;
-; CHECK-SSE-O3-LABEL: atomic_vec2_ptr270:
-; CHECK-SSE-O3:       # %bb.0:
-; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O3-NEXT:    retq
-;
-; CHECK-AVX-O3-LABEL: atomic_vec2_ptr270:
-; CHECK-AVX-O3:       # %bb.0:
-; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O3-NEXT:    retq
-;
-; CHECK-O0-LABEL: atomic_vec2_ptr270:
-; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    movq (%rdi), %rax
-; CHECK-O0-NEXT:    movq %rax, %xmm0
-; CHECK-O0-NEXT:    retq
-;
-; CHECK-SSE-O0-LABEL: atomic_vec2_ptr270:
-; CHECK-SSE-O0:       # %bb.0:
-; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O0-NEXT:    retq
-;
-; CHECK-AVX-O0-LABEL: atomic_vec2_ptr270:
-; CHECK-AVX-O0:       # %bb.0:
-; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O0-NEXT:    retq
+; CHECK-LABEL: atomic_vec2_ptr270:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %xmm0
+; CHECK-NEXT:    retq
   %ret = load atomic <2 x ptr addrspace(270)>, ptr %x acquire, align 8
   ret <2 x ptr addrspace(270)> %ret
 }
 
 define <2 x i32> @atomic_vec2_i32_align(ptr %x) {
-; CHECK-O3-LABEL: atomic_vec2_i32_align:
-; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    movq (%rdi), %rax
-; CHECK-O3-NEXT:    movq %rax, %xmm0
-; CHECK-O3-NEXT:    retq
-;
-; CHECK-SSE-O3-LABEL: atomic_vec2_i32_align:
-; CHECK-SSE-O3:       # %bb.0:
-; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O3-NEXT:    retq
-;
-; CHECK-AVX-O3-LABEL: atomic_vec2_i32_align:
-; CHECK-AVX-O3:       # %bb.0:
-; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O3-NEXT:    retq
-;
-; CHECK-O0-LABEL: atomic_vec2_i32_align:
-; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    movq (%rdi), %rax
-; CHECK-O0-NEXT:    movq %rax, %xmm0
-; CHECK-O0-NEXT:    retq
-;
-; CHECK-SSE-O0-LABEL: atomic_vec2_i32_align:
-; CHECK-SSE-O0:       # %bb.0:
-; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O0-NEXT:    retq
-;
-; CHECK-AVX-O0-LABEL: atomic_vec2_i32_align:
-; CHECK-AVX-O0:       # %bb.0:
-; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O0-NEXT:    retq
+; CHECK-LABEL: atomic_vec2_i32_align:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %xmm0
+; CHECK-NEXT:    retq
   %ret = load atomic <2 x i32>, ptr %x acquire, align 8
   ret <2 x i32> %ret
 }
 
 define <2 x float> @atomic_vec2_float_align(ptr %x) {
-; CHECK-O3-LABEL: atomic_vec2_float_align:
-; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    movq (%rdi), %rax
-; CHECK-O3-NEXT:    movq %rax, %xmm0
-; CHECK-O3-NEXT:    retq
-;
-; CHECK-SSE-O3-LABEL: atomic_vec2_float_align:
-; CHECK-SSE-O3:       # %bb.0:
-; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O3-NEXT:    retq
-;
-; CHECK-AVX-O3-LABEL: atomic_vec2_float_align:
-; CHECK-AVX-O3:       # %bb.0:
-; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O3-NEXT:    retq
-;
-; CHECK-O0-LABEL: atomic_vec2_float_align:
-; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    movq (%rdi), %rax
-; CHECK-O0-NEXT:    movq %rax, %xmm0
-; CHECK-O0-NEXT:    retq
-;
-; CHECK-SSE-O0-LABEL: atomic_vec2_float_align:
-; CHECK-SSE-O0:       # %bb.0:
-; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O0-NEXT:    retq
-;
-; CHECK-AVX-O0-LABEL: atomic_vec2_float_align:
-; CHECK-AVX-O0:       # %bb.0:
-; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O0-NEXT:    retq
+; CHECK-LABEL: atomic_vec2_float_align:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %xmm0
+; CHECK-NEXT:    retq
   %ret = load atomic <2 x float>, ptr %x acquire, align 8
   ret <2 x float> %ret
 }
@@ -900,79 +801,42 @@ define <2 x i32> @atomic_vec2_i32(ptr %x) nounwind {
 define <4 x i8> @atomic_vec4_i8(ptr %x) nounwind {
 ; CHECK-O3-LABEL: atomic_vec4_i8:
 ; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    movl (%rdi), %eax
-; CHECK-O3-NEXT:    movd %eax, %xmm0
+; CHECK-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-O3-NEXT:    retq
 ;
 ; CHECK-SSE-O3-LABEL: atomic_vec4_i8:
 ; CHECK-SSE-O3:       # %bb.0:
-; CHECK-SSE-O3-NEXT:    movl (%rdi), %eax
-; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-SSE-O3-NEXT:    retq
 ;
 ; CHECK-AVX-O3-LABEL: atomic_vec4_i8:
 ; CHECK-AVX-O3:       # %bb.0:
-; CHECK-AVX-O3-NEXT:    movl (%rdi), %eax
-; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-AVX-O3-NEXT:    retq
 ;
 ; CHECK-O0-LABEL: atomic_vec4_i8:
 ; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    movl (%rdi), %eax
-; CHECK-O0-NEXT:    movd %eax, %xmm0
+; CHECK-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-SSE-O0-LABEL: atomic_vec4_i8:
 ; CHECK-SSE-O0:       # %bb.0:
-; CHECK-SSE-O0-NEXT:    movl (%rdi), %eax
-; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-SSE-O0-NEXT:    retq
 ;
 ; CHECK-AVX-O0-LABEL: atomic_vec4_i8:
 ; CHECK-AVX-O0:       # %bb.0:
-; CHECK-AVX-O0-NEXT:    movl (%rdi), %eax
-; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
+; CHECK-AVX-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
 ; CHECK-AVX-O0-NEXT:    retq
   %ret = load atomic <4 x i8>, ptr %x acquire, align 4
   ret <4 x i8> %ret
 }
 
 define <4 x i16> @atomic_vec4_i16(ptr %x) nounwind {
-; CHECK-O3-LABEL: atomic_vec4_i16:
-; CHECK-O3:       # %bb.0:
-; CHECK-O3-NEXT:    movq (%rdi), %rax
-; CHECK-O3-NEXT:    movq %rax, %xmm0
-; CHECK-O3-NEXT:    retq
-;
-; CHECK-SSE-O3-LABEL: atomic_vec4_i16:
-; CHECK-SSE-O3:       # %bb.0:
-; CHECK-SSE-O3-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O3-NEXT:    retq
-;
-; CHECK-AVX-O3-LABEL: atomic_vec4_i16:
-; CHECK-AVX-O3:       # %bb.0:
-; CHECK-AVX-O3-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O3-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O3-NEXT:    retq
-;
-; CHECK-O0-LABEL: atomic_vec4_i16:
-; CHECK-O0:       # %bb.0:
-; CHECK-O0-NEXT:    movq (%rdi), %rax
-; CHECK-O0-NEXT:    movq %rax, %xmm0
-; CHECK-O0-NEXT:    retq
-;
-; CHECK-SSE-O0-LABEL: atomic_vec4_i16:
-; CHECK-SSE-O0:       # %bb.0:
-; CHECK-SSE-O0-NEXT:    movq (%rdi), %rax
-; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
-; CHECK-SSE-O0-NEXT:    retq
-;
-; CHECK-AVX-O0-LABEL: atomic_vec4_i16:
-; CHECK-AVX-O0:       # %bb.0:
-; CHECK-AVX-O0-NEXT:    movq (%rdi), %rax
-; CHECK-AVX-O0-NEXT:    vmovq %rax, %xmm0
-; CHECK-AVX-O0-NEXT:    retq
+; CHECK-LABEL: atomic_vec4_i16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %xmm0
+; CHECK-NEXT:    retq
   %ret = load atomic <4 x i16>, ptr %x acquire, align 8
   ret <4 x i16> %ret
 }

>From 6eb10b5af424ebb18d4e30a4bbfaf8f304e68f60 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Sun, 1 Jun 2025 16:23:26 -0400
Subject: [PATCH 7/8] [X86] Cast atomic vectors in IR to support floats

This commit casts floats to ints in an atomic load during AtomicExpand to support
floating point types. It also is required to support 128 bit vectors in SSE/AVX.

commit-id:80b9b6a7
---
 llvm/lib/Target/X86/X86ISelLowering.cpp    |   7 +
 llvm/lib/Target/X86/X86ISelLowering.h      |   2 +
 llvm/test/CodeGen/X86/atomic-load-store.ll | 181 +++++++++++++++++++--
 3 files changed, 172 insertions(+), 18 deletions(-)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 776d3c0a42e2f..3debf30da0a29 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -32070,6 +32070,13 @@ X86TargetLowering::shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const {
   }
 }
 
+TargetLowering::AtomicExpansionKind
+X86TargetLowering::shouldCastAtomicLoadInIR(LoadInst *LI) const {
+  if (LI->getType()->getScalarType()->isFloatingPointTy())
+    return AtomicExpansionKind::CastToInteger;
+  return AtomicExpansionKind::None;
+}
+
 LoadInst *
 X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
   unsigned NativeWidth = Subtarget.is64Bit() ? 64 : 32;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 5cb6b3e493a32..43cddb2b53bd6 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1839,6 +1839,8 @@ namespace llvm {
     shouldExpandAtomicRMWInIR(AtomicRMWInst *AI) const override;
     TargetLoweringBase::AtomicExpansionKind
     shouldExpandLogicAtomicRMWInIR(AtomicRMWInst *AI) const;
+    TargetLoweringBase::AtomicExpansionKind
+    shouldCastAtomicLoadInIR(LoadInst *LI) const override;
     void emitBitTestAtomicRMWIntrinsic(AtomicRMWInst *AI) const override;
     void emitCmpArithAtomicRMWIntrinsic(AtomicRMWInst *AI) const override;
 
diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index 4b818b6cfa57e..039edcbf83544 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -207,19 +207,19 @@ define <1 x bfloat> @atomic_vec1_bfloat(ptr %x) {
 ; CHECK-O3-LABEL: atomic_vec1_bfloat:
 ; CHECK-O3:       # %bb.0:
 ; CHECK-O3-NEXT:    movzwl (%rdi), %eax
-; CHECK-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O3-NEXT:    movd %eax, %xmm0
 ; CHECK-O3-NEXT:    retq
 ;
 ; CHECK-SSE-O3-LABEL: atomic_vec1_bfloat:
 ; CHECK-SSE-O3:       # %bb.0:
 ; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
-; CHECK-SSE-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
 ; CHECK-SSE-O3-NEXT:    retq
 ;
 ; CHECK-AVX-O3-LABEL: atomic_vec1_bfloat:
 ; CHECK-AVX-O3:       # %bb.0:
 ; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
-; CHECK-AVX-O3-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
 ; CHECK-AVX-O3-NEXT:    retq
 ;
 ; CHECK-O0-LABEL: atomic_vec1_bfloat:
@@ -227,8 +227,7 @@ define <1 x bfloat> @atomic_vec1_bfloat(ptr %x) {
 ; CHECK-O0-NEXT:    movw (%rdi), %cx
 ; CHECK-O0-NEXT:    # implicit-def: $eax
 ; CHECK-O0-NEXT:    movw %cx, %ax
-; CHECK-O0-NEXT:    # implicit-def: $xmm0
-; CHECK-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O0-NEXT:    movd %eax, %xmm0
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-SSE-O0-LABEL: atomic_vec1_bfloat:
@@ -236,8 +235,7 @@ define <1 x bfloat> @atomic_vec1_bfloat(ptr %x) {
 ; CHECK-SSE-O0-NEXT:    movw (%rdi), %cx
 ; CHECK-SSE-O0-NEXT:    # implicit-def: $eax
 ; CHECK-SSE-O0-NEXT:    movw %cx, %ax
-; CHECK-SSE-O0-NEXT:    # implicit-def: $xmm0
-; CHECK-SSE-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
 ; CHECK-SSE-O0-NEXT:    retq
 ;
 ; CHECK-AVX-O0-LABEL: atomic_vec1_bfloat:
@@ -245,8 +243,7 @@ define <1 x bfloat> @atomic_vec1_bfloat(ptr %x) {
 ; CHECK-AVX-O0-NEXT:    movw (%rdi), %cx
 ; CHECK-AVX-O0-NEXT:    # implicit-def: $eax
 ; CHECK-AVX-O0-NEXT:    movw %cx, %ax
-; CHECK-AVX-O0-NEXT:    # implicit-def: $xmm0
-; CHECK-AVX-O0-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
 ; CHECK-AVX-O0-NEXT:    retq
   %ret = load atomic <1 x bfloat>, ptr %x acquire, align 2
   ret <1 x bfloat> %ret
@@ -377,6 +374,74 @@ define <2 x float> @atomic_vec2_float_align(ptr %x) {
   ret <2 x float> %ret
 }
 
+define <2 x half> @atomic_vec2_half(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_half:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_half:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_half:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_half:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_half:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_half:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x half>, ptr %x acquire, align 4
+  ret <2 x half> %ret
+}
+
+define <2 x bfloat> @atomic_vec2_bfloat(ptr %x) {
+; CHECK-O3-LABEL: atomic_vec2_bfloat:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_bfloat:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_bfloat:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_bfloat:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_bfloat:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_bfloat:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    movd {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x bfloat>, ptr %x acquire, align 4
+  ret <2 x bfloat> %ret
+}
+
 define <1 x ptr> @atomic_vec1_ptr(ptr %x) nounwind {
 ; CHECK-O3-LABEL: atomic_vec1_ptr:
 ; CHECK-O3:       # %bb.0:
@@ -457,19 +522,19 @@ define <1 x half> @atomic_vec1_half(ptr %x) {
 ; CHECK-O3-LABEL: atomic_vec1_half:
 ; CHECK-O3:       # %bb.0:
 ; CHECK-O3-NEXT:    movzwl (%rdi), %eax
-; CHECK-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O3-NEXT:    movd %eax, %xmm0
 ; CHECK-O3-NEXT:    retq
 ;
 ; CHECK-SSE-O3-LABEL: atomic_vec1_half:
 ; CHECK-SSE-O3:       # %bb.0:
 ; CHECK-SSE-O3-NEXT:    movzwl (%rdi), %eax
-; CHECK-SSE-O3-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O3-NEXT:    movd %eax, %xmm0
 ; CHECK-SSE-O3-NEXT:    retq
 ;
 ; CHECK-AVX-O3-LABEL: atomic_vec1_half:
 ; CHECK-AVX-O3:       # %bb.0:
 ; CHECK-AVX-O3-NEXT:    movzwl (%rdi), %eax
-; CHECK-AVX-O3-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O3-NEXT:    vmovd %eax, %xmm0
 ; CHECK-AVX-O3-NEXT:    retq
 ;
 ; CHECK-O0-LABEL: atomic_vec1_half:
@@ -477,8 +542,7 @@ define <1 x half> @atomic_vec1_half(ptr %x) {
 ; CHECK-O0-NEXT:    movw (%rdi), %cx
 ; CHECK-O0-NEXT:    # implicit-def: $eax
 ; CHECK-O0-NEXT:    movw %cx, %ax
-; CHECK-O0-NEXT:    # implicit-def: $xmm0
-; CHECK-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-O0-NEXT:    movd %eax, %xmm0
 ; CHECK-O0-NEXT:    retq
 ;
 ; CHECK-SSE-O0-LABEL: atomic_vec1_half:
@@ -486,8 +550,7 @@ define <1 x half> @atomic_vec1_half(ptr %x) {
 ; CHECK-SSE-O0-NEXT:    movw (%rdi), %cx
 ; CHECK-SSE-O0-NEXT:    # implicit-def: $eax
 ; CHECK-SSE-O0-NEXT:    movw %cx, %ax
-; CHECK-SSE-O0-NEXT:    # implicit-def: $xmm0
-; CHECK-SSE-O0-NEXT:    pinsrw $0, %eax, %xmm0
+; CHECK-SSE-O0-NEXT:    movd %eax, %xmm0
 ; CHECK-SSE-O0-NEXT:    retq
 ;
 ; CHECK-AVX-O0-LABEL: atomic_vec1_half:
@@ -495,8 +558,7 @@ define <1 x half> @atomic_vec1_half(ptr %x) {
 ; CHECK-AVX-O0-NEXT:    movw (%rdi), %cx
 ; CHECK-AVX-O0-NEXT:    # implicit-def: $eax
 ; CHECK-AVX-O0-NEXT:    movw %cx, %ax
-; CHECK-AVX-O0-NEXT:    # implicit-def: $xmm0
-; CHECK-AVX-O0-NEXT:    vpinsrw $0, %eax, %xmm0, %xmm0
+; CHECK-AVX-O0-NEXT:    vmovd %eax, %xmm0
 ; CHECK-AVX-O0-NEXT:    retq
   %ret = load atomic <1 x half>, ptr %x acquire, align 2
   ret <1 x half> %ret
@@ -841,6 +903,89 @@ define <4 x i16> @atomic_vec4_i16(ptr %x) nounwind {
   ret <4 x i16> %ret
 }
 
+define <4 x half> @atomic_vec4_half(ptr %x) nounwind {
+; CHECK-LABEL: atomic_vec4_half:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %xmm0
+; CHECK-NEXT:    retq
+  %ret = load atomic <4 x half>, ptr %x acquire, align 8
+  ret <4 x half> %ret
+}
+
+define <4 x bfloat> @atomic_vec4_bfloat(ptr %x) nounwind {
+; CHECK-LABEL: atomic_vec4_bfloat:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    movq (%rdi), %xmm0
+; CHECK-NEXT:    retq
+  %ret = load atomic <4 x bfloat>, ptr %x acquire, align 8
+  ret <4 x bfloat> %ret
+}
+
+define <4 x float> @atomic_vec4_float_align(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec4_float_align:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movl $2, %esi
+; CHECK-O3-NEXT:    callq __atomic_load_16 at PLT
+; CHECK-O3-NEXT:    movq %rdx, %xmm1
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O3-NEXT:    popq %rax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec4_float_align:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    pushq %rbx
+; CHECK-SSE-O3-NEXT:    xorl %eax, %eax
+; CHECK-SSE-O3-NEXT:    xorl %edx, %edx
+; CHECK-SSE-O3-NEXT:    xorl %ecx, %ecx
+; CHECK-SSE-O3-NEXT:    xorl %ebx, %ebx
+; CHECK-SSE-O3-NEXT:    lock cmpxchg16b (%rdi)
+; CHECK-SSE-O3-NEXT:    movq %rdx, %xmm1
+; CHECK-SSE-O3-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O3-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-SSE-O3-NEXT:    popq %rbx
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec4_float_align:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    vmovaps (%rdi), %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec4_float_align:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movl $2, %esi
+; CHECK-O0-NEXT:    callq __atomic_load_16 at PLT
+; CHECK-O0-NEXT:    movq %rdx, %xmm1
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O0-NEXT:    popq %rax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec4_float_align:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    pushq %rbx
+; CHECK-SSE-O0-NEXT:    xorl %eax, %eax
+; CHECK-SSE-O0-NEXT:    movl %eax, %ebx
+; CHECK-SSE-O0-NEXT:    movq %rbx, %rax
+; CHECK-SSE-O0-NEXT:    movq %rbx, %rdx
+; CHECK-SSE-O0-NEXT:    movq %rbx, %rcx
+; CHECK-SSE-O0-NEXT:    lock cmpxchg16b (%rdi)
+; CHECK-SSE-O0-NEXT:    movq %rdx, %xmm1
+; CHECK-SSE-O0-NEXT:    movq %rax, %xmm0
+; CHECK-SSE-O0-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-SSE-O0-NEXT:    popq %rbx
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec4_float_align:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    vmovaps (%rdi), %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <4 x float>, ptr %x acquire, align 16
+  ret <4 x float> %ret
+}
+
 define <4 x float> @atomic_vec4_float(ptr %x) nounwind {
 ; CHECK-O3-LABEL: atomic_vec4_float:
 ; CHECK-O3:       # %bb.0:

>From 70a2b52ffeae81869323a0874c51a7947ec2fe71 Mon Sep 17 00:00:00 2001
From: jofrn <jofernau at amd.com>
Date: Fri, 20 Dec 2024 06:14:28 -0500
Subject: [PATCH 8/8] [AtomicExpand] Add bitcasts when expanding load atomic
 vector

AtomicExpand fails for aligned `load atomic <n x T>` because it
does not find a compatible library call. This change adds appropriate
bitcasts so that the call can be lowered. It also adds support for
128 bit lowering in tablegen to support SSE/AVX.

commit-id:f430c1af
---
 .../include/llvm/Target/TargetSelectionDAG.td |  14 +
 llvm/lib/CodeGen/AtomicExpandPass.cpp         |  15 +-
 llvm/lib/Target/X86/X86InstrCompiler.td       |   5 +
 llvm/test/CodeGen/ARM/atomic-load-store.ll    |  51 ++++
 llvm/test/CodeGen/X86/atomic-load-store.ll    |  93 +++++++
 .../X86/expand-atomic-non-integer.ll          | 263 ++++++++++++------
 6 files changed, 359 insertions(+), 82 deletions(-)

diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index d56216e8b637d..43da39bdc44cd 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -1904,6 +1904,20 @@ def atomic_load_64 :
   let MemoryVT = i64;
 }
 
+def atomic_load_128_v2i64 :
+  PatFrag<(ops node:$ptr),
+          (atomic_load node:$ptr)> {
+  let IsAtomic = true;
+  let MemoryVT = v2i64;
+}
+
+def atomic_load_128_v4i32 :
+  PatFrag<(ops node:$ptr),
+          (atomic_load node:$ptr)> {
+  let IsAtomic = true;
+  let MemoryVT = v4i32;
+}
+
 def atomic_load_nonext_8 :
   PatFrag<(ops node:$ptr), (atomic_load_nonext node:$ptr)> {
   let IsAtomic = true; // FIXME: Should be IsLoad and/or IsAtomic?
diff --git a/llvm/lib/CodeGen/AtomicExpandPass.cpp b/llvm/lib/CodeGen/AtomicExpandPass.cpp
index c376de877ac7d..70f59eafc6ecb 100644
--- a/llvm/lib/CodeGen/AtomicExpandPass.cpp
+++ b/llvm/lib/CodeGen/AtomicExpandPass.cpp
@@ -2066,9 +2066,18 @@ bool AtomicExpandImpl::expandAtomicOpToLibcall(
     I->replaceAllUsesWith(V);
   } else if (HasResult) {
     Value *V;
-    if (UseSizedLibcall)
-      V = Builder.CreateBitOrPointerCast(Result, I->getType());
-    else {
+    if (UseSizedLibcall) {
+      // Add bitcasts from Result's scalar type to I's <n x ptr> vector type
+      auto *PtrTy = dyn_cast<PointerType>(I->getType()->getScalarType());
+      auto *VTy = dyn_cast<VectorType>(I->getType());
+      if (VTy && PtrTy && !Result->getType()->isVectorTy()) {
+        unsigned AS = PtrTy->getAddressSpace();
+        Value *BC = Builder.CreateBitCast(
+            Result, VTy->getWithNewType(DL.getIntPtrType(Ctx, AS)));
+        V = Builder.CreateIntToPtr(BC, I->getType());
+      } else
+        V = Builder.CreateBitOrPointerCast(Result, I->getType());
+    } else {
       V = Builder.CreateAlignedLoad(I->getType(), AllocaResult,
                                     AllocaAlignment);
       Builder.CreateLifetimeEnd(AllocaResult, SizeVal64);
diff --git a/llvm/lib/Target/X86/X86InstrCompiler.td b/llvm/lib/Target/X86/X86InstrCompiler.td
index 26b76dd1ca83a..3143015b7ec66 100644
--- a/llvm/lib/Target/X86/X86InstrCompiler.td
+++ b/llvm/lib/Target/X86/X86InstrCompiler.td
@@ -1211,6 +1211,11 @@ def : Pat<(v4i32 (scalar_to_vector (i32 (atomic_load_32 addr:$src)))),
 def : Pat<(v2i64 (scalar_to_vector (i64 (atomic_load_64 addr:$src)))),
            (MOV64toPQIrm  addr:$src)>; // load atomic <2 x i32,float>
 
+def : Pat<(v2i64 (atomic_load_128_v2i64 addr:$src)),
+           (VMOVAPDrm addr:$src)>; // load atomic <2 x i64>
+def : Pat<(v4i32 (atomic_load_128_v4i32 addr:$src)),
+           (VMOVAPDrm addr:$src)>; // load atomic <4 x i32>
+
 // Floating point loads/stores.
 def : Pat<(atomic_store_32 (i32 (bitconvert (f32 FR32:$src))), addr:$dst),
           (MOVSSmr addr:$dst, FR32:$src)>, Requires<[UseSSE1]>;
diff --git a/llvm/test/CodeGen/ARM/atomic-load-store.ll b/llvm/test/CodeGen/ARM/atomic-load-store.ll
index 560dfde356c29..eaa2ffd9b2731 100644
--- a/llvm/test/CodeGen/ARM/atomic-load-store.ll
+++ b/llvm/test/CodeGen/ARM/atomic-load-store.ll
@@ -983,3 +983,54 @@ define void @store_atomic_f64__seq_cst(ptr %ptr, double %val1) {
   store atomic double %val1, ptr %ptr seq_cst, align 8
   ret void
 }
+
+define <1 x ptr> @atomic_vec1_ptr(ptr %x) #0 {
+; ARM-LABEL: atomic_vec1_ptr:
+; ARM:       @ %bb.0:
+; ARM-NEXT:    ldr r0, [r0]
+; ARM-NEXT:    dmb ish
+; ARM-NEXT:    bx lr
+;
+; ARMOPTNONE-LABEL: atomic_vec1_ptr:
+; ARMOPTNONE:       @ %bb.0:
+; ARMOPTNONE-NEXT:    ldr r0, [r0]
+; ARMOPTNONE-NEXT:    dmb ish
+; ARMOPTNONE-NEXT:    bx lr
+;
+; THUMBTWO-LABEL: atomic_vec1_ptr:
+; THUMBTWO:       @ %bb.0:
+; THUMBTWO-NEXT:    ldr r0, [r0]
+; THUMBTWO-NEXT:    dmb ish
+; THUMBTWO-NEXT:    bx lr
+;
+; THUMBONE-LABEL: atomic_vec1_ptr:
+; THUMBONE:       @ %bb.0:
+; THUMBONE-NEXT:    push {r7, lr}
+; THUMBONE-NEXT:    movs r1, #0
+; THUMBONE-NEXT:    mov r2, r1
+; THUMBONE-NEXT:    bl __sync_val_compare_and_swap_4
+; THUMBONE-NEXT:    pop {r7, pc}
+;
+; ARMV4-LABEL: atomic_vec1_ptr:
+; ARMV4:       @ %bb.0:
+; ARMV4-NEXT:    push {r11, lr}
+; ARMV4-NEXT:    mov r1, #2
+; ARMV4-NEXT:    bl __atomic_load_4
+; ARMV4-NEXT:    pop {r11, lr}
+; ARMV4-NEXT:    mov pc, lr
+;
+; ARMV6-LABEL: atomic_vec1_ptr:
+; ARMV6:       @ %bb.0:
+; ARMV6-NEXT:    ldr r0, [r0]
+; ARMV6-NEXT:    mov r1, #0
+; ARMV6-NEXT:    mcr p15, #0, r1, c7, c10, #5
+; ARMV6-NEXT:    bx lr
+;
+; THUMBM-LABEL: atomic_vec1_ptr:
+; THUMBM:       @ %bb.0:
+; THUMBM-NEXT:    ldr r0, [r0]
+; THUMBM-NEXT:    dmb sy
+; THUMBM-NEXT:    bx lr
+  %ret = load atomic <1 x ptr>, ptr %x acquire, align 4
+  ret <1 x ptr> %ret
+}
diff --git a/llvm/test/CodeGen/X86/atomic-load-store.ll b/llvm/test/CodeGen/X86/atomic-load-store.ll
index 039edcbf83544..9665d4173e23d 100644
--- a/llvm/test/CodeGen/X86/atomic-load-store.ll
+++ b/llvm/test/CodeGen/X86/atomic-load-store.ll
@@ -860,6 +860,53 @@ define <2 x i32> @atomic_vec2_i32(ptr %x) nounwind {
   ret <2 x i32> %ret
 }
 
+; Move td records to AtomicExpand
+define <2 x ptr> @atomic_vec2_ptr_align(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec2_ptr_align:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movl $2, %esi
+; CHECK-O3-NEXT:    callq __atomic_load_16 at PLT
+; CHECK-O3-NEXT:    movq %rdx, %xmm1
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O3-NEXT:    popq %rax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec2_ptr_align:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    vmovaps (%rdi), %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec2_ptr_align:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    vmovaps (%rdi), %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec2_ptr_align:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movl $2, %esi
+; CHECK-O0-NEXT:    callq __atomic_load_16 at PLT
+; CHECK-O0-NEXT:    movq %rdx, %xmm1
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O0-NEXT:    popq %rax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec2_ptr_align:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    vmovapd (%rdi), %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec2_ptr_align:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    vmovapd (%rdi), %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <2 x ptr>, ptr %x acquire, align 16
+  ret <2 x ptr> %ret
+}
+
 define <4 x i8> @atomic_vec4_i8(ptr %x) nounwind {
 ; CHECK-O3-LABEL: atomic_vec4_i8:
 ; CHECK-O3:       # %bb.0:
@@ -903,6 +950,52 @@ define <4 x i16> @atomic_vec4_i16(ptr %x) nounwind {
   ret <4 x i16> %ret
 }
 
+define <4 x ptr addrspace(270)> @atomic_vec4_ptr270(ptr %x) nounwind {
+; CHECK-O3-LABEL: atomic_vec4_ptr270:
+; CHECK-O3:       # %bb.0:
+; CHECK-O3-NEXT:    pushq %rax
+; CHECK-O3-NEXT:    movl $2, %esi
+; CHECK-O3-NEXT:    callq __atomic_load_16 at PLT
+; CHECK-O3-NEXT:    movq %rdx, %xmm1
+; CHECK-O3-NEXT:    movq %rax, %xmm0
+; CHECK-O3-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O3-NEXT:    popq %rax
+; CHECK-O3-NEXT:    retq
+;
+; CHECK-SSE-O3-LABEL: atomic_vec4_ptr270:
+; CHECK-SSE-O3:       # %bb.0:
+; CHECK-SSE-O3-NEXT:    vmovaps (%rdi), %xmm0
+; CHECK-SSE-O3-NEXT:    retq
+;
+; CHECK-AVX-O3-LABEL: atomic_vec4_ptr270:
+; CHECK-AVX-O3:       # %bb.0:
+; CHECK-AVX-O3-NEXT:    vmovaps (%rdi), %xmm0
+; CHECK-AVX-O3-NEXT:    retq
+;
+; CHECK-O0-LABEL: atomic_vec4_ptr270:
+; CHECK-O0:       # %bb.0:
+; CHECK-O0-NEXT:    pushq %rax
+; CHECK-O0-NEXT:    movl $2, %esi
+; CHECK-O0-NEXT:    callq __atomic_load_16 at PLT
+; CHECK-O0-NEXT:    movq %rdx, %xmm1
+; CHECK-O0-NEXT:    movq %rax, %xmm0
+; CHECK-O0-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
+; CHECK-O0-NEXT:    popq %rax
+; CHECK-O0-NEXT:    retq
+;
+; CHECK-SSE-O0-LABEL: atomic_vec4_ptr270:
+; CHECK-SSE-O0:       # %bb.0:
+; CHECK-SSE-O0-NEXT:    vmovapd (%rdi), %xmm0
+; CHECK-SSE-O0-NEXT:    retq
+;
+; CHECK-AVX-O0-LABEL: atomic_vec4_ptr270:
+; CHECK-AVX-O0:       # %bb.0:
+; CHECK-AVX-O0-NEXT:    vmovapd (%rdi), %xmm0
+; CHECK-AVX-O0-NEXT:    retq
+  %ret = load atomic <4 x ptr addrspace(270)>, ptr %x acquire, align 16
+  ret <4 x ptr addrspace(270)> %ret
+}
+
 define <4 x half> @atomic_vec4_half(ptr %x) nounwind {
 ; CHECK-LABEL: atomic_vec4_half:
 ; CHECK:       # %bb.0:
diff --git a/llvm/test/Transforms/AtomicExpand/X86/expand-atomic-non-integer.ll b/llvm/test/Transforms/AtomicExpand/X86/expand-atomic-non-integer.ll
index 5929c153d5961..f5c8baa3e931e 100644
--- a/llvm/test/Transforms/AtomicExpand/X86/expand-atomic-non-integer.ll
+++ b/llvm/test/Transforms/AtomicExpand/X86/expand-atomic-non-integer.ll
@@ -1,153 +1,258 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
 ; RUN: opt -S %s -passes=atomic-expand -mtriple=x86_64-linux-gnu | FileCheck %s
 
 ; This file tests the functions `llvm::convertAtomicLoadToIntegerType` and
-; `llvm::convertAtomicStoreToIntegerType`. If X86 stops using this 
+; `llvm::convertAtomicStoreToIntegerType`. If X86 stops using this
 ; functionality, please move this test to a target which still is.
 
 define float @float_load_expand(ptr %ptr) {
-; CHECK-LABEL: @float_load_expand
-; CHECK: %1 = load atomic i32, ptr %ptr unordered, align 4
-; CHECK: %2 = bitcast i32 %1 to float
-; CHECK: ret float %2
+; CHECK-LABEL: define float @float_load_expand(
+; CHECK-SAME: ptr [[PTR:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = load atomic i32, ptr [[PTR]] unordered, align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32 [[TMP1]] to float
+; CHECK-NEXT:    ret float [[TMP2]]
+;
   %res = load atomic float, ptr %ptr unordered, align 4
   ret float %res
 }
 
 define float @float_load_expand_seq_cst(ptr %ptr) {
-; CHECK-LABEL: @float_load_expand_seq_cst
-; CHECK: %1 = load atomic i32, ptr %ptr seq_cst, align 4
-; CHECK: %2 = bitcast i32 %1 to float
-; CHECK: ret float %2
+; CHECK-LABEL: define float @float_load_expand_seq_cst(
+; CHECK-SAME: ptr [[PTR:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = load atomic i32, ptr [[PTR]] seq_cst, align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32 [[TMP1]] to float
+; CHECK-NEXT:    ret float [[TMP2]]
+;
   %res = load atomic float, ptr %ptr seq_cst, align 4
   ret float %res
 }
 
 define float @float_load_expand_vol(ptr %ptr) {
-; CHECK-LABEL: @float_load_expand_vol
-; CHECK: %1 = load atomic volatile i32, ptr %ptr unordered, align 4
-; CHECK: %2 = bitcast i32 %1 to float
-; CHECK: ret float %2
+; CHECK-LABEL: define float @float_load_expand_vol(
+; CHECK-SAME: ptr [[PTR:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = load atomic volatile i32, ptr [[PTR]] unordered, align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32 [[TMP1]] to float
+; CHECK-NEXT:    ret float [[TMP2]]
+;
   %res = load atomic volatile float, ptr %ptr unordered, align 4
   ret float %res
 }
 
 define float @float_load_expand_addr1(ptr addrspace(1) %ptr) {
-; CHECK-LABEL: @float_load_expand_addr1
-; CHECK: %1 = load atomic i32, ptr addrspace(1) %ptr unordered, align 4
-; CHECK: %2 = bitcast i32 %1 to float
-; CHECK: ret float %2
+; CHECK-LABEL: define float @float_load_expand_addr1(
+; CHECK-SAME: ptr addrspace(1) [[PTR:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = load atomic i32, ptr addrspace(1) [[PTR]] unordered, align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32 [[TMP1]] to float
+; CHECK-NEXT:    ret float [[TMP2]]
+;
   %res = load atomic float, ptr addrspace(1) %ptr unordered, align 4
   ret float %res
 }
 
 define void @float_store_expand(ptr %ptr, float %v) {
-; CHECK-LABEL: @float_store_expand
-; CHECK: %1 = bitcast float %v to i32
-; CHECK: store atomic i32 %1, ptr %ptr unordered, align 4
+; CHECK-LABEL: define void @float_store_expand(
+; CHECK-SAME: ptr [[PTR:%.*]], float [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast float [[V]] to i32
+; CHECK-NEXT:    store atomic i32 [[TMP1]], ptr [[PTR]] unordered, align 4
+; CHECK-NEXT:    ret void
+;
   store atomic float %v, ptr %ptr unordered, align 4
   ret void
 }
 
 define void @float_store_expand_seq_cst(ptr %ptr, float %v) {
-; CHECK-LABEL: @float_store_expand_seq_cst
-; CHECK: %1 = bitcast float %v to i32
-; CHECK: store atomic i32 %1, ptr %ptr seq_cst, align 4
+; CHECK-LABEL: define void @float_store_expand_seq_cst(
+; CHECK-SAME: ptr [[PTR:%.*]], float [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast float [[V]] to i32
+; CHECK-NEXT:    store atomic i32 [[TMP1]], ptr [[PTR]] seq_cst, align 4
+; CHECK-NEXT:    ret void
+;
   store atomic float %v, ptr %ptr seq_cst, align 4
   ret void
 }
 
 define void @float_store_expand_vol(ptr %ptr, float %v) {
-; CHECK-LABEL: @float_store_expand_vol
-; CHECK: %1 = bitcast float %v to i32
-; CHECK: store atomic volatile i32 %1, ptr %ptr unordered, align 4
+; CHECK-LABEL: define void @float_store_expand_vol(
+; CHECK-SAME: ptr [[PTR:%.*]], float [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast float [[V]] to i32
+; CHECK-NEXT:    store atomic volatile i32 [[TMP1]], ptr [[PTR]] unordered, align 4
+; CHECK-NEXT:    ret void
+;
   store atomic volatile float %v, ptr %ptr unordered, align 4
   ret void
 }
 
 define void @float_store_expand_addr1(ptr addrspace(1) %ptr, float %v) {
-; CHECK-LABEL: @float_store_expand_addr1
-; CHECK: %1 = bitcast float %v to i32
-; CHECK: store atomic i32 %1, ptr addrspace(1) %ptr unordered, align 4
+; CHECK-LABEL: define void @float_store_expand_addr1(
+; CHECK-SAME: ptr addrspace(1) [[PTR:%.*]], float [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast float [[V]] to i32
+; CHECK-NEXT:    store atomic i32 [[TMP1]], ptr addrspace(1) [[PTR]] unordered, align 4
+; CHECK-NEXT:    ret void
+;
   store atomic float %v, ptr addrspace(1) %ptr unordered, align 4
   ret void
 }
 
 define void @pointer_cmpxchg_expand(ptr %ptr, ptr %v) {
-; CHECK-LABEL: @pointer_cmpxchg_expand
-; CHECK: %1 = ptrtoint ptr %v to i64
-; CHECK: %2 = cmpxchg ptr %ptr, i64 0, i64 %1 seq_cst monotonic
-; CHECK: %3 = extractvalue { i64, i1 } %2, 0
-; CHECK: %4 = extractvalue { i64, i1 } %2, 1
-; CHECK: %5 = inttoptr i64 %3 to ptr
-; CHECK: %6 = insertvalue { ptr, i1 } poison, ptr %5, 0
-; CHECK: %7 = insertvalue { ptr, i1 } %6, i1 %4, 1
+; CHECK-LABEL: define void @pointer_cmpxchg_expand(
+; CHECK-SAME: ptr [[PTR:%.*]], ptr [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[V]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = cmpxchg ptr [[PTR]], i64 0, i64 [[TMP1]] seq_cst monotonic, align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { i64, i1 } [[TMP2]], 0
+; CHECK-NEXT:    [[TMP4:%.*]] = extractvalue { i64, i1 } [[TMP2]], 1
+; CHECK-NEXT:    [[TMP5:%.*]] = inttoptr i64 [[TMP3]] to ptr
+; CHECK-NEXT:    [[TMP6:%.*]] = insertvalue { ptr, i1 } poison, ptr [[TMP5]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertvalue { ptr, i1 } [[TMP6]], i1 [[TMP4]], 1
+; CHECK-NEXT:    ret void
+;
   cmpxchg ptr %ptr, ptr null, ptr %v seq_cst monotonic
   ret void
 }
 
 define void @pointer_cmpxchg_expand2(ptr %ptr, ptr %v) {
-; CHECK-LABEL: @pointer_cmpxchg_expand2
-; CHECK: %1 = ptrtoint ptr %v to i64
-; CHECK: %2 = cmpxchg ptr %ptr, i64 0, i64 %1 release monotonic
-; CHECK: %3 = extractvalue { i64, i1 } %2, 0
-; CHECK: %4 = extractvalue { i64, i1 } %2, 1
-; CHECK: %5 = inttoptr i64 %3 to ptr
-; CHECK: %6 = insertvalue { ptr, i1 } poison, ptr %5, 0
-; CHECK: %7 = insertvalue { ptr, i1 } %6, i1 %4, 1
+; CHECK-LABEL: define void @pointer_cmpxchg_expand2(
+; CHECK-SAME: ptr [[PTR:%.*]], ptr [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[V]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = cmpxchg ptr [[PTR]], i64 0, i64 [[TMP1]] release monotonic, align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { i64, i1 } [[TMP2]], 0
+; CHECK-NEXT:    [[TMP4:%.*]] = extractvalue { i64, i1 } [[TMP2]], 1
+; CHECK-NEXT:    [[TMP5:%.*]] = inttoptr i64 [[TMP3]] to ptr
+; CHECK-NEXT:    [[TMP6:%.*]] = insertvalue { ptr, i1 } poison, ptr [[TMP5]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertvalue { ptr, i1 } [[TMP6]], i1 [[TMP4]], 1
+; CHECK-NEXT:    ret void
+;
   cmpxchg ptr %ptr, ptr null, ptr %v release monotonic
   ret void
 }
 
 define void @pointer_cmpxchg_expand3(ptr %ptr, ptr %v) {
-; CHECK-LABEL: @pointer_cmpxchg_expand3
-; CHECK: %1 = ptrtoint ptr %v to i64
-; CHECK: %2 = cmpxchg ptr %ptr, i64 0, i64 %1 seq_cst seq_cst
-; CHECK: %3 = extractvalue { i64, i1 } %2, 0
-; CHECK: %4 = extractvalue { i64, i1 } %2, 1
-; CHECK: %5 = inttoptr i64 %3 to ptr
-; CHECK: %6 = insertvalue { ptr, i1 } poison, ptr %5, 0
-; CHECK: %7 = insertvalue { ptr, i1 } %6, i1 %4, 1
+; CHECK-LABEL: define void @pointer_cmpxchg_expand3(
+; CHECK-SAME: ptr [[PTR:%.*]], ptr [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[V]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = cmpxchg ptr [[PTR]], i64 0, i64 [[TMP1]] seq_cst seq_cst, align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { i64, i1 } [[TMP2]], 0
+; CHECK-NEXT:    [[TMP4:%.*]] = extractvalue { i64, i1 } [[TMP2]], 1
+; CHECK-NEXT:    [[TMP5:%.*]] = inttoptr i64 [[TMP3]] to ptr
+; CHECK-NEXT:    [[TMP6:%.*]] = insertvalue { ptr, i1 } poison, ptr [[TMP5]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertvalue { ptr, i1 } [[TMP6]], i1 [[TMP4]], 1
+; CHECK-NEXT:    ret void
+;
   cmpxchg ptr %ptr, ptr null, ptr %v seq_cst seq_cst
   ret void
 }
 
 define void @pointer_cmpxchg_expand4(ptr %ptr, ptr %v) {
-; CHECK-LABEL: @pointer_cmpxchg_expand4
-; CHECK: %1 = ptrtoint ptr %v to i64
-; CHECK: %2 = cmpxchg weak ptr %ptr, i64 0, i64 %1 seq_cst seq_cst
-; CHECK: %3 = extractvalue { i64, i1 } %2, 0
-; CHECK: %4 = extractvalue { i64, i1 } %2, 1
-; CHECK: %5 = inttoptr i64 %3 to ptr
-; CHECK: %6 = insertvalue { ptr, i1 } poison, ptr %5, 0
-; CHECK: %7 = insertvalue { ptr, i1 } %6, i1 %4, 1
+; CHECK-LABEL: define void @pointer_cmpxchg_expand4(
+; CHECK-SAME: ptr [[PTR:%.*]], ptr [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[V]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = cmpxchg weak ptr [[PTR]], i64 0, i64 [[TMP1]] seq_cst seq_cst, align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { i64, i1 } [[TMP2]], 0
+; CHECK-NEXT:    [[TMP4:%.*]] = extractvalue { i64, i1 } [[TMP2]], 1
+; CHECK-NEXT:    [[TMP5:%.*]] = inttoptr i64 [[TMP3]] to ptr
+; CHECK-NEXT:    [[TMP6:%.*]] = insertvalue { ptr, i1 } poison, ptr [[TMP5]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertvalue { ptr, i1 } [[TMP6]], i1 [[TMP4]], 1
+; CHECK-NEXT:    ret void
+;
   cmpxchg weak ptr %ptr, ptr null, ptr %v seq_cst seq_cst
   ret void
 }
 
 define void @pointer_cmpxchg_expand5(ptr %ptr, ptr %v) {
-; CHECK-LABEL: @pointer_cmpxchg_expand5
-; CHECK: %1 = ptrtoint ptr %v to i64
-; CHECK: %2 = cmpxchg volatile ptr %ptr, i64 0, i64 %1 seq_cst seq_cst
-; CHECK: %3 = extractvalue { i64, i1 } %2, 0
-; CHECK: %4 = extractvalue { i64, i1 } %2, 1
-; CHECK: %5 = inttoptr i64 %3 to ptr
-; CHECK: %6 = insertvalue { ptr, i1 } poison, ptr %5, 0
-; CHECK: %7 = insertvalue { ptr, i1 } %6, i1 %4, 1
+; CHECK-LABEL: define void @pointer_cmpxchg_expand5(
+; CHECK-SAME: ptr [[PTR:%.*]], ptr [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr [[V]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = cmpxchg volatile ptr [[PTR]], i64 0, i64 [[TMP1]] seq_cst seq_cst, align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { i64, i1 } [[TMP2]], 0
+; CHECK-NEXT:    [[TMP4:%.*]] = extractvalue { i64, i1 } [[TMP2]], 1
+; CHECK-NEXT:    [[TMP5:%.*]] = inttoptr i64 [[TMP3]] to ptr
+; CHECK-NEXT:    [[TMP6:%.*]] = insertvalue { ptr, i1 } poison, ptr [[TMP5]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertvalue { ptr, i1 } [[TMP6]], i1 [[TMP4]], 1
+; CHECK-NEXT:    ret void
+;
   cmpxchg volatile ptr %ptr, ptr null, ptr %v seq_cst seq_cst
   ret void
 }
 
-define void @pointer_cmpxchg_expand6(ptr addrspace(1) %ptr, 
-                                     ptr addrspace(2) %v) {
-; CHECK-LABEL: @pointer_cmpxchg_expand6
-; CHECK: %1 = ptrtoint ptr addrspace(2) %v to i64
-; CHECK: %2 = cmpxchg ptr addrspace(1) %ptr, i64 0, i64 %1 seq_cst seq_cst
-; CHECK: %3 = extractvalue { i64, i1 } %2, 0
-; CHECK: %4 = extractvalue { i64, i1 } %2, 1
-; CHECK: %5 = inttoptr i64 %3 to ptr addrspace(2)
-; CHECK: %6 = insertvalue { ptr addrspace(2), i1 } poison, ptr addrspace(2) %5, 0
-; CHECK: %7 = insertvalue { ptr addrspace(2), i1 } %6, i1 %4, 1
+define void @pointer_cmpxchg_expand6(ptr addrspace(1) %ptr,
+; CHECK-LABEL: define void @pointer_cmpxchg_expand6(
+; CHECK-SAME: ptr addrspace(1) [[PTR:%.*]], ptr addrspace(2) [[V:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = ptrtoint ptr addrspace(2) [[V]] to i64
+; CHECK-NEXT:    [[TMP2:%.*]] = cmpxchg ptr addrspace(1) [[PTR]], i64 0, i64 [[TMP1]] seq_cst seq_cst, align 8
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { i64, i1 } [[TMP2]], 0
+; CHECK-NEXT:    [[TMP4:%.*]] = extractvalue { i64, i1 } [[TMP2]], 1
+; CHECK-NEXT:    [[TMP5:%.*]] = inttoptr i64 [[TMP3]] to ptr addrspace(2)
+; CHECK-NEXT:    [[TMP6:%.*]] = insertvalue { ptr addrspace(2), i1 } poison, ptr addrspace(2) [[TMP5]], 0
+; CHECK-NEXT:    [[TMP7:%.*]] = insertvalue { ptr addrspace(2), i1 } [[TMP6]], i1 [[TMP4]], 1
+; CHECK-NEXT:    ret void
+;
+  ptr addrspace(2) %v) {
   cmpxchg ptr addrspace(1) %ptr, ptr addrspace(2) null, ptr addrspace(2) %v seq_cst seq_cst
   ret void
 }
 
+define <2 x ptr> @atomic_vec2_ptr_align(ptr %x) nounwind {
+; CHECK-LABEL: define <2 x ptr> @atomic_vec2_ptr_align(
+; CHECK-SAME: ptr [[X:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i128 @__atomic_load_16(ptr [[X]], i32 2)
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i128 [[TMP1]] to <2 x i64>
+; CHECK-NEXT:    [[TMP7:%.*]] = inttoptr <2 x i64> [[TMP6]] to <2 x ptr>
+; CHECK-NEXT:    ret <2 x ptr> [[TMP7]]
+;
+  %ret = load atomic <2 x ptr>, ptr %x acquire, align 16
+  ret <2 x ptr> %ret
+}
+
+define <4 x ptr addrspace(270)> @atomic_vec4_ptr_align(ptr %x) nounwind {
+; CHECK-LABEL: define <4 x ptr addrspace(270)> @atomic_vec4_ptr_align(
+; CHECK-SAME: ptr [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i128 @__atomic_load_16(ptr [[X]], i32 2)
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i128 [[TMP1]] to <4 x i32>
+; CHECK-NEXT:    [[TMP3:%.*]] = inttoptr <4 x i32> [[TMP2]] to <4 x ptr addrspace(270)>
+; CHECK-NEXT:    ret <4 x ptr addrspace(270)> [[TMP3]]
+;
+  %ret = load atomic <4 x ptr addrspace(270)>, ptr %x acquire, align 16
+  ret <4 x ptr addrspace(270)> %ret
+}
+
+define <2 x i16> @atomic_vec2_i16(ptr %x) nounwind {
+; CHECK-LABEL: define <2 x i16> @atomic_vec2_i16(
+; CHECK-SAME: ptr [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[RET:%.*]] = load atomic <2 x i16>, ptr [[X]] acquire, align 8
+; CHECK-NEXT:    ret <2 x i16> [[RET]]
+;
+  %ret = load atomic <2 x i16>, ptr %x acquire, align 8
+  ret <2 x i16> %ret
+}
+
+define <2 x half> @atomic_vec2_half(ptr %x) nounwind {
+; CHECK-LABEL: define <2 x half> @atomic_vec2_half(
+; CHECK-SAME: ptr [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = load atomic i32, ptr [[X]] acquire, align 8
+; CHECK-NEXT:    [[RET:%.*]] = bitcast i32 [[TMP1]] to <2 x half>
+; CHECK-NEXT:    ret <2 x half> [[RET]]
+;
+  %ret = load atomic <2 x half>, ptr %x acquire, align 8
+  ret <2 x half> %ret
+}
+
+define <4 x i32> @atomic_vec4_i32(ptr %x) nounwind {
+; CHECK-LABEL: define <4 x i32> @atomic_vec4_i32(
+; CHECK-SAME: ptr [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i128 @__atomic_load_16(ptr [[X]], i32 2)
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i128 [[TMP1]] to <4 x i32>
+; CHECK-NEXT:    ret <4 x i32> [[TMP2]]
+;
+  %ret = load atomic <4 x i32>, ptr %x acquire, align 16
+  ret <4 x i32> %ret
+}
+
+define <4 x float> @atomic_vec4_float(ptr %x) nounwind {
+; CHECK-LABEL: define <4 x float> @atomic_vec4_float(
+; CHECK-SAME: ptr [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:    [[TMP1:%.*]] = call i128 @__atomic_load_16(ptr [[X]], i32 2)
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i128 [[TMP1]] to <4 x float>
+; CHECK-NEXT:    ret <4 x float> [[TMP2]]
+;
+  %ret = load atomic <4 x float>, ptr %x acquire, align 16
+  ret <4 x float> %ret
+}