[clang] [mlir] [CIR][AArch64] Add lowering for predicated SVE svdup builtins (zeroing) (PR #175976)

Thu Feb 5 07:04:56 PST 2026

https://github.com/banach-space updated https://github.com/llvm/llvm-project/pull/175976

>From fc98105ef968dd8e3c1b371f72cd073504b45c13 Mon Sep 17 00:00:00 2001
From: Andrzej Warzynski <andrzej.warzynski at arm.com>
Date: Wed, 14 Jan 2026 13:40:20 +0000
Subject: [PATCH 1/3] [mlir] Fix alignment for predicate (i1) vectors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Legal scalable predicate vectors (legal in the LLVM sense), e.g.
vector<[16]xi1> (or <vscale x 16 x i1>, using LLVM syntax) ought to have
alignment 2 rather than 16, see e.g. [1].

MLIR currently computes the vector “size in bits” as:

```cpp
  vecType.getNumElements()
    * dataLayout.getTypeSize(vecType.getElementType()) * 8
```

but `getTypeSize()` returns a size in *bytes* (rounded up from bits), so for
`i1` it returns 1. Multiplying by 8 converts that storage byte back to 8 bits
per element, which overestimates predicate vector sizes.

Instead, use:

```cpp
  vecType.getNumElements()
    * dataLayout.getTypeSizeInBits(vecType.getElementType())
```

For `vector<[16]xi1>` this changes:
  * [before]: `16 * (1 byte * 8) = 128 bits`
  * [after]:  `16 * 1 bit        = 16 bits`

This is a very small update that, based on the available tests, only
affects types like `vector<[16]xi1>`. It aligns MLIR with LLVM, making
sure that the corresponding alignment is 2 rather that 16. For context,
LLVM computes the alignment in this case via `getTypeStoreSize`, which
for `16 x i1` returns 2 bytes. Perhaps MLIR should follow similar path
in the future.

[1] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/LDR--predicate---Load-predicate-register-?lang=en
---
 mlir/lib/Interfaces/DataLayoutInterfaces.cpp         | 2 +-
 mlir/test/Interfaces/DataLayoutInterfaces/query.mlir | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mlir/lib/Interfaces/DataLayoutInterfaces.cpp b/mlir/lib/Interfaces/DataLayoutInterfaces.cpp
index 782384999c70c..a6922ee5f4b5b 100644
--- a/mlir/lib/Interfaces/DataLayoutInterfaces.cpp
+++ b/mlir/lib/Interfaces/DataLayoutInterfaces.cpp
@@ -78,7 +78,7 @@ mlir::detail::getDefaultTypeSizeInBits(Type type, const DataLayout &dataLayout,
   if (auto vecType = dyn_cast<VectorType>(type)) {
     uint64_t baseSize = vecType.getNumElements() / vecType.getShape().back() *
                         llvm::PowerOf2Ceil(vecType.getShape().back()) *
-                        dataLayout.getTypeSize(vecType.getElementType()) * 8;
+                        dataLayout.getTypeSizeInBits(vecType.getElementType());
     return llvm::TypeSize::get(baseSize, vecType.isScalable());
   }
 
diff --git a/mlir/test/Interfaces/DataLayoutInterfaces/query.mlir b/mlir/test/Interfaces/DataLayoutInterfaces/query.mlir
index 5df32555000ad..97ef8b2a8ae1c 100644
--- a/mlir/test/Interfaces/DataLayoutInterfaces/query.mlir
+++ b/mlir/test/Interfaces/DataLayoutInterfaces/query.mlir
@@ -44,6 +44,12 @@ func.func @no_layout_builtin() {
   // CHECK: preferred = 16
   // CHECK: size = {minimal_size = 16 : index, scalable}
   "test.data_layout_query"() : () -> vector<[4]xi32>
+  // CHECK: alignment = 2
+  // CHECK: bitsize = {minimal_size = 16 : index, scalable}
+  // CHECK: index = 0
+  // CHECK: preferred = 2
+  // CHECK: size = {minimal_size = 2 : index, scalable}
+  "test.data_layout_query"() : () -> vector<[16]xi1>
   return
 
 }

>From 8bd05e287993b5bcef0600656fe68c2a66f8ac07 Mon Sep 17 00:00:00 2001
From: Andrzej Warzynski <andrzej.warzynski at arm.com>
Date: Wed, 14 Jan 2026 14:56:55 +0000
Subject: [PATCH 2/3] [CIR][AArch64] Add lowering for predicated SVE svdup
 builtins (zeroing)

This PR adds CIR lowering support for predicated SVE `svdup` builtins on
AArch64. The corresponding ACLE intrinsics are documented at:
  https://developer.arm.com/architectures/instruction-sets/intrinsics

This change focuses on the zeroing-predicated variants (suffix `_z`, e.g.
`svdup_n_f32_z`), which lower to the LLVM SVE `dup` intrinsic with a
`zeroinitializer` passthrough operand.

IMPLEMENTATION NOTES
--------------------
* The CIR type converter is extended to support `BuiltinType::SveBool`,
  which is lowered to `cir.vector<[16] x i1>`, matching current Clang
  behaviour and ensuring compatibility with existing LLVM SVE lowering.
* Added logic that converts `cir.vector<[16] x i1>` according to the
  underlying element type. This is done by calling
  `@llvm.aarch64.sve.convert.from.svbool`.

TEST NOTES
----------
Compared to the unpredicated `svdup` tests (#174433), the new tests perform
more explicit checks to verify:
  * Correct argument usage
  * Correct return value + type

This helped validate differences between the default Clang lowering and the
CIR-based lowering. Once all `svdup` variants are implemented, the tests will
be unified.

EXAMPLE LOWERING
----------------
The following example illustrates that CIR lowering produces equivalent LLVM
IR to the default Clang path.

Input:
```c
svint8_t test_svdup_n_s8(svbool_t pg, int8_t op) {
  return svdup_n_s8_z(pg, op);
}

OUTPUT 1 (default):
```llvm
define dso_local <vscale x 16 x i8> @test(<vscale x 16 x i1> %pg, i8 noundef %op) #0 {
entry:
  %pg.addr = alloca <vscale x 16 x i1>, align 2
  %op.addr = alloca i8, align 1
  store <vscale x 16 x i1> %pg, ptr %pg.addr, align 2
  store i8 %op, ptr %op.addr, align 1
  %0 = load <vscale x 16 x i1>, ptr %pg.addr, align 2
  %1 = load i8, ptr %op.addr, align 1
  %2 = call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> %0, i8 %1)
  ret <vscale x 16 x i8> %2
}
```

OUTPUT 2 (via `-fclangir`):
```llvm
; Function Attrs: noinline
define dso_local <vscale x 16 x i8> @test(<vscale x 16 x i1> %0, i8 %1) #0 {
  %3 = alloca <vscale x 16 x i1>, i64 1, align 2
  %4 = alloca i8, i64 1, align 1
  %5 = alloca <vscale x 16 x i8>, i64 1, align 16
  store <vscale x 16 x i1> %0, ptr %3, align 2
  store i8 %1, ptr %4, align 1
  %6 = load <vscale x 16 x i1>, ptr %3, align 2
  %7 = load i8, ptr %4, align 1
  %8 = call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> %6, i8 %7)
  store <vscale x 16 x i8> %8, ptr %5, align 16
  %9 = load <vscale x 16 x i8>, ptr %5, align 16
  ret <vscale x 16 x i8> %9
}
```

**DEPENDS ON:** https://github.com/llvm/llvm-project/pull/175961
---
 .../lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp  |  95 +++-
 clang/lib/CIR/CodeGen/CIRGenFunction.h        |   2 +
 clang/lib/CIR/CodeGen/CIRGenTypes.cpp         |   4 +
 .../CodeGenBuiltins/AArch64/acle_sve_dup.c    | 477 +++++++++++++++++-
 4 files changed, 564 insertions(+), 14 deletions(-)

diff --git a/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp b/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
index 93089eb585aa7..d59d3bebe0bb0 100644
--- a/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
@@ -126,6 +126,81 @@ bool CIRGenFunction::getAArch64SVEProcessedOperands(
   return true;
 }
 
+// Reinterpret the input predicate so that it can be used to correctly isolate
+// the elements of the specified datatype.
+mlir::Value CIRGenFunction::emitSVEpredicateCast(mlir::Value *pred,
+                                                 unsigned minNumElts,
+                                                 mlir::Location loc) {
+
+  // TODO: Handle "aarch64.svcount" once we get round to supporting SME.
+
+  auto retTy = cir::VectorType::get(builder.getUIntNTy(1), minNumElts,
+                                    /*is_scalable=*/true);
+  if (pred->getType() == retTy)
+    return *pred;
+
+  unsigned intID;
+  mlir::Type intrinsicTy;
+  switch (minNumElts) {
+  default:
+    llvm_unreachable("unsupported element count!");
+  case 1:
+  case 2:
+  case 4:
+  case 8:
+    intID = Intrinsic::aarch64_sve_convert_from_svbool;
+    intrinsicTy = retTy;
+    break;
+  case 16:
+    intID = Intrinsic::aarch64_sve_convert_to_svbool;
+    intrinsicTy = pred->getType();
+    break;
+  }
+
+  std::string llvmIntrName(Intrinsic::getBaseName(intID));
+  llvmIntrName.erase(0, /*std::strlen(".llvm")=*/5);
+  auto call = emitIntrinsicCallOp(builder, loc, llvmIntrName, retTy,
+                                  mlir::ValueRange{*pred});
+  assert(call.getType() == retTy && "Unexpected return type!");
+  return call;
+}
+
+// Return the element count for
+static unsigned getSVEMinEltCount(const clang::SVETypeFlags::EltType &sveType) {
+  switch (sveType) {
+  default:
+    llvm_unreachable("Invalid SVETypeFlag!");
+
+  case SVETypeFlags::EltTyInt8:
+    return 16;
+  case SVETypeFlags::EltTyInt16:
+    return 8;
+  case SVETypeFlags::EltTyInt32:
+    return 4;
+  case SVETypeFlags::EltTyInt64:
+    return 2;
+
+  case SVETypeFlags::EltTyMFloat8:
+    return 16;
+  case SVETypeFlags::EltTyFloat16:
+  case SVETypeFlags::EltTyBFloat16:
+    return 8;
+  case SVETypeFlags::EltTyFloat32:
+    return 4;
+  case SVETypeFlags::EltTyFloat64:
+    return 2;
+
+  case SVETypeFlags::EltTyBool8:
+    return 16;
+  case SVETypeFlags::EltTyBool16:
+    return 8;
+  case SVETypeFlags::EltTyBool32:
+    return 4;
+  case SVETypeFlags::EltTyBool64:
+    return 2;
+  }
+}
+
 std::optional<mlir::Value>
 CIRGenFunction::emitAArch64SVEBuiltinExpr(unsigned builtinID,
                                           const CallExpr *expr) {
@@ -171,10 +246,12 @@ CIRGenFunction::emitAArch64SVEBuiltinExpr(unsigned builtinID,
                    std::string("unimplemented AArch64 builtin call: ") +
                        getContext().BuiltinInfo.getName(builtinID));
 
-    if (typeFlags.getMergeType() == SVETypeFlags::MergeZeroExp)
-      cgm.errorNYI(expr->getSourceRange(),
-                   std::string("unimplemented AArch64 builtin call: ") +
-                       getContext().BuiltinInfo.getName(builtinID));
+    // Zero-ing predication
+    if (typeFlags.getMergeType() == SVETypeFlags::MergeZeroExp) {
+      auto null = builder.getNullValue(convertType(expr->getType()),
+                                       getLoc(expr->getExprLoc()));
+      ops.insert(ops.begin(), null);
+    }
 
     if (typeFlags.getMergeType() == SVETypeFlags::MergeAnyExp)
       cgm.errorNYI(expr->getSourceRange(),
@@ -194,11 +271,11 @@ CIRGenFunction::emitAArch64SVEBuiltinExpr(unsigned builtinID,
 
     // Predicates must match the main datatype.
     for (mlir::Value &op : ops)
-      if (auto predTy = dyn_cast<mlir::VectorType>(op.getType()))
-        if (predTy.getElementType().isInteger(1))
-          cgm.errorNYI(expr->getSourceRange(),
-                       std::string("unimplemented AArch64 builtin call: ") +
-                           getContext().BuiltinInfo.getName(builtinID));
+      if (auto predTy = dyn_cast<cir::VectorType>(op.getType()))
+        if (auto cirInt = dyn_cast<cir::IntType>(predTy.getElementType()))
+          if (cirInt.getWidth() == 1)
+            op = emitSVEpredicateCast(
+                &op, getSVEMinEltCount(typeFlags.getEltType()), loc);
 
     // Splat scalar operand to vector (intrinsics with _n infix)
     if (typeFlags.hasSplatOperand()) {
diff --git a/clang/lib/CIR/CodeGen/CIRGenFunction.h b/clang/lib/CIR/CodeGen/CIRGenFunction.h
index 5fe1d9a4f2b76..86d2a8c4ac089 100644
--- a/clang/lib/CIR/CodeGen/CIRGenFunction.h
+++ b/clang/lib/CIR/CodeGen/CIRGenFunction.h
@@ -1269,6 +1269,8 @@ class CIRGenFunction : public CIRGenTypeCache {
   bool getAArch64SVEProcessedOperands(unsigned builtinID, const CallExpr *expr,
                                       SmallVectorImpl<mlir::Value> &ops,
                                       clang::SVETypeFlags typeFlags);
+  mlir::Value emitSVEpredicateCast(mlir::Value *pred, unsigned minNumElts,
+                                   mlir::Location loc);
   std::optional<mlir::Value>
   emitAArch64BuiltinExpr(unsigned builtinID, const CallExpr *expr,
                          ReturnValueSlot returnValue,
diff --git a/clang/lib/CIR/CodeGen/CIRGenTypes.cpp b/clang/lib/CIR/CodeGen/CIRGenTypes.cpp
index 985c2901a7b04..f6220c616ed60 100644
--- a/clang/lib/CIR/CodeGen/CIRGenTypes.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenTypes.cpp
@@ -373,6 +373,10 @@ mlir::Type CIRGenTypes::convertType(QualType type) {
       resultType = cir::VectorType::get(builder.getDoubleTy(), 2,
                                         /*is_scalable=*/true);
       break;
+    case BuiltinType::SveBool:
+      resultType = cir::VectorType::get(builder.getUIntNTy(1), 16,
+                                        /*is_scalable=*/true);
+      break;
 
     // Unsigned integral types.
     case BuiltinType::Char8:
diff --git a/clang/test/CIR/CodeGenBuiltins/AArch64/acle_sve_dup.c b/clang/test/CIR/CodeGenBuiltins/AArch64/acle_sve_dup.c
index 3e0a892d6b368..60a2992ab14ad 100644
--- a/clang/test/CIR/CodeGenBuiltins/AArch64/acle_sve_dup.c
+++ b/clang/test/CIR/CodeGenBuiltins/AArch64/acle_sve_dup.c
@@ -1,13 +1,13 @@
 // REQUIRES: aarch64-registered-target
-
+//
 // RUN: %clang_cc1 -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -fclangir -emit-cir -o - %s | FileCheck %s --check-prefixes=ALL,CIR
 // RUN: %clang_cc1 -DSVE_OVERLOADED_FORMS -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -fclangir -emit-cir -o - %s | FileCheck %s --check-prefixes=ALL,CIR
 
-// RUN: %clang_cc1 -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -fclangir -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR
-// RUN: %clang_cc1 -DSVE_OVERLOADED_FORMS -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -fclangir -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR
+// RUN: %clang_cc1 -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -fclangir -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR,LLVM_VIA_CIR
+// RUN: %clang_cc1 -DSVE_OVERLOADED_FORMS -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -fclangir -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR,LLVM_VIA_CIR
 
-// RUN: %clang_cc1 -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR
-// RUN: %clang_cc1 -DSVE_OVERLOADED_FORMS -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR
+// RUN: %clang_cc1 -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR,LLVM_DIRECT
+// RUN: %clang_cc1 -DSVE_OVERLOADED_FORMS -triple aarch64 -target-feature +sve -disable-O0-optnone -Werror -Wall -emit-llvm -o - %s | FileCheck %s --check-prefixes=ALL,LLVM_OGCG_CIR,LLVM_DIRECT
 #include <arm_sve.h>
 
 #if defined __ARM_FEATURE_SME
@@ -209,3 +209,470 @@ svfloat64_t test_svdup_n_f64(float64_t op) MODE_ATTR
 // LLVM_OGCG_CIR:    [[RES:%.*]] = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double [[OP_LOAD]])
   return SVE_ACLE_FUNC(svdup,_n,_f64,)(op);
 }
+
+// ALL-LABEL: @test_svdup_n_s8_z
+svint8_t test_svdup_n_s8_z(svbool_t pg, int8_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !s8i
+// CIR-SAME:        -> !cir.vector<[16] x !s8i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !s8i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[16] x !s8i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]]
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(1) %[[ALLOCA_OP]]
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[16] x !s8i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[LOAD_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[16] x !s8i>
+// CIR:           cir.store %[[CONVERT_PG]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i8 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i8,{{([[:space:]]?i64 1,)?}} align 1
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 16 x i8>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i8 [[OP]], ptr [[OP_ADDR]], align 1
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i8, ptr [[OP_ADDR]], align 1
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> [[TMP0]], i8 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP2]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP2]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_s8_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_s16_z(
+svint16_t test_svdup_n_s16_z(svbool_t pg, int16_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !s16i
+// CIR-SAME:        -> !cir.vector<[8] x !s16i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !s16i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[8] x !s16i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(2) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[8] x !s16i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:          -> !cir.vector<[8] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:          -> !cir.vector<[8] x !s16i>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i16 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i16,{{([[:space:]]?i64 1,)?}} align 2
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 8 x i16>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i16 [[OP]], ptr [[OP_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i16, ptr [[OP_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> zeroinitializer, <vscale x 8 x i1> [[TMP2]], i16 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_s16_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_s32_z(
+svint32_t test_svdup_n_s32_z(svbool_t pg, int32_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !s32i
+// CIR-SAME:        -> !cir.vector<[4] x !s32i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !s32i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[4] x !s32i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]]
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(4) %[[ALLOCA_OP]]
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[4] x !s32i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:        -> !cir.vector<[4] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[4] x !s32i>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i32 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i32,{{([[:space:]]?i64 1,)?}} align 4
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 4 x i32>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i32 [[OP]], ptr [[OP_ADDR]], align 4
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i32, ptr [[OP_ADDR]], align 4
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> zeroinitializer, <vscale x 4 x i1> [[TMP2]], i32 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_s32_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_s64_z(
+svint64_t test_svdup_n_s64_z(svbool_t pg, int64_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !s64i
+// CIR-SAME:        -> !cir.vector<[2] x !s64i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !s64i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[2] x !s64i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(8) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[2] x !s64i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]] 
+// CIR-SAME:        -> !cir.vector<[2] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[2] x !s64i>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i64 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i64,{{([[:space:]]?i64 1,)?}} align 8
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 2 x i64>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i64 [[OP]], ptr [[OP_ADDR]], align 8
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i64, ptr [[OP_ADDR]], align 8
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> [[TMP2]], i64 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_s64_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_u8_z(
+svuint8_t test_svdup_n_u8_z(svbool_t pg, uint8_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !u8i
+// CIR-SAME:          -> !cir.vector<[16] x !u8i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !u8i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[16] x !u8i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(1) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[16] x !u8i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[LOAD_PG]], %[[LOAD_OP]] 
+// CIR-SAME:        -> !cir.vector<[16] x !u8i>
+// CIR:           cir.store %[[CONVERT_PG]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i8 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i8,{{([[:space:]]?i64 1,)?}} align 1
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 16 x i8>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i8 [[OP]], ptr [[OP_ADDR]], align 1
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i8, ptr [[OP_ADDR]], align 1
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> [[TMP0]], i8 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP2]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP2]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_u8_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_u16_z(
+svuint16_t test_svdup_n_u16_z(svbool_t pg, uint16_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !u16i
+// CIR-SAME:        -> !cir.vector<[8] x !u16i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !u16i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[8] x !u16i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(2) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[8] x !u16i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:          -> !cir.vector<[8] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:          -> !cir.vector<[8] x !u16i>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i16 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i16,{{([[:space:]]?i64 1,)?}} align 2
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 8 x i16>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i16 [[OP]], ptr [[OP_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i16, ptr [[OP_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> zeroinitializer, <vscale x 8 x i1> [[TMP2]], i16 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_u16_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_u32_z(
+svuint32_t test_svdup_n_u32_z(svbool_t pg, uint32_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !u32i
+// CIR-SAME:        -> !cir.vector<[4] x !u32i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !u32i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[4] x !u32i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(4) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[4] x !u32i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:        -> !cir.vector<[4] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[4] x !u32i>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i32 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i32,{{([[:space:]]?i64 1,)?}} align 4
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 4 x i32>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i32 [[OP]], ptr [[OP_ADDR]], align 4
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i32, ptr [[OP_ADDR]], align 4
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> zeroinitializer, <vscale x 4 x i1> [[TMP2]], i32 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_u32_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_u64_z(
+svuint64_t test_svdup_n_u64_z(svbool_t pg, uint64_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !u64i
+// CIR-SAME:        -> !cir.vector<[2] x !u64i>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !u64i
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[2] x !u64i>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(8) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[2] x !u64i>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]] 
+// CIR-SAME:        -> !cir.vector<[2] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[2] x !u64i>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], i64 {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca i64,{{([[:space:]]?i64 1,)?}} align 8
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 2 x i64>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store i64 [[OP]], ptr [[OP_ADDR]], align 8
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load i64, ptr [[OP_ADDR]], align 8
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> [[TMP2]], i64 [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_u64_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_f16_z(
+svfloat16_t test_svdup_n_f16_z(svbool_t pg, float16_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !cir.f16
+// CIR-SAME:        -> !cir.vector<[8] x !cir.f16>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !cir.f16
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[8] x !cir.f16>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(2) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[8] x !cir.f16>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:        -> !cir.vector<[8] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]] 
+// CIR-SAME:        -> !cir.vector<[8] x !cir.f16>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], half {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca half,{{([[:space:]]?i64 1,)?}} align 2
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 8 x half>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store half [[OP]], ptr [[OP_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load half, ptr [[OP_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 8 x half> @llvm.aarch64.sve.dup.nxv8f16(<vscale x 8 x half> zeroinitializer, <vscale x 8 x i1> [[TMP2]], half [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_f16_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_f32_z(
+svfloat32_t test_svdup_n_f32_z(svbool_t pg, float32_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !cir.float
+// CIR-SAME:          -> !cir.vector<[4] x !cir.float>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !cir.float
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[4] x !cir.float>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(4) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[4] x !cir.float>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:        -> !cir.vector<[4] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[4] x !cir.float>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], float {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca float,{{([[:space:]]?i64 1,)?}} align 4
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 4 x float>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store float [[OP]], ptr [[OP_ADDR]], align 4
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load float, ptr [[OP_ADDR]], align 4
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 4 x float> @llvm.aarch64.sve.dup.nxv4f32(<vscale x 4 x float> zeroinitializer, <vscale x 4 x i1> [[TMP2]], float [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_f32_z,)(pg, op);
+}
+
+// ALL-LABEL: @test_svdup_n_f64_z(
+svfloat64_t test_svdup_n_f64_z(svbool_t pg, float64_t op) MODE_ATTR
+{
+// CIR-SAME:      %[[PG:.*]]: !cir.vector<[16] x !cir.int<u, 1>>
+// CIR-SAME:      %[[OP:.*]]: !cir.double
+// CIR-SAME:        -> !cir.vector<[2] x !cir.double>
+// CIR:           %[[ALLOCA_PG:.*]] = cir.alloca !cir.vector<[16] x !cir.int<u, 1>>
+// CIR:           %[[ALLOCA_OP:.*]] = cir.alloca !cir.double
+// CIR:           %[[ALLOCA_RES:.*]] = cir.alloca !cir.vector<[2] x !cir.double>
+// CIR:           cir.store %[[PG]], %[[ALLOCA_PG]]
+// CIR:           cir.store %[[OP]], %[[ALLOCA_OP]]
+// CIR:           %[[LOAD_PG:.*]] = cir.load align(2) %[[ALLOCA_PG]] 
+// CIR:           %[[LOAD_OP:.*]] = cir.load align(8) %[[ALLOCA_OP]] 
+// CIR:           %[[CONST_0:.*]] = cir.const #cir.zero : !cir.vector<[2] x !cir.double>
+// CIR:           %[[CONVERT_PG:.*]] = cir.call_llvm_intrinsic "aarch64.sve.convert.from.svbool" %[[LOAD_PG]]
+// CIR-SAME:        -> !cir.vector<[2] x !cir.int<u, 1>>
+// CIR:           %[[CALL_DUP:.*]] = cir.call_llvm_intrinsic "aarch64.sve.dup" %[[CONST_0]], %[[CONVERT_PG]], %[[LOAD_OP]]
+// CIR-SAME:        -> !cir.vector<[2] x !cir.double>
+// CIR:           cir.store %[[CALL_DUP]], %[[ALLOCA_RES]]
+// CIR:           %[[RES:.*]] = cir.load %[[ALLOCA_RES]]
+// CIR:           cir.return %[[RES]]
+
+// LLVM_OGCG_CIR-SAME: <vscale x 16 x i1> [[PG:%.*]], double {{(noundef)?[[:space:]]?}}[[OP:%.*]])
+// LLVM_OGCG_CIR:    [[PG_ADDR:%.*]] = alloca <vscale x 16 x i1>,{{([[:space:]]?i64 1,)?}} align 2
+// LLVM_OGCG_CIR:    [[OP_ADDR:%.*]] = alloca double,{{([[:space:]]?i64 1,)?}} align 8
+//
+// LLVM_VIA_CIR:    [[RES_ADDR:%.*]] = alloca <vscale x 2 x double>,{{([[:space:]]?i64 1,)?}} align 16
+//
+// LLVM_OGCG_CIR:    store <vscale x 16 x i1> [[PG]], ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    store double [[OP]], ptr [[OP_ADDR]], align 8
+// LLVM_OGCG_CIR:    [[TMP0:%.*]] = load <vscale x 16 x i1>, ptr [[PG_ADDR]], align 2
+// LLVM_OGCG_CIR:    [[TMP1:%.*]] = load double, ptr [[OP_ADDR]], align 8
+// LLVM_OGCG_CIR:    [[TMP2:%.*]] = call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> [[TMP0]])
+// LLVM_OGCG_CIR:    [[TMP3:%.*]] = call <vscale x 2 x double> @llvm.aarch64.sve.dup.nxv2f64(<vscale x 2 x double> zeroinitializer, <vscale x 2 x i1> [[TMP2]], double [[TMP1]])
+//
+// LLVM_DIRECT:     ret {{.*}} [[TMP3]]
+//
+// LLVM_VIA_CIR:    store {{.*}} [[TMP3]], ptr [[RES_ADDR]]
+// LLVM_VIA_CIR:    [[RES:%.*]] = load {{.*}} [[RES_ADDR]]
+// LLVM_VIA_CIR:    ret {{.*}} [[RES]]
+  return SVE_ACLE_FUNC(svdup,_n,_f64_z,)(pg, op);
+}

>From f52d74caa60a0ad0ce4a4a9185466bda4453b6d6 Mon Sep 17 00:00:00 2001
From: Andrzej Warzynski <andrzej.warzynski at arm.com>
Date: Thu, 5 Feb 2026 15:04:34 +0000
Subject: [PATCH 3/3] Address comments from Andy

---
 .../lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp  | 30 ++++++++++---------
 clang/lib/CIR/CodeGen/CIRGenFunction.h        |  2 +-
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp b/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
index d59d3bebe0bb0..14964d8bb19cd 100644
--- a/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
+++ b/clang/lib/CIR/CodeGen/CIRGenBuiltinAArch64.cpp
@@ -126,9 +126,15 @@ bool CIRGenFunction::getAArch64SVEProcessedOperands(
   return true;
 }
 
+static llvm::StringRef getLLVMIntrNameNoPrefix(llvm::Intrinsic::ID intrID) {
+  llvm::StringRef llvmIntrName = llvm::Intrinsic::getBaseName(intrID);
+  assert(llvmIntrName.starts_with("llvm.") && "Not an LLVM intrinsic!");
+  return llvmIntrName.drop_front(/*strlen("llvm.")=*/5);
+}
+
 // Reinterpret the input predicate so that it can be used to correctly isolate
 // the elements of the specified datatype.
-mlir::Value CIRGenFunction::emitSVEpredicateCast(mlir::Value *pred,
+mlir::Value CIRGenFunction::emitSVEPredicateCast(mlir::Value pred,
                                                  unsigned minNumElts,
                                                  mlir::Location loc) {
 
@@ -136,8 +142,8 @@ mlir::Value CIRGenFunction::emitSVEpredicateCast(mlir::Value *pred,
 
   auto retTy = cir::VectorType::get(builder.getUIntNTy(1), minNumElts,
                                     /*is_scalable=*/true);
-  if (pred->getType() == retTy)
-    return *pred;
+  if (pred.getType() == retTy)
+    return pred;
 
   unsigned intID;
   mlir::Type intrinsicTy;
@@ -153,14 +159,13 @@ mlir::Value CIRGenFunction::emitSVEpredicateCast(mlir::Value *pred,
     break;
   case 16:
     intID = Intrinsic::aarch64_sve_convert_to_svbool;
-    intrinsicTy = pred->getType();
+    intrinsicTy = pred.getType();
     break;
   }
 
-  std::string llvmIntrName(Intrinsic::getBaseName(intID));
-  llvmIntrName.erase(0, /*std::strlen(".llvm")=*/5);
+  llvm::StringRef llvmIntrName = getLLVMIntrNameNoPrefix(intID);
   auto call = emitIntrinsicCallOp(builder, loc, llvmIntrName, retTy,
-                                  mlir::ValueRange{*pred});
+                                  mlir::ValueRange{pred});
   assert(call.getType() == retTy && "Unexpected return type!");
   return call;
 }
@@ -274,8 +279,8 @@ CIRGenFunction::emitAArch64SVEBuiltinExpr(unsigned builtinID,
       if (auto predTy = dyn_cast<cir::VectorType>(op.getType()))
         if (auto cirInt = dyn_cast<cir::IntType>(predTy.getElementType()))
           if (cirInt.getWidth() == 1)
-            op = emitSVEpredicateCast(
-                &op, getSVEMinEltCount(typeFlags.getEltType()), loc);
+            op = emitSVEPredicateCast(
+                op, getSVEMinEltCount(typeFlags.getEltType()), loc);
 
     // Splat scalar operand to vector (intrinsics with _n infix)
     if (typeFlags.hasSplatOperand()) {
@@ -310,11 +315,8 @@ CIRGenFunction::emitAArch64SVEBuiltinExpr(unsigned builtinID,
                        getContext().BuiltinInfo.getName(builtinID));
     }
 
-    std::string llvmIntrName(Intrinsic::getBaseName(
-        (llvm::Intrinsic::ID)builtinIntrInfo->llvmIntrinsic));
-
-    llvmIntrName.erase(0, /*std::strlen(".llvm")=*/5);
-
+    llvm::StringRef llvmIntrName = getLLVMIntrNameNoPrefix(
+        (llvm::Intrinsic::ID)builtinIntrInfo->llvmIntrinsic);
     auto retTy = convertType(expr->getType());
 
     auto call = emitIntrinsicCallOp(builder, loc, llvmIntrName, retTy,
diff --git a/clang/lib/CIR/CodeGen/CIRGenFunction.h b/clang/lib/CIR/CodeGen/CIRGenFunction.h
index 86d2a8c4ac089..859751d3deeeb 100644
--- a/clang/lib/CIR/CodeGen/CIRGenFunction.h
+++ b/clang/lib/CIR/CodeGen/CIRGenFunction.h
@@ -1269,7 +1269,7 @@ class CIRGenFunction : public CIRGenTypeCache {
   bool getAArch64SVEProcessedOperands(unsigned builtinID, const CallExpr *expr,
                                       SmallVectorImpl<mlir::Value> &ops,
                                       clang::SVETypeFlags typeFlags);
-  mlir::Value emitSVEpredicateCast(mlir::Value *pred, unsigned minNumElts,
+  mlir::Value emitSVEPredicateCast(mlir::Value pred, unsigned minNumElts,
                                    mlir::Location loc);
   std::optional<mlir::Value>
   emitAArch64BuiltinExpr(unsigned builtinID, const CallExpr *expr,