[Lldb-commits] [lldb] [lldb][debugserver] Read/write SME registers on arm64 (PR #119171)

Sun Dec 8 22:31:21 PST 2024

llvmbot wrote:




@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

<details>
<summary>Changes</summary>

The Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used.  The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity.

When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL.  Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes).

When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail.

Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status.  There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today.  But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future.  debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set).

While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 65k ZA register.  If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that.

Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME.  The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today.

Also, debugserver includes information about registers that share the same part of the register file.  e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register.  And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register.  So the register maps used when defining the VFP registers must differ depending on the runtime state of the cpu.

I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's.  This is necessary when constructing a `g` packet that gets all registers - because there is no separation between register bytes, the offsets are fixed.  But when we are asking for a single register (e.g.  Z0) when not in SSVE/SME mode, this should return an error.

This does mean that when you're running on an SME capabable machine, but not in SME mode, and do `register read -a`, lldb will report that 48 SVE registers were unavailable and 5 SME registers were unavailable.  But that's only when `-a` is used.

The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release.  The test case I wrote is skipped on current OSes.  I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin.

rdar://121608074

---

Patch is 67.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/119171.diff


9 Files Affected:

- (modified) lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp (+19) 
- (added) lldb/test/API/macosx/sme-registers/Makefile (+5) 
- (added) lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py (+164) 
- (added) lldb/test/API/macosx/sme-registers/main.c (+123) 
- (modified) lldb/tools/debugserver/source/DNBDefs.h (+15-10) 
- (modified) lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp (+720-186) 
- (modified) lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.h (+66-7) 
- (added) lldb/tools/debugserver/source/MacOSX/arm64/sme_thread_status.h (+86) 
- (modified) lldb/tools/debugserver/source/RNBRemote.cpp (+49-38) 


``````````diff

diff --git a/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp b/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
index 181ba4e7d87721..6a072354972acd 100644
--- a/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
+++ b/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
@@ -100,6 +100,25 @@ bool ArchitectureAArch64::ReconfigureRegisterInfo(DynamicRegisterInfo &reg_info,
     if (reg_value != fail_value && reg_value <= 32)
       svg_reg_value = reg_value;
   }
+  if (!svg_reg_value) {
+    const RegisterInfo *darwin_svg_reg_info = reg_info.GetRegisterInfo("svl");
+    if (darwin_svg_reg_info) {
+      uint32_t svg_reg_num = darwin_svg_reg_info->kinds[eRegisterKindLLDB];
+      uint64_t reg_value =
+          reg_context.ReadRegisterAsUnsigned(svg_reg_num, fail_value);
+      // UpdateARM64SVERegistersInfos and UpdateARM64SMERegistersInfos
+      // expect the number of 8-byte granules; darwin provides number of
+      // bytes.
+      if (reg_value != fail_value && reg_value <= 256) {
+        svg_reg_value = reg_value / 8;
+        // Apple hardware only implements Streaming SVE mode, so
+        // the non-streaming Vector Length is not reported by the
+        // kernel. Set both svg and vg to this svl value.
+        if (!vg_reg_value)
+          vg_reg_value = reg_value / 8;
+      }
+    }
+  }
 
   if (!vg_reg_value && !svg_reg_value)
     return false;
diff --git a/lldb/test/API/macosx/sme-registers/Makefile b/lldb/test/API/macosx/sme-registers/Makefile
new file mode 100644
index 00000000000000..d4173d262ed270
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/Makefile
@@ -0,0 +1,5 @@
+C_SOURCES := main.c
+
+CFLAGS_EXTRAS := -mcpu=apple-m4
+
+include Makefile.rules
diff --git a/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py b/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py
new file mode 100644
index 00000000000000..82a5eb0dc81a6b
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py
@@ -0,0 +1,164 @@
+import lldb
+from lldbsuite.test.lldbtest import *
+from lldbsuite.test.decorators import *
+import lldbsuite.test.lldbutil as lldbutil
+import os
+
+
+class TestSMERegistersDarwin(TestBase):
+
+    NO_DEBUG_INFO_TESTCASE = True
+    mydir = TestBase.compute_mydir(__file__)
+
+    @skipIfRemote
+    @skipUnlessDarwin
+    @skipUnlessFeature("hw.optional.arm.FEAT_SME")
+    @skipUnlessFeature("hw.optional.arm.FEAT_SME2")
+    # thread_set_state/thread_get_state only avail in macOS 15.4+
+    @skipIf(macos_version=["<", "15.4"])
+    def test(self):
+        """Test that we can read the contents of the SME/SVE registers on Darwin"""
+        self.build()
+        (target, process, thread, bkpt) = lldbutil.run_to_source_breakpoint(
+            self, "break here", lldb.SBFileSpec("main.c")
+        )
+        frame = thread.GetFrameAtIndex(0)
+        self.assertTrue(frame.IsValid())
+
+        if self.TraceOn():
+            self.runCmd("reg read -a")
+
+        svl_reg = frame.register["svl"]
+        svl = svl_reg.GetValueAsUnsigned()
+
+        # SSVE and SME modes should be enabled (reflecting PSTATE.SM and PSTATE.ZA)
+        svcr = frame.register["svcr"]
+        self.assertEqual(svcr.GetValueAsUnsigned(), 3)
+
+        z0 = frame.register["z0"]
+        self.assertEqual(z0.GetNumChildren(), svl)
+        self.assertEqual(z0.GetChildAtIndex(0).GetValueAsUnsigned(), 0x1)
+        self.assertEqual(z0.GetChildAtIndex(svl - 1).GetValueAsUnsigned(), 0x1)
+
+        z31 = frame.register["z31"]
+        self.assertEqual(z31.GetNumChildren(), svl)
+        self.assertEqual(z31.GetChildAtIndex(0).GetValueAsUnsigned(), 32)
+        self.assertEqual(z31.GetChildAtIndex(svl - 1).GetValueAsUnsigned(), 32)
+
+        p0 = frame.register["p0"]
+        self.assertEqual(p0.GetNumChildren(), svl / 8)
+        self.assertEqual(p0.GetChildAtIndex(0).GetValueAsUnsigned(), 0xFF)
+        self.assertEqual(
+            p0.GetChildAtIndex(p0.GetNumChildren() - 1).GetValueAsUnsigned(), 0xFF
+        )
+
+        p15 = frame.register["p15"]
+        self.assertEqual(p15.GetNumChildren(), svl / 8)
+        self.assertEqual(p15.GetChildAtIndex(0).GetValueAsUnsigned(), 0xFF)
+        self.assertEqual(
+            p15.GetChildAtIndex(p15.GetNumChildren() - 1).GetValueAsUnsigned(), 0xFF
+        )
+
+        za = frame.register["za"]
+        self.assertEqual(za.GetNumChildren(), (svl * svl))
+        za_0 = za.GetChildAtIndex(0)
+        self.assertEqual(za_0.GetValueAsUnsigned(), 4)
+        za_final = za.GetChildAtIndex(za.GetNumChildren() - 1)
+        self.assertEqual(za_final.GetValueAsUnsigned(), 67)
+
+        zt0 = frame.register["zt0"]
+        self.assertEqual(zt0.GetNumChildren(), 64)
+        zt0_0 = zt0.GetChildAtIndex(0)
+        self.assertEqual(zt0_0.GetValueAsUnsigned(), 0)
+        zt0_final = zt0.GetChildAtIndex(63)
+        self.assertEqual(zt0_final.GetValueAsUnsigned(), 63)
+
+        z0_old_values = []
+        z0_new_str = '"{'
+        for i in range(svl):
+            z0_old_values.append(z0.GetChildAtIndex(i).GetValueAsUnsigned())
+            z0_new_str = z0_new_str + ("0x%02x " % (z0_old_values[i] + 5))
+        z0_new_str = z0_new_str + '}"'
+        self.runCmd("reg write z0 %s" % z0_new_str)
+
+        z31_old_values = []
+        z31_new_str = '"{'
+        for i in range(svl):
+            z31_old_values.append(z31.GetChildAtIndex(i).GetValueAsUnsigned())
+            z31_new_str = z31_new_str + ("0x%02x " % (z31_old_values[i] + 3))
+        z31_new_str = z31_new_str + '}"'
+        self.runCmd("reg write z31 %s" % z31_new_str)
+
+        p0_old_values = []
+        p0_new_str = '"{'
+        for i in range(int(svl / 8)):
+            p0_old_values.append(p0.GetChildAtIndex(i).GetValueAsUnsigned())
+            p0_new_str = p0_new_str + ("0x%02x " % (p0_old_values[i] - 5))
+        p0_new_str = p0_new_str + '}"'
+        self.runCmd("reg write p0 %s" % p0_new_str)
+
+        p15_old_values = []
+        p15_new_str = '"{'
+        for i in range(int(svl / 8)):
+            p15_old_values.append(p15.GetChildAtIndex(i).GetValueAsUnsigned())
+            p15_new_str = p15_new_str + ("0x%02x " % (p15_old_values[i] - 8))
+        p15_new_str = p15_new_str + '}"'
+        self.runCmd("reg write p15 %s" % p15_new_str)
+
+        za_old_values = []
+        za_new_str = '"{'
+        for i in range(svl * svl):
+            za_old_values.append(za.GetChildAtIndex(i).GetValueAsUnsigned())
+            za_new_str = za_new_str + ("0x%02x " % (za_old_values[i] + 7))
+        za_new_str = za_new_str + '}"'
+        self.runCmd("reg write za %s" % za_new_str)
+
+        zt0_old_values = []
+        zt0_new_str = '"{'
+        for i in range(64):
+            zt0_old_values.append(zt0.GetChildAtIndex(i).GetValueAsUnsigned())
+            zt0_new_str = zt0_new_str + ("0x%02x " % (zt0_old_values[i] + 2))
+        zt0_new_str = zt0_new_str + '}"'
+        self.runCmd("reg write zt0 %s" % zt0_new_str)
+
+        thread.StepInstruction(False)
+        frame = thread.GetFrameAtIndex(0)
+
+        if self.TraceOn():
+            self.runCmd("reg read -a")
+
+        z0 = frame.register["z0"]
+        for i in range(z0.GetNumChildren()):
+            self.assertEqual(
+                z0_old_values[i] + 5, z0.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        z31 = frame.register["z31"]
+        for i in range(z31.GetNumChildren()):
+            self.assertEqual(
+                z31_old_values[i] + 3, z31.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        p0 = frame.register["p0"]
+        for i in range(p0.GetNumChildren()):
+            self.assertEqual(
+                p0_old_values[i] - 5, p0.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        p15 = frame.register["p15"]
+        for i in range(p15.GetNumChildren()):
+            self.assertEqual(
+                p15_old_values[i] - 8, p15.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        za = frame.register["za"]
+        for i in range(za.GetNumChildren()):
+            self.assertEqual(
+                za_old_values[i] + 7, za.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        zt0 = frame.register["zt0"]
+        for i in range(zt0.GetNumChildren()):
+            self.assertEqual(
+                zt0_old_values[i] + 2, zt0.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
diff --git a/lldb/test/API/macosx/sme-registers/main.c b/lldb/test/API/macosx/sme-registers/main.c
new file mode 100644
index 00000000000000..00bbb4a5551622
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/main.c
@@ -0,0 +1,123 @@
+///  BUILT with
+///     xcrun -sdk macosx.internal clang -mcpu=apple-m4 -g sme.c -o sme 
+
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+
+void write_sve_regs() {
+  asm volatile("ptrue p0.b\n\t");
+  asm volatile("ptrue p1.h\n\t");
+  asm volatile("ptrue p2.s\n\t");
+  asm volatile("ptrue p3.d\n\t");
+  asm volatile("pfalse p4.b\n\t");
+  asm volatile("ptrue p5.b\n\t");
+  asm volatile("ptrue p6.h\n\t");
+  asm volatile("ptrue p7.s\n\t");
+  asm volatile("ptrue p8.d\n\t");
+  asm volatile("pfalse p9.b\n\t");
+  asm volatile("ptrue p10.b\n\t");
+  asm volatile("ptrue p11.h\n\t");
+  asm volatile("ptrue p12.s\n\t");
+  asm volatile("ptrue p13.d\n\t");
+  asm volatile("pfalse p14.b\n\t");
+  asm volatile("ptrue p15.b\n\t");
+
+  asm volatile("cpy  z0.b, p0/z, #1\n\t");
+  asm volatile("cpy  z1.b, p5/z, #2\n\t");
+  asm volatile("cpy  z2.b, p10/z, #3\n\t");
+  asm volatile("cpy  z3.b, p15/z, #4\n\t");
+  asm volatile("cpy  z4.b, p0/z, #5\n\t");
+  asm volatile("cpy  z5.b, p5/z, #6\n\t");
+  asm volatile("cpy  z6.b, p10/z, #7\n\t");
+  asm volatile("cpy  z7.b, p15/z, #8\n\t");
+  asm volatile("cpy  z8.b, p0/z, #9\n\t");
+  asm volatile("cpy  z9.b, p5/z, #10\n\t");
+  asm volatile("cpy  z10.b, p10/z, #11\n\t");
+  asm volatile("cpy  z11.b, p15/z, #12\n\t");
+  asm volatile("cpy  z12.b, p0/z, #13\n\t");
+  asm volatile("cpy  z13.b, p5/z, #14\n\t");
+  asm volatile("cpy  z14.b, p10/z, #15\n\t");
+  asm volatile("cpy  z15.b, p15/z, #16\n\t");
+  asm volatile("cpy  z16.b, p0/z, #17\n\t");
+  asm volatile("cpy  z17.b, p5/z, #18\n\t");
+  asm volatile("cpy  z18.b, p10/z, #19\n\t");
+  asm volatile("cpy  z19.b, p15/z, #20\n\t");
+  asm volatile("cpy  z20.b, p0/z, #21\n\t");
+  asm volatile("cpy  z21.b, p5/z, #22\n\t");
+  asm volatile("cpy  z22.b, p10/z, #23\n\t");
+  asm volatile("cpy  z23.b, p15/z, #24\n\t");
+  asm volatile("cpy  z24.b, p0/z, #25\n\t");
+  asm volatile("cpy  z25.b, p5/z, #26\n\t");
+  asm volatile("cpy  z26.b, p10/z, #27\n\t");
+  asm volatile("cpy  z27.b, p15/z, #28\n\t");
+  asm volatile("cpy  z28.b, p0/z, #29\n\t");
+  asm volatile("cpy  z29.b, p5/z, #30\n\t");
+  asm volatile("cpy  z30.b, p10/z, #31\n\t");
+  asm volatile("cpy  z31.b, p15/z, #32\n\t");
+}
+
+#define MAX_VL_BYTES 256
+void set_za_register(int svl, int value_offset) {
+  uint8_t data[MAX_VL_BYTES];
+
+  // ldr za will actually wrap the selected vector row, by the number of rows
+  // you have. So setting one that didn't exist would actually set one that did.
+  // That's why we need the streaming vector length here.
+  for (int i = 0; i < svl; ++i) {
+    // This may involve instructions that require the smefa64 extension.
+    for (int j = 0; j < MAX_VL_BYTES; j++)
+      data[j] = i + value_offset;
+    // Each one of these loads a VL sized row of ZA.
+    asm volatile("mov w12, %w0\n\t"
+                 "ldr za[w12, 0], [%1]\n\t" ::"r"(i),
+                 "r"(&data)
+                 : "w12");
+  }
+}
+
+static uint16_t
+arm_sme_svl_b(void)
+{
+        uint64_t ret = 0;
+        asm volatile (
+                "rdsvl  %[ret], #1"
+                : [ret] "=r"(ret)
+        );
+        return (uint16_t)ret;
+}
+
+
+// lldb/test/API/commands/register/register/aarch64_sme_z_registers/save_restore/main.c
+void
+arm_sme2_set_zt0() {
+#define ZTO_LEN (512 / 8)
+    uint8_t data[ZTO_LEN];
+    for (unsigned i = 0; i < ZTO_LEN; ++i)
+      data[i] = i + 0;
+
+    asm volatile("ldr zt0, [%0]" ::"r"(&data));
+#undef ZT0_LEN
+}
+
+int main()
+{
+
+  printf("Enable SME mode\n");
+
+  asm volatile ("smstart");
+ 
+  write_sve_regs();
+
+  set_za_register(arm_sme_svl_b(), 4);
+
+  arm_sme2_set_zt0();
+
+  int c = 10; // break here
+  c += 5;
+  c += 5;
+
+  asm volatile ("smstop");
+}
diff --git a/lldb/tools/debugserver/source/DNBDefs.h b/lldb/tools/debugserver/source/DNBDefs.h
index dacee652b3ebfc..df8ca809d412c7 100644
--- a/lldb/tools/debugserver/source/DNBDefs.h
+++ b/lldb/tools/debugserver/source/DNBDefs.h
@@ -312,16 +312,21 @@ struct DNBRegisterValue {
     uint64_t uint64;
     float float32;
     double float64;
-    int8_t v_sint8[64];
-    int16_t v_sint16[32];
-    int32_t v_sint32[16];
-    int64_t v_sint64[8];
-    uint8_t v_uint8[64];
-    uint16_t v_uint16[32];
-    uint32_t v_uint32[16];
-    uint64_t v_uint64[8];
-    float v_float32[16];
-    double v_float64[8];
+    // AArch64 SME's ZA register max size is 64k, this object must be
+    // large enough to hold that much data.  The current Apple cores
+    // have a much smaller maximum ZA reg size, but there are not
+    // multiple copies of this object so increase the static size to
+    // maximum possible.
+    int8_t v_sint8[65536];
+    int16_t v_sint16[32768];
+    int32_t v_sint32[16384];
+    int64_t v_sint64[8192];
+    uint8_t v_uint8[65536];
+    uint16_t v_uint16[32768];
+    uint32_t v_uint32[16384];
+    uint64_t v_uint64[8192];
+    float v_float32[16384];
+    double v_float64[8192];
     void *pointer;
     char *c_str;
   } value;
diff --git a/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp b/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
index b6f52cb5cf496d..ba2a8116d68bec 100644
--- a/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
+++ b/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
@@ -93,6 +93,55 @@ DNBArchMachARM64::SoftwareBreakpointOpcode(nub_size_t byte_size) {
 
 uint32_t DNBArchMachARM64::GetCPUType() { return CPU_TYPE_ARM64; }
 
+static std::once_flag g_cpu_has_sme_once;
+bool DNBArchMachARM64::CPUHasSME() {
+  static bool g_has_sme = false;
+  std::call_once(g_cpu_has_sme_once, []() {
+    int ret = 0;
+    size_t size = sizeof(ret);
+    if (sysctlbyname("hw.optional.arm.FEAT_SME", &ret, &size, NULL, 0) != -1)
+      g_has_sme = ret == 1;
+  });
+  return g_has_sme;
+}
+
+static std::once_flag g_cpu_has_sme2_once;
+bool DNBArchMachARM64::CPUHasSME2() {
+  static bool g_has_sme2 = false;
+  std::call_once(g_cpu_has_sme2_once, []() {
+    int ret = 0;
+    size_t size = sizeof(ret);
+    if (sysctlbyname("hw.optional.arm.FEAT_SME2", &ret, &size, NULL, 0) != -1)
+      g_has_sme2 = ret == 1;
+  });
+  return g_has_sme2;
+}
+
+static std::once_flag g_sme_max_svl_once;
+unsigned int DNBArchMachARM64::GetSMEMaxSVL() {
+  static unsigned int g_sme_max_svl = 0;
+  std::call_once(g_sme_max_svl_once, []() {
+    if (CPUHasSME()) {
+      unsigned int ret = 0;
+      size_t size = sizeof(ret);
+      if (sysctlbyname("hw.optional.arm.sme_max_svl_b", &ret, &size, NULL, 0) !=
+          -1)
+        g_sme_max_svl = ret;
+      else
+        g_sme_max_svl = get_svl_bytes();
+    }
+  });
+  return g_sme_max_svl;
+}
+
+// This function can only be called on systems with hw.optional.arm.FEAT_SME
+// It will return the maximum SVL length for this process.
+uint16_t __attribute__((target("sme"))) DNBArchMachARM64::get_svl_bytes(void) {
+  uint64_t ret = 0;
+  asm volatile("rdsvl	%[ret], #1" : [ret] "=r"(ret));
+  return (uint16_t)ret;
+}
+
 static uint64_t clear_pac_bits(uint64_t value) {
   uint32_t addressing_bits = 0;
   if (!DNBGetAddressingBits(addressing_bits))
@@ -415,6 +464,103 @@ kern_return_t DNBArchMachARM64::GetDBGState(bool force) {
   return kret;
 }
 
+kern_return_t DNBArchMachARM64::GetSVEState(bool force) {
+  int set = e_regSetSVE;
+  // Check if we have valid cached registers
+  if (!force && m_state.GetError(set, Read) == KERN_SUCCESS)
+    return KERN_SUCCESS;
+
+  if (!CPUHasSME())
+    return KERN_INVALID_ARGUMENT;
+
+  // Read the registers from our thread
+  mach_msg_type_number_t count = ARM_SVE_Z_STATE_COUNT;
+  kern_return_t kret =
+      ::thread_get_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE1,
+                         (thread_state_t)&m_state.context.sve.z[0], &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read SVE registers z0..z15 return value %d",
+                   kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  count = ARM_SVE_Z_STATE_COUNT;
+  kret = thread_get_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE2,
+                          (thread_state_t)&m_state.context.sve.z[16], &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read SVE registers z16..z31 return value %d",
+                   kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  count = ARM_SVE_P_STATE_COUNT;
+  kret = thread_get_state(m_thread->MachPortNumber(), ARM_SVE_P_STATE,
+                          (thread_state_t)&m_state.context.sve.p[0], &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read SVE registers p0..p15 return value %d",
+                   kret);
+
+  return kret;
+}
+
+kern_return_t DNBArchMachARM64::GetSMEState(bool force) {
+  int set = e_regSetSME;
+  // Check if we have valid cached registers
+  if (!force && m_state.GetError(set, Read) == KERN_SUCCESS)
+    return KERN_SUCCESS;
+
+  if (!CPUHasSME())
+    return KERN_INVALID_ARGUMENT;
+
+  // Read the registers from our thread
+  mach_msg_type_number_t count = ARM_SME_STATE_COUNT;
+  kern_return_t kret =
+      ::thread_get_state(m_thread->MachPortNumber(), ARM_SME_STATE,
+                         (thread_state_t)&m_state.context.sme.svcr, &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME_STATE return value %d", kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  memset(m_state.context.sme.za.data(), 0, m_state.context.sme.za.size());
+
+  size_t za_size = m_state.context.sme.svl_b * m_state.context.sme.svl_b;
+  const size_t max_chunk_size = 4096;
+  int n_chunks;
+  size_t chunk_size;
+  if (za_size <= max_chunk_size) {
+    n_chunks = 1;
+    chunk_size = za_size;
+  } else {
+    n_chunks = za_size / max_chunk_size;
+    chunk_size = max_chunk_size;
+  }
+  for (int i = 0; i < n_chunks; i++) {
+    count = ARM_SME_ZA_STATE_COUNT;
+    arm_sme_za_state_t za_state;
+    kret = thread_get_state(m_thread->MachPortNumber(), ARM_SME_ZA_STATE1 + i,
+                            (thread_state_t)&za_state, &count);
+    m_state.SetError(set, Read, kret);
+    DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME_STATE return value %d", kret);
+    if (kret != KERN_SUCCESS)
+      return kret;
+    memcpy(m_state.context.sme.za.data() + (i * chunk_size), &za_state,
+           chunk_size);
+  }
+
+  if (CPUHasSME2()) {
+    count = ARM_SME2_STATE;
+    kret = thread_get_state(m_thread->MachPortNumber(), ARM_SME2_STATE,
+                            (thread_state_t)&m_state.context.sme.zt0, &count);
+    m_state.SetError(set, Read, kret);
+    DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME2_STATE return value %d", kret);
+    if (kret != KERN_SUCCESS)
+      return kret;
+  }
+
+  return kret;
+}
+
 kern_return_t DNBArchMachARM64::SetGPRState() {
   int set = e_regSetGPR;
   kern_return_t kret = ::thread_set_state(
@@ -441,6 +587,80 @@ kern_return_t DNBArchMachARM64::SetVFPState() {
   return kret;                             // Return the error code
 }
 
+kern_return_t DNBArchMachARM64::SetSVEState() {
+  if (!CPUHasSME())
+    return KERN_INVALID_ARGUMENT;
+
+  int set = e_regSetSVE;
+  kern_return_t kret = thread_set_state(
+      m_thread->MachPortNumber(), ARM_SVE_Z_STATE1,
+      (thread_state_t)&m_state.context.sve.z[0], ARM_SVE_Z_STATE_COUNT);
+  m_state.SetError(set, Write, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Write ARM_SVE_Z_STATE1 return value %d", kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  kret = thread_set_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE2,
+                          (thread_state_t)&m_state.context.sve.z[16],
+                          ARM_SVE_Z_STATE_COUNT);
+  m_state.SetError(set, Write, kret);
+  DNBLogThreadedIf(LOG_TH...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/119171