[llvm] [RFC] [compiler-rt] make MSan up to 20x faster on AMD CPUs (PR #171993)

Azat Khuzhin via llvm-commits llvm-commits at lists.llvm.org
Fri Dec 12 03:23:48 PST 2025


https://github.com/azat created https://github.com/llvm/llvm-project/pull/171993

I noticed that on AMD CPU (so far I've tested on Zen 3 and Zen 4c - AMD EPYC 9R14) a simple program under MSan is up to 20x slower:

    #include <stdio.h>
    #include <time.h>
    #include <stdint.h>

    uint64_t factorial(int n) {
        if (n <= 1) return 1;
        return n * factorial(n - 1);
    }

    int main() {
        const int iterations = 100000000;
        clock_t start = clock();

        for (int i = 0; i < iterations; i++) {
            volatile uint64_t result = factorial(20);
        }

        double elapsed = (double)(clock() - start) / CLOCKS_PER_SEC;
        printf("Direct loop:          %.3f seconds\n", elapsed);
        return 0;
    }

The problem here is the `volatile`, but the underlying problem apparently is that cache conflicts, `result` and it's address in shadow area has conflicts, and overwrites each other, so it has tons of cache misses:

   Performance counter stats for './factorial-test-original':

         212,850,471      L1-dcache-loads
         200,634,333      L1-dcache-load-misses            #   94.26% of all L1-dcache accesses
     <not supported>      L1-dcache-stores

         1.232666099 seconds time elapsed

         1.228437000 seconds user
         0.000994000 seconds sys

To avoid this conflicts we can add size of one cache line to the shadow addresses, and here are the results - 20x improvement:

    $ /usr/bin/clang++ -fsanitize=memory -O3 factorial-test.c -o factorial-test-original
    $ ./factorial-test-original
    Direct loop:          1.223 seconds
    $ clang++ -fsanitize=memory -O3 factorial-test.c -o factorial-test-patched
    $ ./factorial-test-patched
    Direct loop:          0.060 seconds

I've tested performance on Intel CPUs (Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz), and it looks the same after the patch.

Curious about what you think about this!

>From 840320ea1f130888308060405995cfcc22a2075e Mon Sep 17 00:00:00 2001
From: Azat Khuzhin <a3at.mail at gmail.com>
Date: Fri, 12 Dec 2025 09:32:15 +0100
Subject: [PATCH] [RFC] [compiler-rt] make MSan up to 20x faster on AMD CPUs

I noticed that on AMD CPU (so far I've tested on Zen 3 and Zen 4c - AMD
EPYC 9R14) a simple program under MSan is up to 20x slower:

    #include <stdio.h>
    #include <time.h>
    #include <stdint.h>

    uint64_t factorial(int n) {
        if (n <= 1) return 1;
        return n * factorial(n - 1);
    }

    int main() {
        const int iterations = 100000000;
        clock_t start = clock();

        for (int i = 0; i < iterations; i++) {
            volatile uint64_t result = factorial(20);
        }

        double elapsed = (double)(clock() - start) / CLOCKS_PER_SEC;
        printf("Direct loop:          %.3f seconds\n", elapsed);
        return 0;
    }

The problem here is the `volatile`, but the underlying problem
apparently is that cache conflicts, `result` and it's address in shadow
area has conflicts, and overwrites each other, so it has tons of cache
misses:

   Performance counter stats for './factorial-test-original':

         212,850,471      L1-dcache-loads
         200,634,333      L1-dcache-load-misses            #   94.26% of all L1-dcache accesses
     <not supported>      L1-dcache-stores

         1.232666099 seconds time elapsed

         1.228437000 seconds user
         0.000994000 seconds sys

To avoid this conflicts we can add size of one cache line to the shadow
addresses, and here are the results - 20x improvement:

    $ /usr/bin/clang++ -fsanitize=memory -O3 factorial-test.c -o factorial-test-original
    $ ./factorial-test-original
    Direct loop:          1.223 seconds
    $ clang++ -fsanitize=memory -O3 factorial-test.c -o factorial-test-patched
    $ ./factorial-test-patched
    Direct loop:          0.060 seconds

I've tested performance on Intel CPUs (Intel(R) Xeon(R) Platinum 8124M
CPU @ 3.00GHz), and it looks the same after the patch.

Curious about what you think about this!
---
 llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp b/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
index 32ee16c89b4fe..1e8253276555e 100644
--- a/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
+++ b/llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
@@ -442,7 +442,7 @@ static const MemoryMapParams Linux_I386_MemoryMapParams = {
 // x86_64 Linux
 static const MemoryMapParams Linux_X86_64_MemoryMapParams = {
     0,              // AndMask (not used)
-    0x500000000000, // XorMask
+    0x500000000040, // XorMask
     0,              // ShadowBase (not used)
     0x100000000000, // OriginBase
 };



More information about the llvm-commits mailing list