[libc-commits] [libc] [libc] NVPTX Profiling Draft (PR #92009)

Thu May 16 10:36:18 PDT 2024

================
@@ -0,0 +1,108 @@
+//===------------- NVPTX implementation of timing utils ---------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIBC_UTILS_GPU_TIMING_NVPTX
+#define LLVM_LIBC_UTILS_GPU_TIMING_NVPTX
+
+#include "src/__support/GPU/utils.h"
+#include "src/__support/common.h"
+#include "src/__support/macros/attributes.h"
+#include "src/__support/macros/config.h"
+
+#include <stdint.h>
+
+namespace LIBC_NAMESPACE {
+
+// Returns the overhead associated with calling the profiling region. This
+// allows us to substract the constant-time overhead from the latency to
+// obtain a true result. This can vary with system load.
+[[gnu::noinline]] static uint64_t overhead() {
+  volatile uint32_t x = 1;
+  uint32_t y = x;
+  gpu::sync_threads();
+  uint64_t start = gpu::processor_clock();
+  asm volatile("" ::"r"(y), "llr"(start));
+  uint32_t result = y;
+  asm volatile("or.b32 %[v_reg], %[v_reg], 0;" ::[v_reg] "r"(result) :);
+  uint64_t stop = gpu::processor_clock();
+  gpu::sync_threads();
+  volatile auto storage = result;
+  return stop - start;
+}
+
+// Stimulate a simple function and obtain its latency in clock cycles on the
+// system. This function cannot be inlined or else it will disturb the very
+// deliccate balance of hard-coded dependencies.
+//
+// FIXME: This does not work in general on NVPTX because of further
+// optimizations ptxas performs. The only way to get consistent results is to
+// pass and extra "SHELL:-Xcuda-ptxas -O0" to CMake's compiler flag. This
+// negatively implacts performance but it is at least stable.
+template <typename F, typename T>
+[[gnu::noinline]] static LIBC_INLINE uint64_t latency(F f, T t) {
----------------
jameshu15869 wrote:

Removing `[[gnu::noinline]]` ends up modifying the measurement - I checked and including/not including `[[gnu::noinline]]` give the same PTX but different SASS. Is that something the compiler can even do?

Also, keeping `[[gnu::noinline]]` says that `isalnum()` takes 1 cycle while removing `[[gnu::noinline]]` calculates the overhead around 45 cycles (I'm calculating the overhead here by doing `latency(isalnum()) - latency(single_input_output_mock_function())`). I'm thinking the 1 cycles seems to make more sense, but which number seems more reasonable?

https://github.com/llvm/llvm-project/pull/92009