[llvm] [CI] Extend metrics container to log BuildKite metrics (PR #129699)
Nathan Gauër via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 5 05:19:09 PST 2025
================
@@ -35,6 +48,145 @@ class GaugeMetric:
time_ns: int
+# Fetches a page of the build list using the GraphQL BuildKite API.
+# Returns the BUILDKITE_GRAPHQL_BUILDS_PER_PAGE last **finished** builds by
+# default, or the BUILDKITE_GRAPHQL_BUILDS_PER_PAGE **finished** builds older
+# than the one pointer by
+# |cursor| if provided.
+# The |cursor| value is taken from the previous page returned by the API.
+# The returned data had the following format:
+# [
+# {
+# "cursor": <value>,
+# "number": <build-number>,
+# }
+# ]
+def buildkite_fetch_page_build_list(buildkite_token, after_cursor=None):
+ BUILDKITE_GRAPHQL_QUERY = """
+ query OrganizationShowQuery {{
+ organization(slug: "llvm-project") {{
+ pipelines(search: "Github pull requests", first: 1) {{
+ edges {{
+ node {{
+ builds (state: [FAILED, PASSED], first: {PAGE_SIZE}, after: {AFTER}) {{
+ edges {{
+ cursor
+ node {{
+ number
+ }}
+ }}
+ }}
+ }}
+ }}
+ }}
+ }}
+ }}
+ """
+ data = BUILDKITE_GRAPHQL_QUERY.format(
+ PAGE_SIZE=BUILDKITE_GRAPHQL_BUILDS_PER_PAGE,
+ AFTER="null" if after_cursor is None else '"{}"'.format(after_cursor),
+ )
+ data = data.replace("\n", "").replace('"', '\\"')
+ data = '{ "query": "' + data + '" }'
+ url = "https://graphql.buildkite.com/v1"
+ headers = {
+ "Authorization": "Bearer " + buildkite_token,
+ "Content-Type": "application/json",
+ }
+ r = requests.post(url, data=data, headers=headers)
+ data = r.json()
+ # De-nest the build list.
+ builds = data["data"]["organization"]["pipelines"]["edges"][0]["node"]["builds"][
+ "edges"
+ ]
+ # Fold cursor info into the node dictionnary.
+ return [{**x["node"], "cursor": x["cursor"]} for x in builds]
+
+
+# Returns all the info associated with the provided |build_number|.
+# Note: for unknown reasons, graphql returns no jobs for a given build, while
+# this endpoint does, hence why this uses this API instead of graphql.
+def buildkite_get_build_info(build_number):
+ URL = "https://buildkite.com/llvm-project/github-pull-requests/builds/{}.json"
+ return requests.get(URL.format(build_number)).json()
+
+
+# returns the last BUILDKITE_GRAPHQL_BUILDS_PER_PAGE builds by default, or
+# until the build pointed by |last_cursor| is found.
+def buildkite_get_builds_up_to(buildkite_token, last_cursor=None):
+ output = []
+ cursor = None
+
+ while True:
+ page = buildkite_fetch_page_build_list(buildkite_token, cursor)
+ # No cursor provided, return the first page.
+ if last_cursor is None:
+ return page
+
+ # Cursor has been provided, check if present in this page.
+ match_index = next(
+ (i for i, x in enumerate(page) if x["cursor"] == last_cursor), None
+ )
+ # Not present, continue loading more pages.
+ if match_index is None:
+ output += page
+ cursor = page[-1]["cursor"]
+ continue
+ # Cursor found, keep results up to cursor
+ output += page[:match_index]
+ return output
+
+
+# Returns a (metrics, cursor) tuple.
+# Returns the BuildKite workflow metrics up to the build pointed by |last_cursor|.
+# If |last_cursor| is None, no metrics are returned.
+# The returned cursor is either:
+# - the last processed build.
+# - the last build if no initial cursor was provided.
+def buildkite_get_metrics(buildkite_token, last_cursor=None):
+ builds = buildkite_get_builds_up_to(buildkite_token, last_cursor)
+ # Don't return any metrics if last_cursor is None.
+ # This happens when the program starts.
----------------
Keenuts wrote:
Because we have no state, we need to start somewhere.
At first, you have no cursor, hence only want to initialize the cursor, but don't record any metrics. So by default we log no metrics, but give you the start cursor.
Then, next iteration, you start logging builds between now and the previous cursor.
We could also consider loading the page, and logging it. Worse case we write duplicate entries in grafana, which is not bad since we would log the same value twice (and prometheus deduplicates).
BUT there is one badly documented bit: influxdb+prometheus refuses to store a metric with a timestamp older than ~2 hours (from experiments, haven't found this written in the docs)
So if we start by recording the full page, we might have a bad write.
On a side note, I'm considering adding a GCP R/W bucket to store some state like the last recorded github/buildkite job to have some recovery when the container stops (given we cannot record values older than 2 hours)
https://github.com/llvm/llvm-project/pull/129699
More information about the llvm-commits
mailing list