How to Benchmark Your CI Like You Benchmark Your App

Teams apply performance benchmarking rigorously to production and almost never to CI. Here's how to close that gap — with the same tools you're already using.

If your API started returning responses 40% slower than last month, you'd know within hours. Your alerting would fire, your dashboards would show the regression, and someone would have a postmortem started by end of day.

Last month your CI builds got 40% slower. Do you know?

The benchmarking gap

Application performance benchmarking is a mature discipline: establish a baseline, measure against it continuously, alert on regressions, trace root causes. Teams apply this rigorously to production.

CI gets none of it. CI benchmarking is usually "someone noticed it felt slow" or "our bill went up."

The gap exists for a boring reason: CI tools don't expose the right data. They log pass/fail and duration per run. They don't expose the metrics that make benchmarking useful — aggregated percentiles, step-level trends, cross-run correlation.

What you need before you can benchmark

Good CI benchmarks require the same three things as application benchmarks:

A stable baseline — what does "normal" look like? Not just mean duration, but p50/p95/p99 by workflow, branch, and step.
Continuous collection — metrics emitted after every run, not sampled or manually queried.
Correlation data — enough context to tie a regression to a cause: which commit, which step, which runner.

Translating app benchmarking patterns to CI

Latency percentiles, not averages

A 12-minute average build time can hide a p95 of 28 minutes — the tail that's actually killing your developers' flow state. Track p50, p95, and p99 for your critical workflows the same way you track them for your endpoints.

Step-level granularity

"The build is slow" is the CI equivalent of "the app is slow." Useful benchmark data lives at the step level: install, compile, test, lint, deploy. Each step should have its own baseline and its own alert threshold.

Regression detection on merge

The right time to catch a CI regression is the PR that caused it, not three sprints later. A benchmark that compares a PR's build times against the rolling baseline on the target branch catches regressions the same place you catch them in performance testing.

Trend analysis over time

Week-over-week build time trends tell you whether you're accumulating CI debt. A 2% per-week creep in compile time is invisible run-to-run and devastating over a quarter. Plot it.

The practical setup

If your CI emits metrics to Prometheus, most of this is a few queries away.

p95 build duration for a specific workflow over the last 7 days:

histogram_quantile(0.95,
  sum(rate(ci_runner_step_duration_seconds_bucket{workflow="ci"}[7d])) by (le, step)
)

Week-over-week regression — compare this week's p95 to last week's:

(
  histogram_quantile(0.95, sum(rate(ci_runner_step_duration_seconds_bucket[7d])) by (le))
  /
  histogram_quantile(0.95, sum(rate(ci_runner_step_duration_seconds_bucket[7d] offset 7d)) by (le))
) - 1

Set an alert at >10% regression. That's it. You now have CI benchmarking.

The benchmark mindset shift

Your CI pipeline is software infrastructure. It has performance characteristics. Those characteristics change over time as your codebase, dependencies, and team scale. Benchmarking it isn't optional overhead — it's the minimum bar for treating CI as a first-class system.

The teams with the fastest pipelines aren't the ones who started with fast pipelines. They're the ones who noticed when they got slow.

Connect your repos to Rewire and get the Prometheus metrics your CI benchmarks need — no YAML changes required.