Why Industry DORA Benchmarks Are Misleading (And What to Measure Instead)

Every year, the DORA State of DevOps report publishes benchmarks that classify engineering teams as Elite, High, Medium, or Low performers. These benchmarks have become the de facto standard for measuring software delivery performance. Engineering leaders reference them in board decks, managers use them to set OKRs, and consultants wield them to justify transformation programs.

There is just one problem: for most teams, these benchmarks are actively misleading.

Not because the underlying research is bad. It is not. The DORA metrics themselves (deployment frequency, lead time for changes, change failure rate, and mean time to recovery) capture genuinely important aspects of software delivery. The problem lies in how organizations apply universal benchmarks to wildly different delivery contexts.

The One-Size-Fits-All Trap

The DORA benchmarks define “Elite” performance as deploying on demand (multiple times per day), with lead times under one hour, change failure rates below 5%, and recovery times under one hour. These numbers were derived by surveying thousands of organizations across industries, team sizes, and delivery models.

But think about what that means in practice. A two-person startup deploying a single microservice to a serverless platform and a 200-person engineering organization shipping regulated financial software are measured against the same yardstick.

Consider a team that ships a SaaS product on a two-week release train. They have deliberately chosen this cadence because their enterprise customers need predictable release schedules, their QA process requires integration testing across multiple services, and their support team needs time to prepare documentation and training materials. By the DORA benchmark, this team is “Medium” at best. Deploying once every two weeks puts them squarely outside “Elite” territory.

But what if they hit their release schedule 100% of the time? What if every two-week release goes out on time, with minimal defects, and their customers love the predictability? Is that team really underperforming?

DORA “Elite” Classification vs. Reality

CD Team

Elite

Scheduled Release Team

Medium

Event-Driven Team

Low

Three Delivery Models, Three Realities

The fundamental issue is that modern software organizations operate under at least three distinct delivery models, each with legitimately different cadences and success criteria.

Continuous Deployment (CD)

Teams practicing CD merge to main and deploy automatically, often dozens of times per day. For these teams, the traditional DORA benchmarks make reasonable sense. Deployment frequency should be high, lead times should be short, and the emphasis is on fast feedback loops and rapid recovery.

Scheduled Releases

Many teams ship on a fixed cadence (weekly, biweekly, monthly, or quarterly). This is not a sign of immaturity. It is a deliberate choice driven by coordination needs, compliance requirements, customer expectations, or the nature of the product itself. Mobile apps submitted to app stores, embedded systems with firmware updates, and enterprise platforms with contractual SLAs all fall into this category.

For scheduled release teams, “deployment frequency” as a raw number is meaningless. What matters is whether they hit their planned release dates consistently and whether the content they planned to ship actually made it into each release.

Event-Driven Delivery

Some teams deploy in response to specific triggers: a security patch, a customer escalation, a regulatory change, or a market event. Their cadence is intentionally irregular. Measuring them on deployment frequency is like measuring a fire department on how many times they leave the station. The number itself tells you nothing about performance.

The Case for Intent-Based Metrics

If universal benchmarks do not work, what does? The answer is measuring performance against your own stated intentions.

Intent-based metrics start with a simple question: what did you say you were going to do, and did you do it?

For a CD team, the intent might be: “We deploy every merged PR to production within 30 minutes.” The metric then becomes: what percentage of merged PRs were deployed within that window? If you hit 95%, you are performing well, regardless of whether your raw deployment count matches someone else’s.

For a scheduled release team, the intent might be: “We ship a release every two weeks on Thursday.” The metric becomes your release hit rate: what percentage of planned releases actually shipped on time? A team that hits 100% of their biweekly releases is performing exceptionally well, even though their “deployment frequency” number looks unimpressive next to a CD team.

This approach has several advantages over universal benchmarks.

It respects context. Different teams have different delivery models for legitimate reasons. Intent-based metrics honor those differences instead of penalizing them.

It measures reliability. Hitting your commitments consistently is the foundation of trust: with customers, with stakeholders, and across teams that depend on your releases. A team that promises weekly releases and delivers them every week is more reliable than a team that aims for daily deploys but only ships three times a week.

It creates meaningful improvement targets. If your release hit rate is 70%, you know exactly what to focus on: figure out why three out of ten planned releases slip, and fix the root causes. That is a much more actionable insight than “you need to deploy more often.”

It aligns engineering metrics with business outcomes. Business leaders do not care about deployments per day in the abstract. They care about whether the team can predictably deliver what it commits to. Intent-based metrics speak that language.

12/day

CD Team: Deploys per Day

95%

Scheduled Team: Release Hit Rate

<4 hrs

Event-Driven: Response Lead Time

Where Traditional DORA Metrics Still Matter

To be clear, the individual DORA metrics remain valuable. Lead time for changes tells you about the friction in your delivery pipeline. Change failure rate reveals the effectiveness of your testing and review processes. Mean time to recovery reflects your operational maturity and incident response capability.

The issue is not with the metrics themselves but with the benchmarks, specifically the idea that there is a single “Elite” standard that every team should aspire to. It is entirely possible for a team to have a long lead time by DORA standards (say, two weeks) while still being highly performant in the context of their delivery model.

The most useful approach combines the diagnostic power of DORA metrics with the contextual accuracy of intent-based measurement. Use lead time, change failure rate, and MTTR to identify bottlenecks and areas for improvement within your pipeline. Use intent-based metrics (release hit rate, deployment SLA adherence, planned vs. actual delivery) to measure overall performance.

Rethinking the Leaderboard

One of the most damaging effects of universal benchmarks is the implicit leaderboard they create. When every team in your organization is measured against the same “Elite” standard, it creates perverse incentives. Teams might deploy more frequently not because it improves outcomes but because it improves their metrics. They might break up releases into smaller, less meaningful increments to juice their deployment frequency numbers.

Worse, teams that have thoughtfully chosen a scheduled release model might feel pressure to adopt continuous deployment even when it does not suit their product, their customers, or their operational reality. Chasing someone else’s definition of “Elite” can actively degrade delivery quality.

A healthier model is one where each team defines its delivery profile (the cadence, the expectations, the commitments) and then measures itself against that profile. Performance becomes a question of consistency and improvement over time, not a comparison against an arbitrary external standard.

Key Takeaway

The goal is not to deploy the most. It is to consistently deliver on your commitments. Define your delivery profile, measure against your own intentions, and improve over time. That is the difference between chasing someone else’s leaderboard and building a genuinely high-performing team.

Making the Shift

If you are currently using DORA benchmarks to evaluate your teams, here is a practical path forward.

First, catalog your delivery models. For each team or service, identify whether they practice continuous deployment, scheduled releases, event-driven delivery, or some hybrid. Document the intended cadence and any constraints that shape it.

Second, define intent-based targets. For each delivery model, define what “good” looks like. For CD teams, that might be a deployment SLA (e.g., merged PRs reach production within one hour). For scheduled teams, that might be a release hit rate target (e.g., 90% of planned releases ship on time).

Third, keep the diagnostic metrics. Continue tracking lead time, change failure rate, and MTTR. These are useful signals for pipeline health regardless of your delivery model. Just stop comparing them to universal benchmarks.

Fourth, review and adjust. Intent-based metrics only work if the intents are realistic and regularly revisited. A team that always hits 100% might need a more ambitious target. A team that consistently misses might need to adjust their delivery model or address systemic obstacles.

Looking Ahead

The engineering metrics landscape is maturing. The early days of “just measure deployment frequency” are giving way to more nuanced approaches that account for the diversity of modern software delivery. The SPACE framework introduced dimensions like satisfaction and communication alongside raw throughput metrics. Intent-based measurement takes this further by making the team’s own goals the yardstick.

The best engineering organizations are not the ones that deploy the most. They are the ones that consistently deliver on their commitments, learn from their misses, and continuously refine their delivery model to better serve their customers.

That is what performance measurement should capture, and that is exactly what intent-based metrics are designed to do.