Beyond Winning: Spotify’s Experiments with Learning Framework

TL;DR Spotify’s experimentation platform, Confidence, scaled product decision-making across hundreds of teams, evolving from experiment-velocity focus to maximizing experiment quality and learning.

We developed the Experiments with Learning (EwL) metric to measure success: A successful experiment yields enough valid information to inform product decisions, not just those that find “winners.”

Our learning rate (~64%) far exceeds our win rate, emphasizing that most value comes from understanding what doesn’t work or detecting regressions, not just shipping improvements.

The EwL framework helps identify improvement areas for teams and platforms, guides resource allocation, and drives innovation while avoiding bad product decisions.

Introduction

At Spotify we developed Confidence, an experimentation platform, to scale company-wide production development and decision-making. This raised the natural question of how to measure experiment success. To capture the intrinsic value of experimentation, we developed Experiments with Learning (EwL).

Initially, our priority was to boost experiment velocity by providing teams with a safety net to mitigate bad decisions. But as teams increasingly adopted experimentation, our priority shifted to improving experiment quality to maximize insights from each test.

Experimentation entails more than just optimization: On a more fundamental level, its purpose is to power well-informed product decisions. The goal of our EwL framework is to identify and celebrate experiments that provide the most meaningful insights.

From 10 million to 696 million users

Experimentation has been essential to Spotify since the company’s infancy, but in those early days our experiments were painfully time-consuming and manual. A major platform investment in 2018 established a new goal: All teams with user-facing missions should regularly experiment to learn from users and mitigate the risks of making bad decisions. With that goal came the need to help more teams experiment more frequently and in better ways.

To achieve this, we formed an engagement team — a center of excellence for internal customer success. Surveying the organization, we found that around 40 teams were regularly experimenting and set a target of expanding that group to about 300.

The engagement team worked bottom-up across the company to establish foundations for an experiment-driven culture. With this view, we decided we should:

Build the tech: SDKs, instrumentation, analytical requirements.
Upskill teams: Teams are conceptually able to experiment — unpack what experiments are to build understanding and trust.
Establish experiments as the norm: Build expectations among management and stakeholders that experiments should accompany new features and changes.

Beyond Winning: Spotify’s Experiments with Learning Framework Graph 1 — *Figure 1: The Experiment-Needs Pyramid. Only teams that are technically able, know how, and are incentivized to experiment, will experiment.*

The new platform gave most Spotify teams the ability to run experiments. But having the capability alone doesn’t drive adoption — building a culture of experimentation takes more. So the engagement team concentrated on developing training materials, running internal sessions, and developing best practices teams could rely on. In parallel, we streamlined and simplified the platform itself. Our experience tells us that a user-friendly interface with curated options and features is essential to adoption. Having more tools than needed adds friction and fragmentation, hindering experimentation at scale.

By the end of 2021, onboarded teams surpassed 200. A year later, close to 300 teams were experimenting regularly.

From quantity to quality

At Spotify, experimentation has always been more than optimization. One of the most important use cases is risk management, and any qualitative measure of experimentation must capture that. That’s why we find win rates alone (the proportion of experiments finding winning treatments) too narrow. You can have a 0% win rate but still get huge business value from experimentation if you’re learning and mitigating the risks of making bad decisions.

In most products, experiment results aren’t balanced — they skew negative, as shown in Figure 2 below. That’s especially true for mature products like Spotify, where years of optimization means that users are familiar with (and protective of) the experience. To make things better, you need a great idea and flawless execution. To make things worse, all it takes is a bug.

By emphasizing quality, Confidence played a key role: Its wide adoption let us raise the bar by platformizing best practices, strengthening designs, and validating setups. But to boost the impact of experimentation across Spotify, we first had to understand why some teams weren’t learning from their tests — and start by clarifying why we experiment in the first place.

Experiments with Learning

The reasons we experiment

At Spotify, using experimentation to mitigate risk is key to protect teams from poor product decisions. Many changes, like system updates, refactors, or new infrastructure, aren’t intended to improve the user experience directly. They often carry only downside risk: Slower load times, higher crash rates, or increased network usage can all hurt engagement.

When changes do aim to improve the experience, we also care about quantifying impact. For example, new AI features may bring added costs, so we need reliable estimates to weigh benefits against trade-offs. Even when experiments don’t show significant improvements, the precision of results lets us be certain that there is no meaningful effect.

Bottom line: A “successful experiment” isn’t just one that finds a win — it’s one that delivers trustworthy insights to guide product decisions. That broader definition became essential for advancing our experimentation culture.

The Experiments with Learning framework

The most fundamental motivation for experimentation at Spotify is our desire to learn. Our measurement of experiment success should quantify how much we are learning from our experiments. To reflect our high ambitions, our bar for what constitutes learning is high.

What counts as an Experiment with Learning?

An Experiment with Learning is one that produces valid and decision-ready results:

Valid: The setup and data collection worked as intended (e.g., traffic flowed, metrics were measured, no sample ratio mismatches).
Decision-ready:
- Success: A key metric improves without regressions → Ship it.
- Regression detected: Metrics worsen → Abort and iterate.
- Neutral but informative: No effect but the test was strong enough to detect one if it existed → Iterate, abandon, or ship if infra-only.

Beyond Winning: Spotify's Experiments with Learning Framework Graph 4 — *Figure 3: Overview of Confidence, showing a Sankey plot of Spotify R&D experiments from the past six months — classified by whether they produced learning and reasons why.*

Experiments without learning are not informative for decisions

A valid experiment is a prerequisite for learning. Unsuccessful experiments either suffer from invalid implementation or insufficient data for accurate conclusions.

Failed health checks: invalid experiments

For an experiment to inform decisions, it must be correctly implemented. Tests that fail health checks can only produce partial or misleading results. Confidence helps detect these issues early by monitoring fail rates and providing tools to prevent them. Common causes include misconfigured serving (no users see the new experience), sample ratio mismatches, or pre-exposure differences that break fair comparisons.

Unpowered: neutral metrics, insufficient data

At Spotify, we hold a high bar for decision-making: Experiments must be powered across all relevant metrics to count as informative. Because product decisions count on the collective evidence, we don’t treat an experiment as “powered” unless every metric is. This strict rule means a neutral experiment that falls short on even one metric counts as “no learning.” While loosening this standard might boost our EwL rate, we prefer it to reflect our ambitions.

Beyond Winning: Spotify's Experiments with Learning Framework Graph 3 — *Figure 4: Unpowered Experiments: Count by Percent of Powered Metrics.*

Figure 4 shows the percentage of metrics powered in experiments classified as neutral unpowered. While some lack power for all metrics, most have several powered — just not enough across the board. Without all metrics powered, experimenters can’t reliably tell whether no effect exists or if they simply collected too little data. Therefore, we can’t guarantee the risk level of the experiment, and we classify it as “no learning.”

Aborted: experiments that end early

We can’t read experimenters’ minds, but we can ask. When experiments end, we prompt users to share a brief reason to help turn open-ended outcomes into direct feedback on their needs. These answers provide us with further insight, allowing us to separate “Aborted” from “Unpowered” in the EwL Sankey chart (Figure 3).

Experiments with Learning are informative for decisions

Ship: win rate

Positive decisions aren’t the full story, but they’re important. Many tests aim to show meaningful improvements, which requires at least one success metric to improve significantly — and no signs of regression in any metric, including guardrails.

Abort due to regression

Less exciting, but often more important are experiments that detect regressions — significant shifts in the wrong direction. These insights help teams make better product decisions and are classified as EwLs. They may not be the wins experimenters hoped for, but they’re valuable for the business and reflect Spotify’s fail fast, learn fast approach, where early failure detection fuels faster iteration.

Powered, no recommendation: neutral metrics but informative

Not every experiment moves the needle, and many end with neutral results. When all success metrics are neutral and the test is adequately powered, we still classify it as an EwL. In these cases, the experiment reliably shows that the intended changes had no effect — and that effects of the hypothesized size would likely have been detected.

Spotify’s learning rate

Before looking at the numbers, it’s important to understand which experiments are included when we evaluate learning rate. The actual value of the EwL metric is highly influenced by which experiments are included in the calculation, and meaningful comparisons between experiment programs can only be made if the criteria are the same. Our criterion is any experiment run within Spotify R&D. This excludes evaluations of individual ad campaigns since our focus is experimentation for product development.

We count restarts of the same experiment separately since they usually occur when there are bugs or misconfigurations. While counting all of “the same” iterations as one would increase the learning rate, it would also mask valuable signals. Our definition emphasizes transparency and highlights areas for improvement.

Under these inclusion criteria, Spotify’s average learning rate is 64%, with variation across teams from 16% to 76%. If we want the metric to reflect the proportion of iterations/ideas leading to learning, duplications could be removed.

Beyond Winning: Spotify’s Experiments with Learning Framework Image 4 — *Figure 5: (Left) Win rate and learning rate across the Spotify org. (Right) The proportion of experiments run in the different parts of the org.*

Figure 5 shows win and learning rates across different parts of Spotify. Our experimentation isn’t one single program but several, varying in maturity — some refined over a decade, others just beginning.

The gap between win rate and learning rate highlights the value of the EwL framework. In the two orgs running ~80% of Spotify experiments, the learning rate is ~64%, but the win rate is ~12%. Most of our learning doesn’t come from wins — it comes from discovering what not to ship. Without experiments, we’d have shipped more harmful or neutral than helpful changes.

For a mature product like Spotify, preventing regressions is just as important as chasing wins. That’s why Confidence automatically monitors key business metrics: Spotting regressions is itself a major success, protecting user satisfaction and sustaining the business.

Low win rates in our most experiment-heavy orgs may look surprising, but they’re not. These areas are already highly optimized, so improvements are harder to find. Many changes aren’t meant to boost user experience directly but to keep Spotify scaling smoothly as hundreds of millions join — work that often surfaces regressions rather than “wins.”

Just as important, the EwL framework highlights residuals — experiments with no learning. Even these teach us about our experimenters and our platform. That’s why EwL is central to improving both Confidence and our practices. Next, we’ll show how it drives innovation, operational efficiency, and better experimentation across Spotify.

How does Spotify use the metric?

The EwL framework, developed by our platform insights team and experimentation product owners, gives us a clear view of experimentation quality and business impact. For Confidence users, it shows how effectively teams are running experiments. For the broader platform org, it also serves as a productivity signal, along with DevOps productivity metrics.

Strategic impact: guiding innovation and investment

The EwL framework’s biggest value is strategic — showing the return on learning from innovation efforts. For example, a stable learning rate but declining win rate suggests strong experiment quality but also that the experience may be hitting diminishing returns and need bolder bets.

It’s equally important to identify programs struggling due to failed experiments and ones where learning rates are high but returns are low so that testing bandwidth is not wasted. Testing capacity is one of the true currencies for innovation speed: Using test capacity efficiently is crucial for getting ahead.

Operational excellence: managing testing bandwidth

Testing all our ideas is a constant challenge. Even with Spotify’s high app traffic, we constantly struggle to test more ideas with higher precision. Last year, more than 58 teams ran 520 experiments on Spotify’s mobile home screen alone. This seems common among big tech companies. Mark Zuckerberg recently discussed in a podcast how one of the main AI-innovation limitations is A/B testing throughput. The EwL framework helps us triage high-traffic app surfaces. We can analyze all programs running on important app surfaces and identify both promising programs that should get more testing bandwidth and programs with diminishing returns that can be scaled down.

Improving practice and Confidence

Another use case is tracking experimentation practices. If the learning rate is low, the framework directly hints why. Is it poor planning leading to many underpowered experiments? Are there technical experiment integration and setup issues leading to invalid experiments? These patterns are typically easy to recognize, enabling action. We’ve seen that small things like adding experiment reviewers can drastically impact learning rates, both because practices improve when there are more eyes on experiments, but also because ideas get discussed and scrutinized more before tests start.

Customer feedback also informs us as platform owners. Their input helps distinguish between friction caused by team maturity (e.g., training gaps) and platform needs (e.g., tooling accuracy or integrations). In the latter case, we evolve Confidence to better support experimenters.

Over time, the EwL metric has driven many improvements in Confidence. For example, to address the challenge of designing and powering experiments, we refined the sample size calculator to ensure it helps plan experiments and run them with a high chance of giving them answers. If we see health checks failing with high rates for a Confidence customer, it could be a signal that it’s too hard to set up robust experiments in that tech stack, and we need better integrations or documentation. This feedback loop has long helped us improve both practices and tooling — and adapt experimentation processes, access controls, and review requirements across different parts of Spotify.

Successful learning and avoiding gamification

So what is a good value on the EwL metric? We’ve mainly focused on what we can improve given metric results rather than aspiring to certain values. We believe setting a value will give us reference points to contrast local learning rates across the organization. There’s also context to consider when aspiring for more learning outcomes. As we measure health checks for key business metrics against each experiment, we should also consider guardrails for this metric — things we’re not willing to expense for improving learning rate. Key guardrails to avoid gamification of the EwL metric include:

Win rate: maintaining a healthy share of positive results
Experiment volume: keeping the number of completed experiments high so progress doesn’t slow
Precision: ensuring effect sizes are well defined and estimates remain reliable

For example, raising minimum detectable effect sizes could reduce “no learning” outcomes by making more tests appear neutral and powered, but at the cost of precision. EwL helps us balance these trade-offs without undermining product improvement speed or experiment quality.

We also believe some EwL are healthy for innovation culture. Reducing the EwL rate all the way to zero would inevitably imply increasing the friction for speed of innovation and iteration. Striking the right balance between precision and iteration speed is key — some failed experiments are natural and acceptable parts of fostering fast-moving, innovative cultures.

Conclusion

Our EwL framework redefines experimental success by shifting focus from a narrow win rate to the broader value of generating decision-informing insights. By classifying any valid experiment that detects a win, a regression, or a conclusive neutral result as a “learning,” we acknowledge that the primary value of experimentation comes from mitigating risk and understanding what doesn’t work. Our EwL rate of ~64% and win rate of ~12% clearly show there’s more to learn than just what wins. Our EwL metric is a strategic tool that helps us guide investment, manage testing capacity, improve experimentation practices, and evolve Confidence — fostering an innovative culture that learns quickly from every outcome.

Mentions: The Experiments with Learning framework was a collaboration among many people at Spotify. Special thanks to Lizzie Eardley, Caroline Thordenberg, and Johan Rydberg.

🧪 Want to experiment like Spotify? Learn more about Confidence. 📆 Confidence is coming to Spotify Portal! Sign up for our webinar to learn more.

Tags: Data, Machine Learning, Platform, experimentation