Coming Soon: Confidence — An Experimentation Platform from Spotify

TL;DR: Spotify is releasing a new commercial product for software development teams: a version of our homegrown experimentation platform that we’re calling Confidence. Based on everything we’ve learned over the last 10+ years about what it takes to enable experimentation at scale, the platform makes it easy for teams to set up, run, coordinate, and analyze their own user tests — from simple A/B testing to the most complex use cases — so they can quickly validate their ideas and optimize them for impact. Designed to be flexible, extensible, and customizable, our goal is to make the Confidence platform simple to get started with and impossible to outgrow, making experimentation an integral part of your org, like it is for ours.

Note: Confidence is available now as a private beta only. Sign up for the waitlist to be eligible for an invite and to get updates on features, demos, release dates, and other product news.

One platform for a million ideas

Spotify’s data scientists and engineers have been developing and honing our product testing methods for years. Whether it’s automatically coordinating simultaneous A/B tests or orchestrating the rollout of an AI recommendation system across mobile, desktop, and web, the platform we built scales experimentation best practices and capabilities to all our teams. Soon this experimentation platform will be available to any company that wants to build, test, and iterate ideas the way we do at Spotify: quickly, reliably, and with confidence.

How did we get here? We didn’t realize it when we started down this road years ago with our first homegrown experimentation tools, but we’ve been on a decade-long journey to take A/B testing to the next level. As our vision for what Spotify could be expanded, so did our need for an experimentation platform that could scale and keep pace with us. So that’s what we’ve built. And it’s what enables the experimentation culture at Spotify to thrive today.

Building a culture of experimentation

With hundreds of squads and thousands of developers, designers, data scientists, and PMs, there is no shortage of ideas at Spotify. “What if we used different playlist art for different regions?” “What if you could preview the most interesting parts of podcast episodes just by swiping through them?” “What if every listener had their own personal DJ?” As a product-focused company, we are always looking for ways to add value and deliver a great experience for our users, from listeners to creators to advertisers.

We don’t want to slow down the flow of those ideas or get in the way of the development cycle. Our philosophy is “Think it, build it, ship it, tweak it.” Shipping more ideas faster gets us to the best ideas faster. But how do we know which ideas are great? And which ideas are just learning experiences for the next idea?

Data or it didn’t happen

We’ve come a long way from being just a music player. From playlists using Algotorial technology to our annual Wrapped campaign to an AI DJ — even the most ambitious ideas at Spotify got their start as just another idea in an ocean of ideas. Each one is a tiny spark: bright, shiny, new — and totally unproven.

What we’ve learned over time is that — no matter how exciting the idea is or the kind of information you have to support it — if you’re not running controlled experiments, then you’re not confronting your ideas with reality. Customer feedback, intuition, and creativity are all essential tools for bringing innovations to market. But without a solid scientific method and engineering infrastructure to help make data-informed decisions, your teams will be forever chasing ideas instead of shipping and improving the ones that have the most impact on your users and your business.

From a handful of experiments to hundreds

Our early experimentation efforts began more than a decade ago. In the early 2010s (let’s carbon-date it to around when Adele’s “Rolling in the Deep” was climbing the charts), a few data scientists and engineers started conducting small A/B tests internally. These tests were manual and error-prone, but we believed in the importance of experimenting and wanted to get better at it.

So, we decided to build our own A/B testing platform, which we called ABBA. ABBA was pretty basic: it did feature flagging and analysis for a set of standardized metrics. The simplicity and flexibility of ABBA unlocked a wave of experimentation across the company. We grew from running fewer than 20 priority experiments per year to running hundreds of experiments per year across multiple squads.

More testing is not the same as better testing

Then around 2018 (circa: Drake’s “God’s Plan” was the top-streaming song), Spotify launched a revamped free tier, and with the sudden influx of users, we were presented with even more experimentation opportunities. By this time, we had migrated to Google Cloud, and access to all the raw processing power of BigQuery made getting test results faster and easier. So, as the business continued to grow, we continued to increase what we tested.

Then a funny thing started to happen: the more testing we did, the more we could see flaws in the testing methods themselves. Our teams were getting bogged down by restarting experiments, manually calculating statistical analyses in notebooks, and coordinating test groups in spreadsheets. Tying features to specific tests via feature flags also started to prove restrictive. As the bottlenecks, workarounds, and errors continued to pile up, it was clear that we were running into the limits of what ABBA could do for us.

We needed to be able to scale our testing methods across more devices and software platforms , new and more complex use cases (having personalized, context-aware recommendations powered by machine learning is very different from testing the color of a button), an expanding user base (with more cloud processing power came more data management issues), and most importantly, a growing number of teams — which meant more experiments crashing into each other than ever. In short, we needed to learn how to experiment both better and faster. We needed to learn how to experiment at scale.

Learning how to learn better

So we took everything we learned from ABBA and started over. We began building new tools and incorporating more advanced testing methods into how we work. We also started to automate some of these scientific best practices so it was easier for teams to set up controlled experiments themselves, without having to coordinate or schedule test groups with others. And that’s where our new Experimentation Platform (aka EP for short) came in.

Our data platform team introduced two major improvements with EP: (1) a new Metrics Catalog that made analyzing metrics self-service and eliminated the need for data scientists to run analysis manually in notebooks, and (2) a coordination engine that allowed us to run many mutually exclusive experiments at the same time, including managing holdback groups.

With EP, any team at Spotify can run any kind of experiment with the confidence that, at the end of it, they’ll have insights they can trust and use to move forward in an informed way.

From hundreds of experiments to thousands

Once we made it easier for teams to create, run, manage, and analyze experiments on their own and in a scientifically reliable way, naturally, more teams ran more experiments. And so the culture of experimentation at Spotify grew even more. By the time we turned off ABBA in 2020, we’d gone from running hundreds of experiments to running thousands of experiments per year across virtually every aspect of our business.

That culture of experimentation is ingrained throughout our engineering organization and how we build solutions — not just in how we develop features for our apps, but also in how we improve backend services and data pipelines. This virtuous cycle of learning about what is working and what isn’t — many teams testing and shipping, testing and shipping — is what we were able to unlock at scale using the internal experimentation platform we built for ourselves. And now we’re making a commercial version available to everyone.

We 💚 platforms

This tune may be familiar if you’ve followed the evolution of Backstage — our homegrown developer portal, which we open sourced and donated to the CNCF three years ago. That, too, was a platform — a way to unlock the potential of many independent teams by bringing them together on a shared set of tooling and principles in order to solve common problems.

As with Backstage, providing a great developer experience is key to the platform’s success: making sure the best way for your developers to do something is also the easiest and most supported way. As our own teams adopted this way of doing things, we’ve come to think of experimentation not as a tool our teams pick up and sometimes use, but as a capability they always possess.

That’s what we’re aiming to deliver with Confidence, the latest iteration of our Experimentation Platform. Scientific best practices are built right into the platform so that many different teams can run many different experiments reliably and quickly at scale.

An experimentation platform that scales with you

There is seldom a one-size-fits-all solution to experimentation. If you’re serious about using A/B testing to validate user behavior and working in a data-informed way, you need a platform that works across a wide range of needs and use cases. From usability to messaging to advertising to acquisition funnels and beyond, Confidence can help you find answers to questions of every shape and size.

Extensible and customizable

Throughout our journey, we’ve compared notes with other companies struggling to scale reliable experimentation practices within their organizations. Often these companies have outgrown their existing A/B testing tools (whether purchased off-the-shelf or built internally) and are now seeking greater customization for how they run experiments.

But not every company is at the same point in this journey. Confidence is designed to bring value whether you’ve outgrown your current testing platform or are looking for a quick, easy way to get started with A/B testing that will scale with you as your needs change.

One platform, available three ways

To make it easier to fit your needs, the Confidence platform will be available to customers in three ways:

Managed service. Want to get up and running quickly and with the lowest technical overhead? Run the experimentation platform as a standalone web service managed by our team.
Backstage plugin. Already have a Backstage instance running (or want to get started)? Get all the features of Confidence as a plugin next to your other developer tools. This is how we run our experimentation platform at Spotify.
APIs. Need more customization? Want to build a bandit or do switchback testing? Integrate the Confidence platform into your own infrastructure with maximum flexibility and extensibility. Confidence will provide you with the capabilities to do what you need to do.

We believe these three options will make it easy to access and grow with the platform, no matter what your company’s needs are today or tomorrow.

Sign up for the beta

We’re really excited to share this new platform with you. Confidence is currently available for select customers in private beta. You can sign up for the private beta waitlist on our website and join our mailing list on the same form to get updates on all things Confidence.

Appendix: Engineering better experimentation

Learn more about experimentation at Spotify — including a little light reading on automated salting and bucket reuse, choosing sequential testing frameworks, comparing quantiles at scale, and how we scale other scientific best practices across the org — all right here on the Spotify Engineering blog:

Spotify’s New Experimentation Platform (Part 1): How we went from our first A/B testing tool, ABBA, to building EP, the internal experimentation platform we use today and that Confidence is based on. Learn about why we replaced feature “flags” in favor of “properties” for Remote Configuration, our move away from relying on notebooks to the Metrics Catalog for analyses, and how we manage and orchestrate experiments using the Experiment Planner.
Spotify’s New Experimentation Platform (Part 2): More features of our internal platform, including: coordinating many experiments at once while preserving exclusivity and holdbacks, using our “salt machine” to automatically reshuffle users without the need to stop and restart experiments, the importance of setting up both success and guardrail metrics up front, and how validity checks and gradual rollouts further protect you from errors and unexpected regressions.
Search Journey Towards Better Experimentation Practices: How our teams used Spotify’s Experimentation Platform to improve search in a data-informed way.
Spotify’s New Experimentation Coordination Strategy: Building experimentation methods like bucket reuse right into our Experimentation Platform lets teams run many experiments autonomously, without having to manually coordinate test groups.
Choosing a Sequential Testing Framework — Comparisons and Discussions: We go over the pros and cons of using group sequential tests (GSTs), depending on whether your data infrastructure provides data in batch or streaming, and whether you can make reasonable estimates of the maximum sample size your experiment will reach.
Comparing Quantiles at Scale in Online A/B Testing: How we use the Poisson bootstrap algorithm and quantile estimators to easily calculate bootstrap confidence intervals for difference-in-quantiles in A/B tests with hundreds of millions of observations.
Experimenting at Scale, the Spotify Home Way: How we use our internal Experimentation Platform to run over 250 tests a year on our Home screen alone, coordinating the work of dozens of teams, each one inventing new kinds of personalized experiences for hundreds of millions of users.
Experimenting with Machine Learning to Target In-App Messaging: We believed that we could use machine learning to determine who should receive in-app messages and that this more precise targeting would improve user experience without harming business metrics. To find out if our hypothesis was correct, we used uplift modeling to try to directly model the effect of in-app messaging on user behavior.
Bringing Sequential Testing to Experiments with Longitudinal Data (Part 1): The Peeking Problem 2.0: How we deal with the challenges that occur when applying standard sequential tests in the presence of multiple observations per unit, including what we call the “peeking problem 2.0”.
Bringing Sequential Testing to Experiments with Longitudinal Data (Part 2): Sequential Testing: More solutions to sequential testing challenges, including how we use group sequential tests (GSTs) for a large class of estimators.

Hear about it from the people who lived it. Listen to Spotify’s experimentation journey on the NerdOut@Spotify podcast:

Episode 20: The Rise and Fall of ABBA: Host Dave Zolotusky talks with Mark Grey, a senior staff engineer and 10-year Spotify veteran, about our very first A/B testing tool, ABBA, and early lessons about doing product experimentation at scale.
Episode 21: The Man Who Killed ABBA: Dave and Mark are joined by another longtime Spotify engineer, Dima Kunin. They talk about why we replaced ABBA with Spotify’s current internal Experimentation Platform, how we built it, and how it enabled our teams to go from running hundreds of experiments to thousands.

Tags: engineering leadership, experimentation