How We Improved Developer Productivity for Our DevOps Teams
Across Spotify, our teams diligently strive to fulfill our mission to “unlock the potential of human creativity by giving millions of creative artists the opportunity to live off their work, and billions of fans the opportunity to enjoy and be inspired by it”. As product managers in the Platform Developer Experience (PDX) Tribe, part of Spotify’s Technology Infrastructure Group, we focus on unlocking the creativity of engineers by building tools and establishing best practices that automate processes to make Spotify a true DevOps company. Doing so helps our teams experiment, learn, and launch features quickly.
Focusing on speed, for us, means quickly turning our ideas into products and experimenting to improve the user experience, growing into new markets, and remaining competitive as a content streaming provider. As part of that effort, PDX builds CI/CD tools, web tools, testing tools, and other tools for developers to improve their productivity. These tools give engineers the ability to focus on creating and running experiments to get data that helps determine if their idea had the desired effect.
We need to experiment quickly
We need to give developers at Spotify the right tools that will help them deploy various types of A/B tests to users fast. Our vision is that teams can come up with an idea in the morning, ship their experiment during the day, and receive data on it by the end of the day.
To understand how we can help teams at Spotify ship experiments fast we need to understand how they work.
Spotify has a “ops in squad” model where every team owns and is responsible for a feature end-to-end, working with both development and operations to build all aspects of the feature. The teams are highly autonomous and consist of individuals that have all the competencies needed to ship a feature. Though this allows us to operate quickly, we’re finding that it has also created fragmentation among our toolsets and engineering practices.
The products we’ve built were created to reduce the number of technology choices any team needs to make and automate processes like building data pipelines, backend services, and websites, thus eliminating “distractions” from the work itself.
Automating the workflow to decrease build times
Previously, it took about 14 days for a web developer to build a campaign site, requiring a heavier lift to manage integrations between different technologies. To create a website you used to have to complete these steps:
- Create your repo in GitHub.
- Make the developer portal aware of it, manually.
- Set up a Google cloud project, manually.
- Set up Jenkins CI and manually create a pipeline.
- Read a bunch of documentation on how to deploy your website.
- Ask your friends on Slack.
- Pull out your hair…
There was clearly an opportunity for us to build a new process, automate the project setup, and allow our engineers to move faster. As an infrastructure team, we’re consistently pushing the needle forward to create abstractions that reduce low-variance work. We have accomplished this by developing a set of tools; Golden Paths, Backstage, Tingle CI/CD, and our test-certification program. Our engineers can then focus on delivering quality code and can produce more frequent releases with confidence. Automating the workflow with these tools decreased the time required to get a basic service up and running from 14 days to less than 5 minutes. In total, a developer could set up the framework for a site like Spotify Wrapped with a URL, repository, CI/CD, in one day.
Let’s start with the Golden Path
In an effort to reduce fragmentation in our infrastructure, we introduced a set of best practices dubbed Golden Paths. Our Golden Paths were created to help engineers start new projects quickly. This tool uses the wizard format to reduce the number of decisions engineers have to make to build backend services, application features, data pipelines, machine learning projects, and web apps. Golden Paths are also accompanied with documentation that describes the best engineering practices for this type of project.
For example, a web engineer can, by just following the wizard in our developer portal, Backstage, set up a skeleton for a new website and start to develop it. This is a huge improvement in the amount of manual work that a developer previously had to do. After the project setup is done there is a GCP project created, a repository in GitHub Enterprise and CI/CD configuration for the project. The developer can then start committing code to GHE. It used to take about 14 days to even get started to build your new website. Now it takes less than five minutes!
Automatically building websites
When code is committed to GitHub, the CI/CD system automatically builds the project, publishes feedback in GitHub, and sends the developer a notification about the status of the build and when the deployment is complete.
The system building the website is Tingle. Tingle was created in 2017 to serve as our centralized CI/CD system for backend, data, and web services. Our vision for this tool was to provide a common CI/CD experience for all kinds of jobs to help teams set up new projects quickly without needing to understand how to configure a build pipeline.
Tingle automatically builds, tests, packages, and deploys changes to production in the normal GitHub workflow. Results from the build are then presented in GitHub and in our developer portal Backstage. If all tests pass, the component is deployed to production. No setup or configuration is required with Tingle. We’ve replaced over 200 stand-alone Jenkins machines that were previously managed by our feature teams. Now, the system runs tens of thousands of builds every week.
Helping engineers confidently perform continuous deployment
It’s not enough to create these infrastructure tools and then hand them over to our feature teams as they ride off into the sunset to build new products. Our job as an infrastructure team is to support our feature teams and provide developers with insights regarding the quality of their code. This is where the test-certification program we created for each major engineering discipline benefits engineers.
Using a gamified experience, we encourage developers to subject their code to the appropriate tests. We also inform them when their code contains unreliable tests (also known as flaky tests). When a service fulfills the certification requirements, it automatically displays a badge next to the service. This informs users that the service is being maintained and follows best practices for quality controls. Additionally, we provide reports on build times, code coverage, and reliability of test suites to give developers insight on the quality of their code. From 2018, we’ve noticed that teams who have invested time on test certification saw a drastic drop in blocking bugs and reactive work.
DevOps operations for autonomous teams
Creating Golden Paths, Tingle, and the test-certification program helps us standardize our engineering efforts for creating new websites and backend services. But, as mentioned above, our teams are autonomous and are free to choose the tools and methods that fit with how they operate. This presents a unique challenge when establishing best practices — and our engineering community is currently combating fragmentation to improve productivity. Keeping this in mind, we wanted to keep these services and methods flexible so teams can choose how they want to use them.
We provided three options for how to utilize our services:
Recapping our efforts to improve our DevOps organization
We’ve worked on reducing the number of decisions an engineer has to make, and reduced the complexity of our technology by automating the integration of the underlying infrastructure. Consistently creating tools and services that allow engineers to work further up the stack improves our DevOps organization. And improving the infrastructure for autonomous teams is necessary to achieve the scale and speed that we are aiming for as an engineering community. Using the tools referenced above, our feature teams can now focus on collaborating and innovating to make Spotify’s services better for fans and artists.
For websites and backend services, we reduced the time from set-up to the first build and deployment from 14 days to 5 minutes by:
- Launching the Golden Paths best practices and standardizing the configuration for building and deploying projects.
- Building Tingle, our CI/CD tool, that automatically builds and deploys projects.
- Creating the test-certification program to give engineers insights into the quality of their code.
The tools described in this blog were developed by the teams responsible for infrastructure and developer tooling. A special shout-out and thank you to the following teams for their amazing work: Web-infra, Test-infra, Pipedream, Tools, and Pulp-fiction.