Incident Report: Spotify Outage on March 8, 2022

March 11, 2022 Published by Spotify Engineering

On March 8, we experienced a global outage triggered by issues in a cloud-hosted service discovery system used at Spotify. We were made aware of issues with login at 18:12 UTC / 13:12 ET and started implementing fixes to critical systems at 18:39 UTC / 13:39 ET. This outage affected our users and we apologize for the inconvenience it may have caused. Our service has now fully recovered.

What happened?

The Spotify backend consists of multiple microservices that communicate with each other. For microservices to be able to find each other, we utilize multiple service discovery technologies. Most of our services are using a DNS based service discovery system; however, some of our services use an xDS based traffic control plane and discovery system called Traffic Director

On March 8, Google Cloud Traffic Director experienced an outage. This in coordination with a bug in a client (gRPC) library caused the Spotify outage that affected many of our users: if you were logged out of a Spotify app, you were unable to log back in.

As soon as the problem was discovered, we rolled out configuration changes to revert our affected systems back to use our DNS-based service discovery and saw it recover gradually. See the timeline below for more details.

Timeline

18:12 UTC / 13:12 ET  – Reports of users being logged out of the different client apps start to surface.

18:39 UTC / 13:39 ET – Remediations were being put in place to restore the affected systems

20:35 UTC / 15:35 ET – Incident fully mitigated at Spotify

Where do we go from here?

In the short term:

  • We are working with Google Cloud to better understand how issues with Traffic Director resulted in a large outage affecting Spotify’s users.
  • We will add additional monitoring and alerting to ensure that we would catch similar service discovery related problems earlier.

We will continue to invest in resiliency by identifying and implementing additional safety nets in terms of monitoring, automatic error detection, and self-recovery.


Tags: