Switching user database on a running system

Introduction

All Spotify users are now stored in a Cassandra database instead of Postgres. The final switch was made on May 11th, and we, the team responsible for user login at Spotify, would like to tell you a little bit about this endeavour.

Ode to PostgreSQL

In the past year Spotify has added more than 35 million new active users (20 Million Reasons to Say Thanks). User details, such as username, country, and email, are stored in a user database. Every time a user logs in that database is queried. Every time a user is created, upgrades to premium, accepts a license or connects to Facebook, this user database is accessed. This means that it’s a busy database and a core piece of the Spotify Infrastructure.

Our trusty Postgres database has been serving us well for years but it is now handling a few times the number of records it was originally designed for. And the dataset grows, every day, by an ever increasing rate.

Changing such a piece of infrastructure is scary. At the same time, we really didn’t know for how long Postgres would keep working.

Single point of failure

Our Postgres setup also suffered from another issue. Reads were distributed over all data centers, but writes only took place in London. On a single machine. To quote one of our network engineers, Loke Berne: “From a network operations perspective it’s so nice to get rid of the horror rack once and for all”. He was referring to the rack containing the user database write master. To have a single machine being the master of the user database is not fun. All changes made to user information, such as new accounts being created and users upgrading to premium for 75 million active users was handled by that single poor machine. A failure of that machine would have resulted in all those operations failing until the hot standby machine was promoted to master and traffic was redirected.

The inevitable

As the team responsible for the login service we had known for quite some time that the existing solution would not scale with our user growth. We could feel it in our bones that the old mare might last the winter, but next one could get rough. We all knew it was time to put it down but pulling the trigger was scary.

attachment_a7b3749e339da0e3241acea3c8e0e1e4 — *“You want to do what!?!”*

The shark attack that started it all

In September 2013 the Atlantic cable between our datacenter in London and the one in Ashburn broke. Rumors said that a shark ate the cable. Regardless of the cause, it resulted in a major drop in new users during a week. If new users could have been created in other sites, this network problem would have been much less of an issue.

We already knew that having a single point of failure in London was a problem, but that week in September made it very clear that it was not only a theoretical failing of our design. The single point of failure had a tangible business cost that could easily be measured in euros and dollars. We had thought about using Cassandra for a long time but had never gotten around to focus on it. It got very clear that it was time to find a solution.

attachment_ef8ee6e7165c5665a789f5eb08b3bca0 — *Note: Blaming the netsplit on a shark may or may not be purely fictitious. Any connection with this shark and these events are purely coincidental.*

Changing Engines on the Running Car

“When you do things right, people won’t be sure you’ve done anything at all.”

– God entity from the “Godfellas” episode of Futurama.

This quote applies quite well when performing infrastructure changes. It’s not always feasible but it’s something to aim for. Data migration can be especially tricky. We didn’t want to shut down the entire system while we migrated as that would break login and account creation for our users. Instead we had to do a seamless transition. This meant running with both storage solutions for a while, using Postgres as master storage while darkloading the Cassandra storage. By darkloading we mean handling production requests in parallel, ignoring the results.

This had multiple benefits:

We made sure our new storage solutions had the right capacity for handling the load. We knew the capacity needs for Postgres already, but Cassandra is different so it needed to be measured on its own.
We were actually running the Cassandra storage code, so we could find all the bugs, robustness and scalability issues before we made the switch.
Since we darkloaded, it was ok for Cassandra to fail. The main process would log the error but ignore the results.
We could keep the data in sync between the old and the new storage system.

Migrating existing accounts

In addition to darkloading writes we also had to migrate all existing users. To do this we ran a job that iterated over all users and copied them from Postgres to Cassandra. We had to make sure to minimize race conditions, a darkload write could happen at the same time as an account was migrated.

Because of how Postgres works (replication stops during a long running query) we had to migrate in a special way. We made sure that all writes, both new accounts and account updates, were properly darkloaded before we ran the migration script.

Create a stream of usernames by doing a long running query against a read slave.
For each username:
1. If the username exists in Cassandra don’t migrate. Remember we assume approximately 100% darkloading.
2. If the usernames doesn’t exists we want to migrate it.
3. Query anotherread slave to get the user data for this username (Remember that replication stopped during a long running query, so querying the first slave would mean that we would risk getting stale data if the user had mutated after the long running query started)
4. Fetch data from read slave 2 and insert it into Cassandra (from this point and on all mutations that are aimed for this username will end up in both Postgres and Cassandra)

attachment_0cbe0bf1baf24b4ca873952a7c9022b2 — *Migration in the good old days.*

Verifying correctness

We also needed a script to verify that the storage systems were in sync. This did a similar iteration of users, fetched from both storages and compared and spat out useful statistics about which kinds of differences we saw along with their frequencies. This turned out to be very useful for finding bugs.

Switching

When doing the actual switch we were helped by the microservices architecture that Spotify employs throughout it’s backend. The idea is that actual calls to the user database is wrapped in a RESTful service. That way only a single service needs have knowledge of the specifics of the storage layer. The actual switch then simply meant:

A configuration change to switch the master / darkload role.
Making sure this new configuration has updated on all service machines.
Restarting all the service instances at the same time.

Restarting all the instances at the same time is important as this minimizes the risk of conflicts. What if we would have some services running with Postgres as master and some with Cassandra as master? We might get conflicting accounts created. The same username could point to different accounts!

Restarting all instances at the same time does have some downsides. During the restart, which lasts for a couple of seconds, it was impossible to create accounts or login. Fortunately our clients are clever and will automatically retry login.

Doing the switch this way also gave us a good rollback plan. If something went wrong we could fairly easily revert to using Postgres as master in the same way.

So how did it actually turn out?

First, we made sure we were in a good state. Minimizing the number of current inconsistencies by running verification scripts. We then flipped the switch and looked at logs and graphs. And then there was silence. We didn’t really see anything exciting or surprising happening. In fact, it was one of the most boring deploys we’ve done. The only thing left to do was to manually trigger the repair of a few remaining inconsistencies.

Bumps along the road

During the Cassandra migration we bumped into a lot of obstacles. We’ve probably forgot half of them, but we would like to explain some of the bigger ones we had to handle.

Paxos broken?

Cassandra introduced a fancy feature called LWT “Lightweight Transactions” or “conditional inserts” in version 1.2. It basically allows you to guarantee key uniqueness. In the Spotify realm it is important that one username belongs to one user and one user only. We don’t want allow collisions/races if two users concurrently creates an account and decides to have the same username.

So to avoid collisions we decided to use LWT, that under the hood is using a famous distributed consensus algorithm called Paxos. The Cassandra implementation requires a quorum of the replicas to be available and involves four round trips, i.e a quite expensive operation.

We did a benchmark and came to the conclusion that using Paxos when we create new accounts wouldn’t be too expensive. When we tried this live we noticed that a lot of the account creations failed and Paxos claimed that the username/account already existed.

We filed an upstream ticket CASSANDRA-9086 and waited for replies.

The response was that this was an intended behaviour, and we were able to handle this in our service code.

Paxos requires all nodes for CAS

This was a funny bug. As stated earlier, the Paxos algorithm requires a quorum of nodes to be able to reach consensus. If that fails, the operations fails and you get back some additional information. E.g. you get back the number of nodes that were able to take part in the Paxos round.

We noticed that some account creations failed and that the error messages stated that all replicas were needed, not quorum number of replicas. Turned out to be a bug (CASSANDRA-8640) in the Cassandra version we were using. So upgrading the cluster simply solved the issue.

Java driver (revert of JAVA-425, in 2.10.0)

We use an open source Cassandra client built by Datastax, the company that employs most Cassandra committers. After a couple of weeks, and some additional load added, we noticed that the number of connections from our service to the Cassandra cluster dropped, and new connections were not re-established.

This bug turned out to affect live traffic. We use the asynchronous java driver API, not using a separate thread pool. When no new connections were established, the API started blocking. Consequently we hit our concurrency limit and alerts were triggered off hours. Fortunately manual restarts kept the service going while we tracked down the root cause.

Turned out to be a new regression (introduced in JAVA-425) in the client version we were using. Upgrading to latest version, which had reverted JAVA-425, fixed our issue.

Dump dependencies

When moving to Cassandra we also needed to start making daily dumps of the Cassandra database. This is something that is/was being done from the Postgres database, as a dependency for a lot of big data batch jobs (used for personalisation, business analytics, etc.). Dumping from Cassandra introduced subtle differences from the Postgres dump, and not all batch jobs were able to handle the differences. This is something that is still not sorted 100%, but are actively being worked at.

Wrapup

Software projects are being known for taking longer time than expected, and this was no exception. But we did manage to make the switch almost without any excitement. Normally when things just work, we sometimes can’t stop to think that something must be horribly wrong. This was not the case this time, as we hit a lot of the problems in the darkloading phase almost without affecting end users. We also ran verification scripts, tailed logs, and looked at graphs so many times. So by the time we made the switch, we were confident that no boulders would come rolling our way.

This blog post was presented to you as a cooperation from the Login Team at Spotify:

Johanna Albinsson (hannis@spotify.com) Nick Barkas (snb@spotify.com) Kristofer Karlsson (krka@spotify.com) Magnus Melander (magnusm@spotify.com) Roger Schildmeijer (roger@spotify.com) Marcus Vesterlund (mzeo@spotify.com) Alan Wright (alan@spotify.com)