The Rise (and Lessons Learned) of ML Models to Personalize Content on Home (Part II)

November 18, 2021 Published by Annie Edmundson, Engineer

In Part I of this two-part series, we talked about the challenges we faced with the models we use to recommend content on Home, including:

  • The Podcast Model: Predicts podcasts a listener is likely to listen to in the Shows you might like shelf. 
  • The Shortcuts Model: Predicts the listener’s next familiar listen in the Shortcuts feature. 
  • The Playlists Model: Predicts the playlists a new listener is likely to listen to in the Try something else shelf.  

In this part of the series, we’ll highlight how and why we evaluate our models with different tools, and the hurdles to maintaining these models in production. 

Trust but verify your recommendations… with dashboards

So let’s talk about what we do with that data — specifically, how we run experiments, and maybe more importantly, how we evaluate our models’ performance.

Making experimentation simpler

How it started: experimenting on a siloed platform

Not that long ago, after transforming our training data, we would run experiments on a siloed platform specifically geared towards model experimentation, and that was only really used within our team — we did this for both the initial Podcast Model as well as for the Shortcuts Model. This platform could easily launch hundreds of experiments by using a configuration file to specify hyperparameters (it also supported a grid search on specified hyperparameters). And since everything that was submitted to run an experiment was a script, it supported custom evaluation metrics — something that has always been important in our team. While it provided these necessary features, it wasn’t scalable, wasn’t maintained, and had an incomplete UI. Sometimes the compute instances would lose connection with the API (via a periodic ping) and would end up being ghost workers — still running, but not connected to anything.  

How it’s going: integration with Spotify ML ecosystem

Spotify’s managed Kubeflow clusters provide a more scalable approach, modular components, and are compatible with other parts of the Spotify ML infrastructure, so it was an obvious choice to move our experimentation to this platform. Training our models using Kubeflow pipelines is easy and efficient, but running the evaluation we needed and tracking those results were our biggest pain points for two reasons: 

  1. As Spotify’s SDK for Kubeflow uses Tensorflow Model Analysis (TFMA), comparing the performance of a non-ML heuristic algorithm to that of a trained model is challenging to set up and requires extra infrastructure. 
  2. We often have custom evaluation metrics that are specific to the model’s task, but they are infinitely more difficult to implement in TFMA than in vanilla Python.

Evaluating models against simpler (non-ML) solutions

As I alluded to in an earlier paragraph, we don’t typically start solutions to a problem with an ML solution. We first identify a heuristic, or rule-based, solution, and the most appropriate way to evaluate it.  

The first step — having a baseline for comparison

We are often tasked with creating better recommendations for content X, but what are “better recommendations,” and what are they better than? Having a baseline helps answer these questions, giving us something to compare our models against. And a good baseline — usually a heuristic/rule-based solution — is a quick, efficient, but maybe not the most optimal, solution.

Take the Shortcuts Model as an example. We created an initial heuristic that recommended, simply, the most frequently played items from a listener’s short-term listening history. We improved the heuristic over many iterations, then compared it to the performance of the models we trained. Being able to compare these heuristics to the models gave us confidence to say that having a model was an improvement over the heuristics and was worth the extra effort of maintaining, deploying, and monitoring these models.

Comparing model performance to baseline performance is difficult

After establishing our baseline and training our model(s), the difficulty lies in how we compare them evenly. In a perfect world we would run infinite A/B tests with hundreds of test cells to compare the performance of all our solutions in the real world, on real listeners. Since it’s not a perfect world, we need reliable offline metrics that act as a proxy for the online metrics we can’t get in those A/B tests.  

When evaluating our recommendations models, we typically use normalized discounted cumulative gain (NDCG@k) as our metric, which can be implemented using Spotify’s Python SDK for Kubeflow pipelines. The question then becomes: how do we do the same for our heuristic? As we’ve mentioned before, transformation logic consistency is paramount, and so is evaluation logic — ideally, we’d have the same evaluation logic and the same evaluation test set of data. Unfortunately, our heuristics are generally written in a Java service and are tested with unit tests (not for performance).

For fairly simple heuristics, we found a way to “train a model” so that its output is the heuristic rule’s output. This allowed us to use the same evaluation and evaluation test set as the models we were comparing against. We took this same approach when coming up with a solution to recommendations in the Try something else shelf for new users on Home. We computed a popularity heuristic based on a listener’s demographics in Tensorflow Transform (TFT) and used the model as a lookup utility (with a fake loss).  

We can’t always fit our problem into such a simple heuristic, as was the case for Shortcuts. The logic used in most of the Shortcuts heuristics was too complex to write in Tensorflow, so we implemented a completely separate offline evaluation pipeline that would gather recommendations made by models and heuristics, and apply custom evaluation functions for comparison.  

Adding freedom and flexibility to our evaluation tools

As mentioned earlier, there’s a second pain point we run into often: using custom evaluation metrics in TFMA.

TFMA is sometimes too rigid

Spotify’s SDK for Kubeflow only supports evaluations using TFMA, which provides fairly basic metrics out of the box — think: precision, recall, accuracy. The most common metric we typically use is NDCG@k — TFMA provides NDCG, but not NDCG@k. Implementing metrics in TFMA is notoriously difficult; it takes ~120 lines of code to implement NDCG@k in TFMA, but only a single line of code using scikit-learn in Python.

Most recently, we were experimenting with a model that predicts the next playlist that a new user will listen to, and as we have very little information about new users, we wanted to ensure that the model was not just predicting the most popular content. To do so, we were going to evaluate the model with a diversity metric that measured the difference between specific characteristics of items in each playlist. This was nearly impossible to implement in TFMA, so our team contributed to the Python SDK for Kubeflow to support any custom Python evaluation. We have been using this and running our experiments via Kubeflow pipelines since October 2020. 

Compare and track experiment results

In the pre-Kubeflow world, our experimentation platform allowed for a way to track and compare models — now, we are using Spotify’s internal UI for machine learning, as it easily integrates with our Kubeflow runs. We can view and compare the evaluation scores of our experiments — both NDCG and custom metrics — in the UI. We’ve been using this for a number of our models, and it allows us to track our model deployments as well.

Looking at more than just the numbers for evaluating recommendations

I’ve mostly mentioned what metrics we use and why they are important, but there is another incredibly useful way we evaluate our models — sometimes more useful than what a metric can reveal.

We build custom dashboards to manually evaluate the recs

Based on past issues, we know that evaluation metrics don’t show the whole picture of how well a model is recommending content. Sometimes, the best way to evaluate a model is by seeing what content it recommends given a specific set of features about a listener. And for this reason, our team built a dashboard that does exactly that. It loads models simply by supplying the storage location of the model, and supports comparison of multiple models given a set of features. We often test and evaluate the recommendations that a new model will provide before deploying it to production by making predictions with different sets of feature values; this gives us an intuition behind what content will be recommended to different users that have these feature values. This has helped us find glaring issues; for example, when developing and testing a new model, we found that it would recommend the same popular playlist to listeners in all European countries. Having this knowledge allowed us to fix and improve the model before deploying it to production.  

Most recently, we have been working on a new model to recommend albums a listener might like based on their locality and what they like to listen to. We have been running experiments comparing evaluation metric values, but we have also been looking at the recommendations on our dashboard. This dashboard gives you the ability to try different features and compare the recommendations across different models — all before the models are used to recommend content to our listeners. At the beginning stages of experimentation and modeling for this project, we noticed that the same album was recommended as the first item no matter what input features (such as user’s country, followed artists, etc.) were used for testing, meaning this album would have been recommended to everyone as the first recommendation. Without this dashboard as a tool, it would have been more challenging to identify this issue and remediate it before the model went live.

While our offline metrics might indicate poor performance, they don’t tell us anything about what the reason might be, whereas this dashboard can show the quality of our recommendations and is extremely useful in finding issues like this.

Through the use of task-specific custom evaluations and dashboards to show evaluation metrics and recommendations per feature set, we have been able to gain deep insight into how our models are behaving, and make our models a little less of a black box. 

The struggles of automated model retraining and deployment

Let’s dive into our last topic, which is all about maintaining models in production: retraining and automatic deployment.

But do we actually need to retrain our models?

It would be really nice if we could train a model once, deploy it, and then not have to do anything except monitor its online performance. Sadly, we’ve never seen this in reality.

Sometimes the model’s task requires frequent retraining

Since we first deployed the Podcast Model in Home, we have always had retraining set up for it — and that’s because it only recommends podcast shows that it has seen in training data. So if we didn’t retrain it, it wouldn’t recommend any newly published shows.  

The rest of the time, it just becomes a tech debt monster

But in some cases, retraining isn’t necessarily required to capture the full set of possible candidates. For the Shortcuts Model, we didn’t have retraining set up because it only recommends content that the listener has previously listened to (which is always in the serving features). But while retraining wasn’t needed for the Shortcuts Model to operate, the lack of it became one of the biggest sources of ML tech debt. We did not implement retraining for Shortcuts because it wasn’t needed for launching the feature, but have seen that it would have saved us time and effort in the long run had we invested some time in the short term. 

It wasn’t until many months after the launch of the model that we saw issues with the quality of recommendations in Shortcuts due to no retraining — some of the features for this model describe the type of content that a listener has listened to, like whether it’s a personalized playlist or an album, etc., and there was a recent addition of a new type of content that was introduced after the model was last trained. As a result, the model didn’t recommend this piece of content in Shortcuts. While this starts to look like the same scenario as the Podcast Model described above, we also saw issues with migrating to different tools and platforms because the model was trained using older versions of libraries.  

Implement for the short term while waiting for the long-term solution

Once upon a time, we only had that singular Podcast Model, which was used to generate batch predictions, not real-time predictions. We had a Scio pipeline that used Zoltar to predict podcast recommendations for all listeners, and we stored these predictions in our Bigtable instance that holds all of our content recommendations. This was a great start, but fairly inflexible when it came to when and how often we could make predictions for a given listener — and this is important because the listeners’ features could change if they listen to new content or follow new artists, which could provide better information to the model.  

Building a recommender service for the short term

Consequently, we built a new service to serve this model and enable online predictions. We could get fresh recommendations for a listener almost instantly, and we could get these recommendations at useful times, such as when a listener follows a new artist. While this was a great improvement to move from offline predictions to online predictions, and an important step in making a better product, we knew we were only going to be in this state for the short term. Spotify’s online serving platform was on the horizon, but not yet ready; the benefits to building a short-term less-than-optimal solution outweighed the benefits of waiting to serve online models until Spotify’s serving solution was production ready.  

With that said, let’s talk about some challenges we faced in building this recommender service, such as how to refresh the local version of a deployed model. Our solution was to poll our internal storage directory every 10 minutes to check if there was a new revision of the model; if so, the service would pull the model down from where it was stored and start using that model to make predictions. Nevermind that we only retrained weekly or that there would be some state at which some machines would have the new revision of a model and others would have the older revision (although this was not something we worried about in our specific use case).  

The pain of manually deploying models

This was really a solution to serving models online, and less of a solution to a better process of serving models. Each time we wanted to deploy a model we had to: 1) copy the model to a specific storage location, 2) manually generate a pointer in our internal storage directory for that location, and 3) add this pointer to our recommender service along with the logic to fetch and transform features for the model. If we were to retrain the model, we would have to repeat each of those steps.  

Obviously, this was a cumbersome process, but because we had this short-term solution, we were able to deploy four models to production and tested many others in A/B tests.

CI/CD — but make it for model training and deployment

While this recommender service lived a long life of about 10 months, the next obvious step was to migrate to Spotify’s model serving platform, which enabled us to automate retraining and deployment of retrained models.  

Automating feature transformations without Tensorflow Transform

The first step in automating retraining is automating train dataset and test dataset curation, fetching the correct features and performing the necessary feature transformations. While feature transformations are generally handled automatically via TFT in a Kubeflow pipeline, we don’t perform our feature transformations in TFT (and therefore not in our experiment pipeline) because many the transformations we perform on the data are fairly complex and would be unnecessarily difficult to do in Tensorflow.  

But because the serving platform provides feature logging, we enabled logging of already transformed features, to which we then apply the correct labels, and separate into train and test sets. These actions are all performed in scheduled pipelines that run weekly and produce weekly datasets for our models to use.

Migrating from our short-term solution to a long-term solution

In order to enable feature logging, we had to migrate to the new online model serving platform from our recommender service using Zoltar. It was a matter of dark loading all prediction traffic to the new deployment and then running a simple rollout to start directing traffic to our new deployment instead of using Zoltar to make predictions in our own service. This was an easy migration and provided the benefits that the online serving platform offers — feature logging, faster predictions / lower latencies, less code managed by our team — and it also supports pushing a new model version (from a Kubeflow pipeline), as opposed to constantly polling for a new model version. 

Continuous retraining and automatic deployment

Now that our models are all deployed via the Spotify serving platform, it enables us to employ CI/CD. We can schedule our models to be retrained via a Kubeflow pipeline, and as part of the Kubeflow pipeline we can ensure that a “bad” model is not accidentally automatically deployed by specifying that it should: 1) check that the evaluation score is greater than our configured threshold, and 2) automatically push it to our serving infrastructure if it is greater than the threshold. This automates a lot of the processes that we had to perform manually not long ago.  

Enabling CI/CD for retraining and model deployment is hard, but it’s becoming easier with the new tools available and makes the quality and reliability of our models better. And at first glance, you might not think you need retraining for a model because of the task it performs, but without it, your model could make predictions in unpredictable ways and increase your tech debt.   


Our ML stack has come a long way in recent years, but it’s not perfect by any means. There are still a number of challenges we are tackling — data versioning, model versioning, moving feature transformations to Tensorflow Transform — and better ways to compare offline metrics across both ML and non-ML solutions. But it has decreased the time it takes for us to iterate, experiment, and deploy quality models.

We have adopted and/or built the components we need to successfully and efficiently manage our data, experiment with different models, and support continuous integration and development throughout the deployment and retraining processes. Our ML stack has enabled us to launch numerous models that serve millions of listeners on Home every day.

If you are interested in joining us and helping improve how we recommend content on Home, we are hiring!