Large-Scale Generation of ML Podcast Previews at Spotify with Google Dataflow

Integrating the Podz ML pipeline into Spotify

As of March 8, 2023, Spotify has started serving short previews for music, podcasts, and audiobooks on the home feed. (You can see the announcement at Stream On on YouTube, starting at 19:15.) This is a huge lift for in-app content discovery — a move away from making listening decisions based on static content, such as cover art, to using the audio content itself as a teaser. It has been an effort across many teams at the company. The path to generating the home feed podcast previews for millions of episodes on Google Dataflow, making use of the Apache Beam Python SDK, has been a complicated one.

In 2021, Spotify acquired Podz to accelerate audio discovery. Podz had been working on approaches to producing 60-second podcast previews, leveraging the latest advancements in natural language processing (NLP) and audio machine learning (ML), including transformers. At the time of the acquisition, the Podz stack was composed of a number of microservices, each hosting an individual model. It also counted on a central API that would ingest and transcribe new episode content, process each episode by calling the ML services using directed acyclic graph (DAG) logic and surfacing the previews to a mobile app UI based on user-preview affinity. The ML services were sufficient to generate preview offsets for a few thousand podcast episodes per day. However, we found that the Spotify podcast catalog was growing at a rate of hundreds of thousands of episodes a day!

To scale up to Spotify’s needs, we had to adopt a fully managed pipeline system that could operate at scale, robustly, and with a lower latency per podcast episode. We could have managed our own Kubernetes service, but that would have been a lot of work: updates, scaling, security, and reliability of the resources, among other things. Instead we chose Google Dataflow, a managed pipeline execution engine built on top of open source Apache Beam. Dataflow manages Dockerized computational graphs / DAGs over sets of inputs in batch or streaming, it’s optimized for low latency, and it autoscales depending on demand. Beam building blocks include operations like ParDo and Map, which the Dataflow engine optimizes by fusing into pipeline stages in Dataflow and enabling distributed parallel processing of the input data.

Determining our method for generating podcast previews was an interesting challenge.

Raw audio source data

The source of the content is raw audio and transcription data, which needs to run through various preprocessing tasks to prepare it for input into the models. We used source transforms to read and process this data from our storage system, which included steps to deduplicate content to avoid unnecessary work, followed by the ML steps and, finally, the handoff to downstream storage.

Many models and frameworks

There are over a half dozen models within the pipeline that need to be built as an ensemble, including fine-tuned language models and sound event detection. The models are trained with different ML frameworks, including Tensorflow, PyTorch, Scikit-learn, and Gensim. Most of the frameworks out there! This introduced three challenges:

Finding the best path to assemble these models in the pipeline shape. For example, as different nodes in a single Apache Beam transform or as separate transforms within the Apache Beam DAG.
Choosing the right hardware (machine size and GPUs) to achieve our latency and throughput goals.
Deciding on the best way to deal with library dependencies when using all these frameworks together.

Addressing challenges

For challenges 1 and 2, we factored in the size of the models and decided to create a few transforms, each with its own ensemble. We decided to use NVIDIA T4 GPUs, so we had 16 GB of memory to work with and one transform containing the bulk of the models. This mitigated the code complexity of the pipeline as well as data transfer across models. However, it also implied a lot of swapping models in and out of GPU memory and compromising on pipeline visibility, as errors and logs for the bulk of our logic would be located within the same very dense ParDo operation. We used fusion breaks in between these transforms to ensure the GPU only loaded one stage of models at a time.

As for challenge 3, we made use of Dataflow’s ability to apply custom containers to help solve for the dependency challenges. Generating Docker images to properly resolve the dependencies across all these models together with Dataflow was difficult. We used Poetry to resolve dependencies ahead of Dockerization, and we referred to the Dataflow SDK dependencies page for guidance about which versions of our model frameworks and data processing dependencies would be compatible with Python SDK versions for Dataflow. However, locally resolved and Dockerized dependencies would still run into runtime resolution errors, the solution for which we discuss in detail later.

Generalization of the pipeline

Our result was an hourly scheduled, custom container Dockerized, batch Beam pipeline. The next challenge was to maintain multiple versions of the pipeline. In addition to our current-production pipeline, we needed a staging (experimental) pipeline for new-candidate versions of the preview generation and a lightweight “fallback” pipeline for content with lower viewership and content that the production pipeline would fail to produce timely output for. To maintain all of these similar pipelines together, we developed a common DAG structure with the same code template for all of them. We also added a number of Dataflow-native timers and failure counters to profile our code and to measure improvements over various sources of latency and errors over time.

The cost of running a large number of GPU machines year-round is elevated, so we applied some tricks to mitigate this cost while also improving on latency. We initially disabled autoscaling because we rendered it unnecessary for batch jobs with fixed-size hourly input partitions. We could just estimate the number of workers needed to complete each input in an hour and optimize by spinning up and tearing down exactly that number of workers.

However, latency from queueing up inputs hourly as well as from machine spin-up and teardown meant we had to optimize further. A nicety of Apache Beam is that because it supports both batch and streaming with the same API, there is little code change needed to switch between the two. Therefore, we moved to a streaming system, which was a larger improvement in cost and latency, and a reversal on some of our previous decisions. In streaming mode, autoscaling is optimal and is reenabled. The job can determine how many resources it needs dynamically based on input traffic. This also reduces the time and cost of spinning up and tearing down machines between hourly partitions.

From two hours to two minutes

With batch pipelines, we were able to produce a preview two hours after the episode had been ingested. We wanted to bring this latency down to minutes for two reasons:

Some episodes are time sensitive, such as daily news.
Listeners of very popular shows expect the release of a brand new episode at a particular time, on a particular day of the week.

We decided to explore Apache Beam and Dataflow further by making use of a library, Klio. Klio is an open source project by Spotify designed to process audio files easily, and it has a track record of successfully processing music audio at scale. Moreover, Klio is a framework to build both streaming and batch data pipelines, and we knew that producing podcast previews in a streaming fashion would reduce the generation latency.

In a matter of one week, we had the first version of our streaming podcast preview pipeline up and running. Next, we worked on monitoring and observability. We started logging successful and failing inputs into a BigQuery table, and in the latter case, we also logged exception messages. With Google Cloud Dashboards and Google Metrics Explorer, we were able to quickly build dashboards to tell us the size of the backlog in our Pub/Sub queues and to set up alerts in case the backlog grew too large.

To measure the impact of this change, we used the median preview latency, that is, the time between the episode ingestion and the completion of the podcast preview generation. We compared two distinct weeks: one where only the batch pipeline was producing previews and another when only the Klio streaming pipeline was producing previews. The Klio streaming pipeline reduced the median preview latency from 111.7 minutes to 3.7 minutes. That means the new pipeline generates previews 30 times faster than the batch pipeline. Quite an improvement!

Below, you can see preview latency distribution for both the Klio streaming and the batch pipeline. Because there is quite a long tail, we have plotted just the leftmost 80% of episodes processed.

Preview_latency_graph

Figure 1: Preview latency in seconds for 80% of the elements processed by streaming pipeline.

Figure 2: Preview latency in seconds for 80% of the elements processed by batch pipeline.

Common challenges

Whether using batch or streaming pipelines, we had to tackle some problems when running pipelines on Dataflow. One was the pipeline dependency management issues encountered when using a custom Docker container (via the --sdk_container_image flag) or when running the job inside of a VPN, a common practice at many large companies, including Spotify. Debugging such errors is difficult because logging is often insufficient for tracking down the error, and cloud workers may shut down or restart immediately after an error.

In one case, after upgrading the Apache Beam SDK from version 2.35.0 to 2.40.0, the pipeline started to fail within five minutes of starting the jobs, with the error SDK harness disconnected. There were no error logs stating the underlying issue. The SDK harness disconnected error means that the process running the pipeline code has crashed, and the cause of a crash might be buried in a stream of info logs.

The upgraded pipeline did succeed at running on small input sizes (BigQuery input source), but started to fail with larger input sizes, for example, greater than 10,000 rows from the same BigQuery input source. Because both jobs ran on the exact same configuration, we initially suspected an out of memory (OOM) error. That was ruled out based on the historical evidence that the same pipeline had handled a much larger throughput before upgrading.

Upon further debugging, we noticed the bug happened only if a package grpcio==1.34.1 was installed together with google-cloud-pubsublite==1.4.2, and the bug didn’t manifest when either of the packages was updated to their latest version. Given that we didn’t use Pub/Sub functionality in our pipeline, the solution was to uninstall google-cloud-pubsublite in the Docker image after all other dependencies had been installed.

To summarize, when upgrading to new versions of the Apache Beam SDK, the pipeline using custom containers in a VPN might fail due to subtle changes in transitive dependency resolution. Debugging such issues is difficult because not all errors are logged inside the Dataflow workers. It is important to have visibility into version changes for all installed packages, including transitive dependencies, and manual inspection of prebuilt custom containers is required to identify the difference. A future improvement would be providing more logging, tooling, and guidance for dependency management, especially for custom containers. Another mitigating practice would be to make use of the GPU development workflow described in the Dataflow documentation. The idea is to recreate a local environment that emulates the managed service as closely as possible.

Our future with Dataflow

Now that the Podcast Previews feature has launched, we are closely monitoring the behavior of our pipelines and adjusting our preview generation to match observed user engagement. We are also excited to try out two new Dataflow features. For example, Dataflow Prime Right Fitting would allow us to specify resource requirements for each Dataflow step or ParDo, instead of having a single set of requirements for the entire pipeline. This would improve resource utilization by allocating fewer resources to steps that are less computationally expensive, such as reading inputs and writing outputs.

We are also excited about the new RunInference API, which would allow us to break down our dense ParDo into smaller steps. It would run parts of the code that involve ML models frequently swapping in and out of memory in machines dedicated to individual models over the duration of the job. This feature would also standardize our model inference metrics and provide a means to further increase pipeline throughput by intelligently batching our inputs.

Conclusion

It has been a long journey to build an effective, flexible, and scalable system to generate podcast previews at Spotify for millions of users, and along the way, we’ve learned many valuable lessons. Using managed data pipeline tools, such as Google Dataflow, adds value by lowering the bar to build and maintain infrastructure, allowing us to focus on the algorithms and the pipeline. Streaming has been shown to be a far superior system, despite requiring a little extra work. Finally, the job is never done, and we want to continue to push the boundaries of what’s possible both with data engineering and data science — enhancing the services that we provide for our users and creators.

We’d like to thank everyone at Spotify and Google who helped us with this effort, including Tim Chagnon (Senior Machine Learning Engineer, Spotify), Seye Ojumu (Senior Engineering Manager, Spotify), Matthew Solomon (Senior Engineer, Spotify), Ian Lozinski (Senior Engineer, Spotify), Jackson Deane (PM, Spotify), Peter Sobot (Staff ML Engineer, Spotify), and Valentyn Tymofieiev (Software Engineer, Google).

Apache Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.

Tags: engineering leadership