Unleashing ML Innovation at Spotify with Ray

As the field of machine learning (ML) continues to evolve and its impact on society and various aspects of our lives grows, it is becoming increasingly important for practitioners and innovators to consider a broader range of perspectives when building ML models and applications. This desire is driving the need for a more flexible and scalable ML infrastructure.

At Spotify, we strongly believe in a diverse and collaborative approach to building ML applications. Gone are the days when ML was the domain of only a small group of researchers and engineers. We want to democratize our ML efforts such that contributors of all backgrounds, including engineers, data scientists, and researchers, can leverage their unique perspectives, skills, and expertise to further ML at Spotify. As a result, we expect to see an increase in well-represented ML advancements at Spotify in the coming years — and the right infrastructure will play a crucial role in supporting this growth.

Background

Spotify founded its machine learning (ML) platform in 2018 to provide a gold standard for reliable and responsible production ML. As an ML platform team, we aim to empower our users to spend less time maintaining bespoke ML infrastructure and more time focusing on solving business problems through novel model development.

Our centralized infrastructure now serves over half of our internal ML practitioners and ML teams. Internal research has shown, however, that our platform tools aren’t currently perfectly suited for all dimensions of ML practitioners. While the majority of our ML engineers use our centralized tooling, fewer data and research scientists do. We believe solving the following user needs can help alldimensions of ML innovators at Spotify:

Broadening production support for ML frameworks beyond TensorFlow to support novel ML solutions for Spotify
Providing a more user-friendly way for users to access GPU and distributed compute
Accelerating the user journey for ML research and prototyping
Providing solutions to productionize more advanced ML paradigms, such as reinforcement learning (RL) and graph neural networks (GNN) workflows

Spotify’s ML infrastructure today

Our goal for Spotify’s ML Platform has always been to create a seamless user experience for ML practitioners who want to take an ML application from development to production. In early 2020, our ML Platform expanded to cover the ML production workflow for Spotify’s ML practitioners with four core product offerings:

ML Home, a place where ML engineers can store ML project information and access metadata related to the ML application lifecycle
Jukebox, our solution for powering feature engineering based on TensorFlow Transform
Spotify Kubeflow, our managed version of the open-source Kubeflow Pipelines platform with TensorFlow Extended (TFX) as the ML workflow standardization
Salem, our standard for model serving and on-device ML applications

Lifecycle of an ML project at Spotify

ML at Spotify resembles a funnel. At the widest end, we have a big volume of ML activities undertaken by data and research scientists to quickly prove high-potential ideas. Their tasks, tools, and methods are diverse and heterogeneous and difficult to standardize — in fact, it’s suboptimal to standardize their methods at this point in the lifecycle. As the funnel narrows and high-potential ideas prove out, data engineers and ML engineers take over. Standardizing their tasks and tools is an optimization both from a user experience standpoint and from a business perspective; ML engineers can spend less time building redundant tooling, and our business benefits from having proven, reliable, and innovative ML ideas launched to production faster.

We built our platform for ML engineers first because their use cases and needs were easier to standardize. But that focus came at a cost: it meant less flexibility for innovation at the earlier stages of the lifecycle.

Transforming ML development at Spotify

In 2022, our team set out to refresh the two-to-three-year strategy and vision for Spotify’s ML Platform. A big component of that strategy was to better serve the needs of innovators focused on the earlier stages of the ML lifecycle and enable a seamless transition from development to production.

The next evolution of Spotify’s ML infrastructure

At Spotify, ML practitioners all share a similar ML user journey. They want to start their ML projects by prototyping on their local machines or in a notebook, and they need access to large computing resources like dozens of CPUs or GPUs. They like to easily create and scale end-to-end ML workflows in Python, access a diverse set of modern ML libraries, and seamlessly integrate with the rest of the Spotify engineering ecosystem with minimal code changes and infrastructure knowledge.

Current-vs-future-ML-Infra — *The larger ML practitioner umbrella at Spotify refers to the following roles: research scientists, data scientists, data engineers, and ML engineers focused on ML-related tasks.*

To better meet the needs of our users and improve productivity across the entire ML lifecycle, we need flexible infrastructure that meets the majority of our users where they already are. We need a platform that helps day-one Spotifiers feel productive — regardless of if they’re a data scientist in customer service testing ideas fast or an ML engineer on an advanced personalization team concentrated on hardening production workflows. Our current platform experience is heavily weighted towards a single user journey: an ML engineer using TensorFlow/TFX for supervised learning production applications. To better support our target market of a broader range of constituents, we need to lower the barrier to entry and embrace more diverse ML tooling while maintaining scalability and performance in end-to-end ML workflows.

Introducing Ray

After extensive prototyping and investigation, we believe Ray addresses those needs.

Ray is an open-source, unified framework for scaling AI and Python applications. It’s tailored for ML development with its rich ML ecosystem integration. It easily scales compute-heavy workloads such as feature preprocessing, deep learning, hyperparameter tuning, and batch predictions — all with minimal code changes. Ray is widely adopted across the ML industry. OpenAI cofounder and CTO Greg Brockman said at Ray Summit 2022, “We’re using [Ray] to train our largest models. So it has been very, very helpful for us in terms of just being able to scale up to a pretty unprecedented scale and to not go crazy.” With Ray, ML developers no longer need to completely change their code and framework of choice to achieve scale for production applications, easing the transition from local development to a distributed computing environment.

Incorporating Ray into the Spotify ecosystem

We built a centralized Spotify-Ray platform because we want our ML practitioners to solve ML problems and not have to devote their time to managing Ray or underlying infrastructure. The platform consists of server-side infrastructure, client-side SDK and CLI, and integrations with the rest of the Spotify ecosystem. We designed it to cater to the needs of all types of ML practitioners, not only ML engineers. We optimized for accessibility, flexibility, availability, and performance.

Initial-Architecture-Spotify-Ray — The initial architecture design of Spotify’s managed Ray platform. Spotify-Ray empowers novel ML solutions via streamlined acceleration of myriad modern ML libraries, using ML Platform’s centralized managed infrastructure as a foundation.

Accessibility

We wanted users to have an amazing onboarding experience with a gradual learning curve. We optimized for progressive disclosure of complexity, providing sensible defaults for common use cases and flexible abstractions over underlying Ray and Kubernetes complexity that accommodate both new users and “power users” alike. This lets ML practitioners focus on their business logic right away. With a single CLI command, users can create their own Ray cluster with preinstalled ML tools, ready-to-run notebook tutorials, VS Code server for in-browser editing, and SSH access.

$ sp-ray create cluster my-cluster \
 -n ray-playground \
    --with-tutorials \
    --vscode-server \
    --gpus-per-worker 1

Created cluster my-cluster in namespace ray-playground
Uploaded tutorial notebooks
sp-ray version          0.3.0
server ray version      2.2.0
server python version   3.8.13
service account         ...
head IP                 1.2.3.4
server                  ray://1.2.3.4:10001
dashboard               http://1.2.3.4:8265
notebook server         http://1.2.3.4:8081
OpenVSCode server       http://1.2.3.4:3000
workers                 1

head group
  replicas              1
  CPUs                  15
  GPUs                  0
  memory                48Gi

worker groups
  worker
    replicas            1
    CPUs                15
    GPUs                1
    GPU type            t4
    memory              48Gi

Users can list, describe, scale, customize, and delete Ray clusters too.


$ sp-ray get cluster -n ray-playground
NAME                      CREATED                 WORKERS
my-cluster        2 seconds ago     1

# show useful, human-readable cluster info
$ sp-ray describe cluster -n ray-playground my-cluster
sp-ray version          0.3.0
server ray version      2.2.0
server python version   3.8.13
service account         ...
head IP                 1.2.3.4
server                  ray://1.2.3.4:10001
dashboard               http://1.2.3.4:8265
notebook server         http://1.2.3.4:8081
OpenVSCode server       http://1.2.3.4:3000
workers                 1

head group
  replicas              1
  CPUs                  15
  GPUs                  0
  memory                48Gi

worker groups
  worker
    replicas            1
    CPUs                15
    GPUs                1
    GPU type            t4
    memory              48Gi

# easy to customize basic options
$ sp-ray create cluster my-cluster \
    --cpus-in-head 4 \
    --memory-in-head 10Gi \
    --gpus-per-worker 1 \
    --worker-gpu-type a100

# scale worker groups
$ sp-ray scale cluster -n ray-playground my-cluster \
    --worker-group group1 --replicas N

# allow K8s YAML for advanced config like multiple worker groups
$ sp-ray create cluster -n ray-playground my-cluster \
    --file ray-cluster.yaml

$ sp-ray delete cluster -n ray-playground my-cluster

Under the hood, we use Google Kubernetes Engine (GKE) and the open-source KubeRay operator. Our CLI creates a custom Kubernetes Ray cluster resource that tells KubeRay to create a new Ray cluster. Users start with a shared, playground namespace to learn and experiment with minimal setup. Once they’re ready, they create their namespace. Our multi-tenancy team management process grants permissions, configures resources, and manages contributors. It generates all Kubernetes resources based on a team configuration file and deploys them to the cluster to set up the namespace.

In addition to the CLI, we created a Python SDK with equivalent features. The SDK lets users programmatically manage their Ray clusters.

import time
from datetime import datetime
from typing import Final

from spotify_ray.logger import LOGGER
from spotify_ray.models.ray_cluster import RayCluster

CLUSTER_NAME: Final[str] = f"sp-ray-test-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
NAMESPACE: Final[str] = "hyperkube"
NUM_WORKERS: Final[int] = 1


def main():
    # Create new cluster
    cluster = RayCluster.create_cluster(
        name=CLUSTER_NAME,
        namespace=NAMESPACE,
        cpus_in_head=1,
        memory_in_head="2Gi",
        cpus_per_worker=1,
        memory_per_worker="2Gi",
        worker_replicas=NUM_WORKERS,
        await_ready=True,  # makes the function block until cluster is ready
    )

    LOGGER.info(f"Cluster head IP: ${cluster.head_ip}")
    LOGGER.info("Cluster has ${cluster.num_workers} workers")

    cluster.scale_worker_group(worker_group="worker", replicas=2)

    time.sleep(60)

    # Get an existing cluster
    cluster = RayCluster.get_cluster(name=CLUSTER_NAME, namespace=NAMESPACE)

    cluster.delete()


if __name__ == "__main__":
    main()

Flexibility

Users can easily make use of state-of-the-art ML libraries and select computing resources to support their workloads for research and prototyping. As a result of using Ray, our platform supports all the major ML frameworks like PyTorch, TensorFlow, and XGBoost. Computing resource configuration is abstracted in a unified and user-friendly way. If users don’t want the default computing resources, they can easily customize them. For example, they can request a specific type and number of GPUs.

Availability

We build on top of managed GKE’s availability instead of managing Kubernetes clusters ourselves. We isolate workloads by giving each Ray worker its own GKE node, and we isolate teams by giving each their own Kubernetes namespace.

Performance

We leverage GKE’s image streaming feature to speed up image pulls. We’ve decreased the time it takes to pull large GPU-based container images from several minutes to just a few seconds.

A Ray-based path to production

In the name of building a minimum viable version of Spotify-Ray, we chose to prioritize early-stage prototyping and experimentation — the “mouth” of the funnel, in other words. However, we see promise in Ray as the backbone of a powerful path to production for ML practitioners at Spotify. With Spotify-Ray’s native Flyte integration and high-level APIs in the works to streamline and accelerate canonical MLOps tasks, e.g., data loading, artifact logging, experiment tracking, and pipeline orchestration, we believe Ray can significantly shorten the time-to-production for ML applications at Spotify. We’re excited to work together with our internal ML practitioners to achieve this vision.

Problems-Solutions-Spotify-Ray — *A summary of how Ray addresses Spotify ML practitioner needs.*

Use case: Graph learning for content recommendations

In a recent research project, Spotify’s Tech Research team, a previously underinvested ML Platform end user, experimented with using graph learning technologies for recommendations. Unlike past research projects that are typically prototyped with ad hoc tooling and then implemented for production scenarios, the graph learning implementation needed to be production-ready to quickly assess GNN for Spotify’s business use cases. Our ML researchers needed infrastructure that was flexible and easy to productionize quickly. This prompted Tech Research to use graph learning on Spotify-Ray to generate content recommendations.

Following promising offline results on internal datasets, the Tech Research team ran an A/B test to understand how GNN-based algorithms changed our home page’s “Shows you might like” recommendations. Conducting these A/B tests was challenging since the GNN workflows are different from typical ML workflows. Tech Research adopted Spotify-Ray for their infrastructural needs to overcome these challenges and implemented a set of components to train and deploy GNN models at scale.

Data creation (graph construction)Constructing graphs for real-world applications is an iterative process and requires tools that can easily transform large amounts of data with simple Python functions. We leveraged Ray Datasets to create the graph from our data warehouse. Ray Datasets provide flexible APIs to perform common transformations such as mapping on distributed data.
Feature preprocessingThe graph constructed in the previous step consisted of nodes and edges along with corresponding features. We leveraged Ray AIR’s default preprocessors and extended their base API to perform feature transformations such as standardization, categorical transforms, bucketing, etc.
Graph learningWe feed the graph and preprocessed features into a graph learning algorithm implemented in PyG. Ray trainers easily extend to different ML frameworks like PyG and allow us to seamlessly distribute our training.
Inference at scale and evaluationFinally, we implemented custom predictors for batch prediction and evaluators using Ray Datasets.

Using the above components, this team built an end-to-end pipeline for generating show recommendations using GNN-based models and successfully launched an A/B test in less than three months, a feat that was extremely challenging for Tech Research in the past given the prior supported ML infrastructure. The A/B test resulted in significant metric improvements and improved user experience on our home page’s “Shows you might like.”

Looking ahead

Spotify’s ML practitioners’ demand for PyTorch has grown considerably, particularly for emerging use cases in the NLP and GNN spaces. We plan to use Ray to support and scale PyTorch to meet this growing demand and help our diverse users feel productive, no matter their role on the team.

While bringing in a new framework carries the risk of fragmentation, with better foundational building blocks in place, we can work toward creating a more flexible, representative, and responsible ML platform experience that comprehensively unleashes ML innovation at Spotify.

Acknowledgments

Our work bringing Ray into the Spotify ecosystem would not have been possible without the fantastic work of the ML Workflows team at Spotify, our teammates from ML Platform, and generous collaboration with the Anyscale team. Thank you to the individuals whose work made unleashing ML innovation at Spotify possible: Jonathan Jin, Mike Seid, Joshua Baer, Richard Liaw, Dmitri Gekhtman, Abdullah Mobeen, Maria Cipollone, Olga Ianiuk, Sara Leary, Grace Glenn, Omar Delarosa, Maisha Lopa, Shawn Lin, Andrew Martin, and Union.ai for their support on Flyte integration.

PyTorch, the PyTorch logo, and any related marks are trademarks of The Linux Foundation.

TensorFlow, the TensorFlow logo, and any related marks are trademarks of Google Inc.

Tags: machine learning