Data Platform Explained Part II 

May 28, 2024 Published by Anastasia Khlebnikova (Senior Engineer) and Carol Cunha (Product Manager)

Check out Data Platform Explained Part I, where we started sharing the journey of building a data platform, its building blocks, and the motivation for investing into building a platformized solution at Spotify.

Introduction

In Data Platform Explained Part I, we shared the first steps in the journey to build a data platform, the insights that indicate it’s time to start building one, and how we are organized to succeed on it. In this article, we will take one step further into the why, what, and how of our data platform, introduce you to the domains underneath it that are responsible for the platform’s building blocks — here we will talk about scalability, the tooling we use and provide, alongside the value each building block brings to a data platform — and finally our strategy to navigate the complexity of a data ecosystem by building a strong community around it.

Data Collection 

When it comes to scalability, Spotify’s Data Collection platform collects more than 1 trillion events per day. Its event delivery architecture is constantly evolving through numerous iterations. To learn more about  the event delivery evolution, its inception, and subsequent improvements, check out this blog post.

Data Collection is needed, so we can: 

  • Understand what content is relevant to Spotify users 
  • Directly respond to user feedback
  • Have a deeper understanding of user interactions to enhance their experience
Figure 1: The event delivery infrastructure is a significant topic that deserves its own dedicated article (coming soon). Nevertheless, here’s an overview of the main components handled by our event delivery infrastructure.

When a team at Spotify decides to instrument their functionality with event delivery, aside from writing code using our SDs, they only need to define the event schemas. The infrastructure then automatically deploys a new set of event-specific components (such as PubSub queues, anonymization pipelines, and streaming jobs) using K8 operators. Any changes to the event schema triggers the deployment of corresponding resources. Anonymization solutions, including internal key-handling systems, are covered in detail in this article

The balance between centralized and distributed ownership allows most updates to be managed by consumers of the consumption dataset, without requiring intervention from the infrastructure team.

Today, over 1800 different event types — or signals representing interactions from Spotify users — are being published. In terms of team structure, the data collection area is organized to focus on the event delivery infrastructure, supporting and enhancing client SDKs for event transmission, and building the high quality datasets that represent the user journey experience, as well as the infrastructure needed behind it.

Data Management & Data Processing

Our Data Processing efforts focus on empowering Spotify to utilize data effectively, while Data Management is dedicated to ensuring data integrity through tool creation and collaborative efforts. With more than 38,000 actively scheduled pipelines handling both hourly and daily tasks, scalability is a key consideration. Data Management and Data Processing are essential for Spotify to effectively manage its extensive data and pipelines. It’s crucial to maintain data traceability (lineage), searchability (metadata), and accessibility, while implementing access controls and retention policies to manage storage costs and comply with regulations. These functions enable Spotify to extract maximum value from its data assets while upholding operational efficiency and regulatory standards.

Figure 2: These domains, like Event Delivery, warrant their own comprehensive blog posts. This article provides a closer look at the tools we use, and our organizational structure.

The scheduling and orchestration of workflows are essential components of Data Processing. Once a workflow is picked up by the scheduler, it’s executed on BigQuery, or either Flink or Dataflow clusters. Most pipelines utilize Scio, a Scala API for Beam.

Data pipelines generate data endpoints, each adhering to a specific schema and possibly containing multiple partitions. These endpoints are equipped with retention policies, access controls, lineage tracking, and quality checks.

Defining a workflow or endpoint involves custom K8 operators, which help us to easily deploy and maintain complex structures. In that manner, the resource definition lives in the same repo as the pipeline code and gets deployed and maintained by the codeowners.

Monitoring options include alerts for data lateness, long-running or failing workflows, and endpoints. Backstage integration facilitates easy resource management, monitoring, cost analysis, and quality assurance.

Building a culture around the data platform

Building a data platform is non-trivial — it needs to be flexible enough to satisfy a variety of different use cases, aligning with cost effectiveness and return on investment goals, and at the same time keeping the developer experience lean. The data platform needs to be easy to onboard to and have seamless upgrade paths (nobody likes to be disrupted by platform upgrades and breaking changes). And the platform needs to be reliable — if teams  have the expectation to build business critical logic on top of your platform, we treat the platform as a critical use case as well. 

There are multiple ways to elevate engagement with your product:

  • Documentation (which is easy to find). We all have been in situations where, “I remember reading about it, but I don’t remember where.” It should be easier to find documentation than to ask a question (considering the waiting time).
  • Onboard teams. There is no better way to learn about your product than to start using it yourself. Go to users and embed there. Learn about different use cases, make sure that your product is easy to use in all possible environments, and bring the learnings back to the platform.
  • Fleetshift the changes. People love evolving and making changes to their infrastructure and having the code being highlighted as deprecated, right? Not really. That is why we should automate all possible toils and migrations. Plan to deal with risks. Make time to support your customers.
  • Build a community where people are free to ask questions and where there are dedicated goalies to answer these questions. Answering community questions should not be left to free will, but should instead be encouraged and taken seriously. At Spotify we have a slack channel #data-support, where all data questions are addressed.

Wrapping up 

Our Data Platform has come a long way, and continues to evolve. At the very beginning, we were a few people, part of one team. We ran the pipelines on-premise, operating the largest Hadoop cluster in Europe. We are now 100+ engineers working on building the Spotify data platform on GCP, with data collection, management, and processing capabilities.

There is no formula or script to set up a data platform. A good way to start is by aligning your organizational needs with your investments. These needs become the drivers for your platform’s building blocks, and may change over time. Make sure the challenges are clear — define clear goals and set clear expectations — it will help you to have the right support from your organization and to be on the path for success.

Get closer to your users, have a clear way through which customers and stakeholders can reach out and give you direct feedback — it will set the stage to create a community around your platform. Finally, you do not have to start big: just start somewhere then evolve, iterate, and learn.


Tags: