Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant

At Spotify, data problems used to follow a specific pattern. You'd look for the relevant dashboard, there weren't any. You'd message the corresponding data expert on Slack, wait until they had time to help. But with thousands of teams moving fast, the demand for data insights had quietly outpaced what any individual expert could handle alone.
To solve this problem, we started developing an AI data assistant, but with over 70,000 datasets at Spotify, amounting to petabytes of data, no single individual can claim knowledge of everything. Just putting all schemas into an LLM doesn’t work at this scale.
For one, context windows are limited, even if it’s a million tokens. A million tokens are insufficient to accommodate a whole data warehouse. Secondly, schemas do not convey all the information. If a column has the INT64 type, then it doesn’t say anything about how those less than 100 are legacy test data and how they differ from actual data in terms of definitions or what is meant by “active user.” Provide the same number of tables to a model, and it will be confident in selecting the wrong one.
We needed something in between. A layer that captures what actually matters about a slice of the warehouse, owned by people who own and understand the domain.
Our data agent
Spotify’s data assistant was built to solve this problem. Ask a question in simple English and get reliable data within seconds. It has been actively utilized since August 2025 by over 2,100 Spotifiers within 13,000+ conversations, and 60,000+ messages using 177 clusters covering advertising, podcasts, music, audiobooks, finances, creators' tools, and more than a dozen other fields. More than a quarter of these users haven’t even coded SQL before.
When a question comes in, the agent picks the appropriate context, writes the SQL query, runs it against our warehouse, and returns the answer alongside the query and its sources. It follows a ReAct [1] loop, reasoning and acting in steps, adjusting based on what each tool call returns. You can read how the result was produced, not just what it was.
We built into the surfaces people already work: a Slack bot for quick questions while chatting on a thread, an MCP server for IDEs and AI tools, and a dedicated web UI for interactive exploration. When no knowledge base covers the topic, the data agent informs you about it. That transparency is what makes the answers it gives reliable.
But the interesting part isn't the model. It's how we make sure the answers are trustworthy. That comes down to context and ownership.
The cluster model
At Spotify, we call data domains, clusters. Those domains can be tied to an initiative, an organization, or an adhoc interest. This flexibility enables any insights team to build a cluster around their topics, whilst also informing them if the domain is already covered. Each cluster is owned by a named team of domain experts and consists of three components:
Datasets: the data warehouse tables that are relevant, with full schema and profiling. We capture column cardinality, samples of common values, and partition structure. When the model generates a WHERE clause, it helps to know that `country` has values like 'US', 'GB', 'SE' rather than guessing.
Pairs: vetted question-and-SQL examples. This is the few-shot mechanism powering the data agent. A domain expert writes or approves each pair, picking examples that teach the patterns they'd want a colleague to follow. They teach the LLM how to query the data and its semantics.
Docs: additional business context. This could be terminology, gotchas, definitions that vary by team, which columns to use and which to avoid.
The curation is owned by the data experts, the data scientists and analytics engineers who know how the data is modeled and how to efficiently query them. They decide how to split their domain into clusters, which tables to include, and which examples are important.
Human Judgement
The obvious shortcut was to skip the curator. Our data warehouse holds the complete query history of every data expert who has ever used it. From there, generating question-SQL pairs is straightforward: take a query, ask an LLM to infer the question it was written for, and use those pairs to teach the model how to generate the SQL. These are real queries people actually wrote for answering their domain knowledge made into data. It looks like a way to scale.
And the issue here is trust. With Spotify being the size that it is, an overconfident wrong guess may sway the decision in the wrong way. We wanted the examples that would influence the assistant’s behavior to be reviewed and marked as canonical by those familiar with the data.
So, we tried it out. During our curation phase, we provided the questions and SQL for actual queries issued against the domain by the data scientists in our data warehouse, and we asked the cluster curators to pick which ones were good examples.
They accepted only 12.5% of the proposed pairs.
The other 87.5% were ad-hoc exploration, debugging sessions, one-off answers no one would ask again, queries that used the wrong table, or queries that were technically correct but taught the wrong pattern. Query history is rich. Most of it is noise. And the signal doesn't label itself.
That's why every example runs through an expert. The model reasons over context. It doesn't decide what's true about the data, the experts do. This isn’t about replacing the people that they know the best how to work with our data, it’s about giving them more leverage. Shipping their expertise in a more scalable way.
Keeping clusters healthy
Data changes, business logic shifts, and context that was accurate last month can be wrong today. Schemas evolve, columns get renamed, tables get deprecated and replaced. Vedder needs that information current, without requiring constant manual attention.
That’s why each cluster has a health score made up of signals we calculate and monitor continuously. How healthy is the underlying data that it is used in the cluster? How many of its curated pairs are still valid after recent schema changes? If a column gets renamed, the pairs referencing it degrade immediately. How well does the context cover questions people are actually using? How reproducible was the generated SQL? And a handful of others. If any of these degrade, then the cluster’s health score reflects that and actions are suggested.
Data experts see the score and the underlying signals on their cluster dashboard, and use them to decide where to spend curation time.
Closing the loop
Every conversation with Vedder becomes a data point that feeds back into the system. Vedder logs every conversation and query, and the questions, answers, generated SQL and user feedback are shown to cluster owners.
This is how we scale the knowledge of a data scientist. Every question-SQL pair they approve, every doc they clarify helps the next users get even more accurate insights. The answers are only as trustworthy as the context behind them and that context needs tending.
Beyond Spotify
Spotify has a strong data foundation with well-maintained datasets, a data catalog, and data scientists who care about their domains. That made Vedder possible, but the architecture isn't Spotify-specific.
The core idea remains valid: the people who best understand a data domain are the best ones to curate the context the model sees. Humans and LLMs can only understand raw schemas to a certain extent, but context and understanding is what enables the insights at scale. The role of our data experts grows more strategically. They spent less time answering one-off questions, more time shaping the knowledge layer that answers thousands.
Context curation is the foundation. But what if the knowledge lies outside the schema? What if it exists in documentation and definitions of processes within the organization? These are some of the questions we are exploring next.
Citations
[1] https://arxiv.org/abs/2210.03629



