Introducing Voyager: Spotify’s New Nearest-Neighbor Search Library

October 25, 2023 Published by Peter Sobot, Staff ML Engineer

For the past decade, Spotify has used approximate nearest-neighbor search technology to power our personalization, recommendation, and search systems.

These technologies allow engineers and researchers to build systems that recommend similar items (like similar tracks, artists, or albums) without needing to run slow and expensive machine learning algorithms in real time.

Spotify led the pack by building and open sourcing Annoy, our hugely popular nearest-neighbor search library, back in 2013. Since then, Annoy has served us extremely well, powering features like Discover Weekly, Home, and countless others.

The evolving nearest-neighbor ecosystem

Over the past decade, the state of the art in nearest-neighbor search has advanced considerably. Annoy is now solidly in the middle of the pack, and there are systems out there that can produce results twice as accurate in the same amount of time or similar-quality results in one-tenth the time.

In addition to technical advances, the nearest-neighbor search ecosystem is growing quickly: many vendors offer nearest-neighbor search as part of their database offerings (Weaviate, Pinecone, Vespa, Chromadb), and many traditional database engines are also adding support for vector-based search (e.g., pgvector in PostgreSQL).

While accuracy and speed are two major factors in comparing nearest-neighbor technologies, others are important to Spotify’s engineers:

  • Flexibility: Each application of nearest-neighbor search has different needs and constraints. Being able to customize every part of the search algorithm is incredibly useful as it can allow a balance to be struck among maximum performance, maximum throughput, minimum latency, and minimum cost.
  • Statelessness: Many of Spotify’s systems use nearest-neighbor search in memory, enabling stateless deployments (via Kubernetes) and almost entirely removing the maintenance and cost burden of maintaining a stateful database cluster.
  • Language support: Backend and data engineers at Spotify prefer to deploy production systems on JVM-based languages, like Java and Scala, to maximize performance — while many machine learning use cases operate with Python. Many of the newer nearest-neighbor technologies have poor support for languages other than Python, or they provide client libraries for many languages but require running a database process alongside the deployment.
  • Cost: The most advanced nearest-neighbor algorithms are incredibly fast and produce extremely high-quality output. But to do so, they often require large amounts of memory. For most use cases, extreme accuracy can be sacrificed for a significant decrease in cost.

Voyager: Spotify’s solution

Since 2018, many teams across Spotify have been experimenting with an open source library for nearest-neighbor search called hnswlib (pronounced “h-n-s-w lib”). This library offers a tenfold speed increase over Annoy, and it was very useful as we scaled up to use cases that required higher-dimensional embeddings.

However, as we deployed it at scale, we identified a significant number of changes we wanted to make to hnswlib. These changes included modifications to its on-disk data format and its API and substantial architectural changes to make the codebase easier to maintain. Changes like this break backward compatibility, which is very hard to do when a software package has lots of users — and hnswlib already has more than 700,000 downloads every month.

To work around this issue, we decided to build a new package entirely. We call it Voyager — because it searches through vector spaces, just like NASA’s Voyager space probes.

Voyager is a new nearest-neighbor search library based on hnswlib, intended to succeed Annoy as Spotify’s recommended nearest-neighbor search library for production use. Voyager combines the increased accuracy and speed from HNSW with well-tested, well-documented, and production-ready bindings for both Java and Python.

What Voyager offers

Voyager’s philosophy is to offer a rock-solid, stable, production-ready library that allows anybody to add nearest-neighbor index lookup to their application, in Python or Java. Its features include:

  • More than 10 times the speed of Annoy (at the same recall) or up to 50% more accuracy (at the same speed)
  • Up to 4 times less memory usage than Annoy (thanks to E4M3 8-bit floating point)
  • Fully multithreaded index creation and querying
  • Fully supported Python and Java bindings with identical interfaces
  • Production-ready, fault-tolerant index files with corruption detection
  • Google Cloud Platform–compatible stream-based I/O (stream indices from Google Cloud Services!)
  • Built-in support for string-based identifiers (i.e., query by URI)
  • 16 times less memory usage versus hnswlib at index creation time
  • Dependency-free install: only NumPy (any version) in Python, and no Java dependencies
  • MacOS, Windows, and Linux support for both x86 and arm64 CPUs.
  • Full Python and Java documentation

Voyager in production

This blog post isn’t an announcement of what we plan to do; rather, it’s an announcement of what we’ve been doing for almost a year. Voyager is battle tested, having been used by many teams at Spotify to serve production traffic since 2022. And now it’s open source on GitHub for all to use.

To try out Voyager, check out spotify.github.io/voyager – or just pip install voyager in Python to get started.

You can also hear more about Voyager in episode 23 of NerdOut@Spotify.


Tags: , , ,