Spydra – Ephemeral Hadoop Clusters Using Google Compute Platform

We are happy to announce a new open sourced project from Spotify called Spydra. Spydra makes it easy to run data processing jobs in Google Cloud Platform (GCP) Dataproc, automating the cluster life-cycle management and making troubleshooting of failed jobs easier. Spydra supports usage of ephemeral clusters created only for the duration of a single job execution, and also the usage of static long living clusters.

Spydra is part of Spotify’s effort to migrate its data infrastructure to GCP. The principles and the design of Spydra are based on our experiences in scaling and maintaining a Hadoop cluster of over 2500 nodes and over 100 PBs of capacity, running about 20,000 independent jobs per day. Spydra supports submitting data processing jobs to Dataproc as well as to existing on-premise Hadoop infrastructure, and was designed to ease the migration to GCP while supporting the dual use of both on-premise Hadoop and Dataproc.

Stay tuned for more updates coming around Spydra soon!

Spydra is available on Github: https://github.com/spotify/spydra

Tags: backend