Automated Incident Response Infrastructure in GCP

April 4, 2019 Published by Spotify Engineering

Incident responders want to have as much information as possible to ease the investigation and triage process. Additionally, intrusion detection engineers want to know about forensic artifacts and map server baselines (running processes, storage artifacts on disk) on a large fleet of servers in order to quickly identify anomalies.

This is difficult in the context of Spotify’s scale of infrastructure and the questions that our security team has to answer. If an attacker compromised a set of machines with malware that produces a unique file artifact, we’d like some way to remotely pull process memory from all running machines which have that artifact. If your fleet is in the order of tens of thousands of hosts, this task would quickly become tiresome (and easy to make mistakes over).

This is why we deployed Google’s GRR (GRR Rapid Response), an incident response framework focused on conducting remote live forensics, including investigations of disk artifacts, memory retrieval from live machines, and remote binary management.

One issue we had when starting to experiment with GRR was around easily bootstrapping a usable instance for testing and extension purposes. There were few tools available for engineers to deploy it for a large enterprise, and much of our time was spent searching for an easy to use, scalable solution.

GRR Architecture Diagram

After searching, the Spotify Security team realized that a complete solution was not available and decided to publicly release a Terraform-based GRR server deployment in GCP that works for our evaluation purposes for anyone to use. Terraform is a tool that empowers developers to define and deploy resources with reproducible, standardized configurations. Although this module is currently in testing, users are able to deploy the following GRR architecture with ease:

GRR’s server component is typically comprised of the three following pieces.

  • GRR Frontend: Frontends are in charge of sending and receiving messages from GRR clients.
  • GRR Worker: Workers are in charge of inspecting messages from clients and executing flows that have been scheduled on the GRR server.
  • GRR Admin UI: The Admin UI is a graphic interface that allows incident responders to easily schedule hunts.

Using this Terraform module, GCE instance groups utilizing Google’s Container-Optimized OS will be created for each GRR component. All AdminUI and Frontend machines are behind GCP Load Balancers, and health checks are automatically bootstrapped over all machines. All communication between users and the AdminUI are behind an HTTPS load balancer and authenticated using Google Identity-Aware Proxy. Additionally, the Terraform module provides plug-and-play support for the code signing, client encryption, and client verification keys that are required to set up the server. Clients are verified to servers using the CA key pair, and communication is encrypted with the client encryption key pair. For more information about how GRR manages communication security, please refer to the official documentation here.

Communication between the AdminUI, Frontend, and Workers are managed via a single Google SQL database instance. Official documentation for GRR is here.

Additionally, we wanted to further harden GRR by managing access control under Google’s Identity Aware Proxy product without having to maintain users in both GCP and the GRR user database. We chose to additionally contribute an authentication hook so that users can authenticate into GRR from the JWT token provided by Google’s Identity Aware Proxy should they choose to place GRR behind it.

With this current setup, we can easily provision GRR agents in hosts throughout Spotify’s internal network, run automated scans, and conduct remote memory forensics. This new level of visibility provides us an easy way to manage large numbers of disk and memory artifacts as well as a framework to ingest these artifacts into our intrusion detection pipeline. GRR primarily supports process memory dumping as well as signature search in Yara, allowing incident responders to quickly identify in-memory artifacts and backup anomalous processes for later analysis.

There’s still lots of future direction in the GRR universe, and we’re excited to continue evaluating how it fits in our incident response process. In the meantime, we want to make drop-in installation available for users in the GCP environment. A final note: We’d like to thank Google and the open source DFIR community for their support in understanding GRR’s technical details, architecture, and deployment issues. We couldn’t have done it without them!

If any of this sounds interesting to you, our team is currently hiring for engineers in the intrusion detection, forensics, and incident response space. We’re always excited to test new technologies, and we’re proud to say that engineers at Spotify have the opportunity to push an exciting field further. Please check out www.spotifyjobs.com for more information about joining the band!


Tags: