How We Automated Content Marketing to Acquire Users at Scale
Spotify runs paid marketing campaigns across the globe on various digital ad platforms like Facebook, Google UAC (display banners), TikTok, and more. Being efficient with our marketing budget is critical for maximizing the return on ad spend so that we can continue to develop ads that communicate the value of Spotify to users and non-users alike. Running and managing paid marketing campaigns at a global scale is not an easy job — as humans and marketers, it’s incredibly difficult to catch all the possible edge cases. And so we asked ourselves, How can we combine a scalable approach — using tools like automated creative generation, machine learning, and ad interaction data — with Spotify’s unmatched content library? That combination could allow us to:
- Better convey Spotify’s values
- Make our performance marketing more efficient
- Handle the scale of the tens of thousands of ads we run globally
Prior to 2019, we conducted a few tests of an off-platform content marketing hypothesis that resulted in varying levels of success. A manual test in H1 2019 again demonstrated the potential of bringing in incremental users using content ads; the next step was to focus on scalability. Could we build a system that automatically generated content-based ads, load it to our digital marketing channels, and observe how it performs and make changes on the fly? At the time, off-the-shelf components could get us partway there, but we still lacked capabilities like creative asset generation at scale and a method for reliably estimating and fine-tuning our target metric — cost per registration (CPR) per market (basically, the average amount of money we want to spend to acquire a new user). So we began to brainstorm. First, we thought about the system behavior we sought and conceived a basic loop of five stages:
- Ingest
- Rank
- Deploy
- Learn
- Repeat
The technical approach to automated content ad generation
With this in mind, we set about automating each step of the Acquisition System one at a time. We first focused on the content generation element; for any given marketing opportunity on, say, Facebook, there are dozens of combinations of aspect ratios and sizes for ad slots. When you factor in styling and graphical elements like cover art, you can find yourself looking at hundreds of individual creative ad assets for a single campaign in a single language or region, and that’s even before you tackle issues like internationalization. We knew from the start that if we couldn’t scale content generation, the off-platform side of the effort would stall in the starting gate.
Initially, we used some rather basic templating, building a Java-based backend service that would retrieve content elements from our metadata services and layer them into a static image with basic styling using our color picker, to generate static imagery:
This service was able to generate the images as shown above, but it limited us to very simplistic static templates.
What about animation?
Static ads can be useful in certain scenarios — e.g., when the end user’s bandwidth is at a premium — but in most cases, we want motion to help engage the user. Simple animations can be generated with knowledge of translation routines and Bezier curves. But more complex creative treatments, which need to be referenced through JSON files, were beyond the simple system built for hardcoded templates. Additionally, we encountered issues with right-to-left scanning for languages like Arabic, and there were creative limitations for our designers. We considered several off-the-shelf solutions already in use at Spotify, like Lottie and Blender. However, at the time, Lottie could not support our templating needs at scale, and Blender was an unfamiliar tool for our creative teams looking to push the bounds of compelling content, quickly.
Adobe After Effects — the motion graphics tool — seemed like a natural choice, given its familiarity to our teams and its ability to provide the creative freedom our designers needed to generate work that could really catch someone’s attention. Most importantly, it provided us a mechanism to create templates that could then be used to generate the dozens of aspect ratios and sizes we needed. There was just one problem: After Effects is a desktop tool, and we were unable to fire up a desktop application in a GCP compute instance anytime we wanted.
Or could we?
Adobe provides a handy helper executable that lets users fire up the render pipeline with a command line, for example:
aerender -project c:\projects\project_1.aep -comp "Composition_1" -s 1 -e 10
-RStemplate "Multi-Machine Settings" -OMtemplate "Multi-Machine Sequence"
-output c:\output\project_1\frames[####].psd
It felt like we were getting somewhere!
Unfortunately, it still required things like a host node with After Effects installed, compositing assets available locally to the OS image, and so forth. So while it was close, After Effects still wasn’t quite what we wanted.
Our next option? Enter nexrender, an open source project that extends aerender into a first-class batchable system. Here, we can script file movements from network locations, specify multiple output formats all at once, and manage headless aerender nodes to slice up our batch jobs efficiently.
With nexrender, we were finally able to combine all the pieces that we wanted to automate the creation of visuals.
Now that we had the creative elements, what content would we include? And where? And what ad do we play for, say, a Gen Z audience in the EMEA region? How do we figure all this out?
Content ranking
Content ranking is where the magic of data and machine learning (ML) comes into picture. We leveraged the power of machine learning combined with the valuable data sources available to us to rank the content on a daily basis. We then fed the ranked content to the ad creative generation system to attract specific target audiences on different marketing channels, encouraging them to join Spotify and enjoy what we have to offer.
It’s standard industry practice to collect data points on key ad performance metrics including clicks, impressions, app installs, registrations, subscriptions, etc., so that marketing campaigns can be optimized to earn the best return on spend possible. We started designing and implementing data pipeline components that would query from different ad platform APIs as well as mobile measurement partner (MMP) APIs to fetch these ad performance metrics and attribution data, allowing us to prepare a quality dataset. As alluded to above, one of the unique selling propositions (USPs) of this project was to leverage the vast Spotify content catalog data about all the artists on our platform around the world. Therefore, combining the local popularity of the artists per country from this content catalog with the collected performance metrics of the campaign ads would give the content ranking model solid data to be trained with to churn quality rankings.
But these content rankings, that is, which artist to feature in which country’s marketing campaign ad, also had another dimension — ad creative template. Having a system for combining artists with templates was just the first part of the challenge. Knowing which of the hundreds of thousands of artists to feature along with selecting a template from a myriad of design choices was the other part of the challenge.
Now, you might be asking, Couldn’t we create all the possibilities and then let the already sophisticated audience-targeting algorithms built into social media and search platforms optimize that selection for us based on the best-performing generated ad creatives?
While these algorithms may work when you have a human-scale number of four to eight ads to optimize across, they are not necessarily designed for orders of magnitude with more options (potentially unlimited!) that we would want to consider. As a result, a preranking step is required to identify the best ad creatives for trafficking and then measuring the performance thereafter to optimize the campaign performance.
Content ranking, the heuristic way
So how do we actually go about preranking? Let’s first simplify the problem, only choosing a particular set of artists to feature (it’s illustrative, and there’s plenty of complexity here already). For any ad campaign, we would like the set of artists we choose to maximize the reach, conversion, and cost efficiency of the campaign that they are featured in. At Spotify, we know a lot about our artists — for instance, their popularity among our users. However, we don’t know how well they reach, convert, or cost inside a third-party platform. We also don’t know which users are being shown our ads directly, so the conversion rate or cost per impression on one day might not mean the same thing on a different day. Finally, cost from one day to another could vary based on market forces like demand or music-streaming competitors.
With these issues in mind, we needed to find a way to estimate the combination of reach, conversion, and cost efficiency of the different artists in each campaign, without actually having interpretable numbers from each.
Our first insight was to leverage the information from the ad platforms themselves, since those calculations weren’t directly available to us. Specifically, we were interested in each platform’s particular ad-placement algorithms and how they rebalance ads for each of their users. We used the share of registrations coming from each artist as a quality score for that artist.
From there, we could apply the algorithms we built to solve the preranking problem, but before we were able to dive into it, we needed to take two factors into consideration:
- How do we combine the kind of quality score we described with other information we have about the artist, e.g., popularity?
- How do we estimate what the quality scores of different artists are if we haven’t yet observed them?
To answer these questions, we look to our first heuristic.
Our first heuristic took three calculations or data points and combined them to automatically decide which artists to feature in campaigns — popularity, share of registrations, and diversity. The first data point was to use popularity to build up a set of eight artists to place into a campaign. We then observed how these artists performed using the share-of-registrations metric we described above. Once we had these, we used our knowledge graph of artist similarity to help predict the quality score similar artists might have. Finally, we evaluated the performance and optimized our choices based on both popularity and differentiation from the other artists and ads that remained. This allowed the heuristic a controlled way to explore a diverse pool of ads and artists.
Content ranking, the ML way
While the first heuristic used a simple, fixed way of weighing these three factors (i.e., one-third each), we started wondering if there was a better way for each factor to contribute in predicting the quality of artists used in the ads? So we transformed this question into a supervised ML problem, where features of each artist were used to predict the share of registrations, and popularity was now added as one of the features to learn from. This also gave us a way to add features we did not consider before in the heuristic model, such as those of the campaign itself — metadata like campaign market, ad creative dimension, operating system, template theme and variation, etc.
The heuristic allowed us to get off the ground with a running start, but the ML solution allowed us to combine the covariates of the problem into a more powerful algorithm that netted us 9% more monthly active users (MAUs) over the heuristic during the lifetime of its operation.
We used the XGBoost library (via the Spotify Kubeflow managed service offered by the Platform mission), which uses gradient boosting framework internally to implement the ML algorithm. The model was trained with the data points of various features related to campaign-level information, artist metadata, and the ad creative template data. Each day, the model trained with the historical data of these features over a certain lookback window and predicted two main target variables.
In the case of the predictive model for Spotify free-tier ads:
- reg_percentage: Percentage of Spotify user registrations that the ranked artist will contribute to
- relative_cpr_ratio: Ratio/share of the ranked artist in the overall CPR of a marketing campaign
In the case of the predictive model for Spotify premium-tier ads:
- sub_percentage: Percentage of Spotify premium user subscriptions that the ranked artist will contribute to
- relative_cps_ratio: Ratio/share of the ranked artist in the overall cost per subscription (CPS) of a marketing campaign
We decided to use relative metrics instead of absolute metrics because raw metrics — such as number of registration or subscription events and their unit costs — depend on external market factors (such as supply and demand for ads on platforms) that are hard to model.
Once the model was ready, we proved our hypothesis that an ML model would outperform the heuristic model by running an A/B test with the heuristic model set as the control and the ML model set as the treatment in two regions for a duration of three weeks. The results were clear, with the ML model achieving a 4% and 14% cheaper CPR than the heuristic model in the two regions. This was mainly due to the fact that the ads generated from the predicted rankings from the ML model had an 11% to 12% higher click-through rate (CTR) than the heuristic model, since the ML model was trained with richer training data with a higher number of features.
With the test results this clear, the natural choice was for us to productionize the ML model to take care of content ranking across all active regions where we ran marketing campaigns.
This is the end-to-end architecture of the solution we productionized:
Technical challenges & takeaways
Throughout our collective effort on this project, we found successes and difficulties. Here are some of our most noteworthy challenges and takeaways:
Challenge 1: Orchestrating the generation of assets.
Moving from Java templates to After Effects turned asset generation from something that could be done inline in an API call to something that needed rendering asynchronously. Scaling the render workers up and down in response to the volume of assets to be generated was also a challenge.
Takeaway 1: Changes affecting the architecture can and will happen.
Systems are initially designed with certain assumptions based on the information we have at the time. When new information inevitably comes along, it’s a good idea to assess how to incorporate the new requirements into the design as nonintrusively as possible and to move incrementally toward that. Systems built with modular components can make things easier, though it will never be possible to predict all the eventualities that can put pressure on the design.
Challenge 2: Dependency on ad platform APIs for ingesting ad performance metrics.
To feed the ML content ranking model with good-quality training data, the data pipelines have to fetch ad performance metrics from platform APIs on a daily basis. If there’s an issue retrieving data from these platform APIs, our ranking flows might break, or ultimately suggest content that is not ideal.
For example, in a couple of instances, a Facebook API outage caused disruptions in our data pipelines, which resulted in the ML model not being able to train and churn out content rankings until Facebook’s marketing API was back to normal.
Takeaway 2: Always find a backup solution when external dependencies break.
It’s important to have a backup solution in case of an unforeseen situation, especially when it’s about an external dependency. As a workaround for such a scenario, we decided to continue preserving the artist rankings from the previous day, as the ML model would not have insight into the latest day’s ad performance to make modifications to the rankings.
Challenge 3: Identifier for Advertisers (IDFA) implications from iOS version 14.5 onward.
In the summer of 2021, Apple unveiled IDFA, which changed the landscape of the ad tech industry forever in terms of which data points advertisers can collect and use. For us, this meant we would no longer rely blindly on getting user-level/log-level ad performance data to optimize the campaign. But because our ML model trained on the aggregated data over a lookback window, the change did not affect us adversely.
Takeaway 3: Anticipate changes in the industry and assess the system accordingly.
The moment we were made aware that IDFA would be activated for iOS version 14.5 onward, we evaluated our ML model output via offline analyses for any negative impact on model performance. It turned out in our favor: the model performance would not be impacted negatively. So it’s always a good idea to set up ways to assess the system in case of disruptions and to think about other solutions. .
Challenge 4: MMP migration from Adjust to Branch.
Spotify decided to migrate from Adjust to Branch for the preferred deep linking and attribution partner as described in detail in this Spotify Engineering blog post. This meant updating all our data pipelines and our ML model to consume Branch-powered ad metrics and then calibrating the system for content ranking in the best possible manner.
Takeaway 4: Design a system that allows for updates on the fly and ensures the same performance.
We carved out time to verify that consuming ad performance attribution data from Branch instead of Adjust did not result in performance implication of the ML ranking model, and we came up with detailed technical specifications of the tasks involved. Once we confirmed with our stakeholders (Performance Marketing and Marketing Analytics) that it was OK to cutover from Adjust to Branch and that the ad metrics were flowing correctly from Branch, we made the necessary changes in our data pipelines and ML model to complete the migration effort. As expected, the performance of the ML model remained as strong as before.
Challenge 5: Incorporating diversity of artists in the ML model.
It turned out that naive ways to encode the diversity of the group of artists in the campaign into our supervised learning algorithm did not help — but this problem, known as slate recommendation, is difficult. It’s a very interesting challenge with broad application, and we’re always looking for new problem solvers!
Takeaway 5: In ML, being able to iterate from simpler to more complex is a superpower.
We no longer live in a world where the teams that write ML algorithms can wait until the core functionality of the platforms and multiple systems around it stabilize. Backends, pipelines, processes, and even policies of entire industries (cue IDFA) evolve and change. At the same time, the most impactful systems still need to react to data. So creating ML systems is a ballet, sometimes requiring quiet heuristic methods with few moving parts, other times bringing the whole orchestra of cross-platform tech, like artist embedding vectors. The flow supporting this range of development is critical.
Conclusion & acknowledgements
Our journey to building a complex automated content management system started with a tiny hypothesis: “We can make Spotify’s Performance Marketing more efficient by leveraging the power of engineering and content.” This product was particularly complex, requiring cross-functional efforts between Engineering and Marketing to solve complex real problems. In the end, we developed an end-to-end automated system that could generate content ads and optimize them continuously. We are very proud to say that only a handful of tech companies fully automated the Performance Marketing cycle globally. We thank all team members who worked hard to bring this product to the world.