Fixed-Power Designs: It’s Not IF You Peek, It’s WHAT You Peek at
TL;DR Sometimes we cannot estimate the required sample size needed to power an experiment before starting it. To alleviate this problem, we could run a sequential test or an A/A test. However, sequential tests are typically less sensitive and introduce bias to the treatment effect estimator. Moreover, A/A tests prolong the duration of the experiment and still don’t guarantee that the resulting sample size calculation is accurate. In this blog post, we present highlights from our recent paper (Nordin and Schultzberg, 2024), where we introduce an alternative that we call “fixed-power design.” In a fixed-power design, you start the experiment without an estimated sample size, estimate the required sample size from the currently available outcome data in the experiment, and stop when your current sample size is larger than the required sample size. We show that fixed-power designs can be analyzed using nonsequential methods without any corrections. The point estimator is consistent, and the treatment effect confidence interval has asymptotically nominal coverage. Not all forms of peeking inflate the false positive rate of fixed-sample inference.
Introduction
There are many reasons why companies use online experiments. Reasons could be, for example, the following:
- To identify the best version of a product
- To quantify the impact of a product change
- To detect regressions before a bug reaches all users
Online experiments let you do all these things while managing the risk of making the wrong decisions.
At Spotify, some of the goals of experimentation are to learn what works, how well it works, and stop the things that don’t work early on. However, the extent to which we can achieve these goals depends on the experimental design and analysis. For example, certain designs promote early stopping but pay a price in terms of power — with reduced chances overall of finding true effects. Other designs instead focus on maximizing power, but at the expense of prolonged runtime because they don’t allow early stopping. In the next section, we dig into the most common designs for A/B tests, discuss limitations of common approaches, and introduce a new design that mitigates some of those limitations.
Sequential experimental designs: more than just sequential testing
Two of the most fundamental concerns of experimental design are when to stop the experiment and when to analyze the results. Experimental designs can be roughly split into two categories: fixed-sample designs and sequential designs.
Fixed-sample designs
In fixed-sample designs, the experimenter leverages power analyses, also known as sample size calculations, to set a predetermined sample size. The power analysis produces a required sample size that the experiment should meet. If it does, the comparison will have high enough precision to limit the risk of missing effects of a certain size of interest. The experiment collects data until the predecided sample size is met, at which point the statistical analysis is performed.
Sequential designs
In sequential designs, the sample size isn’t predetermined.1 In principle, a sequential design is a design where users enter the experiment sequentially and the experiment stops according to a rule based on the available data. The sequential design uses a stopping rule that only indirectly determines the sample size as it’s being evaluated during the experiment. The most common sequential design, often simply called “sequential testing,” stops once the test detects a significant result.
In the context of online experimentation, many recommend using sequential tests to detect regressions but recommend against using sequential tests for the shipping decision, due to power and bias concerns with sequential tests. See, for example, Fan et al. (2004) and our previous blog post on comparing sequential testing methods.
Using a hybrid design
In practice, many companies, in fact, use a hybrid design. That is, the treatment effect is estimated and evaluated using statistical tests derived for fixed-sample designs. However, the design is sequential, because the experiment runs until the current sample size exceeds the estimated required sample size. The estimated required sample size, in turn, uses an estimate of the variance derived from the data collected so far in the experiment. In Nordin and Schultzberg (2024), we call this a fixed-power design — you sample new users until the power criterion, according to the currently available data — is met.
The graph above shows an example of how a fixed-power design can play out. To keep the graph easy to read, the sample size is kept small. In this case, the experiment stops very close to the true required sample size. In large samples (small-powered effects), as the required sample size estimator becomes precise, the region in which the sample size crosses the estimated required sample size will often be quite narrow.
Fixed-power designs: a summary of Nordin and Schultzberg (2024)
In Nordin and Schultzberg (2024), we investigate the properties of the difference-in-means average treatment effect estimator in sequential designs where the stopping rule is based on the precision of the treatment effect estimator. We express the precision in two ways, as the confidence interval (CI) width, and the current required sample size for a given hypothetical treatment effect. As shown in our paper, stopping based on the required sample size is equivalent to stopping on the confidence interval width, as it’s just a transformation of the variance of the treatment effect estimator.
These kinds of stopping rules based on precision are common in practice, with some experimentation vendors even selling them. However, to the best of our knowledge, the statistical implications of precision-based stopping rules haven’t been rigorously investigated.
The fixed-power design can make your peeking problem alarm bell go off. It is, after all, a stopping rule that uses outcome data to determine whether to stop. This is what’s responsible for why we use sequential testing in the first place, and therefore, it’s not unreasonable to expect corrections to be required if you stop based on the required sample size, too. However, in Nordin and Schultzberg (2024), we show that not all stopping rules based on outcome data are equally problematic for statistical inference. Our research shows that functions of the sample variance are much less problematic than stopping based on, for example, significance.
What aspects of the outcome data we peek at determines the effects — if any — of peeking on an inference about the estimand we are interested in. In our paper, we show that under a fixed-power design, the following are true:
- The difference-in-means estimator consistently estimates the average treatment effect.
- The fixed-sample confidence interval for the average treatment effect has asymptotically correct coverage.
This means that in large samples, we can use standard inference even when we stop based on the current estimated required sample size. No further adjustments are necessary to guarantee correct inference.
In the paper, we also propose conservative finite-sample versions of the fixed-power design and the fixed-width confidence interval design.
Pre-experiment sample size calculation is hard
Fixed-power designs let us peek at the required sample size without adjusting inference. Why is this important? Can’t we use historical data for a power analysis to determine the required sample size?
Estimating the required sample size during experiments as a complement to pre-experiment power analyses is often necessary, as historical data can fail to accurately describe the outcome distributions. At Spotify, for example, the diverse and ever-changing user base, especially in new markets and with new-user experiments, makes historical comparisons unreliable. Adding to that, historical data won’t reflect treatment effects since new variants haven’t been tested, and assuming homogeneous treatment effects across a diverse customer base is unrealistic. Users with different listening habits will likely respond differently to the same feature changes.
Leveraging observed outcomes during experiments can enhance sample size accuracy and inform the experimenter early on if their initial planning is accurate. With the guarantees of fixed-power designs, we can plan according to a required sample size calculated before the experiment, revise it during the experiment, and finally stop at the right time — all while relying on standard fixed-sample inference.
Sequential testing versus fixed-sample testing
Sequential tests give valid inference under any stopping rule, so why not just rely on sequential tests and peek at the required sample size as much as we want?2
As have been discussed in many places (Larsen et al. 2024, our previous blog post), there are two main reasons:
- Unbiased point estimators. Sequential tests that stop on significance yield biased estimators that overestimate effect size. Moreover, the idea of stopping on first significance is in stark contrast to the advice of many not to trust experiments with too low power.
- Power. In most situations, experiments must run for at least a given period. This could, for example, be to obtain data on users in the experiment for a sufficiently long time to rule out novelty effects. Another reason, common at Spotify, is to avoid issues with seasonal effects on weekdays. Using sequential testing in situations where we don’t intend to stop based on the first significance is a waste of power. If stopping is prohibited during a large part of the experiment, sequential tests that aren’t taking this into account will be highly conservative.
With the fixed-power design, we get the benefits from fixed-sample designs, but with the added ability to inform the stopping based on a continuous power analysis of the experiment.
Sequential design | Traditional fixed-sample design | Fixed-power design (Nordin and Schultzberg, 2024) |
– Sequential tests allow early stopping in experiments with a stopping rule based on significance or any other function of the data. – Sequential tests bound false positive rates and coverage at least at the intended level under early stopping. – Sequential tests are conservative if you always want to run the experiment until you reach a certain precision. This is because they adjust for early stopping on significance (even if you don’t use it). – Sequential tests (with early stopping) give biased difference-in-means estimators. | – Fixed-sample tests require the sample size to be fixed ahead of time. – To achieve a certain precision, you need to estimate the variance of the outcome(s) from historical data before the experiment is started to plan the sample size. – The difference-in-means estimator is unbiased and the standard fixed-sample CI has the right coverage. | – Fixed-power designs estimate the current required sample size from outcome data during the experiment. – Fixed-power designs stop when the current sample size is larger than the estimated required. – Under a fixed-power design, the standard difference-in-means estimator is consistent, and the fixed-sample CI has asymptotic nominal coverage. |
Summary
In the ever-evolving landscape of online experiments, determining the optimal time to stop an experiment remains a substantial challenge. Traditional methods, such as fixed-sample and sequential test designs, each have limitations. Fixed-sample designs predetermine the sample size but don’t allow adjustments based on incoming data, while sequential test designs can adjust but may affect the power and bias of the results.
In our recent paper, Nordin and Schultzberg (2024), we introduce an innovative approach called the “fixed-power design.” This method allows an experiment to start without a predefined sample size it needs to reach. Instead, the required sample size is estimated from ongoing outcome data, and the experiment concludes when the current sample size surpasses this estimate. Crucially, this design supports standard nonsequential inference, guaranteeing consistent point estimators and maintaining nominal coverage in confidence intervals. This means that the fixed-power design allows sequential stopping without losing the power benefits from nonsequential tests.
This design is particularly advantageous in environments like Spotify, where the user base is diverse and constantly changing. Traditional pre-experiment calculations based on historical data often fall short because they don’t account for the variability in treatment effects across different user segments or for new user experiences.
The fixed-power design provides a practical balance between the rigidity of fixed-sample designs and the flexibility of sequential test designs, providing reliable decision-making in product development. At the same time, many challenges remain. Although the fixed-power design makes it possible to do real-time adjustments to the required sample size, it’s problematic to not know the required sample size in the planning stage. At Spotify, where we run tens of thousands of experiments, there are always limitations to how large an experiment a team can run. If it’s detected during the experiment that the required sample size is much larger than the team had anticipated, it’s not always possible to run it longer or increase the proportion of the population that the experiment targets because of other conflicting experiments. In this situation, fixed-power designs offer a way to know early in an experiment if the data is in line with the pre-experiment power analysis.
Acknowledgments: This blog post is based on a paper written in collaboration with Mattias Nordin, Department of Statistics, Uppsala University, Sweden.
Get access to Spotify’s decision engine via Confidence.In Confidence, you can always access the current required sample size and current-powered effect while an experiment is live. This means that you can use a fixed-power design by simply starting the experiment using a fixed-sample design. As usual, we ensure that the statistics are all in order for any results we show you — so you can focus on building a great product.
Want to learn more about Confidence? Check out the Confidence blog for more posts on Confidence and its functionality. Confidence is currently available in private beta. If you haven’t signed up already, sign up today, and we’ll be in touch.
1 In some types of sequential tests, the maximum sample size needs to be predecided, but not the stopping sample size.
2 At least so-called always-valid sequential tests.
Tags: Data, experimentation