What are the priniciples that make an experimentation system sound? We’ll discuss in a future blog post how to make the experimentation-analysis pipeline correct by construction, and what correctness means in this context. First, however, we need to to look at the set of basic assumptions that allow us to reason about the interactions between concurrently running experiments, variables or parameters not represented in a particular experimental procedure, and those parameters of interest. The underlying experimentation system must reflect these assumptions; if it does not, then all of our analyses may be invalid.

Many institutions have invested in experimentation systems over the years. Some of these experimentation systems are vertically integrated, stand-alone applications. Some cordon off a critical part of the experimentation-analysis pipeline and interface with modules used by other systems. Here’s a quick sample of available technologies (dear reader, if you can suggest more, please do so below, in the comments).

System	Year	Venue	Institution	Purpose
Gatekeeper	2015	SOSP	Facebook, Inc.	Configuration Management
PlanOut	2014	WWW	Facebook, Inc.	Experiment Scripting Language
Trebuchet	2014	Unpublished	Airbnb, Inc.	Experiment Management
3X	2013	Unpublished	Stanford Univ.	Experiment Management
Feature	2013	Unpublished	Etsy	Experiment Management
psiTurk	2012	Unpublished	New York Univ.	Experiment Management
TurkServer	2012	HCOMP	Harvard Univ.	Experiment Management
Google Analytics	2012?	Unpublished	Google	A/B tests
Layers	2010	KDD	Google, Inc. Cloudera, Inc.	Experiment Management
Experimentation Platform	2006+	KDD	Microsoft	Unclear
Unnamed System	2004	Unpublished	Amazon.com, Inc.	Automated A/B tests

Here we describe the issues they’ve uncovered and attempt to describe the necessary and sufficient assumptions an experimentation system must have, in order for the kinds of analyses we are proposing to be valid: any analysis we do statically is based upon a model of the dynamic part of a system. Randomization may be very sensitive to incorrect modelling in the dynamic part of system, especially when it involves human behavior. We discussed some of these issues previously in a prior post about the basic requirements for crowdsourcing systems.

At the very start of the experimentation-analysis pipeline, we must have a reliable way to sample the population. The PlanOut paper describes a system at Facebook that allocates “segments” to experiments. Segments appear to be constant-sized groups of userids. If we were operating over a much smaller population, say on my new business’s website, then we might allocate users as they arrive. However, in the face of Internet-scale experimentation, we would want to allocate users in chuncks. In a lot of ways, the segmentation logic sounds a big like memory allocation, where the experimenter “frees” a segment by ending the experiment. The Facebook implementation sounds like it’s done mostly over accounts, but you could imagine a general system that allocates segments according to whatever ids are relevant (and for all I know, they could have multiple segment allocators).

However, each company/institution/service may institute limitations on sampling policies for their products, since bad experiences have serious consequences for the reputation of the company/institution/service. For example, Netflix treats existing and new users as different pools for their experimentation system. Airbnb must account for non-users and unconventional workflow.

On its surface, controlled experimentation is easy: once we have selected our participants, we just randomly allocate them to variants. However, this simple act of allocation becomes much more thorny when we consider the complexity of users interacting with an online system. Before we can even reason about what might happen when “things go wrong,” we must first consider what it means for “things to go right.”

The Google Layers paper gives an informal specification for a massive experimentation system requiring both a large number of participants and a large number of concurrent experiments. In a perfect world, all parameters could be varied independently; this would allow many experimenters to run their experiments concurrently, so long as the parameters of interest do not overlap. Formally, suppose experimenter \(A\) is doing an experiment using some set of parameters whose names are represented by\(\lbrace A_1, \ldots, A_n\rbrace\) and experimenter \(B\) is doing an experiment at the same time using some set of parameters whose names are represented by \(\lbrace B_1, \ldots, B_m\rbrace\). So long as \(A \cap B = \emptyset\), the two experimenters may run both of their experiments at the same time, on the same population.

However, since we don’t live in an ideal world, we need to deal with the fact that many times, parameters of interest may not be varied independently. For example, if \(a_i\) is text-color for buttons and \(b_j\) is background-color for buttons, we will have a disasterous situation if we allow text-color and background-color to be set to the same value. The quick-and-dirty way of handling this is to just ensure that this never happens. The smarter way is what this paper suggestions:

The solution we propose in this paper is to parittion the parameters into subsets, and each subset contains parameters that cannot be varied independently of each other. A subset is associated with a layer that contains experiments, and draffic diversion into experiments in different layers is orthogonal.

The vast majority of experimentation systems run A/B tests on a target population. This means no conditional random assignment, no complex routing through external services, and in most cases, randomization on userids alone. Under these conditions, we can almost always analyze the experiment variants using average treatment effect.

Future Work

Linking analyses properly One of the things we’d like to be able to do in the experimentation-analysis pipeline is to formally hook up the generated contrasts with the variable of interest \(Y\) during the analysis phase. The Google layers authors bring up an important point about analysis and metrics:

Standardized metrics should be easily available for all experiments so that experiment comparisons are fair: two experimenters should use the same filters to remove robot traffic when calculating a metric such as [click-through rate].

We are agnostic about what \(Y\) is when we analyze Planout scripts. However, we may want a library of \(Y\)s for users to choose from. That is, instead of allowing users to build their query for \(Y\) using raw user data, such as a database query of reuslts, they would need to either use a predefined metric or implement a new one with a standard name.

Using Z3 » « Thoughts on Church Encodings

Sampling-Related Assumptions

Deployment-Related Assumptions

Analysis-Related Assumptions

Future Work