The Potential Outcomes Framework (aka the Neyman-Rubin Causal Model) is arguably the most widely used framework for causal inference in the social sciences. This post gives an accessible introduction to the framework’s key elements — interventions, potential outcomes, estimands, assignment mechanisms, and estimators.
An informal first look: a game and a story
A game with boxes
Suppose we play the following game. I put two boxes in front of you, one labelled \(0\) and the other labelled \(1\). Each box contains a slip of paper on which I’ve written some number. Denote by \(y(0)\) the number in the box labelled \(0\) and \(y(1)\) the number in the box labelled \(1\), both of which are unknown to you. Your goal is to guess the difference between the number in box \(1\) and the number in box \(0\), denoted:
\[ \tau = y(1) - y(0). \]
The game is played as follows: you must choose a box and open it to read the number it contains. The trick, however, is that the boxes are rigged so that the moment you open one of the boxes, the other box self-destructs so you can never know the number it contained.
In general, a variant of the game is played in which \(N\) pairs of boxes are aranged in \(N\) rows indexed \(i = 1, \ldots, N \). Each box is sealed and contains a slip of paper with a number. Denote by \(y_i(0)\) the number in the box labelled \(0\) in the \(i^{th}\) row, and \(y_i(1)\) the number in the box labelled \(1\) in the same row. This time, the goal is to guess the average difference:
\[ \tau = \frac{1}{n} \sum_{i=1}^n {y_i(1) - y_i(0)}, \]
You get to open a single box in each row — the rule being, as above, that the moment you open one of the boxes in row \(i\) the other box in the same row self-destructs. The name of this game, as you’ve probably guessed, is causal inference. Now let’s look at another way of motivating this framework.
A story with a genie
Guillaume is in a bit of a bind at the moment: he must give a talk in one hour but is currently having a throbbing headache. He contemplates taking an aspirin pill, but he is unsure of the effect that it would have on his headache. Fortunately, he met a genie a long time ago who promised to grant him a wish, and this seems like a good time to use it. He conjures the genie and asks what the effect of taking the pill would be on his headache. After pointing out that he could have asked him to cure his headache, the genie tells Guillaume that although he is all-powerful, he cannot answer his question because he doesn’t know what “the effect of the pill” even means! Guillaume scratches his head and starts to babble stuff about causal effects, but the genie interrupts him:
Genie: Look, here’s what we’re going to do. If you were all-powerful like me, describe precisely the quantity that you would like to know.
Guillaume: Ok, if I were all powerful, here is what I would do. First I would take the aspirin, wait an hour, and then assess the state of my headache.
Genie: Wait, how would you “assess” the state of your headache?
Guillaume: I have a special scale for rating headaches, from 1 (feeling great) to 10 (feeling aweful). You’re all-powerful, so you should know my scale..
Genie: Ah, right. Ok, sure, so you’d get a number between 1 and 10. Then what would you do?
Guillaume: Well, after that I would go back in time to the exact moment I took the aspirin pill, but this time I would not take it. I would then wait an hour and rate the state of my headache: this would give me an other number. What I call a causal effect is the difference between these two numbers.
Genie: I see. Wait a sec. Done — the answer is -4, which means the Aspirin would help…
Guillaume: Got it, thank you for your help!
Potential outcomes in a nutshell
Hopefully, the connection between the (first version) of the game and the story is clear. The numbers written in the box labelled \(1\) corresponds to the strength of the headache should Guillaume take the pill, and the number written in the box labelled \(0\) corresponds to the strength of the headache should he not take the pill. So we can write \(y(1)\) the strength of Guillaume’s headache (an hour from now) should he take the pill, and \(y(0)\) the strength of his headache should he not take the pill. The causal effect of the pill is then defined as \(\tau = y(1) - y(0)\).
The boxes metaphore crystalizes the key intuition behind the potential outcomes framework. Before he decides which action to take (pill or no pill), \(y(1)\) and \(y(0)\) exists as potentialities: they are the potential state of the headache (one hour from now) if Guillaume takes the pill (resp. doesn’t take the pill). Like the numbers in the boxes, they are well-defined, and exist prior to Guillaume’s action: they are the “potential outcomes” of Guillaume’s action. Like the numbers in the boxes, they are both unknown prior to Guillaume’s action. Like the numbers in the boxes, one of the two will be revealed depending on which action is taken, while the other will never be knowable.
The quantities \(y(1)\) and \(y(0)\) are known as the potential outcomes. If Guillaume takes the pill, then he observes \(y(1)\), and \(y(0)\) becomes counterfactual; if he does not take the pill, he observes \(y(0)\) and \(y(1)\) becomes counterfactual. The fundamental problem of causal inference is that in either case, the counterfactual outcome forever remains unknown.
This, in a nutshell, summarizes the intuition behind potential outcomes. The next section formalizes this intuition, and introduces the remaining components of the framework.
The Rubin Causal Model
Assignments and potential outcomes
Consider a population of \(n\) units indexed \(i = 1, \ldots n\). Suppose that each unit may receive one of two interventions — for concreteness, let’s call one of them treatment and the other control. For each unit \(i\) denote by \(W_i\) the treatment indicator for unit i, where \(W_i = 1\) if unit \(i\) is assigned to treatment and \(W_i = 0\) if unit \(i\) is assigned to control. The vector of treatment indicators is denoted \(\vec{W} = (W_1, \ldots, W_n)\). Since the treatment will be assigned at random, \(\vec{W}\) will be a random vector. We will denote by an uppercase \(\vec{W}\) the random variable, and by lower case \(\vec{w} = (w_1, \ldots, w_n)\) a specific realizations of the assignment. Then we use the notation \(Pr(\vec{W} = \vec{w})\) to denote the probability that the random assignment vector \(\vec{W}\) will take the value \(\vec{w}\). The distribution of \(\vec{W}\), called assignment mechanism, will be discussed in more detailed below.
Note: Throughout, we will still to the convention that uppercase letters denote random variables, while lowercase letters denote constants or specific realizations of random variables.
In the examples above, we introduced the potential outcomes intuitively, as describing the response of a unit if it were assigned a specific treatment. More formally, for each unit \(i\) consider a function \(y_i: \{0,1\}^n \rightarrow \mathbb{R}\), so that for any assignment \(\vec{w} \in \{0,1\}^n\), \(y_i(\vec{w})\) is the potential outcome of unit \(i\) if unit 1 were assigned to treatment \(w_1\), unit 2 to \(w_2\), etc…
In our aspirin example, there is a single unit \((n=1)\), therefore the assignment vector \(\vec{w}\) reduces to a scalar \(w_1 \in \{0,1\}\), and the potential outcomes for the single unit \(i=1\) are \(y_1(w_1)\) for \(w_1 \in \{0,1\}\); in this case, the outcomes of unit \(i=1\) depend only the unit’s own treatment assignment \(w_1\), so there are only two potential outcomes: \(y_1(1)\) and \(y_1(0)\). When there are more than one units, the outcome of any unit may, a priori, depend on the treatment assigned to any other unit: this is captured by the notation \(y_i(\vec{w})\). In particular, each unit has not two but \(|\{0,1\}^n| = 2^n\) potential outcomes.
The no-interference assumption
Let’s go back, once again, to our aspirin example, but suppose that we have not just a single individual, but \(n\) units participating in the drug trial. Following the framework so far, we can write the potential outcomes for unit \(i\) as \(y_i(\vec{w})\) for \({\vec{w} \in \{0,1\}^N}\). As stated above, this allows for unit \(i\)’s outcome to depend on another unit, say unit \(j\)’s treatment assignment. In the context of the drug trial though, this seems a bit overkill. Barring very special circumstances (e.g. the units enrolled in the trial know each other), it seems reasonable to assume that the outcome of the trial for unit \(i\) depends only on the treatment assigned to unit \(i\) itself — that is, it should depend on \(\vec{w}\) only through \(w_i\). This assumption is known as the no-interference assumption, and can be formally stated as follows:
Assumption (No Interference): For all \(i=1, \ldots, n\), it holds that: \[ \forall \vec{w}, \vec{w}’ \in \{0,1\}^n, \quad w_i = w_i’ \quad \Rightarrow \quad y_i(\vec{w}) = y_i(\vec{w}') \] With a slight abuse of notation, we can write \(y_i(\vec{w}) = y_i(w_i)\).
In words, the assumption says that if you are a trial participant, the status of your headache depends only on whether you took the aspirin — not on whether the other trial participants took the aspirin. This assumption is reasonable in a broad range of settings, and simplifies matters considerably. Indeed, under then no-interference assumption, each units \(i\) has only two potential outcomes: its outcome under treatment, \(y_i(1)\) and its outcome under control \(y_i(0)\).
The Science
If there is no interference, then all the information of a causal problem is contained in the treatment and control potential outcomes vectors. Taken together, they form what is sometimes called the Science Table or just the Science, denoted \(\underline{y} = (\vec{y}(1), \vec{y}(0))\). An example of the Science for 6 units is given below.
i | \(y_i(0)\) | \(y_i(1)\) |
---|---|---|
1 | 2 | 3 |
2 | 1 | 1 |
3 | 2 | 4 |
4 | 0 | 1 |
5 | 3 | 3 |
6 | 0 | 2 |
This table gives, for each unit \(i\), it’s potential outcome under treatment, \(y_i(1)\) and its potential outcome under control, \(y_i(0)\). For instance, for unit \(i = 4\), we read \(y_4(0) = 0\) and \(y_4(1)=1\).
An important fact about the Science is that we can never observe it fully. Indeed, if a unit \(i\) is assigned to treatment (\(w_i=1\)), then we only observe its potential outcome \(y_i(1)\). If, however, it is assigned to control (\(w_i=0\)), then we only observe its control potential outcome, \(y_i(0)\).
So, suppose for instance that \(\vec{W} = (1, 1, 0, 1, 0 ,0)\), then we would only observe a partial version of the table, with missing elements:
i | W_i | \(y_i(0)\) | \(y_i(1)\) |
---|---|---|---|
1 | 1 | ? | 3 |
2 | 1 | ? | 1 |
3 | 0 | 2 | ? |
4 | 1 | ? | 1 |
5 | 0 | 3 | ? |
6 | 0 | 0 | ? |
This is what makes causal inference a challenging problem — we will get back to this point below. Still, it is useful to consider the full Science table conceptually because it allows us to rigorously define causal effects.
Causal Estimands
An estimand can be defined as a “quantity we would compute if we were omniscient.” In the current context, being omniscient would mean knowing the entire Science Table \(\underline{y}\), so we can think of an estimand as a function of the Science, say \(\tau(\underline{y})\). The simplest quantity that satisfies this definition is the so-called individual treatment effect (ITE) for unit \(i\), defined as:
\[ \tau_i = y_i(1) - y_i(0), \quad i = 1, \ldots n. \]
As its name indicates, the ITE \(\tau_i\) is the causal effect of the treatment on unit \(i\): it compares the response of unit \(i\) if it were assigned to receive the treatment to the response of unit \(i\) if it were assigned to receive the control. Since it depends on both the treatment and the control potential outcome of the same unit \(i\), the ITE can never be computed in practice – this is what makes it an estimand, a quantity that we could only compute if we were omniscient.
If we have more than a single unit, we are generally interested in some sort of average effect of the treatment. The Average Treatment Effect (ATE) is defined as:
\[ \tau^{ATE} = \frac{1}{n} \sum_{i=1}^n \{ y_i(1) - y_i(0)\} = \frac{1}{n} \sum_{i=1}^n \tau_i \]
What makes these estimands causal is that they are based on contrasts between potential outcomes.
Estimating causal effects
In the previous section, we have introduced the key pieces of the Rubin Causal Model:
- an experimental population,
- an intervention,
- the potential outcomes, and
- the causal estimands — that is, the quantity that we would like to learn.
So far, we have remained in the realm of the potential — all the quantities introduced exist prior to the experiment being actually conducted. We have defined our objective — our causal estimand — but have said nothing about how one would actually go about estimating that quantity. To do so, we need to move to the realm of the observed.
Observed Outcomes
An important fact about the Science, which we have alluded to repeatedly, is that we can never observe it fully. Indeed, for each unit \(i\), we only observe the unique potential outcome associated with the treatment to which the unit is assigned. The observed outcome, denoted \(Y_i\), can therefore be written:
\[ Y_i = y_i(W_i) = W_i \,y_i(1) + (1-W_i) \,y_i(0) \]
Since \(W_i\) is a random variable, the observed outcome \(Y_i\) will also be a random quantity — hence, we write it in uppercase. We will denote by \(\vec{Y} = y(\vec{W}) = (Y_1, \ldots, Y_N)\) the vector of observed outcomes. Once the experiment has been run, the analyst observed only two quantities: the observed assignment \(\vec{W}\) and the observed outcomes \(\vec{Y}\). For the science table we displayed in the previous section, and assignment vector \(\vec{W}\), the observed outcome vector is \(\vec{Y} = (3, 1, 2, 1, 3, 0)\)
Assignment Mechanism
When we introduced the assignment vector \(\vec{W}\), we said that it was a random quantity, but we didn’t say much about its distribution \(Pr(\vec{W})\) other than it was called the assignment mechanism (or the design).
There are, of course, many possible assignment mechanisms for an experiment on a population of \(n\) individuals — as many, in fact, as there are distributions with support \(\{0,1\}^n\). To keep things simple, we introduce just two of the simplest and most popular such distributions.
Definition (Bernoulli Design): We say that \(\vec{W}\) is assigned according to a Bernoulli design (or Bernoulli assignment mechanism) with parameter \(\pi\) if each unit \(i\) is assigned to treatment independently with probability \(\pi\). That is,
\[ W_i \overset{i.i.d}{\sim} Bernoulli(\pi). \]
Definition (Completely Randomized Design): We say that \(\vec{W}\) is assigned according to a Completely Randomized Design with parameter \(n_1\) if all assignments with exactly \(n_1\) treated units are equally likely.
Estimators
A causal estimand, such as the \(\tau^{ATE}\), depends on the entire science table, meaning we can never compute it directly. In practice, we estimate it using data that we observe: the observed assignment \(\vec{W}\) and the observed outcomes \(\vec{Y}\).
An estimator is a function \(\hat{\tau}\) of the observed data, say \(\hat{\tau}(\vec{W}, \vec{Y}))\). It can be thought of as a “data-driven” guess for the estimand of interest; indeed, since it depends only on observed data, an estimator can always be computed. It is only a guess, though, because it uses only the observed portion \(\vec{Y}\) of the science table \(\underline{y}\) to estimate the estimand of interest. Specifically, since it depends on the random assignment vector \(\vec{W}\), the estimator \(\hat{\tau}(\vec{W}, \vec{Y})\) is itself a random variable.
Choosing an appropriate estimator of a given estimand is an interesting topic but is beyond this post’s scope. If the causal estimand of interest \(\tau^{ATE}(\underline{y})\) as defined above, then a natural estimator for it is the difference in means:
\[ \hat{\tau}^{DiM}(\vec{W},\vec{Y}) = \frac{1}{n_1(\vec{W})} \sum_{i=1}^n W_i Y_i - \frac{1}{n_0(\vec{W})} \sum_{i=1}^n (1-W_i) Y_i \]
where \(n_1(\vec{W}) = \sum_{i=1}^n W_i\) and \(n_0(\vec{W}) = n - n_1(\vec{W})\). For the science table we displayed in the previous section and assignment \(\vec{W} = (1, 1, 0, 1, 0, 0)\), we saw that the observed outcome vector was \(\vec{Y} = (3, 1, 2, 1, 3, 0)\) and therefore the estimate is:
\[ \hat{\tau}^{DiM} = \frac{1}{3} (3 + 1 + 1) - \frac{1}{3}(2 + 3 + 0) = 0 \]
If the treatment was assigned according to a Completely Randomized Design as defined above, then \(\hat{\tau}^{DiM}\) is, in the precise sense described below, a good estimator.
Randomization-based inference: a primer
So far, we haven’t talked about models — or about randomness, really. That’s because the Rubin Causal Model is largely agnostic about these considerations. The RCM helps us define clearly what is the causal quantity we are after, and separate this from what we actually observe.
Once this has been established, we can take a number of paths to estimate causal effects and assess the uncertainty of those estimates. Here we will give a primer on one such approach that is particularly natural and helpful in randomized experiments. Specifically, suppose that the treatment \(\vec{W}\) is assigned according to a completely randomized design. As mentioned above, apart from \(P(\vec{W})\), which we have just specified, we have not assumed any model. In particular, we have not assumed that the potential outcomes follow any distribution, nor that they are independent draws from some distribution, nor are they random. At the same time, we have also not precluded the potential outcomes from having been randomly drawn from some distributions. That is the wonderful nature of the potential outcomes framework: it requires no assumptions on the outcomes!
Let’s see how far we can push this idea, and let’s consider \(\underline{y}\) to be fixed (and, prior to the experiment, unknown). Notice that the observed outcomes \(\vec{Y}\) are still random, since they depend on \(\vec{W}\) which is itself random. We can then state the following result.
Proposition: If \(\vec{W}\) is assigned according to a completely randomized design, then the difference in means estimator \(\hat{\tau}^{DiM}\) is unbiased for the average treatment effect \(\tau^{ATE}\).
Much more can be said about the properties of \(\hat{\tau}^{DiM}\) in completely randomized experiments — future posts will explore these properties in greater details.
References
There are many fantastic books on the topic of causal inference, but we have found the following to be useful:
Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.