# StatisticsCausal Inference

Exercise
Suppose that your company's ad spending and revenue are found to have the relationship shown in the figure below. What are some potential explanations for why these values are positively associated?

Solution. Perhaps both revenue and ad spending are associated with a third variable, such as proximity to the holiday season. Or maybe management decides to spend more on ads when they have more revenue. Or maybe more ad spending results in more ad impressions and leads to increased sales.

Association does not imply causation is a cautionary mantra, and as such it raises the important question How can we use statistics to discern cause? There are many applications in business and science where the distinction between association and causation is exactly what we're interested in.

We will develop the counterfactual model for describing causation mathematically. The idea is to model causal relationships using random variables which describe potential outcomes.

For example, suppose you choose to drive rather than take the train to work, and you end up being late. It's natural to wonder would I have been late if I'd driven? In your mind, you're pondering two random variables: the amount of time that it would have taken if you'd chosen the train, and the amount of time that it was going to take if you drove. You would model both of these as random variables since you don't their values at the outset of the trip. When your journey is complete, you've been able to observe the value of one of these random variables, but not the other. Given your decision , your observed outcome is .

To simplify, let's let be 0 if you're on time and 1 if you're late. Similarly, we let be 0 if you're on time and 1 if you're late. Also, we'll use and interchangeably, as well as and (in other words, encode train and car as 0 and 1, respectively).

Exercise
Suppose that the joint distribution of , , and is the uniform distribution on the rows of a table compatible with the following one:

Note that the asterisks indicate counterfactual outcomes which are not observed.

We define the association to be

and the average causal effect to be

Find the association as well as the largest and smallest possible values for the average causal effect. Describe a scenario in which the measure which gives rise to these extreme average causal effect values might be plausible.

Solution.

The association is , while the largest possible value for the average causal effect occurs when the last column is all ones and the next-to-last is all zeros. That gives an average causal effect of 1. The smallest value would be zero, if the first four rows are all zeros in the last two columns, and the last four rows are all ones.

Intepretation-wise, this makes sense. If the table ends in in every row, that means that taking the train always results in our being on time, while taking the car always results in our being late. The value of in that case definitely has a causal effect. Conversely, if the top half of the table is all zeros in the last two columns and the bottom half is all ones, then that means that we would have been on time those days regardless of our mode of transit on the days we took the train, and we would have been late no matter what on the days we took the car. So there is no causal effect in that case, and is appropriately equal to 0.

The punch line of Problem 2 is still negative: it tells us that the missing counterfactual outcomes can make it impossible to use association to say something about the causal effect. However, this is not always the case:

Exercise
Suppose that you flip a coin every day to determine whether to take the train or car. In other words, suppose that is independent of . Show that in that case, we have .

We have

A study in which the treatment value is not randomly assigned is called an observational study. Observational studies are subject to confounding from variables such as the weather in the scenario described in Problem 2. In that situation, was associated with both and , and their non-independence led to a difference between and to be different.

However, if and are independent conditioned on , and if we record the value of as well as and in our study, then we can obtain an unbiased estimate of the causal effect by from an unbiased estimator of the association by performing that estimate within each group and averaging. This is called the adjusted treatment effect.

Exercise
Suppose that the probability measure on is uniform on the rows of the following table ( means good weather and means bad weather).

(a) Compute the association .

(b) Compute the average causal effect .

(c) Show that and are conditionally independent given , and compute the adjusted treatment effect.

(a) The association is equal to .

(b) The average causal effect is equal to .

(c) The conditional distribution of given and places half its probability mass at and half at . The conditional distribution of given and likewise places half its probability mass at and half at . So and are conditionally independent given . A similar calculation shows that and are conditionally independent given . So and are conditionally independent given .

The adjusted treatment effect is the average of (coming from ) and (coming from ). So it is indeed equal to the average causal effect .

## Continuous random variables

Although we've focused on binary random variables, essentially the same analysis carries over to continuous random variables. If is real-valued, then the counterfactual vector becomes a counterfactual process which specifies the outcome that results from each possible value of . As in the binary case, only one of the values of the random function is ever seen for a given observation.

Example
Suppose that is a random variable and and are and random variables (respectively), which are independent. Suppose that and that

Plot several instances of , over .

Solution.

plot(xlabel = "x", ylabel = "C(x)")
for i in 1:10
Z = rand(Uniform(0, 10))
U = rand(Uniform(0,1))
V = rand(Uniform(-5,5))
plot!(0:0.01:10, Z + V < 5 ? x -> 5 + U : x->x+sin(U*x))
end
current()

Example
Draw 1000 observations from the joint distribution on and , and make a scatter plot.

Solution.

points = Tuple{Float64, Float64}[]
for i in 1:1000
U = rand(Uniform(0,1))
V = rand(Uniform(-5, 5))
X = rand(Uniform(0,10))
Y = if X + V < 5
5 + U
else
X + sin(U*X)
end
push!(points, (X,Y))
end
scatter(points, ms = 1.5, msw = 0.5, color = :LightSeaGreen, markeropacity = 0.5)

Example
The causal regression function is . Find the causal regression function in the example above.

using SymPy
@vars x u
f = 1//2 * integrate(x + sin(u*x), (u, 0, 1)) + 1//2 * integrate(5 + u, (u, 0, 1))
plot!(0.01:0.01:10, x->f(x), lw = 2, color = :purple)

Exercise
How does the causal regression function compare to the regression function? Feel free to eyeball the regression function from the graph.

Solution. The causal regression function weights the and parts of the probability space equally all along the range from , rather than giving more weight to the former condition when is close to 0 and more to the latter when is close to 10.

You can imagine the distinction between the regression function and causal regression function by visualizing a person sitting a particular value of and watching a sequence of observations of . For the causal regression function, they record every value of they observe. For the ordinary regression function, they wait until they see a value of which is very close to , and only then do they record the pair for that observation.

When is close to the extremes in this example, the additional conditioning on performed in the ordinary regerssion obscures the causal relationship between and .

The formula for the adjusted treatment effect in the continuous case becomes

where is the density of (note that this is the same idea as in the discrete case: we're averaging the -specific estimates , weighted by how frequently those -values occur).

And as in the discrete case, the adjusted treatment effect is equal to the causal regression function if and are conditionally independent given . This implies that, again assuming conditional independence of and given , if is a consistent estimator of , then is a consistent estimator of .

If the regression function given and is linear (that is, ), then we can control for merely by including as a feature in the ordinary least squares regression. In other words, if is independent of given , then is a consistent estimator of , where are the OLS coefficients.

Exercise
Suppose that and are independent random variables, and that

(a) Calculate

(b) Calculate .

(c) Suppose that is the OLS estimator of with features and . Show that is a consistent estimator of .

(d) Show that if is the OLS estimator of with as the lone regressor that does not converge to as the sample size tends to infinity.

Solution. (a) We have

(b) We have .

(c) We have . By consistency of the OLS estimator, we have and as the sample size tends to , as well as . By the law of large numbers, we have . Therefore, converges to as the sample size goes to .

(d) We have where and as the sample size goes to , since .

## Conclusion

We conclude by noting that the conditional independence of and given a proposed confounding variable isn't directly supportable by statistical evidence, since we won't have observations of the joint distribution of and (since so many of 's values are unobserved). The argument must be made that the list of variables controlled for is reasonably exhaustive, and we would typically hope to see the same argument supported by a variety of studies before believing it with very high confidence.

Congraulations! You've finished the Data Gymnasia Statistics Course.  Bruno