# StatisticsEstimating Joint Densities

The following scenario will be our running example throughout this section. We will begin by assuming that we know the joint distribution of two random variables and , and then we will explore what happens when we drop that assumption.

Example
Suppose that is a random variable which represents the number of hours a student studies for an exam, and is a random variable which represents their performance on the exam. Suppose that the joint density of and is given by

Given that a student's number of hours of preparation is known to be , what is the best prediction for (as measured by mean squared error)?

Solution. The best guess of a random variable (as measured by mean squared error) is the expectation of the random variable. Likewise, the best guess of given is the conditional expectation of given . Inspecting the density function, we recognize that restricting it to a vertical line at position gives an expression which is proportional to the Gaussian density centered at . Therefore,

A graph of this function is shown in red in the figure below:

We call the regression function for the problem of predicting given the value of .

Given the distribution of , finding the regression function is a probability problem. However, in practice the probability measure is typically only known empirically, that is, by way of a collection of independent observations sampled from the measure. For example, to understand the relationship between exam scores and studying hours, we would record the values for many past exams (see Figure 1.3 below).

Exercise
Describe an example of a random experiment where the probability measure may be reasonably inferred without the need for experimentation or data collection. What is the difference between such an experiment and one for which an empirical approach is required?

Solution. There are many random experiments for which it is reasonable to use a measure derived from principle rather than from experimentation. For example, if we are drawing a card from a well-shuffled deck, or drawing a marble from a well-stirred urn, then we would assume that the probability measure is uniform. This assumption would lead to more accurate predictions than if we used an inferred measure based on repeated observations of the experiment. By extension, if we have many random variables whose distributions are known from principle and which are known from principle to be independent, then we know that their average is approximately normally distributed.

The key difference between these examples and those requiring an empirical approach is the use of symmetry. There is no a priori reason to believe that, for example, the possible scores on an exam are equally likely, or that an exam score is a sum of independent random variables.

You can convince yourself that , and indeed the whole joint distribution of and , is approximately recoverable from the samples shown in Figure 1.3: if we draw a curve roughly through the middle of the point cloud, it would be pretty close to the graph of . If we shade the region occupied by the point cloud, darker where there are more points and lighter where there are fewer, it will be pretty close to the heat map of the density function.

We will want to sharpen this intuition into a computable algorithm, but the picture gives us reason to be optimistic.

Exercise
In the context of Figure 1.3, why doesn't it work to approximate the conditional distribution of given using all of the samples which are along the vertical line at position ?

Solution. Because for almost all values of , there will be no points along the vertical line at position !

### Kernel Density Estimation

The empirical distribution associated with a collection of observations from two random variables and is the probability measure which assigns mass to each location . The previous exercise illustrated shortcomings of using the empirical distribution directly for this problem: the mass is not appropriately spread out in the rectangle. It makes more sense to conclude from the presence of an observation at a point that the unknown distribution of has some mass near , not necessarily exactly at .

We can capture this idea mathematically by spreading out units of probability mass around each sample . We have to choose a function—called the kernel—to specify how the mass is spread out.

There are many kernel functions in common use; we will base our kernel on the tricube function, which is defined by

We define ; The parameter can be tuned to adjust how tightly the measure is concentrated at the center. Experiment with the slider below to get a feel for how changing affects the shape of the graph of .

λ=${λ} We define the kernel function as a product of two copies of , one for each coordinate (see Figure 1.4 for a graph): Exercise Is the probability mass more or less tightly concentrated around the origin when is small? Solution. Note that as soon as either or is larger than . Therefore, all of the probability mass is less than units from the origin. So small corresponds to tight concentration of the mass near zero. Now, let's place small piles of probability mass with the shape shown in Figure 1.4 at each sample point. The resulting approximation of the joint density is Graphs of for various values of are shown below. We can see that the best match is going to be somewhere between the and extremes (which are too concentrated and too diffuse, respectively). Exercise Estimate the best value of by eyeballing the figure above (the values of in the first four figures are 0.25, 1, 3, and 10. The last figure shows the actual density). Solution. In the picture, it looks like the estimated measure is taking individual points too seriously, since there are lots of points with more mass around them than in the gaps between observations. In the picture, the mass looks way too spread out. Among the ones shown, appears to be the best. We can take advantage of our knowledge of the density function to determine which value works best. In practice, we would not have access to this information, but let's postpone dealing with that issue till the next section. We measure the difference between and by evaluating both functions at each point in a square grid and finding the sum of the squares of these differences. This sum is approximately proportional to . using LinearAlgebra, Statistics, Roots, Optim, Plots, Random Random.seed!(1234) n = 1000 # number of samples to draw # the true regression function r(x) = 2 + 1/50*x*(30-x) # the true density function σy = 3/2 f(x,y) = 3/4000 * 1/√(2π*σy^2) * x*(20-x)*exp(-1/(2σy^2)*(y-r(x))^2) # x values and y values for a grid xs = 0:1/2^3:20 ys = 0:1/2^3:10 # F(t) is the CDF of X F(t) = -t^3/4000 + 3t^2/400 "Sample from the distribution of X (inverse CDF trick)" function sampleX() U = rand() find_zero(t->F(t)-U,(0,20),Bisection()) end "Sample from joint distribution of X and Y" function sampleXY(r,σ) X = sampleX() Y = r(X) + σ*randn() (X,Y) end # Sample n observations sample = [sampleXY(r,σy) for i=1:n] D(u) = abs(u) < 1 ? 70/81*(1-abs(u)^3)^3 : 0 # tri-cube function D(λ,u) = 1/λ*D(u/λ) # scaled tri-cube K(λ,x,y) = D(λ,x) * D(λ,y) # kernel kde(λ,x,y,observations) = sum(K(λ,x-Xi,y-Yi) for (Xi,Yi) in observations)/length(observations) # optimize takes functions which accept vector arguments, so we treat # λ as a one-entry vector L(λ) = sum((f(x,y) - kde(λ,x,y,sample))^2 for x=xs,y=ys)*step(xs)*step(ys) # minimize L using the BFGS method λ_best = optimize(λ->L(first(λ)),[1.0],BFGS())  We find that the best value comes out to about . This is significantly smaller than our "eyeball" value of 3 from earlier, but not terribly far off. ### Cross-validation Since we only know the density function in this case because we're in the controlled environment of an exercise, we need to think about how we could have chosen the optimal value without that information. One idea, which we will find is applicable in many statistical contexts, is to reserve one observation and form an estimator which only uses the other observations. We evaluate this density approximation at the reserved sample point, and we repeat all of these steps for each sample in the data set. If the resulting density value is consistently very small, then our density estimator is being consistently "surprised" by the location of the reserved sample. This suggests that our value is too large or too small. This idea is called cross-validation. Exercise Experiment with the sliders below to adjust the bandwidth and the omitted point i to find a value of you find satisfactory (set i to 7 to include all six points). λ=${λ}

i=\${i}

The density value at the omitted point is small when is too small because , and the value is small when is too large because .

We've decided we want the densities at the omitted points to be not too small, but we haven't specified an objective function to say exactly what we mean by that. It turns out that we can do this in a principled way, starting from the idea of integrated squared error.

Let's aim to minimize the loss (or error, or risk) of our estimator, which we define to be the integrated squared difference between and the true density . We can write this integral as

using . The third term does not involve , so minimizing is the same as minimizing , which we will call .

Recalling the expectation formula

we recognize the second term in the top equation as , where is an observation from the true density which is independent of . To get observations which are independent of the estimator, we will define to be the density estimator obtained using the observations other than the one. Since the observations are drawn from the joint density of and are assumed to be independent, we can suggest the approximation

We call

the cross-validation loss estimator. Let's find the value of which minimizes for the present example:

"Evaluate the summation Σᵢ f⁽⁻ⁱ⁾(Xᵢ,Yᵢ) in J(λ)'s second term"
function kdeCV(λ,i,observations)
x,y = observations[i]
newobservations = copy(observation)
deleteat!(newobservations,i)
kde(λ,x,y,newobservations)
end

# first line approximates ∫f̂², the second line approximates -(2/n)∫f̂f
J(λ) = sum([kde(λ,x,y,observations)^2 for x=xs,y=ys])*step(xs)*step(ys) -
2/length(observations)*sum(kdeCV(λ,i,observations) for i=1:length(observations))
λ_best_cv = optimize(λ->J(first(λ)),[1.0],BFGS())


The cross-validated minimizing value is . Recall that the true minimizer is . So in this case, cross validation gets us quite close to the optimal value without needing to know the actual density. In fact, the cross-validation estimator performs well in general given enough data, in the following sense:

Theorem (Stone's Theorem)
Suppose that is a bounded probability density function. Let be the kernel density estimator with bandwidth obtained by cross-validation, and let be the kernel density estimator with optimal bandwidth . Then

converges in probability to 1 as . Furthermore, there are constants and such that and for large .

### Nonparametric Regression

With an estimate of the joint density of and in hand, we can turn to the problem of estimating the regression function . If we restrict the density estimate to the vertical line at position , we find that the conditional density is

So the conditional expectation of given is

Let's try to simplify this expression. Looking at Figure 1.6, we can see by the symmetry of the function that each observation contributes to the numerator of the above equation. To arrive at the same conclusion by manipulating the formula directly, we write

Then is the average of the probability measure with density centered at . Since is symmetric, this average is the center point .

Applying a similar idea to the denominator (note that instead of , we get the total mass 1 from integrating ), we find that

The graph of this function is shown in Figure 1.7. We see that it matches the graph of quite closely except near the ends of the interval.

λ = first(λ_best_cv.minimizer)
r̂(x) = sum(D(λ,x-Xi)*Yi for (Xi,Yi) in observations)/sum(D(λ,x-Xi) for (Xi,Yi) in observations)
scatter([x for (x,y) in observations],
[y for (x,y) in observations],label="observations",
markersize=2, legend = :bottomright)
plot!(0:0.2:20, r̂, label = L"\hat{r}", ylims = (0,10), linewidth = 3)

The approximate integrated squared error of this estimator is sum((r(x)-r̂(x))^2 for x in xs)*step(xs) = 1.90.

Kernel density estimation is just one approach to estimating a joint density, and the Nadaraya-Watson estimator is just one approach to estimating a regression function. In the machine learning course following this one, we will explore a wide variety of machine learning models which take quite different approaches, and each will have its own strengths and weaknesses.  Bruno