词汇表

选择左侧的一个关键字...

Bayesian Inference and Graphical ModelsBayesian networks

阅读时间: ~20 min

Consider the following probabilistic narrative about an individual's health outcome.

(i) A person becomes a smoker with probability 18%.
(ii) They exercise regularly with probability 40% if they are a non-smoker or with probability 25% if they are a smoker.
(iii) Independently of the above, with probability 15% they have a gene which predisposes them to lung cancer.
(iv) Their conditional probability of contracting lung cancer, given the indicator random variables I_1, I_2, and I_3 of the events described in (a), (b), and (c) respectively, is given by 0.025 + 0.1I_1 - 0.02I_2 + 0.1I_3.

We can visualize this story with a diagram in which each event of the four indicator random variables is a node, and arrows are drawn to indicate dependencies as specified in the story.

A Bayesian network

Exercise
Is this the only such diagram consistent with the specified probability measure on the four random variables?

Solution. No, there's nothing about smoking and exercising that requires that we sample the smoking indicator and then the exercising indicator from its conditional distribution giving smoking. We could have done it the other way around.

The diagram tells us that having the gene is independent of smoking and exercising (since those nodes have no common ancestors in the diagram). If we included another descendant of the "smokes" node, like "develops premature wrinkles", then that would be communicating that premature wrinkles and lung cancel—while not independent—are conditionally independent given the smoking random variable.

Gaussian mixture models

Consider a distribution on \mathbb{R}^n whose density function can be written as a linear combination of d multivariate Gaussian densities:

using Plots, Distributions
f(x,y) = 0.55pdf(MvNormal([2.2, -0.4], [0.4 0.2; 0.2 0.4]), [x,y]) +
         0.45pdf(MvNormal([0.1, -4.3], [1.5 -0.1; -0.1 0.5]), [x,y])
p1 = heatmap(-6:0.05:6, -6:0.05:6, f)
p2 = surface(-6:0.05:6, -6:0.05:6, f)
plot(p1, p2, size = (650, 300))

Such a distribution is called a Gaussian mixture model. We can sample from a GMM of the form \alpha_1 f_1(x) + \alpha_2 f_2(x) + \ldots + \alpha_d f_d(x) by simulating a random variable Z which takes values in \{1, 2, \ldots, d\} with probability \alpha_k for each element k, and then drawing X from a multivariate normal distribution with mean \mu_Z and covariance \Sigma_Z (where \mu_k and \Sigma_k are the mean and covariance of f_k).

Exercise
Explain how you might estimate the means, covariances, and \alpha values based on the observations shown. Feel free to use your own visual intuition as part of the algorithm.

Solution. We identify the two clusters visually, and we associate each point with one of the clusters or the other. Then we estimate means and covariances of the sample means and covariances for the two clusters, and we estimate the \alpha's as the proportions of points belonging to each cluster.

In the next section (on Expectation-Maximization), we'll talk about how to do this in a way that doesn't require a human to hand-pick the Z value for each point.

Hidden Markov Models

The second example of a Bayesian network we'll look is the Hidden Markov Model (HMM). An HMM consists of a Markov chain Z_1, \ldots, Z_n together with a collection of random variables X_1, \ldots, X_n with the property that that the conditional distribution of X_j given all of the other random variables depends only on Z_j. Represented as a Bayes net, the hidden Markov model looks like this:

Bayes net for a hidden Markov model

Example
Simulate a hidden Markov model and plot the vector of Z's and the vector of X's on the same graph.

Solution.

using Plots, OffsetArrays
P = OffsetArray([0.2 0.8
                 1/3 2/3], 0:1, 0:1)

n = 100

function markov_chain(P, n)
    Z = [rand(0:1)]
    for i in 1:n-1
        current_state = Z[end]
        push!(Z, rand() < P[current_state, 0] ? 0 : 1)
    end
    Z
end

Z = markov_chain(P, n)
X = Z + randn(n)

plot(Z, size = (500, 100), legend = false)
plot!(X)

The kinds of questions we'll want to answer for hidden Markov models include:

  1. Given observations for the X's—but not the Z's—which model parameters (including the transition probabilities for the Markov chain and any parameters for conditional distribution of X_j given Z_j) maximize the likelihood of the observed data?

  2. Given values for the parameters of the model and given observations for the X's, what is the conditional distribution of the Z's?

Exercise
Consider a hidden Markov model for which the transition matrix P takes the form \left[\begin{array}{cc}{q} & {1-q} \\ {1-q} & {q}\end{array}\right] and for which the conditional distribution of X_j given Z_j is a normal distribution with mean Z_j and variance \sigma^2.

Given the observed X values shown, how many times would you guess the underlying Markov chain changed its state (from 0 to 1, or from 1 to 0)? Also, does it appear as though \sigma^2 is large or small?

Solution. It looks like the sequence of Z's was most likely this path (which switches 8 times):

Furthermore, it appears that \sigma^2 is probably pretty small, since the differences between the X's and Z's are small.

In the next section we'll talk about a more principled method for inferring model parameters and the conditional distribution of the Z's given the observed X's.

We close this section with an example showing how to use Bayes nets to calculate likelihood values.

Example
Find the likelihood of the following data for the hidden Markov model described above, with n = 3, q = 0.7, and \sigma^2 = 1. Suppose Z_1 is uniformly distributed on \{0,1\}.

\begin{align*} \begin{array}{ccc}{z_{1}=0} & {z_{2}=1} & {z_{3}=1} \\ {x_{1}=0.2} & {x_{2}=-0.4} & {x_{3}=0.85}\end{array} \end{align*}

Solution. The probability of observing Z_1 = 0 is 1/2. The probability of observing Z_1 = 0 and Z_2 = 1 is (1/2)(1-q). The probability of observing all three of the given Z values is (1/2)(1-q)(q).

The conditional probability of seeing an x_1 value close to 0.2 given \{Z_1=0\} is proportional to value of the standard Gaussian density at 0.2, which is \frac{1}{\sqrt{2\pi}}\operatorname{e}^{-0.2^2/2}. Likewise, the likelihood gets a factor of \frac{1}{\sqrt{2\pi}}\operatorname{e}^{-(1-(-0.4))^2/2} for X_2 and a factor of \frac{1}{\sqrt{2\pi}}\operatorname{e}^{-(1-(0.85))^2/2} for X_3, given the values for Z_2 and Z_3 under consideration. All together, the likelihood is

\begin{align*}(1/2)(1-q)(q)\frac{1}{\sqrt{2\pi}}\operatorname{e}^{-0.2^2/2}\frac{1}{\sqrt{2\pi}}\operatorname{e}^{-(1-(-0.4))^2/2}\frac{1}{\sqrt{2\pi}}\operatorname{e}^{-(1-(0.85))^2/2}\end{align*}

More generally, we can compute the likelihood for any complete set of values in a Bayes net by traversing the diagram starting from a root node (a node with no incoming arrows) and including a factor for each conditional probability mass or density value encountered at each node.

Bruno
Bruno Bruno