Undirected Graphical Models

Similar to Bayesian networks, undirected graphical models also provide a visual representation of the structure of a joint probability distribution from which conditional independence relationships can be inferred. However, there are significant representational and computational differences between these models.

Markov Random Fields (MRFs)

A Markov random field is an undirected graph $G = (V,E)$ together with a collection of potential functions defined over cliques of the graph $G$. Undirected graphs are the same as directed graphs except that the edges no longer have an associated direction. A clique in an undirected graph is a completely connected subgraph (i.e., there is an edge between every pair of vertices in the sub graph). See Figure 1 for an example of a clique.

Figure 1. An undirected graph $G$ with four vertices $A,B,C,$ and $D$. The subgraph over vertices $A, B,$ and $C$ is a clique.

As before, each vertex in an MRF corresponds to a random variable. Each clique, $C\subseteq V$, in the graph $G$ is associated with a nonnegative, real-valued potential function, $\psi_C$. The potential functions need not correspond to conditional probability distributions of any kind. A joint probability distribution, $p$, factorizes with respect to an undircted graph $G$ if there exist potential functions defined on the cliques of the graph $G$ such that $p$ can be expressed as follows.

\[p(x_1,\ldots, x_{|V|}) = \frac{1}{Z} \prod_{C\in \mathcal{C}(G)} \psi_{C}(x_C)\]

Here, $\mathcal{C}(G)$ is the set of all cliques in the graph $G$. The normalizing constant or partition function, $Z$, is chosen so that the product of potential functions defines a proper probability distribution.

\[Z = \sum_{x_1,\ldots,x_{|V|}} \prod_{C\in \mathcal{C}(G)} \psi_{C}(x_C)\]

If such a factorization exists, then the random variables are said to form a Markov random field with respect to the graph $G$. In general, we can restrict ourselves to models with only potential functions over the maximal cliques of $G$. A clique is maximal if it is not contained as a subgraph inside of a larger clique. As an example, in Figure 1, both sets $\{A,B\}$ and $\{A,B,C\}$ are cliques of the graph. However, as $\{A,B\} \subset \{A,B,C\}$, the clique $\{A,B\}$ is not a maximal clique. We can restrict our attention to maximal cliques because if we are given a factorization of a joint probability distribution that contains a potential function $\psi_{AB}$ and a potential function $\psi_{ABC}$, we could create a new potential function $\psi'_{ABC} \triangleq \psi_{ABC}\cdot\psi_{AB}$ and remove the old potential functions yielding a new factorization of the joint probability distribution over only the maximal cliques.

Figure 2. An undirected graph $G$ with four vertices $A,B,C,$ and $D$. A joint distribution $p(x_A, x_B, x_C, x_D)$ factorizes over $G$ if there exist potential functions such that \[p(x_A, x_B, x_C, x_D) = \frac{1}{Z} \psi_{AB}(x_A,x_B)\psi_{BC}(x_B,x_C)\psi_{CD}(x_C,x_D),\] where $Z$ is the normalization constant and the $\psi$'s are nonnegative real-valued functions.

To see how the MRF framework can be used to easily express certain types of probability distributions that would have been more challenging to express in Bayesian networks, consider the problem of constructing a probability distribution over the independent sets of a given graph $H = (V_H, E_H)$. Recall that an independent set in $H$ is a subset of $V_H$ such that no two elements of the chosen subset are joined by an edge in $E_H$. Number the vertices of $H$ from $1$ to $|V_H|$. We will represent each possible subset of $V_H$ as a binary vector, $x$, whose $i^{th}$ component is equal to $1$ if the $i^{th}$ vertex of $H$ is selected to be in the subset and zero otherwise. We can express the uniform distribution over independent sets of $H$ as

\[p(x_1,\ldots, x_{|V|}) = \frac{1}{Z} \prod_{(i,j)\in E_H} 1_{x_i + x_j \leq 1}.\]

Here, $1_{x_i + x_j \leq 1}$ is an indicator function for the constraint that $x_i + x_j \leq 1$. That is, $1_{x_i + x_j \leq 1} = 1$ if $x_i + x_j \leq 1$ and zero otherwise. As the goal is to construct the uniform probability distribution over independent sets, the probability of any vector $x'$ whose corresponding subset is not an independent set should be zero. The potential functions are enforcing exactly this constraint: if two adjacent nodes in $H$, say $i$ and $j$, are both selected, i.e., $x'_i = x'_j = 1$, then $1_{x'_i + x'_j \leq 1} = 0$ and thus the whole joint probability will be zero for the vector $x'$. However, if $x'$ defines an independent set, then the joint probability will be $\frac{1}{Z}$. Finally, $Z = \sum_{x_1,\ldots,x_n\in\{0,1\}} \prod_{(i,j)\in E_H} 1_{x_i + x_j \leq 1}$ is simply the total number of independent sets in the graph. Which means that, with this choice of potential functions, $p$ is the uniform probabiltiy distribution over independent sets of $H$.

Conditional Independence and Graph Separation

As in the case of Bayesian networks, independence relationships that are a consequence of the factorization can be inferred from the undirected graph itself. However, as there are no directions on the arrows, the separation conditions are somewhat easier to verify. Consider three sets of mutually disjoint sets $S_1,S_2,S_3\subseteq V$. The set of vertices $S_1 \subseteq V$ is graph separated from a set of vertices $S_2\subseteq V$ given a set $S_3\subseteq V$ if every path from a vertex $v_1\in S_1$ to a vertex $v_2\in S_2$ contains at least one vertex from the set $S_3$. Equivalently, $S_1$ and $S_2$ are graph separated given $S_3$ if removing the vertices in the set $S_3$ and all of their incident edges from the graph $G$ results in a new graph such that no connected component simultaneously contains a subset of $S_1$ and a subset of $S_2$.

Figure 3. An undirected graph on six vertices.

Is $D$ graph separated from $A$ given $B$ and $E$ in the graph in Figure 3?

Yes, as every path from $D$ to $A$ must pass through the neighbors of $A$, namely either $B$ or $E.$

Is $C$ graph separated from $A$ given $B$ and $D$ in the graph in Figure 3?

No, the path $C - E - A$ does not contain $B$ or $D.$

Theorem 1: For mutually disjoint sets $S_1,S_2,S_3\subseteq V$, if $S_1$ is graph separated from $S_2$ given $S_3$ in an undirected graph $G = (V,E)$, then for any probability distribution that factorizes over the maximal cliques of $G$, $X_{S_1} \perp X_{S_2} | X_{S_3}$.

To begin, we have that the joint distribution $p$ can be expressed as

\[p(x_1,\ldots, x_{|V|}) = \frac{1}{Z} \prod_{C\in \mathcal{C}(G)} \psi_{C}(x_C).\]

By the graph separation assumption, there can be no edges between vertices in the set $S_1$ and vertices in the set $S_2$. Therefore, no clique $C\in\mathcal{C}(G)$ can contain both vertices of $S_1$ and vertices of $S_2$. Now, let's divide the cliques into two sets: those cliques, $\mathcal{C}_1$, that contain at least one vertex of $S_1$ and those cliques, $\mathcal{C}_2$ that do not contain a vertex of $S_1$. With the above definition

\[p(x_1,\ldots, x_{|V|}) = \frac{1}{Z} \left[\prod_{C\in \mathcal{C}_1} \psi_{C}(x_C)\right] \left[\prod_{C\in \mathcal{C}_2} \psi_{C}(x_C)\right].\]

Again, by graph separation, if we remove the vertices in $S_3$ and their corresponding edges from the graph $G$ to form a new graph $G'$, the vertices in $V \setminus (S_1\cup S_2\cup S_3)$ must belong to connected components of $G'$ that contain either vertices of $S_1$ or vertices of $S_2$ but not both vertices of $S_1$ and $S_2$. Hence, we can partition the vertices of $V \setminus (S_1\cup S_2\cup S_3)$ into two sets: the set $T_1$ all of those vertices that are in a connected component of $G'$ with some vertex in $S_1$ and the set $T_2$ of those vertices that are not in such a connected component of $G'$. Note that if $C\in\mathcal{C}_1$, then $C$ can only contain vertices in $S_1$, $T_1$, and $S_3$ as $C$ must contain a vertex of $S_1$ by construction and, as a result, it cannot contain a vertex of $S_2$ or $T_2$. Similarly, if $C\in\mathcal{C}_2$, then $C$ can only contain vertices in $S_2$, $T_2$, and $S_3$. Using these observations, define

\[f(x_{S_1}, x_{T_1}, x_{S_3}) \triangleq \prod_{C\in \mathcal{C}_1} \psi_{C}(x_C)\]

and

\[g(x_{S_2}, x_{T_2}, x_{S_3}) \triangleq \prod_{C\in \mathcal{C}_2} \psi_{C}(x_C).\]

This allows us to write

\[p(x_1,\ldots, x_{|V|}) = \frac{1}{Z} f(x_{S_1}, x_{T_1}, x_{S_3})\cdot g(x_{S_2}, x_{T_2}, x_{S_3}).\]

Now,

\begin{align*} p(x_{S_1}, x_{S_2} | x_{S_3}) &= \frac{\sum_{x''_{T_1}, x''_{T_2}}\frac{1}{Z} f(x_{S_1}, x''_{T_1}, x_{S_3})\cdot g(x_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_1}, x'_{T_2}, x'_{S_1}, x'_{S_2}}\frac{1}{Z} f(x'_{S_1}, x'_{T_1}, x_{S_3})\cdot g(x'_{S_2}, x'_{T_2}, x_{S_3})}\\ &= \left[\frac{\sum_{x''_{T_1}} f(x_{S_1}, x''_{T_1}, x_{S_3})}{\sum_{x'_{T_1}, x'_{S_1}}f(x'_{S_1}, x'_{T_1}, x_{S_3})}\right]\cdot\left[\frac{\sum_{x''_{T_2}}g(x_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_2},x'_{S_2}}g(x'_{S_2}, x'_{T_2}, x_{S_3})}\right]. \end{align*}

Further,

\begin{align*} p(x_{S_1} | x_{S_3}) &= \frac{\sum_{x''_{T_1}, x''_{T_2}, x''_{S_2}}\frac{1}{Z} f(x_{S_1}, x''_{T_1}, x_{S_3})\cdot g(x''_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_1}, x'_{T_2}, x'_{S_1}, x'_{S_2}}\frac{1}{Z} f(x'_{S_1}, x'_{T_1}, x_{S_3})\cdot g(x'_{S_2}, x'_{T_2}, x_{S_3})}\\ &= \left[\frac{\sum_{x''_{T_1}} f(x_{S_1}, x''_{T_1}, x_{S_3})}{\sum_{x'_{T_1}, x'_{S_1}}f(x'_{S_1}, x'_{T_1}, x_{S_3})}\right]\cdot\left[\frac{\sum_{x''_{T_2}, x''_{S_2}}g(x''_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_2},x'_{S_2}}g(x'_{S_2}, x'_{T_2}, x_{S_3})}\right]\\ &= \frac{\sum_{x''_{T_1}} f(x_{S_1}, x''_{T_1}, x_{S_3})}{\sum_{x'_{T_1}, x'_{S_1}}f(x'_{S_1}, x'_{T_1}, x_{S_3})} \end{align*}

and

\begin{align*} p(x_{S_2} | x_{S_3}) &= \frac{\sum_{x''_{T_1}, x''_{T_2}, x''_{S_1}}\frac{1}{Z} f(x''_{S_1}, x''_{T_1}, x_{S_3})\cdot g(x_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_1}, x'_{T_2}, x'_{S_1}, x'_{S_2}}\frac{1}{Z} f(x'_{S_1}, x'_{T_1}, x_{S_3})\cdot g(x'_{S_2}, x'_{T_2}, x_{S_3})}\\ &= \left[\frac{\sum_{x''_{T_1},x''_{S_1} } f(x''_{S_1}, x''_{T_1}, x_{S_3})}{\sum_{x'_{T_1}, x'_{S_1}}f(x'_{S_1}, x'_{T_1}, x_{S_3})}\right]\cdot\left[\frac{\sum_{x''_{T_2}}g(x_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_2},x'_{S_2}}g(x'_{S_2}, x'_{T_2}, x_{S_3})}\right]\\ &= \frac{\sum_{x''_{T_2}}g(x_{S_2}, x''_{T_2}, x_{S_3})}{\sum_{x'_{T_2},x'_{S_2}}g(x'_{S_2}, x'_{T_2}, x_{S_3})}, \end{align*}

which demostrates that $p(x_{S_1}, x_{S_2} | x_{S_3}) = p(x_{S_1} | x_{S_3})\cdot p(x_{S_2} | x_{S_3})$. In other words, $X_{S_1} \perp X_{S_2} | X_{S_3}$.

As a simple example of the kinds of independence relationships that are implied by graph separation, consider the local Markov independence assumptions: for each $i\in V$ the random variable $x_i$ is independent of its nonneighbors given its neighbors in the graph $G$. A vertex $i\in V$ is a neighbor of a vertex $j\in V$ if there is an edge $(i,j)\in E$. We say that an undirected graph $G$ is an I-map for probability distribution $p$ if $p$ factorizes over the maximal cliques of $G$. The converse of Theorem 1 is known as the Hammersley-Clifford theorem: If $G$ is an I-map for some strictly positive probability distribution $p$, then $p$ factorizes over the maximal cliques of $G$. The proof of the Hammersley-Clifford Theorem is quite a bit more technical than the proof of Theorem 1, and we omit it here.

Bayesian Networks as MRFs

Every Bayesian network can be converted into a MRF via a process known as moralization. The intuition is that we can simply treat the conditional probability distributions that arise in the factorization over a directed acyclic graph as the potential functions over an undirected graph that is obtained from the directed graph by: 1) dropping the direction of the arrows and 2) adding undirected edges so that the subgraph formed by a node and its parents from the directed graph forms a clique in the new undirected graph.

Factor Graph Representation

Conditional Random Fields (CRFs)

Log-Linear Models