A Brief Introduction to Graphical Models

Approximate MAP Inference


Recall that the MAP inference task is to compute the maximizing assignment or value of a (conditional) probability distribution. \begin{eqnarray} {\arg \max}_{x} p(x) \end{eqnarray} While the MAP problem is in general NP-hard, we saw how to find the MAP assignment in linear time when the probability distribution factorizes over a tree structured MRF by performing variable elimination starting at the leaves of the tree and working our way to the root. However, if the graph is not a tree, no elimination order may lead to an efficient MAP inference scheme. In this section, we will take a closer look at MAP inference in loopy MRFs, i.e., MRFs that have a least one cycle, and develop two different methods to perform approximate MAP inference.

An Upper Bound on the MAP Inference Problem

Our first approximate MAP inference scheme is based on the idea of reparameterization. The potential functions in a given factorization are, in general, not unique: the same joint probability distribution may factorize in many different ways over the same undirected graph. Note that this is not true for Bayesian networks where the conditional probability distributions are uniquely determined for a given joint probability distribution. Suppose that $p$ is a probability distribution that factorizes over the set of cliques $C$ of some graph $G = (V, E)$. In other words, $p(x) = \frac{1}{Z} \prod_{c\in C} \psi_c(x_c)$. Construct a collection of messages, that is, positive functions $m_{c\rightarrow i}(x_i)$ for each $c \in C$, each $i\in c$, and each assignment to the random variable $X_i$. By multiplying and dividing the joint probability distribution by $\prod_{c\in C} \prod_{i\in c} m_{c\rightarrow i}(x_i)$, we obatin

\begin{align*} p(x) &= \frac{1}{Z} \prod_{c\in C} \psi_c(x_c)\\ &= \frac{1}{Z} \left[\prod_{i\in V}\prod_{c\in C : i\in c} m_{c\rightarrow i}(x_i)\right]\left[\prod_{c\in C} \frac{\psi_c(x_c)}{\prod_{j\in c} m_{c\rightarrow j}(x_j)}\right]. \end{align*}

Each choice of messages yields a different reparameterization of the joint probability distribution with new potentials

\[\phi'_i(x_i) \triangleq \prod_{c\in C : i\in c} m_{c\rightarrow i}(x_i)\] and \[\psi'_c(x_c) \triangleq \frac{\psi_c(x_c)}{\prod_{j\in c} m_{c\rightarrow j}(x_j)}.\]

Although each of these reparameterizations represents the same joint probability distribution, they can be used to obtain different upper bounds on the MAP problem. To see this, recall that the maximum of a product is less than or equal to the product of maximums: for nonnegative functions $f$ and $g$ over the same set of variables, $\max_x f(x)g(x) \leq \left[\max_x f(x)\right]\left[\max_x g(x)\right]$. This inequality is tight whenever $f$ and $g$ are both maximized at the same assignment. Applying the inequality to the above reparameterization yields the following inequality for the MAP problem.

\begin{align*} \max_x p(x) &= \max_x \frac{1}{Z} \prod_{c\in C} \psi_c(x_c)\\ &= \max_x \frac{1}{Z} \left[\prod_{i\in V}\prod_{c\in C : i\in c} m_{c\rightarrow i}(x_i)\right]\left[\prod_{c\in C} \frac{\psi_c(x_c)}{\prod_{j\in c} m_{c\rightarrow j}(x_j)}\right]\\ &\leq \frac{1}{Z} \left[\prod_{i\in V}\max_{x_i} \prod_{c\in C : i\in c} m_{c\rightarrow i}(x_i)\right]\left[\prod_{c\in C} \max_{x_c}\frac{\psi_c(x_c)}{\prod_{j\in c} m_{c\rightarrow j}(x_j)}\right] \end{align*}

As the righthand side of this expression can be different for different choices of the messages, we could try to minimize this upper bound over the set of all positive messages. While it appears that we have replaced an NP-hard problem with and even more difficult optimization problem, this turns out not to be the case. Consider the following function obtained by taking the logarithm of the upper bound.

\[U(\log m) \triangleq -\log Z + \sum_{i\in V} \max_{x_i} \left[\sum_{c\in C : i\in c} \log m_{c\rightarrow i}(x_i)\right] + \sum_{c\in C} \max_{x_c}\left[\log {\psi_c(x_c)} - {\sum_{j\in c} \log m_{c\rightarrow j}(x_j)}\right]\]
Theorem 1: $U$ is a convex function of the log messages. That is, $U(\lambda \log m + (1-\lambda) \log m') \leq \lambda U(\log m) + (1-\lambda) U(\log m')$ for any $\lambda \in [0,1]$ and any two different reparameterizaitons given by $m$ and $m'$.
\begin{align*} U(\lambda \log m + (1-\lambda) \log m') &= -\log Z + \sum_{i\in V} \max_{x_i} \left[\sum_{c\in C : i\in c} \lambda \log m_{c\rightarrow i}(x_i) + (1-\lambda) \log m'_{c\rightarrow i}(x_i)\right]\\ & + \sum_{c\in C} \max_{x_c}\left[\log {\psi_c(x_c)} - {\sum_{j\in c}\left( \lambda \log m_{c\rightarrow j}(x_j) + (1-\lambda) \log m'_{c\rightarrow j}(x_j) \right)}\right]\\ &\leq -\log Z + \sum_{i\in V} \lambda \max_{x_i} \left[\sum_{c\in C : i\in c} \log m_{c\rightarrow i}(x_i)\right] + \sum_{i\in V} \max_{x_i} (1-\lambda) \left[\sum_{c\in C : i\in c} \log m'_{c\rightarrow i}(x_i)\right]\\ & + \sum_{c\in C} \max_{x_c} \lambda \left[\log {\psi_c(x_c)} - {\sum_{j\in c}\log m_{c\rightarrow j}(x_j) }\right] + \sum_{c\in C} \max_{x_c} (1-\lambda) \left[\log {\psi_c(x_c)} - {\sum_{j\in c}\log m'_{c\rightarrow j}(x_j) }\right]\\ & = \lambda U(\log m) + (1-\lambda) U(\log m') \end{align*}

Minimizing a convex function can be done via standard methods from continuous optimization. Gradient descent, one of the simplest of these methods, would attempt to find a global optima of a convex function $f$ by starting at a fixed $x$ and following the negative gradient of the function $f$. As the gradient always points in an increasing direction, the negative gradient will always point in a decreasing direction. More formally, to minimize a differentiable convex function $f$, gradient descent starts at an initial point $x^{(0)}$ and iterates the following for each $t \in \{1,2,\ldots\}$ \[x^{(t)} = x^{(t-1)} - \gamma_t \nabla f(x^{(t-1)}),\] where $\gamma_t\in\mathbb{R}$ is the step size for the $t^{th}$ iteration. Common choices for the step size include a constant step size $\gamma_t = a$ for all $t > 0$ and some $a\in\mathbb{R}$ or a diminishing step size rule such as $\gamma_t = 2/(2+t)$ for all $t > 0$. With the diminishing step size rule, the gradient descent algorithm is guaranteed to converge to a minimum of the convex function $f$. When the function $f$ is convex but not differentiable, the same strategy can be used with the notable exception that instead of the gradient at the point $x$ we will substitue a subgradient at the point $x$. A subgradient of a convex function $f$ at a point $x$ is any tagent line through that point that everywhere underestimates the function. As with the case of gradient descent, subgradient descent with the diminishing step size rule above is guaranteed to converge to a global optima of $f$.

In summary, we have approximated the NP-hard MAP problem with a convex minimization problem. In general, the minimum value of the upper bound need not agree with the MAP value. In such a situtation, we can only use the upper bound as an estimate of the MAP value and assignment.

Approximate MAP Inference via Linear Programming

An alternative approach to approximating the MAP problem begins with the simple observation that the maximum value of a function is always larger than its average value. Or, more generally, given a function $f(x)$ and a probability distribution $q(x)$, \begin{eqnarray} {\arg \max}_{x} f(x) \geq \sum_x q(x)f(x). \end{eqnarray} The previous inequality is tight (i.e., is satisfied with equality) whenever $q(x)$ is a probability distribution that puts all of its mass on some assignment $x^* \in \arg\max_x f(x)$. Combining this with the observation that $\log(x)$ is a monotonic increasing function, we can reformulate the MAP problem as \begin{eqnarray} {\arg \max}_{x} p(x) = {\arg \max_{q\in Q}} \sum_x q(x) \log p(x) \end{eqnarray} where $Q$ is the set of all probability distributions over the random vector $x$.

Now, suppose that $p$ is a probability distribution that factorizes over the set of cliques $C$ of some graph $G = (V, E)$. Applying the above argument, the MAP problem can be expressed as follows. \begin{align*} {\arg \max}_{x} \log p(x) &= {\arg \max_{q\in Q}} \sum_x q(x) \log p(x)\\ & = {\arg \max_{q\in Q}} \sum_x q(x) \log \left(\frac{1}{Z}\prod_{c\in C} \psi_c(x_c)\right)\\ & = {\arg \max_{q\in Q}} \left[-\log Z + \sum_x \sum_{c\in C} q(x) \log \psi_c(x_c)\right]\\ & = {\arg \max_{q\in Q}} \sum_x \sum_{c\in C} q(x) \log \psi_c(x_c)\\ & = {\arg \max_{q\in Q}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \end{align*} where $q_c(x_c) \equiv \sum_{x':x'_c = x_c} q(x)$ is the marginal distribution of $q$ over the variables in the clique $c\in C$.

At first glance, the new optimization problem \begin{eqnarray} {\arg \max_{q\in Q}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \end{eqnarray} appears to be more difficult than the original MAP optimization problem as, instead of optimizing over all possible assignments to the random vector $x$, we must now optimize over $Q$, the set of all possible probability distributions over $x$. As we observed earlier, we do not need to optimize over the entire set $Q$: it would suffice to optimize over only those distributions that place all of their mass on a single assignment. If $x\in \{1,\ldots,d\}^n$, then there are $d^n$ possible distributions with this property. We also observe that the optimization problem only requires certain marginals of the distribution $q$ and not the entire joint distribution. Every marginal distribution that arises from some joint distribution $q$ must satisfy several constraints. First, the marginal distributions must agree on their overlap. That is, if $i \in c$ and $i \in c'$, then \begin{align*} \sum_{x'_c:x'_i = x_i} q_c(x'_c) = q_i(x_i) = \sum_{x'_{c'}:x'_i = x_i} q_{c'}(x'_{c'}). \end{align*} Second, to be a probability distribution, each $q_i(x_i)$ must sum to one. \begin{align*} \sum_{x_i} q_i(x_i) = 1 \end{align*}

The marginal polytope, $M$, is defined to be the convex hull of the collection of all marginal distributions, $q_{i\in V}, q_{c\in C}$, satisfying the above constraints that place their mass on a single assignment. More formally, the collection of marginals $(q_{i\in V}, q_{c\in C})$ are the marginals of some joint probability distribution that places all of its mass on a single assignment if and only if \begin{align*} &\text{For all }c\in C, i\in c, x_i\in\{1,\ldots,d\},& \sum_{x'_c:x'_i = x_i} q_c(x'_c) = q_i(x_i)\\ &\text{For all }i\in V,& \sum_{x_i} q_i(x_i) = 1\\ &\text{For all }i\in V, x_i\in\{1,\ldots,d\},& q_i(x_i)\in \{0,1\}\\ &\text{For all }c\in C, x_c\in\{1,\ldots,d\}^{|c|},& q_c(x_c)\in \{0,1\}. \end{align*}

Theorem: $(q_{i\in V}, q_{c\in C})$ satisfy the constraints if and only if there exists a joint probability distribution $q'\in Q$ whose marginals are given by $(q_{i\in V}, q_{c\in C})$.

The proof of the theorem is relatively straightforward. First, given some $(q_{i\in V}, q_{c\in C})$ satisfying the constraints, we can construct $q'\in Q$ as follows. For each $i$, choose $y_i \in \arg \max_{x_i} q_i(x_i)$. As $q_i$ is maximized at the choice of $x_i \in \{1,\dots, d\}$ such that $q_i(x_i) = 1$, $y_i$ is uniquely determined. Let $q'$ be the probability distribution that places all of its mass on the assignment $y$. We can easily verify that marginals of $q'$ are exactly those given by $q_{i\in V}$ and $q_{c\in C}$.

For the other direction, given $q' \in Q$ we select $q_i(x_i) = q'_i(x_i)$ for all $i\in V$ and all $x_i\in\{1,\ldots,d\}$ and $q_c(x_c) = q'_c(x_c)$ for all $c\in C, x_c\in\{1,\ldots,d\}^{|c|}$. This choice of marginals satisfies the constraints by definition.

The above argument allows us to reformulate the MAP problem as \begin{eqnarray} {\arg \max_{q\in Q}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) = {\arg \max_{(q_{i\in V}, q_{c\in C}) \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c). \end{eqnarray} For simplicity we often write this as ${\arg \max_{q \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c)$.

The Local Marginal Polytope

Finally, we have all of the ingredients that we need in order to construct an approximate version of the MAP optimization problem. The optimization problem \begin{eqnarray} {\arg \max_{(q_{i\in V}, q_{c\in C}) \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \end{eqnarray} is sometimes referred to as an integer programming problem. Integer programming problems are constrained optimization problems in which you are asked to maximize a linear function over vectors of integers subject to a series of linear constraints. Integer programming problems are in general NP-hard. However, linear programming problems, in which you are asked to maximize a linear function over vectors of real numbers subject to a series of linear constraints, are known to be solvable in polynomial time. We can always obtain a linear programming problem from an integer programming problem by relaxing the integrality constraint. Speifically, consider the local marginal polytope, $L$, which is obtained by relaxing the integrality constraint in the marginal polytope. A collection of marginals $(q_{i\in V}, q_{c\in C}) \in L$ if \begin{align*} &\text{For all }c\in C, i\in c, x_i\in\{1,\ldots,d\},& \sum_{x'_c:x'_i = x_i} q_c(x'_c) = q_i(x_i)\\ &\text{For all }i\in V,& \sum_{x_i} q_i(x_i) = 1\\ &\text{For all }i\in V, x_i\in\{1,\ldots,d\},& q_i(x_i)\in [0,1]\\ &\text{For all }c\in C, x_c\in\{1,\ldots,d\}^{|c|},& q_c(x_c)\in [0,1]. \end{align*}

Notice that the only difference between $M$ and $L$ are the final two constraints. As $M \subseteq L$ we must have that \begin{eqnarray} {\max_{(q_{i\in V}, q_{c\in C}) \in M}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c) \leq {\max_{(q_{i\in V}, q_{c\in C}) \in L}} \sum_{c\in C} \sum_{x_c} q_c(x_c) \log \psi_c(x_c). \end{eqnarray} Sometimes these two optimization problems are equivalent, though this is typically not the case. However, we are often willing to settle for an upper bound on the MAP problem in practice as linear programming problems can be solved in polynomial time while the exact MAP inference problem does not typically admit a polynomial time solution for arbitrary MRFs.

Creative Commons License