Don't wanna be here? Send us removal request.
Text
On statistical learning in graphical models, and how to make your own [this is a work in progress...]
I’ve been doing a lot of work on graphical models, and at this point if I don’t write it down I’m gonna blow. I just recently came up with an idea for a model that might work for natural language processing, so I thought that for this post I’ll go through the development of that model, and see if it works. Part of the reason why I’m writing this is that the model isn’t working right now (it just shrieks “@@@@@@@@@@”). That means I have to do some math, and if I’m going to do that why not write about it.
So, the goal will be to develop what the probabilistic people in natural language processing call a “language model” – don’t ask me why. This just means a probability distribution on strings (i.e. lists of letters and punctuation and numbers and so on). It’s what you have in your head that makes you like “hi how are you” more than “1$…aaaXD”. Actually I like the second one, but the point is we’re trying to make a machine that doesn’t. Anyway, once you have a language model, you can do lots of things, which hopefully we’ll get to if this particular model works. (I don’t know if it will.)
The idea here is that if we go through the theory and then apply it to develop a new model, then maybe You Can Too TM. In a move that is pretty typical for me, I misjudged the scope of the article that I was going for, and I ended up laying out a lot of theory around graphical models and Boltzmann machines. It’s a lot, so feel free to skip things. The actual new model is developed at the end of the post using the theory.
Graphical models and Gibbs measures
If we are going to develop a language model, then we are going to have to build a probability distribution over, say, strings of length 200. Say that we only allowed lowercase letters and maybe some punctuation in our strings, so that our whole alphabet has size around 30. Then already our state space has size 20030. In general, this huge state space will be very hard to explore, and ordinary strategies for sampling from or studying our probability distribution (like rejection or importance sampling) will not work at all. But, what if we knew something nice about the structure of the interdependence of our 200 variables? For instance, a reasonable assumption for a language model to make is that letters which are very far away from each other have very little correlation – $n$-gram models use this assumption and let that distance be $n$.
It would be nice to find a way to express that sort of structure in such a way that it could be exploited. Graphical models formalize this in an abstract setting by using a graph to encode the interdependence of the variables, and it does so in such a way that statistical ideas like marginal distributions and independence can be expressed neatly in the language of graphs.
For an example, let $V={1,\dots,n}$ index some random variables $X_V=(X_1,\dots,X_n)$ with joint probability mass function $p(x_V)$. What would it mean for these variables to respect a graph like, say if $n=5$,
In other words, we’d like to say something about the pmf $p$ that would guarantee that observing $X_5$ doesn’t tell us anything interesting about $X_1$ if we already knew $X_4$. Well,
Def 0. (Gibbs random field, pmf/pdf version) Let $\mathcal{C}$ be the set of cliques in a graph $G=(V,E)$. Then we say that a collection of random variables $X_V$ with joint probability mass function or probability density function $p$ is a Gibbs random field with respect to $G$ if $p$ is of the form $$p(x_V) = \frac{1}{Z_\beta} e^{-\beta H(x_V)},$$ given that the energy $H$ factorizes into clique potentials: $$H(x_V)=\sum_{C\in\mathcal{C}} V_C(x_C).$$
Here, if $A\subset V$, we use the notation $x_A$ to indicate the vector $x_A=(x_i;i\in A)$.
For completeness, let’s record the measure-theoretic definition. We won’t be using it here so feel free to skip it, but it can be helpful to have it when your graphical model mixes discrete variables with continuous ones.
Def 1. (Gibbs random field) For the same $G$, we say that a collection of random variables $X_V\thicksim\mu$ is a Gibbs random field with respect to $G$ if $\mu$ can be written $$\mu(dx_V)=\frac{\mu_0(dx_V)}{Z_\beta} e^{-\beta H(x_V)},$$ for some energy $H$ factorizes over the cliques like above.
Here, $\mu_0$ is the “base” or “prior” measure that appears in the constrained maximum-entropy derivation of the Gibbs measure (notice, this is the same as in exponential families -- it’s the measure that $X_V$ would obey if there were no constraint on the energy statistic), but we won’t care about it here, since it’ll just be Lebesgue measure or counting measure on whatever our state space is. Also, $Z$ is just a normalizing constant, and $\beta$ is the “inverse temperature,” which acts like an inverse variance parameter. See footnote:Klenke 538 for more info and a large deviations derivation.
There are two main reasons to care about Gibbs random fields. First, the measures that they obey (Gibbs or Boltzmann distributions) show up in statistical physics: under reasonable assumptions, physical systems that have a notion of potential energy will have the statistics of this distribution. For more details and a plain-language derivation, I like Terry Tao’s post footnote:https://terrytao.wordpress.com/2007/08/20/math-doesnt-suck-and-the-chayes-mckellar-winn-theorem/.
Second, and more to the point here, they have a lot of nice properties from a statistical point of view. For one, they have the following nice conditional independence property:
Def 2a. (Global Markov property) We say that $X_V$ has the global Markov property on $G$ if for any $A,B,S\subset V$ so that $S$ separates $A$ and $B$ (i.e., for any path from $A$ to $B$ in $G$, the path must pass through $S$), $X_A\perp X_B\mid X_S$, or in other words, $X_A$ is conditionally independent of $X_B$ given $X_S$.
Using the graphical example from above, for instance, we see what happens if we can condition on $X_4=x_4$:
This is what happens in Def 2a. when $S=\{4\}$. Since the node 4 separates the graph into partitions $A=\{1,2,3\}$ and $B=\{5\}$, we can say that $X_1,X_2,X_3$ are independent of $X_5$ given $X_4$, or in symbols $X_1,X_2,X_3\perp X_5\mid X_4$.
The global Markov property directly implies a local property:
Def 2b. (Local Markov property) We say that $X_V$ has the local Markov property on $G$ if for any node $i\in V$, $X_i$ is conditionally independent from the rest of the graph given its neighbors.
In a picture,
Here, we are expressing that $X_2\perp X_3,X_5\mid X_1,X_4$, or in words, $X_2$ is conditionally independent from $X_3$ and $X_5$ given its neighbors $X_1$ and $X_4$. I hope that these figures (footnote:tikz-cd) have given a feel for how easy it is to understand the dependence structure of random variables that respect a simple graph.
It’s not so hard to check that a Gibbs measure satisfies these properties on its graph. If $X_V\thicksim p$ where $p$ is a Gibbs density/pmf, then $p$ factorizes as follows: $$p(x_V)=\prod_{C\in\mathcal{C}} p_C(x_C).$$ This factorization means that $X_V$ is a “Markov random field” in addition to being a Gibbs random field, and the Markov properties follow directly from this factorization footnote:mrfwiki. The details of the equivalence between Markov random fields and Gibbs random fields are given in the Hammersley-Clifford theorem footnote:HCThm.
This means that in situations where $V$ is large and $X_V$ is hard to sample directly, there is still a nice way to sample from the distribution, and this method becomes easier when $G$ is sparse.
The Gibbs sampler
The Gibbs sampler is a simpler alternative to methods like Metropolis-Hastings, first named in an amazing paper by Stu Geman footnote:thatpaper. It’s best explained algorithmically. So, say that you had a pair of variables $X,Y$ whose joint distribution $p(x,y)$ is unknown or hard to sample from, but where the conditional distributions $p(x\mid y)$ and $p(y\mid x)$ are easy to sample. (This might sound contrived, but the language model we’re working towards absolutely fits into this category, so we will be using this a lot.) How can we sample faithfully from $p(x,y)$ using these conditional distributions?
Well, consider the following scheme. For the first step, given some initialization $x_0$ for $X$, sample $y_0\thicksim p(\cdot\mid x_0)$. Then at the $i$th step, sample $x_{i}\thicksim p(\cdot\mid y_{i-1})$, and then sample $y_{i}\thicksim p(\cdot \mid x_i)$. The claim is that the Markov chain $(X_i,Y_i)$ will approximate samples from $p(x,y)$ as $i$ grows.
It is easy to see that $p$ is an invariant distribution for this Markov chain: indeed, if it were the case that $x_0,y_0$ were samples from $p(X,Y)$, then if $X_1\thicksim p(X\mid Y=y_0)$, clearly $X_1,Y_0\thicksim p(X\mid Y)p(Y)$, since $Y_0$ on its own must be distributed according to its marginal $p(Y)$. By the same logic, $X_1$ is distributed according to its marginal $p(X)$, so that if $Y_1$ is chosen according to $p(Y\mid X_0)$, then the pair $X_1,Y_1\thicksim p(Y\mid X)p(X)=p(X,Y)$.
The proof that this invariant distribution is the limiting distribution of the Markov chain is more involved and can be found in the Geman paper, but to me this is the main intuition.
This sampler is especially useful in the context of graphical models, and more so when the graph has some nice regularity. The Gemans make great use of this to create a parallel algorithm for sampling from their model, which would otherwise be very hard to study. For a simpler example of the utility, consider a lattice model (you might think of an Ising model), i.e. say that $G$ is a piece of a lattice: take $V=\{1,\dots,n\}\times\{1,\dots,n\}$ and let $E=\{((h,i),(j,k):\lvert h-j + i-k\rvert=1\}$ be the nearest-neighbor edge system on $V$. Say that $X_V$ form a Gibbs random field with respect to that lattice (here we let $n=4$):
Even the relatively simple Gibbs distribution of the free Ising model, $$p(x_V)=\frac{1}{Z_\beta}\exp\left( \beta\, {\textstyle\sum_{u,v\in E} x_u x_v }\right)$$ is hard to sample. But in general, any Gibbs random field on this graph can be easily sampled with the Gibbs sampler: if $N_u$ denotes the neighbors of the node $u=(i,j)$, then to sample from $p$, we can proceed as follows: let $x_V^0$ be a random initialization. Then, looping through each site $u\in V$, sample the variable $X_u\thicksim p(X_u\mid X_{N_u})$, making sure to use the new samples where available and the initialization elsewhere. Then let $x_V^1$ be the result of this first round of sampling, and repeat. This algorithm is “local” in the sense that each node only needs to care about its neighbors, so the loop can be parallelized simply (as long as you take care not to sample two neighboring nodes at the same time).
Later, we will be dealing with bipartite graphs with lots of nice structure that make this sampling even easier from a computational point of view.
The Ising model
One important domain where graphical models have been used is to study what are known as “spin glasses.” Don’t ask me why they are called glasses, I don’t know, maybe it’s a metaphor. The word “spin” shows up because these are models of the emergence of magnetism, and magnetism is what happens when all of the little particles in a chunk of material align their spins. Spin is probably also a metaphor.
In real magnets, these spins can be oriented in any direction in God’s green three-dimensional space. But to simplify things, some physicists study the case in which each particle’s spin can only point “up” or “down” -- this is already hard to deal with. The spins are then represented by Bernoulli random variables, and in the simplest model (the Ising model footnote:ising), a joint pmf is chosen that respects the “lattice” or “grid” graph in arbitrary dimension. The grid is meant to mimic a crystalline structure, and keeps things simple. Since each site depends only on its neighbors in the graph, Gibbs sampling can be done quickly and in parallel to study this system, which turns out to yield some interesting physical results about phase changes in magnets.
Now, this is an article about applications of graphical models to natural language processing, so I would not blame you for feeling like we are whacking around in the weeds. But this is not true. We are not whacking around in the weeds. The Ising model is the opposite of the weeds. It is a natural language processing algorithm, and that’s not a metaphor, or at least it won’t be when we get done with it. Yes indeed. We’re going to go into more detail about the Ising model. That’s right. I am going to give you the particulars.
Let’s take a look at the Ising pmf, working in two dimensions for a nice mix of simplicity and interesting behavior. First, let’s name our spin: Let $z_V$ be a matrix of Bernoulli random variables $z_{i,j}\in\{0,1\}$. Here, we’re letting $z_{i,j}=0$ represent a spin pointing down at position $(i,j)$ in the lattice -- a 1 means spin up, and $V$ indicates the nodes in our grid.
Now, how should spins interact? Well, by the local Markov property, we know that $$p(z_{i,j}\mid z_V)=p(z_{i,j}\mid z_{N_{i,j}}),$$ where $$N_{i,j}=\{(i’,j’)\mid \lvert i’-i\rvert + \lvert j’ - j\rvert =1 \}$$ is the set of neighbors of the node at position $(i,j)$. In other words, this node’s interaction with the material as a whole is mediated through its direct neighbors in the lattice. And, since this is a magnet, the idea is to choose this conditional distribution such that $z_{i,j}$ is likely to align its spin with those of its neighbors. Indeed, following our Gibbs-measure style, let $$p(z_{i,j}\mid z_{N_{i,j}})=\frac{1}{Z_{i,j}}\exp(-\beta H_{i,j}).$$ Here $E_{i,j}$ is the “local” magnetic potential at our site $i,j$, $$H_{i,j}=-z_{i,j}\sum_{\alpha\in N_{i,j}} z_\alpha,$$ and $Z_{i,j}$ is the normalizing constant for this conditional distribution
So, does this conditional distribution have the property that $z_{i,j}$ will try to synchronize with its neighbors? Well, let’s compute the conditional probabilities that $z_{i,j}$ is up or down as a function of its neighbors. \begin{align*} p(z_{i,j}=1\mid z_{N_{i,j}}) &= \frac{1}{Z_{i,j}} \exp\left({\textstyle \sum_{\alpha\in N_{i,j}} z_\alpha}\right)\\\\ p(z_{i,j}=0\mid z_{N_{i,j}}) &= \frac{1}{Z_{i,j}}e^0=1. \end{align*} Then we have $Z_{i,j}=1 + \exp\left({\textstyle \sum_{\alpha\in N_{i,j}} z_\alpha}\right)$. In other words, $$p(z_{i,j}=1\mid z_{N_{i,j}})=\sigma\left({\textstyle \sum_{\alpha\in N_{i,j}} z_\alpha}\right),$$ where $\sigma$ is the sigmoid function $\sigma(x)=e^{x}/(e^{x}+1)$ that often appears in Bernoulli pmfs. This function is increasing, which indicates that local synchrony is encouraged. I should also note that Ising is usually done with $\text{down}=-1$ instead of 0, which is even better for local synchrony.
I thought that giving the local conditional distributions would be a nice place to start for getting some intuition on the model -- in particular, these tell you how Gibbs sampling would evolve. Since this distribution can be sampled by a long Gibbs sampling run, we have sort of developed an intuition that the samples from an Ising model should have some nice synchrony properties with high probability.
But, can we find a joint pmf for $z_V$ that yields these conditional distributions? Indeed, if we define the energy $$H(z_V)=-\sum_{\alpha\in V}\sum_{\beta\in N_\alpha} z_\alpha z_\beta,$$ and from here the pmf $$p(z_V)=\frac{1}{Z}e^{-\beta H(z_V)},$$ then we can quickly check that this is a Gibbs random field on the lattice graph with conditional pmfs as above.
For something with such a simple description, the Ising model has a really amazing complexity. Some of the main results involve qualitative differences in the model’s behavior for different values of the inverse temperature $\beta$. For low $\beta$ (i.e., high temperature, high variance), the model is uniformly random. For large $\beta$, the model concentrates on its modes (i.e., the minima of $H$). It turns out that there is a critical temperature somewhere in the middle, with a phase change from a disordered material to one with magnetic order. There are also theorems about annealing, one of which can be found in a slightly more general setting in footnote:geman.
The complexity of these models indicates that they are good at handling information, especially when some form of synchrony is useful as an inductive bias. So, we’ll now start to tweak the model so that it can learn to model something other than a chunk of magnetic material.
Statistical learning in Gibbs random fields
Inspired by this and some related work in machine learning footnote:history, Ackley, Hinton, and Sejnowski footnote:ackley generalized the Ising model to one that can fit its pmf to match a dataset, which they called a Boltzmann machine. It turns out that their results can be extended to work for general Gibbs measures, so I will present the important results in that context, but we will stick to Ising/Boltzmann machines in the development of the theory, and as an important special case. Then the language model we’d like to develop will appear as a sort of “flavor” of Boltzmann machine, and we’ll already have the theory in place regarding how it should learn and be sampled and all that stuff.
So, here is my sort of imagined derivation of the Boltzmann machine from the Ising model. The units in the Ising model are hooked together like the atoms in a crystal. How can they be hooked together like the neurons in a brain? Well, first, let’s enumerate the sites in the Ising model’s lattice, so that the vertices are now named $V=\{1,\dots,n\}$ for some $n$, instead of being 2-d indices. Then the lattice graph has some $n\times n$ adjacency matrix $W$, where $W_{\alpha\beta}=1$ if and only if $\alpha$ and $\beta$ represent neighbors in the original lattice. So, there should be exactly 4 1s along each row, etc.
Let’s rewrite the Ising model’s distribution in terms of $W$. Indeed, we can see that $$H(z_V)=-z_V W z_V^T,$$ so that the energy is just the quadratic form on $\mathbb{R}^n$ induced by $W$. Then if we fix the notation $$x_\alpha = [W z_V^T]_\alpha,$$ we can rewrite $H_\alpha=z_\alpha x_\alpha$, and simplify our conditional distribution to $$p(z_\alpha=1\mid z_{V})=\sigma(x_\alpha).$$ Not bad.
Now, this is only an Ising model as long as $W$ remains fixed. But what if we consider $W$ as a parameter to our distribution $p(z_V; W)$, and try to fit $W$ to match some data? For example, say that we want to learn the MNIST dataset of handwritten digits, which is a dataset of images with 784 pixels. Then we’d like to learn some $W$ with $V=\{1,\dots,784\}$ such that samples from the distribution look like handwritten digits.
This can be done, but it should be noted that sometimes it helps to add some extra “hidden” nodes $z_H$ to our collection of “visible” nodes $z_V$, for $H=\{785,\dots,785 + n_H\}$, so that the random vector grows to$z=(z_1,\dots,z_{784},z_{785},\dots,z_{785+n_H})$. Let’s also grow $W$ to be a $n\times n$ matrix for $n=n_V+n_H$, where in this example the number of visible units $n_V=784$. Then the Hamiltonian becomes $H(z)=z W z^T$ just like above, and similarly for $x_\alpha$ and so on.
Adding more units gives the pmf more degrees of freedom, and now the idea is to have the marginal pmf for the visible units $p(z_V;W)$ fit the dataset, and let the “hidden” units $z_H$ do whatever they have to do to make that happen.
Ackley, Hinton, and Sejnowski came up with a learning algorithm, with the very Hintony name “contrastive divergence,” which was presented and refined (footnote:cdpapers). I’d like to first present and prove the result in the most general setting, and then we’ll discuss what that means for training a Boltzmann machine.
The most general version of contrastive divergence that I think it’s useful to consider is for $z$ (now not necessarily a binary vector) to be Gibbs distributed according to some parameters $\theta$, with distribution $$p(z;\theta)\,dz=\frac{1}{Z}\exp(-\beta H(z;\theta))\,dz.$$ Here, since we are trying to be general, we’re letting $z$ be absolutely continuous against some base measure $dz$, and $p$ is the density there. But it should help to keep in mind the special case where $p$ is just a pmf, like in the Boltzmann machine, where $z$ is a binary vector $z\in\{0,1\}^{n}$.
First, we need a quick result about the form taken by marginal distributions in Gibbs random fields, since we want to talk about the marginal over the visible units.
Def 3. (Free energy) Let $A$ be a set of vertices in the graph. Then the “free energy” $F_A(z_A)$ of the units $z_A$ is given by$$F_A(z_A)=-\log\int_{z_H} e^{-H(z)}\,dz_H.$$
If it seems weird to you to be taking an integral in the case of binary units, you can just think of that integral $\int\cdot \,dz_H$ as a sum $\sum_{z_H}$ over all possible values for the hidden portion of the vector -- this is because in the Boltzmann case, the base measure $dz$ is just counting measure.
It follows directly from this definition that:
Lemma 4. (Marginals in Gibbs distributions) For $A$ as before, if $Z_A$ is a normalizing constant, we have $$p(z_A)=\frac{1}{Z_A} e^{-F_A(z_A)}.$$
So, the free energy takes the role of the Hamiltonian/energy $H$ when we talk about marginals. We can quickly check Lemma 4 by plugging in: \begin{align*} Z_A p(z_A) &= e^{-F_A(z_A)}\\\\ &= \exp\left( \log\left( \int_{z_{\setminus A}} e^{-\beta H(z_A,z_{\setminus A})} \,dz_{\setminus A} \right) \right)\\\\ &= \int_{z_{\setminus A}} e^{-\beta H(z_A,z_{\setminus A})} \,dz_{\setminus A}, \end{align*} which agrees with the standard definition of the marginal. Maybe it should also be noted that $F_V$ is implicitly a function of $\beta$ -- we are marginalizing the distribution $p_\beta(z)$, and my notation has let $\beta$ get a little lost, since it’s not so important right now.
Contrastive divergence will work with the free energy for the visible units $F_V$. And guess what, I think we’re ready to state it. It’s not really a theorem, but more of a bag of practical results that combine to make a learning algorithm, some of which deserve proof and others of which are well-known.
“Theorem”/Algorithm 5. (Contrastive divergence) Say that $\mathcal{D}$ is the distribution of the dataset, and consider the maximum-likelihood problem of finding $$\theta^*={\arg\max}_{\theta} \mathbb{E}_{z_V^+\thicksim \mathcal{D}}[\log p(z_V^+;\theta)].$$In other words, we’d like to find $\theta$ that maximizes the model’s expected log likelihood of the visible units, assuming that the visible units are distributed according to the data. We note that this is the same as finding the minimizer of the Kullback-Liebler divergence:$$D_{KL}(\mathcal{D}\mid p(z_V;\theta))=\mathbb{E}^+[\log\mathcal{D}(z_V^+)] - \mathbb{E}^+[\log p(z_V^+;\theta)]$$since $\mathcal{D}$ does not depend on $\theta$. (In general, we’ll use superscript $+$ to indicate that a quantity is distributed according to the data, and $-$ to indicate samples from the model, and similarly $\mathbb{E}^{\pm}$ to indicate expectations taken against the data distribution or the model’s distribution.)
To find $\theta^*$, we should not hesitate to pull gradient ascent out of our toolbox. And luckily, we can write down a formula for the gradient:
TODO: note, we want gradient ascent on the good thing. put in DKL. also ascent! what is up with this. Should just rewrite the math here by hand and sort it all out for real instead of this tumblr bullpucky.
Main Result 5a. For the gradient ascent, we claim that $$\frac{\partial}{\partial\theta} \mathbb{E}_{z_V^+\thicksim \mathcal{D}}[\log p(z_V^+;\theta)] = -\mathbb{E}_{z_V^+\thicksim \mathcal{D}}\left[\tfrac{d F_V(z_V^+)}{d\theta}\right] + \mathbb{E}_{z_V^-\thicksim p(z_V;\theta)}\left[ \tfrac{d F_V(z_V^-)}{d\theta}\right].$$
All that remains is to estimate these expectations. I say “estimate” since we probably can’t compute expectations against $\mathcal{D}$ (we might not even know what $\mathcal{D}$ is), and since we can’t even sample directly from $p(z_V)$. In practice, one uses a single sample $z_V^+\thicksim\mathcal{D}$ to estimate the first expectation Monte-Carlo style, and then uses a Gibbs sampling Markov chain starting from $z_V^+$ to try to get a sample $z_V^-$ from the model’s distribution, which is then used to estimate the second expectation. So, each step of gradient descent requires a little bit of MCMC.
What requires proof is the main result in the middle there. See footnote:ackley for the proof in the special case of a Boltzmann machine, but in general the proof is not hard. Below, for brevity we fix the notation $\mathbb{E}^+$ for expectations when $z$ comes from the data, and $\mathbb{E}^-$ when $z$ comes from the model.
Proof 5a. We just have to compute the derivative of the expected log likelihood. First, let’s expand the expression that we are differentiating:\begin{align*}\mathbb{E}^+[\log p(z_V^+;\theta)]&=\mathbb{E}^+[\log(\exp(-F_V(z_V^+)) - \log Z_V]\\\\ &=-\mathbb{E}^+[F_V(z_V^+)] - \log Z_V.\end{align*}Here, we have removed the constant $\log Z_V$ from the expectation. Already we can see that differentiating the first term gives the right result. So, it remains to show that $\frac{\partial}{\partial\theta}\log Z_V=-\mathbb{E}^-[\frac{\partial}{\partial\theta} F_V(z_V^-)]$. Indeed,\begin{align*} \frac{\partial}{\partial\theta}\log Z_V &=\frac{1}{Z_V}\frac{\partial}{\partial\theta}Z_V\\\ &=\frac{1}{Z_V}\int_{z_V} \frac{\partial}{\partial\theta} e^{-F_V(z_V;\theta)}\,dz_V\\\\ &=\int_{z_V}\frac{1}{Z_V} e^{-F_V(z_V;\theta)}\,\frac{\partial}{\partial\theta} -F_V(z_V;\theta)\,dz_V\\\\ &=-\int_{z_V} p(z_V;\theta) \frac{\partial}{\partial\theta} F_V(z_V;\theta)\,dz_V\\\\ &=-\mathbb{E}^-\left[ \tfrac{d F(z_V;\theta)}{d\theta} \right], \end{align*} as desired.
The derivative of free energy will have to be computed on a case-by-case basis, and in some cases the integrals involved may not be solvable. Now, let’s take a look at how this works for a typical Boltzmann machine.
Contrastive divergence for restricted Boltzmann machines
Recall that the Boltzmann machine is a Gibbs random field with energy$$H(z)=-\frac{1}{2} z W z^T.$$I haven’t said much about these yet, so let’s clear some things up. First off, we need $W$ to be a symmetric matrix, so that the energy stays positive. Second, it’s typical to fix the diagonal elements $W_{ii}=0$, although this can be dropped if necessary.
This is the most general flavor of Bernoulli Boltzmann machine, with each node connected to all of the others -- we’ll call it a “fully-connected” Boltzmann machine. These tend not to be used in practice. One reason for that is that Gibbs sampling cannot be done in parallel, since the model respects only the fully connected graph, which makes everything really slow.
It would be more practical to use a graph with sparser dependencies. The most popular flavor is to pick a bipartite graph, where the partitions are the visible and hidden nodes. In other words, any edge in the graph connects a visible unit with a hidden one. This is called a “restricted” Boltzmann machine, and in this case $W$ has the special block form $$W=\left[\begin{array}{c|c} 0 & M \\\\ \hline M^T & 0 \end{array} \right].$$Here, we’ve enforced the bipartite and symmetric structure with notation, and we let $M$ be an $n_V\times n_H$ matrix. Then our energy can be written\begin{align*}H(z)&=-\frac{1}{2} z W z^T\\\\&= -\frac{1}{2} z_V M z_H^T - \frac{1}{2} z_H M^T z_V^T\\\\&=-z_V M z_H^T.\end{align*}Also, we can rewrite the “inputs” to units $x_i=[M z_H^T]_i$ for $i\in V$ and $x_i=[M^T z_V]_i$ for $i\in H$ -- this will be useful notation to have later.
OK. Now, we’d like to compute the expression in Main Result 5a so that we can do gradient descent and train our RBM. Let’s begin by calculating the free energy of the visible units. Starting from the definition,\begin{align*} F_V(z_V) &=-\log \sum_{z_H} e^{-H(z)}\\\\ &=-\log \sum_{z_H} \exp\bigg( z_V M z_H^T \bigg)\\\\ &=-\log \sum_{z_H} \exp\bigg( \sum_{i=1}^{n_V} \sum_{j=1}^{n_H} z^V_i M_{ij} z^H_j\bigg)\\\\ &=-\log\left[ \sum_{z^H_1=0}^1\dotsm\sum_{z^H_{n_H}=0}^1 \left( \prod_{j=1}^{n_H} \exp\!\bigg( \sum_{i=1}^{n_V} z^V_i M_{ij} z^H_j\bigg)\right)\right]. \end{align*}Here, pause to notice that we have $n_H$ sums on the outside, and that each sum only cares about one hidden unit. But then many of the terms in the product in the summand are constants from the perspective of that sum, so by linearity, we can move the product outside all of the sums and then through the logarithm:\begin{align*} F_V(z_V) &=-\log\left[ \prod_{j=1}^{n_H} \sum_{z^H_j=0}^1 \exp\!\bigg( \sum_{i=1}^{n_V} z^V_i M_{ij} z^H_j\bigg)\right]\\\\ &=-\sum_{j=1}^{n_H} \log \left( 1 + \exp\!\bigg( \sum_{i=1}^{n_V} M_{ij} z^V_i \bigg)\right)\\\\ &=-\sum_{j=1}^{n_H} \log \left( 1 + e^{x_j}\right). \end{align*} Not so bad, after all. In particular, we’ll be comfortable differentiating this with respect to the weights $M_{ij}$. Which, well, that’s what we’re doing next. These calculations might be sort of boring, and you know, the details are not that important, but we’re gonna be doing something similar later to train the new model, so it seems nice to see the way it goes in the classic model first.
Anyway, now to take the derivative. We’ll write $\partial_{M_{ij}}=\frac{\partial}{\partial M_{ij}}$ for short, and we’ll work from the second to last line in the previous display. \begin{align*} \partial_{M_{ij}} F_V(z_V) &= - \partial_{M_{ij}} \sum_{k=1}^{n_H} \log \left( 1 + \exp\!\bigg( \sum_{i=1}^{n_V} M_{ik} z^V_i \bigg)\right)\\\\ &= - \partial_{M_{ij}} \log \left( 1 + \exp\!\bigg( \sum_{i=1}^{n_V} M_{ij} z^V_i \bigg)\right) \\\\ &= - \frac{1}{1 + \exp\!\left( \sum_{i=1}^{n_V} M_{ij} z^V_i \right)} \partial_{M_{ij}} \exp\!\bigg( \sum_{i=1}^{n_V} M_{ij} z^V_i \bigg)\\\\ &= - \frac{\exp\!\left( \sum_{i=1}^{n_V} M_{ij} z^V_i \right)}{1 + \exp\!\left( \sum_{i=1}^{n_V} M_{ij} z^V_i \right)} z^V_i. \end{align*} Here, we pause to notice that the fraction in the last line is exactly $$\frac{e^{x_j}}{1+e^{x_j}}=\sigma(x_j)=p(z^H_j=1\mid z_V)=\mathbb{E}[Z^H_j\mid z_V].$$This might seem like a coincidence that just happened to pop out of the derivation, but as far as I can tell, this conditional expectation always shows up in this part of the gradient in all sorts of Boltzmann machines with similar connectivity (e.g., in footnote:honglak, footnote:zemel). I don’t know if this has been proven, but I think it’s a safe bet and a good way to check your math if you’re making your own model. Plugging this in, we find $$\partial_{M_{ij}} F_V(z_V) = -z^V_i \mathbb{E}[Z^H_j \mid z_V].$$ Now, let’s plug this result back into our formula for the gradient to finish up. By the averaging property of conditional expectations,\begin{align*} \frac{\partial}{\partial M_{ij}} \mathbb{E}^+[\log p(z_V^+;\theta)] &= -\mathbb{E}^+[\partial_{M_{ij}} F_V(z_V^+)] + \mathbb{E}^-[ \partial_{M_{ij}} F_V(z_V^-)]\\\\ &= \mathbb{E}^+[ z^V_i \mathbb{E}[Z^H_j \mid z_V] ] - \mathbb{E}^-[ z^V_i \mathbb{E}[Z^H_j \mid z_V] ]\\\\ &= \mathbb{E}^+[ z^V_i z^H_j ] - \mathbb{E}^-[ z^V_i z^H_j ]. \end{align*}TODO: Some commentary on Hebbian and +- phases.
That’s nice, but it ought to be mentioned that there are a lot of practical considerations to take into account if you want to train a really good model, since the gradients are noisy and the search space is large and unconstrained. footnote:practical is a good place to start.
To finish, it should be noted that these RBMs can be stacked into what’s called a “deep belief network” -- I think this is another Hinton name. What that means is that more hidden layers are added, so that the graph becomes $n+1$-partite, where $n$ is the number of hidden layers (we add 1 for the visible layer). Then the energy function becomes$$H(z_V,z_{H_1},\dots,z{H_n})=-z_V M_1 z_{H_1}^T - \dots - z_{H_{n-1}} M_n z_{H_n}^T.$$ These networks are tough to train directly using the free-energy differentiating method above (although the math is similar). In practice, each pair of layers is trained as if it were an isolated RBM, this is called greedy layer-wise training, and is described in footnote:greedylayerwise. There’s a good tutorial on DBNs here as well footnote:http://deeplearning.net/tutorial/DBN.html.
One important thing to note about DBNs is that when you are doing Gibbs sampling in the middle of the network, you have to sample from the conditional distribution of your layer given both the layer above and the layer below. So, there is a little more bookkeeping involved. There are other methods for sampling these networks too (footnote:honglak, footnote:wheredoeshintontalkaboutdeepdreamthing).
Developing langmod-dbn
So, we’ll be working on a DBN to model language, since probably a single layer will not be enough to model anything very interesting. But, since we’ll be using greedy layer-wise training, we can just describe how a “langmod-rbm” looks, and that will take us most of the way there.
OK, so. We want to develop a family of probability distributions $p_n(z^{(n)}_V)$, on strings $z_V^{(n)}$ of length $n$, one probability distribution for each possible length $n$. These strings will come from an “alphabet” $\mathcal{A}$. For example, we might have $\mathcal{A}=\{a,b,c,\dots\}$, or we might allow capital letters or punctuation, etc. So, a length-$n$ string is then just some element of $\mathcal{A}^n$. For short, let’s let $N=\lvert\mathcal{A}\rvert$ be the size of the alphabet.
Our starting point is going to be the RBM as above. The first change is that we will focus on the special case where $M$ is not a general linear transform, but rather a convolution (actually cross-correlation) with some kernel. This is just a choice of the form that $M$ takes, so we are in the same arena so far. There is a good deal of work around convolutional Boltzmann machines for image processing tasks (especially footnote:honglak), but I’m not aware of any work using a convolutional Boltzmann machine for natural language.
(The thing is that we’re going to do sort of a strange convolution, so if you’re familiar with them this will be a little weird. If you’re not, well I’m going to do my best not to assume you are, but it might help to understand discrete cross-correlations a bit. If this is confusing, you might take a look at footnote:https://en.wikipedia.org/wiki/Cross-correlation, and at Andrej Karpathy’s lecture notes on convolutional neural networks footnote:http://cs231n.github.io/convolutional-networks/ for a machine learning point of view.)
Now, it does not make much sense to encode $z_V$ (we’ll use $z_V$ as shorthand for $z_V^{(n)}$) as just an element of $\mathcal{A}^n$, since $\mathcal{A}$ is not an alphabet of numbers. But we can make letters into numbers by fixing a function $\phi:\mathcal{A}\to \{1,\dots,N\}$, so that we can now associate an integer to each letter.
But the choice of $\phi$ is pretty arbitrary, and it does not make much sense to let $z_V$ just be a vector of integers like this. Rather, let’s convert strings into 2-d arrays using what’s called a one-hot encoding. The one-hot encoding of a letter $a$ is an $N$-d vector $e(a)$, where $e(a)_i=1$ if $i=\phi(a)$, and $e(a)_i=0$ if $i\neq\phi(a)$. So, we take a letter to a binary vector with a single one in the natural position.
Then we can encode a whole string $s=a_1a_2\dotsm a_n$ to its one-hot encoding naturally by letting $$e(s)=(e(a_1)\ e(a_2)\ \dotsm\ e(a_n)).$$ So, we’re imaging strings as “images” of a sort -- the width of the image is $n$, and the height is $N$. So, we’ll define an energy function on $z_V$, where $z_V$ is no longer a binary vector like above, but rather an $n\times N$ binary image with a single one in each column. This way, the hidden units can “tell” what letters are sending them input.
Now, instead of having each hidden unit receive input from all of the visible units, as in an ordinary RBM, we’ll say that each hidden unit receives input from some $k$ consecutive columns of visible units (i.e. $k$ consecutive letters), for some choice of $k$. We’ll also assume that the input is “stationary,” just like Shannon did in his early language modelling work footnote:shannon. This means we are assuming that a priori, the distribution of each letter in the string is not influenced by its position in the string. In other words, the statistics of the string are invariant under translations.
But since the input is translation invariant, each hidden unit should expect statistically similar input. So, it makes sense to let the hidden units be connected to their $k$ input letters by the same exact weights. Here’s a graphic of “sliding” those weights across the visible field to produce the inputs $x_j$ to the hidden layer:
This figure footnote:figurepeople takes a little explaining, but what it shows is exactly how we compute the cross-correlation of the visible field and the weights. Here, the blue grid represents the visible field. The weights (we’ll be calling them the “kernel” or the “filter”, and they’re represented by the shadow on the blue grid) are sliding along the width dimension of the input, and we have chosen a kernel width $k=3$ in the figure. In general, the kernel height is arbitrary, but here we’ve chosen it to be exactly the height of the visible field, which is the alphabet size $N$. This way each row of the kernel is always lined up against the same row of the visible field, so that it is always activated by the same letter, and it can learn the patterns associated with that letter.
At this point, it might help to put down a working definition of cross-correlation, which should be thought of as something like an inner product. (You can think of $u$ as the $n\times N$ visible field, and $v$ as the $k\times N$ filter.)
Def 6. (Cross-correlation) Let $u$ be an $n_1\times n_2$ matrix, and let $v$ be an $m_1\times m_2$ matrix, such that $n_1\geq m_1$ and $n_2\geq m_2$. Then the cross-correlation $u*v$ is the $(n_1-m_1+1)\times(n_2-m_2+1)$ matrix with elements$$[u*v]_{ij}=\sum_{p=0}^{m_1}\sum_{q=0}^{m_2} u_{i+p,j+q}v_{pq}.$$
This differs from a convolution, which is the same but with $v$ reflected on all of its axes first. For some reason machine learning people just call cross-correlations convolutions, so that’s what I’m doing too.
We are extremely close to writing down the energy function for our new network. I keep doing that and then realizing I forgot like 10 things that needed to be said first. I think the last thing that needs saying is this: we’ve shown how to get one row of hidden units, with one filter. In a good network, these units will all be excited by whatever length-$k$ substrings of the input have high correlation with the filter. This will hopefully be some common feature of length-$k$ strings -- for small $k$, this could be something like a phoneme -- I don’t know. Maybe the filter would recognize the short building blocks of words, like “nee” or “bun,” little word-clusters that show up a lot.
But this filter is only going to be sensitive to one feature, and we’d like our model to work with lots of features. So, let’s add more rows of hidden units. Each row will be just like the one we’ve already described, but each row will use its own filter. We’ll stack the rows together, so that the hidden layer now is now image shaped. Let’s let $c$ be the number of filters, and we’ll let $K$ be our filter bank -- this is going to be a 3-d array of shape $c\times k\times N$.
Then, the input to the hidden unit $z^H_{ij}$ is given by the cross correlation of the $j$th filter with the input. If we use $*$ also to mean cross-correlation with a filter bank, we have:$$x^H_{ij}=[z_V*K]_{ij}:=\sum_{p=0}^k \sum_{q=0}^N z^V_{i+p,q} K_{j,p,q}.$$Then the hidden layer $z_H$ will have shape $n-k+1\times c$.
This leads us to a natural definition for the energy function. Since cross-correlation is “like an inner product,” we’ll use it to define a quadratic form just like we did in the RBM:\begin{align*}H(z_V,z_H) &=-\langle z_V * K, z_H\rangle\\\\ &:= - \sum_{i=0}^{n-k-1}\sum_{j=0}^c x^H_{ij} z^H_{ij}\\\\ &= \sum_{i=0}^{n-k-1}\sum_{j=0}^c z^H_{ij} \sum_{p=0}^k \sum_{q=0}^N z^V_{i+p,q} K_{j,p,q}.\end{align*}So, there we have the heart of our new convolutional Boltzmann machine, since the probability is basically determined by the energy.
Categorical input units
There’s only one change left to get us to something that could be called a language model. Right now, the machine has a little problem, which is that it’s possible for a sample from the machine to have two or more units on in some column of the sample. We didn’t use a two-hot encoding for our strings, though, so that sample can’t be understood as a string. We have to find some way to make sure this never happens.
The answer is pretty simple: we just restrict our state space. So far, we had been looking at $z\in\{0,1\}^{V\times H}$, the set of all possible binary configurations of all the units. But let’s restrict our probability distribution to the set$$\Omega=\bigg\{z\in\{0,1\}^{V\times H} : \forall i\,\sum_{j} z^V_{ij}= 1 \bigg\}.$$Then our distribution $p(z)$ is still a Gibbs random field on this state space, but we’re left with a question: how do we decide which of the units in each column should be active? The visible units in a column are no longer conditionally independent, so this manipulation of the state space has changed the graph that our model respects by linking the visible columns into cliques.
So, let’s compute the conditional distribution of the visible units given the hidden units. Since the energy is the same, we still have that $$p(z^V_{ij}=1\mid z_H)=\frac{1}{Z} e^{x^V_{ij}},$$ for some normalizing constant that we’ll just call $Z_{ij}$ for now. The question is, what is the normalizing constant? Well, it satisfies $$Z_{ij}=Z_{ij}p(z^V_{ij}=1\mid z_H) + Z_{ij} p(z^V_{ij}=0\mid z_H),$$and we know that since exactly one unit is on,$$Z_{ij}p(z^V_{ij}=0\mid z_H)=\sum_{k\neq j} Z_{ij} p(z^V_{ik}=1\mid z_H).$$ Plugging back into the previous line, we get$$Z_{ij}=\sum_{k}Z_{ij} p(z^V_{ik}=1\mid z_H).$$ But $Z_{ij}$ obeys this relationship for any choice of $j$, so we see that in fact $$Z_{i0}=Z_{i1}=\dotsm=Z_i$$is constant over the whole column, and that its value is $$Z_i=\sum_{k} Z_i p(z^V_{ik}=1\mid z_H) = \sum_k e^{x^V_{ik}}.$$So, now we have found our conditional distribution $$p(z^V_{ij}=1\mid z_H)=\frac{e^{x^V_{ij}}{\sum_k e^{x^V_{ik}}}.$$ But this is exactly the softmax footnote:softmax function that often appears when sigmoid-Bernoulli random variables are generalized to categorical random variables! That’s nice and encouraging, and it gives us a simple sampling recipe. Just compute the softmax of $x^V$ over columns, and then run a categorical sampler over each column of the result.
It might be worth noting that this trick of changing the distribution of the input units is a common one: often for real-valued data, people will use Gaussian visible units and leave the rest of the network Bernoulli footnote:GaussBerRBM. Also, I should note that I was inspired to do this by the similar state-space restriction that Lee et al. used to construct their probabilistic max pooling layer footnote:honglak -- this is a pretty direct adaptation of that trick.
Lastly, it might be worth noting that you could do the same thing to the hidden layer that we have just done to the visible layer -- this could be helpful, since it would keep the network activity sparse. Also, in the language domain, maybe it’s appropriate to think of hidden features as mutually exclusive in some way (you can’t have two phonemes at once). I would bet that this assumption starts making less sense as you go deeper in the net. The problem with this idea is that unlike what we’ve just done in the visible layer, changing the state space of the hidden units ends up changing the integral that computes the free energy of the visible units, which means we would have to do that again. So for now, let’s not do this.
Training the language model
Now, we’d like to get this thing learning. But for once, we are in for a lucky break. Notice that this categorical input units business hasn’t changed anything about the free energy -- that integral is still the same. So, since the learning rule is basically determined by the free energy, that means that this categorical input convolutional RBM has the same exact learning rule as a regular old convolutional RBM.
And, we’re in for another break too. Cross-correlations are linear maps, so a convolutional RBM is just a special case of the generic RBM whose learning rule we’ve computed above. In particular, there exists a matrix $M_K$ so that if $z_V’$ is $z_V$ flattened from an image to a flat vector, $M_Kz_V=(z_V*K)’$. The thing is that $M_K$ is very sparse, so we need to figure out the update rule for the elements of $M_K$ that are not fixed to 0.
Well, first of all, our free energy hasn’t changed, but we can write it in a more convolutional style: $$F_V(z_V)=-\sum_{i=0}^{n-k+1}\sum_{j=0}^c \log(1 + e^{x^H_{ij}}).$$ So, all we need to do is compute $\partial_{K_{ors}}$ of this expression ($K$ is hidden in $x^H_{ij}$). This is quick, so I’ll write it out:\begin{align*}\partial_{K_{ors}} F_V(z_V) &=\partial_{K_{ors}} - \sum_{i=1}^{n-k+1}\sum_{j=1}^c}\log(1+e^{x^H_{ij}}\\\\ &=-\sum_{i=0}^{n-k+1} \partial_{K_{ors}} \log(1+e^{x^H_{io}})\\\\ &=-\sum_{i=0}^{n-k+1}\frac{1}{1+e^{x^H_{ij}}} \partial_{K_{ors}} e^{x^H_{io}}\\\\ &=-\sum_{i=0}^{n-k+1}\sigma(x^H_{io}) \partial_{K_{ors}} x^H_{io}\\\\ &=-\sum_{i=0}^{n-k+1}\sigma(x^H_{io}) z^V_{i+r,s},\end{align*}where the last line follows directly from the definition of $x^H_{io}$ given above.
But on the other hand, what if we did want to have categorical columns in the hidden layer? It’s plausible to me to think that only one “language feature” might be active at a given time, and this has the added bonus of keeping the hidden layer activities sparse. This amounts to imposing on the hidden layer the same restriction that we earlier imposed on the state space for the visible layer, i.e. we choose the new state space$$\Omega’=\bigg\{z\in\{0,1\}^{V\times H} : \forall i\,\sum_{j} z^V_{ij}= 1 = \sum_j z^H_{ij} \bigg\}.$$Then, of course, we sample the hidden units in the same way that we just discussed sampling categorical visible units.
But now we’ve changed the state space of $z_H$, which in turn changes the behavior of the $\int \, dz_H$ that appears in the definition of free energy. So if we want to train a net like this, we’ll need to re-compute the free energy and see what happens to the learning rule.
This is sort of the convolutional version of the outer product that appeared in the RBM, and it’s a special case of the gradient needed to train a convolutional DBN like footnote:honglak, which allows for a kernel that is not full-height like ours.
Many thanks especially to Matt Ricci for working through all of this material with me, and to Professors Matt Harrison, Nicolas Garcia Trillos, Govind Menon, and Kavita Ramanan for teaching me the math.
we could just use an ordinary dbn and make sure to sample from the conditional distribution on only one per col in vis layer
modify sliding kernel pic to have full-height kernel, show the alphabet on the side, stuff like that. cite that guy’s tool if it works
let’s try that
ok, but just for fun, let’s encode that constraint into the model. maybe it’ll make it more structured and able to automatically deal with sparsity problems.
at this point you have as much intuition for me if this thing should work. let’s try it out. more layers is tough?? idk
mention more Boltzmann “mods” like DUBM, or more standard Gaussian, or Gaussian-Bernoulli. Mention gemangeman model.
search for todo and footnote when done. also resize the figures.w
0 notes
Text
On two proofs of the central limit theorem
Two of my statistics professors, G. Menon and N.G. Trillos, said sort of contradictory things about the central limit theorem. I’m going to try to write them both down and explain what the difference was, and maybe it will be interesting. First, let's set up the context. The classic(al) central limit theorem is
Theorem 1. (Classical CLT) Let $X,X_1,X_2,\dotsc$ be i.i.d. random variables with $\mathbb{E}[X]=0$ and $\operatorname{var}(X)=1$. If $S_n$ is the random walk $S_n=\sum_{i=1}^n X_i$, then the process $\frac{1}{\sqrt{n}}S_n$ converges in distribution to the centered Gaussian $N(0,1)$.
This can be proved in a few lines, that boil down to: “Say that a bunch of little fluctuations are happening, and that none of them is large enough to stand out. Then in the average as more fluctuations occur, they add up to a smooth blob of a fluctuation.” The proof is carried out entirely with characteristic functions and I’m about to write it down, but I’ll try to explain myself first.
Prof. Menon’s remark was that it wasn’t until Paul Lévy and others developed characteristic functions that we understood “what makes the CLT tick.” Looking at the history of the CLT (there is a very detailed book by H. Fischer [1]) gives the sense that this is true, since before Lévy there was a whole pile of different central limit theorems for different special cases. Characteristic functions seem like the right language for this theorem, in that they can show it for all of the cases in the same breath. So, here’s how they show up in the proof. This is ripped from Wiki [2], so don’t get too excited.
Proof. (Theorem 1) Since the $X_i$ are identically distributed, we can let $\phi$ be their shared characteristic function. Then $\phi(0)=\mathbb{E}_{X} [e^0]=1$. These are centered random variables, so $\phi’(0)=0$ (Fourier transform exchanges position and momentum operators). On the other hand, they have fixed variance $1$, so that $\phi’’(0)=-1$. Then by Taylor, $$\phi(t)=1-\frac{t^2}{2}+o(t^2).$$Then if $Y=X,Y_i=X_i/\sqrt{n}$, using the scaling property of characteristic functions, we have $$\phi_{Y}(t)=\phi(\tfrac{t}{\sqrt{n}})=1-\frac{t^2}{2n}+o(\tfrac{t^2}{n}).$$ Then if we take $S_n=\sum_{i=1}^n Y_i$, by independence we have that \begin{align*}\phi_{S_n}(t)&=\mathbb{E}[e^{it(Y_1+\dotsm+Y_n)}]\\\\ &=(\phi_Y(t))^n\\\\ &=\left(1-\frac{t^2}{2n}+o(\tfrac{t^2}{n})\right)^n,\end{align*}and this will famously approach$$\lim_{n\to\infty} \phi_{S_n}(t)=e^{-\frac{t^2}{2}}.$$By Lévy continuity, this is the end. $\blacksquare$
On the one hand, this proof makes nothing but sense: higher frequency noise gets washed out on average, since we have assumed away the $\Theta(t)$ term that would arise if $\mathbb{E}[X]\neq0$. That would represent the process running off to $\pm\infty$, so of course it has got to go. Then the $n^{-1/2}$ scaling gets rid of everything except for the variance, since the “compound interest” formula $\lim (1 + x/n)^n$ for the exponential will eat all but the $\cdot/n$ term. So, the $\cdot/\sqrt{n}$ scaling is the right choice to leave behind the behavior of the process’ limiting fluctuations about its center, since it reveals the expected deviation from the mean, and this is why it's called the central limit theorem.
To me, this argument feels too descriptive. “High frequency noise washes out” is a statement of what happens, and while you would expect to find the “why it happens” in the proof, to me the argument by Taylor’s theorem does not do that, it just describes. It’s an analytical argument that does not lend itself to much in the way of interpretation, or to understanding the details of a particular case. (In [12], T. Tao mentions the “algebraic miracles” that complete moment-method proofs of the CLT, which also spectacularly and somewhat out-of-nowhere reveal the Gaussian limit. Apparently, something called the “Lindenberg swapping trick” helps deal with the mystical aspect.)
To me, the result sort of comes out of the blue, maybe because characteristic functions are able to compress so much math into such brief manipulations. For instance, looking at this proof, why should we expect a Gaussian limit? Well, it is a fixed point of the Fourier transform, so maybe we should expect a sane limit of a sequence of characteristic functions to be some flavor of Gaussian. This might be on the way to Stein’s method [3], and this understanding might become less descriptive and more intuitive if I knew more about stable distributions, but I don’t, so we’ll leave that for another day.
Still, the Gaussian has such strong ties to entropy and physics that it really should not be appearing in this “oh, hey” kind of way in a theorem about a random process. So, now I’ll try to talk about the “entropic CLT,” which to me gives this sort of nice physical interpretation, although it comes at the cost of a more complex proof.
The central limit theorem is all about the limiting behavior of the variance of a process about its center: if we assume away the possibility of a rare event or a catastrophe, how much noise is left in the limit? A theorem like Sanov’s theorem in large deviations complements the CLT by revealing the behavior in the process’ tails, rather than near its center. And, something interesting is that Sanov’s theorem, like many results in large deviations, is proved using tools from information theory. Prof. Trillos’ remark, that I have been putting off, is something to the effect that entropic machinery gives a better understanding of the CLT. Indeed, since it is powerful enough to describe the tails of the limiting distribution above, maybe it can say more about the near-equilibrium behavior as well? In this framework, the central limit theorem becomes something like the second law of thermodynamics: we’re adding more random variables, so of course entropy has to increase, and the Gaussian maximizes entropy, so there you have it.
I’ll make a couple of definitions to set up the main entropic tool, which is “convergence in relative entropy.” This is much like convergence in a metric, and it’s an interesting “metric” since it has these physical associations. It’s a different sort of convergence than the weak convergence (convergence in distribution) given by the CLT above, which says that “from the point of view of every possible statistic” $S_n\Rightarrow N(0,1)$, or put a different way, it only understands the limiting process by the action of its expectation operator on a restricted class of statistics (bounded and continuous functions). But it would be misleading if I did not note that, since the Fourier transform is an isometry, there exist hard-analytical statements like the Barry-Esséen theorem, or statements shown by Stein’s method, which strengthen the weak$^*$ convergence to uniform convergence (I don’t know if this is for CDFs or densities, so, ..., but see [12]), with explicit bounds on rates of convergence. I do not know too much about these results, and would be interested to know how they would mix with my discussion here.
Anyway, here are some details about relative entropy to set up Prof. Trillos’ statement.
Def 2. (Relative entropy, or Kullback-Liebler divergence) Let $\mu,\nu$ be probability measures on the same space. Then the relative entropy of $\mu$ with respect to $\nu$ is$$D(\mu \mathbin{\|} \nu)=\begin{cases} \int\log\frac{d\mu}{d\nu}\,d\mu & \text{if $\mu\ll\nu$}\\\ \infty & \text{otherwise,}\end{cases}$$where $\frac{d\mu}{d\nu}$ is the Radon-Nikodym derivative.
Recall that this is a positive functional with identity of indiscernibles ($D(\mu\mathbin{\|} \nu)=0$ iff $\mu=\nu$), so that it would be a metric of not for its lack of symmetry (although we can symmetrize by looking at $D(\mu\mathbin{\|}\nu) + D(\nu\mathbin{\|}\mu)$) and the triangle inequality. It is convex and lower semicontinuous in both of its arguments, among other nice properties. It might as well be a metric, since the sequential characterization of the topology it generates is very nice, as in the proposition below.
Def 3. (Convergence in relative entropy) We say that probability measures $\mu_1,\mu_2,\dots$ converge to $\nu$ in relative entropy if$$\lim_{n\to\infty}D(\mu_n\mathbin\|\nu)=0.$$
Prop 4. Convergence in relative entropy implies convergence in total variation, and in particular, convergence in distribution, uniform convergence of CDFs [4], ...
By Prop. 4 and [4], we can see that convergence in relative entropy is a little stronger than the results we got with characteristic functions above. From the weak$^*$ point of view, the class of test functions is much larger, allowing bounded and measurable functions in addition to the bounded and continuous ones from above, so that this theorem gives an identity between the limiting expectation operator and that of the Gaussian.
Over the horizon is a theorem to the effect that the $S_n$ defined in the proof of Thm. 1 above converge in relative entropy to a standard Gaussian. Since the Gaussian maximizes entropy, we would hope that this theorem can be understood as: “As you mix in more randomness, of course the entropy has to increase to its maximum, and since you’ve scaled these variables to leave behind some variance, that defines a Gaussian.” To me this version of the theorem feels better motivated, although the proofs I had seen before writing this [5] involve a lot of fiddling with Fisher information and inequalities. Here is a result that everyone would love,
Aspirational Theorem 5. (Entropic CLT) Let $\mu_1,\mu_2,\dotsc$ be centered probability measures with second moment 1, and let$$\nu_n=\frac{1}{\sqrt{n}}\sum_{i=1}^n \mu_i.$$Then $D(\nu_n\mathbin\|\Phi)\to0$, where $\Phi$ is the standard Gaussian measure $d\Phi(x)=e^{-\frac{x^2}{2}}\,dx$.
This is false as stated: How would one deal with the absolute continuity requirement in the definition of relative entropy, if say the $\mu_i$ were Bernoulli like in de Moivre’s theorem? (In Thm. 6 below, this is the requirement that the relative entropy be finite at some point.) So, it’s a question how far this approach can be taken.
Prof. Trillos offered his class A. Barron’s classic 1986 paper [5], where the convergence in relative entropy was first established (for densities). The result from Barron is,
Theorem 6. (Barron 1986 Entropic CLT) Let $f_n$ be the probability density function for the standardized sum of iid random variables$$S_n=\frac{1}{\sqrt{n}\sigma}\left(\sum_{i=1^n} X_i-\mathbb{E}[X_i]\right).$$Then $D(f_n\mathbin\|\phi)\to0$, where $\phi$ is the standard normal density, provided that the relative entropy is finite for some $n$. What’s more, the sequence $n D(f_n\mathbin\|\phi)$ is shown to be subadditive.
Recall that
Def 7. (Subadditive sequence) A sequence $(a_i)_{i\in\mathbb{N}}$ is said to be subadditive if for any $i,j\in\mathbb{N}$,$$a_{i+j}\leq a_i + a_j.$$
Barron builds on work by Linnik and especially by L.D. Brown [10]. Brown had shown weak$^*$ convergence, and Barron extends his entropic argument to show convergence in relative entropy. The argument relies on the interesting relationship between the Fisher information, defined in a second, and relative entropy.
Def 8. (Fisher information) Let $g$ be the $C^1$ density of a random variable $Y$ with variance $\sigma^2$, and let $\rho=\frac{g’}{g}$ be the score function (derivative of log likelihood!), and let $\rho_\phi$ be that of the standard Gaussian. Then the Fisher information is$$I(Y)=\mathbb{E}[\rho^2(Y)],$$and the standardized Fisher information is $$J(Y)=\sigma^2\mathbb{E}[(\rho(Y)-\rho_\phi(Y))^2].$$
The important property is that $J\geq0$ with equality iff $g=\phi$, so that the Gaussian minimizes Fisher information for a given variance. This seems similar to $D(\cdot\mathbin\|\cdot)$, and it turns out that they are related:
Prop 9. (Barron 1986 Lemma 1, Relative entropy is an integral of Fisher informations) Let $X$ be any random variable with finite variance and a $C^1$ density, and let $Z$ be an independent normal random variable with the same first and second moments as $X$. Then$$D(X\mathbin\|\Phi)=\int_0^1 J(\sqrt{t} X + \sqrt{1-t} Z)\,\frac{1}{2t}dt.$$
There is also the interesting relationship between Fisher information and entropy production in the heat equation, which is not super relevant but worth working out (and a homework problem from Prof. Trillos), that if $f_t(x)$ solves the heat equation then $\partial_t H(f_t)=I(f_t)$.
The rest of Barron’s argument follows by convolution inequalities for Fisher informations also used by Brown, intuition from the Cramér-Rao bound [11], and sequential arguments involving subadditivity. It would be healthy for me to repeat this argument, but it builds on so many previous results (inequalities by Kullback, “de Bruijn’s identity”, Results on Fisher info by Stam and Blachman) that I have not seen before, and which seem worth working through, but this may be quite a project. I include this stuff to give a feeling for the relative complexity of these arguments compared to the classical argument above. Of course, there is a huge amount of work hidden behind manipulations of characteristic functions, so that the net amount of math is probably similar. In particular, the way that characteristic functions are able to deal with absolute continuity starts to seem pretty good.
A separate tactic might be: since we want to show that $D(f_n\mathbin\|\phi)\to0$, but for densities at least, this is the same as showing that $H(f_n)\to H(\phi)=\frac{1}{2}\log(2\pi e\sigma^2)$. Can we make an argument at the level of measures by showing $H(\mu_n)\to H(\phi)$? As mentioned in [6], this probably will not work, since the entropy does not have great upper semicontinuity properties, even as a functional acting on densities. Maybe we need a “characteristic functions” toolbox for information theoretic methods. Are there assumptions under which the entropy can be computed using only the characteristic function, without inverting to get the measure?
That basically finishes this chapter, and here I am, digging around to see what’s happened since ‘86. Carlen’s 1991 paper [6] provides an interesting statistical-mechanical interpretation of this and some related results, considering the entropy as a Lyapunov functional in a more dynamical context. It discusses Ising models and things that I am interested in, so that’s nice, and since it wants to deal with the discrete variables of interest there, it seems that they are able to deal with the absolute continuity requirements, apparently by adding a little Gaussian noise. This post has gotten long enough, but this Carlen paper seems very cool. I hope to write another post soon going through the methods there, since they bring in a lot of ideas from Linnik, Brown, and Barron, and are interested in generalizations. I also like their physical point of view: they are thinking about a CLT for the renormalization group method applied to block rescaling of the Ising model. Hey, that sounds pretty cool.
Everyone’s favorite T. Tao comments on this MathOverflow post [7], and cites two papers by a group of authors [8]. I haven’t looked at these. He also has notes on the CLT [12], which are well worth reading, although they focus more on moment methods. These notes appear in his book on random matrices, which I’m enjoying working through. It has a lot more than just matrices going on. Another answer on [7] points to O. Johnson’s book [9], which is all about this stuff.
H. Fischer, A History of the Central Limit Theorem. https://www.springer.com/us/book/9780387878560
https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of_classical_CLT
https://math.stackexchange.com/questions/813748/is-there-a-proof-for-the-central-limit-theorem-via-some-fixed-point-theorem
If $f$ is bounded and measurable and $\mu_1,\mu_2,\dotsc\overset{TV}{\to}\nu$, then $\int f\,d\mu_n\to\int f\,d\nu$.
Barron, Andrew R. Entropy and the Central Limit Theorem. Ann. Probab. 14 (1986), no. 1, 336--342. doi:10.1214/aop/1176992632. https://projecteuclid.org/euclid.aop/1176992632
Carlen, E. A.; Soffer, A. Entropy production by block variable summation and central limit theorems. Comm. Math. Phys. 140 (1991), no. 2, 339--371. https://projecteuclid.org/euclid.cmp/1104247985
https://mathoverflow.net/questions/182752/central-limit-theorem-via-maximal-entropy
Arstein et al, 2004(a,b): https://mathscinet.ams.org/mathscinet-getitem?mr=2128238 and https://mathscinet.ams.org/mathscinet-getitem?mr=2083473.
O. Johnson, Entropy and the Central Limit Theorem (a textbook.) https://doi.org/10.1142/p341
L.D. Brown, A Proof of the central limit theorem motivated by the Cramèr-Rao inequality. Online here.
https://en.wikipedia.org/wiki/Cramér-Rao_bound
Notes 2 and Notes 4 from Terry Tao. At least the first link appears in his book on random matrices, but I haven’t encountered the second there.
0 notes