Boltzmann Machines

Here we present a tiny restricted Boltzmann Machine that runs in the browser.

↓ Slide 2 / 9

Boltzmann Machines are one of the earliest generative AI models, introduced in the 1980s.

Boltzmann Machines are used for unsupervised learning, which means they can learn from data without being told what to look for.

They can be used for generating new data that is similar to the data they were trained on, also known as generative AI.

↓ Slide 3 / 9

Boltzmann Machine

A Boltzmann Machine is a type of neural network that tries to learn patterns by mimicking how energy works in physics.

Each neuron can be on or off, the machine is made up of many of these neurons connected to each other.

Some neurons are visible (we can see them and even set their state), and some are hidden (we can't see them).

The connections between neurons are called weights, and they can be positive or negative.

Hover over the neurons to highlight their connections.

↓ Slide 4 / 9

General Boltzmann Machine

Hover over the neurons to highlight their connections.

A General Boltzmann Machine has connections between all neurons. This makes it powerful, but its training involves calculating an $O(2^n)$ term.

Restricted Boltzmann Machine

A Restricted Boltzmann Machine is a special case where the visible and hidden neurons are not connected within the same layer. This makes it faster to train and understand.

↓ Slide 5 / 9

A Boltzmann Machine is an energy based model.

The energy of a configuration of the visible and hidden units is defined as:

E(v,h) = -\sum_{i=1}^{m} \sum_{j=1}^{n} w_{ij} v_i h_j - \sum_{i=1}^{m} b_i v_i - \sum_{j=1}^{n} c_j h_j

where $v$ is the visible layer, $h$ is the hidden layer, $w$ is the weight matrix, and $b$ and $c$ are the biases for the visible and hidden layers, respectively.

The visualisation on the right randomises the weights, biases and activation values of a Boltzmann machine and calculates its energy.

Network Energy: NaN

↓ Slide 6 / 9

Training and Generation

During training it is given examples (e.g., images, text) and the machine adjusts its weights to lower the energy of those samples.

It effectively learns $P(v)$ , the probability of visible units $v$ , which is proportional to $e^{-E(v)}$ .

After training, it can sample new data from the learned distribution using Gibbs sampling.

These samples are new, never-before-seen, but statistically similar to the training data.

Here is our training data.

We want the network to learn how to make similar samples to these.

↓ Slide 7 / 9

Lets simulate one step at a time

A Restricted Boltzmann Machine (RBM) is trained using a process called Contrastive Divergence. The steps are as follows:

Step 1: Clamping visible units to data
Step 2: Sampling hidden units
Step 3: Sampling visible units
Step 4: Sampling hidden units
Step 5: Updating weights

A more formal description of the steps above are given in the Appendix.

Input Sample

↓ Slide 8 / 9

Simulator

Press the "Run Simulation" button to start training the RBM. If you let the simulation run for a while, you will see the weights of the RBM converge to a stable state. The energy loss will also decrease over time.

You can compare the input and output states of the RBM by pausing the simulation.

In the beginning, the input and output states will be dissimilar. As the simulation progresses, the input and output states will become more similar.

Iteration:

100000

Input Sample

Output

Reconstruction Accuracy

Energy

Weights

↓ Slide 9 / 9

Appendix: Contrastive Divergence

Starting with a Boltzmann machine as defined earlier, we want to derive the contrastive divergence algorithm for training. The goal is to adjust the weights of the network to minimize the energy of the training data.

We have:

A visible layer $v$ and a hidden layer $h$ .
A weight matrix $W$ that connects the visible and hidden layers.
A bias vector $b$ for the visible layer and a bias vector $c$ for the hidden layer.

Energy function in matrix form:

E(v,h) = -\sum_{i=1}^{m} \sum_{j=1}^{n} w_{ij} v_i h_j - \sum_{i=1}^{m} b_i v_i - \sum_{j=1}^{n} c_j h_j

Joint distribution:

P(v,h) = \frac{1}{Z} e^{-E(v,h)}

Where $Z$ is the partition function, which normalizes the distribution.

We train the RBM by maximizing the likelihood of the training data, i.e. maximizing $\text{log}(P(v))$ .

The marginal likelihood of the visible layer is given by:

P(v) = \sum_{h} P(v,h)

Then the log-likelihood is:

\text{log}(P(v)) = \text{log}\sum_{h} \frac{1}{Z} e^{-E(v,h)} = \text{log}\sum_{h} e^{-E(v,h)} - \text{log}(Z)

Differentiating with respect to the weights $w_{ij}$ gives:

\begin{align*} \frac{\partial \log P(\mathbf{v})}{\partial w_{ij}} &= \frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})}} \sum_{\mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \\ &\quad - \frac{1}{Z} \sum_{\mathbf{v}, \mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \end{align*}

Similar forms exist for the biases $b_i$ and $c_j$ .

Since we are performing gradient ascent:

\Delta w_{ij} \leftarrow \Delta w_{ij} + \eta \frac{\partial \log P(v)}{\partial w_{ij}}

Therefore we get our weight update rule:

\Delta w_{ij} = \eta \frac{\partial \log P(v)}{\partial w_{ij}} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)

A similar process can be followed for the biases $b_i$ and $c_j$ .

\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)

\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)

Where $\langle \cdot \rangle_{data}$ is the expectation with respect to the training data and $\langle \cdot \rangle_{model}$ is the expectation with respect to the model distribution.

The next step is to approximate the model expectation using Gibbs sampling.

Positive phase: Sample $\mathbf{h}^{(0)} \approx P(\mathbf{h}|\mathbf{v}^{(0)} = \text{data})$
Negative phase: Run k steps of Gibbs sampling:

Alternating between $\mathbf{v}^{(t+1)} \approx P(\mathbf{v}|\mathbf{h}^{(t)})$
and $\mathbf{h}^{(t+1)} \approx P(\mathbf{h}|\mathbf{v}^{(t)})$

Once those steps are done we update the weights and biases according to:

\Delta w_{ij} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)

\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)

\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)

Slide 1 / 9