Boltzmann Machines
Here we introduce introduction to Boltzmann machines and present a Tiny Restricted Boltzmann Machine that runs in the browser.
Skip to SimulatorBoltzmann Machines are used for unsupervised learning, which means they can learn from data without being told what to look for.
The can be used for generating new data that is similar to the data they were trained on, also known as generative AI.
A Boltzmann Machine is a type of neural network that tries to learn patterns by mimicking how energy works in physics.
Each neuron can be on or off, the machine is made up of many of these neurons connect to each other.
Some neurons are visible (we can see them and even set their state), and some are hidden (we can't see them).
The connections between neurons are called weights, and they can be positive or negative.
A General Boltzmann Machine has connections between all neurons. This makes it powerful, but its training involves calculating an O(2n) term.
A Restricted Boltzmann Machine is a special case where the visible and hidden neurons are not connected to each other. This makes it faster to train and understand.
The energy of a configuration of the visible and hidden units is defined as:
E(v,h)=−i=1∑mj=1∑nwijvihj−i=1∑mbivi−j=1∑ncjhjwhere v is the visible layer, h is the hidden layer, w is the weight matrix, and b and c are the biases for the visible and hidden layers, respectively.
The visualisation on the right randomises the weights, biases and activation values of a Boltzmann machine and calculates its energy.
During training it is given examples (e.g., images, text) and the machine adjusts its weights to lower the energy of those samples.
It effectively learns P(v), the probability of visible units v, which is proportional to e−E(v).
After training, it can sample new data from the learned distribution using Gibbs sampling.
These samples are new, never-before-seen, but statistically similar to the training data.
Starting with a Boltzmann machine as defined earlier, we want to derivce the contrastive divergence algorithm for training. The goal is to adjust the weights of the network to minimize the energy of the training data.
We have:
Energy function in matrix form:
E(v,h)=−i=1∑mj=1∑nwijvihj−i=1∑mbivi−j=1∑ncjhjJoint distribution:
P(v,h)=Z1e−E(v,h)Where Z is the partition function, which normalizes the distribution.
We train the RBM by maximizing the likelihood of the training data, i.e. maximizing log(P(v)).
The marginal likelihood of the visible layer is given by:
P(v)=h∑P(v,h)Then the log-likelihood is:
log(P(v))=logh∑Z1e−E(v,h)=logh∑e−E(v,h)−log(Z)Differentiating with respect to the weights wij gives:
∂wij∂logP(v)=∑he−E(v,h)1h∑(−∂wij∂E(v,h))e−E(v,h)−Z1v,h∑(−∂wij∂E(v,h))e−E(v,h)Similar forms exist for the biases bi and cj.
Since we are performing gradient ascent, therefore.
Δwij←Δwij+η∂wij∂logP(v)Therefore we get our weight update rule:
Δwij=η∂wij∂logP(v)=η(⟨vihj⟩data−⟨vihj⟩model)A similar process can be followed for the biases bi and cj.
Δbi=η(⟨vi⟩data−⟨vi⟩model)Δcj=η(⟨hj⟩data−⟨hj⟩model)Where ⟨⋅⟩data is the expectation with respect to the training data and ⟨⋅⟩model is the expectation with respect to the model distribution.
The next step is to approximate the model expectation using Gibbs sampling.
Once those steps are done we update the weights and biases according to:
Δwij=η(⟨vihj⟩data−⟨vihj⟩model)Δbi=η(⟨vi⟩data−⟨vi⟩model)Δcj=η(⟨hj⟩data−⟨hj⟩model)