Understanding Evidence Lower BOund (ELBO)

Understanding Evidence Lower Bound (ELBO) in Variational Inference

In probabilistic machine learning, understanding complex data often involves using probabilistic models. However, when working with real-world datasets, calculating the likelihood of our models accurately is usually too challenging. This is where Variational Inference (VI) and the Evidence Lower Bound (ELBO) come into play. In this post, we’ll break down what ELBO is and how it helps in variational inference.

What is ELBO?

The Evidence Lower Bound (ELBO) is a tool that helps approximate complex probability distributions. In Variational Inference, we want to find a distribution that’s close to the true, often intractable, distribution of our data. ELBO allows us to do this by providing a lower bound to the log likelihood of our observed data (the “evidence”).

Why is ELBO Important?

In Variational Inference, our goal is to maximize the likelihood of observing the data given our model, represented as \(\log p(x)\), but calculating \(\log p(x)\) directly is often computationally infeasible. ELBO offers a practical alternative by providing a bound we can maximize instead.

In mathematical terms, maximizing the ELBO brings the variational distribution closer to the true posterior. This approach is beneficial in models where exact inference is intractable, like in Variational Autoencoders (VAEs) and Bayesian deep learning.

The Math Behind ELBO

Let’s break down the components of ELBO. Given a dataset \(x\) and a latent variable \(z\), we aim to approximate the true posterior \(p(z\vert x)\) with a simpler distribution \(q(z\vert x)\). ELBO is derived from the following decomposition of the log-likelihood:

\[\log p(x) = \text{ELBO} + D_{KL}(q(z\vert x) \parallel p(z|x)) \tag{1} \label{eq:1}\]

Here:

  • KL Divergence \(\text{KL}(q(z\vert x) \parallel p(z\vert x))\): This term measures the “distance” between the true posterior and the variational distribution. When we minimize this, \(q(z\vert x)\) becomes a good approximation of \(p(z\vert x)\).
  • ELBO: This is the Evidence Lower Bound. Since the log evidence term \(\log p(x)\) is fixed and KL-Divergence term \(D_{KL}\) is non-negative, maximizing the ELBO it minimizes the KL divergence, bringing \(q(z\vert x)\) closer to \(p(z\vert x)\).

Rearranging equation \(\eqref{eq:1}\), we get:

\[\text{ELBO} = \log p(x) - \text{KL}(q(z\vert x) \parallel p(z \vert x)) \tag{2} \label{eq:2}\]

Since the KL divergence is always non-negative, ELBO serves as a lower bound to the log evidence \(\log p(x)\).

Another common expression for ELBO is:

\[\text{ELBO} = \mathbb{E}_{q(z\vert x)} \left[ \log p(x, z) - \log q(z\vert x) \right] \tag{3} \label{eq:3}\]

Or equivalently:

\[\text{ELBO} = \mathbb{E}_{q(z\vert x)} \left[ \log p(x\vert z) \right] - \text{KL}(q(z\vert x) \parallel p(z)) \tag{4} \label{eq:4}\]

In this formulation of equation \(\eqref{eq:4}\) :

  • Data Likelihood Term: \(\mathbb{E}_{q(z\vert x)} \left[ \log p(x\vert z) \right]\) represents how well the model explains the data.
  • Regularization Term: \(\text{KL}(q(z\vert x) \parallel p(z))\) ensures that the approximate posterior \(q(z\vert x)\) doesn’t stray too far from the prior \(p(z)\).
  • Therefore, minimizing the KL-Divergence term means increasing the \(\text{ELBO}\) as it will increase the term \(\mathbb{E}_{q(z\vert x)} \left[ \log p(x\vert z) \right]\) which makes \(q(z\vert x)\) a better representation of the data.

Intuition Behind ELBO

You can think of ELBO as a balance between two forces:

  1. Data Reconstruction: We want the model to explain or reconstruct the data well, captured by the term \(\mathbb{E}_{q(z\vert x)} \left[ \log p(x\vert z) \right]\).
  2. Regularization: We want the approximate distribution \(q(z\vert x)\) to stay close to our prior beliefs about \(z\), enforced by \(\text{KL}(q(z\vert x) \parallel p(z))\).

By maximizing ELBO, we’re effectively pushing the variational approximation to balance these two forces, resulting in a model that is both effective and interpretable.

Why Maximize ELBO?

Maximizing the ELBO is essential because it brings our approximating distribution \(q(z\vert x)\) closer to the true posterior \(p(z\vert x)\). This, in turn, makes our inference process more accurate without requiring us to calculate \(\log p(x)\) directly. In practice, this makes ELBO a core component in training models like VAEs, where it serves as a loss function guiding the model’s learning process.


The Evidence Lower Bound (ELBO) is a powerful concept in probabilistic machine learning, making it possible to approximate difficult distributions effectively. By maximizing the ELBO, we improve our model’s ability to approximate the true distribution without directly computing intractable terms. This makes it invaluable in areas like Variational Autoencoders (VAEs), Bayesian inference, and any setting where exact inference is challenging.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra
  • Displaying External Posts on Your al-folio Blog
  • Understanding KL-Divergence