Introduction to Stein Variational Gradient Descent

Introduction to Stein Variational Gradient Descent

The Stein Variational Gradient Descent (SVGD) algorithm was first introduced by Liu and Wang [LW16], whose idea is to transport a set of $N$ particles $\{x_i\}_{i=1}^N$ in $\mathbb R^d$ so that their empirical measure

$$
\mu^N:=\frac{1}{N}\sum_{i=1}^N\delta_{x_i}
$$

approximates the target probability measure

$$
\rho_\infty (x)\ dx=Z^{-1} e^{-V(x)}\ dx
$$

with an unknown normalization factor $Z$.

At discrete times, the particles are updated via the map
$$
x \mapsto T(x) = x + \varepsilon \varphi(x),
\tag{1}
$$

where $\varepsilon$ is a small time step size and $\varphi$ is a velocity field, which is chosen appropriately so to have a “fastest decay’’ of the Kullback-Leibler (KL) divergence between the push-forward measure $T_\sharp\mu^N$ and the target $\rho_\infty$. Recall that the KL-divergence, or relative entropy, $\operatorname{KL}(\mu\Vert\nu)$ between probability measures $\mu$ and $\nu$ is

$$
\operatorname{KL}(\mu\Vert\nu)
=
\int \log\left(\frac{d\mu}{d\nu}\right)\frac{d\mu}{d\nu}d\nu,
$$

if $\mu$ is absolutely continuous with respect to $\nu$, and we set $\operatorname{KL}(\mu\Vert\nu)=+\infty$ otherwise.

This idea of SVGD can be formalized as choosing the velocity field $\varphi$ to solve the variational problem

$$
\sup _{ \varphi\in\mathcal H } \left\{-\partial_\varepsilon \operatorname{KL} (T_\sharp \mu^N \Vert\rho_\infty) \big| _{\varepsilon=0} \ \big|\ \Vert\varphi\Vert _{\mathcal H} \le 1 \right\}\tag{2}
$$

at each time step, where $\mathcal H$ is a suitable space of vector fields. It is not clear that (2) is well-defined, because the measure $T_\sharp\mu^N$ may be singular with respect to $\rho_\infty$ and $\operatorname{KL}(T_\sharp\mu^N\Vert\rho_\infty)=+\infty$. However, as shown in [LW16], (2) can be given meaning through the following observation.

Lemma 1. If $\mu$ is absolutely continuous with respect to $\rho$ and $\operatorname{KL}(T_\sharp\mu\Vert\rho)<\infty$, then
$$
-\partial_\varepsilon \operatorname{KL}(T_\sharp \mu \Vert\rho) \big| _{\varepsilon=0}
=
\mathbb E_\mu[S_\rho\varphi],
$$

where $S_\rho$ is the so-called Stein operator defined by

$$
S_\rho\varphi
:=
\nabla\log\rho(x)\cdot\varphi(x)+\nabla\cdot\varphi(x).
$$

Proof : Denote $\mu_\varepsilon := T_{\varepsilon\sharp}\mu$, with a slight abuse of notation, the density is also denoted by $\mu_\varepsilon$. Since $\varphi$ is the velocity field, we have the corresponding continuity equation:

$$
\partial_\varepsilon \mu_\varepsilon
+
\nabla \cdot (\mu_\varepsilon \varphi)
=
0 .
$$

Therefore, by just calculating the derivative, we have

$$
\begin{aligned}
\partial_\varepsilon \operatorname{KL}(\mu_\varepsilon \Vert \rho)
&=
\partial_\varepsilon
\int
\mu_\varepsilon
\log \frac{\mu_\varepsilon}{\rho}
\ dx =
\int
\left(
\log \frac{\mu_\varepsilon}{\rho}
+
1
\right)
\partial_\varepsilon \mu_\varepsilon
\ dx \\\
&\xlongequal{\text{continuity equation}}
-
\int
\left(
\log \frac{\mu_\varepsilon}{\rho}
+
1
\right)
\nabla \cdot (\mu_\varepsilon \varphi)
\ dx \\\
&\xlongequal{\text{integration by parts}}
\int
\nabla
\log \frac{\mu_\varepsilon}{\rho}
\cdot
\mu_\varepsilon \varphi
\ dx .
\end{aligned}
$$

Hence, we have

$$
\begin{aligned}
-\partial_\varepsilon \operatorname{KL} (\mu _\varepsilon \Vert \rho) \vert _{\varepsilon = 0}
& = -\int \nabla \log \frac{\mu}{\rho}\cdot \varphi \ d\mu = \int \varphi \nabla \log \mu \ d \mu +\int \varphi \nabla \log \rho \ d\mu \\\
& = -\int \varphi \nabla \mu \ dx +\int \varphi \nabla \log \rho \ d\mu \\\
&\xlongequal{\text{Integration by parts}} \int \nabla \cdot \varphi \ d\mu +\int \varphi \nabla \log \rho \ d\mu \\\\
&=\mathbb E _{ \mu }\left[ \nabla \cdot \varphi + \varphi \nabla \log \rho\right]= \mathbb E _{ \mu } [ S_\rho \varphi ].
\end{aligned}
$$

which gives the desired result. $\square$

In view of (2) and Lemma 1, this leads to the definition of Stein discrepancy

$$
\mathrm{SD} (\mu,\rho,\mathcal H) := \sup _{\varphi\in\mathcal H} \left\{ \mathbb E_\mu[S_\rho\varphi] \ \big|\ \Vert \varphi\Vert _{\mathcal H}\le 1 \right\}, \tag{3}
$$

which has the property that $\mathrm{SD}(\mu,\rho,\mathcal H)\ge 0$ is equal to zero if and only if $\mu=\rho$ provided that the space $\mathcal H$ is sufficiently rich.

For the empirical measure $\mu^N$, the objective function $\mathbb E_{\mu^N}[S_{\rho_\infty}\varphi]$ in (3) may be well-defined and finite even though $\operatorname{KL}(T_\sharp\mu^N\Vert\rho_\infty)=+\infty$.

Furthermore, [LW16] showed that if the space $\mathcal H$ is chosen to be a reproducing kernel Hilbert space with a positive definite kernel $K$, then the velocity field optimizing (3) can be characterized explicitly.

Lemma 2. Let $\mathcal H_K$ be a reproducing kernel Hilbert space with a positive definite kernel $K:\mathbb R^d\times \mathbb R^d\to \mathbb R$ and $\mathcal H:=\mathcal H_K^d$. For all $\varphi,\psi:\mathbb R^d\to\mathbb R^d$ with $\varphi=(\varphi_1,\cdots,\varphi_d)$ and $\psi=(\psi_1,\cdots,\psi_d)$, the inner product on $\mathcal H$ is defined by

$$
\langle \varphi,\psi\rangle_{\mathcal H}:= \sum_{i=1}^d \langle \varphi_i,\psi_i\rangle_{\mathcal H_K}.
$$

Then the optimal velocity field $\varphi^\star_{\mu,\rho}$ of (3) can be characterized by

$$
\varphi^\star_{\mu,\rho}(\cdot) \propto \mathbb E_{x\sim\mu}[S_\rho K(x,\cdot)] := \int_{\mathbb R^d} \left( \nabla\log\rho(x)K(x,\cdot)+\nabla_x K(x,\cdot) \right) \mu(dx).
$$

Proof : By the definition of Stein operator, we have
$$
\begin{aligned}
S_\rho \varphi(x)&=\nabla \log \rho(x)\cdot \varphi(x)+\nabla\cdot \varphi(x)\\\
&=\sum_{i=1}^d \left( \partial_i \log \rho(x)\cdot \varphi_i(x) +\partial_{x_i} \varphi_i(x)\right)\\\
&=\sum_{i=1}^d \left[ \langle \varphi_i(\cdot), \partial_i\log \rho(x) K(x,\cdot)\rangle_{\mathcal H_K}+\langle \varphi_i(\cdot),\partial_{x_i} K(x,\cdot)\rangle_{\mathcal H_K}\right]\\\
&= \sum_{i=1}^d \langle \varphi_i(\cdot), \xi_i(x,\cdot)\rangle_{\mathcal H_K}=\langle \varphi(\cdot),\xi(x,\cdot)\rangle_{\mathcal H},
\end{aligned}
$$

where

$$
\xi_i(x,\cdot):= \partial_i\log \rho(x) K(x,\cdot)+\partial_{x_i} K(x,\cdot)\quad \text{and}\quad \xi(x,\cdot):= \nabla\log \rho(x) K(x,\cdot)+\nabla_x K(x,\cdot):= S_\rho K(x,\cdot).
$$

Therefore, we have

$$
\mathbb E_\mu[S_\rho\varphi]=\langle \varphi(\cdot),\mathbb E_\mu[S_\rho K(x,\cdot)]\rangle_{\mathcal H}.
$$

And then by the definition of Stein discrepancy

$$
\mathrm{SD}(\mu,\rho,\mathcal H)=\sup_{\Vert\varphi\Vert_{\mathcal H}\le 1} \langle \varphi(\cdot),\mathbb E_\mu[S_\rho K(x,\cdot)]\rangle_{\mathcal H},
$$

we know the optimal vector must satisfies

$$
\varphi^\star_{\mu,\rho}(\cdot)
\propto
\mathbb E_{x\sim\mu}[S_\rho K(x,\cdot)]
:=
\int_{\mathbb R^d}
\left(
\nabla\log\rho(x)K(x,\cdot)+\nabla_x K(x,\cdot)
\right)
\mu(dx),
$$

which gives the desired result. $\square$

For the empirical measure $\mu^N$ and the target distribution $\rho_{\infty}\propto e^{-V(x)}$, the optimal direction is given by

$$
\varphi^\star_{\mu^N,\rho_\infty}(\cdot)
\propto
\mathbb E_{x\sim\mu^N}[S_{\rho_\infty} K(x,\cdot)]
=
\frac{1}{N}\sum_{j=1}^N
\left(
\nabla\log\rho_\infty(x_j)K(x_j,\cdot)+\nabla_{x_j} K(x_j,\cdot)
\right).
$$

Now we denote $K:\mathbb R^d\to\mathbb R^d$ defined by $K(x-y):=K(x,y)$ above and assume that it is a smooth symmetric and positive definite kernel, we obtain

$$
\varphi^\star_{\mu^N,\rho_\infty}(x)
\propto
\frac{1}{N}\sum_{j=1}^N
\left(
-\nabla V(x_j)K(x-x_j)-\nabla_{x_j} K(x-x_j)
\right).
$$

Putting this optimal velocity back into (1) and letting step size $\varepsilon\downarrow 0$ gives the following interacting particle system in $\mathbb R^d$:

$$
\begin{aligned}
\dot{x}_i(t) &= -\frac{1}{N}\sum _{j=1}^{N}\nabla K(x_i(t)-x_j(t)) -\frac{1}{N} \sum _{j=1} ^{N}K\big(x_i(t)-x_j(t)\bigr)\nabla V (x_j(t)), \\\
x_i(0) &= x_i ^0 \in \mathbb R^d, \qquad i=1,\cdots,N.
\end{aligned} \tag{4}
$$

The time-discretized form of the above ODEs is called the Stein Variational Gradient Descent (SVGD).

As the number of particles $N\to\infty$, the mean field limit of this interacting particle system is described by the following PDE:

$$
\begin{aligned}
&\partial_t \rho
= \nabla \cdot \bigl( \rho (K \star (\nabla \rho + \nabla V \rho)) \bigr), \\\
&\rho(0,\cdot) = \rho_0(\cdot).
\end{aligned}
\tag{5}
$$

This rigorous connection between the interacting particle system (4) and the mean field PDE (5) is given by [LLN19].

Compared with ULA algorithm. Recall that the ULA algorithm for sampling from $\rho_\infty$ is the Euler-Maruyama method of the Langevin dynamics

$$
\operatorname{d} X_t = -\nabla V(X_t)\operatorname{d}t +\sqrt{2} \operatorname{d} B_t.
$$

One advantage of ULA is that the dynamics tend to explore high probability regions (around the local minima of $V$), while the random noise helps the dynamics to escape outside the basin of attraction and thus promotes its exploration of the entire state space. In contrast to this stochastic sampling approach, (4) may be viewed as a deterministic (albeit coupled) particle system for approximating $\rho_\infty$. Qualitatively speaking, the terms in (4) which involve $\nabla V$ tend to drive particles toward local minima of $V$ (note however the nonlocal interaction due to the presence of $K$). On the other hand, the terms involving $\nabla K$ are repulsive, forcing the particles to disperse; this is seen in the fact that

$$
-\frac{1}{N}\sum_{j=1}^{N}\nabla K(x_i-x_j)=-\nabla_{x_i}E(\mathbf{x}),
$$

where

$$
E(\mathbf{x})=\frac{1}{N}\sum_{i<j}K(x_i-x_j)
$$

is the interaction energy with the assumption that $\nabla K(0)=0$. This interaction term in SVGD plays a role similar to that of the diffusion term in stochastic-dynamics-based sampling methods. In fact, this particle-dispersing property is essential, especially for certain funnel-shaped distributions.


Figure 1: Funnel-shaped distributions.

Reference

[LW16] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian
inference algorithm. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[LLN19] Jianfeng Lu, Yulong Lu, and James Nolen. Scaling limit of the stein variational gradient
descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648–671, 2019.

The cover image of this article was taken at Mount Kazbek, Georgia.

Introduction to Stein Variational Gradient Descent

https://handsteinwang.github.io/2026/05/14/SVGD/

Author

Handstein Wang

Posted on

2026-05-14

Updated on

2026-05-15

Licensed under